美文网首页
GDAS002-Bioconductor备忘录

GDAS002-Bioconductor备忘录

作者: backup备份 | 来源:发表于2019-11-02 09:24 被阅读0次

    title: GDAS002-Bioconductor备忘录
    date: 2019-09-02 12:0:00
    type: "tags"
    tags:

    • Bioconductor
      categories:
    • Genomics Data Analysis Series

    前言

    这篇笔记的主要内容是关于Bioconductor的一些备忘录(cheat sheet),作者总结的不错,我就按原文翻译了下来,由于一些知识点并没有学到,因此未翻译,后面学到后,再回头过来翻译。

    安装

    安装细节参考http://bioconductor.org/install/,具体代码如下所示:

    if (!requireNamespace("BiocManager"))
        install.packages("BiocManager")
    BiocManager::install()
    BiocManager::install(c("package1","package2")
    BiocManager::valid() # are packages up to date?
    
    # what Bioc version is release right now?
    http://bioconductor.org/bioc-version
    # what Bioc versions are release/devel?
    http://bioconductor.org/js/versions.js
    

    R语言帮助

    简单的帮忙如下所示:

    ?函数名
    ?"eSet-class" # 如果要查看类的帮助,需要后缀'-class'
    help(package="foo",help_type="html") # 使用web浏览器来查看帮助文档
    vignette("topic")
    browseVignettes(package="package") # 某包的简单案例
    

    高级用户的帮助功能

    functionName # 直接输入函数名(不带括号),则显示函数代码
    getMethod(method,"class")  # 显示某类的方法代码
    selectMethod(method, "class") # 查看继承来的方法
    showMethods(classes="class") # 显示某类的所有方法show all methods for class
    methods(class="GRanges") # 此函数限于R >=3.2
    ?"functionName,class-method" # 用于查看S4对象的方法文档
    ?"plotMA,data.frame-method" # 需要加载geneplotter包,from library(geneplotter)
    ?"method.class" # 查看S3对象的方法 
    ?"plot.lm"
    sessionInfo() # 获取R语言版本的相关帮助信息 
    packageVersion("foo") # 查看某个包的版本
    

    Bioconductor相关的论坛: https://support.bioconductor.org

    如果你使用的Rstudio,则可以使用?help来查看帮忙文档,如果使用是命令行模式(例如Linux),则可以使用下面的代码通过浏览器来查看帮助文档(我使用的Rstudio,因此未尝试以下代码):

    alias rhelp="Rscript -e 'args <- commandArgs(TRUE); help(args[2], package=args[3], help_type=\"html\"); Sys.sleep(5)' --args"
    

    debugging R

    traceback() # 查看哪一步导致了R错误 
    # debug a function
    debug(myFunction) # 逐行遍历函数中的代码 
    undebug(myFunction) # 停止debugging
    debugonce(myFunction) # 功能如上,但是不需要undebug()
    # 在写代码的过程中,可以将函数browser()放到一个关系的节点上
    # this plus devtools::load_all() can be useful for programming
    # to jump in function on error:
    options(error=recover)
    # turn that behavior off:
    options(error=NULL)
    # debug, e.g. estimateSizeFactors from DESeq2...
    # debugging an S4 method is more difficult; this gives you a peek inside:
    trace(estimateSizeFactors, browser, exit=browser, signature="DESeqDataSet")
    

    显示某个类的包特异方法

    以下两个段代码实现的功能大致相同:查看特定的某个包中,某个特定类对象的操作方法。给定类的对象的操作方法。

    intersect(sapply(strsplit(as.character(methods(class="DESeqDataSet")), ","), `[`, 1), ls("package:DESeq2"))
    sub("Function: (.*) \\(package .*\\)","\\1",grep("Function",showMethods(classes="DESeqDataSet", 
        where=getNamespace("DESeq2"), printTo=FALSE), value=TRUE))
    

    注释Annotations

    这里需要先更新一下AnnotationHub,现在我们查看一下AnnotationHub的某个案例:

    https://master.bioconductor.org/packages/release/bioc/html/AnnotationHub.html

    以下代码是如何使用数据库包,以及biomart,先看AnnotationDbi

    # 某用包个注释包
    library(AnnotationDbi)
    library(org.Hs.eg.db) # 这个是人类注释包,即Homo.sapiens
    columns(org.Hs.eg.db)
    keytypes(org.Hs.eg.db) # columns与keytypes返回的结果一样,即有什么样的注释内容
    k <- head(keys(org.Hs.eg.db, keytype="ENTREZID"))
    
    # 返回一个字符向量(1对1的关系),查看mapIds则可以查看多个选项
    res <- mapIds(org.Hs.eg.db, keys=k, column="ENSEMBL", keytype="ENTREZID")
    
    # 产生1对多的映射关系 
    res <- select(org.Hs.eg.db, keys=k,
      columns=c("ENTREZID","ENSEMBL","SYMBOL"),
      keytype="ENTREZID")
    

    biomaRt

    这个包主要用于各种基因ID的转换。

    # 使用biomart包,将一个注释与另外一个注释进行映射
    library(biomaRt)
    m <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
    map <- getBM(mart = m,
      attributes = c("ensembl_gene_id", "entrezgene"),
      filters = "ensembl_gene_id", 
      values = some.ensembl.genes)
    

    GenomicRanges包

    GenomicRanges这个包用于分析高通量测序数据,它能有效地表壳与操作基因组注释信息(genomic annotations)与比对结果(alignments)。GenomicRanges包定义了一个通用容器,此容器用于储存与操作基因间隔(genomic intervals)与变量(variables)。在GenomicAlignments包和SummarizedExperiment包中定义了一个更加特殊的容器,GenomicAlignments包中的这个更加特殊的容器可以根据一个参考基因组来表示和操作一些短比对结果(short alignments),SummarizedExperiment包中容器可以构建一个实验的矩阵样汇总信息。

    此包的使用如下所示:

    library(GenomicRanges)
    z <- GRanges("chr1",IRanges(1000001,1001000),strand="+")
    start(z)
    end(z)
    width(z)
    strand(z)
    mcols(z) # the 'metadata columns', any information stored alongside each range
    ranges(z) # gives the IRanges
    seqnames(z) # the chromosomes for each ranges
    seqlevels(z) # the possible chromosomes
    seqlengths(z) # the lengths for each chromosome
    

    Intra-range methods

    不清楚功能,后面补充。

    Affects ranges independently

    function description
    shift moves left/right
    narrow narrows by relative position within range
    resize resizes to width, fixing start for +, end for -
    flank returns flanking ranges to the left +, or right -
    promoters similar to flank
    restrict restricts ranges to a start and end position
    trim trims out of bound ranges
    +/- expands/contracts by adding/subtracting fixed amount
    * zooms in (positive) or out (negative) by multiples

    Inter-range methods

    Affects ranges as a group

    function description
    range one range, leftmost start to rightmost end
    reduce cover all positions with only one range
    gaps uncovered positions within range
    disjoin breaks into discrete ranges based on original starts/ends

    Nearest methods

    Given two sets of ranges, x and subject, for each range in x, returns...

    function description
    nearest index of the nearest neighbor range in subject
    precede index of the range in subject that is directly preceded by the range in x
    follow index of the range in subject that is directly followed by the range in x
    distanceToNearest distances to its nearest neighbor in subject (Hits object)
    distance distances to nearest neighbor (integer vector)

    A Hits object can be accessed with queryHits, subjectHits and mcols if a distance is associated.

    set methods

    If y is a GRangesList, then use punion, etc. All functions have default ignore.strand=FALSE, so are strand specific.

    union(x,y) 
    intersect(x,y)
    setdiff(x,y)
    

    Overlaps

    x %over% y  # logical vector of which x overlaps any in y
    fo <- findOverlaps(x,y) # returns a Hits object
    queryHits(fo)   # which in x
    subjectHits(fo) # which in y 
    

    Seqnames and seqlevels

    GenomicRanges and GenomeInfoDb

    gr.sub <- gr[seqlevels(gr) == "chr1"]
    seqlevelsStyle(x) <- "UCSC" # convert to 'chr1' style from "NCBI" style '1'
    

    Sequences

    Biostrings是一个操作序列或序列集的包,简要操作可以参考Biostrings Quick Overview PDF,关于一些物种的,缩写的信息可以参考cheat sheet for annotation

    library(BSgenome.Hsapiens.UCSC.hg19)
    dnastringset <- getSeq(Hsapiens, granges) # returns a DNAStringSet
    # also Views() for Bioconductor >= 3.1
    
    library(Biostrings)
    dnastringset <- readDNAStringSet("transcripts.fa")
    
    substr(dnastringset, 1, 10) # to character string
    subseq(dnastringset, 1, 10) # returns DNAStringSet
    Views(dnastringset, 1, 10) # lightweight views into object
    complement(dnastringset)
    reverseComplement(dnastringset)
    matchPattern("ACGTT", dnastring) # also countPattern, also works on Hsapiens/genome
    vmatchPattern("ACGTT", dnastringset) # also vcountPattern
    letterFrequecy(dnastringset, "CG") # how many C's or G's
    # also letterFrequencyInSlidingView
    alphabetFrequency(dnastringset, as.prob=TRUE)
    # also oligonucleotideFrequency, dinucleotideFrequency, trinucleotideFrequency
    # transcribe/translate for imitating biological processes
    

    测序数据

    Rsamtools中的 scanBam 返回BAM文件中的原始数值:

    library(Rsamtools)
    which <- GRanges("chr1",IRanges(1000001,1001000))
    what <- c("rname","strand","pos","qwidth","seq")
    param <- ScanBamParam(which=which, what=what)
    # for more BamFile functions/details see ?BamFile
    # yieldSize for chunk-wise access
    bamfile <- BamFile("/path/to/file.bam")
    reads <- scanBam(bamfile, param=param)
    res <- countBam(bamfile, param=param) 
    # for more sophisticated counting modes
    # see summarizeOverlaps() below
    
    # quickly check chromosome names
    seqinfo(BamFile("/path/to/file.bam"))
    
    # DNAStringSet is defined in the Biostrings package
    # see the Biostrings Quick Overview PDF
    dnastringset <- scanFa(fastaFile, param=granges)
    

    GenomicAlignments 返回一个Bioconductor对象(GRanges-based)

    library(GenomicAlignments)
    ga <- readGAlignments(bamfile) # single-end
    ga <- readGAlignmentPairs(bamfile) # paired-end
    

    转录本数据库

    GenomicFeatures

    # get a transcript database, which stores exon, trancript, and gene information
    library(GenomicFeatures)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
    
    # or build a txdb from GTF file (e.g. downloadable from Ensembl FTP site)
    txdb <- makeTranscriptDbFromGFF("file.GTF", format="gtf")
    
    # or build a txdb from Biomart (however, not as easy to reproduce later)
    txdb <- makeTranscriptDbFromBiomart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
    
    # in Bioconductor >= 3.1, also makeTxDbFromGRanges
    
    # saving and loading
    saveDb(txdb, file="txdb.sqlite")
    loadDb("txdb.sqlite")
    
    # extracting information from txdb
    g <- genes(txdb) # GRanges, just start to end, no exon/intron information
    tx <- transcripts(txdb) # GRanges, similar to genes()
    e <- exons(txdb) # GRanges for each exon
    ebg <- exonsBy(txdb, by="gene") # exons grouped in a GRangesList by gene
    ebt <- exonsBy(txdb, by="tx") # similar but by transcript
    
    # then get the transcript sequence
    txSeq <- extractTranscriptSeqs(Hsapiens, ebt)
    

    根据范围(ranges)和实验汇总信息

    SummarizedExperiment是高维数据的储存类, 它与GRangesGRangesList配合使用(使用每个基因外显子的read计数)。

    library(GenomicAlignments)
    fls <- list.files(pattern="*.bam$")
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
    ebg <- exonsBy(txdb, by="gene")
    # see yieldSize argument for restricting memory
    bf <- BamFileList(fls)
    library(BiocParallel)
    register(MulticoreParam(4))
    # lots of options in the man page
    # singleEnd, ignore.strand, inter.features, fragments, etc.
    se <- summarizeOverlaps(ebg, bf)
    
    # operations on SummarizedExperiment
    assay(se) # the counts from summarizeOverlaps
    colData(se)
    rowRanges(se)
    

    Salmon是一个定量工具,如果数据占有GC依赖的数据,那么就添加上--gcBias参数,接着使用 tximport,以下是使用案例:

    coldata <- read.table("samples.txt")
    rownames(coldata) <- coldata$id
    files <- coldata$files; names(files) <- coldata$id
    txi <- tximport(files, type="salmon", tx2gene=tx2gene)
    dds <- DESeqDataSetFromTximport(txi, coldata, ~condition)
    

    Rsubread包中的featureCounts函数也可以用于read计算,速度可能更快。

    library(Rsubread)
    res <- featureCounts(files, annot.ext="annotation.gtf",
      isGTFAnnotationFile=TRUE,
      GTF.featureType="exon",
      GTF.attrType="gene_id")
    res$counts
    

    RNA-seq分析方法

    DESeq2:My preferred pipeline for DESeq2 users is to start with a lightweight transcript
    abundance quantifier such as Salmon
    and to use tximport, followed
    by DESeqDataSetFromTximport.

    Here, coldata is a data.frame with group as a column.

    library(DESeq2)
    # from tximport
    dds <- DESeqDataSetFromTximport(txi, coldata, ~ group)
    # from SummarizedExperiment
    dds <- DESeqDataSet(se, ~ group)
    # from count matrix
    dds <- DESeqDataSetFromMatrix(counts, coldata, ~ group)
    # minimal filtering helps keep things fast 
    # one can set 'n' to e.g. min(5, smallest group sample size)
    keep <- rowSums(counts(dds) >= 10) >= n 
    dds <- dds[keep,]
    dds <- DESeq(dds)
    res <- results(dds) # no shrinkage of LFC, or:
    res <- lfcShrink(dds, coef = 2, type="apeglm") # shrink LFCs
    

    edgeR

    # this chunk from the Quick start in the edgeR User Guide
    library(edgeR) 
    y <- DGEList(counts=counts,group=group)
    keep <- filterByExpr(y)
    y <- y[keep,]
    y <- calcNormFactors(y)
    design <- model.matrix(~group)
    y <- estimateDisp(y,design)
    fit <- glmFit(y,design)
    lrt <- glmLRT(fit)
    topTags(lrt)
    # or use the QL methods:
    qlfit <- glmQLFit(y,design)
    qlft <- glmQLFTest(qlfit)
    topTags(qlft)
    

    limma-voom

    library(limma)
    design <- model.matrix(~ group)
    y <- DGEList(counts)
    keep <- filterByExpr(y)
    y <- y[keep,]
    y <- calcNormFactors(y)
    v <- voom(y,design)
    fit <- lmFit(v,design)
    fit <- eBayes(fit)
    topTable(fit)
    

    更多有关RNA-seq操作的包可以参考:Many more RNA-seq packages

    表达数据集Expression set

    library(Biobase)
    data(sample.ExpressionSet)
    e <- sample.ExpressionSet
    exprs(e)
    pData(e)
    fData(e)
    

    下载GEO数据库

    library(GEOquery)
    e <- getGEO("GSE9514")
    

    芯片分析

    library(affy)
    library(limma)
    phenoData <- read.AnnotatedDataFrame("sample-description.csv")
    eset <- justRMA("/celfile-directory", phenoData=phenoData)
    design <- model.matrix(~ Disease, pData(eset))
    fit <- lmFit(eset, design)
    efit <- eBayes(fit)
    topTable(efit, coef=2)
    

    iCOBRA包

    iCOBRA包用于计算和可视化排序数据以及二分类匹配后的数据。例如,iCOBRA包的一个典型用途就是用于评估基因表达实验中不同组之间的比较方法,我们可以视之为一个排序问题(评估正确的效应量(effect size)以及按照显著性进行排列的基因),还是一个二分类比较问题(binary assignment problem)(例如将基因分为差异表达和非差异表达)。

    library(iCOBRA)
    cd <- COBRAData(pval=pval.df, padj=padj.df, score=score.df, truth=truth.df)
    cp <- calculate_performance(cd, binary_truth = "status", cont_truth = "logFC")
    cobraplot <- prepare_data_for_plot(cp)
    plot_fdrtprcurve(cobraplot)
    # interactive shiny app:
    COBRAapp(cd)
    

    参考资料

    1. Genomic annotation in Bioconductor: Cheat sheet

    2. HarvardX Biomedical Data Science Open Online Training

    相关文章

      网友评论

          本文标题:GDAS002-Bioconductor备忘录

          本文链接:https://www.haomeiwen.com/subject/zojcbctx.html