美文网首页
GDAS006-深入理解SummarizedExperiment

GDAS006-深入理解SummarizedExperiment

作者: backup备份 | 来源:发表于2019-11-07 19:28 被阅读0次

    title: GDAS006-深入理解SummarizedExperiment类
    date: 2019-09-05 12:0:00
    type: "tags"
    tags:

    • Bioconductor
      categories:
    • Genomics Data Analysis Series

    前言

    在数据管理部分,我们应用summarizeOverlaps 来管理RNAseqData.HNRNPC.bam.chr14中的BAM文件。现在我们再次来操作一下。

    使用SummarizedExperiment来管理BAM文件

    library(RNAseqData.HNRNPC.bam.chr14)
    bfp = RNAseqData.HNRNPC.bam.chr14_BAMFILES
    library(Rsamtools)
    bfl = BamFileList(file=bfp)
    hnrnpcLoc = GRanges("chr14", IRanges(21677296, 21737638))
    library(GenomicAlignments)
    library(BiocParallel)
    register(SerialParam())
    hnse = summarizeOverlaps(hnrnpcLoc,bfl)
    hnse
    ## class: RangedSummarizedExperiment 
    ## dim: 1 8 
    ## metadata(0):
    ## assays(1): counts
    ## rownames: NULL
    ## rowData names(0):
    ## colnames(8): ERR127306 ERR127307 ... ERR127304 ERR127305
    ## colData names(0):
    

    hnseRangedSummarizedExperiment 类的一实例。这个类就类似于 ExpressionSet ,不过有着更多的内容来管理元数据,它的流程如下所示:

    plot of chunk lkseee

    有效地使用SummarizedExperiment实例涉及学习它的一些方法。为了获取HNRNPC基因的读长/区域(read/region),我们可以使用assay方法,如下所示:

    assay(hnse)
    ##      ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303 ERR127304
    ## [1,]      5422      6320      5896      5558       172       196       316
    ##      ERR127305
    ## [1,]       282
    

    以上就是最基本的结果表示方法。列名则是样本标识符,但是有关区域检查的一些信息则已经丢失。

    SummarizedExperiment中的元数据

    hnse 对象还有一些其它的信息,如下所示:

    rowRanges(hnse)
    ## GRanges object with 1 range and 0 metadata columns:
    ##       seqnames               ranges strand
    ##          <Rle>            <IRanges>  <Rle>
    ##   [1]    chr14 [21677296, 21737638]      *
    ##   -------
    ##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
    seqinfo(hnse)
    ## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
    ##   seqnames seqlengths isCircular genome
    ##   chr14            NA         NA   <NA>
    metadata(hnse)
    ## list()
    

    我们还可以进一步分析差异信息,从而输出更多,更广泛的信息。

    通过添加输入信息来更有效地生成SummarizedExperiment

    使用元数据定义感兴趣的区域

    We have seen that it is sufficient to define a single GRanges to drive summarizeOverlaps over a set of BAM files. We’d like to preserve more metadata about the regions examined. We’ll use the TxDb infrastructure, to be described in more detail later, to get a structure defining gene regions on chr14. We’ll also use the Homo.sapiens annotation package to add gene symbols.

    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
    gr14 = genes(txdb, vals=list(tx_chrom="chr14"))
    ## Warning: 'elementLengths' is deprecated.
    ## Use 'elementNROWS' instead.
    ## See help("Deprecated")
    
    ## Warning: 'elementLengths' is deprecated.
    ## Use 'elementNROWS' instead.
    ## See help("Deprecated")
    gr14$symbol = mapIds(Homo.sapiens, keys=gr14$gene_id, keytype="ENTREZID",
       column="SYMBOL")
    ## 'select()' returned 1:1 mapping between keys and columns
    gr14
    ## GRanges object with 781 ranges and 2 metadata columns:
    ##             seqnames                 ranges strand |     gene_id
    ##                <Rle>              <IRanges>  <Rle> | <character>
    ##       10001    chr14 [ 71050957,  71067384]      - |       10001
    ##   100113389    chr14 [ 45580078,  45580176]      + |   100113389
    ##   100113391    chr14 [ 20794600,  20794698]      - |   100113391
    ##   100124539    chr14 [ 91592770,  91592896]      + |   100124539
    ##   100126297    chr14 [101507700, 101507781]      + |   100126297
    ##         ...      ...                    ...    ... .         ...
    ##        9870    chr14 [ 75127955,  75179807]      - |        9870
    ##        9878    chr14 [ 21945335,  21967319]      + |        9878
    ##        9895    chr14 [102829300, 102968818]      + |        9895
    ##        9950    chr14 [ 93260650,  93306304]      + |        9950
    ##        9985    chr14 [ 24641234,  24649463]      + |        9985
    ##                  symbol
    ##             <character>
    ##       10001        MED6
    ##   100113389    SNORD127
    ##   100113391    SNORD126
    ##   100124539    SNORA11B
    ##   100126297      MIR300
    ##         ...         ...
    ##        9870       AREL1
    ##        9878        TOX4
    ##        9895      TECPR2
    ##        9950      GOLGA5
    ##        9985        REC8
    ##   -------
    ##   seqinfo: 93 sequences (1 circular) from hg19 genome
    

    定义BAM文件的样本信息

    现在我们有三个样本,一个用于控制,两个用于敲低。我们使用GenomicFiles来绑定样本信息的元数据,如下所示:

    char = rep(c("hela_wt", "hela_hkd"), each=4)
    bff = GenomicFiles(files=path(bfl))
    colData(bff)$condition = char
    sid = c(1,1,1,1,2,2,3,3)
    bff$sample = sid
    bff
    ## GenomicFiles object with 0 ranges and 8 files: 
    ## files: ERR127306_chr14.bam, ERR127307_chr14.bam, ..., ERR127304_chr14.bam, ERR127305_chr14.bam 
    ## detail: use files(), rowRanges(), colData(), ...
    

    比较读长覆盖区,保留元数据

    我们来查看5个基因,其中就包括HNRNPC。当计算后,我们会将样本信息再绑定回结果,如下所示:

    hnse = summarizeOverlaps(gr14[c(1:4,305)],files(bff))
    colData(hnse) = cbind(colData(hnse), colData(bff))
    hnse
    ## class: RangedSummarizedExperiment 
    ## dim: 5 8 
    ## metadata(0):
    ## assays(1): counts
    ## rownames(5): 10001 100113389 100113391 100124539 3183
    ## rowData names(2): gene_id symbol
    ## colnames(8): ERR127306 ERR127307 ... ERR127304 ERR127305
    ## colData names(2): condition sample
    assay(hnse)
    ##           ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303
    ## 10001           114       156       129       144       175       213
    ## 100113389         0         0         0         0         0         0
    ## 100113391         0         0         0         0         0         0
    ## 100124539         0         0         0         0         0         0
    ## 3183           2711      3160      2948      2779        86        98
    ##           ERR127304 ERR127305
    ## 10001           210       165
    ## 100113389         0         0
    ## 100113391         0         0
    ## 100124539         0         0
    ## 3183            158       141
    

    从上面我们可以看出,行标识符是随计数矩阵一起出现的,现在我们绘制出这几个基因的箱线图,如下所示:

    par(mfrow=c(2,2))
    for (i in 2:5) {
      boxplot(assay(hnse)[i,]~hnse$condition, ylab=rowRanges(hnse)$symbol[i])
    }
    
    plot of chunk lkbo

    从ExpressionSet提取数据

    从ExpressionSet中提取数据很容易,但是还需要基于基因组范围查询的阵列探针的子集,如下所示:

    library(ALL)
    data(ALL)
    allse = makeSummarizedExperimentFromExpressionSet(ALL)
    allse
    ## class: RangedSummarizedExperiment 
    ## dim: 12625 128 
    ## metadata(3): experimentData annotation protocolData
    ## assays(1): exprs
    ## rownames(12625): 1000_at 1001_at ... AFFX-YEL021w/URA3_at
    ##   AFFX-YEL024w/RIP1_at
    ## rowData names(0):
    ## colnames(128): 01005 01010 ... 83001 LAL4
    ## colData names(21): cod diagnosis ... f.u date.last.seen
    rowRanges(allse)
    ## Warning: 'elementLengths' is deprecated.
    ## Use 'elementNROWS' instead.
    ## See help("Deprecated")
    ## GRangesList object of length 12625:
    ## $$1000_at 
    ## GRanges object with 0 ranges and 0 metadata columns:
    ##    seqnames    ranges strand
    ##       <Rle> <IRanges>  <Rle>
    ## 
    ## $$1001_at 
    ## GRanges object with 0 ranges and 0 metadata columns:
    ##      seqnames ranges strand
    ## 
    ## $$1002_f_at 
    ## GRanges object with 0 ranges and 0 metadata columns:
    ##      seqnames ranges strand
    ## 
    ## ...
    ## <12622 more elements>
    ## -------
    ## seqinfo: no sequences
    

    总结

    RangedSummarizedExperiment类实例化了一些Bioconductor数据结构设计的一些关键原则:

    • 基于样本特征的数据分析和元数据分析,并将它们以某种方式绑定在一起。
    • 类似于矩阵的子集直接用于分析和样本数据分析。
    • 基于范围的子集设置适用于可通过基因组坐标寻址的分析。
    • 可以在mcol中提供关键分析特征的任意元数据(rowRanges(se))。
    • 可以通过metadata(se)<- 来添加任何元数据。

    在后面的内容里我们还将会了解更多的关于SummarizedExperiment改造的一些数据结构,从而用于专门用于RNA-seq的多个阶段处理和数据分析。

    参考资料

    1. SummarizedExperiment class in depth

    相关文章

      网友评论

          本文标题:GDAS006-深入理解SummarizedExperiment

          本文链接:https://www.haomeiwen.com/subject/cctbbctx.html