美文网首页R作图
GenomicRanges 数据结构

GenomicRanges 数据结构

作者: BeeBee生信 | 来源:发表于2021-07-02 17:41 被阅读0次

R 包 GenomicRanges 是 Bioconductor 项目承载基因组位置信息的基础包,为诸如 BSgenome,rtracklayer,VariantAnnotation 及其他 R 包提供了基础数据结构。
GenomicRanges 包含了 3 类对象:GRanges,GPos,GRangesList. 本文主要讨论 GRanges 结构和操作函数。

GRanges 对象是基因组范围(Genomic Ranges)的集合,每个基因组范围有一个在基因组的起点和终点位置。GRanges 可用于存储诸如转录本、外显子等基因组特征。

下面代码用 genes 函数提取 hg38 基因。

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
hg38Gene <- genes(txdb)

class 可以查看它属于 GRanges 对象。

> class(hg38Gene)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

> show(hg38Gene)
GRanges object with 25750 ranges and 1 metadata column:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
          1    chr19   58345178-58362751      - |           1
         10     chr8   18391282-18401218      + |          10
        100    chr20   44619522-44652233      - |         100
       1000    chr18   27950966-28177130      - |        1000
  100009613    chr11   70072434-70075433      - |   100009613
        ...      ...                 ...    ... .         ...
       9991     chr9 112217716-112333667      - |        9991
       9992    chr21   34364024-34371389      + |        9992
       9993    chr22   19036282-19122454      - |        9993
       9994     chr6   89829894-89874436      + |        9994
       9997    chr22   50523568-50526461      - |        9997
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

show 函数显示 GRanges 结构,可见用竖线分为 2 部分,左边为基因组座标信息,包含 seqnames,ranges 等;右边为 metadata 存储相关联的信息,具备非常高的可拓展性。

左边信息根据列名,GenomicRanges 定义了相应函数提取详细信息。

> seqnames(hg38Gene)
factor-Rle of length 25750 with 19826 runs
  Lengths:     1     1     1     1     1     1 ...     1     1     1     1     1
  Values : chr19 chr8  chr20 chr18 chr11 chr10 ... chr9  chr21 chr22 chr6  chr22
Levels(595): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1

> ranges(hg38Gene)
IRanges object with 25750 ranges and 0 metadata columns:
                start       end     width
            <integer> <integer> <integer>
          1  58345178  58362751     17574
         10  18391282  18401218      9937
        100  44619522  44652233     32712
       1000  27950966  28177130    226165
  100009613  70072434  70075433      3000
        ...       ...       ...       ...
       9991 112217716 112333667    115952
       9992  34364024  34371389      7366
       9993  19036282  19122454     86173
       9994  89829894  89874436     44543
       9997  50523568  50526461      2894

granges 函数提取左边信息。

> granges(hg38Gene)
GRanges object with 25750 ranges and 0 metadata columns:
            seqnames              ranges strand
               <Rle>           <IRanges>  <Rle>
          1    chr19   58345178-58362751      -
         10     chr8   18391282-18401218      +
        100    chr20   44619522-44652233      -
       1000    chr18   27950966-28177130      -
  100009613    chr11   70072434-70075433      -
        ...      ...                 ...    ...
       9991     chr9 112217716-112333667      -
       9992    chr21   34364024-34371389      +
       9993    chr22   19036282-19122454      -
       9994     chr6   89829894-89874436      +
       9997    chr22   50523568-50526461      -
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

mcols 函数提取 metadata 信息。

> mcols(hg38Gene)
DataFrame with 25750 rows and 1 column
              gene_id
          <character>
1                   1
10                 10
100               100
1000             1000
100009613   100009613
...               ...
9991             9991
9992             9992
9993             9993
9994             9994
9997             9997

也可以修改 metadata, 比如:

> mcols(hg38Gene)$test <- "something"
> show(hg38Gene)
GRanges object with 25750 ranges and 2 metadata columns:
            seqnames              ranges strand |     gene_id        test
               <Rle>           <IRanges>  <Rle> | <character> <character>
          1    chr19   58345178-58362751      - |           1   something
         10     chr8   18391282-18401218      + |          10   something
        100    chr20   44619522-44652233      - |         100   something
       1000    chr18   27950966-28177130      - |        1000   something
  100009613    chr11   70072434-70075433      - |   100009613   something
        ...      ...                 ...    ... .         ...         ...
       9991     chr9 112217716-112333667      - |        9991   something
       9992    chr21   34364024-34371389      + |        9992   something
       9993    chr22   19036282-19122454      - |        9993   something
       9994     chr6   89829894-89874436      + |        9994   something
       9997    chr22   50523568-50526461      - |        9997   something
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

GRanges 可以取子集,类似于向量进行切片操作。

> hg38Gene[1:3]
GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     gene_id        test
         <Rle>         <IRanges>  <Rle> | <character> <character>
    1    chr19 58345178-58362751      - |           1   something
   10     chr8 18391282-18401218      + |          10   something
  100    chr20 44619522-44652233      - |         100   something
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome
  
# 指定需要的 metadata 列
> hg38Gene[1:3, "gene_id"]
GRanges object with 3 ranges and 1 metadata column:
      seqnames            ranges strand |     gene_id
         <Rle>         <IRanges>  <Rle> | <character>
    1    chr19 58345178-58362751      - |           1
   10     chr8 18391282-18401218      + |          10
  100    chr20 44619522-44652233      - |         100
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

而用 split 函数可以将 GRanges 分割为多个 GRanges 对象,从而组织成 GRangesList 对象。

> sp <- split(hg38Gene, rep_len(1:2, length.out = 25750))
> show(sp)
GRangesList object of length 2:
$`1`
GRanges object with 12875 ranges and 2 metadata columns:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
          1    chr19   58345178-58362751      - |           1
        100    chr20   44619522-44652233      - |         100
  100009613    chr11   70072434-70075433      - |   100009613
  100009676     chr3 101676475-101679217      + |   100009676
      10002    chr15   71792638-71818259      + |       10002
        ...      ...                 ...    ... .         ...
       9987     chr4   82422564-82430408      - |        9987
       9989    chr18     9546791-9615240      - |        9989
       9990    chr15   34229784-34338060      - |        9990
       9992    chr21   34364024-34371389      + |        9992
       9994     chr6   89829894-89874436      + |        9994
                   test
            <character>
          1   something
        100   something
  100009613   something
  100009676   something
      10002   something
        ...         ...
       9987   something
       9989   something
       9990   something
       9992   something
       9994   something
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

$`2`
GRanges object with 12875 ranges and 2 metadata columns:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
         10     chr8   18391282-18401218      + |          10
       1000    chr18   27950966-28177130      - |        1000
  100009667    chr10   68010205-68010862      - |   100009667
      10001    chr14   70581257-70641204      - |       10001
  100033411    chr16   75693893-75701461      - |   100033411
        ...      ...                 ...    ... .         ...
       9988     chr7   87152361-87196337      + |        9988
        999    chr16   68737292-68835541      + |         999
       9991     chr9 112217716-112333667      - |        9991
       9993    chr22   19036282-19122454      - |        9993
       9997    chr22   50523568-50526461      - |        9997
                   test
            <character>
         10   something
       1000   something
  100009667   something
      10001   something
  100033411   something
        ...         ...
       9988   something
        999   something
       9991   something
       9993   something
       9997   something
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

多个 GRanges 对象可以用 c 或者 append 函数合并。

> c(sp[[1]], sp[[2]])
GRanges object with 25750 ranges and 2 metadata columns:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
          1    chr19   58345178-58362751      - |           1
        100    chr20   44619522-44652233      - |         100
  100009613    chr11   70072434-70075433      - |   100009613
  100009676     chr3 101676475-101679217      + |   100009676
      10002    chr15   71792638-71818259      + |       10002
        ...      ...                 ...    ... .         ...
       9988     chr7   87152361-87196337      + |        9988
        999    chr16   68737292-68835541      + |         999
       9991     chr9 112217716-112333667      - |        9991
       9993    chr22   19036282-19122454      - |        9993
       9997    chr22   50523568-50526461      - |        9997
                   test
            <character>
          1   something
        100   something
  100009613   something
  100009676   something
      10002   something
        ...         ...
       9988   something
        999   something
       9991   something
       9993   something
       9997   something
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

对基因区域记录的修改分为 intra-range methods,inter-range methods,between-range methods 三类。intra-range methods 单独修改每一个区域记录,有 flank,resize,shift 等函数。比如用 shift 函数移动每个区域 100bp.

> shift(hg38Gene, 100)
GRanges object with 25750 ranges and 2 metadata columns:
            seqnames              ranges strand |     gene_id
               <Rle>           <IRanges>  <Rle> | <character>
          1    chr19   58345278-58362851      - |           1
         10     chr8   18391382-18401318      + |          10
        100    chr20   44619622-44652333      - |         100
       1000    chr18   27951066-28177230      - |        1000
  100009613    chr11   70072534-70075533      - |   100009613
        ...      ...                 ...    ... .         ...
       9991     chr9 112217816-112333767      - |        9991
       9992    chr21   34364124-34371489      + |        9992
       9993    chr22   19036382-19122554      - |        9993
       9994     chr6   89829994-89874536      + |        9994
       9997    chr22   50523668-50526561      - |        9997
                   test
            <character>
          1   something
         10   something
        100   something
       1000   something
  100009613   something
        ...         ...
       9991   something
       9992   something
       9993   something
       9994   something
       9997   something
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

Inter-range methods 涉及多个区域记录的比较,包含 reduce,gaps,coverage 等函数。如 coverage 函数统计区域间重叠情况。

> coverage(hg38Gene)
RleList of length 595
$chr1
integer-Rle of length 248956422 with 5195 runs
  Lengths:  11868   2535      6  15161 ...  47051  13751  36476
  Values :      0      1      2      1 ...      0      1      0

$chr2
integer-Rle of length 242193529 with 3397 runs
  Lengths:  38813   8057 171265  46731 ...   5123  34215 424713
  Values :      0      1      0      1 ...      0      1      0

$chr3
integer-Rle of length 198295559 with 2954 runs
  Lengths:  11744  13105  84857  87056 ...    914   4486 251839
  Values :      0      1      0      1 ...      2      1      0

$chr4
integer-Rle of length 190214555 with 2071 runs
  Lengths:   53285   34923   36292 ... 1278990    3363  146691
  Values :       0       1       0 ...       0       1       0

$chr5
integer-Rle of length 181538259 with 2377 runs
  Lengths:  92150  97822   1522   4840 ...  94960    995 169997
  Values :      0      1      0      1 ...      0      1      0

...
<590 more elements>

Between-range methods 是多个 GRanges 对象间操作,包含 union,intersect,setdiff 等函数。如 intersect 取 2 个 GRanges 交集。

> hg38Gene2 <- hg38Gene[3:5]
> intersect(hg38Gene, hg38Gene2)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames            ranges strand
         <Rle>         <IRanges>  <Rle>
  [1]    chr11 70072434-70075433      -
  [2]    chr18 27950966-28177130      -
  [3]    chr20 44619522-44652233      -
  -------
  seqinfo: 595 sequences (1 circular) from hg38 genome

参考资料
An Introduction to the GenomicRanges Package

相关文章

网友评论

    本文标题:GenomicRanges 数据结构

    本文链接:https://www.haomeiwen.com/subject/dthyultx.html