R 包 GenomicRanges 是 Bioconductor 项目承载基因组位置信息的基础包,为诸如 BSgenome,rtracklayer,VariantAnnotation 及其他 R 包提供了基础数据结构。
GenomicRanges 包含了 3 类对象:GRanges,GPos,GRangesList. 本文主要讨论 GRanges 结构和操作函数。
GRanges 对象是基因组范围(Genomic Ranges)的集合,每个基因组范围有一个在基因组的起点和终点位置。GRanges 可用于存储诸如转录本、外显子等基因组特征。
下面代码用 genes
函数提取 hg38 基因。
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
hg38Gene <- genes(txdb)
用 class
可以查看它属于 GRanges 对象。
> class(hg38Gene)
[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"
> show(hg38Gene)
GRanges object with 25750 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 58345178-58362751 - | 1
10 chr8 18391282-18401218 + | 10
100 chr20 44619522-44652233 - | 100
1000 chr18 27950966-28177130 - | 1000
100009613 chr11 70072434-70075433 - | 100009613
... ... ... ... . ...
9991 chr9 112217716-112333667 - | 9991
9992 chr21 34364024-34371389 + | 9992
9993 chr22 19036282-19122454 - | 9993
9994 chr6 89829894-89874436 + | 9994
9997 chr22 50523568-50526461 - | 9997
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
用 show
函数显示 GRanges 结构,可见用竖线分为 2 部分,左边为基因组座标信息,包含 seqnames,ranges 等;右边为 metadata 存储相关联的信息,具备非常高的可拓展性。
左边信息根据列名,GenomicRanges 定义了相应函数提取详细信息。
> seqnames(hg38Gene)
factor-Rle of length 25750 with 19826 runs
Lengths: 1 1 1 1 1 1 ... 1 1 1 1 1
Values : chr19 chr8 chr20 chr18 chr11 chr10 ... chr9 chr21 chr22 chr6 chr22
Levels(595): chr1 chr2 chr3 ... chrUn_KI270756v1 chrUn_KI270757v1
> ranges(hg38Gene)
IRanges object with 25750 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
1 58345178 58362751 17574
10 18391282 18401218 9937
100 44619522 44652233 32712
1000 27950966 28177130 226165
100009613 70072434 70075433 3000
... ... ... ...
9991 112217716 112333667 115952
9992 34364024 34371389 7366
9993 19036282 19122454 86173
9994 89829894 89874436 44543
9997 50523568 50526461 2894
用 granges
函数提取左边信息。
> granges(hg38Gene)
GRanges object with 25750 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
1 chr19 58345178-58362751 -
10 chr8 18391282-18401218 +
100 chr20 44619522-44652233 -
1000 chr18 27950966-28177130 -
100009613 chr11 70072434-70075433 -
... ... ... ...
9991 chr9 112217716-112333667 -
9992 chr21 34364024-34371389 +
9993 chr22 19036282-19122454 -
9994 chr6 89829894-89874436 +
9997 chr22 50523568-50526461 -
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
用 mcols
函数提取 metadata 信息。
> mcols(hg38Gene)
DataFrame with 25750 rows and 1 column
gene_id
<character>
1 1
10 10
100 100
1000 1000
100009613 100009613
... ...
9991 9991
9992 9992
9993 9993
9994 9994
9997 9997
也可以修改 metadata, 比如:
> mcols(hg38Gene)$test <- "something"
> show(hg38Gene)
GRanges object with 25750 ranges and 2 metadata columns:
seqnames ranges strand | gene_id test
<Rle> <IRanges> <Rle> | <character> <character>
1 chr19 58345178-58362751 - | 1 something
10 chr8 18391282-18401218 + | 10 something
100 chr20 44619522-44652233 - | 100 something
1000 chr18 27950966-28177130 - | 1000 something
100009613 chr11 70072434-70075433 - | 100009613 something
... ... ... ... . ... ...
9991 chr9 112217716-112333667 - | 9991 something
9992 chr21 34364024-34371389 + | 9992 something
9993 chr22 19036282-19122454 - | 9993 something
9994 chr6 89829894-89874436 + | 9994 something
9997 chr22 50523568-50526461 - | 9997 something
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
GRanges 可以取子集,类似于向量进行切片操作。
> hg38Gene[1:3]
GRanges object with 3 ranges and 2 metadata columns:
seqnames ranges strand | gene_id test
<Rle> <IRanges> <Rle> | <character> <character>
1 chr19 58345178-58362751 - | 1 something
10 chr8 18391282-18401218 + | 10 something
100 chr20 44619522-44652233 - | 100 something
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
# 指定需要的 metadata 列
> hg38Gene[1:3, "gene_id"]
GRanges object with 3 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 58345178-58362751 - | 1
10 chr8 18391282-18401218 + | 10
100 chr20 44619522-44652233 - | 100
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
而用 split
函数可以将 GRanges 分割为多个 GRanges 对象,从而组织成 GRangesList 对象。
> sp <- split(hg38Gene, rep_len(1:2, length.out = 25750))
> show(sp)
GRangesList object of length 2:
$`1`
GRanges object with 12875 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 58345178-58362751 - | 1
100 chr20 44619522-44652233 - | 100
100009613 chr11 70072434-70075433 - | 100009613
100009676 chr3 101676475-101679217 + | 100009676
10002 chr15 71792638-71818259 + | 10002
... ... ... ... . ...
9987 chr4 82422564-82430408 - | 9987
9989 chr18 9546791-9615240 - | 9989
9990 chr15 34229784-34338060 - | 9990
9992 chr21 34364024-34371389 + | 9992
9994 chr6 89829894-89874436 + | 9994
test
<character>
1 something
100 something
100009613 something
100009676 something
10002 something
... ...
9987 something
9989 something
9990 something
9992 something
9994 something
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
$`2`
GRanges object with 12875 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
10 chr8 18391282-18401218 + | 10
1000 chr18 27950966-28177130 - | 1000
100009667 chr10 68010205-68010862 - | 100009667
10001 chr14 70581257-70641204 - | 10001
100033411 chr16 75693893-75701461 - | 100033411
... ... ... ... . ...
9988 chr7 87152361-87196337 + | 9988
999 chr16 68737292-68835541 + | 999
9991 chr9 112217716-112333667 - | 9991
9993 chr22 19036282-19122454 - | 9993
9997 chr22 50523568-50526461 - | 9997
test
<character>
10 something
1000 something
100009667 something
10001 something
100033411 something
... ...
9988 something
999 something
9991 something
9993 something
9997 something
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
多个 GRanges 对象可以用 c
或者 append
函数合并。
> c(sp[[1]], sp[[2]])
GRanges object with 25750 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 58345178-58362751 - | 1
100 chr20 44619522-44652233 - | 100
100009613 chr11 70072434-70075433 - | 100009613
100009676 chr3 101676475-101679217 + | 100009676
10002 chr15 71792638-71818259 + | 10002
... ... ... ... . ...
9988 chr7 87152361-87196337 + | 9988
999 chr16 68737292-68835541 + | 999
9991 chr9 112217716-112333667 - | 9991
9993 chr22 19036282-19122454 - | 9993
9997 chr22 50523568-50526461 - | 9997
test
<character>
1 something
100 something
100009613 something
100009676 something
10002 something
... ...
9988 something
999 something
9991 something
9993 something
9997 something
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
对基因区域记录的修改分为 intra-range methods,inter-range methods,between-range methods 三类。intra-range methods 单独修改每一个区域记录,有 flank,resize,shift 等函数。比如用 shift
函数移动每个区域 100bp.
> shift(hg38Gene, 100)
GRanges object with 25750 ranges and 2 metadata columns:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 58345278-58362851 - | 1
10 chr8 18391382-18401318 + | 10
100 chr20 44619622-44652333 - | 100
1000 chr18 27951066-28177230 - | 1000
100009613 chr11 70072534-70075533 - | 100009613
... ... ... ... . ...
9991 chr9 112217816-112333767 - | 9991
9992 chr21 34364124-34371489 + | 9992
9993 chr22 19036382-19122554 - | 9993
9994 chr6 89829994-89874536 + | 9994
9997 chr22 50523668-50526561 - | 9997
test
<character>
1 something
10 something
100 something
1000 something
100009613 something
... ...
9991 something
9992 something
9993 something
9994 something
9997 something
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
Inter-range methods 涉及多个区域记录的比较,包含 reduce,gaps,coverage 等函数。如 coverage
函数统计区域间重叠情况。
> coverage(hg38Gene)
RleList of length 595
$chr1
integer-Rle of length 248956422 with 5195 runs
Lengths: 11868 2535 6 15161 ... 47051 13751 36476
Values : 0 1 2 1 ... 0 1 0
$chr2
integer-Rle of length 242193529 with 3397 runs
Lengths: 38813 8057 171265 46731 ... 5123 34215 424713
Values : 0 1 0 1 ... 0 1 0
$chr3
integer-Rle of length 198295559 with 2954 runs
Lengths: 11744 13105 84857 87056 ... 914 4486 251839
Values : 0 1 0 1 ... 2 1 0
$chr4
integer-Rle of length 190214555 with 2071 runs
Lengths: 53285 34923 36292 ... 1278990 3363 146691
Values : 0 1 0 ... 0 1 0
$chr5
integer-Rle of length 181538259 with 2377 runs
Lengths: 92150 97822 1522 4840 ... 94960 995 169997
Values : 0 1 0 1 ... 0 1 0
...
<590 more elements>
Between-range methods 是多个 GRanges 对象间操作,包含 union,intersect,setdiff 等函数。如 intersect 取 2 个 GRanges 交集。
> hg38Gene2 <- hg38Gene[3:5]
> intersect(hg38Gene, hg38Gene2)
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr11 70072434-70075433 -
[2] chr18 27950966-28177130 -
[3] chr20 44619522-44652233 -
-------
seqinfo: 595 sequences (1 circular) from hg38 genome
网友评论