- 46.《Bioinformatics Data Skills》之
- 28.《Bioinformatics-Data-Skills》之
- 18.《Bioinformatics-Data-Skills》之
- 19.《Bioinformatics-Data-Skills》之
- 【shell笔记>生信|专项】生信数据处理技能手札(3):
- Bioinformatics Data Skills
- 17.《Bioinformatics-Data-Skills》之
- 25.《Bioinformatics-Data-Skills》之
- 25.《Bioinformatics-Data-Skills》之
- 23.《Bioinformatics-Data-Skills》之
1. 创建GRanges对象
1.1. 基本信息
前面我们学习了IRanges
对象,GenomicRanges
包引入了新的对象叫做GRanges
,额外包含了序列名字(即染色体编号)与链信息:
> library(GenomicRanges)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"))
> gr
GRanges object with 4 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 5-14 +
[2] chr1 6-15 -
[3] chr2 7-16 -
[4] chr3 8-17 +
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
1.2. 元数据
GRanges
还可以添加任意元数据(meta-data),例如添加GC含量列:
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"),
gc = round(runif(4), 3))
> gr
GRanges object with 4 ranges and 1 metadata column:
seqnames ranges strand | gc
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 5-14 + | 0.821266
[2] chr1 6-15 - | 0.119132
[3] chr2 7-16 - | 0.764710
[4] chr3 8-17 + | 0.827638
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
2. 定义序列长度
由于不同版本的参考基因组的差异,需要自己设定染色体的长度,才能准确地计算测序深度等信息(没有序列长度的话只能知道范围的终点,而不是序列的终点)。有两种方式设定序列长度,其一是直接在生成GRanges
时定义:
gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"),
gc = round(runif(4), 3),
seqlengths=c(chr1=152, chr2=432, chr3=903))
> gr
GRanges object with 4 ranges and 1 metadata column:
seqnames ranges strand | gc
<Rle> <IRanges> <Rle> | <numeric>
[1] chr1 5-14 + | 0.185549
[2] chr1 6-15 - | 0.295794
[3] chr2 7-16 - | 0.978961
[4] chr3 8-17 + | 0.352265
-------
seqinfo: 3 sequences from an unspecified genome
其二是生成GRanges
之后使用seqlengths
函数设定:
seqlengths(gr) <- c(chr1=152, chr2=432, chr3=903)
3. 信息提取
GRanges
基本信息提取方式与IRanges
相同(见IRanges基本操作),如:
> start(gr)
[1] 5 6 7 8
> end(gr)
[1] 14 15 16 17
> width(gr)
[1] 10 10 10 10
同样可以查看长度,设定名字:
> length(gr)
[1] 4
> names(gr) <- letters[1:length(gr)]
> gr
GRanges object with 4 ranges and 1 metadata column:
seqnames ranges strand | gc
<Rle> <IRanges> <Rle> | <numeric>
a chr1 5-14 + | 0.185549
b chr1 6-15 - | 0.295794
c chr2 7-16 - | 0.978961
d chr3 8-17 + | 0.352265
-------
seqinfo: 3 sequences from an unspecified genome
GRanges
通过额外的函数提取seqnames
,strand
信息:
> seqnames(gr)
factor-Rle of length 4 with 3 runs
Lengths: 2 1 1
Values : chr1 chr2 chr3
Levels(3): chr1 chr2 chr3
> strand(gr)
factor-Rle of length 4 with 3 runs
Lengths: 1 2 1
Values : + - +
Levels(3): + - *
这两者都采用了Rle
对象存储。
通过mcols
函数提取所有元数据列:
> mcols(gr)
DataFrame with 4 rows and 1 column
gc
<numeric>
a 0.185549
b 0.295794
c 0.978961
d 0.352265
元数据列采用的是DataFrame数据格式,是更加扩展的data.frame格式,可以支持run-length编码数据以及更多的列操作。
可以使用ranges
函数提取gr
的所有范围:
> ranges(gr)
IRanges object with 4 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
a 5 14 10
b 6 15 10
c 7 16 10
d 8 17 10
4. 提取范围子集
GRanges
取子集的操作和IRanges
类似,如取起始位点大于7的范围:
> start(gr) > 7
[1] FALSE FALSE FALSE TRUE
> gr[start(gr) > 7]
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gc
<Rle> <IRanges> <Rle> | <numeric>
d chr3 8-17 + | 0.352265
-------
seqinfo: 3 sequences from an unspecified genome
提取1号染色体上的所有范围:
> gr[seqnames(gr) == "chr1"]
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gc
<Rle> <IRanges> <Rle> | <numeric>
a chr1 5-14 + | 0.185549
b chr1 6-15 - | 0.295794
-------
seqinfo: 3 sequences from an unspecified genome
结合以上内容,可以容易地进行更加复杂的操作,例如计算1号染色体上所有范围GC含量平均值:
> mean(mcols(gr[seqnames(gr) == "chr1"])$gc)
[1] 0.2406719
网友评论