美文网首页生物信息学生物信息数据科学
46.《Bioinformatics Data Skills》之

46.《Bioinformatics Data Skills》之

作者: DataScience | 来源:发表于2021-07-15 20:47 被阅读0次

1. 创建GRanges对象

1.1. 基本信息

前面我们学习了IRanges对象,GenomicRanges包引入了新的对象叫做GRanges,额外包含了序列名字(即染色体编号)与链信息:

> library(GenomicRanges)
> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"))
> gr
GRanges object with 4 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      5-14      +
  [2]     chr1      6-15      -
  [3]     chr2      7-16      -
  [4]     chr3      8-17      +
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

1.2. 元数据

GRanges还可以添加任意元数据(meta-data),例如添加GC含量列:

> gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"),
gc = round(runif(4), 3))
> gr
GRanges object with 4 ranges and 1 metadata column:
      seqnames    ranges strand |        gc
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     chr1      5-14      + |  0.821266
  [2]     chr1      6-15      - |  0.119132
  [3]     chr2      7-16      - |  0.764710
  [4]     chr3      8-17      + |  0.827638
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

2. 定义序列长度

由于不同版本的参考基因组的差异,需要自己设定染色体的长度,才能准确地计算测序深度等信息(没有序列长度的话只能知道范围的终点,而不是序列的终点)。有两种方式设定序列长度,其一是直接在生成GRanges时定义:

gr <- GRanges(seqname=c("chr1", "chr1", "chr2", "chr3"),
ranges=IRanges(start=5:8, width=10),
strand=c("+", "-", "-", "+"),
gc = round(runif(4), 3),
seqlengths=c(chr1=152, chr2=432, chr3=903))
> gr
GRanges object with 4 ranges and 1 metadata column:
      seqnames    ranges strand |        gc
         <Rle> <IRanges>  <Rle> | <numeric>
  [1]     chr1      5-14      + |  0.185549
  [2]     chr1      6-15      - |  0.295794
  [3]     chr2      7-16      - |  0.978961
  [4]     chr3      8-17      + |  0.352265
  -------
  seqinfo: 3 sequences from an unspecified genome

其二是生成GRanges之后使用seqlengths函数设定:

seqlengths(gr) <- c(chr1=152, chr2=432, chr3=903)

3. 信息提取

GRanges基本信息提取方式与IRanges相同(见IRanges基本操作),如:

> start(gr)
[1] 5 6 7 8
> end(gr)
[1] 14 15 16 17
> width(gr)
[1] 10 10 10 10

同样可以查看长度,设定名字:

> length(gr)
[1] 4
> names(gr) <- letters[1:length(gr)]
> gr
GRanges object with 4 ranges and 1 metadata column:
    seqnames    ranges strand |        gc
       <Rle> <IRanges>  <Rle> | <numeric>
  a     chr1      5-14      + |  0.185549
  b     chr1      6-15      - |  0.295794
  c     chr2      7-16      - |  0.978961
  d     chr3      8-17      + |  0.352265
  -------
  seqinfo: 3 sequences from an unspecified genome

GRanges通过额外的函数提取seqnamesstrand信息:

> seqnames(gr)
factor-Rle of length 4 with 3 runs
  Lengths:    2    1    1
  Values : chr1 chr2 chr3
Levels(3): chr1 chr2 chr3
> strand(gr)
factor-Rle of length 4 with 3 runs
  Lengths: 1 2 1
  Values : + - +
Levels(3): + - *

这两者都采用了Rle对象存储

通过mcols函数提取所有元数据列:

> mcols(gr)
DataFrame with 4 rows and 1 column
         gc
  <numeric>
a  0.185549
b  0.295794
c  0.978961
d  0.352265

元数据列采用的是DataFrame数据格式,是更加扩展的data.frame格式,可以支持run-length编码数据以及更多的列操作。

可以使用ranges函数提取gr的所有范围:

> ranges(gr)
IRanges object with 4 ranges and 0 metadata columns:
        start       end     width
    <integer> <integer> <integer>
  a         5        14        10
  b         6        15        10
  c         7        16        10
  d         8        17        10

4. 提取范围子集

GRanges取子集的操作和IRanges类似,如取起始位点大于7的范围:

> start(gr) > 7
[1] FALSE FALSE FALSE  TRUE
> gr[start(gr) > 7]
GRanges object with 1 range and 1 metadata column:
    seqnames    ranges strand |        gc
       <Rle> <IRanges>  <Rle> | <numeric>
  d     chr3      8-17      + |  0.352265
  -------
  seqinfo: 3 sequences from an unspecified genome

提取1号染色体上的所有范围:

> gr[seqnames(gr) == "chr1"]
GRanges object with 2 ranges and 1 metadata column:
    seqnames    ranges strand |        gc
       <Rle> <IRanges>  <Rle> | <numeric>
  a     chr1      5-14      + |  0.185549
  b     chr1      6-15      - |  0.295794
  -------
  seqinfo: 3 sequences from an unspecified genome

结合以上内容,可以容易地进行更加复杂的操作,例如计算1号染色体上所有范围GC含量平均值:

> mean(mcols(gr[seqnames(gr) == "chr1"])$gc)
[1] 0.2406719

相关文章

网友评论

    本文标题:46.《Bioinformatics Data Skills》之

    本文链接:https://www.haomeiwen.com/subject/bymmpltx.html