对于提取序列来说,我们一般会选择用samtools、bedtools等,因为他们有非常完善的操作序列的功能,比如取两个bed文件的交集,提取某段序列等等。但实际上,他们对于区间的操作是不如R方便的,比如取基因body,取第一个外显子,取启动子区域等等。所以,我们有时候就会想能否在R里面,花式找了些区间之后,提取他们的序列,实际上,这也是可以的,需要分成几步。
首先我们需要先加载一些包
# BSgenome.Athaliana.TAIR.TAIR9是拟南芥的基因组序列包
# TAIR9和TAIR10是一样的,无非是TAIR10自身的基因注释丰富了些
library(BSgenome.Athaliana.TAIR.TAIR9)
# TxDb.Athaliana.BioMart.plantsmart28是拟南芥的Txdb包
# 就是我们ChIPSeeker做peak注释的那个Txdb
library(TxDb.Athaliana.BioMart.plantsmart28)
Txdb_package <- TxDb.Athaliana.BioMart.plantsmart28
# 也可以用自己的gtf做
# 所以这些操作也可以针对非模式物种
Txdb_gtf <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf")
然后是提取基因的区间
> gene_package <- genes(Txdb_package)
> gene_package
GRanges object with 33602 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 1 3631-5899 + | AT1G01010
AT1G01020 1 5928-8737 - | AT1G01020
AT1G01030 1 11649-13714 - | AT1G01030
AT1G01040 1 23146-31227 + | AT1G01040
AT1G01046 1 28500-28706 + | AT1G01046
... ... ... ... . ...
ATMG01370 Mt 360717-361052 - | ATMG01370
ATMG01380 Mt 361062-361179 - | ATMG01380
ATMG01390 Mt 361350-363284 - | ATMG01390
ATMG01400 Mt 363725-364042 + | ATMG01400
ATMG01410 Mt 366086-366700 - | ATMG01410
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome
# 注意这里有个no seqlengths
> gene_gtf <- genes(Txdb_gtf)
> gene_gtf
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 Chr1 3631-5899 + | AT1G01010
AT1G01020 Chr1 6788-9130 - | AT1G01020
AT1G01030 Chr1 11649-13714 - | AT1G01030
AT1G01040 Chr1 23121-31227 + | AT1G01040
AT1G01050 Chr1 31170-33171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 181268-181347 + | ATMG09730
ATMG09740 ChrM 191954-192025 - | ATMG09740
ATMG09950 ChrM 333651-333725 - | ATMG09950
ATMG09960 ChrM 347266-347348 + | ATMG09960
ATMG09980 ChrM 359666-359739 - | ATMG09980
-------
seqinfo: 7 sequences (2 circular) from an unspecified genome; no seqlengths
然后根据基因区域来提取启动子序列
> promoter_package <- promoters(gene_package,upstream = 4000,downstream = 500)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
GRanges object contains 11 out-of-bound ranges located on sequences 1, 2, 3, 4, and Pt. Note that
ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not
considered out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags
of the underlying sequences). You can use trim() to trim these ranges. See
?`trim,GenomicRanges-method` for more information.
> promoter_gtf <- promoters(gene_gtf,upstream = 4000,downstream = 500)
这里就出现了一个坑了,就是你在用Txdb包的时候,出现了warning,但用自建gtf构建的Txdb没有出现了warning。实际上,这种warning是因为你的区间取过界了。
> promoter_package
GRanges object with 33602 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 1 -369-4130 + | AT1G01010
AT1G01020 1 8238-12737 - | AT1G01020
AT1G01030 1 13215-17714 - | AT1G01030
AT1G01040 1 19146-23645 + | AT1G01040
AT1G01046 1 24500-28999 + | AT1G01046
... ... ... ... . ...
ATMG01370 Mt 360553-365052 - | ATMG01370
ATMG01380 Mt 360680-365179 - | ATMG01380
ATMG01390 Mt 362785-367284 - | ATMG01390
ATMG01400 Mt 359725-364224 + | ATMG01400
ATMG01410 Mt 366201-370700 - | ATMG01410
-------
seqinfo: 7 sequences (1 circular) from an unspecified genome
而promoter_gtf没有报错是因为他不知道你的染色体的长度范围在哪里。
所以下面我的操作都是基于promoter_gtf,这样还可以应用在非模式物种里面。首先你在构建Txdb的时候就需要给他传进去染色体长度了。但你怎么知道每条染色体是多长呢,这就需要你的fai文件了。fai文件是给基因组fa做索引的时候出现的,没有的话你可以直接用samtools faidx做下。
$ samtools faidx Athaliana.fa
# 这里的第2列是染色体长度
Chr1 30427671 6 79 80
Chr2 19698289 30812844 79 80
Chr3 23459830 50760485 79 80
Chr4 18585056 74517281 79 80
Chr5 26975502 93337597 79 80
ChrM 366924 120654568 79 80
ChrC 154478 121026144 79 80
然后我们就可以在构建的时候手动输进对应的长度了,但这只是拟南芥这个组装完整的基因组,如果是一些非模式物种,scaffold级别的话,就会有数百条染色体长度了,根本不可能手动输。所以我们还需要读进R里面进行一些操作。
# 注意不要变成因子
> seq_length <- read.table("~/reference/genome/TAIR10/Athaliana.fa.fai",
+ stringsAsFactors = F)[,1:2]
> seq_length$V1
[1] "Chr1" "Chr2" "Chr3" "Chr4" "Chr5" "ChrM" "ChrC"
> seq_length$V2
[1] 30427671 19698289 23459830 18585056 26975502 366924 154478
> Txdb_gtf <- makeTxDbFromGFF("~/reference/annoation/Athaliana/Araport11/Araport11_GFF3_genes_transposons.201606.gtf",
+ chrominfo = Seqinfo(seqnames = seq_length$V1,
+ seqlengths = seq_length$V2))
这样我们提取启动子序列的时候就会报错了
> gene_gtf <- genes(Txdb_gtf)
# 此时no seqlengths消失了
> gene_gtf
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 Chr1 3631-5899 + | AT1G01010
AT1G01020 Chr1 6788-9130 - | AT1G01020
AT1G01030 Chr1 11649-13714 - | AT1G01030
AT1G01040 Chr1 23121-31227 + | AT1G01040
AT1G01050 Chr1 31170-33171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 181268-181347 + | ATMG09730
ATMG09740 ChrM 191954-192025 - | ATMG09740
ATMG09950 ChrM 333651-333725 - | ATMG09950
ATMG09960 ChrM 347266-347348 + | ATMG09960
ATMG09980 ChrM 359666-359739 - | ATMG09980
-------
seqinfo: 7 sequences from an unspecified genome
# 出现了报错
> promoter_gtf <- promoters(gene_gtf,upstream = 4000,downstream = 500)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
GRanges object contains 15 out-of-bound ranges located on sequences Chr1, Chr2, Chr3, Chr4, Chr5, ChrC, and ChrM.
Note that ranges located on a sequence whose length is unknown (NA) or on a circular sequence are not considered
out-of-bound (use seqlengths() and isCircular() to get the lengths and circularity flags of the underlying
sequences). You can use trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more information.
# 我们用trim函数去掉越过界的区间
# 原本过界的就变成了1了
> promoter_gtf <- trim(promoter_gtf)
> promoter_gtf
GRanges object with 37330 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 Chr1 1-4130 + | AT1G01010
AT1G01020 Chr1 8631-13130 - | AT1G01020
AT1G01030 Chr1 13215-17714 - | AT1G01030
AT1G01040 Chr1 19121-23620 + | AT1G01040
AT1G01050 Chr1 32672-37171 - | AT1G01050
... ... ... ... . ...
ATMG09730 ChrM 177268-181767 + | ATMG09730
ATMG09740 ChrM 191526-196025 - | ATMG09740
ATMG09950 ChrM 333226-337725 - | ATMG09950
ATMG09960 ChrM 343266-347765 + | ATMG09960
ATMG09980 ChrM 359240-363739 - | ATMG09980
-------
seqinfo: 7 sequences from an unspecified genome
然后我们只取前2个基因出来看看效果
# 这边的GRange是带名字的GRange,所以我们取区间的时候可以用基因名字
> promoter_gtf_part <- promoter_gtf[c("AT1G01010","AT1G01020")]
> promoter_gtf_part
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
AT1G01010 Chr1 1-4130 + | AT1G01010
AT1G01020 Chr1 8631-13130 - | AT1G01020
-------
seqinfo: 7 sequences from an unspecified genome
# 用的是Biostrings包的getSeq
# 注意你这里的Grange的seqnames要和genome的seqnames要一致
# 有时候Grange是1,而Genome是Chr1
> getSeq(BSgenome.Athaliana.TAIR.TAIR9,promoter_gtf_part)
A DNAStringSet instance of length 2
width seq names
[1] 4130 CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCC...GGTTTCTGGTAAATGGAAGCTTACCGGAGAATCTGTTGAGGTCAAGG AT1G01010
[2] 4500 ATCCGCAACAATTCACCAATTGAAGAACAAGAGAAAGGTTTAAACTT...GAGAGAGAGCAATGGCGGCGAGTGAACACAGATGCGTGGGATGTGGT AT1G01020
promoter_gtf_part_seq <- getSeq(BSgenome.Athaliana.TAIR.TAIR9,promoter_gtf_part)
# 然后用writeXStringSet写成fa
writeXStringSet(promoter_gtf_part_seq,
filepath = "test.fasta",
format = "fasta")
>AT1G01010
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATG
AATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTAT
CGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGT
GGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATT
TGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTT
GGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTG
网友评论