虽然我们已经找到在R语言中读取GTF文件的最好方法
但是不妨碍我们学习新的方法
实际上rtracklayer::import就是个另类,一般而言大家读入GTF文件后只有9行
refGenome也是,但是他可以把第九行轻松地取出来
install.packages("refGenome")
library(refGenome)
create ensemblGenome object for storing Ensembl genomic annotation data
ens <- ensemblGenome()
read GTF file into ensemblGenome object
read.gtf(ens, "Homo_sapiens.GRCh38.90.chr.gtf")
read.gtf(ens, "Homo_sapiens.GRCh38.90.chr.gtf")
[read.gtf.refGenome] Reading file 'Homo_sapiens.GRCh38.90.chr.gtf'.
[GTF] 2612134 lines processed.
[read.gtf.refGenome] Extracting genes table.
[read.gtf.refGenome] Found 58,243 gene records.
[read.gtf.refGenome] Finished.
看一下他属性
class(ens)
[1] "ensemblGenome"
attr(,"package")
[1] "refGenome"
本次的重点来了create table of genes
my_gene <- getGenePositions(ens)
class(my_gene)
[1] "data.frame"
dim(my_gene)
[1] 58243 26
这26个是变量分别是:
testmygene <- my_gene[1:5,]
View(testmygene)
names(testmygene)
[1] "id" "seqid"
[3] "source" "start"
[5] "end" "score"
[7] "strand" "frame"
[9] "ccds_id" "protein_version"
[11] "protein_id" "exon_version"
[13] "transcript_support_level" "tag"
[15] "exon_number" "gene_id"
[17] "transcript_name" "gene_version"
[19] "gene_name" "gene_source"
[21] "gene_biotype" "transcript_id"
[23] "transcript_source" "exon_id"
[25] "transcript_version" "transcript_biotype"
获取也是非常容易。
其实这个功能完全是可以自己写个函数来搞定的,之前已经写过了。
根据这个帖子结合dplyr,他还可以做很多事情
Read GTF file into R
同时这个文章的作者已经坚持写作7年,是个生物信息学的大神,收入书签!
比如可以每个染色体上有哪些gene类型
my_gene %>% group_by(seqid, gene_biotype) %>% summarise(count = n()) -> my_tally
ggplot(my_tally, aes(x =seqid, y = log2(count))) +
geom_bar(aes(fill = gene_biotype), stat = 'identity', position = 'dodge')
mark
网友评论