在R语言中读取GTF文件的最好方法

作者: 9d760c7ce737 | 来源:发表于2017-11-23 16:49 被阅读1385次

在R中使用refGenome阅读GTF格式的文件
在R语言中读取GTF文件的最好方法
R语言读取gtf文件
annovar注释非常规参考基因组的snp
GTF文件读取
R语言 CSV文件
R语言读取文件
gffread软件使用教程
爬虫_014_文件操作
常用读写数据

下载的TCGA数据是没有注释的，需要从ensemble上面下载GTF文件，现在需要把GTF文件读入R

第一种方法rtracklayer::import

source("https://bioconductor.org/biocLite.R")
biocLite("rtracklayer")
biocLite("SummarizedExperiment")
gtf1 <- rtracklayer::import('Homo_sapiens.GRCh38.90.chr.gtf')
gtf_df <- as.data.frame(gtf1)

最终读入27个变量，2612129个观测，测试一下显示的不错

test <- gtf_df[1:5,]
View(test)

取出我需要的gene_id,gene_biotype,gene_name，成为新的数据框

geneid_df <- dplyr::select(gtf_df,c(gene_name,gene_id,gene_biotype))

第二种方法read.table

gtf2 <- read.table('Homo_sapiens.GRCh38.90.chr.gtf', header = FALSE, sep = '\t')

最后发现，速度奇慢无比,取消参考了Y叔的帖子增加参数，读入成功，需要1分钟,读入9个变量
根据GTF画基因的多个转录本结构

gtf2 <- read.table('Homo_sapiens.GRCh38.90.chr.gtf', stringsAsFactors = F, header = FALSE, sep = '\t',comment = "#")
test2  <- gtf2[1:5,]

第三种方法readr包里面的read_table2

速度很快，有进度条，读入了18个变量，但是最gene_biotype显示不对

gtf3 <- readr::read_table2("Homo_sapiens.GRCh38.90.chr.gtf",comment = "#")
gtf_df3 <- as.data.frame(gtf3) #转成data.frame
test3 <- gtf_df3[1:5,]
View(test3)

第四种方法read.table，fread

读入数据跟read.table一样，

library(data.table)
genes <- fread("Homo_sapiens.GRCh38.90.chr.gtf")
test4<- genes[1:5,]
View(test4)

上一步用时12秒，速度极快,只是读入后没有名字，需要自己添加
读入9个变量,跟read.table一样

setnames(genes, names(genes), c("chr","source","type","start","end","score","strand","phase","attributes") )

可选步骤，type那一项有转录本和gene，选取gene，其他有很多转录本

genes <- genes[type == "gene"]

我要的信息粗存在attributes中，构建函数提取

extract_attributes <- function(gtf_attributes, att_of_interest){
  att <- strsplit(gtf_attributes,"; ")
  att <- gsub("\"","",unlist(att))
  if(!is.null(unlist(strsplit(att[grep(att_of_interest, att)], " ")))){
    return( unlist(strsplit(att[grep(att_of_interest, att)], " "))[2])
  }else{
    return(NA)}
}

举例子解释,加入我们需要gene_id

gtf_attributes <- genes[1,9]
att <- strsplit(gtf_attributes,"; ")

这一步提示non-character argument，后面我查询才知道, 对于第九列的第一行的的获取两种方式返回的格式不一样

genes[1,9] #data.frame
genes$attributes[1] #character

而strsplit的对象是character,重来

gtf_attributes <- genes$attributes[1]
att <- strsplit(gtf_attributes,"; ")
att <- gsub("\"","",unlist(att))

grep获取位置，获取到内容后分隔,得到的第二个元素是我们需要的

unlist(strsplit(att[grep("gene_id", att)], " "))[2]

利用函数获得基因列表

genes$gene_id <- unlist(lapply(genes$attributes, extract_attributes, "gene_id"))

如果我们选择使用第一种方法rtracklayer::import，这些事情都是不需要做的！！！

总结：rtracklayer::import最简单，fread最快，read.table可选，read_table没戏！

网友评论

本文标题：在R语言中读取GTF文件的最好方法

本文链接：https://www.haomeiwen.com/subject/mwnzvxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

在R语言中读取GTF文件的最好方法

第一种方法rtracklayer::import

第二种方法read.table

第三种方法readr包里面的read_table2

第四种方法read.table，fread

总结：rtracklayer::import最简单，fread最快，read.table可选，read_table没戏！

相关文章

在R中使用refGenome阅读GTF格式的文件

在R语言中读取GTF文件的最好方法

R语言读取gtf文件

annovar注释非常规参考基因组的snp

GTF文件读取

R语言 CSV文件

R语言读取文件

gffread软件使用教程

爬虫_014_文件操作

常用读写数据

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读