很久没遇到转录组的报错了,最近在曾老师3月的生信入门马拉松课里有个学员遇到了这样的疑问。
在转录组分析中,用feature counts进行定量时出现了这种报错

比如我查看gtf的时候,其第9列明明包含了gene_id,但仍会报错。

早在曾老师那里实习时,我遇到过这个问题,归根结底是因为物种不常见,基因组和注释文件都不够完善。而且不同的人上传的gtf格式还是略有不同的,如果选中gene_id作为-g参数会报错,可能是gene_id存在空值,去掉就好了。
查看gtf文件
cat genomic.gtf | grep gene_id\ \"\"\ # 四个\是转义符,将后面的字符转义为字符串
NC_001788.1 RefSeq exon 1 71 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Phe"; exon_number "1";
NC_001788.1 RefSeq exon 72 1045 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "rRNA"; product "s-rRNA"; exon_number "1";
NC_001788.1 RefSeq exon 1046 1112 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Val"; exon_number "1";
NC_001788.1 RefSeq exon 1113 2692 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "rRNA"; product "l-rRNA"; exon_number "1";
NC_001788.1 RefSeq exon 2693 2766 . + . gene_id ""; transcript_id "unknown_transcript_1"; codons "2,3"; gbkey "tRNA"; product "tRNA-Leu"; exon_number "1";
NC_001788.1 RefSeq exon 3725 3793 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Ile"; exon_number "1";
NC_001788.1 RefSeq exon 3791 3863 . - . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Gln"; exon_number "1";
NC_001788.1 RefSeq exon 3866 3934 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Met"; exon_number "1";
NC_001788.1 RefSeq exon 4974 5042 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Trp"; exon_number "1";
NC_001788.1 RefSeq exon 5048 5116 . - . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Ala"; exon_number "1";
NC_001788.1 RefSeq exon 5118 5190 . - . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Asn"; exon_number "1";
NC_001788.1 RefSeq exon 5223 5288 . - . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Cys"; exon_number "1";
NC_001788.1 RefSeq exon 5289 5355 . - . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Tyr"; exon_number "1";
NC_001788.1 RefSeq exon 6899 6967 . - . gene_id ""; transcript_id "unknown_transcript_1"; codons "4,5,6,7"; gbkey "tRNA"; product "tRNA-Ser"; exon_number "1";
NC_001788.1 RefSeq exon 6976 7042 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Asp"; exon_number "1";
NC_001788.1 RefSeq exon 7730 7798 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Lys"; exon_number "1";
NC_001788.1 RefSeq exon 9425 9494 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Gly"; exon_number "1";
NC_001788.1 RefSeq exon 9842 9910 . + . gene_id ""; transcript_id "unknown_transcript_1"; gbkey "tRNA"; product "tRNA-Arg"; exon_number "1";
处理
通过查看我们发现,gtf文件的gene_id存在一些空值,去掉即可。
sed -i -e '/gene_id\ \"\"\;/d' genomic2.gtf
转录组分析是高度依赖基因组和注释文件的质量的,我们组早在14年测有一个不常见物种的转录组数据,一直到今年才开始分析,就是因为该物种的注释文件去年才发表。如果你觉得现有的注释文件不能满足你,以上我只提供了一种思路,具体的处理方法就见仁见智了。
网友评论