htseq-count的一个坑

作者: TOP生物信息 | 来源:发表于2019-03-10 18:56 被阅读0次

htseq-count的一个坑
samtools sort
mRNA-seq学习（三）：htseq-count计数
HTSeq-count原理
htseq-count file merge
Hisat2+FeatureCounts+DESeq2流程+作图
usage: htseq-count
表达矩阵的合并
【小结】2018-10-16目前遇到的问题和可能的解决方案
【转录组】基因定量分析 featureCounts的使用

已经在多个地方看到有人吐槽htseq-count了，原因就是它对bam的排序有严格要求，必须是按照reads name来排序才能运行。本着”实践是检验真理的唯一标准“的理念，我自己又试了一下。

#SRR3286802.s.bam已经按坐标排好序了
htseq-count -s no -r pos -f bam /map/SRR3286802.s.bam /ref/gene27655.gff \
> /count/SRR3286802.count 2> /count/SRR3286802.err

结果真的报错了

Error occured when processing SAM input (record #30072152 in file /ifs1/Grp3/huangsiyuan/learn_rnaseq/mRNA-seq/map/SRR3286802.s.bam):
  Maximum alignment buffer size exceeded while pairing SAM alignments.
  [Exception type: ValueError, raised in __init__.py:686]

我在SEQanswers上找到了一个解答，提供了两种解决方案，我都试了一遍。

'
Two options: （换成reads name排序方式；换软件）
You can either re-sort your files lexographically/alphabetically and run htseq-counts with the lexigraphic order (in other words, NOT --order=pos)
or
You can switch to the much faster and generally more user friendly (easier options, easier output file format) program "featureCounts". Part of the "subread" package. [http://subread.sourceforge.net/](http://subread.sourceforge.net/)
I would switch to subread. htseq-counts is incredibly slow and very poor at handling large files without crashing.
'

1. 重新排序

$ cat 2.sh 
ls SRR328680{2..7}.bam | while read id
do
samtools sort -n ${id} -o ${id%.*}.namesort.bam -@ 2 &
done

$ cat 3.sh 
ls *.namesort.bam | while read id
do
htseq-count -s no -r name -f bam  ${id} /ref/gene27655.gff > /count/${id%%.*}.count 2> /count/${id%%.*}.err &
done

理论上是可行的，以前试过。但这次又出现了新问题，跑出来的结果都是空文件。

$ cat SRR3286807.count
__no_feature    45205743
__ambiguous 0
__too_low_aQual 0
__not_aligned   1336845
__alignment_not_unique  2045753

这时我猜测，htseq-count要求bam(chr1,chr2...)和gff(1,2...)中的染色体名称相同，不然无法判断，更改gff的chr名称后，再按同样的命令运行一次。

awk '{ print "chr"$0 }' gene27655.gff > gene27655_re_name_chr.gff

结果还是一样空的?! —— (运行过程中提示Warning: No features of type 'exon' found.)我只能猜测是自己前面提取gene区间至gff（为了直接统计gene区间内的reads数）这一步有些多此一举了。
于是又拿完整的gff文件试了下，Arabidopsis_thaliana.TAIR10.42.gff3，又说某行第九列注释信息中没有gene_id信息，看了下确实没有。

1       araport11       exon    3631    3913    .       +       .       \
Parent=transcript:AT1G01010.1;\
Name=AT1G01010.1.exon1;\
constitutive=1;\
ensembl_end_phase=1;\
ensembl_phase=-1;\
exon_id=AT1G01010.1.exon1;\
rank=1

又去NCBI下载了一个gtf注释文件（基因组版本是一样的）, GCF_000001735.4_TAIR10.1_genomic.gtf.gz，这回终于有gene_id了。更改染色体名称后，重新用htseq-count计数。
总结一下
通过上面的折腾，对htseq-count计数有了新的认识。

bam中的reads根据名称排序；
基因组feature都保留，不要多此一举专门提出feature为gene的记录；
gff/gtf的第9列注释要有gene_id；
bam和gff/gtf的染色体名称保持一致。

其中，第2、3点可以自己另外设置，见htseq-count计数

2. 更换软件

nohup wget https://nchc.dl.sourceforge.net/project/subread/subread-1.6.3/subread-1.6.3-source.tar.gz &
tar -zxvf subread-1.6.3-source.tar.gz
cd src/
make -f Makefile.Linux #编译成功后，可执行文件存储在../bin/中

Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...

nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 \
-a ~/learn_rnaseq/mRNA-seq/ref/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/test \
~/learn_rnaseq/mRNA-seq/map/SRR328680*.s.bam &

上面这一步运行完之后会多出test.summary，test这两个文件

nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -p \
-a ~/learn_rnaseq/mRNA-seq/ref/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/all \
~/learn_rnaseq/mRNA-seq/map/SRR328680*.s.bam &

上面这行只多了-p选项，为了看看在双端测序中统计fragments和reads有什么区别。运行完了多出all.summary，all两个文件。

从all和text两个矩阵文件中，看不出什么明显的差别，但是加-p之后，运行时间长了很多。但不管怎么样，比htseq-count快多了。