之所以有这个需求,是因为在做一个单细胞转录组测序数据的分析,发现一些非常诡异的比对情况。比如:尽管是全局比对率达到80%以上,但是过半数居然左右失衡,也就是说只有双端测序reads的其中一条成功比对了,另一条莫名其妙的比对失败。这种情况可以发生,只是超过50%就太夸张了,非常值得探究,究竟是reads本身的特性呢,还是比对工具的选择不够合适。
5个比对软件
我这里探究了hisat2,subjunc,star,bwa,bowtie2这5个比对工具,它们的安装方式文件不赘述了,使用代码如下:
hisat='/home/jianmingzeng/biosoft/HISAT/current/hisat2';
star="/home/jianmingzeng/biosoft/STAR/STAR-2.5.3a/bin/Linux_x86_64/STAR";
subjunc="/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc";
tophat2="/home/jianmingzeng/biosoft/TopHat/current/tophat2";
bowtie="/home/jianmingzeng/biosoft/bowtie/bowtie2-2.2.9/bowtie2";
hisat2_mm10_index='/home/jianmingzeng/reference/index/hisat/mm10/genome';
star_mm10_index='/home/jianmingzeng/reference/index/star/mm10/';
subjunc_mm10_index='/home/jianmingzeng/reference/index/subread/mm10';
bowtie2_mm10_index='/home/jianmingzeng/reference/index/bowtie/mm10';
bwa_mm10_index='/home/jianmingzeng/reference/index/bwa/mm10';
fq1=SC2-01_S1_L002_R1_001.fastq.gz
fq2=SC2-01_S1_L002_R2_001.fastq.gz
sample=SC2-01_S1_L002
$hisat -p 5 -x $hisat2_mm10_index -1 $fq1 -2 $fq2 -S $sample.sam 2>$sample.hisat.log
samtools sort -O bam -@ 5 -o ${sample}_hisat.bam $sample.sam
$subjunc -T 5 -i $subjunc_mm10_index -r $fq1 -R $fq2 -o ${sample}_subjunc.bam
## 很有趣,虽然subjunc直接输出bam,但是是按照reads名字排序,不是按照基因组坐标排序。
bwa mem -t 5 -M $bwa_mm10_index $fq1 $fq2 1>$sample.sam 2>/dev/null
samtools sort -O bam -@ 5 -o ${sample}_bwa.bam $sample.sam
$bowtie -p 5 -x $bowtie2_mm10_index -1 $fq1 -2 $fq2 | samtools sort -O bam -@ 5 -o - >${sample}_bowtie.bam
## star软件载入参考基因组非常耗时,约10分钟,也比较耗费内存,但是比对非常快,5M的序列就两分钟即可
$star --runThreadN 5 --genomeDir $star_mm10_index --readFilesCommand zcat --readFilesIn $fq1 $fq2 --outFileNamePrefix ${sample}_star
## --outSAMtype BAM 可以用这个参数设置直接输出排序好的bam文件
samtools sort -O bam -@ 5 -o ${sample}_star.bam ${sample}_starAligned.out.sam
把比对后的bam文件统一进行samtools flagstat
统计,整理一下比对情况如下表:
bowtie2 | bwa | subjunc | hisat | star | |
---|---|---|---|---|---|
4,457,852 | 5,028,723 | 4,457,852 | 4,636,906 | 3,732,166 | in total (QC-passed reads + QC-failed reads) |
0 | 570,871 | 0 | 179,054 | 327,880 | secondary |
0 | 0 | 0 | 0 | 0 | supplementary |
0 | 0 | 0 | 0 | 0 | duplicates |
2,846,896 | 4,898,291 | 3,694,805 | 3,418,483 | 3,732,166 | mapped(比对率) |
4,457,852 | 4,457,852 | 4,457,852 | 4,457,852 | 3,404,286 | paired in sequencing |
2,228,926 | 2,228,926 | 2,228,926 | 2,228,926 | 1,702,143 | read1 |
2,228,926 | 2,228,926 | 2,228,926 | 2,228,926 | 1,702,143 | read2 |
1,977,832 | 3,741,658 | 2,284,400(太低了) | 3,014,912 | 3,404,286 | properly paired(双端reads比对率) |
2,568,660 | 4,320,816 | 3,531,396 | 3,150,876 | 3,404,286 | with itself and mate mapped |
278,236 | 6,604 | 163,409 | 88,553 | 0 | singletons |
65,730 | 102,160 | 41,864 | 8,860 | 0 | with mate mapped to a different chr |
25,582 | 77,062 | 41,864 | 6,474 | 0 | with mate mapped to a different chr |
让我更不能理解的是,这个是RNA-seq数据,可是居然用BWA可以成功比对率是最高的,虽然比对文件里面有着大量的H/S情况,但是不影响它仍然可以比对成功。高达97.41% 的比对率,还有83.93% 左右两端reads匹配的比对率,遥遥领先与其它比对工具。完全颠覆了我以前的想象,一直以为对转录组数据不能用bwa来比对。(大家可以思考一下其中的原理) 可以看到bwa默认是容许多比对情况的,一条reads可以输出多个比对记录。
## samtools flagstat SC2-01_S1_L002_bwa.bam
5028723 + 0 in total (QC-passed reads + QC-failed reads)
570871 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
4898291 + 0 mapped (97.41% : N/A)
4457852 + 0 paired in sequencing
2228926 + 0 read1
2228926 + 0 read2
3741658 + 0 properly paired (83.93% : N/A)
4320816 + 0 with itself and mate mapped
6604 + 0 singletons (0.15% : N/A)
102160 + 0 with mate mapped to a different chr
77062 + 0 with mate mapped to a different chr (mapQ>=5)
而subjunc的比对结果里面的双端reads比对率与全局reads的比对率始终是太诡异了,如下,明明还有82.88%的比对率,但是左右两端匹配的比对率下降到51.24% ,理论上我应该好好探究一下各种原因。不过,本文先放过这个细节。(欢迎大家留言提出自己的见解)
## samtools flagstat SC2-01_S1_L002_subjunc.bam
4457852 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
3694805 + 0 mapped (82.88% : N/A)
4457852 + 0 paired in sequencing
2228926 + 0 read1
2228926 + 0 read2
2284400 + 0 properly paired (51.24% : N/A)
3531396 + 0 with itself and mate mapped
163409 + 0 singletons (3.67% : N/A)
41864 + 0 with mate mapped to a different chr
41864 + 0 with mate mapped to a different chr (mapQ>=5)
而且这里star软件取了一个巧,只在bam文件里面输出那些成功比对的reads,这样比对率就都是100%啦,不过也可能是我对star软件的说明书没有吃透,参数不恰当。
而且我还应该再探究一下那些被subjunc比对不上的序列却能被bwa软件成功比对到参考基因组的是写什么情况。这个也以后再分享。
表达量探索
因为转录组测序最重要是得到表达量,我们并不关心这些reads具体比对到参考基因组是被切掉了还是跨越了内含子的成功比对 (如果要考虑转录本定量,或者可变剪切,就需要真正的跨越外显子的比对)
这里我们还是选用最开始便捷的featureCounts来进行基于基因的定量,代码如下;
featureCounts='/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts';
mm10_gtf='/home/jianmingzeng/reference/gtf/gencode/gencode.vM12.annotation.gtf';
# Summarize paired-end reads and count fragments (instead of reads):
$featureCounts -T 5 -p -t exon -g gene_id \
-a $mm10_gtf -o counts.txt *.bam
这里只有两个reads都成功比对到参考基因组才计数,所以数的是fragment数量,比对总结如下:
Status | bowtie.bam | bwa.bam | hisat.bam | star.bam | subjunc.bam |
---|---|---|---|---|---|
Assigned | 1,180,215 | 2,183,516 | 1,295,795 | 1,311,956 | 1,518,380 |
Unassigned_Unmapped | 666,360 | 61,914 | 564,935 | 0 | 299,819 |
Unassigned_MappingQuality | 0 | 0 | 0 | 0 | 0 |
Unassigned_Chimera | 0 | 0 | 0 | 0 | 0 |
Unassigned_FragmentLength | 0 | 0 | 0 | 0 | 0 |
Unassigned_Duplicate | 0 | 0 | 0 | 0 | 0 |
Unassigned_MultiMapping | 0 | 0 | 165,425 | 254,209 | 0 |
Unassigned_Secondary | 0 | 0 | 0 | 0 | 0 |
Unassigned_Nonjunction | 0 | 0 | 0 | 0 | 0 |
Unassigned_NoFeatures | 282,738 | 408,715 | 228,180 | 232,006 | 307,639 |
Unassigned_Overlapping_Length | 0 | 0 | 0 | 0 | 0 |
Unassigned_Ambiguity | 99,613 | 145,544 | 73,526 | 67,912 | 103,088 |
可以看到bwa软件把几乎所有的RNA-seq测序的reads都成功的比对到了基因组,并且成功的注释到了基因,计数成为了基因的表达量。从这一点来说,BWA最大限度的保留了这个RNA-seq测序得到的基因表达信息,现在只需要探究它们的表达量的相关性,看看BWA那种截取reads的部分片段进行比对,是否会出现大幅度的基因定量的偏差。
简单的R代码探究如下:
a=read.table("tmp.txt",header=T) ## 49585 genes in gencode
head(a)
cor(a)
write.table(cor(a),file="tmp.sum",quote=F,sep="\t")
cor(a[order(apply(a,1,mad), decreasing = T)[1:50],])
cor(a[order(apply(a,1,mad), decreasing = T)[1:500],])
dim(a[rowSums(a)>10,]) ## 5823 expressed genes in single cells
cor(a[rowSums(a)>10,])
相关性结果如下:
bowtie.bam | bwa.bam | hisat.bam | star.bam | subjunc.bam | |
---|---|---|---|---|---|
bowtie.bam | 1 | 0.885440896 | 0.879128734 | 0.862654678 | 0.910767788 |
bwa.bam | 0.885440896 | 1 | 0.921680804 | 0.923104636 | 0.954310013 |
hisat.bam | 0.879128734 | 0.921680804 | 1 | 0.996418718 | 0.971852796 |
star.bam | 0.862654678 | 0.923104636 | 0.996418718 | 1 | 0.970620816 |
subjunc.bam | 0.910767788 | 0.954310013 | 0.971852796 | 0.970620816 | 1 |
发现只有bowtie2这个软件的比对跟其它工具得到的表达量相关性比较差。hisat,star,subjunc这3款软件都是设计来专门做转录组数据分析,所以它们之间的相关性非常高。
虽然相关性非常高,也还是有些基因区域被这5款软件计数后的表达量差异却非常大,如下:
ENSMUSG00000077463.1 chr1 24548841 24548977 - 137 0 0 0 0 0 |
---|
ENSMUSG00000099997.1 chr1 24594087 24595423 - 1337 1 9 1 2 2 |
ENSMUSG00000100131.1 chr1 24611535 24612410 - 876 95 141 0 11 13 |
ENSMUSG00000067736.2 chr1 24612407 24612700 - 294 49 66 0 3 8 |
ENSMUSG00000101939.1 chr1 24612775 24613119 - 345 25 67 11 16 18 |
ENSMUSG00000101111.1 chr1 24613189 24613971 - 783 1581 2842 174 154 524 |
ENSMUSG00000100862.1 chr1 24613974 24614651 - 678 1088 1789 8 37 270 |
ENSMUSG00000102070.1 chr1 24614885 24615565 - 681 881 1468 10 60 276 |
ENSMUSG00000101249.1 chr1 24615706 24616197 - 492 1946 2883 34 101 664 |
上表中最后五列是不同比对得到的表达量,分别是bowie2,bwa,hisat2,star,subjunc。 可以看到对于这种单外显子基因来说,bwa和bowtie2比较类似,但是跟hisat2,star,subjunc这3款转录组比对工具完全不一样。真奇怪。
先用sickle把fastq过滤
过滤代码很简单
cat $config |while read id
do
arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
echo $sample
~/biosoft/sickle/sickle-master/sickle pe -q 30 -l 36 -g -f $fq1 -r $fq2 -t sanger -o $sample.clean.R1.fq.gz -p $sample.clean.R2.fq.gz -s $sample.trash.fq.gz 1 > $sample.log 2>&1
done
再用本文最开头的5个比对软件把它们再比对一次,总结比对情况,如下表:
bowtie2 | bwa | hisat2 | star | subjunc | |
---|---|---|---|---|---|
3,618,770 | 35,014 | 3,775,457 | 2,925,028 | 3,618,770 | in total (QC-passed reads + QC-failed reads) |
0 | 416,244 | 156,687 | 277,104 | 0 | secondary |
0 | 0 | 0 | 0 | 0 | supplementary |
0 | 0 | 0 | 0 | 0 | duplicates |
2,367,830(65.43% | 3,934,849(97.52% | 2,853,784(84.90% | 2,925,028 | 3,078,617(85.07% | mapped |
3,618,770 | 3,618,770 | 3,618,770 | 2,647,924 | 3,618,770 | paired in sequencing |
1,809,385 | 1,809,385 | 1,809,385 | 1,324,102 | 1,809,385 | read1 |
1,809,385 | 1,809,385 | 1,809,385 | 1,323,822 | 1,809,385 | read2 |
1,642,810(45.40% | 3,072,416(84.90% | 2,414,588(75.59% | 2,645,342 | 1,941,150(53.64% | properly paired (双端reads比对率) |
2,138,294 | 3,516,004 | 2,637,368 | 2,645,342 | 2,991,132 | with itself and mate mapped |
229,536 | 2,601 | 59,729 | 2,582 | 87,485 | singletons |
53,050 | 62,750 | 10,862 | 0 | 34,144 | with mate mapped to a different chr |
19,185 | 52,158 | 6,479 | 0 | 34,144 | with mate mapped to a different chr |
可以看到之前的2.228M序列被过滤成了1.809M的序列,然后比对率都略微有提升。
同样的进行featureCounts定量,结果如下:
Status | SC2-01_bowtie.bam | SC2-01_bwa.bam | SC2-01_hisat.bam | SC2-01_star.bam | SC2-01_subjunc.bam |
---|---|---|---|---|---|
Assigned | 979,423 | 1,737,888 | 1,067,257 | 1,016,223 | 1,241,309 |
Unassigned_Unmapped | 510,702 | 48,782 | 430,972 | 0 | 226,334 |
Unassigned_MappingQuality | 0 | 0 | 0 | 0 | 0 |
Unassigned_Chimera | 0 | 0 | 0 | 0 | 0 |
Unassigned_FragmentLength | 0 | 0 | 0 | 0 | 0 |
Unassigned_Duplicate | 0 | 0 | 0 | 0 | 0 |
Unassigned_MultiMapping | 0 | 0 | 146,667 | 213,261 | 0 |
Unassigned_Secondary | 0 | 0 | 0 | 0 | 0 |
Unassigned_Nonjunction | 0 | 0 | 0 | 0 | 0 |
Unassigned_NoFeatures | 239,597 | 325,603 | 194,491 | 182,369 | 259,568 |
Unassigned_Overlapping_Length | 0 | 0 | 0 | 0 | 0 |
Unassigned_Ambiguity | 79,663 | 113,301 | 59,189 | 52,063 | 82,174 |
网友评论