比对工具大揭秘

作者: 因地制宜的生信达人 | 来源:发表于2018-07-01 16:12 被阅读374次

比对工具大揭秘
比对工具大揭秘
2020-12-29 生信4.0序列比较（第二部分）
Module-SVN使用说明
beyond compare比对工具
RNA-seq名词解释（4）
DNA/RNA序列比对软件整理
2020-09-03 Papara 安装
10大白帽黑客专用的 Linux 操作系统
「R绘图」minimap2的PAF文件如何进行可视化？

之所以有这个需求，是因为在做一个单细胞转录组测序数据的分析，发现一些非常诡异的比对情况。比如：尽管是全局比对率达到80%以上，但是过半数居然左右失衡，也就是说只有双端测序reads的其中一条成功比对了，另一条莫名其妙的比对失败。这种情况可以发生，只是超过50%就太夸张了，非常值得探究，究竟是reads本身的特性呢，还是比对工具的选择不够合适。

5个比对软件

我这里探究了hisat2,subjunc,star,bwa,bowtie2这5个比对工具，它们的安装方式文件不赘述了，使用代码如下:

hisat='/home/jianmingzeng/biosoft/HISAT/current/hisat2';
star="/home/jianmingzeng/biosoft/STAR/STAR-2.5.3a/bin/Linux_x86_64/STAR";
subjunc="/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/subjunc";
tophat2="/home/jianmingzeng/biosoft/TopHat/current/tophat2";
bowtie="/home/jianmingzeng/biosoft/bowtie/bowtie2-2.2.9/bowtie2";


hisat2_mm10_index='/home/jianmingzeng/reference/index/hisat/mm10/genome'; 
star_mm10_index='/home/jianmingzeng/reference/index/star/mm10/'; 
subjunc_mm10_index='/home/jianmingzeng/reference/index/subread/mm10'; 
bowtie2_mm10_index='/home/jianmingzeng/reference/index/bowtie/mm10';
bwa_mm10_index='/home/jianmingzeng/reference/index/bwa/mm10';

fq1=SC2-01_S1_L002_R1_001.fastq.gz
fq2=SC2-01_S1_L002_R2_001.fastq.gz
sample=SC2-01_S1_L002 

$hisat -p 5 -x $hisat2_mm10_index -1 $fq1 -2 $fq2 -S $sample.sam 2>$sample.hisat.log
samtools sort -O bam -@ 5  -o ${sample}_hisat.bam $sample.sam

$subjunc -T 5  -i $subjunc_mm10_index -r $fq1  -R $fq2 -o ${sample}_subjunc.bam
## 很有趣，虽然subjunc直接输出bam，但是是按照reads名字排序，不是按照基因组坐标排序。

bwa mem -t 5 -M  $bwa_mm10_index $fq1 $fq2 1>$sample.sam 2>/dev/null 
samtools sort -O bam -@ 5  -o ${sample}_bwa.bam $sample.sam

$bowtie -p 5 -x $bowtie2_mm10_index -1 $fq1  -2 $fq2 | samtools sort  -O bam  -@ 5 -o - >${sample}_bowtie.bam

## star软件载入参考基因组非常耗时，约10分钟，也比较耗费内存，但是比对非常快，5M的序列就两分钟即可
$star --runThreadN  5 --genomeDir $star_mm10_index --readFilesCommand zcat --readFilesIn  $fq1 $fq2 --outFileNamePrefix  ${sample}_star 
## --outSAMtype BAM  可以用这个参数设置直接输出排序好的bam文件
samtools sort -O bam -@ 5  -o ${sample}_star.bam ${sample}_starAligned.out.sam

把比对后的bam文件统一进行samtools flagstat统计，整理一下比对情况如下表：

bowtie2	bwa	subjunc	hisat	star
4,457,852	5,028,723	4,457,852	4,636,906	3,732,166	in total (QC-passed reads + QC-failed reads)
0	570,871	0	179,054	327,880	secondary
0	0	0	0	0	supplementary
0	0	0	0	0	duplicates
2,846,896	4,898,291	3,694,805	3,418,483	3,732,166	mapped(比对率)
4,457,852	4,457,852	4,457,852	4,457,852	3,404,286	paired in sequencing
2,228,926	2,228,926	2,228,926	2,228,926	1,702,143	read1
2,228,926	2,228,926	2,228,926	2,228,926	1,702,143	read2
1,977,832	3,741,658	2,284,400(太低了)	3,014,912	3,404,286	properly paired（双端reads比对率）
2,568,660	4,320,816	3,531,396	3,150,876	3,404,286	with itself and mate mapped
278,236	6,604	163,409	88,553	0	singletons
65,730	102,160	41,864	8,860	0	with mate mapped to a different chr
25,582	77,062	41,864	6,474	0	with mate mapped to a different chr

让我更不能理解的是，这个是RNA-seq数据，可是居然用BWA可以成功比对率是最高的，虽然比对文件里面有着大量的H/S情况，但是不影响它仍然可以比对成功。高达97.41% 的比对率，还有83.93% 左右两端reads匹配的比对率，遥遥领先与其它比对工具。完全颠覆了我以前的想象，一直以为对转录组数据不能用bwa来比对。(大家可以思考一下其中的原理) 可以看到bwa默认是容许多比对情况的，一条reads可以输出多个比对记录。

## samtools flagstat SC2-01_S1_L002_bwa.bam
5028723 + 0 in total (QC-passed reads + QC-failed reads)
570871 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
4898291 + 0 mapped (97.41% : N/A)
4457852 + 0 paired in sequencing
2228926 + 0 read1
2228926 + 0 read2
3741658 + 0 properly paired (83.93% : N/A)
4320816 + 0 with itself and mate mapped
6604 + 0 singletons (0.15% : N/A)
102160 + 0 with mate mapped to a different chr
77062 + 0 with mate mapped to a different chr (mapQ>=5)

而subjunc的比对结果里面的双端reads比对率与全局reads的比对率始终是太诡异了，如下，明明还有82.88%的比对率，但是左右两端匹配的比对率下降到51.24% ，理论上我应该好好探究一下各种原因。不过，本文先放过这个细节。(欢迎大家留言提出自己的见解)

## samtools flagstat SC2-01_S1_L002_subjunc.bam
4457852 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
3694805 + 0 mapped (82.88% : N/A)
4457852 + 0 paired in sequencing
2228926 + 0 read1
2228926 + 0 read2
2284400 + 0 properly paired (51.24% : N/A)
3531396 + 0 with itself and mate mapped
163409 + 0 singletons (3.67% : N/A)
41864 + 0 with mate mapped to a different chr
41864 + 0 with mate mapped to a different chr (mapQ>=5)

而且这里star软件取了一个巧，只在bam文件里面输出那些成功比对的reads，这样比对率就都是100%啦，不过也可能是我对star软件的说明书没有吃透，参数不恰当。

而且我还应该再探究一下那些被subjunc比对不上的序列却能被bwa软件成功比对到参考基因组的是写什么情况。这个也以后再分享。

表达量探索

因为转录组测序最重要是得到表达量，我们并不关心这些reads具体比对到参考基因组是被切掉了还是跨越了内含子的成功比对 (如果要考虑转录本定量，或者可变剪切，就需要真正的跨越外显子的比对)

这里我们还是选用最开始便捷的featureCounts来进行基于基因的定量，代码如下；

featureCounts='/home/jianmingzeng/biosoft/featureCounts/subread-1.5.3-Linux-x86_64/bin/featureCounts';
mm10_gtf='/home/jianmingzeng/reference/gtf/gencode/gencode.vM12.annotation.gtf';
# Summarize paired-end reads and count fragments (instead of reads):
$featureCounts -T 5 -p -t exon -g gene_id \
-a $mm10_gtf -o counts.txt   *.bam

这里只有两个reads都成功比对到参考基因组才计数，所以数的是fragment数量，比对总结如下：

Status	bowtie.bam	bwa.bam	hisat.bam	star.bam	subjunc.bam
Assigned	1,180,215	2,183,516	1,295,795	1,311,956	1,518,380
Unassigned_Unmapped	666,360	61,914	564,935	0	299,819
Unassigned_MappingQuality	0	0	0	0	0
Unassigned_Chimera	0	0	0	0	0
Unassigned_FragmentLength	0	0	0	0	0
Unassigned_Duplicate	0	0	0	0	0
Unassigned_MultiMapping	0	0	165,425	254,209	0
Unassigned_Secondary	0	0	0	0	0
Unassigned_Nonjunction	0	0	0	0	0
Unassigned_NoFeatures	282,738	408,715	228,180	232,006	307,639
Unassigned_Overlapping_Length	0	0	0	0	0
Unassigned_Ambiguity	99,613	145,544	73,526	67,912	103,088

可以看到bwa软件把几乎所有的RNA-seq测序的reads都成功的比对到了基因组，并且成功的注释到了基因，计数成为了基因的表达量。从这一点来说，BWA最大限度的保留了这个RNA-seq测序得到的基因表达信息，现在只需要探究它们的表达量的相关性，看看BWA那种截取reads的部分片段进行比对，是否会出现大幅度的基因定量的偏差。

简单的R代码探究如下：

a=read.table("tmp.txt",header=T) ## 49585 genes in gencode 
head(a)
cor(a) 
write.table(cor(a),file="tmp.sum",quote=F,sep="\t")
cor(a[order(apply(a,1,mad), decreasing = T)[1:50],])
cor(a[order(apply(a,1,mad), decreasing = T)[1:500],])
dim(a[rowSums(a)>10,]) ## 5823 expressed genes in single cells
cor(a[rowSums(a)>10,])

	bowtie.bam	bwa.bam	hisat.bam	star.bam	subjunc.bam
bowtie.bam	1	0.885440896	0.879128734	0.862654678	0.910767788
bwa.bam	0.885440896	1	0.921680804	0.923104636	0.954310013
hisat.bam	0.879128734	0.921680804	1	0.996418718	0.971852796
star.bam	0.862654678	0.923104636	0.996418718	1	0.970620816
subjunc.bam	0.910767788	0.954310013	0.971852796	0.970620816	1

ENSMUSG00000077463.1 chr1 24548841 24548977 - 137 0 0 0 0 0
ENSMUSG00000099997.1 chr1 24594087 24595423 - 1337 1 9 1 2 2
ENSMUSG00000100131.1 chr1 24611535 24612410 - 876 95 141 0 11 13
ENSMUSG00000067736.2 chr1 24612407 24612700 - 294 49 66 0 3 8
ENSMUSG00000101939.1 chr1 24612775 24613119 - 345 25 67 11 16 18
ENSMUSG00000101111.1 chr1 24613189 24613971 - 783 1581 2842 174 154 524
ENSMUSG00000100862.1 chr1 24613974 24614651 - 678 1088 1789 8 37 270
ENSMUSG00000102070.1 chr1 24614885 24615565 - 681 881 1468 10 60 276
ENSMUSG00000101249.1 chr1 24615706 24616197 - 492 1946 2883 34 101 664

先用sickle把fastq过滤

过滤代码很简单

cat $config |while read id
do
    arr=($id)
    fq1=${arr[1]}
    fq2=${arr[2]}
    sample=${arr[0]}
    echo $sample
    ~/biosoft/sickle/sickle-master/sickle pe -q 30 -l 36 -g -f $fq1 -r $fq2 -t sanger -o $sample.clean.R1.fq.gz -p $sample.clean.R2.fq.gz -s $sample.trash.fq.gz  1 > $sample.log 2>&1
done

再用本文最开头的5个比对软件把它们再比对一次，总结比对情况，如下表：

bowtie2	bwa	hisat2	star	subjunc
3,618,770	35,014	3,775,457	2,925,028	3,618,770	in total (QC-passed reads + QC-failed reads)
0	416,244	156,687	277,104	0	secondary
0	0	0	0	0	supplementary
0	0	0	0	0	duplicates
2,367,830(65.43%	3,934,849(97.52%	2,853,784(84.90%	2,925,028	3,078,617(85.07%	mapped
3,618,770	3,618,770	3,618,770	2,647,924	3,618,770	paired in sequencing
1,809,385	1,809,385	1,809,385	1,324,102	1,809,385	read1
1,809,385	1,809,385	1,809,385	1,323,822	1,809,385	read2
1,642,810(45.40%	3,072,416(84.90%	2,414,588(75.59%	2,645,342	1,941,150(53.64%	properly paired （双端reads比对率）
2,138,294	3,516,004	2,637,368	2,645,342	2,991,132	with itself and mate mapped
229,536	2,601	59,729	2,582	87,485	singletons
53,050	62,750	10,862	0	34,144	with mate mapped to a different chr
19,185	52,158	6,479	0	34,144	with mate mapped to a different chr

可以看到之前的2.228M序列被过滤成了1.809M的序列，然后比对率都略微有提升。

同样的进行featureCounts定量，结果如下：

Status	SC2-01_bowtie.bam	SC2-01_bwa.bam	SC2-01_hisat.bam	SC2-01_star.bam	SC2-01_subjunc.bam
Assigned	979,423	1,737,888	1,067,257	1,016,223	1,241,309
Unassigned_Unmapped	510,702	48,782	430,972	0	226,334
Unassigned_MappingQuality	0	0	0	0	0
Unassigned_Chimera	0	0	0	0	0
Unassigned_FragmentLength	0	0	0	0	0
Unassigned_Duplicate	0	0	0	0	0
Unassigned_MultiMapping	0	0	146,667	213,261	0
Unassigned_Secondary	0	0	0	0	0
Unassigned_Nonjunction	0	0	0	0	0
Unassigned_NoFeatures	239,597	325,603	194,491	182,369	259,568
Unassigned_Overlapping_Length	0	0	0	0	0
Unassigned_Ambiguity	79,663	113,301	59,189	52,063	82,174