star的twopassMode问题

作者: 因地制宜的生信达人 | 来源:发表于2018-01-23 10:30 被阅读135次

如果只有一个样本，那么通常推荐加上这个--twopassMode Basic参数，可以保证更精准的比对情况，也方便去找变异，其官方文档解释如下：

Annotated junctions will be included in both the 1st and 2nd passes. To run STAR 2-pass mapping for each sample separately, use --twopassMode Basic option. STAR will perform the 1st pass mapping, then it will automatically extract junctions, insert them into the genome index, and, finally, re-map all reads in the 2nd mapping pass.

但是如果有多个样本，每个样本都走 twopassMode 就浪费时间了，通常是把所有样本都比对一次，然后收集好他们产生的SJ.out.tab文件重新构建一次参考基因组的index，然后批量再比对一次。

~/biosoft/STAR/STAR-2.5.3a/bin/Linux_x86_64/STAR --runMode genomeGenerate \
--genomeDir  second_index  \
--genomeFastaFiles ~/reference/genome/mm10/mm10.fa \
--sjdbGTFfile ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf \
--sjdbFileChrStartEnd all_raw_star.tab --runThreadN 4

再次比对，代码是：

$star --runThreadN  5  --genomeLoad  LoadAndKeep   --limitBAMsortRAM 13045315604 \
--outSAMtype BAM SortedByCoordinate  --genomeDir $second_index  \
--readFilesCommand zcat --readFilesIn  $fq1 $fq2 --outFileNamePrefix  ${sample}_

一个样本正常比对是：

## 测序数据如下：
6.7G Dec 12 15:55 clean.1.fq.gz
6.6G Dec 12 18:03 clean.2.fq.gz
## 比对代码如下，需要自行安装好软件已经参考基因组文件及索引
$star --runThreadN  5 --genomeDir $hg19_star_index --readFilesCommand zcat --outSAMtype BAM  SortedByCoordinate \
--readFilesIn  $fq1 $fq2 --outFileNamePrefix  ${sample}_star ## --alignEndsType EndToEnd

比对后的结果如下：

14G Dec 30 22:38 DH01_starAligned.sortedByCoord.out.bam
1.9K Dec 30 22:38 DH01_starLog.final.out
21K Dec 30 22:38 DH01_starLog.out
4.6K Dec 30 22:38 DH01_starLog.progress.out
8.1M Dec 30 22:38 DH01_starSJ.out.tab

比对耗时如下：

Dec 30 21:43:03 ..... started STAR run
Dec 30 21:43:03 ..... loading genome
Dec 30 21:49:28 ..... started mapping
Dec 30 22:27:57 ..... started sorting BAM
Dec 30 22:38:20 ..... finished successfully

两次比对是：

$star --runThreadN  5 --genomeDir $hg19_star_index --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate  \
--twopassMode Basic --outReadsUnmapped None --chimSegmentMin 12 \
--chimJunctionOverhangMin 12  --alignSJDBoverhangMin 10  --alignMatesGapMax 100000 \
--alignIntronMax 100000 --chimSegmentReadGapMax parameter 3  --alignSJstitchMismatchNmax 5 -1 5 5 \
--readFilesIn  $fq1 $fq2 --outFileNamePrefix  ${sample}_star ## --alignEndsType EndToEnd
## 这样会比较耗费内存哦

耗费内存，并且耗时：

Dec 31 09:33:17 ..... started STAR run
Dec 31 09:33:17 ..... loading genome
Dec 31 09:38:55 ..... started 1st pass mapping
Dec 31 10:14:14 ..... finished 1st pass mapping
Dec 31 10:14:29 ..... inserting junctions into the genome indices
Dec 31 10:19:55 ..... started mapping
Dec 31 11:24:54 ..... started sorting BAM
Dec 31 11:31:11 ..... finished successfully

可以看到前面的基础比对才不到一个小时，这个两次比对消耗2个小时了。

得到的文件如下；

6.7G Dec 12 15:55 clean.1.fq.gz
6.6G Dec 12 18:03 clean.2.fq.gz
12G Dec 31 11:31 DH01_starAligned.sortedByCoord.out.bam
126M Dec 31 11:25 DH01_starChimeric.out.junction
874M Dec 31 11:25 DH01_starChimeric.out.sam
1.9K Dec 31 11:31 DH01_starLog.final.out
24K Dec 31 11:31 DH01_starLog.out
12K Dec 31 11:31 DH01_starLog.progress.out
8.0M Dec 31 11:31 DH01_starSJ.out.tab

我开启了chimeric模式，所以可以看到输出文件也多了一点，主要是为了找fusion基因准备的。

网友评论

本文标题：star的twopassMode问题

本文链接：https://www.haomeiwen.com/subject/ebajaxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

star的twopassMode问题

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

生信入门参考资料

RNASeq 数据分析