最近做的项目是关于地衣的转录组和基因组,拿到的结果是真菌部位的基因组以及不同时期和部位的转录组,地下部位的藻类部由于太复杂而且杂质过多不好分离就没有测基因组,目前想知道地下部位的藻类到底是什么?根据之前的观察,推测是Coccomyxa subellipsoidea,想要具体确定的话还需要进一步证明。
![](https://img.haomeiwen.com/i16572814/621cd123bd54fada.png)
可以看到该地衣的地上部位是一种担子菌,地下部位不仅有藻类还有其他的物种。
转录组比对到真菌参考基因组
目前有真菌的测序基因组,第一步就是想把转录组数据比对到真菌的参考基因组上
比对工具我选择STAR,使用地上部位(真菌部_Lh3-1)和地下部位(藻类部 _Lh1-1)的转录组比对到基因组上
建立索引
S$ATR --runMode genomeGenerate --genomeDir /datadisk02/mpoly/hun_db/ --genomeFastaFiles /datadisk02/mpoly/hudsoiana.fa
比对
$STAR --runThreadN 20 --genomeDir /datadisk02/mpoly/hun_db/ --readFilesCommand zcat --readFilesIn _Lh2-1_S110_L007_R1_001.fastq.gz _Lh2-1_S110_L007_R2_001.fastq.gz --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5 --outFileNamePrefix S2-1
$STAR --runThreadN 20 --genomeDir /datadisk02/mpoly/hun_db/ --readFilesCommand zcat --readFilesIn _Lh3-1_S108_L007_R1_001.fastq.gz _Lh3-1_S108_L007_R2_001.fastq.gz --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5 --outFileNamePrefix S1-1
比对结束后我最关心的是比对率,按照预期,地下部位的比对率比地上部位低。
$less S1-1Log.final.out
Started job on | Aug 20 05:48:00
Started mapping on | Aug 20 05:48:01
Finished on | Aug 20 06:42:35
Mapping speed, Million of reads per hour | 26.61
Number of input reads | 24198466
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 15397710
Uniquely mapped reads % | 60.89%
$less S3-1Log.final.out
Started job on | Aug 20 08:38:29
Started mapping on | Aug 20 08:38:30
Finished on | Aug 20 08:49:15
Mapping speed, Million of reads per hour | 134.06
Number of input reads | 24018884
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 19151198
Uniquely mapped reads % | 79.73%
可以看到结果符合预期,地下部位的reads比对率为64%,地上部位为80%。
地下部位转录组分析
接下来用地下部位的转录组分析确定共生藻,首先的想法是利用转录组数据比对到初步确定的共生藻Coccomyxa subellipsoidea上,使用Coccomyxa subellipsoidea的基因组建库进行比对
S$ATR --runMode genomeGenerate --genomeDir /datadisk02/mpoly/cocco/ --genomeFastaFiles /datadisk02/mpoly/Coccomyxa.fa
$STAR --runThreadN 20 --genomeDir /datadisk02/mpoly/cocco/ --readFilesCommand zcat --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --outSAMtype BAM SortedByCoordinate --outBAMsortingThreadN 5 --outFileNamePrefix S1-1C
结果有点出乎意料,只有零点几的比对率,
$less S1-1CLog.final.out
Started job on | Aug 19 14:25:11
Started mapping on | Aug 19 14:25:16
Finished on | Aug 19 20:51:41
Mapping speed, Million of reads per hour | 3.96
Number of input reads | 25517519
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 92457
Uniquely mapped reads % | 0.36%
由于地下部位比对到真菌部也有64%的比对率,所以我重新跑了一遍STAR,利用unmapped reads进行比对。
$STAR --runThreadN 32 --genomeDir hun_db/ --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --readFilesCommand zcat --outSAMtype SAM --outFileNamePrefix S1-1 --outReadsUnmapped Fastx
$awk -F " " '{print $1}' S1-1Unmapped.out.mate1| awk '{if (NR%4==1||NR%4=3) print}' > S1-1.left.fa
$sed -i "s/@/>/g" S1-1.left.fa
$awk -F " " '{print $1}' S1-1Unmapped.out.mate2| awk '{if (NR%4==1||NR%4=3) print}' > S1-1.right.fa
$sed -i "s/@/>/g" S1-1.right.fa
$STAR --runThreadN 20 --genomeDir cocco/ --readFilesIn S1-1.left.fa S1-1.right.fa --outSAMtype SAM --outFileNamePrefix S1-1UC --outReadsUnmapped Fastx
$less S1-1UCLog.final.out
Started job on | Aug 26 10:59:26
Started mapping on | Aug 26 10:59:39
Finished on | Aug 26 15:32:14
Mapping speed, Million of reads per hour | 1.58
Number of input reads | 7180320
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 96710
Uniquely mapped reads % | 1.35%
比对率也只有可怜的1.35%,一开始我怀疑是不是藻类物种鉴定错了,后来我用了ncbi中所有藻类的基因组,比对率都不到1%。
后来去网上搜索比对率低的原因,在github上发现相应解答,加入--outFilterScoreMinOverLread 0.3和--outFilterMatchNminOverLread 0.3 参数
$STAR --runThreadN 20 --genomeDir hun_db/ --readFilesCommand zcat --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --outFileNamePrefix S1-1fli --outReadsUnmapped Fastx --outSAMtype SAM --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
$less S1-1fliLog.finall.out
Started job on | Aug 27 09:52:31
Started mapping on | Aug 27 09:52:37
Finished on | Aug 27 13:02:11
Mapping speed, Million of reads per hour | 8.08
Number of input reads | 25517519
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 16371292
Uniquely mapped reads % | 64.16%
$STAR --runThreadN 20 --genomeDir cocco/ --readFilesCommand zcat --readFilesIn _Lh1-1_S110_L007_R1_001.fastq.gz _Lh1-1_S110_L007_R2_001.fastq.gz --outFileNamePrefix S1-1Cfli --outReadsUnmapped Fastx --outSAMtype SAM --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
$less S1-1CfliLog.final.out
Started job on | Aug 27 09:52:51
Started mapping on | Aug 27 09:52:53
Finished on | Aug 28 04:02:41
Mapping speed, Million of reads per hour | 1.40
Number of input reads | 25517519
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 1475579
Uniquely mapped reads % | 5.78%
对率有5.78%,提高了一点。
接下来,我使用真菌的unmapped reads进行比对,看看没有比到真菌上的reads有多少是藻类的。
$awk -F " " '{print $1}' S1-1fliUnmapped.out.mate1| awk '{if (NR%4==1||NR%4=3) print}' > S1-1fli.left.fa
$sed -i "s/@/>/g" S1-1.left.fa
$awk -F " " '{print $1}' S1-1fliUnmapped.out.mate2| awk '{if (NR%4==1||NR%4=3) print}' > S1-1fli.right.fa
$sed -i "s/@/>/g" S1-1.right.fa
$STAR --runThreadN 20 --genomeDir cocco/ --readFilesIn S1-1fli.left.fa S1-1fli.right.fa --outSAMtype SAM --outFileNamePrefix S1-1Ufli --outReadsUnmapped Fastx --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
$less S1-1UfliLog.final.out
Started job on | Aug 27 14:28:39
Started mapping on | Aug 27 14:28:44
Finished on | Aug 27 21:38:00
Mapping speed, Million of reads per hour | 0.86
Number of input reads | 6123885
Average input read length | 300
UNIQUE READS:
Uniquely mapped reads number | 1441904
Uniquely mapped reads % | 23.55%
也就是说,在剩余的36%未比对的reads中有23%属于Coccomyxa,总比对率有70%左右,地上部位按照相同的步骤进行比对有83%的比对率,这个现象也正常,因为地下部位的组织特别复杂,在取样时肯定包含了很多杂质,比对率肯定比地上部位低。
总结
目前已经可以确定地下部位的藻类就是Coccomyxa属的,是不是Coccomyxa subellipsoidea目前还不确定,接下来的工作是确定地下部位未比对上reads的所属。
转载请注明:周小钊的博客>>>根据转录组数据鉴定共生物种
网友评论