生物的DNA都是由四种碱基构成的,这四种碱基A,T,G和C的排列顺序包含了物种的所有遗传信息。这些碱基的任意一个排列可以叫做一个k-mer,其中k指的是这个排列的长度,例如 AATT就是一个4-mer。在分析海量的NGS数据时,有时会需要统计序列里的k-mer分布情况来估计测序数据的质量和覆盖度,检测基因组的重复度等等。
Overrepresented Kmers
如果某k个bp的短序列在reads中大量出现,其频率高于统计期望的话,fastqc将其记为over-represented k-mer。默认的k = 5,可以用-k --kmers选项来调节,范围是2-10。出现频率总体上3倍于期望或是在某位置上5倍于期望的k-mer被认为是over-represented。fastqc除了列出所有over-represented k-mers,还会把前6个的per base distribution画出来。
当有出现频率总体上3倍于期望或是在某位置上5倍于期望的k-mer时,报"黄色!";
当有出现频率在某位置上10倍于期望的k-mer时报"红色×"。
# 安装:
install: Installing and running khmer — khmer 3.0.0a1+98.gfe0ce11 documentation
sudo apt-get install python3-dev python3-venv build-essential
conda create --name py3.6 python=3.6
source activate py3.6
pip install khmer
# 使用:
ref: khmer’s command-line interface — khmer 3.0.0a1+98.gfe0ce11 documentation
# 对质控前后的数据统计单端丰度距离
abundance-dist-single.py -M 1000M -k 21 SRR1976948_1.fastq.gz SRR1976948_1.fastq.gz.dist
abundance-dist-single.py -M 1000M -k 21 SRR1976948_1.qc.fq.gz SRR1976948_1.qc.fq.gz.dist
# 只对高覆盖度中的低丰度kmer剪切(更可能是测序错误);低覆盖度保留
interleave-reads.py SRR1976948_1.paired.fq.gz SRR1976948_2.paired.fq.gz | trim-low-abund.py -V -M 1000M -C 3 -Z 10 - -o SRR1976948.trim.fq
-V Only trim low-abundance k-mers from sequences that have high coverage.
-C --cutoff remove k-mers below this abundance
-Z --trim-at-coverage trim reads when entire read above this coverage

# 查看质控后的结果:
unique-kmers.py SRR1976948_1.paired.fq.gz SRR1976948_2.paired.fq.gz
Total Sequences 885734
Total Sequences 885734
Estimated number of unique 32-mers in SRR1976948_1.qc.fq.gz: 65344914
Estimated number of unique 32-mers in SRR1976948_2.qc.fq.gz: 85395776
Total estimated number of unique 32-mers: 112,758,982
unique-kmers.py SRR1976948.trim.fq
Total Sequences 1757752
Total estimated number of unique 32-mers: 101,285,633
结果: 只经过了简单的尾部过滤,k-mer的数量减少了10%以上,对下游分析的准确度和速度都非常有帮助
# 如果不对R1R2的fastq进行交叉合并呢?
trim-low-abund.py -V -M 1000M -C 3 -Z 10 SRR1976948_1.paired.fq.gz -o SRR1976948_1.trim.fq
trim-low-abund.py -V -M 1000M -C 3 -Z 10 SRR1976948_2.paired.fq.gz -o SRR1976948_2.trim.fq
unique-kmers.py SRR1976948_1.paired.fq.gz SRR1976948_2.paired.fq.gz
Estimated number of unique 32-mers in SRR1976948_1.qc.fq.gz: 65344914
Estimated number of unique 32-mers in SRR1976948_2.qc.fq.gz: 85395776
Total estimated number of unique 32-mers: 112,758,982
unique-kmers.py SRR1976948_1.trim.fq SRR1976948_2.trim.fq
Estimated number of unique 32-mers in SRR1976948_1.trim.fq: 63068781
Estimated number of unique 32-mers in SRR1976948_2.trim.fq: 81720503
Total estimated number of unique 32-mers: 106,996,366
结果: 没问题! 对结果影响不是很大
ref: 科学网—[转载]宏基因组实战2. 数据质控fastqc, Trimmomatic, MultiQC, khmer - 刘永鑫的博文
网友评论