相比version2,version3数据库拓了很多,还有3支持自建数据库了。
Github: https://github.com/biobakery/MetaPhlAn
Manual: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0
Wiki: https://github.com/biobakery/biobakery/wiki/metaphlan3
MetaPhlAn依赖~1.1M的clade-specific的marker genes,
mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
这些marker genes由~100,000 reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic)确定。该数据可用于物种分类,株水平分析,定量,快速计算。
安装 - manual 给的方法
conda create --name mpa -c bioconda python=3.7 metaphlan
conda activate mpa
不出意外的话,会报错
undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_initializeEv
bowtie2-build-s: symbol lookup error, undefined symbol
安装 - 针对tbb版本问题
conda create -n metaphlan python=3.7
conda activate metaphlan
conda install tbb=2020.2
conda install bowtie2
conda install -c bioconda metaphlan
metaphlan --version
# MetaPhlAn version 3.0.13 (27 Jul 2021)
bowtie2 --version
# bowtie2-align-s version 2.4.4
# Compiler: gcc version 9.3.0
这里顺便安装了phylophlan3
数据库获取
下载地址:
Google Driver
Zenodo
Google Driver 文件列表:
bcftools https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACPUjNhnsIIJn16-ww6-MWOa/bcftools?dl=1
metaphlan2_homebrew_counter.txt https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAD6ImeA91we2nBWBhpfaEOqa/metaphlan2_homebrew_counter.txt?dl=1
mpa_latest https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAyoJpOgcjop41VIHAGWIVLa/mpa_latest?dl=1
mpa_v20_m200_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABKU_RAK5yOzyhV27NpOduDa/mpa_v20_m200_marker_info.txt.bz2?dl=1
mpa_v20_m200.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADS8nukx3dSoiR82OHw6dOka/mpa_v20_m200.md5?dl=1
mpa_v20_m200.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAASBOj-2gAbA53cV1bXBULYa/mpa_v20_m200.tar?dl=1
mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADhE2Ur7JqirifOdgi4fGQEa/mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v29_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADdxAWsjLPLjy10VICgSAEPa/mpa_v29_CHOCOPhlAn_201901.md5?dl=1
mpa_v29_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAACeczBU6P9lIBD4ZYtxwKva/mpa_v29_CHOCOPhlAn_201901.tar?dl=1
mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADoxBNynVopWt2shYYKQ2Mba/mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v292_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAApZLRLD0Bkb86bvH-Y3-tUa/mpa_v292_CHOCOPhlAn_201901.md5?dl=1
mpa_v292_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABK4Zns2PYUh_R-sLGEx_Bza/mpa_v292_CHOCOPhlAn_201901.tar?dl=1
mpa_v293_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAsg2PN5Ng6uwHnEemlqo3-a/mpa_v293_CHOCOPhlAn_201901.md5?dl=1
mpa_v293_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABFg8C3gMyNpYNTAY5PcSONa/mpa_v293_CHOCOPhlAn_201901.tar?dl=1
mpa_v294_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA6KBc-nd8_C4bJOPVjzWW7a/mpa_v294_CHOCOPhlAn_201901.md5?dl=1
mpa_v294_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAJSiG2qkfNMcYzhvz4Rxsga/mpa_v294_CHOCOPhlAn_201901.tar?dl=1
mpa_v295_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACIWLuO_0ixZsxiSZZG-4M1a/mpa_v295_CHOCOPhlAn_201901.md5?dl=1
mpa_v295_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA0y5qvPjJdxVw86Ovjinipa/mpa_v295_CHOCOPhlAn_201901.tar?dl=1
mpa_v296_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAbRMEnsewv_pHFsLv2Be0va/mpa_v296_CHOCOPhlAn_201901.md5?dl=1
mpa_v296_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABAD51gd3wr0___2MWO0xD-a/mpa_v296_CHOCOPhlAn_201901.tar?dl=1
mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAv_ShZiz7pNTaT_YONJTF7a/mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAlyQITZuUCtBUJxpxhIroIa/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
mpa_v30_CHOCOPhlAn_201901.tar https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADlxibskzbPHPoDl6S-FyKka/mpa_v30_CHOCOPhlAn_201901.tar?dl=1
mpa_v30_CHOCOPhlAn_201901.md5 https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACTzoUYDqZps8u2JqWCNCODa/mpa_v30_CHOCOPhlAn_201901.md5?dl=1
SRS019033.fastq https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACh1NQExDk39RXzZOyTPmQwa/SRS019033.fastq?dl=1
strainphlan_homebrew_counter.txt https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAZO5uydQ2QQkOSkYu5DxRha/strainphlan_homebrew_counter.txt?dl=1
Win下载:marker genes数据
# 解压 15M -> 336M
bzip2 -d mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
less -S mpa_v30_CHOCOPhlAn_201marker_info.txt | wc -l
# 1132198 # 113万行
一个物种有多个marker genes,因此总基因大于物种数
# 剖开看看
39444__GeneID:11747987 {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le
39444__GeneID:11747985 {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le
39444__GeneID:11747982 {'ext': ['PRJNA15142', 'PRJNA14595', 'PRJNA20981'], 'sco
39444__GeneID:11747981 {'ext': [], 'score': 0.0, 'clade': 's__Cotia_virus', 'le
bzip2压缩了20倍,牛皮呀呀呀呀呀呀
Win下载:基因组数据
tar -xvf mpa_v30_CHOCOPhlAn_201901.tar
rm mpa_v30_CHOCOPhlAn_201901.tar
bzip2 -d mpa_v30_CHOCOPhlAn_201901.fna.bz2
# 敲开kankan>1000373__GeneID:11569613
less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | head
# 是基因的序列信息
>1000373__GeneID:11604940
>1000373__GeneID:11604942
>1000373__GeneID:11604944
>100053__V6HSA9__LEP1GSC062_4244 UniRef90_V6HSA9;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HSV5__LEP1GSC062_0144 UniRef90_V6HSV5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HSX2__LEP1GSC062_1624 UniRef90_V6HSX2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HT46__LEP1GSC062_1702 UniRef90_V6HT46;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HT56__LEP1GSC062_1778 UniRef90_V6HT56;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
>100053__V6HTZ5__LEP1GSC062_0133 UniRef90_V6HTZ5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | awk -F" " '{print $2}' | sed -e '/^$/d' | awk -F";" '{print $2}' | uniq | wc -l
# 10164 行(物种)
bowtie2建数据库索引
bowtie2-build --threads 8 \
-f mpa_v30_CHOCOPhlAn_201901.fna \
mpa_v30_CHOCOPhlAn_201901
metaphlan3注释物种
shell脚本:
#!/usr/bin/env bash
# 帮助文档
helpdoc(){
cat <<EOF
Description:
This is a help document
- Plot circos
Usage:
$0 -i <input file> -o <output file>
Option:
-i this is a input file/fold
-o this is a output file/fold
EOF
}
# 参数传递
while getopts ":i:o:" opt
do
case $opt in
i)
infile=`echo $OPTARG`
;;
o)
outfile=`echo $OPTARG`
;;
?)
echo "未知参数"
exit 1;;
esac
done
# 若无指定任何参数则输出帮助文档
if [ $# = 0 ]
then
helpdoc
exit 1
fi
# 核心
source /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
conda activate metaphlan
route="/hwfssz5/ST_META/TMP1T.2/soil"
metaphlan3db="/hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/metaphlan_v30/"
mkdir $route/Result/metaphlan3/$infile
cat $route/Result/kneaddata/$infile/${infile}_1_kneaddata.repeats.removed.*.fastq > $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq
metaphlan $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq \
--input_type fastq \
--bowtie2db $metaphlan3db \
-x mpa_v30_CHOCOPhlAn_201901 \
--nproc 8 \
-o $route/Result/metaphlan3/$infile/${infile}_metaphlan3.out
rm $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq
--input_type {fastq,fasta,bowtie2out,sam}
--bowtie2db METAPHLAN_BOWTIE2_DB
-x INDEX Specify the id of the database version to use
--bt2_ps BowTie2 presets [default very-sensitive]
--add_viruses Allow the profiling of viral organisms
--ignore_eukaryotes Do not profile eukaryotic organisms
--ignore_bacteria Do not profile bacterial organisms
--ignore_archaea Do not profile archeal organisms
--nproc N The number of CPUs to use for parallelizing the mapping [default 4]
qsub提交任务:
route="/hwfssz5/ST_META/TMP1T.2/soil"
qsub -cwd -l vf=80G,p=8 -q st_supermem.q -P st_supermem -binding linear:8 $route/script/q_metaphlan3.sh -i 49
结果:
样本合并
conda activate metaphlan
python /route/metaphlan/bin/merge_metaphlan_tables.py \
-o ./abundance/abundance_merged.txt \
./abundance/*.out
unifrac样本距离
下载metaphlan nwk文件:
https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk
R安装依赖:
conda activate r411
conda install r-vegan r-ape r-rbiom
获取计算R脚本:
https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/calculate_unifrac.R
计算距离:
Rscript $unifrac_route/calculate_unifrac.R \
abundance_merged.txt \
$nwk_route/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk \
abundance_merged_unifrac_profiles.txt
警告
nwk信息不全
WARNING: 4 species not present in the tree were removed from the input profile.
Nitrolancea_hollandica
Singulisphaera_acidiphila
Leptothrix_ochracea
Sutterella_parvirubra
hclust2.py画热图
安装:
conda activate -c bioconda hclust2
运行:
hclust2.py \
-i abundance_merged.txt \
-o hclust.sqrt_scale.png
热图太惨了,不要了
更多:
MetaPhlAn3安装及使用
网友评论