美文网首页宏基因组
MetaPhlAn3宏基因组物种注释

MetaPhlAn3宏基因组物种注释

作者: 胡童远 | 来源:发表于2021-06-23 16:53 被阅读0次

    相比version2,version3数据库拓了很多,还有3支持自建数据库了。

    Github: https://github.com/biobakery/MetaPhlAn
    Manual: https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-3.0
    Wiki: https://github.com/biobakery/biobakery/wiki/metaphlan3

    MetaPhlAn依赖~1.1M的clade-specific的marker genes,
    mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
    这些marker genes由~100,000 reference genomes (~99,500 bacterial and archaeal and ~500 eukaryotic)确定。该数据可用于物种分类,株水平分析,定量,快速计算。

    安装 - manual 给的方法

    conda create --name mpa -c bioconda python=3.7 metaphlan
    conda activate mpa
    

    不出意外的话,会报错

    undefined symbol: _ZN3tbb10interface78internal15task_arena_base19internal_initializeEv
    

    bowtie2-build-s: symbol lookup error, undefined symbol

    安装 - 针对tbb版本问题

    conda create -n metaphlan python=3.7
    conda activate metaphlan
    conda install tbb=2020.2
    conda install bowtie2
    conda install -c bioconda metaphlan
    metaphlan --version
    # MetaPhlAn version 3.0.13 (27 Jul 2021)
    bowtie2 --version
    # bowtie2-align-s version 2.4.4
    # Compiler: gcc version 9.3.0
    

    这里顺便安装了phylophlan3

    数据库获取

    下载地址:
    Google Driver
    Zenodo

    Google Driver 文件列表:

    bcftools    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACPUjNhnsIIJn16-ww6-MWOa/bcftools?dl=1
    metaphlan2_homebrew_counter.txt https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAD6ImeA91we2nBWBhpfaEOqa/metaphlan2_homebrew_counter.txt?dl=1
    mpa_latest  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAyoJpOgcjop41VIHAGWIVLa/mpa_latest?dl=1
    mpa_v20_m200_marker_info.txt.bz2    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABKU_RAK5yOzyhV27NpOduDa/mpa_v20_m200_marker_info.txt.bz2?dl=1
    mpa_v20_m200.md5    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADS8nukx3dSoiR82OHw6dOka/mpa_v20_m200.md5?dl=1
    mpa_v20_m200.tar    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAASBOj-2gAbA53cV1bXBULYa/mpa_v20_m200.tar?dl=1
    mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADhE2Ur7JqirifOdgi4fGQEa/mpa_v29_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
    mpa_v29_CHOCOPhlAn_201901.md5   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADdxAWsjLPLjy10VICgSAEPa/mpa_v29_CHOCOPhlAn_201901.md5?dl=1
    mpa_v29_CHOCOPhlAn_201901.tar   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAACeczBU6P9lIBD4ZYtxwKva/mpa_v29_CHOCOPhlAn_201901.tar?dl=1
    mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADoxBNynVopWt2shYYKQ2Mba/mpa_v292_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
    mpa_v292_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAApZLRLD0Bkb86bvH-Y3-tUa/mpa_v292_CHOCOPhlAn_201901.md5?dl=1
    mpa_v292_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABK4Zns2PYUh_R-sLGEx_Bza/mpa_v292_CHOCOPhlAn_201901.tar?dl=1
    mpa_v293_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAsg2PN5Ng6uwHnEemlqo3-a/mpa_v293_CHOCOPhlAn_201901.md5?dl=1
    mpa_v293_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABFg8C3gMyNpYNTAY5PcSONa/mpa_v293_CHOCOPhlAn_201901.tar?dl=1
    mpa_v294_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA6KBc-nd8_C4bJOPVjzWW7a/mpa_v294_CHOCOPhlAn_201901.md5?dl=1
    mpa_v294_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAJSiG2qkfNMcYzhvz4Rxsga/mpa_v294_CHOCOPhlAn_201901.tar?dl=1
    mpa_v295_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACIWLuO_0ixZsxiSZZG-4M1a/mpa_v295_CHOCOPhlAn_201901.md5?dl=1
    mpa_v295_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAA0y5qvPjJdxVw86Ovjinipa/mpa_v295_CHOCOPhlAn_201901.tar?dl=1
    mpa_v296_CHOCOPhlAn_201901.md5  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAbRMEnsewv_pHFsLv2Be0va/mpa_v296_CHOCOPhlAn_201901.md5?dl=1
    mpa_v296_CHOCOPhlAn_201901.tar  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AABAD51gd3wr0___2MWO0xD-a/mpa_v296_CHOCOPhlAn_201901.tar?dl=1
    mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2  https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAv_ShZiz7pNTaT_YONJTF7a/mpa_v296_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
    mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAlyQITZuUCtBUJxpxhIroIa/mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2?dl=1
    mpa_v30_CHOCOPhlAn_201901.tar   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AADlxibskzbPHPoDl6S-FyKka/mpa_v30_CHOCOPhlAn_201901.tar?dl=1
    mpa_v30_CHOCOPhlAn_201901.md5   https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACTzoUYDqZps8u2JqWCNCODa/mpa_v30_CHOCOPhlAn_201901.md5?dl=1
    SRS019033.fastq https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AACh1NQExDk39RXzZOyTPmQwa/SRS019033.fastq?dl=1
    strainphlan_homebrew_counter.txt    https://www.dropbox.com/sh/7qze7m7g9fe2xjg/AAAZO5uydQ2QQkOSkYu5DxRha/strainphlan_homebrew_counter.txt?dl=1
    

    Win下载:marker genes数据

    # 解压 15M -> 336M
    bzip2 -d mpa_v30_CHOCOPhlAn_201901_marker_info.txt.bz2
    less -S mpa_v30_CHOCOPhlAn_201marker_info.txt | wc -l
    # 1132198 # 113万行
    

    一个物种有多个marker genes,因此总基因大于物种数

    # 剖开看看
    39444__GeneID:11747987  {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le
    39444__GeneID:11747985  {'ext': [], 'score': 2.0, 'clade': 's__Cotia_virus', 'le
    39444__GeneID:11747982  {'ext': ['PRJNA15142', 'PRJNA14595', 'PRJNA20981'], 'sco
    39444__GeneID:11747981  {'ext': [], 'score': 0.0, 'clade': 's__Cotia_virus', 'le
    

    bzip2压缩了20倍,牛皮呀呀呀呀呀呀

    Win下载:基因组数据

    tar -xvf mpa_v30_CHOCOPhlAn_201901.tar
    rm mpa_v30_CHOCOPhlAn_201901.tar
    bzip2 -d mpa_v30_CHOCOPhlAn_201901.fna.bz2
    
    # 敲开kankan>1000373__GeneID:11569613
    less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | head
    # 是基因的序列信息
    >1000373__GeneID:11604940
    >1000373__GeneID:11604942
    >1000373__GeneID:11604944
    >100053__V6HSA9__LEP1GSC062_4244 UniRef90_V6HSA9;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
    >100053__V6HSV5__LEP1GSC062_0144 UniRef90_V6HSV5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
    >100053__V6HSX2__LEP1GSC062_1624 UniRef90_V6HSX2;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
    >100053__V6HT46__LEP1GSC062_1702 UniRef90_V6HT46;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
    >100053__V6HT56__LEP1GSC062_1778 UniRef90_V6HT56;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
    >100053__V6HTZ5__LEP1GSC062_0133 UniRef90_V6HTZ5;k__Bacteria|p__Spirochaetes|c__Spirochaetia|o__Spirochaetia_unclassified|f__Leptospiraceae|g__Leptospira|s__Leptospira_alexanderi;GCA_000243815
    
    less -S mpa_v30_CHOCOPhlAn_201901.fna | grep "^>" | awk -F" " '{print $2}' | sed -e '/^$/d' | awk -F";" '{print $2}' | uniq  | wc -l
    # 10164 行(物种)
    

    bowtie2建数据库索引

    bowtie2-build --threads 8 \
    -f mpa_v30_CHOCOPhlAn_201901.fna \
    mpa_v30_CHOCOPhlAn_201901
    

    metaphlan3注释物种

    shell脚本:

    #!/usr/bin/env bash
    # 帮助文档
    helpdoc(){
        cat <<EOF
    Description:
    
        This is a help document
        - Plot circos
    
    Usage:
    
        $0 -i <input file> -o <output file>
    
    Option:
    
        -i    this is a input file/fold
        -o    this is a output file/fold
    EOF
    }
    
    # 参数传递
    while getopts ":i:o:" opt
    do
        case $opt in
            i)
            infile=`echo $OPTARG`
            ;;
            o)
            outfile=`echo $OPTARG`
            ;;
            ?)
            echo "未知参数"
            exit 1;;
        esac
    done
    
    # 若无指定任何参数则输出帮助文档
    if [ $# = 0 ]
    then
        helpdoc
        exit 1
    fi
    
    # 核心
    source /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/software/miniconda3/etc/profile.d/conda.sh
    conda activate metaphlan
    
    route="/hwfssz5/ST_META/TMP1T.2/soil"
    metaphlan3db="/hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/database/humann3/metaphlan_v30/"
    
    mkdir $route/Result/metaphlan3/$infile
    cat $route/Result/kneaddata/$infile/${infile}_1_kneaddata.repeats.removed.*.fastq > $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq
    
    metaphlan $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq \
    --input_type fastq \
    --bowtie2db $metaphlan3db \
    -x mpa_v30_CHOCOPhlAn_201901 \
    --nproc 8 \
    -o $route/Result/metaphlan3/$infile/${infile}_metaphlan3.out
    
    rm $route/Result/kneaddata/$infile/${infile}_1_kneaddata.merge.fastq
    

    --input_type {fastq,fasta,bowtie2out,sam}
    --bowtie2db METAPHLAN_BOWTIE2_DB
    -x INDEX Specify the id of the database version to use
    --bt2_ps BowTie2 presets [default very-sensitive]
    --add_viruses Allow the profiling of viral organisms
    --ignore_eukaryotes Do not profile eukaryotic organisms
    --ignore_bacteria Do not profile bacterial organisms
    --ignore_archaea Do not profile archeal organisms
    --nproc N The number of CPUs to use for parallelizing the mapping [default 4]

    qsub提交任务:

    route="/hwfssz5/ST_META/TMP1T.2/soil"
    qsub -cwd -l vf=80G,p=8 -q st_supermem.q -P st_supermem -binding linear:8 $route/script/q_metaphlan3.sh -i 49
    

    结果:

    样本合并

    conda activate metaphlan
    python /route/metaphlan/bin/merge_metaphlan_tables.py \
    -o ./abundance/abundance_merged.txt \
    ./abundance/*.out
    

    unifrac样本距离

    下载metaphlan nwk文件:
    https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk

    R安装依赖:

    conda activate r411
    conda install r-vegan r-ape r-rbiom
    

    获取计算R脚本:
    https://github.com/biobakery/MetaPhlAn/blob/master/metaphlan/utils/calculate_unifrac.R

    计算距离:

    Rscript $unifrac_route/calculate_unifrac.R \
    abundance_merged.txt \
    $nwk_route/mpa_v30_CHOCOPhlAn_201901_species_tree.nwk \
    abundance_merged_unifrac_profiles.txt
    

    警告nwk信息不全

    WARNING: 4 species not present in the tree were removed from the input profile.
    Nitrolancea_hollandica
    
    Singulisphaera_acidiphila
    
    Leptothrix_ochracea
    
    Sutterella_parvirubra
    

    hclust2.py画热图

    安装:

    conda activate -c bioconda hclust2
    

    运行:

    hclust2.py \
      -i abundance_merged.txt \
      -o hclust.sqrt_scale.png
    

    热图太惨了,不要了

    更多:
    MetaPhlAn3安装及使用

    相关文章

      网友评论

        本文标题:MetaPhlAn3宏基因组物种注释

        本文链接:https://www.haomeiwen.com/subject/ywjbyltx.html