BUSCO评估基因组组装和完整性

作者: 胡童远 | 来源:发表于2021-05-18 15:32 被阅读0次

    BUSCO是Benchmarking Universal Single-Copy Orthologs(通用单拷贝同源基因基准)的缩写,基于基因进化(有参比对)评估基因组组装和注释完整性的开源python软件。

    文献:
    文章:BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015
    引用:4695
    BOOK:BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods in Molecular Biology 2019

    摘要:
    Motivation: Genomics has revolutionized biological research, but quality assessment of the resulting assembled sequences is complicated and remains mostly limited to technical measures like N50.
    Results: We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with sets of Benchmarking Universal Single-Copy Orthologs, named BUSCO.
    基因组组装评估方法少,BUSCO开源且好用。

    方法:
    官网:https://busco.ezlab.org/
    MANUAL: https://busco.ezlab.org/busco_userguide.html

    conda安装:
    conda:https://anaconda.org/bioconda/busco
    选一即可,可能是v4.1.2

    conda install -c bioconda busco
    conda install -c bioconda/label/broken busco
    conda install -c bioconda/label/cf201901 busco 
    

    bioconda安装最新版v5.1.2,see manual

    # 没有镜像的话,添加镜像
    conda config --show 
    conda config --add channels conda-forge
    
    # conda安装
    conda create -n busco
    conda activate busco
    conda install -c bioconda -c conda-forge busco=5.1.2
    busco --help
    busco --version
    # BUSCO 5.1.2
    

    数据库:
    更多老哥下了植物的参考基因组,链接似乎不好用了?

    # 植物的BUSCO的数据库
    wget -c https://busco.ezlab.org/datasets/embryophyta_odb9.tar.gz
    

    orthodb: https://www.orthodb.org/?page=filelist 里似乎有很多数据?

    MANAUAL中提供了lineage数据源:
    https://busco-data.ezlab.org/v5/data/,发现:

    是V5最新版的数据库,没错了

    https://busco-data.ezlab.org/v5/data/lineages/,发现:

    2021本月最新版,各个物种任意选择,下载bacteria_odb10,并查看:

    wget -c https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2020-03-06.tar.gz
    tar -zxvf bacteria_odb10.2020-03-06.tar.gz
    cd bacteria_odb10
    

    BUSCO使用:

    manual里的Automated lineage selection模式

    busco -m MODE -i INPUT -o OUTPUT --auto-lineage
    busco -m MODE -i INPUT -o OUTPUT --auto-lineage-prok
    # or ignoring eukaryotes to save runtime, if compatible with your experimental goal.
    busco -m MODE -i INPUT -o OUTPUT --auto-lineage-euk
    # or ignoring non-eukaryotes to save runtime, if compatible with your experimental goal.
    

    manual推荐的靶向lineage模式

    db_busco="/database/BUSCO/bacteria_odb10"
    busco --in AF04-12.fna \
    --lineage_dataset $db_busco \
    --out ./output/ \
    -m genome --offline
    

    结果报错:

    顾名思义,不能有slash,需要更改配置文件,安全起见别这样做。去掉slash即可正常。对于批处理,只需不断进出新建的文件夹即可。

    busco --in AF04-12.fna \
    --lineage_dataset $db_busco \
    --out output \
    -m genome --offline
    

    结果:

    full_table.tsv

    # BUSCO version is: 5.1.2
    # The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
    # Busco id      Status  Sequence        Gene Start      Gene End        Strand  ScoreLength   OrthoDB url     Description
    4421at2 Complete        AF04-12.Scaf40_36       46725   51011   +       1675.3  1205 https://www.orthodb.org/v10?query=4421at2        DNA-directed RNA polymerase subunit beta'
    9601at2 Complete        AF04-12.Scaf40_35       42874   46686   +       1169.7  804  https://www.orthodb.org/v10?query=9601at2        DNA-directed RNA polymerase subunit beta
    26038at2        Complete        AF04-12.Scaf8_42        54773   58477   +       212.5371      https://www.orthodb.org/v10?query=26038at2      phosphoribosylformylglycinamidine synthase
    91428at2        Complete        AF04-12.Scaf45_20       22437   25052   +       540.6530      https://www.orthodb.org/v10?query=91428at2      alanine--tRNA ligase
    95696at2        Complete        AF04-12.Scaf4_63        73584   75617   +       714.7504      https://www.orthodb.org/v10?query=95696at2      excinuclease ABC subunit B
    143460at2       Complete        AF04-12.Scaf1_51        58613   60415   +       512.5441      https://www.orthodb.org/v10?query=143460at2     GTP-binding protein
    182107at2       Complete        AF04-12.Scaf17_16       11979   13760   +       709.2491      https://www.orthodb.org/v10?query=182107at2     elongation factor 4
    

    missing_busco_list.tsv

    POG091H008J
    POG091H00BL
    POG091H00TK
    ...............这里其实没有,嘎嘎
    

    short_summary.txt

    # BUSCO version is: 5.1.2
    # The lineage dataset is: bacteria_odb10 (Creation date: 2020-03-06, number of genomes: 4085, number of BUSCOs: 124)
    # Summarized benchmarking in BUSCO notation for file /hwfssz5/ST_META/P18Z10200N0423_ZYQ/MiceGutProject/hutongyuan/analysis/platform/test/AF04-12.fna
    # BUSCO was run in mode: genome
    # Gene predictor used: prodigal
    
            ***** Results: *****
    
            C:100.0%[S:97.6%,D:2.4%],F:0.0%,M:0.0%,n:124
            124     Complete BUSCOs (C)
            121     Complete and single-copy BUSCOs (S)
            3       Complete and duplicated BUSCOs (D)
            0       Fragmented BUSCOs (F)
            0       Missing BUSCOs (M)
            124     Total BUSCO groups searched
    
    Dependencies and versions:
            hmmsearch: 3.1
            prodigal: 2.6.3
    

    合并BUSCO结果:

    ## BUSCO 结果统计
    task="illumina"
    touch BUSCO/${task}_busco.txt
    echo -e "id\tc\ts\td\tf\tm" >> BUSCO/${task}_busco.txt
    
    for i in `cat 76_strain_id.list`;
    do
        c=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete BUSCOs" | awk '{print $1}'`
        s=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete and single-copy BUSCOs" | awk '{print $1}'`
        d=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Complete and duplicated BUSCOs" | awk '{print $1}'`
        f=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Fragmented BUSCOs" | awk '{print $1}'`
        m=`cat BUSCO/$task/$i/run_bacteria_odb10/short_summary.txt | grep "Missing BUSCOs" | awk '{print $1}'`
        echo -e "$i\t$c\t$s\t$d\t$f\t$m" >> BUSCO/${task}_busco.txt
        echo -e "\033[32m $i done... \033[0m"
    done
    

    可视化:
    这个呢需要某个脚本,官网是这么干的,自己捯饬一下也行,反正我没做了。

    cp XX1/short_summary.*.lineage_odb10.XX1.txt BUSCO_summaries/.
    cp XX2/short_summary.*.lineage_odb10.XX2.txt BUSCO_summaries/.
    cp XX3/short_summary.*.lineage_odb10.XX3.txt BUSCO_summaries/.
    
    python3 scripts/generate_plot.py –wd BUSCO_summaries
    python3 scripts/generate_plot.py –wd /full/path/to/my/folder/BUSCO_summaries
    

    更多:
    BUSCO - 组装质量评估

    相关文章

      网友评论

        本文标题:BUSCO评估基因组组装和完整性

        本文链接:https://www.haomeiwen.com/subject/jrpjjltx.html