美文网首页
Call种内微生物SNP用Parsnp

Call种内微生物SNP用Parsnp

作者: 胡童远 | 来源:发表于2021-11-04 14:17 被阅读0次

    parsnp可进行核心基因组对齐、call核心基因组SNP、构建SNP进化树。

    Github: https://github.com/marbl/parsnp
    Manual: https://harvest.readthedocs.io/en/latest/content/parsnp/tutorial.html

    文章:The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes
    中文:用于快速核心基因组比对和数千种种内微生物基因组可视化的 Harvest 套件
    杂志:Genome Biology
    时间:2014
    被引:913

    文章知识点:
    1 核心基因组是全基因组的自己,这些同源序列可用于多基因组序列对齐
    2 核心基因垂直传递,信噪比更强
    3 核心基因SNP是研究进化关系最可靠的突变
    4 核心基因组SNP是目前研究近源微生物从大标准方法
    5 core-genome SNP typing有三种方法:
    read mapping: 成本最低
    k-mer analyses
    whole-genome alignment:原方法,检测SNP INDEL金标准,依赖组装

    Parsnp原理:
    1 基于suffix graph data structure快速检测maximal unique matches (MUMs)
    MUM是很多pairwise/multiple基因组对齐的基础
    2 Gingr可视化比对结构
    3 输入要求:同亚种或ANI>=97%,基于MUMi distance指定纳入标准
    4 Directed Acyclic Graph (DAG) 数据结构,也称Compressed Suffix Graph (CSG)
    5 multi-MUMs检测locally collinear blocks (LCBs),构建核心基因组对齐的基础
    6 LCB作为anchor,MUSCLE定位gap
    7 以上结果包含核心基因组所有的:SNP, INDEL, 结构变异
    8 确定核心基因组SNP的方法:(1) repetitive sequence; (2) small LCB size; (3) poor alignment quality; (4) poor base quality; and (5) possible recombination。FreeBayes推测碱基质量,PhiPack检测碱基重排。
    9 FastTree2构建进化树

    方法比较:

    引用:
    1 Genome sequence of a clinical Salmonella Enteritidis sequence type 11 strain from South Africa. Journal of Global Antimicrobial Resistance. 2019
    2 OutbreakFinder: a visualization tool for rapid detection of bacterial strain clusters based on optimized multidimensional scaling. PeerJ. 2019

    安装

    conda create -n parsnp
    conda install parsnp
    parsnp --version
    #|--Parsnp 1.5.6--|
    #For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
    #parsnp 1.5.6
    

    使用

    parsnp \
    -c -p 4 -n mafft \
    -r ./T2103096416.fna \
    -d ./input/*.fna \
    -o ./parsnp/
    

    参数
    -d Input genomes
    -g <reference_genbank>
    -r <reference_fasta>
    -o OUTPUT_DIR
    -c, --curated (c)urated genome directory
    --use-fasttree Use fasttree instead of RaxML
    --vcf Generate VCF file.
    -p THREADS, --threads THREADS
    -n {mafft,muscle,fsa,prank}, --alignment-program
    -u, --unaligned Ouput unaligned regions

    过程

    19:51:26 - INFO - <<Parsnp started>>
    19:51:26 - INFO - No genbank file provided for reference annotations, skipping..
    19:51:26 - INFO - Running Parsnp multi-MUM search and libMUSCLE aligner...
    19:51:45 - INFO - Reconstructing core genome phylogeny...
    19:51:45 - INFO - Aligned 4 genomes in 18.35 seconds
    19:51:45 - INFO - Parsnp finished! All output available in ./parsnp/
    

    结果


    tree 树文件
    xmfa 基因组对齐结果
    Ini 运行配置

    xmfa文件格式同mauve对齐结果文件,但丢弃了unalign的序列

    >1:0-109 + cluster58 :p1
    >2:1399845-1399954 + cluster58 s11:p323
    >3:2214877-2214986 - cluster58 s41:p1147
    >4:2220684-2220793 - cluster58 s51:p849
    =
    >1:138-164 + cluster59 s1:p138
    >2:1399997-1400023 + cluster59 s11:p475
    >3:2214808-2214834 - cluster59 s41:p1013
    >4:2220615-2220641 - cluster59 s51:p715
    =
    

    Gingr可视化

    Manual: https://harvest.readthedocs.io/en/latest/content/gingr.html

    只有MAC Linux可视化软件才能用,无命令行模式,使用测试数据直接打开如下:

    Harvest处理parsnap结果

    Manual:https://harvest.readthedocs.io/en/latest/content/harvest/quickstart.html#download-install-run

    下载解压可直接使用

    wget -c https://github.com/marbl/harvest-tools/releases/download/v1.2/harvesttools-Linux64-v1.2.tar.gz/
    tar -zxvf harvesttools-Linux64-v1.2.tar.gz
    /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools -h
    

    参数:

    harvesttools options:
       -i <Gingr input>
       -b <bed filter intervals>,<filter name>,"<description>"
       -B <output backbone intervals>
       -f <reference fasta>
       -F <reference fasta out>
       -g <reference genbank>
       -a <MAF alignment input>
       -m <multi-fasta alignment input>
       -M <multi-fasta alignment output (concatenated LCBs)>
       -n <Newick tree input>
       -N <Newick tree output>
       --midpoint-reroot (reroot the tree at its midpoint after loading)
       -o <Gingr output>
       -S <output for multi-fasta SNPs>
       -u 0/1 (update the branch values to reflect genome length)
       -v <VCF input>
       -V <VCF output>
       -x <xmfa alignment file>
       -X <output xmfa alignment file>
       -h (show this help)
       -q (quiet mode)
    

    利用ggr文件获取snp fasta

    /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools \
    -i parsnp.ggr \
    -S parsnp.snps
    # 过程
    Loading parsnp.ggr...
    Writing parsnp.snps...
    # remove enter
    python3 /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/analysis/oral/hgt/db_blast/trim_enter.py \
    parsnp.snps parsnp.snps2
    

    结果:

    利用ggr文件获取snp vcf

    /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools \
    -i parsnp.ggr \
    -V parsnp.vcf
    

    结果:

    ##FILTER=<ID=IND,Description="Column contains indel">
    ##FILTER=<ID=N,Description="Column contains N">
    ##FILTER=<ID=LCB,Description="LCB smaller than 200bp">
    ##FILTER=<ID=CID,Description="SNP in aligned 100bp window with < 50% column % ID">
    ##FILTER=<ID=ALN,Description="SNP in aligned 100b window with > 20 indels">
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  MF3KT15001B.fna.ref        BF3KT15003B.fna MF3KT15004.fna  PF3KT15003.fna
    MF3KT15001B.Scaf2       48063   ACAGTGGGAT.AACTTACCCG   A       G       40      PASS       NA      GT      0       1       0       0
    MF3KT15001B.Scaf2       48079   CCCGACTGGC.AACAAATTGG   A       G       40      PASS       NA      GT      0       1       0       0
    MF3KT15001B.Scaf6       74973   AGTTGCTTGG.TTTAGAGTTT   T       C       40      PASS       NA      GT      0       1       1       1
    MF3KT15001B.Scaf6       102844  ACATCATGAT.CTTCAAGTTC   C       G       40      PASS       NA      GT      0       1       0       0
    

    更多:
    细菌基因组重测序:使用snippy一条命令从sra到vcf
    如何对细菌基因组做SNP calling

    相关文章

      网友评论

          本文标题:Call种内微生物SNP用Parsnp

          本文链接:https://www.haomeiwen.com/subject/jlmaoltx.html