parsnp可进行核心基因组对齐、call核心基因组SNP、构建SNP进化树。
Github: https://github.com/marbl/parsnp
Manual: https://harvest.readthedocs.io/en/latest/content/parsnp/tutorial.html
文章:The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes
中文:用于快速核心基因组比对和数千种种内微生物基因组可视化的 Harvest 套件
杂志:Genome Biology
时间:2014
被引:913
文章知识点:
1 核心基因组是全基因组的自己,这些同源序列可用于多基因组序列对齐
2 核心基因垂直传递,信噪比更强
3 核心基因SNP是研究进化关系最可靠的突变
4 核心基因组SNP是目前研究近源微生物从大标准方法
5 core-genome SNP typing有三种方法:
read mapping: 成本最低
k-mer analyses
whole-genome alignment:原方法,检测SNP INDEL金标准,依赖组装
Parsnp原理:
1 基于suffix graph data structure快速检测maximal unique matches (MUMs)
MUM是很多pairwise/multiple基因组对齐的基础
2 Gingr可视化比对结构
3 输入要求:同亚种或ANI>=97%,基于MUMi distance指定纳入标准
4 Directed Acyclic Graph (DAG) 数据结构,也称Compressed Suffix Graph (CSG)
5 multi-MUMs检测locally collinear blocks (LCBs),构建核心基因组对齐的基础
6 LCB作为anchor,MUSCLE定位gap
7 以上结果包含核心基因组所有的:SNP, INDEL, 结构变异
8 确定核心基因组SNP的方法:(1) repetitive sequence; (2) small LCB size; (3) poor alignment quality; (4) poor base quality; and (5) possible recombination。FreeBayes推测碱基质量,PhiPack检测碱基重排。
9 FastTree2构建进化树
方法比较:
引用:
1 Genome sequence of a clinical Salmonella Enteritidis sequence type 11 strain from South Africa. Journal of Global Antimicrobial Resistance. 2019
2 OutbreakFinder: a visualization tool for rapid detection of bacterial strain clusters based on optimized multidimensional scaling. PeerJ. 2019
安装
conda create -n parsnp
conda install parsnp
parsnp --version
#|--Parsnp 1.5.6--|
#For detailed documentation please see --> http://harvest.readthedocs.org/en/latest
#parsnp 1.5.6
使用
parsnp \
-c -p 4 -n mafft \
-r ./T2103096416.fna \
-d ./input/*.fna \
-o ./parsnp/
参数
-d
Input genomes
-g
<reference_genbank>
-r
<reference_fasta>
-o
OUTPUT_DIR
-c
, --curated (c)urated genome directory
--use-fasttree
Use fasttree instead of RaxML
--vcf
Generate VCF file.
-p
THREADS, --threads THREADS
-n
{mafft,muscle,fsa,prank}, --alignment-program
-u
, --unaligned Ouput unaligned regions
过程
19:51:26 - INFO - <<Parsnp started>>
19:51:26 - INFO - No genbank file provided for reference annotations, skipping..
19:51:26 - INFO - Running Parsnp multi-MUM search and libMUSCLE aligner...
19:51:45 - INFO - Reconstructing core genome phylogeny...
19:51:45 - INFO - Aligned 4 genomes in 18.35 seconds
19:51:45 - INFO - Parsnp finished! All output available in ./parsnp/
结果
tree 树文件
xmfa 基因组对齐结果
Ini 运行配置
xmfa文件格式同mauve对齐结果文件,但丢弃了unalign的序列
>1:0-109 + cluster58 :p1
>2:1399845-1399954 + cluster58 s11:p323
>3:2214877-2214986 - cluster58 s41:p1147
>4:2220684-2220793 - cluster58 s51:p849
=
>1:138-164 + cluster59 s1:p138
>2:1399997-1400023 + cluster59 s11:p475
>3:2214808-2214834 - cluster59 s41:p1013
>4:2220615-2220641 - cluster59 s51:p715
=
Gingr可视化
Manual: https://harvest.readthedocs.io/en/latest/content/gingr.html
只有MAC Linux可视化软件才能用,无命令行模式,使用测试数据直接打开如下:
Harvest处理parsnap结果
Manual:https://harvest.readthedocs.io/en/latest/content/harvest/quickstart.html#download-install-run
下载解压可直接使用
wget -c https://github.com/marbl/harvest-tools/releases/download/v1.2/harvesttools-Linux64-v1.2.tar.gz/
tar -zxvf harvesttools-Linux64-v1.2.tar.gz
/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools -h
参数:
harvesttools options:
-i <Gingr input>
-b <bed filter intervals>,<filter name>,"<description>"
-B <output backbone intervals>
-f <reference fasta>
-F <reference fasta out>
-g <reference genbank>
-a <MAF alignment input>
-m <multi-fasta alignment input>
-M <multi-fasta alignment output (concatenated LCBs)>
-n <Newick tree input>
-N <Newick tree output>
--midpoint-reroot (reroot the tree at its midpoint after loading)
-o <Gingr output>
-S <output for multi-fasta SNPs>
-u 0/1 (update the branch values to reflect genome length)
-v <VCF input>
-V <VCF output>
-x <xmfa alignment file>
-X <output xmfa alignment file>
-h (show this help)
-q (quiet mode)
利用ggr文件获取snp fasta
/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools \
-i parsnp.ggr \
-S parsnp.snps
# 过程
Loading parsnp.ggr...
Writing parsnp.snps...
# remove enter
python3 /hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/analysis/oral/hgt/db_blast/trim_enter.py \
parsnp.snps parsnp.snps2
结果:
利用ggr文件获取snp vcf
/hwfssz1/ST_HEALTH/P18Z10200N0423/hutongyuan/softwares/harvesttools-Linux64-v1.2/harvesttools \
-i parsnp.ggr \
-V parsnp.vcf
结果:
##FILTER=<ID=IND,Description="Column contains indel">
##FILTER=<ID=N,Description="Column contains N">
##FILTER=<ID=LCB,Description="LCB smaller than 200bp">
##FILTER=<ID=CID,Description="SNP in aligned 100bp window with < 50% column % ID">
##FILTER=<ID=ALN,Description="SNP in aligned 100b window with > 20 indels">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MF3KT15001B.fna.ref BF3KT15003B.fna MF3KT15004.fna PF3KT15003.fna
MF3KT15001B.Scaf2 48063 ACAGTGGGAT.AACTTACCCG A G 40 PASS NA GT 0 1 0 0
MF3KT15001B.Scaf2 48079 CCCGACTGGC.AACAAATTGG A G 40 PASS NA GT 0 1 0 0
MF3KT15001B.Scaf6 74973 AGTTGCTTGG.TTTAGAGTTT T C 40 PASS NA GT 0 1 1 1
MF3KT15001B.Scaf6 102844 ACATCATGAT.CTTCAAGTTC C G 40 PASS NA GT 0 1 0 0
网友评论