对组装之后的三代基因组进行polish

作者: 多啦A梦的时光机_648d | 来源:发表于2019-10-15 20:18 被阅读0次

使用nextpolish对三代组装进行polish
对组装之后的三代基因组进行polish
NextPolish对基因组进行polish
「三代组装」Pacbio组装后如何用自身数据进行polish（更
「三代组装」使用Pilon对基因组进行polish
「生信软件」好用的细菌基因组组装与注释软件
wtdbg2 | 三代测序数据组装软件③
【报告笔记】基因组组装的最后挑战-T2T
「三代组装」Pacbio组装后如何用自身数据进行polish
使用nextpolish对三代组装进行polish(v1.2.2

一：利用pilon软件进行二代数据对三代数据polish

1.下载最新的pilon包

$wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar
$java -Xmx10G -jar pilon-1.23.jar

编译的时候发现报错

$java -Xmx10G -jar pilon-1.23.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/pilon/Pilon : Unsupported major.minor version 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

可以看到这里执行java的版本低于被编译的版本，所以就去下了一个低版本的pilon-1.22.jar，再去编译发现就可以了！

下载pilon-1.22.jar

$java -Xmx10G -jar pilon-1.22.jar
Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400


    Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]
                 [...other options...]
           pilon --help for option details

2. 准备数据

三代数据组装好的基因组文件：draft.fa
illumina的双端测序数据经过质控之后的数据：read1_fq.gz read2_fq.gz

3. 比对（bwa）

构建索引

$bwa index -p index/draft draft.fa

比对并排序

$bwa mem -t 16 index/draft raed1_fq.gz read2_fq.gz |samtools sort -@ 10 -O bam -o align.bam

对比对好的bam文件建索引

$samtools index -@ 10 align.bam

4. 标记重复

$sambamba markup -t 10 align.bam align_markup.bam

5. 过滤高质量比对的reads

$samtools view -@ 10 -q 30 align_markup.bam >align_filter.bam
$samtools index -@ 10 align_filter.bam

6. 使用pilon进行polish

java -Xmx10G -jar pilon-1.23.jar --genome draft.fa --frags align_filter.bam --fix snp,indels --output pilon_polished --vcf & >pilon.log

pilon的参数

--frags: 表示输入的是1kb以内的paired-end文库，--jumps表示 大于1k以上的mate pair文库,--bam则是让软件自己猜测
-vcf: 输出一个vcf文件，包含每个碱基的信息
--fix: Pilon将会处理的内容，基本上选snps和indels就够了
--variant: 启发式的变异检测，等价于--vcf --fix all,breaks, 如果是polish不要使用该选项
--minmq: 用于Pilon堆叠的read最低比对质量，默认是0。

二：利用Pacbio组装的自身数据进行polish

1.准备软件

samtools
arrow
pbmm2
pbindex 最后两个可以推荐用conda安装pacbio公司的工具全家桶。

# 安装
conda create -n pb-assembly pb-assembly
# 启动
conda activate pb-assembly

2. 准备数据

组装得到的基因组文件raw_assembly.fa[falcon, canu, mecat2以及flye等软件组装好的数据]
公司给的raw bam文件【类似这样的XXX.subreads.bam】

3. 运行

$samtools faidx assembly.fa
$pbmm2 align pacbio.subreads.bam assembly.fa | samtools sort -@ 16 > map.pacbio.bam
$pbindex map.pacbio.bam
$arrow -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam

由于GenomicConsensus只能在Python上运行，所以已经被gcpp取代了，因此最后一步arrow也可以用gcpp运行：
gcpp用法与GenomicConsensus类似，参数都类似，所以最后一步可以改为：

$gcpp -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam

最后可以看看他的详细参数和用法：

gcpp - Compute genomic consensus from alignments and call variants relative to the reference.

Usage:
  gcpp [options] <input.bam>

  input.bam                    STR    The input BAM file.

Required input/output files:
  -r,--reference               FILE   The filename of the reference FASTA file.
  -o,--output                  STR    The output filename(s), as a comma-separated list. Valid output formats are
                                      .fa/.fasta, .fq/.fastq, .gff, .vcf

Output filtering:
  -q,--min-confidence          INT    The minimum confidence for a variant call to be output to variants.{gff,vcf} [40]
  -x,--min-coverage            INT    The minimum site coverage that must be achieved for variant calls and consensus
                                      to be calculated for a site. [5]
  --no-evidence-call           STR    The consensus base that will be output for sites with no effective coverage.
                                      Valid choices: (nocall, reference, lowercasereference). [lowercasereference]

Read selection/filtering:
  -X,--coverage                INT    A designation of the maximum coverage level to be used for analysis. Exact
                                      interpretation is algorithm-specific. The meaningful range of this argument is
                                      [1, 1000]. [100]
  --min-accuracy               FLOAT  The minimum acceptable window-global alignment accuracy for reads that will be
                                      used for the analysis (arrow-only). [0.82]
  -m,--min-map-qv              INT    The minimum MapQV for reads that will be used for analysis. [10]
  --min-read-score             FLOAT  The minimum ReadScore for reads that will be used for analysis (arrow-only).
                                      [0.65]
  --min-snr                    FLOAT  The minimum acceptable signal-to-noise over all channels for reads that will be
                                      used for analysis (arrow-only). [2.5]
  -w,--windows                 STR    The window (or multiple comma-delimited windows) of the reference to be
                                      processed, in the format refGroup:refStart-refEnd (default: entire reference).

Algorithm and parameter settings:
  --algorithm                  STR    The consensus algorithm used. Valid choices: (arrow, plurality, poa). [arrow]
  --mask-radius                INT    Radius of window to use when excluding local regions for exceeding
                                      maskMinErrorRate, where 0 disables any filtering (arrow-only). [3]
  --mask-error-rate            FLOAT  Maximum local error rate before the local region defined bymaskRadius is excluded
                                      from polishing (arrow-only). [0.7]
  -P,--parameters-file         STR    Path to a model file or directory containing model files.
  -p,--parameters-spec         STR    Name of chemistry or model to use, overriding default selection.
  --max-iterations             INT    Maximum number of iterations to polish the template. [40]
  --max-poa-coverage           INT    Maximum number of sequences to use for consensus calling. [11]
  --mutation-separation        INT    Find the best mutations within a separation window for iterative polishing. [10]
  --mutation-neighborhood      INT    Find nearby mutations within neighborhood for iterative polishing. [20]
  --read-stumpiness-threshold  FLOAT  Filter out reads whose aligned length along a subread is lower than a percentage
                                      of its corresponding reference length. [0.1]

Verbosity and debugging:
  -d,--dump-evidence           STR    Dump evidence data. Valid choices: (variants, all, outliers, none). [none]
  --evidence-directory         DIR    Directory to dump evidence into.
  --annotate-gff                      Augment GFF variant records with additional information
  --report-effective-coverage         Additionally record the *post-filtering* coverage at variant sites.

Advanced configuration options:
  -C,--reference-chunk-size    INT    Size of reference chunks. [500]
  --reference-chunk-overlap    INT    Size of reference chunk overlaps. [5]
  --simple-chunking                   Disable adaptive reference chunking.
  --sort-strategy              STR    Read sorting strategy. Valid choices: (longest_and_strand_balanced, longest,
                                      spanning, file_order). [longest_and_strand_balanced]
  --min-poa-coverage           INT    Minimum number of reads required within a window to call consensus and variants
                                      using arrow or poa. [3]

  -h,--help                           Show this help and exit.
  --version                           Show application version and exit.
  -j,--num-threads             INT    Number of threads to use, 0 means autodetection. [0]
  --log-level                  STR    Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
  --log-file                   FILE   Log to a file, instead of stderr.

要是你觉得很麻烦，你也可以用hoptop的arrow_polish.sh运行也是可以的，不过跟上面不同的是需要把每一个raw bam文件写到一个input.info文档里面。

input.info，里面每一行类似于xxx.subreads.bam, 是公司提供的subread数据。

#!/bin/bash


set -e
set -o pipefail
set -u


REF=$1
BAM=$2
THREADS=100


source activate pb-assembly


if [ ! -f $REF.fai ]; then
    samtools faidx $REF
fi


if [ ! f aln.bam ];then
pbalign \
    --tmpDir=./ --nproc=${THREADS} \
    --minAccuracy=0.75 --minLength=50 \
    --minAnchorSize=12 --maxDivergence=30 --concordant --algorithm=blasr \
    --algorithmOptions=--useQuality --maxHits=1 --hitPolicy=random --seed=1 \
    $BAM ${REF} aln.bam
fi
variantCaller --algorithm=arrow \
    -x 5 -X 120 -q 20 -j 24 \
    -r $REF aln.bam \
    -o cns.fasta -o cns.fastq || echo quvier failed

最后运行代码

$bash arrow_polish.sh raw_assembly.fa input.fofn

三. 最后利用多个软件拼接的结果进行合并，来提高组装质量.。

quickmerge

1. quickmerge安装

$unzip quickmerge-master.zip
$cd quickmerge-master
$bash make_merger.sh 
$export PATH=/data1/spider/ytbiosoft/soft/quickmerge-master:/data1/spider/ytbiosoft/soft/quickmerge-master/MUMmer3.23:$PATH
或者
$source /data1/spider/ytbiosoft/soft/quickmerge-master/.quickmerge

安装好之后记得加入环境变量哦！

看一下帮助文档

Usage: quickmerge -d out.delta -q query.fasta -r reference.fasta -hco (default=5.0) -c (default=1.5) -l seed_length_cutoff -ml merging_length_cutoff -p prefix
=========================================================
quickmerge version 0.3
   Options:
       -d : delta alignment file from nucmer
       -q : fasta used as query in nucmer
       -r : fasta used as reference in nucmer
     -hco : seed alignment HCO cutoff (default=5.0)
       -c : high confidence overlap cutoff (default=1.5)
       -l : seed alignment length cutoff (long integer)
      -ml : merging length cutoff (integer)
       -p : output prefix
-h/--help : prints this help

参数

-d              nucmer生成的delta文件
-q              nucmer所用到的query文件
-r              nucmer所用到的reference文件
-hco            default=5.0
-c              高可信度的overlap cutoff(默认为1.5)

-l              seed对齐的length cutoff（长整数）
-ml             合并的length cutoff(整数)
-p              输出文件的前缀

2.开始

1 最简单就是运他的一个py脚本就可以了：

$merge_wrapper.py hybrid_assembly.fasta self_assembly.fasta

自己merge_wrapper_v2.py -h去看详细介绍。

2分步运行

$nucmer -l 100 -p out1 -t 8 reference.fa query.fa
$delta-filter -i 95 -r -q out.delta > out.rq.delta
$quickmerge -d out.rq.delta -q query.fa -r reference.fa -hco 5.0 -c 1.5 -l 520000 -ml 10000

一般-l选择引用(-r)程序集的N50作为初始值，-ml一般大于5000。
这里讲一下nucmer和delta-filter都是mumer里面的程序包，quickmerge里面自带了mummer,要是想进一步了解也可以自己下载：
mummer官网
 github的mummer
nucmer参数及用法

$nucmer  [options]  <Reference>  <Query>

-l|minmatch       设置单个匹配的最小长度(默认20)
-p|prefix         设置输出文件的前缀(默认为out)

delta-filter参数及用法

$delta-filter  [options]  <deltafile>

-i float         设置最小对齐标识[0,100]，默认为0
-r               允许query overlaps（多对多）
-q               允许reference overlaps（多对多）

一些建议

1 It can be used to merge two different long molecule only assemblies (e.g. one generated with PBcRor canu and another generated with FALCON).
2 You can run Ka-kit's finisherSC after running quickmerge to improve the contiguity even further.
3 Assembly polishing with Quiver and pilon before and after assembly merging is strongly recommended.

最后

这里有一些quickmerge设置的技巧

使用nextpolish对三代组装进行polish
使用nextpolish对三代组装进行polish NextPolish是武汉未来组开发的一个三代基因组polis...
对组装之后的三代基因组进行polish
一：利用pilon软件进行二代数据对三代数据polish 1.下载最新的pilon包编译的时候发现报错可以看到...
NextPolish对基因组进行polish
NextPolish由未来组开发对基因组序列进行polish的工具，对三代以及二代均可进行polish。 gitu...
「三代组装」Pacbio组装后如何用自身数据进行polish（更
之前那我由于需要对PacBio的组装结果进行polish，于是写了「三代组装」Pacbio组装后如何用自身数据进行...
「三代组装」使用Pilon对基因组进行polish
软件安装官方提供了编译好的jar包，方便使用如果要顺利运行程序，要求JAVA > 1.7, 以及根据基因组大小...
「生信软件」好用的细菌基因组组装与注释软件
1. Spades 细菌基因组组装最佳软件(主观感受)，可进行二代组装，三代组装以及二代三代混合组装。 2. U...
wtdbg2 | 三代测序数据组装软件③
wtdbg2软件介绍 wdbg2能利用三代Pacbio 或 Nanopore 测序数据进行基因组组装。在组装过程中...
【报告笔记】基因组组装的最后挑战-T2T
长读长组装发展 2012：三代组装、二代校正；耗资源，适合小基因组，如细菌，4-15%错误率 2013：三代组装、...
「三代组装」Pacbio组装后如何用自身数据进行polish
三代数据由于其高错误率(目前应该是10%左右), 即便在组装前有一步纠错环节，但是组装得到序列依旧存在着许多错误，...
使用nextpolish对三代组装进行polish(v1.2.2
NextPolish是武汉未来组开发的一个三代基因组polish工具（另外一个常用软件是Pilon）。NextPo...

对组装之后的三代基因组进行polish

一：利用pilon软件进行二代数据对三代数据polish

1.下载最新的pilon包

2. 准备数据

3. 比对（bwa）

4. 标记重复

5. 过滤高质量比对的reads

6. 使用pilon进行polish

二：利用Pacbio组装的自身数据进行polish

1.准备软件

2. 准备数据

3. 运行

要是你觉得很麻烦，你也可以用hoptop的arrow_polish.sh运行也是可以的，不过跟上面不同的是需要把每一个raw bam文件写到一个input.info文档里面。

三. 最后利用多个软件拼接的结果进行合并，来提高组装质量.。

1. quickmerge安装

2.开始

一些建议

最后

相关文章

使用nextpolish对三代组装进行polish

对组装之后的三代基因组进行polish

NextPolish对基因组进行polish

「三代组装」Pacbio组装后如何用自身数据进行polish（更

「三代组装」使用Pilon对基因组进行polish

「生信软件」好用的细菌基因组组装与注释软件

wtdbg2 | 三代测序数据组装软件③

【报告笔记】基因组组装的最后挑战-T2T

「三代组装」Pacbio组装后如何用自身数据进行polish

使用nextpolish对三代组装进行polish(v1.2.2

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

基因组

生物信息学

微生物信息学

Transcriptomics