美文网首页基因组生物信息学微生物信息学
对组装之后的三代基因组进行polish

对组装之后的三代基因组进行polish

作者: 多啦A梦的时光机_648d | 来源:发表于2019-10-15 20:18 被阅读0次

一:利用pilon软件进行二代数据对三代数据polish

1.下载最新的pilon包

$wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar
$java -Xmx10G -jar pilon-1.23.jar

编译的时候发现报错

$java -Xmx10G -jar pilon-1.23.jar
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/pilon/Pilon : Unsupported major.minor version 52.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

可以看到这里执行java的版本低于被编译的版本,所以就去下了一个低版本的pilon-1.22.jar,再去编译发现就可以了!

下载pilon-1.22.jar

$java -Xmx10G -jar pilon-1.22.jar
Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400


    Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]
                 [...other options...]
           pilon --help for option details

2. 准备数据


  1. 三代数据组装好的基因组文件:draft.fa
  2. illumina的双端测序数据经过质控之后的数据:read1_fq.gz read2_fq.gz

3. 比对(bwa)

  • 构建索引
$bwa index -p index/draft draft.fa
  • 比对并排序
$bwa mem -t 16 index/draft raed1_fq.gz read2_fq.gz |samtools sort -@ 10 -O bam -o align.bam
  • 对比对好的bam文件建索引
$samtools index -@ 10 align.bam

4. 标记重复

$sambamba markup -t 10 align.bam align_markup.bam

5. 过滤高质量比对的reads

$samtools view -@ 10 -q 30 align_markup.bam >align_filter.bam
$samtools index -@ 10 align_filter.bam

6. 使用pilon进行polish

java -Xmx10G -jar pilon-1.23.jar --genome draft.fa --frags align_filter.bam --fix snp,indels --output pilon_polished --vcf & >pilon.log

  • pilon的参数

--frags: 表示输入的是1kb以内的paired-end文库,--jumps表示 大于1k以上的mate pair文库,--bam则是让软件自己猜测
-vcf: 输出一个vcf文件,包含每个碱基的信息
--fix: Pilon将会处理的内容,基本上选snps和indels就够了
--variant: 启发式的变异检测,等价于--vcf --fix all,breaks, 如果是polish不要使用该选项
--minmq: 用于Pilon堆叠的read最低比对质量,默认是0。

二:利用Pacbio组装的自身数据进行polish

1.准备软件

samtools
arrow
pbmm2
pbindex 最后两个可以推荐用conda安装pacbio公司的工具全家桶。

# 安装
conda create -n pb-assembly pb-assembly
# 启动
conda activate pb-assembly

2. 准备数据

  • 组装得到的基因组文件raw_assembly.fa[falcon, canu, mecat2以及flye等软件组装好的数据]
  • 公司给的raw bam文件【类似这样的XXX.subreads.bam】

3. 运行

$samtools faidx assembly.fa
$pbmm2 align pacbio.subreads.bam assembly.fa | samtools sort -@ 16 > map.pacbio.bam
$pbindex map.pacbio.bam
$arrow -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam

由于GenomicConsensus只能在Python上运行,所以已经被gcpp取代了,因此最后一步arrow也可以用gcpp运行:
gcpp用法与GenomicConsensus类似,参数都类似,所以最后一步可以改为:

$gcpp -j 16 -r assembly.fa -o variants.vcf -o consensus.fasta map.pacbio.bam

最后可以看看他的详细参数和用法:

gcpp - Compute genomic consensus from alignments and call variants relative to the reference.

Usage:
  gcpp [options] <input.bam>

  input.bam                    STR    The input BAM file.

Required input/output files:
  -r,--reference               FILE   The filename of the reference FASTA file.
  -o,--output                  STR    The output filename(s), as a comma-separated list. Valid output formats are
                                      .fa/.fasta, .fq/.fastq, .gff, .vcf

Output filtering:
  -q,--min-confidence          INT    The minimum confidence for a variant call to be output to variants.{gff,vcf} [40]
  -x,--min-coverage            INT    The minimum site coverage that must be achieved for variant calls and consensus
                                      to be calculated for a site. [5]
  --no-evidence-call           STR    The consensus base that will be output for sites with no effective coverage.
                                      Valid choices: (nocall, reference, lowercasereference). [lowercasereference]

Read selection/filtering:
  -X,--coverage                INT    A designation of the maximum coverage level to be used for analysis. Exact
                                      interpretation is algorithm-specific. The meaningful range of this argument is
                                      [1, 1000]. [100]
  --min-accuracy               FLOAT  The minimum acceptable window-global alignment accuracy for reads that will be
                                      used for the analysis (arrow-only). [0.82]
  -m,--min-map-qv              INT    The minimum MapQV for reads that will be used for analysis. [10]
  --min-read-score             FLOAT  The minimum ReadScore for reads that will be used for analysis (arrow-only).
                                      [0.65]
  --min-snr                    FLOAT  The minimum acceptable signal-to-noise over all channels for reads that will be
                                      used for analysis (arrow-only). [2.5]
  -w,--windows                 STR    The window (or multiple comma-delimited windows) of the reference to be
                                      processed, in the format refGroup:refStart-refEnd (default: entire reference).

Algorithm and parameter settings:
  --algorithm                  STR    The consensus algorithm used. Valid choices: (arrow, plurality, poa). [arrow]
  --mask-radius                INT    Radius of window to use when excluding local regions for exceeding
                                      maskMinErrorRate, where 0 disables any filtering (arrow-only). [3]
  --mask-error-rate            FLOAT  Maximum local error rate before the local region defined bymaskRadius is excluded
                                      from polishing (arrow-only). [0.7]
  -P,--parameters-file         STR    Path to a model file or directory containing model files.
  -p,--parameters-spec         STR    Name of chemistry or model to use, overriding default selection.
  --max-iterations             INT    Maximum number of iterations to polish the template. [40]
  --max-poa-coverage           INT    Maximum number of sequences to use for consensus calling. [11]
  --mutation-separation        INT    Find the best mutations within a separation window for iterative polishing. [10]
  --mutation-neighborhood      INT    Find nearby mutations within neighborhood for iterative polishing. [20]
  --read-stumpiness-threshold  FLOAT  Filter out reads whose aligned length along a subread is lower than a percentage
                                      of its corresponding reference length. [0.1]

Verbosity and debugging:
  -d,--dump-evidence           STR    Dump evidence data. Valid choices: (variants, all, outliers, none). [none]
  --evidence-directory         DIR    Directory to dump evidence into.
  --annotate-gff                      Augment GFF variant records with additional information
  --report-effective-coverage         Additionally record the *post-filtering* coverage at variant sites.

Advanced configuration options:
  -C,--reference-chunk-size    INT    Size of reference chunks. [500]
  --reference-chunk-overlap    INT    Size of reference chunk overlaps. [5]
  --simple-chunking                   Disable adaptive reference chunking.
  --sort-strategy              STR    Read sorting strategy. Valid choices: (longest_and_strand_balanced, longest,
                                      spanning, file_order). [longest_and_strand_balanced]
  --min-poa-coverage           INT    Minimum number of reads required within a window to call consensus and variants
                                      using arrow or poa. [3]

  -h,--help                           Show this help and exit.
  --version                           Show application version and exit.
  -j,--num-threads             INT    Number of threads to use, 0 means autodetection. [0]
  --log-level                  STR    Set log level. Valid choices: (TRACE, DEBUG, INFO, WARN, FATAL). [WARN]
  --log-file                   FILE   Log to a file, instead of stderr.

要是你觉得很麻烦,你也可以用hoptop的arrow_polish.sh运行也是可以的,不过跟上面不同的是需要把每一个raw bam文件写到一个input.info文档里面。

  • input.info,里面每一行类似于xxx.subreads.bam, 是公司提供的subread数据。
#!/bin/bash


set -e
set -o pipefail
set -u


REF=$1
BAM=$2
THREADS=100


source activate pb-assembly


if [ ! -f $REF.fai ]; then
    samtools faidx $REF
fi


if [ ! f aln.bam ];then
pbalign \
    --tmpDir=./ --nproc=${THREADS} \
    --minAccuracy=0.75 --minLength=50 \
    --minAnchorSize=12 --maxDivergence=30 --concordant --algorithm=blasr \
    --algorithmOptions=--useQuality --maxHits=1 --hitPolicy=random --seed=1 \
    $BAM ${REF} aln.bam
fi
variantCaller --algorithm=arrow \
    -x 5 -X 120 -q 20 -j 24 \
    -r $REF aln.bam \
    -o cns.fasta -o cns.fastq || echo quvier failed

最后运行代码

$bash arrow_polish.sh raw_assembly.fa input.fofn

三. 最后利用多个软件拼接的结果进行合并,来提高组装质量.。

quickmerge

1. quickmerge安装

$unzip quickmerge-master.zip
$cd quickmerge-master
$bash make_merger.sh 
$export PATH=/data1/spider/ytbiosoft/soft/quickmerge-master:/data1/spider/ytbiosoft/soft/quickmerge-master/MUMmer3.23:$PATH
或者
$source /data1/spider/ytbiosoft/soft/quickmerge-master/.quickmerge

安装好之后记得加入环境变量哦!

  • 看一下帮助文档
Usage: quickmerge -d out.delta -q query.fasta -r reference.fasta -hco (default=5.0) -c (default=1.5) -l seed_length_cutoff -ml merging_length_cutoff -p prefix
=========================================================
quickmerge version 0.3
   Options:
       -d : delta alignment file from nucmer
       -q : fasta used as query in nucmer
       -r : fasta used as reference in nucmer
     -hco : seed alignment HCO cutoff (default=5.0)
       -c : high confidence overlap cutoff (default=1.5)
       -l : seed alignment length cutoff (long integer)
      -ml : merging length cutoff (integer)
       -p : output prefix
-h/--help : prints this help
  • 参数
-d              nucmer生成的delta文件
-q              nucmer所用到的query文件
-r              nucmer所用到的reference文件
-hco            default=5.0
-c              高可信度的overlap cutoff(默认为1.5)

-l              seed对齐的length cutoff(长整数)
-ml             合并的length cutoff(整数)
-p              输出文件的前缀

2.开始

  • 1 最简单就是运他的一个py脚本就可以了:
$merge_wrapper.py hybrid_assembly.fasta self_assembly.fasta

自己merge_wrapper_v2.py -h去看详细介绍。

  • 2分步运行
$nucmer -l 100 -p out1 -t 8 reference.fa query.fa
$delta-filter -i 95 -r -q out.delta > out.rq.delta
$quickmerge -d out.rq.delta -q query.fa -r reference.fa -hco 5.0 -c 1.5 -l 520000 -ml 10000
  • 一般-l选择引用(-r)程序集的N50作为初始值,-ml一般大于5000。
    这里讲一下nucmer和delta-filter都是mumer里面的程序包,quickmerge里面自带了mummer,要是想进一步了解也可以自己下载:
    mummer官网
    github的mummer
  • nucmer参数及用法
$nucmer  [options]  <Reference>  <Query>

-l|minmatch       设置单个匹配的最小长度(默认20)
-p|prefix         设置输出文件的前缀(默认为out)
  • delta-filter参数及用法
$delta-filter  [options]  <deltafile>

-i float         设置最小对齐标识[0,100],默认为0
-r               允许query overlaps(多对多)
-q               允许reference overlaps(多对多)

一些建议

  • 1 It can be used to merge two different long molecule only assemblies (e.g. one generated with PBcRor canu and another generated with FALCON).

  • 2 You can run Ka-kit's finisherSC after running quickmerge to improve the contiguity even further.

  • 3 Assembly polishing with Quiver and pilon before and after assembly merging is strongly recommended.

最后

这里有一些quickmerge设置的技巧

相关文章

网友评论

    本文标题:对组装之后的三代基因组进行polish

    本文链接:https://www.haomeiwen.com/subject/sgcemctx.html