美文网首页
HyPo: Super Fast & Accurate Poli

HyPo: Super Fast & Accurate Poli

作者: 生信师姐 | 来源:发表于2020-07-23 17:28 被阅读0次

HyPo--全称 Hybrid Polisher.使用short reads或者long reads对组装的基因组序列进行polish. 它利用基因组中唯一的kmers,通过部分排好序的read-segments比对片段来选择性地polish contigs的segments。

请注意,“short reads”不一定非得是NGS short reads; 由PacBio SequelII产生的HiFi基因组reads(例如CCS)也可以。要求:这些reads必须高度准确(>98%的准确度)。

为什么会有一个racon的比较,一个
As demonstrated on human genome assemblies, Hypo generates significantly more accurate polished assembly in about one-third time with about half the memory requirements in comparison to contemporary widely used polishers like Racon.

Hypo 的输入文件需满足如下要求:

  • Short reads/HiFi reads (FASTA/FASTQ,可以是压缩文件)
  • Draft contigs (in FASTA/FASTQ,可以是压缩文件)
  • Alignments between short reads (or HiFi reads) and the draft (in sam/bam format; should contain CIGAR).Long (noisy) reads 也可以, 需和draft基因组文件比对生成sam/bam格式文件.
  • 预期 short reads(或HiFi reads)的平均覆盖率和基因组的大致大小。

在下面的内容中,HiFi reads可以替代short reads:

  • 理论上,将draft contigs(未纠错的)大致分为两类区域(片段):强区域和弱区域。
  • 强区域是那些有足够证据支持,证明其正确性的区域,因此不需要polish。
  • 弱区域将使用POA进行polish。每个弱区域将使用short reads或long reads进行polish;short reads优先于long reads。
  • 为了确定强区域,我们使用solid kmers(预估的唯一基因组kmers)。强区域在选择read-segments来polish相邻的弱区时也起到了一定的作用。此外,我们的方法也将long reads及他们组装产生的homopolymer error考虑其中。

Installation

Hypo is only available for Unix-like platforms (Linux and MAC OS). We recommend using Option_1 for a convenient installation and in the case where target machine is different from the one on which hypo is installed. On the other hand, Option_2 is more suitable if the binary of hypo is to be run on the same machine on which it is compiled because then a machine-specific optimised (and thus slightly faster) binary can be produced using the flag -Doptimise_for_native=ON.

Option_1: Conda Package Installation

The convenient way of installation is using the conda package as follows:

conda install -c bioconda hypo

Htslib 1.10 may cause conflicts with some of the already installed packages. In such a case, hypo can be installed in a new environment as follows:

conda create --name hypo_env
conda activate hypo_env  
conda install -c bioconda hypo

Option_2: Installation from the source

CmakeLists is provided in the project root folder.

Pre-requisites

For installing from the source, the following requirements are assumed to be installed already (with path to their binaries available in $PATH).

  • Zlib

  • OpenMP

  • GCC (>=7.3)

    • Following are the commands to update GCC (say to GCC 8) on an Ubuntu machine (from say GCC 5):
      sudo apt-get update; sudo apt-get install build-essential software-properties-common -y;
      sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y; sudo apt update; 
      sudo apt install gcc-snapshot -y; sudo apt update
      sudo apt install gcc-8 g++-8 -y; 
      sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 60 --slave /usr/bin/g++ g++ /usr/bin/g++-8
      sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5;
    
  • HTSLIB (version >=1.10)

    • If htslib version 1.10 or higher is not installed, we recommend using install_deps.sh in the project folder to install it locally.
  • KMC3

    • Required for suk
    • Can also be installed using conda as follows:
    conda install -c bioconda kmc
    

Building Executable

Run the following commands to build a binary (executable) hypo in build/bin : If htslib version 1.10 or higher is installed:

  git clone --recursive https://github.com/kensung-lab/hypo hypo
  cd hypo
  mkdir build
  cd build
  cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
  make -j 8

If htslib is not installed or the version is smaller than 1.10:

  git clone --recursive https://github.com/kensung-lab/hypo hypo
  cd hypo
  chmod +x install_deps.sh
  ./install_deps.sh
  mkdir build
  cd build
  cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
  make -j 8

Notes:

  • If --recursive was omitted from git clone, please run git submodule init and git submodule update before compiling.
  • If target machine is different from the one on which Hypo is being compiled, exclude the flag -Doptimise_for_native=ON.

Usage of the tool:

 Usage: hypo <args>

 ** Mandatory args:
    -r, --reads-short <str>
    Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.

    -d, --draft <str>
    Input file name containing the draft contigs (in fasta/fastq format; can be compressed). 

    -b, --bam-sr <str>
    Input file name containing the alignments of short reads against the draft (in bam/sam format; must have CIGAR information). 

    -c, --coverage-short <int>
    Approximate mean coverage of the short reads. 

    -s, --size-ref <str>
    Approximate size of the genome (a number; could be followed by units k/m/g; e.g. 10m, 2.3g). 

** Optional args:
    -B, --bam-lr <str>
    Input file name containing the alignments of long reads against the draft (in bam/sam format; must have CIGAR information). 
    [Only Short reads polishing will be performed if this argument is not given]

    -o, --output <str>
    Output file name. 
    [Default] hypo_<draft_file_name>.fasta in the working directory.

    -t, --threads <int>
    Number of threads. 
    [Default] 1.

    -p, --processing-size <int>
    Number of contigs to be processed in one batch. Lower value means less memory usage but slower speed. 
    [Default] All the contigs in the draft.

    -k, --kind-sr <str>
    Kind of the short reads. 
    [Valid Values] 
        sr  (Corresponding to NGS reads like Illumina reads) 
        ccs (Corresponding to HiFi reads like PacBio CCS reads) 
    [Default] sr.

    -m, --match-sr <int>
    Score for matching bases for short reads. 
    [Default] 5.

    -x, --mismatch-sr <int>
    Score for mismatching bases for short reads. 
    [Default] -4.

    -g, --gap-sr <int>
    Gap penalty for short reads (must be negative). 
    [Default] -8.

    -M, --match-lr <int>
    Score for matching bases for long reads. 
    [Default] 3.

    -X, --mismatch-lr <int>
    Score for mismatching bases for long reads. 
    [Default] -5.

    -G, --gap-lr <int>
    Gap penalty for long reads (must be negative). 
    [Default] -4.

    -n, --ned-th <int>
    Threshold for Normalised Edit Distance of long arms allowed in a window (in %). Higher number means more arms allowed which may slow down the execution.
    [Default] 20.

    -q, --qual-map-th <int>
    Threshold for mapping quality of reads. The reads with mapping quality below this threshold will not be taken into consideration. 
    [Default] 2.

    -i, --intermed
    Store or use (if already exist) the intermediate files. 
    [Currently, only Solid kmers are stored as an intermediate file.].

    -h, --help
    Print the usage. 

输出文件

如果没有提供参数--output (-o) ,则默认在和draft contig相同的目录下生成 hypo_X.fasta

中间文件

如果使用参数 --intermed (-i) , hypo将在第一次运行时将 solid kmers对应的中间文件存储在名为“aux”的文件夹中。另一个指示运行进度的文件也将在该文件夹中创建。在运行下一轮时(使用“-i”),将使用这些中间文件,而不是运行“suk”模块。建议在使用Hypo进行多轮polish时使用此标志,以便使后续轮更快

资源使用

可选参数 --processing-size (-p) 控制每一batch中处理的contigs的数目。默认情况下,所有的 contigs 将会被 single batch处理. 在一个batch中处理的contigs数目越多,内存消耗就越高。另一方面,较低的contigs数目可能无法有效地利用线程数量。作为参考,对于整个人类基因组,我们在48核机器上使用 -p 96,使用了大约380G内存,3小时内完成了运行(仅Illumina polishing)。建议使用 -p-t的倍数。如果基因组大小对于可用的RAM来说不是太大,建议在一个批次中处理所有的contigs(即避免指定 -p)。

Normalised Edit Distance

可选参数 --ned-th (or -n)仅和long reads的polish相关。它设置了标准化编辑距离(Normalised Edit Distance)的阈值,该距离测量long read与其对齐的draft segment之间的相似性。标准化编辑距离=(对齐的read_segment 和draft /aligned_length of draft之间的1bp编辑距离)x100。比阈值越高,更多的read-segments 将会被考虑。我们建议对整个人类基因组polishing使用默认值。

例1 (using Illumina paired-end reads)

Mapping the short reads to contigs:

$R1, $R2:short reads (paired-end) ;
$Draft:draft contigs;
$NUMTH :线程数;

minimap2 --secondary=no --MD -ax sr -t $NUMTH $DRAFT $R1 $R2 | samtools view -Sb - > mapped-sr.bam
samtools sort -@$NUMTH -o mapped-sr.sorted.bam mapped-sr.bam
samtools index mapped-sr.sorted.bam
rm mapped-sr.bam

Mapping the long reads to contigs:

$LONGR:long reads (PacBio or ONT) ;
$Draft:draft contigs;
$NUMTH:线程数;
$RTYPEmap-pb for PacBio; map-ont for ONT reads.

minimap2 --secondary=no --MD -ax $RTYPE -t $NUMTH $DRAFT $LONGR | samtools view -Sb - > mapped-lg.bam
samtools sort -@$NUMTH -o mapped-lg.sorted.bam mapped-lg.bam
samtools index mapped-lg.sorted.bam
rm mapped-lg.bam

Running Hypo:

首先,创建一个含有short reads的文本文件。

echo -e "$R1\n$R2" > il_names.txt

基因组大约3Gbp(-s),Illumina reads的覆盖度大约55x(-c). 使用48线程(-t),每个batch运行96个contigs(-p).

仅仅使用short reads 进行polish:

./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -p 96 -t 48 -o whole_genome.h.fa

使用short reads 和long reads进行polish:

./bin/hypo -d $DRAFT -r @il_names.txt -s 3g -c 55 -b mapped-sr.sorted.bam -B mapped-lg.sorted.bam -p 96 -t 48 -o whole_genome.h2.fa

例 2 (using CCS reads)

Mapping the CCS reads to contigs:

$READS:CCS reads;
$Draft:draft contigs;
$NUMTH:线程数;

minimap2 --secondary=no --MD -ax asm20 -t $NUMTH $DRAFT $READS | samtools view -Sb - > mapped-ccs.bam
samtools sort -@$NUMTH -o mapped-ccs.sorted.bam mapped-ccs.bam
samtools index mapped-ccs.sorted.bam
rm mapped-ccs.bam

Running Hypo:

基因组大约3Gbp(-s),CCS reads的覆盖度大约33x(-c). 使用48线程(-t),一个batch运行全部的contigs.

仅仅 CCS polish:

./bin/hypo -d $DRAFT -r $READS -s 3g -c 30 -b mapped-ccs.sorted.bam -t 48 -o whole_genome.h.fa

Method and Results

对于整个人类基因组(HG002)的组装,仅使用Illumina reads,Hypo花了大约3个小时和大约380G RAM来polish(在48核48线程的机器上)。对于使用Illumina和PacBio reads进行抛光,使用约4小时15分钟和大约410G RAM。方法和部分结果可在BioRxiv找到.

HyPo: Super Fast & Accurate Polisher for Long Read Genome Assemblies
Ritu Kundu, Joshua Casey, Wing-Kin Sung
bioRxiv 2019.12.19.882506; doi: https://doi.org/10.1101/2019.12.19.882506 

Limitations

  • 待polish的contig/scaffold中的N将和碱基一样对待,(即,N将被对应POA图中与之align的最频繁的碱基所取代)。
  • contig/scaffold的最大长度被限制在4,294,967,295(32位无符号整数的最大值)

External Libraries

  • sdsl-lite 用于秩选择和位向量数据结构.
  • slog 被用来打印时间和内存使用情况。
  • suk 被用作计算solid (唯一的) kmers的模块。
  • POA使用了spoa库的改编版本。

官网:
https://github.com/kensung-lab/hypo

相关文章

网友评论

      本文标题:HyPo: Super Fast & Accurate Poli

      本文链接:https://www.haomeiwen.com/subject/ctybcktx.html