前言

基因组结构变异是很多癌症、遗传病等疾病的重要诱因。目前基于二代测序技术检测基因组结构变异存在很大的局限性，而三代测序存在错误率较高等多种问题，尤其针对复杂结构变异大多软件识别能力较差。针对这一问题，有研究人员就开发出了基因组比对工具NGMLR和结构变异识别工具Sniffles，为变异检测提供了前所未有的灵敏度和精确度，并且NGMLR和Sniffles可以自动过滤虚假事件并对低覆盖率数据进行操作，从而降低成本。

简介

NGMLR和Sniffles是适用于长读长测序的新型结构变异检测工具，基因组比对工具NGMLR在基于短read比对方法的基础上，考虑了PacBio和Oxford Nanopore平台产生的数据类型。结构变异识别工具Sniffles是一款结构变异识别工具，可以根据比对结果进行扫描，精确检测出结构变异。

NGMLR（左）和Sniffles（右）的主要步骤

NGMLR

安装

推荐使用conda进行安装：

conda install ngmlr

使用

对于Pacbio数据：

ngmlr -t 4 -r reference.fasta -q reads.fastq -o test.sam

对于Oxford Nanopore数据：

ngmlr -t 4 -r reference.fasta -q reads.fastq -o test.sam -x ont

参数说明

用法：ngmlr [options] -r <reference> -q <reads> [-o <output>]

输入/输出参数:
    -r <file>,  --reference <file>
        (required)  Path to the reference genome (FASTA/Q, can be gzipped)
    -q <file>,  --query <file>
        Path to the read file (FASTA/Q) [/dev/stdin]
    -o <string>,  --output <string>
        Path to output file [stdout]
    --skip-write
        Don't write reference index to disk [false]
    --bam-fix
        Report reads with > 64k CIGAR operations as unmapped. Required to be compatible with the BAM format [false]
    --rg-id <string>
        Adds RG:Z:<string> to all alignments in SAM/BAM [none]
    --rg-sm <string>
        RG header: Sample [none]
    --rg-lb <string>
        RG header: Library [none]
    --rg-pl <string>
        RG header: Platform [none]
    --rg-ds <string>
        RG header: Description [none]
    --rg-dt <string>
        RG header: Date (format: YYYY-MM-DD) [none]
    --rg-pu <string>
        RG header: Platform unit [none]
    --rg-pi <string>
        RG header: Median insert size [none]
    --rg-pg <string>
        RG header: Programs [none]
    --rg-cn <string>
        RG header: sequencing center [none]
    --rg-fo <string>
        RG header: Flow order [none]
    --rg-ks <string>
        RG header: Key sequence [none]

一般参数:
    -t <int>,  --threads <int>
        Number of threads [1]
    -x <pacbio, ont>,  --presets <pacbio, ont>
        Parameter presets for different sequencing technologies [pacbio]
    -i <0-1>,  --min-identity <0-1>
        Alignments with an identity lower than this threshold will be discarded [0.65]
    -R <int/float>,  --min-residues <int/float>
        Alignments containing less than <int> or (<float> * read length) residues will be discarded [0.25]
    --no-smallinv
        Don't detect small inversions [false]
    --no-lowqualitysplit
        Split alignments with poor quality [false]
    --verbose
        Debug output [false]
    --no-progress
        Don't print progress info while mapping [false]

高级参数:
    --match <float>
        Match score [2]
    --mismatch <float>
        Mismatch score [-5]
    --gap-open <float>
        Gap open score [-5]
    --gap-extend-max <float>
        Gap open extend max [-5]
    --gap-extend-min <float>
        Gap open extend min [-1]
    --gap-decay <float>
        Gap extend decay [0.15]
    -k <10-15>,  --kmer-length <10-15>
        K-mer length in bases [13]
    --kmer-skip <int>
        Number of k-mers to skip when building the lookup table from the reference [2]
    --bin-size <int>
        Sets the size of the grid used during candidate search [4]
    --max-segments <int>
        Max number of segments allowed for a read per kb [1]
    --subread-length <int>
        Length of fragments reads are split into [256]
    --subread-corridor <int>
        Length of corridor sub-reads are aligned with [40]

Sniffles

安装

推荐使用conda进行安装：

conda install sniffles

使用

sniffles -m mapped.sort.bam -v output.vcf

mapped.sort.bam可以来自ngmlr或bwa，如果是来自bwa，要使用-M参数标记出主要和次要比对。

参考

Sedlazeck F J , Rescheneder P , Smolka M , et al. Accurate detection of complex structural variations using single-molecule sequencing[J]. Nature Methods, 2018.
Sniffles
NGMLR