一个完整的WGS变异检测snakemake流程详解

作者: BioGou | 来源:发表于2022-04-07 10:33 被阅读0次

写在前边的话

在此要感谢碱基矿工，第一次做全外显子测序的变异检测时便是参考了他写的教程，当时的GATK还是3.8版本，而现在已经是4.1了。使用GATK来做变异检测非常好用，但是在有些步骤是非常耗时的，如果不能用多线程去处理的话，尤其是有多个样本时，就会效率很差。所以才打算写一个snakemake流程化处理，最后会附上一个完整脚本供大家参考。

第一步：建立index以及下载数据库

Index 使用UCSC数据库的hg38作为reference，使用bwa进行比对。下载hg38数据存为hg38.chrom.fasta, 建立bwa比对index：

$ bwa index -a bwtsw -p hg38.chrom.fasta hg38.chrom.fasta

参数说明：
-p 指定输出index文件名字
第二个hg38.chrom.fasta为当前文件夹下的reference

另外，还需要一个hg38.chrom.dict文件和一个hg38.chrom.fasta.fai文件, 命令如下：

$ gatk CreateSequenceDictionary -R hg38.chrom.fasta -O hg38.chrom.dict
$ samtools faidx hg38.chrom.fasta

参数说明：
-R 基因组reference文件
-O 输出文件路径
生成文件和输入文件都是在当前目录下

index基本建立完成，然后是已知突变数据库下载，gatk提供了相关网站bundle可下载相关数据。下载hg38的所有数据到本地：

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/*

第二步：数据处理

1. 去接头

使用trim_galore进行去接头，自动检测接头序列：

input:
    R1 = '/Volumes/RawData1/WGS/Fastq/{sample}_R1.fastq.gz',
    R2 = '/Volumes/RawData1/WGS/Fastq/{sample}_R2.fastq.gz'
output:
    'Trim/{sample}_R1_val_1.fq.gz',
    'Trim/{sample}_R2_val_2.fq.gz'
shell:
    '''
        trim_galore \
        -q 20 \         
        --length 25 \ 
        --e 0.1 \
        --paired \
        {input.R1} \
        {input.R2} \
        --gzip \
        -o Trim
    '''

2. 序列比对

BWA 的比对速度实在是太慢了，而hisat2速度相对快一些，而且一样可以用于dna测序比对；比对之后利用samtools进行转换为bam并排序：

input:
    r1 = 'Trim/{sample}_R1_val_1.fq.gz',
    r2 = 'Trim/{sample}_R2_val_2.fq.gz',
    index = index
output:
    bam = 'Mapping/{sample}.sorted.bam',
    sum = 'Mapping/{sample}_aln_sum.txt'
shell:
    '''
        hisat2 \
        -p {threads} \
        -x {input.index}/genome \
        -1 {input.r1} \
        -2 {input.r2} \
        --summary-file {output.sum} | \
        samtools view -Sb -q 30 - | \
        samtools sort -@ {threads} -m 2G -O bam \
        -T Mapping/{wildcards.sample}.tmp -o {output.bam}
    '''

3. 添加头信息

由于不是用bwa比对，所以没办法在一开始就为reads加上头信息，所以这一步使用gatk去添加：

input:
    'Mapping/{sample}.sorted.bam'
output:
    'Mapping/{sample}.addhead.bam'
shell:
    '''
        gatk AddOrReplaceReadGroups \
        -I {input} \
        -O {output} \
        -LB {wildcards.sample} \
        -PL illumina \ # 测序平台不能乱写，其他随意
        -PU {wildcards.sample} \
        -SM {wildcards.sample} \
        -SO coordinate
    '''

4.去掉pcr重复

这一步只是把重复reads标记了出来，并没有删除，可以通过更改参数去删除reads：

input:
    'Mapping/{sample}.addhead.bam'
output:
    bam = 'MarkDup/{sample}.markdup.bam',
    met = 'MarkDup/{sample}.metrics.txt'
shell:
    '''
        gatk MarkDuplicates \
        -I {input} \
        -O {output.bam} \
        -M {output.met}
    '''

做完这一步需要建立一个index，方便后续调用bam文件：

input:
    'MarkDup/{sample}.markdup.bam'
output:
    'MarkDup/{sample}.markdup.bam.bai'
shell:
    'samtools index {input}'

5. 碱基质量校正

由于比对到SNP或INDEL上的reads附近会有很多错配，为了避免出现过多假阳性，需要对这部分reads进行局部重新比对;这个过程用到了很多已知变异集，即已知的可靠的变异位点，重比对将主要围绕这些位点进行：

# SNP and INDEL datasets
omni='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/1000G_omni2.5.hg38.vcf.gz'
thg='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/1000G_phase1.snps.high_confidence.hg38.vcf.gz'
mill='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz'
db146='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/dbsnp_146.hg38.vcf.gz'
hap='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/hapmap_3.3.hg38.vcf.gz'
dbsnp='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/beta/Homo_sapiens_assembly38.dbsnp.vcf.gz'
kn_indel='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/beta/Homo_sapiens_assembly38.known_indels.vcf.gz'
gold='/Volumes/RawData1/WGS/Reference/hg38_snp_datasets/beta/Homo_sapiens_assembly38.variantEvalGoldStandard.vcf.gz'

第一步是重比对的过程：

input:
    bam = 'MarkDup/{sample}.markdup.bam',
    ref = ref,
    db146 = db146,
    mill = mill,
    thg = thg,
    hap = hap,
    omni = omni,
    kn_indel = kn_indel,
    gold = gold,
    dbsnp = dbsnp
output:
    'BQSR/{sample}.markdup.recal.table'
shell:
    '''
        gatk BaseRecalibrator \
        -R {input.ref} \
        -I {input.bam} \
        --known-sites {input.db146} \
        --known-sites {input.mill} \
        --known-sites {input.thg} \
        --known-sites {input.hap} \
        --known-sites {input.omni} \
        --known-sites {input.kn_indel} \
        --known-sites {input.gold} \
        --known-sites {input.dbsnp} \
        -O {output}
    '''

第二步是把重比对的结果再写入到bam文件中：

input:
    bam = 'MarkDup/{sample}.markdup.bam',
    table = 'BQSR/{sample}.markdup.recal.table',
    ref = ref
output:
    'BQSR/{sample}.BQSR.bam'
shell:
    '''
        gatk ApplyBQSR \
        --bqsr-recal-file {input.table} \
        -R {input.ref} \
        -I {input.bam} \
        -O {output}
    '''

第三步依然是为bam文件建立index：

input:
    'BQSR/{sample}.BQSR.bam'
output:
    'BQSR/{sample}.BQSR.bam.bai'
shell:
    'samtools index {input}''

6. 开始真正的变异calling

在对reads做了校正之后，就可以拿来做Variant calling 了，这一步只是拿到初始的变异位点，因为后续还有对这些变异位点进行过滤和校正:

input:
    bam = 'BQSR/{sample}.BQSR.bam',
    ref = ref
output:
    'HC/{sample}.HC.vcf.gz'
shell:
    '''
        gatk HaplotypeCaller \
        -R {input.ref} \
        -I {input.bam} \
        -O {output}
    '''

7. 对上一步得到的变异进行过滤

该过程也分为两步，第一步是根据已知变异集的变异位点信息，利用自己的测序数据建立一个高斯模型，用来区分好的变异位点和坏的变异位点；好的变异位点即为已知变异集相同或相似的位点，坏的变异位点则相反：

首先是SNP筛选

第一步建立模型

input:
    vcf = 'HC/{sample}.HC.vcf.gz',
    ref = ref,
    hap = hap,
    omi = omni,
    thg = thg,
    dbs = db146,
    dbsnp = dbsnp,
    gold = gold
output:
    R = 'VQSR/{sample}.snps.plots.R',
    tr = 'VQSR/{sample}.snps.tranches',
    recal = 'VQSR/${sample}.snps.recal'
shell:
    '''
        gatk VariantRecalibrator \
        -R {input.ref} \
        -V {input.vcf} \
        --resource hapmap,known=false,training=true,truth=true,prior=15.0:{input.hap} \
        --resource omini,known=false,training=true,truth=true,prior=12.0:{input.omi} \
        --resource 1000G,known=false,training=true,truth=false,prior=10.0:{input.thg} \
        --resource dbsnp,known=true,training=false,truth=false,prior=2.0:{input.dbs} \
        --resource dbsnp38,known=true,training=false,truth=false,prior=2.0:{input.dbsnp} \
        --resource gold,known=true,training=false,truth=false,prior=2.0:{input.gold} \
        -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \ # 全基因组测序还有一个参数是 -an QD ，外显子测序没有
        -mode SNP \
        -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
        --rscript-file {output.R} \
        --tranches-file {output.tr} \
        -O {output.recal}
    '''

第二步筛选变异位点：

input:
    vcf = 'HC/{sample}.HC.vcf.gz',
    ref = ref,
    tr = 'VQSR/{sample}.snps.tranches',
    recal = 'VQSR/${sample}.snps.recal'
output:
    'VQSR/{sample}.snp.vcf'
shell:
    '''
        gatk ApplyVQSR \
        -R {input.ref} \
        -V {input.vcf} \
        --ts-filter-level 99.0 \
        --tranches-file {input.tr} \
        --recal-file {input.recal} \
        -mode SNP \
        -O {output}
    '''

然后是INDEL筛选

INDEL的筛选是建立在snp筛选基础上的，所以snp筛选用过的变异集在这一步就不再用了。

第一步建立模型

input:
    vcf = 'VQSR/{sample}.snp.vcf',
    ref = ref,
    mill = mill,
    kn_indel = kn_indel
output:
    tr = 'VQSR/{sample}.indel.tranches',
    R = 'VQSR/{sample}.indel.plots.R',
    recal = 'VQSR/${sample}.indel.recal'
shell:
    '''
        gatk VariantRecalibrator \
        -R {input.ref} \
        -V {input.vcf} \
        --resource mills,known=true,training=true,truth=true,prior=12.0:{input.mill} \
        --resource kn_indel,known=true,training=true,truth=true,prior=10.0:{input.kn_indel} \
        -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP \ # 全基因组测序还有一个参数是 -an QD ，外显子测序没有
        -mode INDEL \
        --max-gaussians 6 \
        --rscript-file {output.R} \
        --tranches-file {output.tr} \
        -O {output.recal}
    '''

第二步筛选变异位点

input:
    vcf = 'VQSR/{sample}.snp.vcf',
    ref = ref,
    tr = 'VQSR/{sample}.indel.tranches',
    recal = 'VQSR/${sample}.indel.recal'
output:
    'VQSR/{sample}.snp.indel.vcf'
shell:
    '''
        gatk ApplyVQSR \
        -R {input.ref} \
        -V {input.vcf} \
        --ts-filter-level 99.0 \
        --tranches-file {input.tr} \
        --recal-file {input.recal} \
        -mode INDEL \
        -O {output}
    '''

8.拆分SNP和INDEL结果

由于前两步产生的结果都在同一个文件里，所以需要从中把SNP和INDEL分别拆分出来：

input:
    vcf = 'VQSR/{sample}.snp.indel.vcf',
    ref = ref
output:
    snp = 'SNP_INDEL/{sample}.snp.vcf.gz',
    indel = 'SNP_INDEL/{sample}.indel.vcf.gz'
shell:
    '''
        gatk SelectVariants -R {input.ref} -select-type SNP --variant {input.vcf} -O {output.snp}
        gatk SelectVariants -R {input.ref} -select-type INDEL --variant {input.vcf} -O {output.indel}
    '''

9. 选择通过筛选的变异

在前两步的结果中，通过筛选的变异会加上一个'PASS'的标签，我们通过这个标签可以选择出对应的变异位点：

input:
    snp = 'SNP_INDEL/{sample}.snp.vcf.gz',
    indel = 'SNP_INDEL/{sample}.indel.vcf.gz'
output:
    snp = 'PASS/{sample}.filtered.snp.vcf',
    indel = 'PASS/{sample}.filtered.indel.vcf'
shell:
    '''
        zgrep 'PASS' {input.snp} > {output.snp}
        zgrep 'PASS' {input.indel} > {output.indel}
    '''

10. 对变异位点进行注释

最后得到的变异位点我们并不能直接看出它的功能等信息，因此需要利用已知的变异库去注释，然后去观察它是否和一些疾病等有关。注释软件我们选择ANNOVAR,因为它下载即可使用，而且提供常用的变异库，使用起来非常方便：

input:
    snp = 'PASS/{sample}.filtered.snp.vcf',
    indel = 'PASS/{sample}.filtered.indel.vcf',
    humandb = humandb,
    xref = xref
output:
    snp = 'ANNOVAR/{sample}.filtered.snp',
    indel = 'ANNOVAR/{sample}.filtered.indel'
shell:
    '''
        table_annovar.pl {input.snp} {input.humandb} -buildver hg38 -out {output.snp} \
        -remove -protocol refGene,cytoBand,exac03,dbnsfp33a,avsnp150,cosmic70,dbscsnv11 \
        -operation gx,r,f,f,f,f,f -nastring . \
        -polish -xref {input.xref}
    

        table_annovar.pl {input.indel} {input.humandb} -buildver hg38 -out {output.indel} \
        -remove -protocol refGene,cytoBand,exac03,dbnsfp33a,avsnp150,cosmic70,dbscsnv11 \
        -operation gx,r,f,f,f,f,f -nastring . \
        -polish -xref {input.xref}
    '''

最后，我们得到的是Tab分割的表格，分别储存着变异位点的位置、序列、相关基因等信息，可以利用这些信息去进行可视化，例如利用IGV去观察或者用circos或cirlize去作图。

一个完整的WGS变异检测snakemake流程详解

写在前边的话

第一步：建立index以及下载数据库

第二步：数据处理

1. 去接头

2. 序列比对

3. 添加头信息

4.去掉pcr重复

5. 碱基质量校正

6. 开始真正的变异calling

7. 对上一步得到的变异进行过滤

首先是SNP筛选

然后是INDEL筛选

8.拆分SNP和INDEL结果

9. 选择通过筛选的变异

10. 对变异位点进行注释

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

说不定以后用到呢

生物信息学与基因组学