使用GATK从RNA-seq数据中call variants

作者: RenJ_9394 | 来源:发表于2019-12-26 11:39 被阅读0次

使用GATK从RNA-seq数据中call variants
BSR-(RNA-seq)数据进行BSR分析-更新中
GATK中MNP MIXED SYMBOLIC等不同类型解释
RNA-seq 检测变异之 GATK 最佳实践流程
GATK's Pipeline for calling
bwa+samtools+picardtools+GATK ca
通过lftp命令下载GATK官网Resource bundle数
WGS数据分析（二）
胚系变异(SNP/InDel)频率波动的影响因素
GATK4+mutect2 call somatic mutat

GATK官方给出了从RNA-seq数据中寻找变异位点的流程，但这个示意图比较简洁，实际操作时一不小心就会报错，故经过探索，记录下这个流程的细节以及半自动化的脚本。

image

1. 软件及环境

samtools: 1.9

STAR: 2.7.0e

GATK: 4.1.4.1

操作系统：Red Hat 4.8.5-28

2.流程概述

STAR进行比对生成BAM文件 -> GATK内置Picard工具对BAM文件进行处理，以适配后续分析流程，否则会报错 -> HaplotypeCaller寻找变异 -> 分SNP，Indel两种模式对变异进行过滤。

*若有生物学重复，可以在call变异这一步为每一个样本生成gvcf文件，然后合并gvcf，进行joint-calling。

3.代码

1）mapping

对于RNA-seq数据，GATK建议使用STAR这一比对软件。这一步我并没有写到一个脚本里。

首先建立index：


STAR --runThreadN 20 \ #线程数

    --runMode genomeGenerate \

    --genomeDir /your_path/STAR/1.index \ #生成index的路径

    --genomeFastaFiles /your_path/hg38_UCSC.fa \  #参考基因组文件

    --sjdbGTFfile /your_path/gencode.v32.annotation.gtf \ #参考基因组注释文件

    --sjdbOverhang 140 #reads长度-1

结果如下：

image

然后进行mapping：


STAR --runMode alignReads\

      --runThreadN 20 \ 

      --genomeDir /your_path/STAR/1.index/  \ #上一步生成的index路径

      --outFileNamePrefix /your_path/2.mapping/${sample} \ #mapping结果的路径及前缀

      --outSAMmapqUnique 255  \ #唯一比对的质量值设为255

      --outSAMtype BAM SortedByCoordinate \ #直接输出排序好的BAM文件

      --outBAMsortingThreadN 10 \

      --readFilesIn /your_path/${sample}_1.clean.trimmed.fq /your_path/${sample}_2.clean.trimmed.fq #双端测序文件

结果如下：

image

接着利用上一步mapping结果收集到的junction信息，也就是SJ.out.tab文件第二次构建index，多个样本时将它们的结果一起输入：*


STAR --runThreadN 40 \

    --runMode genomeGenerate \

    --genomeDir  /your_path/STAR/3.index \

    --genomeFastaFiles /your_path/hg38_UCSC.fa \

    --sjdbGTFfile /your_path/gencode.v32.annotation.gtf \

    --sjdbFileChrStartEnd /your_path/STAR/2.mapping/*SJ.out.tab \ #第一次mapping得到的SJ.out.tab文件

    --sjdbOverhang 140 \

    --limitSjdbInsertNsj 1100000 #由于我的样本太多，得到的junction数量超过上限，因此加上这个参数更改上限（默认值：1000000）

最后进行第二次mapping，代码和第一次mapping类似，只是更换下index和output路径，代码就不贴了。

2）多样本模式进行variants calling

首先从bam文件到gvcf文件，我写到了一个脚本里，命名bam2gvcf.bash：


*#！/bin/bash* 

#mapping BAM to vcf

#软件, reference genome路径，根据实际情况修改，=两边不能有空格

gatk=your path to GATK

samtools=your path to samtools

ref_genome=your path to reference genome

dbsnp=your path to dbsnp

#输入参数

BAM=$1 #mapping后的BAM文件

RGID=$2 #read group ID

library=$3 #测序文库编号

sample=$4 #sample名称

outdir=$5 #输出文件路径

#按样本设置目录

outdir = ${outdir}/${sample}

#output diretory

if [ ! -d $outdir/variants ]

then mkdir -p $outdir/variants

fi

if [ ! -d $outdir/processBAM ]

then mkdir -p $outdir/processBAM

fi

#为BAM文件添加RG标签，这步省略后续会报错

time $gatk AddOrReplaceReadGroups \

    --INPUT $BAM \

    --OUTPUT $outdir/processBAM/${sample}.RG.bam \

    --RGID $RGID \

    --RGLB $library \

    --RGPL illumina \

    --RGPU snpcall \

    --RGSM ${sample} && echo "** ADD RG done **"

#标记因PCR产生的重复reads，mark duplicates & index

time $gatk MarkDuplicates \

    -I $outdir/processBAM/${sample}.RG.bam \

    -M $outdir/processBAM/${sample}.markdup_metrics.txt \

    -O $outdir/processBAM/${sample}.RG.markdup.bam &&\

time $samtools index $outdir/processBAM/${sample}.RG.markdup.bam && echo "** ${sample}.RG.markdup.bam index done **"

#split N Trim，去除落到内含子上的片段，但运行结果发现仍然有很多intron上的变异

time $gatk SplitNCigarReads \

    --R $ref_genome \

    --I $outdir/processBAM/${sample}.RG.markdup.bam \

    --O $outdir/processBAM/${sample}.RG.markdup.split.bam && echo "** ${sample}.RG.markdup.bam split N done **"

#HaplotypeCaller

time $gatk HaplotypeCaller \

    -R $ref_genome \

    -I $outdir/processBAM/${sample}.RG.markdup.split.bam \

    -O $outdir/variants/${sample}.g.vcf.gz \

    -dont-use-soft-clipped-bases \

    --emit-ref-confidence GVCF \

    --standard-min-confidence-threshold-for-calling 20 \

    --dbsnp $dbsnp && echo "** ${sample}.gvcf done **"


#执行脚本，参数根据实际情况填进去

bash bam2gvcf.sh bam文件 RG LB SAMPLE_ID OUTPUT

然后合并gvcf，并对变异进行硬过滤，脚本名gvcf2vcf.sh：


#!/bin/bash

*#gvcf to vcf* 

#路径

gatk=path to GATK

ref_genome=path to reference genome

#执行参数

samples=$1 #所有的样本ID， 用","分隔开

indir=$2 #输入目录的路径，与上个脚本的输出路径相同

outname=$3 #输出文件名的前缀

outdir=$indir

#设置群体变异检测结果输出目录

if [ ! -d $outdir/population ]

then mkdir -p $outdir/population

fi

#按照","，把所有样本ID拆分出来存至数组中

samples = $(echo $samples | tr "," "\n")

##群体的joint genotyping, 先合并所有的gvcf结果，然后统一进行GenotypeGVCFs

sample_gvcfs=""

for sample in $samples; do

    sample_gvcfs=${sample_gvcfs}"-V $outdir/${sample}/variants/${sample}.g.vcf.gz "

done

time $gatk CombineGVCFs \

    -R $ref_genome \

    ${sample_gvcfs} \

    -O $outdir/population/${outname}.combine.g.vcf.gz && echo "** ${outname}.combine.g.vcf done **" && \

time $gatk GenotypeGVCFs \

    -R $ref_genome \

    -V $outdir/population/${outname}.combine.g.vcf.gz \

    -O $outdir/population/${outname}.vcf.gz && echo "** ${outname}.vcf.gz done **"

#select SNP & filter

time $gatk SelectVariants \

    -select-type SNP \

    -V $outdir/population/${outname}.vcf.gz \

    -O $outdir/population/${outname}.snp.vcf.gz && \

time $gatk VariantFiltration \

    -V $outdir/population/${outname}.snp.vcf.gz \

    -O $outdir/population/${outname}.snp.filter.vcf.gz \

    -R $ref_genome \

    --filter-expression 'QD < 2.0 || FS > 60.0 || MQ <25.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' \ #过滤的阈值可以根据实验进行设置

    --filter-name "my_filter" && echo "** ${outname}.snp filter done **"

对于indel，过滤的方法类似，但由于我的实验目的不关心indel，因此脚本中不涉及。

3）单样本进行variants calling

可以放在一个脚本中：


#！/bin/bash

#mapping BAM to vcf

#软件, reference genome路径，根据实际情况修改

gatk=path to GATK

samtools=path to samtools

ref_genome=path to reference genome

dbsnp=path to dbsnp

#参数

BAM=$1 #mapping后的BAM文件

RGID=$2 #read group ID

library=$3 #测序文库编号

sample=$4 #sample名称

outdir=$5 #输出文件路径

#按样本设置目录

outdir = ${outdir}/${sample}

#output diretory

if [ ! -d $outdir/variants ]

then mkdir -p $outdir/variants

fi

if [ ! -d $outdir/processBAM ]

then mkdir -p $outdir/processBAM

fi

#add RG & sort

time $gatk AddOrReplaceReadGroups \

    --INPUT $BAM \

    --OUTPUT $outdir/processBAM/${sample}.RG.bam \

    --RGID $RGID \

    --RGLB $library \

    --RGPL illumina \

    --RGPU snpcall \

    --RGSM ${sample} && echo "** ADD RG done **"

#mark duplicates & index

time $gatk MarkDuplicates \

    -I $outdir/processBAM/${sample}.RG.bam \

    -M $outdir/processBAM/${sample}.markdup_metrics.txt \

    -O $outdir/processBAM/${sample}.RG.markdup.bam &&\

time $samtools index $outdir/processBAM/${sample}.RG.markdup.bam && echo "** ${sample}.RG.markdup.bam index done **"

#split N Trim

time $gatk SplitNCigarReads \

    --R $ref_genome \

    --I $outdir/processBAM/${sample}.RG.markdup.bam \

    --O $outdir/processBAM/${sample}.RG.markdup.split.bam && echo "** ${sample}.RG.markdup.bam split N done **"

#HaplotypeCaller

time $gatk HaplotypeCaller \

    -R $ref_genome \

    -I $outdir/processBAM/${sample}.RG.markdup.split.bam \

    -O $outdir/variants/${sample}.vcf.gz \

    -dont-use-soft-clipped-bases \

    --standard-min-confidence-threshold-for-calling 20 \

    --dbsnp $dbsnp && echo "** ${sample}.vcf done **"

#选择SNP，并filter

time $gatk SelectVariants \

    -select-type SNP \

    -V $outdir/variants/${sample}.vcf.gz \

    -O $outdir/variants/${sample}.snp.vcf.gz && \

time $gatk VariantFiltration \

    -V $outdir/variants/${sample}.vcf.gz \

    -O $outdir/variants/${sample}.snp.filter.vcf.gz \

    -R $ref_genome \

    --filter-expression 'QD < 2.0 || FS > 60.0 || MQ <25.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' \ #过滤的阈值可以根据实验进行设置

    --filter-name "my_filter" && echo "** ${sample}.snp filter done **"

接下来就可以用ANNOVAR对变异进行注释了。

使用GATK从RNA-seq数据中call variants
GATK官方给出了从RNA-seq数据中寻找变异位点的流程，但这个示意图比较简洁，实际操作时一不小心就会报错，故经...
BSR-(RNA-seq)数据进行BSR分析-更新中
使用工具GATK4。GATK基础RNA-seq分析GATK官方RNA-seq calling流程WT 3个单株，混...
GATK中MNP MIXED SYMBOLIC等不同类型解释
What types of variants can GATK tools detect / handle? Th...
RNA-seq 检测变异之 GATK 最佳实践流程
tags： RNA-seq GATK STAR SNP INDEL RNA-seq 序列比对对 RNA-seq ...
GATK's Pipeline for calling
#!/bin/bash PICARD=/home/lq/storage/call_variants/softwar...
bwa+samtools+picardtools+GATK ca
call SNP GATK bwa+samtools 参考文档一、原始数据处理测序得到的RAW D...
通过lftp命令下载GATK官网Resource bundle数
通过lftp命令下载GATK官网Resource bundle数据 # 如果使用GATK软件calling var...
WGS数据分析（二）
manta call SV 结构变异（structural variants）主要找 ---缺失(Deletion...
胚系变异(SNP/InDel)频率波动的影响因素
背景在二代数据处理中，用GATK HaplotypeCaller去call二倍体胚系变异应该属于比较常用的检...
GATK4+mutect2 call somatic mutat
最近看了GATK的网站，以前经常用GATK call somatic mutation，但是用的版本太老了，另外...