美文网首页
全外显子组测序(WES)分析1: GATK4变异检测(bwa+s

全外显子组测序(WES)分析1: GATK4变异检测(bwa+s

作者: cc的生信随记 | 来源:发表于2024-06-20 09:45 被阅读0次

1. 全外显子组测序技术简介

全外显子组测序(whole exome sequencing,WES)是指利用序列捕获技术将全基因组外显子区域DNA捕捉并富集,然后通过高深度的高通量测序发现与蛋白质功能变异相关的遗传突变分析方法。外显子组约占全基因组的1%,却包含约85%的致病突变。因此,相比全基因组测序,全外显子测序更加经济、高效。外显子组测序主要用于识别和研究与疾病、种群进化相关的编码区及重要调控功能的非转录区域(Untranslated Regions,UTR)内的变异。结合大量的公共数据库提供的外显子数据,有利于更好地解释所得变异与疾病的关系。

优势:
1)外显子区域占比小(1%),更容易做到更高深度测序,检测到更多低频和罕见变异,同时也能降低测序费用和存储空间
2)直接对蛋白编码序列进行测序,找出影响蛋白结构的变异

2. 全外显子组测序分析流程

2.1 质量控制(Quality Control,QC)

# 安装指控软件fastp
$ conda install fastp

# 检查是否安装成功
$ fastp --help
## usage: fastp [options] ...
## options:
##   -i, --in1                            read1 input file name (string [=])
##   -o, --out1                           read1 output file name (string [=])
##   -I, --in2                            read2 input file name (string [=])
##   -O, --out2                           read2 output file name (string [=])
## ......

$ fastp -i input.R1.fq -o output.R1.fq -I input.R2.fq -O output.R2.fq

# 批量
$ bash run_fastp.sh

ꔷ run_fastp.sh:

#!/bin/bash
for i in ~/FASTQ/*_R1_001.fastq.gz
do
        fileName=`echo "$i" |awk -F "/" '{print $NF;exit}'|awk -F "_R" '{print $1;exit}'`
        sampleName=`echo "$i" |awk -F"/" '{print $NF;exit}'|awk -F "_" '{print $1;exit}'`

        fastp -i "$fileName"_R1_001.fastq.gz -o "$sampleName"_1.fastp.fq.gz -I "$fileName"_R2_001.fastq.gz -O "$sampleName"_2.fastp.fq.gz -h "$sampleName".html
done

ꔷ 主要参数设置:

-i, --in1 输入read1文件名
-o, --out1 输出read1文件名
-I, --in2 输入read2文件名
-O, --out2 输出read2文件名,软件默认是根据扩展名识别压缩文件,所以输出文件需要加上*.gz扩展名;

2.2 GATK使用流程

2.2.1 工具安装

  1. BWA

BWA,即Burrows-Wheeler-Alignment Tool,是一种能够将差异度较小的序列比对到一个较大的参考基因组上的软件包。由三个不同的算法组成:
ꔷ BWA-backtrack: 是用来比对较短的序列,reads 长度最长能到 100bp
ꔷ BWA-SW: 用于比对 long-read ,支持的长度为 70bp-1Mbp;同时支持剪接性比对。(通常用于对比如 PacBio 或 Nanopore 这样的长读取)
ꔷ BWA-MEM: 最常用的模式,支持较长的read长度,同时支持剪接性比对(split alignments),且 BWA-MEM 对于 70bp-100bp 的 Illumina 数据来说,效果也更好些。

sudo apt install bwa -y

bwa # 检验是否安装成功
## 
## Program: bwa (alignment via Burrows-Wheeler transformation)
## Version: 0.7.17-r1188
## Contact: Heng Li <lh3@sanger.ac.uk>
## 
## Usage:   bwa <command> [options]
## 
## Command: index         index sequences in the FASTA format
##          mem           BWA-MEM algorithm
##          fastmap       identify super-maximal exact matches
##          pemerge       merge overlapping paired ends (EXPERIMENTAL)
##          aln           gapped/ungapped alignment
##          samse         generate alignment (single ended)
##          sampe         generate alignment (paired ended)
##          bwasw         BWA-SW for long queries
## 
##          shm           manage indices in shared memory
##          fa2pac        convert FASTA to PAC format
##          pac2bwt       generate BWT from PAC
##          pac2bwtgen    alternative algorithm for generating BWT
##          bwtupdate     update .bwt to the new format
##          bwt2sa        generate SA from BWT and Occ
## 
## Note: To use BWA, you need to first index the genome with `bwa index'.
##       There are three alignment algorithms in BWA: `mem', `bwasw', and
##       `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
##       first. Please `man ./bwa.1' for the manual.
## 
  1. samtools
wget https://github.com/samtools/samtools/releases/download/1.20/samtools-1.20.tar.bz2 # 下载
tar jxvf samtools-1.20.tar.bz2 #解压

cd samtools-1.20/
./configure --prefix=/software/samtools-1.20
make
make install

./samtools # 检验是否安装成功
## 
## Program: samtools (Tools for alignments in the SAM format)
## Version: 1.20 (using htslib 1.20)
## 
## Usage:   samtools <command> [options]
## 
## Commands:
##   -- Indexing
##      dict           create a sequence dictionary file
##      faidx          index/extract FASTA
##      fqidx          index/extract FASTQ
##      index          index alignment
## 
##   -- Editing
##      calmd          recalculate MD/NM tags and '=' bases
##      fixmate        fix mate information
##      reheader       replace BAM header
##      targetcut      cut fosmid regions (for fosmid pool only)
##      addreplacerg   adds or replaces RG tags
##      markdup        mark duplicates
##      ampliconclip   clip oligos from the end of reads
## 
##   -- File operations
##      collate        shuffle and group alignments by name
##      cat            concatenate BAMs
##      consensus      produce a consensus Pileup/FASTA/FASTQ
##      merge          merge sorted alignments
##      mpileup        multi-way pileup
##      sort           sort alignment file
##      split          splits a file by read group
##      quickcheck     quickly check if SAM/BAM/CRAM file appears intact
##      fastq          converts a BAM to a FASTQ
##      fasta          converts a BAM to a FASTA
##      import         Converts FASTA or FASTQ files to SAM/BAM/CRAM
##      reference      Generates a reference from aligned data
##      reset          Reverts aligner changes in reads
## 
##   -- Statistics
##      bedcov         read depth per BED region
##      coverage       alignment depth and percent coverage
##      depth          compute the depth
##      flagstat       simple stats
##      idxstats       BAM index stats
##      cram-size      list CRAM Content-ID and Data-Series sizes
##      phase          phase heterozygotes
##      stats          generate stats (former bamcheck)
##      ampliconstats  generate amplicon specific stats
## 
##   -- Viewing
##      flags          explain BAM flags
##      head           header viewer
##      tview          text alignment viewer
##      view           SAM<->BAM<->CRAM conversion
##      depad          convert padded BAM to unpadded BAM
##      samples        list the samples in a set of SAM/BAM/CRAM files
## 
##   -- Misc
##      help [cmd]     display this help message or help for [cmd]
##      version        detailed version information
## 
  1. sambamba
# 下载文件
wget https://github.com/biod/sambamba/archive/refs/tags/v1.0.1.tar.gz
tar -zxvf v1.0.1.tar.gz

cd sambamba-1.0.1/
make # 若报错,需先安装ldc2(wget https://github.com/ldc-developers/ldc/releases/download/v1.35.0/ldc2-1.35.0-linux-x86_64.tar.xz)

echo "PATH=$PATH:/sambamba-1.0.1/bin/path" >> ~/.bashrc
source ~/.bashrc

sambamba-1.0.1 --help  # 查看是否安装成功
## 
## sambamba 1.0.1
##  by Artem Tarasov and Pjotr Prins (C) 2012-2023
##     LDC 1.35.0 / DMD v2.105.2 / LLVM16.0.6 / bootstrap LDC - the LLVM D compiler (1.35.0)
## 
## 
## Usage: sambamba [command] [args...]
## 
##   Available commands:
## 
##     view        view contents and convert from one format
##                 to another (SAM/BAM/JSON/UNPACK)
##     index       build index (BAI)
##     merge       merge files (BAM)
##     sort        sort file (BAM)
##     slice       slice file (BAM using BED)
##     markdup     mark or remove duplicates (BAM)
##     subsample   subsample (BAM)
##     flagstat    output statistics (BAM)
##     depth       output statistics (BAM)
##     validate    simple validator (BAM)
## 
##   No longer recommended:
## 
##     mpileup     parallel execution of samtools (BAM)
## 
## To get help on a particular command, call it without args.
## 
## Global options
## 
##     -q          quiet mode (do not show banner)
## 
## For bug reports and feature requests see
## 
##        https://github.com/biod/
## 
  1. GATK4
wget https://github.com/broadinstitute/gatk/releases/download/4.5.0.0/gatk-4.5.0.0.zip

unzip gatk-4.5.0.0.zip # 解压安装包
cd gatk-4.5.0.0

echo "PATH=$PATH:/gatk-4.5.0.0/path" >> ~/.bashrc
source ~/.bashrc

gatk -h  # 查看是否安装成功
## 
##  Usage template for all tools (uses --spark-runner LOCAL when used with a Spark tool)
##     gatk AnyTool toolArgs
## 
##  Usage template for Spark tools (will NOT work on non-Spark tools)
##     gatk SparkTool toolArgs  [ -- --spark-runner <LOCAL | SPARK | GCS> sparkArgs ]
## 
##  Getting help
##     gatk --list       Print the list of available tools
## 
##     gatk Tool --help  Print help on a particular tool
## 
##  Configuration File Specification
##      --gatk-config-file                PATH/TO/GATK/PROPERTIES/FILE
## 
##  gatk forwards commands to GATK and adds some sugar for submitting spark jobs
## 
##    --spark-runner <target>    controls how spark tools are run
##      valid targets are:
##      LOCAL:      run using the in-memory spark runner
##      SPARK:      run using spark-submit on an existing cluster
##                  --spark-master must be specified
##                  --spark-submit-command may be specified to control the Spark submit command
##                  arguments to spark-submit may optionally be specified after --
##      GCS:        run using Google cloud dataproc
##                  commands after the -- will be passed to dataproc
##                  --cluster <your-cluster> must be specified after the --
##                  spark properties and some common spark-submit parameters will be translated
##                  to dataproc equivalents
## 
##    --dry-run      may be specified to output the generated command line without running it
##    --java-options 'OPTION1[ OPTION2=Y ... ]'   optional - pass the given string of options to the
##                  java JVM at runtime.
##                  Java options MUST be passed inside a single string with space-separated values.
## 
##    --debug-port <number> sets up a Java VM debug agent to listen to debugger connections on a
##                          particular port number. This in turn will add the necessary java VM arguments
##                          so that you don't need to explicitly indicate these using --java-options.
##    --debug-suspend       sets the Java VM debug agent up so that the run get immediatelly suspended
##                          waiting for a debugger to connect. By default the port number is 5005 but
##                          can be customized using --debug-port
## 
  1. bcftools
sudo apt-get install bcftools #  Ubuntu 系统

bcftools --help  # 查看是否安装成功
## 
## Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
## Version: 1.10.2 (using htslib 1.10.2-3ubuntu0.1)
## 
## Usage:   bcftools [--version|--version-only] [--help] <command> <argument>
## 
## Commands:
## 
##  -- Indexing
##     index        index VCF/BCF files
## 
##  -- VCF/BCF manipulation
##     annotate     annotate and edit VCF/BCF files
##     concat       concatenate VCF/BCF files from the same set of samples
##     convert      convert VCF/BCF files to different formats and back
##     isec         intersections of VCF/BCF files
##     merge        merge VCF/BCF files files from non-overlapping sample sets
##     norm         left-align and normalize indels
##     plugin       user-defined plugins
##     query        transform VCF/BCF into user-defined formats
##     reheader     modify VCF/BCF header, change sample names
##     sort         sort VCF/BCF file
##     view         VCF/BCF conversion, view, subset and filter VCF/BCF files
## 
##  -- VCF/BCF analysis
##     call         SNP/indel calling
##     consensus    create consensus sequence by applying VCF variants
##     cnv          HMM CNV calling
##     csq          call variation consequences
##     filter       filter VCF/BCF files using fixed thresholds
##     gtcheck      check sample concordance, detect sample swaps and contamination
##     mpileup      multi-way pileup producing genotype likelihoods
##     roh          identify runs of autozygosity (HMM)
##     stats        produce VCF/BCF stats
## 
##  Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
##  automatically even when streaming from a pipe. Indexed VCF and BCF will work
##  in all situations. Un-indexed VCF and BCF and streams will work in most but
##  not all situations.
## 

2.2.2 原始数据处理

  1. 下载参考基因组数据并建立索引
wget https://github.com/broadinstitute/gatk/blob/master/src/test/resources/large/Homo_sapiens_assembly38.fasta.gz
gunzip Homo_sapiens_assembly38.fasta.gz # 解压

bwa index  /data/Ref/Homo_sapiens_assembly38.fasta.gz # 大于2G的参考基因组要用参数-a bwtsw
## [bwa_index] Construct BWT for the packed sequence...
## [BWTIncCreate] textLength=383669262, availableWord=38996220
## [BWTIncConstructFromPacked] 10 iterations done. 64326510 characters processed.
## [BWTIncConstructFromPacked] 20 iterations done. 118839214 characters processed.
## [BWTIncConstructFromPacked] 30 iterations done. 167286270 characters processed.
## [BWTIncConstructFromPacked] 40 iterations done. 210342094 characters processed.
## [BWTIncConstructFromPacked] 50 iterations done. 248606174 characters processed.
## ......

生成5个文件:


GATK-1
  1. 比对到参考基因组
# 此处一个样本有4个fastq,分开处理,将同⼀通道的两个fq文件⼀起处理,生成⼀个sam/bam⽂件,后面进行merge
bwa mem -t 20 -M -Y -R '@RG\tID:V300059041_L02_658\tPL:ILLUMINA\tLB:lib1\tSM:organoid_V300059041_L02' /data/CC56organoid/clean_data/V300059041_L01_658_1.fq.gz /data/CC56organoid/clean_data/V300059041_L01_658_2.fq.gz > V300059041_L01_658.sam 2> bwa_1.log
bwa mem -t 20 -M -Y -R '@RG\tID:V300059041_L02_658\tPL:ILLUMINA\tLB:lib1\tSM:organoid_V300059041_L02' /data/Ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa /data/CC56organoid/clean_data/V300059041_L02_658_1.fq.gz /data/CC56organoid/clean_data/V300059041_L02_658_2.fq.gz > V300059041_L02_658.sam 2> bwa_2.log

# sam转bam格式 (bam是二进制文件,运算速度快)
samtools view -bS V300059041_L01_658.sam > V300059041_L01_658.bam
samtools view -bS V300059041_L02_658.sam > V300059041_L02_658.bam

# 排序
samtools sort V300059041_L01_658.bam -o V300059041_L01_658_sorted.bam
samtools sort V300059041_L02_658.bam -o V300059041_L02_658_sorted.bam

# 合并
samtools merge CC56organoid.bam V300059041_L01_658_sorted.bam V300059041_L02_658_sorted.bam

2.2.3 运行GATK4

  1. 去除重复reads
# sambamba去重复时会自动建立索引
sambamba-1.0.1 markdup -t 10 -r -p V300059041_L01_658_sorted.bam V300059041_L01_658_sorted.markdup.bam 2> sambamba_markdup_1_log.txt
sambamba-1.0.1 markdup -t 10 -r -p V300059041_L02_658_sorted.bam V300059041_L02_658_sorted.markdup.bam 2> sambamba_markdup_2_log.txt

sambamba-1.0.1 markdup -t 10 -r -p  CC56organoid.bam CC56organoid.markdup.bam  2> sambamba_markdup_log.txt

# 建立reference genome序列的dict
gatk CreateSequenceDictionary -R Homo_sapiens_assembly38.fasta
  1. Germline variant discovery:SNPs & Indels (HaplotypeCaller)
# 生成中间g.vcf文件
gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R Homo_sapiens_assembly38.fasta \
   -I V300059041_L01_658_sorted.markdup.bam \
   -O V300059041_L01_658_sorted.markdup.g.vcf \
   -ERC GVCF

gatk --java-options "-Xmx4g" HaplotypeCaller  \
   -R Homo_sapiens_assembly38.fasta \
   -I V300059041_L02_658_sorted.markdup.bam \
   -O V300059041_L02_658_sorted.markdup.g.vcf \
   -ERC GVCF

# 合并GVCF文件
gatk CombineGVCFs -R Homo_sapiens_assembly38.fasta \
    --variant V300059041_L01_658_sorted.markdup.g.vcf  \
    --variant V300059041_L02_658_sorted.markdup.g.vcf \
    -O CC56organoid.HaplotypeCaller.g.vcf

# 基因型调用
gatk GenotypeGVCFs -R Homo_sapiens_assembly38.fasta -V CC56organoid.HaplotypeCaller.g.vcf -O  CC56organoid.HaplotypeCaller.final.vcf
  1. Somatic variant discovery: SNPs & Indels
# 突变检测,生成初步的 VCF 文件 (Mutect2)
gatk --java-options "-Xmx10G" Mutect2 -R Homo_sapiens_assembly38.fasta -I CC56organoid.markdup.bam -O CC56organoid.markdup.unfiltered.vcf

# 过滤
gatk --java-options "-Xmx10G" FilterMutectCalls -R Homo_sapiens_assembly38.fasta -V CC56organoid.markdup.unfiltered.vcf -O CC56organoid.markdup.filtered.vcf

# 进一步过滤 (可选)
bcftools filter -i 'DP>10 && QUAL>30' CC56organoid.markdup.filtered.vcf -o final_CC56organoid.vcf
  1. GATK bundle 数据下载
    GATK 在官网中提供了resource bundle,里面包含了所需要的很多数据。目前提供FTP 和 Google Cloud bucket 2种下载方式,Google可以https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0?pli=1自行下载,FTP可使用lftp工具进行访问和下载
# 安装lftp(需要root账号或权限)
$ sudo apt install lftp

$ lftp ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/ # 密码为空,直接回车
GATK-2
lftp gsapubftp-anonymous@ftp.broadinstitute.org:/bundle> ls hg38
-rw-rw-r--   1 vdauwera broad    53238342 Jan  5  2016 1000G_omni2.5.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     1544108 Jan  5  2016 1000G_omni2.5.hg38.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad    1888262073 Jan  5  2016 1000G_phase1.snps.high_confidence.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     2128536 Jan  5  2016 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad     3053999 Dec  1  2016 Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad      421689 Dec  1  2016 Axiom_Exome_Plus.genotypes.all_populations.poly.hg38.vcf.gz.tbi
drwxrwxr-x   2 vdauwera broad         644 Dec  1  2016 beta
-rw-rw-r--   1 vdauwera broad    1560889937 Dec  1  2016 dbsnp_138.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     2321422 Jan  5  2016 dbsnp_138.hg38.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad    3215686850 Jan  5  2016 dbsnp_144.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     2434204 Jan  5  2016 dbsnp_144.hg38.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad    3411143311 Dec  1  2016 dbsnp_146.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     2466606 Dec  1  2016 dbsnp_146.hg38.vcf.gz.tbi
-rw-r--r--   1 vdauwera broad    179601790 Feb 10  2017 hapmap_3.3_grch38_pop_stratified_af.vcf.gz
-rw-r--r--   1 vdauwera broad     1547721 Feb 10  2017 hapmap_3.3_grch38_pop_stratified_af.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad    62043448 Jan  5  2016 hapmap_3.3.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     1552123 Jan  5  2016 hapmap_3.3.hg38.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad      581712 Jan  6  2016 Homo_sapiens_assembly38.dict
-rw-r--r--   1 vdauwera broad      487553 Dec  1  2016 Homo_sapiens_assembly38.fasta.64.alt
-rw-rw-r--   1 vdauwera broad      160928 Dec  1  2016 Homo_sapiens_assembly38.fasta.fai
-rw-rw-r--   1 vdauwera broad    890202736 Jan  5  2016 Homo_sapiens_assembly38.fasta.gz
-rw-rw-r--   1 vdauwera broad    20685880 Jan  5  2016 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
-rw-rw-r--   1 vdauwera broad     1500013 Jan  5  2016 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
-rw-rw-r--   1 vdauwera broad      599399 Dec  1  2016 wgs_calling_regions.hg38.interval_list

# 根据文件夹内容可自行下载
wget -b -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
wget -b -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_138.hg38.vcf.gz
wget -b -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
  1. 注释VCF文件 (VariantAnnotator)
gatk --java-options "-Xmx8g -Xms8g" VariantAnnotator \
   -R /data/shumin/WES/Index/ref/ref_fa/Homo_sapiens_assembly38.fasta \
   -I /data/shumin/WES/Index/ref/alignment/CC56organoid.markdup.bam \
   -V CC56organoid.HaplotypeCaller.final.vcf \
   -O CC56organoid.HaplotypeCaller.final.annotate.vcf \
   --dbsnp /data/WES/ref_data/dbsnp_138.hg38.vcf.gz
  1. Indel / SNP 过滤 (VariantRecalibrator)
##Indel Mode
gatk --java-options "-Xmx8g -Xms8g" VariantRecalibrator \
   -R /data/shumin/WES/Index/ref/ref_fa/Homo_sapiens_assembly38.fasta \
   -V CC56tissueA.HaplotypeCaller.final.annotate.vcf \
   -resource:mills,known=true,training=true,truth=true,prior=12.0 /data/shumin/WES/ref_data/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /data/shumin/WES/ref_data/dbsnp_138.hg38.vcf.gz \
   -an QD    -an SOR  \
   --mode INDEL \
   --max-gaussians 4 \
   --rscript-file  CC56tissueA.plots.R \
   --tranches-file CC56tissueA.tranches \
   -O CC56tissueA.recal \
   1> CC56tissueA.indel.vqsr.log 2>&1

gatk --java-options "-Xmx8g -Xms8g" ApplyVQSR \
   -R /data/shumin/WES/Index/ref/ref_fa/Homo_sapiens_assembly38.fasta \
   -V CC56tissueA.HaplotypeCaller.final.annotate.vcf \
   --truth-sensitivity-filter-level 99.0 \
   --tranches-file CC56tissueA.tranches \
   --recal-file CC56tissueA.recal \
   --mode INDEL \
   -O CC56tissueA.indels.VQSR.vcf \
   1>CC56tissueA.indels.ApplyVQSR.log 2>&1
   echo "** Indels VQSR done **"

##SNP Mode
gatk --java-options "-Xmx8g -Xms8g" VariantRecalibrator \
   -R /data/shumin/WES/Index/ref/ref_fa/Homo_sapiens_assembly38.fasta \
   -V CC56tissueA.indels.VQSR.vcf  \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/shumin/WES/ref_data/hapmap_3.3.hg38.vcf.gz \
   -resource:omini,known=false,training=true,truth=false,prior=12.0 /data/shumin/WES/ref_data/1000G_omni2.5.hg38.vcf.gz \
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/shumin/WES/ref_data/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
   -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /data/shumin/WES/ref_data/dbsnp_138.hg38.vcf.gz \
   -an QD  -an MQ  \
   --mode SNP \
   --max-gaussians 4 \
   -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 95.0 -tranche 90.0 \
   --rscript-file CC56tissueA.indels.snps.plots.R \
   --tranches-file CC56tissueA.indels.snps.tranches \
   -O CC56tissueA.indels.snps.recal  \
   1>CC56tissueA.indels.snps.vqsr.log 2>&1 

gatk --java-options "-Xmx8g -Xms8g" ApplyVQSR \
   -R /data/shumin/WES/Index/ref/ref_fa/Homo_sapiens_assembly38.fasta \
   -V CC56tissueA.indels.VQSR.vcf \
   --truth-sensitivity-filter-level 99.0 \
   --tranches-file CC56tissueA.indels.snps.tranches \
   --recal-file CC56tissueA.indels.snps.recal \
   --mode SNP \
   -O CC56tissueA.indels.snps.VQSR.vcf \
   1>CC56tissueA.indels.snps.ApplyVQSR.log 2>&1 
   echo "** SNPs VQSR done **" 
  1. 转化为maf格式
perl /data/shumin/software/vcf2maf/vcf2maf.pl --input-vcf CC56tissueA.indels.snps.VQSR.vcf --output-maf CC56tissueA.indels.snps.VQSR.maf --ref-fasta /data/shumin/WES/Index/ref/ref_fa/Homo_sapiens_assembly38.fasta --tumor-id CC56tissueA --vep-path /data/shumin/software/ensembl-vep-release-112.0 --vep-data /data/shumin/software/ensembl-vep-release-112.0/vep_cache --cache-version 112 --ncbi-build GRCh38 

相关文章

网友评论

      本文标题:全外显子组测序(WES)分析1: GATK4变异检测(bwa+s

      本文链接:https://www.haomeiwen.com/subject/utzwqjtx.html