学习WES数据分析流程

作者: 程凉皮儿 | 来源:发表于2020-06-22 17:51 被阅读0次

首先安装软件

先安装conda，使用清华的conda，说明书：https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/
然后下载安装miniconda，位置：https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/ ，
在国内需要更改镜像配置（我租用了一台美国HPC超算中心的服务器，不用设置镜像）
下载安装软件之前先搜索是否存在 https://bioconda.github.io/recipes.html
更改镜像源配置参考卖萌哥持续更新的conda教程：

然后就可以根据流程来使用conda安装一系列软件

首先搭建分析环境

为了做好目录整理，可以先创建文件夹，以存放各种软件包、数据库文件，以及分析过程中产生的结果

## 首先在用户的主目录下创建 wes_cancer 文件夹作为工作目录
mkdir ~/wes_cancer
cd ~/wes_cancer
## 在 ~/wes_cancer 中创建 biosoft project data 三个文件夹
## biosoft 存放软件安装包
## project 存放分析过程产生的文件
## data 存放数据库文件
mkdir biosoft project data
cd project
## 在 project 文件夹中创建若干个文件夹，分别存放每一步产生的文件
mkdir -p 0.sra 1.raw_fq 2.clean_fq 3.qc/{raw_qc,clean_qc} 4.align/{qualimap,flagstat,stats} 5.gatk/gvcf 6.mutect 7.annotation/{vep,annovar,funcotator,snpeff} 8.cnv/{gatk,cnvkit,gistic,facet} 9.pyclone 10.signature

2个重要的软件安装相关问题，参考之前的学习笔记GATK下载安装相关,ANNOVAR | 注释

熟悉参考基因组及必备数据库

gatk_hg38系列由BWA生成,构建索引需要下载很多文件，费时较久：
/bundle/hg38/索引：

cd ~/wes_cancer/data/
wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz

gatk需要用到的其他文件

因为比较多，可以使用nohup...&的形式将下载的命令都提交到后台（如果网络不好，可能会下载失败，下载后使用前需要检查）

## gatk
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.fai & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.dict & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi & 
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz &

其它数据库文件：

## bed
wget ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt
cat CCDS.current.txt | grep  "Public" | perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{if($3>$2) print "chr"$0"\t0\t+"}'  > hg38.exon.bed

构建索引使用的命令：

conda activate wes
cd ~/wes_cancer/data
gunzip Homo_sapiens_assembly38.fasta.gz
time bwa index -a bwtsw -p gatk_hg38 ~/wes_cancer/data/Homo_sapiens_assembly38.fasta
cd ~/wes_cancer/project

构建索引过程如下：

(base) root@1100150:~# conda activate wes
(wes) root@1100150:~# cd ~/wes_cancer/data
(wes) root@1100150:~/wes_cancer/data# time bwa index -a bwtsw -p gatk_hg38 ~/wes_cancer/data/Homo_sapiens_assembly38.fasta
[bwa_index] Pack FASTA... 52.77 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6434693834, availableWord=464768632
[BWTIncConstructFromPacked] 10 iterations done. 99999994 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999994 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999994 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999994 characters processed.
...
[bwt_gen] Finished constructing BWT in 711 iterations.
[bwa_index] 5822.96 seconds elapse.
[bwa_index] Update BWT... 33.52 sec
[bwa_index] Pack forward-only FASTA... 39.86 sec
[bwa_index] Construct SA from BWT and Occ... 1753.57 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p gatk_hg38 /root/wes_cancer/data/Homo_sapiens_assembly38.fasta
[main] Real time: 7785.516 sec; CPU: 7702.678 sec

real    129m45.519s
user    126m56.258s
sys 1m26.421s

完成后的索引文件目录（大约100G的数据库文件）
由于单位网络屏蔽了远程服务器的IP，暂时没能链接上，用了曾老师的教程内容

/public/biosoft/GATK/resources/bundle/hg38/bwa_index/
|-- [ 20K]  gatk_hg38.amb
|-- [445K]  gatk_hg38.ann
|-- [3.0G]  gatk_hg38.bwt
|-- [767M]  gatk_hg38.pac
|-- [1.5G]  gatk_hg38.sa
|-- [6.2K]  hg38.bwa_index.log 
`-- [ 566]  run.sh

/public/biosoft/GATK/resources/bundle/hg38/
|-- [1.8G]  1000G_phase1.snps.high_confidence.hg38.vcf.gz
|-- [2.0M]  1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
|-- [3.2G]  dbsnp_146.hg38.vcf.gz
|-- [3.0M]  dbsnp_146.hg38.vcf.gz.tbi
|-- [ 59M]  hapmap_3.3.hg38.vcf.gz
|-- [1.5M]  hapmap_3.3.hg38.vcf.gz.tbi
|-- [568K]  Homo_sapiens_assembly38.dict
|-- [3.0G]  Homo_sapiens_assembly38.fasta
|-- [157K]  Homo_sapiens_assembly38.fasta.fai
|-- [ 20M]  Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
|-- [1.4M]  Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

通过流量连接上服务器后查看自己下载的索引及数据库文件(没有分开为2个目录，且少了hapmap_3.3.hg38.vcf.gz，可能会对后续分析造成影响)：

(base) root@1100150:~# cd wes_cancer/data/
(base) root@1100150:~/wes_cancer/data# ll
total 26715496
drwxr-xr-x 2 root root          20 Jun 19 02:15 ./
drwxr-xr-x 5 root root           5 Jun 18 08:37 ../
-rw-r--r-- 1 root root  1888262073 Jun 18 12:28 1000G_phase1.snps.high_confidence.hg38.vcf.gz
-rw-r--r-- 1 root root     2128536 Jun 18 12:25 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
-rw-r--r-- 1 root root     9839781 Jun 18 12:26 CCDS.current.txt
-rw-r--r-- 1 root root      581712 Jun 18 12:25 Homo_sapiens_assembly38.dict
-rw-r--r-- 1 root root  3249912778 Jun 18 12:25 Homo_sapiens_assembly38.fasta
-rw-r--r-- 1 root root      160928 Jun 18 12:25 Homo_sapiens_assembly38.fasta.fai
-rw-r--r-- 1 root root    20685880 Jun 18 12:25 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
-rw-r--r-- 1 root root     1500013 Jun 18 12:25 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
-rw-r--r-- 1 root root  3411143311 Jun 18 12:29 dbsnp_146.hg38.vcf.gz
-rw-r--r-- 1 root root     2466606 Jun 18 12:25 dbsnp_146.hg38.vcf.gz.tbi
-rw-r--r-- 1 root root 15381038028 Jun 18 12:33 funcotator_dataSources.v1.6.20190124s.tar.gz
-rw-r--r-- 1 root root       20199 Jun 19 01:46 gatk_hg38.amb
-rw-r--r-- 1 root root      455474 Jun 19 01:46 gatk_hg38.ann
-rw-r--r-- 1 root root  3217347004 Jun 19 01:45 gatk_hg38.bwt
-rw-r--r-- 1 root root   804336731 Jun 19 01:46 gatk_hg38.pac
-rw-r--r-- 1 root root  1608673512 Jun 19 02:15 gatk_hg38.sa
-rw-r--r-- 1 root root     6813165 Jun 18 12:27 hg38.exon.bed
-rw------- 1 root root    32237249 Jun 18 12:33 nohup.out

第一步是QC

由于我的示例数据是clean.fq这一步就省略了
包括使用fasqc和multiqc两个软件查看测序质量，以及使用trim_galore软件进行过滤低质量reads和去除接头。

mkdir ~/project/boy
wkd=/home/jmzeng/project/boy
mkdir {raw,clean,qc,align,mutation}
cd qc 
find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|xargs fastqc -t 10 -o ./

假设质量很差，就过滤：

### step3: filter the bad quality reads and remove adaptors. 
mkdir $wkd/clean 
cd $wkd/clean

find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|grep 1.fastq.gz > 1
find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|grep 2.fastq.gz > 2
paste 1 2  > config
### 打开文件 qc.sh ，并且写入内容如下： 
source activate wes

bin_trim_galore=trim_galore
dir=$wkd/clean
cat config  |while read id
do
        arr=(${id})
        fq1=${arr[0]}
        fq2=${arr[1]} 
        echo  $dir  $fq1 $fq2 
nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir  $fq1 $fq2 & 
done 

source deactivate

读质量较好的测序数据进行比对

先走测试数据

## 先提取小的fq
source activate wes
find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._' > fq.txt
cat fq.txt |while read id ;do (zcat $id|head -10000 > $(basename $id ".gz"));done
## 然后一个个小fq文件比对
sample='7E5239'
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38 7E5239.L1_1.fastq 7E5239.L1_2.fastq  | samtools sort -@ 5 -o 7E5239.bam -
sample='7E5240'
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38  7E5240_L1_A001.L1_1.fastq 7E5240_L1_A001.L1_2.fastq  | samtools sort -@ 5 -o 7E5240.bam -
sample='7E5241'
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38 7E5241.L1_1.fastq 7E5241.L1_2.fastq   | samtools sort -@ 5 -o 7E5241.bam -
## 或者循环比对
# 7E5239    7E5239.L1_1.fastq   7E5239.L1_2.fastq
# 7E5240    7E5240_L1_A001.L1_1.fastq   7E5240_L1_A001.L1_2.fastq
# 7E5241    7E5241.L1_1.fastq   7E5241.L1_2.fastq
INDEX=/public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
cat config |while read id
do

arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2  | samtools sort -@ 5 -o $sample.bam - 

done

模仿这个例子完成小bam提取和比对：

(base) root@1100150:~/wes_cancer/project# source activate wes
(wes) root@1100150:~/wes_cancer/project# find ~/wes_cancer/project/2.clean_fq/ -name *gz
/root/wes_cancer/project/2.clean_fq/9Y1640.R1.clean.fastq.gz
/root/wes_cancer/project/2.clean_fq/9Y1640.R2.clean.fastq.gz
(wes) root@1100150:~/wes_cancer/project# find ~/wes_cancer/project/2.clean_fq/ -name *gz > fq.txt
(wes) root@1100150:~/wes_cancer/project# cat fq.txt |while read id ; do (zcat $id|head -10000 > $(basename $id ".gz")); done
(wes) root@1100150:~/wes_cancer/project# cat fq.txt
/root/wes_cancer/project/2.clean_fq/9Y1640.R1.clean.fastq.gz
/root/wes_cancer/project/2.clean_fq/9Y1640.R2.clean.fastq.gz
(wes) root@1100150:~/wes_cancer/project# ll
total 791
drwxr-xr-x 13 root root     16 Jun 22 06:59 ./
drwxr-xr-x  5 root root      5 Jun 18 08:37 ../
drwxr-xr-x  2 root root      2 Jun 18 09:02 0.sra/
drwxr-xr-x  2 root root      2 Jun 18 09:02 1.raw_fq/
drwxr-xr-x  2 root root      2 Jun 18 09:02 10.signature/
drwxr-xr-x  2 root root      4 Jun 21 06:48 2.clean_fq/
drwxr-xr-x  4 root root      4 Jun 18 09:02 3.qc/
drwxr-xr-x  5 root root      6 Jun 21 06:59 4.align/
drwxr-xr-x  3 root root      3 Jun 18 09:02 5.gatk/
drwxr-xr-x  2 root root      2 Jun 18 09:02 6.mutect/
drwxr-xr-x  6 root root      6 Jun 18 09:02 7.annotation/
drwxr-xr-x  6 root root      6 Jun 18 09:02 8.cnv/
drwxr-xr-x  2 root root      2 Jun 18 09:02 9.pyclone/
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R1.clean.fastq
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R2.clean.fastq
-rw-r--r--  1 root root    122 Jun 22 06:57 fq.txt
(wes) root@1100150:~/wes_cancer/project# sample='9Y1640'
(wes) root@1100150:~/wes_cancer/project# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq  | samtools sort -@ 5 -o 7E5239.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 5000 sequences (708078 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 2045, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (142, 182, 238)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 430)
[M::mem_pestat] mean and std.dev: (193.44, 69.83)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 526)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 5000 reads in 1.654 CPU sec, 0.360 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 5 -R @RG\tID:9Y1640\tSM:9Y1640\tLB:WGS\tPL:Illumina /root/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq
[main] Real time: 31.238 sec; CPU: 14.275 sec
[bam_sort_core] merging from 0 files and 5 in-memory blocks...
(wes) root@1100150:~/wes_cancer/project# ll
total 1250
drwxr-xr-x 13 root root     17 Jun 22 07:10 ./
drwxr-xr-x  5 root root      5 Jun 18 08:37 ../
drwxr-xr-x  2 root root      2 Jun 18 09:02 0.sra/
drwxr-xr-x  2 root root      2 Jun 18 09:02 1.raw_fq/
drwxr-xr-x  2 root root      2 Jun 18 09:02 10.signature/
drwxr-xr-x  2 root root      4 Jun 21 06:48 2.clean_fq/
drwxr-xr-x  4 root root      4 Jun 18 09:02 3.qc/
drwxr-xr-x  5 root root      6 Jun 21 06:59 4.align/
drwxr-xr-x  3 root root      3 Jun 18 09:02 5.gatk/
drwxr-xr-x  2 root root      2 Jun 18 09:02 6.mutect/
drwxr-xr-x  6 root root      6 Jun 18 09:02 7.annotation/
-rw-r--r--  1 root root 451049 Jun 22 07:10 7E5239.bam
drwxr-xr-x  6 root root      6 Jun 18 09:02 8.cnv/
drwxr-xr-x  2 root root      2 Jun 18 09:02 9.pyclone/
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R1.clean.fastq
-rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R2.clean.fastq
-rw-r--r--  1 root root    122 Jun 22 06:57 fq.txt

Jun 22生成的就是模仿后的结果，说明这个过程是OK的（这里复制曾老师的代码有的名称忘记修改了7E5239.bam，后面跑完整数据的时候进行修改）

如果是样本很多需要写循环，然后走正常的数据

ls /home/jmzeng/project/boy/clean/*1.fq.gz >1
ls /home/jmzeng/project/boy/clean/*2.fq.gz >2
cut -d"/" -f 7 1 |cut -d"_" -f 1  > 0
paste 0 1 2 > config 
source activate wes
INDEX=/public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
cat config |while read id
do

arr=($id)
fq1=${arr[1]}
fq2=${arr[2]}
sample=${arr[0]}
echo $sample $fq1 $fq2 
 bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2  | samtools sort -@ 5 -o $sample.bam -   

done

而我只有一个样本，则只需要像示例数据那样完成即可：

(wes) root@1100150:~# cd wes_cancer/project/2.clean_fq/
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# ll
total 5561163
drwxr-xr-x  2 root root          4 Jun 21 06:48 ./
drwxr-xr-x 13 root root         14 Jun 22 07:13 ../
-rw-r--r--  1 root root 2897369062 Jun 21 06:48 9Y1640.R1.clean.fastq.gz
-rw-r--r--  1 root root 2798611010 Jun 21 06:50 9Y1640.R2.clean.fastq.gz
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# sample='9Y1640'
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq  | samtools sort -@ 5 -o 9Y1640.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[E::main_mem] fail to open file `9Y1640.R1.clean.fastq'.
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# sample='9Y1640'
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq.gz 9Y1640.R2.clean.fastq.gz  | samtools sort -@ 5 -o 9Y1640.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 354436 sequences (50000086 bp)...
[M::process] read 353998 sequences (50000030 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (113, 144879, 0, 173)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (128, 187, 267)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 545)
[M::mem_pestat] mean and std.dev: (197.98, 92.21)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 684)
...

由于WES数据量很大，即使一个样本也要运行很久，就不记录全程了(真的太久了，1个小时了还没有结束的迹象，以后这种指令还是要放在后台运行)。
运行完成结果如下：

[M::mem_process_seqs] Processed 63974 reads in 26.351 CPU sec, 5.102 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 5 -R @RG\tID:9Y1640\tSM:9Y1640\tLB:WGS\tPL:Illumina /root/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq.gz 9Y1640.R2.clean.fastq.gz
[main] Real time: 5758.605 sec; CPU: 28791.178 sec
[bam_sort_core] merging from 30 files and 5 in-memory blocks...
(wes) root@1100150:~/wes_cancer/project/2.clean_fq# ll
total 8745086
drwxr-xr-x  2 root root          5 Jun 22 10:13 ./
drwxr-xr-x 13 root root         14 Jun 22 07:13 ../
-rw-r--r--  1 root root 2897369062 Jun 21 06:48 9Y1640.R1.clean.fastq.gz
-rw-r--r--  1 root root 2798611010 Jun 21 06:50 9Y1640.R2.clean.fastq.gz
-rw-r--r--  1 root root 3261116989 Jun 22 10:13 9Y1640.bam

最简单的找变异流程

如果要理解参数的意思，参考曾老师的【直播】我的基因组25：用bcftools来call variation

ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
source activate wes
time samtools mpileup -ugf $ref  *.bam | bcftools call -vmO z -o out.vcf.gz
ls *.bam |xargs -i samtools index {}
## 没有去除PCR重复

去除PCR重复

理解参数和教程为什么会过时（生物信息日新月异，要明白代码的逻辑，会复用即可）

# samtools markdup -r 7E5241.bam 7E5241.rm.bam
# samtools markdup -S 7E5241.bam 7E5241.mk.bam

完善的GATK流程

source activate wes
GATK=/home/jmzeng/biosoft/gatk4/gatk-4.0.6.0/gatk
ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
snp=/public/biosoft/GATK/resources/bundle/hg38/dbsnp_146.hg38.vcf.gz
indel=/public/biosoft/GATK/resources/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz  

for sample in {7E5239.L1,7E5240,7E5241.L1}
do 
echo $sample  
# Elapsed time: 7.91 minutes
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates \
    -I $sample.bam \
    -O ${sample}_marked.bam \
    -M $sample.metrics \
    1>${sample}_log.mark 2>&1 

## Elapsed time: 13.61 minutes
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation \
    -I ${sample}_marked.bam \
    -O ${sample}_marked_fixed.bam \
    -SO coordinate \
    1>${sample}_log.fix 2>&1 

samtools index ${sample}_marked_fixed.bam

##  17.2 minutes
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    --known-sites $snp \
    --known-sites $indel \
    -O ${sample}_recal.table \
    1>${sample}_log.recal 2>&1 

$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"   ApplyBQSR \
    -R $ref  \
    -I ${sample}_marked_fixed.bam  \
    -bqsr ${sample}_recal.table \
    -O ${sample}_bqsr.bam \
    1>${sample}_log.ApplyBQSR  2>&1 

## 使用GATK的HaplotypeCaller命令
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
     -R $ref  \
     -I ${sample}_bqsr.bam \
      --dbsnp $snp \
      -O ${sample}_raw.vcf \
      1>${sample}_log.HC 2>&1  

done

检查感兴趣基因区域内比对和找变异情况

通过IGV可视化来加深自己对这个流程的把握和理解。

chr17   HAVANA  gene    43044295        43170245
3.5G Jul 21 18:01 7E5240.bam
7.1G Jul 21 21:40 7E5240_bqsr.bam
4.7G Jul 21 20:28 7E5240_marked.bam
4.8G Jul 21 20:44 7E5240_marked_fixed.bam

把这些bam里面的BRCA1基因的reads拿出来：

samtools  view -h  7E5240.bam chr17:43044295-43170245 |samtools sort -o  7E5240.brca1.bam -
samtools  view -h  7E5240_bqsr.bam chr17:43044295-43170245 |samtools sort -o  7E5240_bqsr.brca1.bam -
samtools  view -h  7E5240_marked.bam chr17:43044295-43170245 |samtools sort -o  7E5240_marked.brca1.bam -
samtools  view -h  7E5240_marked_fixed.bam chr17:43044295-43170245 |samtools sort -o  7E5240_marked_fixed.brca1.bam -
ls  *brca1.bam|xargs -i samtools index {}

有了这些特定基因区域的bam，就可以针对特定基因找变异

source activate wes
ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
samtools mpileup -ugf $ref   7E5240_bqsr.brca1.bam   | bcftools call -vmO z -o 7E5240_bqsr.vcf.gz

所有样本走samtools mpileup 和bcftools call 流程

仍然是参考我的直播基因组【直播】我的基因组25：用bcftools来call variation

source activate wes
wkd=/home/jmzeng/project/boy
cd $wkd/align 
ls  *_bqsr.bam  |xargs -i samtools index {}
ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
nohup samtools mpileup -ugf $ref   *_bqsr.bam | bcftools call -vmO z -o all_bqsr.vcf.gz &

比对及找变异结果的质控

raw和clean的fastq文件都需要使用fastqc质控。

比对的各个阶段的bam文件都可以质控。

使用qualimap对wes的比对bam文件总结测序深度及覆盖度。

source activate wes
wkd=/home/jmzeng/project/boy
cd $wkd/clean
ls *.gz |xargs fastqc -t 10 -o ./

cd $wkd/align
rm *_marked*.bam
ls  *.bam  |xargs -i samtools index {} 
ls  *.bam  | while read id ;do (samtools flagstat $id > $(basename $id ".bam").stat);done

conda install bedtools
cat /public/annotation/CCDS/human/CCDS.20160908.txt  |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{print "chr"$0"\t0\t+"}'  > $wkd/align/hg38.exon.bed 

exon_bed=hg38.exon.bed 
ls  *_bqsr.bam | while read id;
do
sample=${id%%.*}
echo $sample
qualimap bamqc --java-mem-size=20G -gff $exon_bed -bam $id  & 
done 

### featureCounts 
gtf=/public/reference/gtf/gencode/gencode.v25.annotation.gtf.gz
featureCounts -T 5 -p -t exon -g gene_id    \
-a  $gtf   *_bqsr.bam -o  all.id.txt  1>counts.id.log 2>&1 &

比较两个找变异工具的区别

chr1    139213  .       A       G       388     .   
DP=984;VDB=0.831898;SGB=-223.781;RPB=0.599582;MQB=0.0001971;MQSB=0.00457372;BQB=2.59396e-09;MQ0F=0.281504;ICB=0.333333;HOB=0.5;AC=3;AN=6;DP4=371,169,160,84;MQ=21
GT:PL   0/1:198,0,255   0/1:176,0,255   0/1:50,0,158

chr1    139213  rs370723703 A   G   3945.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.999;ClippingRankSum=0.000;DB;DP=244;ExcessHet=3.0103;FS=2.256;MLEAC=1;MLEAF=0.500;MQ=29.33;MQRankSum=-0.929;QD=16.17;ReadPosRankSum=1.462;SOR=0.863  GT:AD:DP:GQ:PL  0/1:136,108:244:99:3974,0,6459
chr1    139213  rs370723703 A   G   2261.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.191;ClippingRankSum=0.000;DB;DP=192;ExcessHet=3.0103;FS=9.094;MLEAC=1;MLEAF=0.500;MQ=32.03;MQRankSum=-0.533;QD=11.78;ReadPosRankSum=0.916;SOR=0.321  GT:AD:DP:GQ:PL  0/1:126,66:192:99:2290,0,7128
chr1    139213  rs370723703 A   G   2445.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.495;ClippingRankSum=0.000;DB;DP=223;ExcessHet=3.0103;FS=10.346;MLEAC=1;MLEAF=0.500;MQ=30.18;MQRankSum=0.486;QD=10.97;ReadPosRankSum=-0.808;SOR=0.300 GT:AD:DP:GQ:PL  0/1:152,71:223:99:2474,0,7901

VCF下游分析

主要是：注释和过滤

注释

VEP,snpEFF,ANNOVAR

1.Annovar使用记录 (http://www.bio-info-trainee.com/641.html)

2.用annovar对snp进行注释 (http://www.bio-info-trainee.com/441.html)

3.对感兴趣的基因call variation(http://www.bio-info-trainee.com/2013.html)

4.WES（六）用annovar注释(http://www.bio-info-trainee.com/1158.html)

cd ~/biosoft/
ln -s /public/biosoft/ANNOVAR/ ANNOVAR
source activate wes
wkd=/home/jmzeng/project/boy
cd $wkd/mutation 
 mv ../align/*.vcf ./

 ~/biosoft/ANNOVAR/annovar/convert2annovar.pl  -format vcf4old    7E5240_raw.vcf  >7E5240.annovar

~/biosoft/ANNOVAR/annovar/annotate_variation.pl \
-buildver hg38  \
--outfile 7E5240.anno \
7E5240.annovar \
~/biosoft/ANNOVAR/annovar/humandb/

其实并不一定需要vcf文件，软件需要的只是染色体加上坐标即可，对于我们的vcf格式的变异文件，软件通常会进行一定程度的格式化之后再进行注释。这里的注释主要有三种方式，分别是：

基于基因的注释,exonic,splicing,ncRNA,UTR5,UTR3,intronic,upstream,downstream,intergenic
基于区域的注释, cytoBand,TFBS,SV,bed,GWAS,ENCODE,enhancers, repressors, promoters
基于数据库的过滤,dbSNP,ExAC,ESP6500,cosmic,gnomad,1000genomes,clinvar

如果要使用annovar一次性注释多个数据库

dir=/home/jianmingzeng/biosoft/ANNOVAR/annovar
db=$dir/humandb/ 
ls $db

wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2018/clinvar_20180603.vcf.gz 

perl $dir/annotate_variation.pl  --downdb --webfrom annovar --buildver hg38 
clinvar_20180603 $db
# perl $bin  --downdb --webfrom annovar --buildver hg38 gnomad_genome $db
mkdir annovar_results 
 $dir/convert2annovar.pl -format vcf4old highQ.vcf  1> highQ.avinput  2>/dev/null 
perl $dir/annotate_variation.pl  -buildver hg38 -filter -dbtype clinvar_20180603  --outfile annovar_results/highQ_clinvar  highQ.avinput  $db

perl $dir/table_annovar.pl   \
-buildver hg38 \
highQ.avinput  $db \
-out test \
-remove -protocol \
refGene,clinvar_20170905 \
-operation g,r \
-nastring NA

补充作业

使用 Variant Effect Predictor 对所有遗传变异进行注释。过滤掉 dbSNP 数据库和千人基因组计划数据库中已知的 SNP。

应用 OMIM 数据库(http://omim.org/)查询蛋白的结构及功能。利用 SIFT ，PolyPhen-2 以及 PROVEAN 软件, 预测 SNV 对蛋白质功能的影响程度，仅当 3 种软件均预测同一遗传变异对蛋白质的功能影响较大时，才认定该遗传变异具有高危害性。利用 PROVEAN 软件预测 Indel 对蛋白质功能的影响。

其实dbNSFP数据库，就注释了这些变异对蛋白功能的影响。

变异位点的过滤

使用 GATK的 Joint Calling ，过滤参考：https://mp.weixin.qq.com/s/W8Vfv1WmW6M7U0tIcPtlng

注意很多资料会过时

比如虽然可以找到了gatk3代码：http://baserecalibrator1.rssing.com/chan-10751514/all_p13.html 但是已经可以抛弃了，也就是说教程经常会过时。

学习WES数据分析流程

首先安装软件

首先搭建分析环境

熟悉参考基因组及必备数据库

第一步是QC

读质量较好的测序数据进行比对

最简单的找变异流程

去除PCR重复

完善的GATK流程

检查感兴趣基因区域内比对和找变异情况

所有样本走samtools mpileup 和bcftools call 流程

比对及找变异结果的质控

比较两个找变异工具的区别

VCF下游分析

注释

如果要使用annovar一次性注释多个数据库

补充作业

变异位点的过滤

注意很多资料会过时

GVCF 教程

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

全基因组测序学习资料