学习WES数据分析流程

作者: 程凉皮儿 | 来源:发表于2020-06-22 17:51 被阅读0次

    参考学习资料:
    笔记
    视频教程
    知识库

    首先安装软件

    先安装conda,使用清华的conda,说明书:https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/
    然后下载安装miniconda,位置:https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/
    在国内需要更改镜像配置(我租用了一台美国HPC超算中心的服务器,不用设置镜像)
    下载安装软件之前先搜索是否存在 https://bioconda.github.io/recipes.html
    更改镜像源配置参考卖萌哥持续更新的conda教程

    然后就可以根据流程来使用conda安装一系列软件

    首先搭建分析环境

    为了做好目录整理,可以先创建文件夹,以存放各种软件包、数据库文件,以及分析过程中产生的结果

    ## 首先在用户的主目录下创建 wes_cancer 文件夹作为工作目录
    mkdir ~/wes_cancer
    cd ~/wes_cancer
    ## 在 ~/wes_cancer 中创建 biosoft project data 三个文件夹
    ## biosoft 存放软件安装包
    ## project 存放分析过程产生的文件
    ## data 存放数据库文件
    mkdir biosoft project data
    cd project
    ## 在 project 文件夹中创建若干个文件夹,分别存放每一步产生的文件
    mkdir -p 0.sra 1.raw_fq 2.clean_fq 3.qc/{raw_qc,clean_qc} 4.align/{qualimap,flagstat,stats} 5.gatk/gvcf 6.mutect 7.annotation/{vep,annovar,funcotator,snpeff} 8.cnv/{gatk,cnvkit,gistic,facet} 9.pyclone 10.signature
    

    2个重要的软件安装相关问题,参考之前的学习笔记GATK下载安装相关,ANNOVAR | 注释

    熟悉参考基因组及必备数据库

    gatk_hg38系列由BWA生成,构建索引需要下载很多文件,费时较久:
    /bundle/hg38/索引

    cd ~/wes_cancer/data/
    wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz
    

    gatk需要用到的其他文件

    因为比较多,可以使用nohup...&的形式将下载的命令都提交到后台(如果网络不好,可能会下载失败,下载后使用前需要检查)

    ## gatk
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.fai & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.dict & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi & 
    nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/funcotator_dataSources.v1.6.20190124s.tar.gz &
    

    其它数据库文件:

    ## bed
    wget ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_human/CCDS.current.txt
    cat CCDS.current.txt | grep  "Public" | perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{if($3>$2) print "chr"$0"\t0\t+"}'  > hg38.exon.bed
    

    构建索引使用的命令:

    conda activate wes
    cd ~/wes_cancer/data
    gunzip Homo_sapiens_assembly38.fasta.gz
    time bwa index -a bwtsw -p gatk_hg38 ~/wes_cancer/data/Homo_sapiens_assembly38.fasta
    cd ~/wes_cancer/project
    

    构建索引过程如下:

    (base) root@1100150:~# conda activate wes
    (wes) root@1100150:~# cd ~/wes_cancer/data
    (wes) root@1100150:~/wes_cancer/data# time bwa index -a bwtsw -p gatk_hg38 ~/wes_cancer/data/Homo_sapiens_assembly38.fasta
    [bwa_index] Pack FASTA... 52.77 sec
    [bwa_index] Construct BWT for the packed sequence...
    [BWTIncCreate] textLength=6434693834, availableWord=464768632
    [BWTIncConstructFromPacked] 10 iterations done. 99999994 characters processed.
    [BWTIncConstructFromPacked] 20 iterations done. 199999994 characters processed.
    [BWTIncConstructFromPacked] 30 iterations done. 299999994 characters processed.
    [BWTIncConstructFromPacked] 40 iterations done. 399999994 characters processed.
    ...
    [bwt_gen] Finished constructing BWT in 711 iterations.
    [bwa_index] 5822.96 seconds elapse.
    [bwa_index] Update BWT... 33.52 sec
    [bwa_index] Pack forward-only FASTA... 39.86 sec
    [bwa_index] Construct SA from BWT and Occ... 1753.57 sec
    [main] Version: 0.7.17-r1188
    [main] CMD: bwa index -a bwtsw -p gatk_hg38 /root/wes_cancer/data/Homo_sapiens_assembly38.fasta
    [main] Real time: 7785.516 sec; CPU: 7702.678 sec
    
    real    129m45.519s
    user    126m56.258s
    sys 1m26.421s
    

    完成后的索引文件目录(大约100G的数据库文件)
    由于单位网络屏蔽了远程服务器的IP,暂时没能链接上,用了曾老师的教程内容

    /public/biosoft/GATK/resources/bundle/hg38/bwa_index/
    |-- [ 20K]  gatk_hg38.amb
    |-- [445K]  gatk_hg38.ann
    |-- [3.0G]  gatk_hg38.bwt
    |-- [767M]  gatk_hg38.pac
    |-- [1.5G]  gatk_hg38.sa
    |-- [6.2K]  hg38.bwa_index.log 
    `-- [ 566]  run.sh
    
    /public/biosoft/GATK/resources/bundle/hg38/
    |-- [1.8G]  1000G_phase1.snps.high_confidence.hg38.vcf.gz
    |-- [2.0M]  1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
    |-- [3.2G]  dbsnp_146.hg38.vcf.gz
    |-- [3.0M]  dbsnp_146.hg38.vcf.gz.tbi
    |-- [ 59M]  hapmap_3.3.hg38.vcf.gz
    |-- [1.5M]  hapmap_3.3.hg38.vcf.gz.tbi
    |-- [568K]  Homo_sapiens_assembly38.dict
    |-- [3.0G]  Homo_sapiens_assembly38.fasta
    |-- [157K]  Homo_sapiens_assembly38.fasta.fai
    |-- [ 20M]  Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    |-- [1.4M]  Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
    
    

    通过流量连接上服务器后查看自己下载的索引及数据库文件(没有分开为2个目录,且少了hapmap_3.3.hg38.vcf.gz,可能会对后续分析造成影响):

    (base) root@1100150:~# cd wes_cancer/data/
    (base) root@1100150:~/wes_cancer/data# ll
    total 26715496
    drwxr-xr-x 2 root root          20 Jun 19 02:15 ./
    drwxr-xr-x 5 root root           5 Jun 18 08:37 ../
    -rw-r--r-- 1 root root  1888262073 Jun 18 12:28 1000G_phase1.snps.high_confidence.hg38.vcf.gz
    -rw-r--r-- 1 root root     2128536 Jun 18 12:25 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
    -rw-r--r-- 1 root root     9839781 Jun 18 12:26 CCDS.current.txt
    -rw-r--r-- 1 root root      581712 Jun 18 12:25 Homo_sapiens_assembly38.dict
    -rw-r--r-- 1 root root  3249912778 Jun 18 12:25 Homo_sapiens_assembly38.fasta
    -rw-r--r-- 1 root root      160928 Jun 18 12:25 Homo_sapiens_assembly38.fasta.fai
    -rw-r--r-- 1 root root    20685880 Jun 18 12:25 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
    -rw-r--r-- 1 root root     1500013 Jun 18 12:25 Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
    -rw-r--r-- 1 root root  3411143311 Jun 18 12:29 dbsnp_146.hg38.vcf.gz
    -rw-r--r-- 1 root root     2466606 Jun 18 12:25 dbsnp_146.hg38.vcf.gz.tbi
    -rw-r--r-- 1 root root 15381038028 Jun 18 12:33 funcotator_dataSources.v1.6.20190124s.tar.gz
    -rw-r--r-- 1 root root       20199 Jun 19 01:46 gatk_hg38.amb
    -rw-r--r-- 1 root root      455474 Jun 19 01:46 gatk_hg38.ann
    -rw-r--r-- 1 root root  3217347004 Jun 19 01:45 gatk_hg38.bwt
    -rw-r--r-- 1 root root   804336731 Jun 19 01:46 gatk_hg38.pac
    -rw-r--r-- 1 root root  1608673512 Jun 19 02:15 gatk_hg38.sa
    -rw-r--r-- 1 root root     6813165 Jun 18 12:27 hg38.exon.bed
    -rw------- 1 root root    32237249 Jun 18 12:33 nohup.out
    

    第一步是QC

    由于我的示例数据是clean.fq这一步就省略了
    包括使用fasqc和multiqc两个软件查看测序质量,以及使用trim_galore软件进行过滤低质量reads和去除接头。

    mkdir ~/project/boy
    wkd=/home/jmzeng/project/boy
    mkdir {raw,clean,qc,align,mutation}
    cd qc 
    find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|xargs fastqc -t 10 -o ./
    

    假设质量很差,就过滤:

    ### step3: filter the bad quality reads and remove adaptors. 
    mkdir $wkd/clean 
    cd $wkd/clean
    
    find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|grep 1.fastq.gz > 1
    find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._'|grep 2.fastq.gz > 2
    paste 1 2  > config
    ### 打开文件 qc.sh ,并且写入内容如下: 
    source activate wes
    
    bin_trim_galore=trim_galore
    dir=$wkd/clean
    cat config  |while read id
    do
            arr=(${id})
            fq1=${arr[0]}
            fq2=${arr[1]} 
            echo  $dir  $fq1 $fq2 
    nohup $bin_trim_galore -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o $dir  $fq1 $fq2 & 
    done 
    
    source deactivate
    
    

    读质量较好的测序数据进行比对

    先走测试数据

    ## 先提取小的fq
    source activate wes
    find /public/project/clinical/beijing_boy  -name *gz |grep -v '\._' > fq.txt
    cat fq.txt |while read id ;do (zcat $id|head -10000 > $(basename $id ".gz"));done
    ## 然后一个个小fq文件比对
    sample='7E5239'
    bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38 7E5239.L1_1.fastq 7E5239.L1_2.fastq  | samtools sort -@ 5 -o 7E5239.bam -
    sample='7E5240'
    bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38  7E5240_L1_A001.L1_1.fastq 7E5240_L1_A001.L1_2.fastq  | samtools sort -@ 5 -o 7E5240.bam -
    sample='7E5241'
    bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  /public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38 7E5241.L1_1.fastq 7E5241.L1_2.fastq   | samtools sort -@ 5 -o 7E5241.bam -
    ## 或者循环比对
    # 7E5239    7E5239.L1_1.fastq   7E5239.L1_2.fastq
    # 7E5240    7E5240_L1_A001.L1_1.fastq   7E5240_L1_A001.L1_2.fastq
    # 7E5241    7E5241.L1_1.fastq   7E5241.L1_2.fastq
    INDEX=/public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
    cat config |while read id
    do
    
    arr=($id)
    fq1=${arr[1]}
    fq2=${arr[2]}
    sample=${arr[0]}
    bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2  | samtools sort -@ 5 -o $sample.bam - 
    
    done 
    
    

    模仿这个例子完成小bam提取和比对:

    (base) root@1100150:~/wes_cancer/project# source activate wes
    (wes) root@1100150:~/wes_cancer/project# find ~/wes_cancer/project/2.clean_fq/ -name *gz
    /root/wes_cancer/project/2.clean_fq/9Y1640.R1.clean.fastq.gz
    /root/wes_cancer/project/2.clean_fq/9Y1640.R2.clean.fastq.gz
    (wes) root@1100150:~/wes_cancer/project# find ~/wes_cancer/project/2.clean_fq/ -name *gz > fq.txt
    (wes) root@1100150:~/wes_cancer/project# cat fq.txt |while read id ; do (zcat $id|head -10000 > $(basename $id ".gz")); done
    (wes) root@1100150:~/wes_cancer/project# cat fq.txt
    /root/wes_cancer/project/2.clean_fq/9Y1640.R1.clean.fastq.gz
    /root/wes_cancer/project/2.clean_fq/9Y1640.R2.clean.fastq.gz
    (wes) root@1100150:~/wes_cancer/project# ll
    total 791
    drwxr-xr-x 13 root root     16 Jun 22 06:59 ./
    drwxr-xr-x  5 root root      5 Jun 18 08:37 ../
    drwxr-xr-x  2 root root      2 Jun 18 09:02 0.sra/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 1.raw_fq/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 10.signature/
    drwxr-xr-x  2 root root      4 Jun 21 06:48 2.clean_fq/
    drwxr-xr-x  4 root root      4 Jun 18 09:02 3.qc/
    drwxr-xr-x  5 root root      6 Jun 21 06:59 4.align/
    drwxr-xr-x  3 root root      3 Jun 18 09:02 5.gatk/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 6.mutect/
    drwxr-xr-x  6 root root      6 Jun 18 09:02 7.annotation/
    drwxr-xr-x  6 root root      6 Jun 18 09:02 8.cnv/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 9.pyclone/
    -rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R1.clean.fastq
    -rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R2.clean.fastq
    -rw-r--r--  1 root root    122 Jun 22 06:57 fq.txt
    (wes) root@1100150:~/wes_cancer/project# sample='9Y1640'
    (wes) root@1100150:~/wes_cancer/project# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq  | samtools sort -@ 5 -o 7E5239.bam -
    [M::bwa_idx_load_from_disk] read 0 ALT contigs
    [M::process] read 5000 sequences (708078 bp)...
    [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 2045, 0, 0)
    [M::mem_pestat] skip orientation FF as there are not enough pairs
    [M::mem_pestat] analyzing insert size distribution for orientation FR...
    [M::mem_pestat] (25, 50, 75) percentile: (142, 182, 238)
    [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 430)
    [M::mem_pestat] mean and std.dev: (193.44, 69.83)
    [M::mem_pestat] low and high boundaries for proper pairs: (1, 526)
    [M::mem_pestat] skip orientation RF as there are not enough pairs
    [M::mem_pestat] skip orientation RR as there are not enough pairs
    [M::mem_process_seqs] Processed 5000 reads in 1.654 CPU sec, 0.360 real sec
    [main] Version: 0.7.17-r1188
    [main] CMD: bwa mem -t 5 -R @RG\tID:9Y1640\tSM:9Y1640\tLB:WGS\tPL:Illumina /root/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq
    [main] Real time: 31.238 sec; CPU: 14.275 sec
    [bam_sort_core] merging from 0 files and 5 in-memory blocks...
    (wes) root@1100150:~/wes_cancer/project# ll
    total 1250
    drwxr-xr-x 13 root root     17 Jun 22 07:10 ./
    drwxr-xr-x  5 root root      5 Jun 18 08:37 ../
    drwxr-xr-x  2 root root      2 Jun 18 09:02 0.sra/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 1.raw_fq/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 10.signature/
    drwxr-xr-x  2 root root      4 Jun 21 06:48 2.clean_fq/
    drwxr-xr-x  4 root root      4 Jun 18 09:02 3.qc/
    drwxr-xr-x  5 root root      6 Jun 21 06:59 4.align/
    drwxr-xr-x  3 root root      3 Jun 18 09:02 5.gatk/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 6.mutect/
    drwxr-xr-x  6 root root      6 Jun 18 09:02 7.annotation/
    -rw-r--r--  1 root root 451049 Jun 22 07:10 7E5239.bam
    drwxr-xr-x  6 root root      6 Jun 18 09:02 8.cnv/
    drwxr-xr-x  2 root root      2 Jun 18 09:02 9.pyclone/
    -rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R1.clean.fastq
    -rw-r--r--  1 root root 874855 Jun 22 07:05 9Y1640.R2.clean.fastq
    -rw-r--r--  1 root root    122 Jun 22 06:57 fq.txt
    

    Jun 22生成的就是模仿后的结果,说明这个过程是OK的(这里复制曾老师的代码有的名称忘记修改了7E5239.bam,后面跑完整数据的时候进行修改

    如果是样本很多需要写循环,然后走正常的数据

    ls /home/jmzeng/project/boy/clean/*1.fq.gz >1
    ls /home/jmzeng/project/boy/clean/*2.fq.gz >2
    cut -d"/" -f 7 1 |cut -d"_" -f 1  > 0
    paste 0 1 2 > config 
    source activate wes
    INDEX=/public/biosoft/GATK/resources/bundle/hg38/bwa_index/gatk_hg38
    cat config |while read id
    do
    
    arr=($id)
    fq1=${arr[1]}
    fq2=${arr[2]}
    sample=${arr[0]}
    echo $sample $fq1 $fq2 
     bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2  | samtools sort -@ 5 -o $sample.bam -   
    
    done 
    
    

    而我只有一个样本,则只需要像示例数据那样完成即可:

    (wes) root@1100150:~# cd wes_cancer/project/2.clean_fq/
    (wes) root@1100150:~/wes_cancer/project/2.clean_fq# ll
    total 5561163
    drwxr-xr-x  2 root root          4 Jun 21 06:48 ./
    drwxr-xr-x 13 root root         14 Jun 22 07:13 ../
    -rw-r--r--  1 root root 2897369062 Jun 21 06:48 9Y1640.R1.clean.fastq.gz
    -rw-r--r--  1 root root 2798611010 Jun 21 06:50 9Y1640.R2.clean.fastq.gz
    (wes) root@1100150:~/wes_cancer/project/2.clean_fq# sample='9Y1640'
    (wes) root@1100150:~/wes_cancer/project/2.clean_fq# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq 9Y1640.R2.clean.fastq  | samtools sort -@ 5 -o 9Y1640.bam -
    [M::bwa_idx_load_from_disk] read 0 ALT contigs
    [E::main_mem] fail to open file `9Y1640.R1.clean.fastq'.
    (wes) root@1100150:~/wes_cancer/project/2.clean_fq# sample='9Y1640'
    (wes) root@1100150:~/wes_cancer/project/2.clean_fq# bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina"  ~/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq.gz 9Y1640.R2.clean.fastq.gz  | samtools sort -@ 5 -o 9Y1640.bam -
    [M::bwa_idx_load_from_disk] read 0 ALT contigs
    [M::process] read 354436 sequences (50000086 bp)...
    [M::process] read 353998 sequences (50000030 bp)...
    [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (113, 144879, 0, 173)
    [M::mem_pestat] analyzing insert size distribution for orientation FF...
    [M::mem_pestat] (25, 50, 75) percentile: (128, 187, 267)
    [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 545)
    [M::mem_pestat] mean and std.dev: (197.98, 92.21)
    [M::mem_pestat] low and high boundaries for proper pairs: (1, 684)
    ...
    

    由于WES数据量很大,即使一个样本也要运行很久,就不记录全程了(真的太久了,1个小时了还没有结束的迹象,以后这种指令还是要放在后台运行)。
    运行完成结果如下:

    [M::mem_process_seqs] Processed 63974 reads in 26.351 CPU sec, 5.102 real sec
    [main] Version: 0.7.17-r1188
    [main] CMD: bwa mem -t 5 -R @RG\tID:9Y1640\tSM:9Y1640\tLB:WGS\tPL:Illumina /root/wes_cancer/data/gatk_hg38 9Y1640.R1.clean.fastq.gz 9Y1640.R2.clean.fastq.gz
    [main] Real time: 5758.605 sec; CPU: 28791.178 sec
    [bam_sort_core] merging from 30 files and 5 in-memory blocks...
    (wes) root@1100150:~/wes_cancer/project/2.clean_fq# ll
    total 8745086
    drwxr-xr-x  2 root root          5 Jun 22 10:13 ./
    drwxr-xr-x 13 root root         14 Jun 22 07:13 ../
    -rw-r--r--  1 root root 2897369062 Jun 21 06:48 9Y1640.R1.clean.fastq.gz
    -rw-r--r--  1 root root 2798611010 Jun 21 06:50 9Y1640.R2.clean.fastq.gz
    -rw-r--r--  1 root root 3261116989 Jun 22 10:13 9Y1640.bam
    

    最简单的找变异流程

    如果要理解参数的意思,参考曾老师的 【直播】我的基因组25:用bcftools来call variation

    ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
    source activate wes
    time samtools mpileup -ugf $ref  *.bam | bcftools call -vmO z -o out.vcf.gz
    ls *.bam |xargs -i samtools index {}
    ## 没有去除PCR重复
    
    

    去除PCR重复

    理解参数和教程为什么会过时(生物信息日新月异,要明白代码的逻辑,会复用即可

    # samtools markdup -r 7E5241.bam 7E5241.rm.bam
    # samtools markdup -S 7E5241.bam 7E5241.mk.bam
    

    完善的GATK流程

    source activate wes
    GATK=/home/jmzeng/biosoft/gatk4/gatk-4.0.6.0/gatk
    ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
    snp=/public/biosoft/GATK/resources/bundle/hg38/dbsnp_146.hg38.vcf.gz
    indel=/public/biosoft/GATK/resources/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz  
    
    for sample in {7E5239.L1,7E5240,7E5241.L1}
    do 
    echo $sample  
    # Elapsed time: 7.91 minutes
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates \
        -I $sample.bam \
        -O ${sample}_marked.bam \
        -M $sample.metrics \
        1>${sample}_log.mark 2>&1 
    
    ## Elapsed time: 13.61 minutes
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" FixMateInformation \
        -I ${sample}_marked.bam \
        -O ${sample}_marked_fixed.bam \
        -SO coordinate \
        1>${sample}_log.fix 2>&1 
    
    samtools index ${sample}_marked_fixed.bam
    
    ##  17.2 minutes
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"  BaseRecalibrator \
        -R $ref  \
        -I ${sample}_marked_fixed.bam  \
        --known-sites $snp \
        --known-sites $indel \
        -O ${sample}_recal.table \
        1>${sample}_log.recal 2>&1 
    
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./"   ApplyBQSR \
        -R $ref  \
        -I ${sample}_marked_fixed.bam  \
        -bqsr ${sample}_recal.table \
        -O ${sample}_bqsr.bam \
        1>${sample}_log.ApplyBQSR  2>&1 
    
    ## 使用GATK的HaplotypeCaller命令
    $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
         -R $ref  \
         -I ${sample}_bqsr.bam \
          --dbsnp $snp \
          -O ${sample}_raw.vcf \
          1>${sample}_log.HC 2>&1  
    
    done 
    
    

    检查感兴趣基因区域内比对和找变异情况

    通过IGV可视化来加深自己对这个流程的把握和理解。

    chr17   HAVANA  gene    43044295        43170245
    3.5G Jul 21 18:01 7E5240.bam
    7.1G Jul 21 21:40 7E5240_bqsr.bam
    4.7G Jul 21 20:28 7E5240_marked.bam
    4.8G Jul 21 20:44 7E5240_marked_fixed.bam
    
    

    把这些bam里面的BRCA1基因的reads拿出来:

    samtools  view -h  7E5240.bam chr17:43044295-43170245 |samtools sort -o  7E5240.brca1.bam -
    samtools  view -h  7E5240_bqsr.bam chr17:43044295-43170245 |samtools sort -o  7E5240_bqsr.brca1.bam -
    samtools  view -h  7E5240_marked.bam chr17:43044295-43170245 |samtools sort -o  7E5240_marked.brca1.bam -
    samtools  view -h  7E5240_marked_fixed.bam chr17:43044295-43170245 |samtools sort -o  7E5240_marked_fixed.brca1.bam -
    ls  *brca1.bam|xargs -i samtools index {}
    
    

    有了这些特定基因区域的bam,就可以针对特定基因找变异

    source activate wes
    ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
    samtools mpileup -ugf $ref   7E5240_bqsr.brca1.bam   | bcftools call -vmO z -o 7E5240_bqsr.vcf.gz
    
    

    所有样本走samtools mpileup 和bcftools call 流程

    仍然是参考我的直播基因组 【直播】我的基因组25:用bcftools来call variation

    source activate wes
    wkd=/home/jmzeng/project/boy
    cd $wkd/align 
    ls  *_bqsr.bam  |xargs -i samtools index {}
    ref=/public/biosoft/GATK/resources/bundle/hg38/Homo_sapiens_assembly38.fasta
    nohup samtools mpileup -ugf $ref   *_bqsr.bam | bcftools call -vmO z -o all_bqsr.vcf.gz & 
    

    比对及找变异结果的质控

    raw和clean的fastq文件都需要使用fastqc质控。

    比对的各个阶段的bam文件都可以质控。

    使用qualimap对wes的比对bam文件总结测序深度及覆盖度

    source activate wes
    wkd=/home/jmzeng/project/boy
    cd $wkd/clean
    ls *.gz |xargs fastqc -t 10 -o ./
    
    cd $wkd/align
    rm *_marked*.bam
    ls  *.bam  |xargs -i samtools index {} 
    ls  *.bam  | while read id ;do (samtools flagstat $id > $(basename $id ".bam").stat);done
    
    conda install bedtools
    cat /public/annotation/CCDS/human/CCDS.20160908.txt  |perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i |awk '{print "chr"$0"\t0\t+"}'  > $wkd/align/hg38.exon.bed 
    
    exon_bed=hg38.exon.bed 
    ls  *_bqsr.bam | while read id;
    do
    sample=${id%%.*}
    echo $sample
    qualimap bamqc --java-mem-size=20G -gff $exon_bed -bam $id  & 
    done 
    
    ### featureCounts 
    gtf=/public/reference/gtf/gencode/gencode.v25.annotation.gtf.gz
    featureCounts -T 5 -p -t exon -g gene_id    \
    -a  $gtf   *_bqsr.bam -o  all.id.txt  1>counts.id.log 2>&1 & 
    
    

    比较两个找变异工具的区别

    chr1    139213  .       A       G       388     .   
    DP=984;VDB=0.831898;SGB=-223.781;RPB=0.599582;MQB=0.0001971;MQSB=0.00457372;BQB=2.59396e-09;MQ0F=0.281504;ICB=0.333333;HOB=0.5;AC=3;AN=6;DP4=371,169,160,84;MQ=21
    GT:PL   0/1:198,0,255   0/1:176,0,255   0/1:50,0,158
    
    
    chr1    139213  rs370723703 A   G   3945.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.999;ClippingRankSum=0.000;DB;DP=244;ExcessHet=3.0103;FS=2.256;MLEAC=1;MLEAF=0.500;MQ=29.33;MQRankSum=-0.929;QD=16.17;ReadPosRankSum=1.462;SOR=0.863  GT:AD:DP:GQ:PL  0/1:136,108:244:99:3974,0,6459
    chr1    139213  rs370723703 A   G   2261.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.191;ClippingRankSum=0.000;DB;DP=192;ExcessHet=3.0103;FS=9.094;MLEAC=1;MLEAF=0.500;MQ=32.03;MQRankSum=-0.533;QD=11.78;ReadPosRankSum=0.916;SOR=0.321  GT:AD:DP:GQ:PL  0/1:126,66:192:99:2290,0,7128
    chr1    139213  rs370723703 A   G   2445.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-2.495;ClippingRankSum=0.000;DB;DP=223;ExcessHet=3.0103;FS=10.346;MLEAC=1;MLEAF=0.500;MQ=30.18;MQRankSum=0.486;QD=10.97;ReadPosRankSum=-0.808;SOR=0.300 GT:AD:DP:GQ:PL  0/1:152,71:223:99:2474,0,7901
    
    

    VCF下游分析

    主要是:注释和过滤

    注释

    VEP,snpEFF,ANNOVAR

    1.Annovar使用记录 (http://www.bio-info-trainee.com/641.html)

    2.用annovar对snp进行注释 (http://www.bio-info-trainee.com/441.html)

    3.对感兴趣的基因call variation(http://www.bio-info-trainee.com/2013.html)

    4.WES(六)用annovar注释(http://www.bio-info-trainee.com/1158.html)

    cd ~/biosoft/
    ln -s /public/biosoft/ANNOVAR/ ANNOVAR
    source activate wes
    wkd=/home/jmzeng/project/boy
    cd $wkd/mutation 
     mv ../align/*.vcf ./
    
     ~/biosoft/ANNOVAR/annovar/convert2annovar.pl  -format vcf4old    7E5240_raw.vcf  >7E5240.annovar
    
    ~/biosoft/ANNOVAR/annovar/annotate_variation.pl \
    -buildver hg38  \
    --outfile 7E5240.anno \
    7E5240.annovar \
    ~/biosoft/ANNOVAR/annovar/humandb/
    
    

    其实并不一定需要vcf文件,软件需要的只是染色体加上坐标即可,对于我们的vcf格式的变异文件, 软件通常会进行一定程度的格式化之后再进行注释。这里的注释主要有三种方式,分别是:

    如果要使用annovar一次性注释多个数据库

    dir=/home/jianmingzeng/biosoft/ANNOVAR/annovar
    db=$dir/humandb/ 
    ls $db
    
    wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2018/clinvar_20180603.vcf.gz 
    
    perl $dir/annotate_variation.pl  --downdb --webfrom annovar --buildver hg38 
    clinvar_20180603 $db
    # perl $bin  --downdb --webfrom annovar --buildver hg38 gnomad_genome $db
    mkdir annovar_results 
     $dir/convert2annovar.pl -format vcf4old highQ.vcf  1> highQ.avinput  2>/dev/null 
    perl $dir/annotate_variation.pl  -buildver hg38 -filter -dbtype clinvar_20180603  --outfile annovar_results/highQ_clinvar  highQ.avinput  $db
    
    perl $dir/table_annovar.pl   \
    -buildver hg38 \
    highQ.avinput  $db \
    -out test \
    -remove -protocol \
    refGene,clinvar_20170905 \
    -operation g,r \
    -nastring NA 
    
    

    补充作业

    使用 Variant Effect Predictor 对所有遗传变异进行注释。过滤掉 dbSNP 数据库和千人基因组计划数据库中已知的 SNP。

    应用 OMIM 数据库(http://omim.org/)查询蛋白 的结构及功能。利用 SIFT ,PolyPhen-2 以及 PROVEAN 软件, 预测 SNV 对蛋白质功能的影响 程度,仅当 3 种软件均预测同一遗传变异对蛋白质 的功能影响较大时,才认定该遗传变异具有高危害 性。利用 PROVEAN 软件 预测 Indel 对蛋白质功 能的影响。

    其实dbNSFP数据库,就注释了这些变异对蛋白功能的影响。

    变异位点的过滤

    使用 GATK的 Joint Calling , 过滤参考:https://mp.weixin.qq.com/s/W8Vfv1WmW6M7U0tIcPtlng

    注意很多资料会过时

    比如虽然可以找到了gatk3代码:http://baserecalibrator1.rssing.com/chan-10751514/all_p13.html 但是已经可以抛弃了,也就是说教程经常会过时。

    GVCF 教程

    相关文章

      网友评论

        本文标题:学习WES数据分析流程

        本文链接:https://www.haomeiwen.com/subject/clntfktx.html