美文网首页
NGS流程-全外显子测序分析记录

NGS流程-全外显子测序分析记录

作者: Lucas21423 | 来源:发表于2021-05-25 15:37 被阅读0次

检查md5,查看数据是否完整

cat md5.txt | while read id
do
arr=($id)
md5=${arr[0]}
filename=${arr[1]}
echo `md5sum $filename`
done >md5.check

diff -b -y md5.txt md5.check

数据上传完整

去接头

这一步在raw目录下进行

所有的数据都是原始的,没有去接头,随便展示一个:


image-20210525102843317.png
image-20210525102853277.png

接头里面包含了illumina universal adapter,这里用trimmomatic去掉

构建config批量处理

ls *1.fastq.gz > 1
ls *2.fastq.gz | while read id ; do echo ${id%%.*}; done > 0
paste 0 1 2 >config
rm 0 1 2

因为数据比较大,共98个,这里分成10组并行批处理,数据信息分别存放在g1-g10里面

创建脚本remove_adaptor_trimmomatic.sh内容如下:

#!/bin/bash
TRIMMOMATIC=/hanlab/Sicheng/biosoft/trimmomatic/Trimmomatic-0.39/trimmomatic-0.39.jar

cat $1 | while read id
do
arr=($id)
sample=${arr[0]}
fq1=${arr[1]}
fq2=${arr[2]}
java -jar $TRIMMOMATIC PE -threads 10 -phred33 -trimlog ../clean/${sample}.log $fq1 $fq2 ../clean/${sample}_1.paried.clean.fq.gz ../clean/${sample}_1.unparied.clean.fq.gz ../clean/${sample}_2.paried.clean.fq.gz ../clean/${sample}_2.unparied.clean.fq.gz ILLUMINACLIP:/hanlab/Sicheng/biosoft/trimmomatic/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done

执行

for i in {1..10}
do
(nohup bash ./remove_adaptor_trimmomatic.sh g${i} >g${i}.log &)
done

可优化内容:计算每组的运行时间等附加信息。这里单纯跑流程,暂且不做。此外,不清楚需要消耗的资源,不敢设计使用太多线程和内存,因此也可以优化资源利用,加快速度。

关于trimmomatic参数的介绍:This will perform the following in this order :

Remove Illumina adapters provided in the TruSeq3-PE.fa file (provided). Initially Trimmomatic will look for seed matches (16 bases) allowing maximally 2 mismatches. These seeds will be extended and clipped if in the case of paired end reads a score of 30 is reached (about 50 bases), or in the case of single ended reads a score of 10, (about 17 bases).

Remove leading low quality or N bases (below quality 3)

Remove trailing low quality or N bases (below quality 3)

Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15

Drop reads which are less than 36 bases long after these steps

查看任务完成度:

for i in {1..10};
do echo g${i};
echo `cat g${i}.log | grep successfully|wc`
done

去掉接头、质控后的数据,用fastqc检查:


image-20210525144147079.png

至于duplication,后续步骤会进行处理

数据预处理

参考GATK的数据预处理方法:https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery

流程:clean数据利用BWA比对到参考基因组hg38上面,samtools排序,利用MarkDuplicates去重,BaseRecalibrator和ApplyBQSR矫正数据质量,得到用于下游分析的bam文件

这一步运行在align文件夹下

首先构建比对文件的metadata,记录样本名称和双端测序文件

ls *1.paried.clean.fq.gz>1
ls *2.paried.clean.fq.gz>2
ls *2.paried.clean.fq.gz | while read id ; do echo ${id%%_*}; done > 0
paste 0 1 2 >config
rm 0 1 2

同样的,这里分成10组并行批处理,数据信息分别存放在g1-g10里面

创建脚本wes_pre_process.sh,内容如下:

#!/bin/bash
INDEX=/hanlab/Sicheng/reference/hg38/bwa_index/gatk_hg38
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
snp1=/hanlab/Sicheng/reference/hg38/dbsnp_146.hg38.vcf.gz
snp2=/hanlab/Sicheng/reference/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
indel=/hanlab/Sicheng/reference/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz  

cat $1 |while read id
do 
arr=($id)
sample=${arr[0]}
fq1=${arr[1]}
fq2=${arr[2]}

## BWA Alignment
echo "$sample bwa started at $(date)"
bwa mem -t 8 -M -R "@RG\tID:$sample\tSM:$sample\tLB:WES\tPL:Illumina" $INDEX ../clean/$fq1 ../clean/$fq2 |samtools sort -@ 4 -o $sample.bam -  
echo "$sample bwa done at $(date)"

## MarkDuplicate
echo "$sample gatk-MarkDuplicates started at $(date)"
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" MarkDuplicates \
-I ${sample}.bam \
-O ../rmdufix/${sample}_marked.bam \
-M ../rmdufix/$sample.metrics \
--REMOVE_DUPLICATES true 
echo "$sample gatk-MarkDuplicates done at $(date)"

## BQSR
echo "$sample gatk-BaseRecalibrator started at $(date)"
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" BaseRecalibrator \
   -I ../rmdufix/${sample}_marked.bam \
   -R $ref \
   --known-sites $snp1 \
   --known-sites $snp2 \
   --known-sites $indel \
   -O ../rmdufix/${sample}.recal_data.table
echo "$sample gatk-BaseRecalibrator done at $(date)"

echo "$sample gatk-ApplyBQSR started at $(date)"
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" ApplyBQSR \
   -R $ref \
   -I ../rmdufix/${sample}_marked.bam \
   --bqsr-recal-file ../rmdufix/${sample}.recal_data.table \
   -O ../rmdufix/${sample}_last.bam
echo "$sample gatk-ApplyBQSR done at $(date)"
done

执行

for i in {1..10}
do
(nohup bash ./wes_pre_process.sh g${i} >g${i}.log &)
done

可优化内容:bwa和samtools线程调用起来还是有点恐怖,CUP运行最大值有98%,下次可以改成-6和-2

可以预见的是,由于每个group的样本数据大小不一样,而最消耗资源的是bwa和samtools过程,当第一组比对完成之后,gatk消耗的资源没有多少,任务先后完成,应该就会出现空窗期。所以只有刚开始的时候会消耗大量资源,一段时间后就没事了。

查看任务完成度

for i in {1..10};
do echo g${i};
echo `cat g${i}.log | grep -E 'started|done'`
done

SNV+Indels

参考GATK体细胞短变异发现(SNVs+Indels)流程: https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-

流程:利用Mutect2对肿瘤和匹配的正常样本进行SNV和indels的原始检出,

GetPileupSummaries CalculateContamination

LearnReadOrientationModel

最后,用FilterMutectCalls得到最终的结果,并用Funcotator注释变异结果得到maf文件

Mutect2

参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2

将last.bam文件分成CA(肿瘤)和N(正常)两组,构建一个config,包括样本名称、肿瘤和正常三列,用Mutect2来发现原始变异 (批量处理节省时间)

ls *CA_last.bam | while read id; do echo ${id%%_*}; done >CA
ls *N_last.bam | while read id; do echo ${id%%_*}; done >N
ls *N_last.bam | while read id; do echo ${id%%-*}; done >sample
paste sample CA N > config
rm sample CA N

同理还是用g1-g10来快一些(如果可以处理跑20组也可以,就是创建分组的时候有点难)

最后选择了9组,第一个样本为第一组实验,后续样本每6个为一组,共8组。49个样本分成了1*1+6*9,就是消耗的资源不多,有点浪费时间,3个为一组会快一些

随意一个组的构建情况

YZQ YZQ-CA  YZQ-N
ZFL ZFL-CA  ZFL-N
ZR  ZR-CA   ZR-N
ZSB ZSB-CA  ZSB-N
ZSM ZSM-CA  ZSM-N

创建脚本wes_mutect2.sh,内容如下(==发现一个脚本缺陷,必须在特定的文件夹下面才可以运行,尝试用一下绝对路径?==,当然也不是必须的):

#!/bin/bash
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
bed=/hanlab/Sicheng/reference/CCDS/hg38.exon.bed
germline=/hanlab/Sicheng/reference/hg38/af-only-gnomad.hg38.vcf.gz
pon=/hanlab/Sicheng/reference/hg38/1000g_pon.hg38.vcf.gz

cat $1 | while read id
do
arr=($id)
sample=${arr[0]}
tumor=${arr[1]}
normal=${arr[2]}

gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" Mutect2 \
     -R $ref \
     -I ${tumor}_last.bam \
     -I ${normal}_last.bam \
     -normal ${normal} \
     -L $bed \
     --germline-resource $germline \
     --panel-of-normals $pon \
     --f1r2-tar-gz ../f1r2/${sample}.f1r2.tar.gz \
     -O ../vcfraw/${sample}.vcf.gz 
done

执行

for i in {1..9}
do
(nohup bash ./wes_mutect2.sh g${i} >g${i}.log &)
done

查看任务完成度,这里我没有在脚本里面加时间,所以没有什么特征点可以提取,不看也行

for i in {1..9};
do echo g${i};
echo `cat g${i}.log`
done

GetpileupSummaries

参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056969392-GetPileupSummaries

==这里,可以和mutect2并行着来==

GetPileupSummaries只需要tumor.bam和normal.bam即可

目前的路径情况如下:

.
├── align
├── clean
├── f1r2
├── pileups
├── raw
├── rmdufix
└── vcfraw

仍然是分组,正常5组,癌症5组这么并行。这一步在pileups下面运行。

构建脚本wes_pileup.sh,内容如下

#!/bin/bash
biallelic=/hanlab/Sicheng/reference/hg38/small_exac_common_3.hg38.vcf.gz

cat $1 |while read id
do
sample=${id%%_*}
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" GetPileupSummaries \
   -I ../rmdufix/${sample}_last.bam \
   -V $biallelic \
   -L $biallelic \
   -O ${sample}.pileups.table
done

执行

for i in {1..10}
do
(nohup bash ./wes_pileup.sh g${i} >g${i}.log &)
done

查看任务进度,同理,没有什么参考价值

for i in {1..10};
do echo g${i};
echo `cat g${i}.log`
done

CalculateContamination

参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056968632-CalculateContamination

这一步需要GetpileupSummaries运行完成之后才能进行,需要的输入文件是正常和肿瘤的pileups.table,输出一个contamination.table作为FilterMutectCalls的输入文件。

构建匹配的模板,计算contamination

ls *CA.pileups.table >../contamination/CA
ls *N.pileups.table >../contamination/N
ls *CA.pileups.table | while read id; do echo ${id%%-*}; done >../contamination/sample
cd ../contamination/
paste sample CA N >config
rm CA N sample

同样,构建g1-g10,批处理运行(这一步能不能自动化呢?感觉很麻烦)

head -7 filename | tail -7
#因为这里只有49个病人
n=1
for i in $(seq 7 7 49)
do 
head -$i config |tail -7 >g${n}
n=`expr $n + 1`
done

构建成功,欢呼!

构建wes_calculatecontamination.sh,内容如下:

#!/bin/bash
cat $1 |while read id
do
arr=($id)
sample=${arr[0]}
tumor=${arr[1]}
normal=${arr[2]}
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" CalculateContamination \
   -I ../pileups/${tumor} \
   -matched ../pileups/${normal} \
   -O ${sample}.table \
   --tumor-segmentation ${sample}.segments.tsv
done

执行

for i in {1..7}
do
(nohup bash ./wes_calculatecontamination.sh g${i} >g${i}.log &)
done

查看任务进度

for i in {1..7};
do echo g${i};
echo `cat g${i}.log`
done

LearnReadOrientationModel

参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360057439111-LearnReadOrientationModel

这一步需要CollectF1R2Counts or by Mutect2, with the --f1r2-tar-gz argument运行完成之后才能进行,需要的输入文件是肿瘤和匹配正常结果的f1r2.tar.gz,输出一个artifact-prior.tar.gz作为FilterMutectCalls的输入文件。

分组:

ls *gz >config
n=1
for i in $(seq 7 7 49)
do 
head -$i config |tail -7 >g${n}
n=`expr $n + 1`
done

构建wes_learnmodel.sh,内容如下:

#!/bin/bash
cat $1 |while read id
do
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" LearnReadOrientationModel \
   --input $id \
   --output ${id%%.*}_artifact-prior.tar.gz
done

执行:

for i in {1..7}
do
(nohup bash ./wes_learnmodel.sh g${i} >g${i}.log &)
done

查看任务进度

for i in {1..7};
do echo g${i};
echo `cat g${i}.log`
done

FilterMutectCalls

参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360057438571--Tool-Documentation-Index#FilterMutectCalls

这一步用上述3个步骤得到的结果,过滤得到confident的变异结果

用病人的名字作为分组,共分成7组,进行脚本编写

n=1
for i in $(seq 7 7 49)
do 
head -$i config |tail -7 >g${n}
n=`expr $n + 1`
done

构建wes_filtervcf.sh,内容如下(在vcfclean目录下面操作)

#!/bin/bash
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta

cat $1 |while read id
do
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" FilterMutectCalls \
   -R $ref \
   -V ../vcfraw/${id}.vcf.gz \
   --contamination-table ../contamination/${id}.table \
   --tumor-segmentation ../contamination/${id}.segments.tsv \
   --orientation-bias-artifact-priors ../f1r2/${id}_artifact-prior.tar.gz \
   -O ${id}.filtered.vcf.gz
done

执行

for i in {1..7}
do
(nohup bash ./wes_filtervcf.sh g${i} >g${i}.log &)
done

查看任务进度

for i in {1..7};
do echo g${i};
echo `cat g${i}.log`
done

Funcotator

参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056967772-Funcotator

将变异注释到基因上面,分组情况就用上一步骤的g1-g7,不过记录log的文档可以稍微修改一下。

构建wes_funcotator.sh,内容如下

#!/bin/bash
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
dataSourcesFolder=/hanlab/Sicheng/reference/functator/funcotator_dataSources.v1.7.20200521s/

cat $1 |while read id
do
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" Funcotator \
   -R $ref \
   -V ${id}.filtered.vcf.gz \
   -O ../maf/${id}.maf \
   --output-file-format MAF \
   --data-sources-path $dataSourcesFolder/ \
   --ref-version hg38 
done

执行

for i in {1..7}
do
(nohup bash ./wes_funcotator.sh g${i} >g${i}.funcotator.log &)
done

查看任务进度

for i in {1..7};
do echo g${i};
echo `cat g${i}.funcotator.log`
done

合并maf文件

参考资料:https://www.yuque.com/biotrainee/wes/qmkaqo

构建一个config,里面记录对应的肿瘤名称和正常名称,类似于:

BJ  BJ-CA   BJ-N
CGX CGX-CA  CGX-N
CKY CKY-CA  CKY-N
CQ  CQ-CA   CQ-N
CZX CZX-CA  CZX-N

修改一下maf的文件内容

cat config | while read id
do
arr=($id)
sample=${arr[0]}
tumor=${arr[1]}
normal=${arr[2]} 
    grep -v '^#' ${sample}.maf| grep -v '^Hugo_Symbol'| awk -v T=${tumor} -v N=${normal} 'BEGIN{FS="\t";OFS="\t"}{$16=T;$17=N;print $0}' >gatk/${sample}_funcotator.maf 
done

## 取出一个列名
grep 'Hugo_Symbol' BJ.maf >gatk/header

## 删除掉旧文件
rm *.maf
## 合并所有的样本的maf
cat header `ls *maf` > merged.maf

结语

至此,全外显子测序上游分析结束,之后将进行下游分析。也可以参考语雀的详细介绍。

参考资料

去接头: GitHub - usadellab/Trimmomatic, http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

数据预处理:https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery

SNVs+Indels: https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-

Mutect2: https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2

GetpileupSummaries:https://gatk.broadinstitute.org/hc/en-us/articles/360056969392-GetPileupSummaries

CalculateContamination:https://gatk.broadinstitute.org/hc/en-us/articles/360056968632-CalculateContamination

LearnReadOrientationModel:https://gatk.broadinstitute.org/hc/en-us/articles/360057439111-LearnReadOrientationModel

FilterMutectCalls:https://gatk.broadinstitute.org/hc/en-us/articles/360057439111-LearnReadOrientationModel

Funcotator:https://gatk.broadinstitute.org/hc/en-us/articles/360056967772-Funcotator

合并maf:https://www.yuque.com/biotrainee/wes/qmkaqo

相关文章

网友评论

      本文标题:NGS流程-全外显子测序分析记录

      本文链接:https://www.haomeiwen.com/subject/qsvksltx.html