检查md5,查看数据是否完整
cat md5.txt | while read id
do
arr=($id)
md5=${arr[0]}
filename=${arr[1]}
echo `md5sum $filename`
done >md5.check
diff -b -y md5.txt md5.check
数据上传完整
去接头
这一步在raw目录下进行
所有的数据都是原始的,没有去接头,随便展示一个:
image-20210525102843317.png
image-20210525102853277.png
接头里面包含了illumina universal adapter,这里用trimmomatic去掉
构建config批量处理
ls *1.fastq.gz > 1
ls *2.fastq.gz | while read id ; do echo ${id%%.*}; done > 0
paste 0 1 2 >config
rm 0 1 2
因为数据比较大,共98个,这里分成10组并行批处理,数据信息分别存放在g1-g10里面
创建脚本remove_adaptor_trimmomatic.sh
内容如下:
#!/bin/bash
TRIMMOMATIC=/hanlab/Sicheng/biosoft/trimmomatic/Trimmomatic-0.39/trimmomatic-0.39.jar
cat $1 | while read id
do
arr=($id)
sample=${arr[0]}
fq1=${arr[1]}
fq2=${arr[2]}
java -jar $TRIMMOMATIC PE -threads 10 -phred33 -trimlog ../clean/${sample}.log $fq1 $fq2 ../clean/${sample}_1.paried.clean.fq.gz ../clean/${sample}_1.unparied.clean.fq.gz ../clean/${sample}_2.paried.clean.fq.gz ../clean/${sample}_2.unparied.clean.fq.gz ILLUMINACLIP:/hanlab/Sicheng/biosoft/trimmomatic/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
done
执行
for i in {1..10}
do
(nohup bash ./remove_adaptor_trimmomatic.sh g${i} >g${i}.log &)
done
可优化内容:计算每组的运行时间等附加信息。这里单纯跑流程,暂且不做。此外,不清楚需要消耗的资源,不敢设计使用太多线程和内存,因此也可以优化资源利用,加快速度。
关于trimmomatic参数的介绍:This will perform the following in this order :
Remove Illumina adapters provided in the TruSeq3-PE.fa file (provided). Initially Trimmomatic will look for seed matches (16 bases) allowing maximally 2 mismatches. These seeds will be extended and clipped if in the case of paired end reads a score of 30 is reached (about 50 bases), or in the case of single ended reads a score of 10, (about 17 bases).
Remove leading low quality or N bases (below quality 3)
Remove trailing low quality or N bases (below quality 3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15
Drop reads which are less than 36 bases long after these steps
查看任务完成度:
for i in {1..10};
do echo g${i};
echo `cat g${i}.log | grep successfully|wc`
done
去掉接头、质控后的数据,用fastqc检查:
image-20210525144147079.png
至于duplication,后续步骤会进行处理
数据预处理
参考GATK的数据预处理方法:https://gatk.broadinstitute.org/hc/en-us/articles/360035535912-Data-pre-processing-for-variant-discovery
流程:clean数据利用BWA比对到参考基因组hg38上面,samtools排序,利用MarkDuplicates去重,BaseRecalibrator和ApplyBQSR矫正数据质量,得到用于下游分析的bam文件
这一步运行在align文件夹下
首先构建比对文件的metadata,记录样本名称和双端测序文件
ls *1.paried.clean.fq.gz>1
ls *2.paried.clean.fq.gz>2
ls *2.paried.clean.fq.gz | while read id ; do echo ${id%%_*}; done > 0
paste 0 1 2 >config
rm 0 1 2
同样的,这里分成10组并行批处理,数据信息分别存放在g1-g10里面
创建脚本wes_pre_process.sh
,内容如下:
#!/bin/bash
INDEX=/hanlab/Sicheng/reference/hg38/bwa_index/gatk_hg38
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
snp1=/hanlab/Sicheng/reference/hg38/dbsnp_146.hg38.vcf.gz
snp2=/hanlab/Sicheng/reference/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz
indel=/hanlab/Sicheng/reference/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
cat $1 |while read id
do
arr=($id)
sample=${arr[0]}
fq1=${arr[1]}
fq2=${arr[2]}
## BWA Alignment
echo "$sample bwa started at $(date)"
bwa mem -t 8 -M -R "@RG\tID:$sample\tSM:$sample\tLB:WES\tPL:Illumina" $INDEX ../clean/$fq1 ../clean/$fq2 |samtools sort -@ 4 -o $sample.bam -
echo "$sample bwa done at $(date)"
## MarkDuplicate
echo "$sample gatk-MarkDuplicates started at $(date)"
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" MarkDuplicates \
-I ${sample}.bam \
-O ../rmdufix/${sample}_marked.bam \
-M ../rmdufix/$sample.metrics \
--REMOVE_DUPLICATES true
echo "$sample gatk-MarkDuplicates done at $(date)"
## BQSR
echo "$sample gatk-BaseRecalibrator started at $(date)"
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" BaseRecalibrator \
-I ../rmdufix/${sample}_marked.bam \
-R $ref \
--known-sites $snp1 \
--known-sites $snp2 \
--known-sites $indel \
-O ../rmdufix/${sample}.recal_data.table
echo "$sample gatk-BaseRecalibrator done at $(date)"
echo "$sample gatk-ApplyBQSR started at $(date)"
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" ApplyBQSR \
-R $ref \
-I ../rmdufix/${sample}_marked.bam \
--bqsr-recal-file ../rmdufix/${sample}.recal_data.table \
-O ../rmdufix/${sample}_last.bam
echo "$sample gatk-ApplyBQSR done at $(date)"
done
执行
for i in {1..10}
do
(nohup bash ./wes_pre_process.sh g${i} >g${i}.log &)
done
可优化内容:bwa和samtools线程调用起来还是有点恐怖,CUP运行最大值有98%,下次可以改成-6和-2
可以预见的是,由于每个group的样本数据大小不一样,而最消耗资源的是bwa和samtools过程,当第一组比对完成之后,gatk消耗的资源没有多少,任务先后完成,应该就会出现空窗期。所以只有刚开始的时候会消耗大量资源,一段时间后就没事了。
查看任务完成度
for i in {1..10};
do echo g${i};
echo `cat g${i}.log | grep -E 'started|done'`
done
SNV+Indels
参考GATK体细胞短变异发现(SNVs+Indels)流程: https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-
流程:利用Mutect2对肿瘤和匹配的正常样本进行SNV和indels的原始检出,
GetPileupSummaries CalculateContamination
LearnReadOrientationModel
最后,用FilterMutectCalls得到最终的结果,并用Funcotator注释变异结果得到maf文件
Mutect2
参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2
将last.bam文件分成CA(肿瘤)和N(正常)两组,构建一个config,包括样本名称、肿瘤和正常三列,用Mutect2来发现原始变异 (批量处理节省时间)
ls *CA_last.bam | while read id; do echo ${id%%_*}; done >CA
ls *N_last.bam | while read id; do echo ${id%%_*}; done >N
ls *N_last.bam | while read id; do echo ${id%%-*}; done >sample
paste sample CA N > config
rm sample CA N
同理还是用g1-g10来快一些(如果可以处理跑20组也可以,就是创建分组的时候有点难)
最后选择了9组,第一个样本为第一组实验,后续样本每6个为一组,共8组。49个样本分成了,就是消耗的资源不多,有点浪费时间,3个为一组会快一些
随意一个组的构建情况
YZQ YZQ-CA YZQ-N
ZFL ZFL-CA ZFL-N
ZR ZR-CA ZR-N
ZSB ZSB-CA ZSB-N
ZSM ZSM-CA ZSM-N
创建脚本wes_mutect2.sh
,内容如下(==发现一个脚本缺陷,必须在特定的文件夹下面才可以运行,尝试用一下绝对路径?==,当然也不是必须的):
#!/bin/bash
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
bed=/hanlab/Sicheng/reference/CCDS/hg38.exon.bed
germline=/hanlab/Sicheng/reference/hg38/af-only-gnomad.hg38.vcf.gz
pon=/hanlab/Sicheng/reference/hg38/1000g_pon.hg38.vcf.gz
cat $1 | while read id
do
arr=($id)
sample=${arr[0]}
tumor=${arr[1]}
normal=${arr[2]}
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" Mutect2 \
-R $ref \
-I ${tumor}_last.bam \
-I ${normal}_last.bam \
-normal ${normal} \
-L $bed \
--germline-resource $germline \
--panel-of-normals $pon \
--f1r2-tar-gz ../f1r2/${sample}.f1r2.tar.gz \
-O ../vcfraw/${sample}.vcf.gz
done
执行
for i in {1..9}
do
(nohup bash ./wes_mutect2.sh g${i} >g${i}.log &)
done
查看任务完成度,这里我没有在脚本里面加时间,所以没有什么特征点可以提取,不看也行
for i in {1..9};
do echo g${i};
echo `cat g${i}.log`
done
GetpileupSummaries
参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056969392-GetPileupSummaries
==这里,可以和mutect2并行着来==
GetPileupSummaries只需要tumor.bam和normal.bam即可
目前的路径情况如下:
.
├── align
├── clean
├── f1r2
├── pileups
├── raw
├── rmdufix
└── vcfraw
仍然是分组,正常5组,癌症5组这么并行。这一步在pileups下面运行。
构建脚本wes_pileup.sh
,内容如下
#!/bin/bash
biallelic=/hanlab/Sicheng/reference/hg38/small_exac_common_3.hg38.vcf.gz
cat $1 |while read id
do
sample=${id%%_*}
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" GetPileupSummaries \
-I ../rmdufix/${sample}_last.bam \
-V $biallelic \
-L $biallelic \
-O ${sample}.pileups.table
done
执行
for i in {1..10}
do
(nohup bash ./wes_pileup.sh g${i} >g${i}.log &)
done
查看任务进度,同理,没有什么参考价值
for i in {1..10};
do echo g${i};
echo `cat g${i}.log`
done
CalculateContamination
参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056968632-CalculateContamination
这一步需要GetpileupSummaries运行完成之后才能进行,需要的输入文件是正常和肿瘤的pileups.table,输出一个contamination.table作为FilterMutectCalls的输入文件。
构建匹配的模板,计算contamination
ls *CA.pileups.table >../contamination/CA
ls *N.pileups.table >../contamination/N
ls *CA.pileups.table | while read id; do echo ${id%%-*}; done >../contamination/sample
cd ../contamination/
paste sample CA N >config
rm CA N sample
同样,构建g1-g10,批处理运行(这一步能不能自动化呢?感觉很麻烦)
head -7 filename | tail -7
#因为这里只有49个病人
n=1
for i in $(seq 7 7 49)
do
head -$i config |tail -7 >g${n}
n=`expr $n + 1`
done
构建成功,欢呼!
构建wes_calculatecontamination.sh
,内容如下:
#!/bin/bash
cat $1 |while read id
do
arr=($id)
sample=${arr[0]}
tumor=${arr[1]}
normal=${arr[2]}
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" CalculateContamination \
-I ../pileups/${tumor} \
-matched ../pileups/${normal} \
-O ${sample}.table \
--tumor-segmentation ${sample}.segments.tsv
done
执行
for i in {1..7}
do
(nohup bash ./wes_calculatecontamination.sh g${i} >g${i}.log &)
done
查看任务进度
for i in {1..7};
do echo g${i};
echo `cat g${i}.log`
done
LearnReadOrientationModel
参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360057439111-LearnReadOrientationModel
这一步需要CollectF1R2Counts or by Mutect2, with the --f1r2-tar-gz argument运行完成之后才能进行,需要的输入文件是肿瘤和匹配正常结果的f1r2.tar.gz,输出一个artifact-prior.tar.gz作为FilterMutectCalls的输入文件。
分组:
ls *gz >config
n=1
for i in $(seq 7 7 49)
do
head -$i config |tail -7 >g${n}
n=`expr $n + 1`
done
构建wes_learnmodel.sh
,内容如下:
#!/bin/bash
cat $1 |while read id
do
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" LearnReadOrientationModel \
--input $id \
--output ${id%%.*}_artifact-prior.tar.gz
done
执行:
for i in {1..7}
do
(nohup bash ./wes_learnmodel.sh g${i} >g${i}.log &)
done
查看任务进度
for i in {1..7};
do echo g${i};
echo `cat g${i}.log`
done
FilterMutectCalls
这一步用上述3个步骤得到的结果,过滤得到confident的变异结果
用病人的名字作为分组,共分成7组,进行脚本编写
n=1
for i in $(seq 7 7 49)
do
head -$i config |tail -7 >g${n}
n=`expr $n + 1`
done
构建wes_filtervcf.sh
,内容如下(在vcfclean目录下面操作)
#!/bin/bash
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
cat $1 |while read id
do
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" FilterMutectCalls \
-R $ref \
-V ../vcfraw/${id}.vcf.gz \
--contamination-table ../contamination/${id}.table \
--tumor-segmentation ../contamination/${id}.segments.tsv \
--orientation-bias-artifact-priors ../f1r2/${id}_artifact-prior.tar.gz \
-O ${id}.filtered.vcf.gz
done
执行
for i in {1..7}
do
(nohup bash ./wes_filtervcf.sh g${i} >g${i}.log &)
done
查看任务进度
for i in {1..7};
do echo g${i};
echo `cat g${i}.log`
done
Funcotator
参考资料:https://gatk.broadinstitute.org/hc/en-us/articles/360056967772-Funcotator
将变异注释到基因上面,分组情况就用上一步骤的g1-g7,不过记录log的文档可以稍微修改一下。
构建wes_funcotator.sh
,内容如下
#!/bin/bash
ref=/hanlab/Sicheng/reference/hg38/Homo_sapiens_assembly38.fasta
dataSourcesFolder=/hanlab/Sicheng/reference/functator/funcotator_dataSources.v1.7.20200521s/
cat $1 |while read id
do
gatk --java-options "-Xmx8G -Djava.io.tmpdir=./" Funcotator \
-R $ref \
-V ${id}.filtered.vcf.gz \
-O ../maf/${id}.maf \
--output-file-format MAF \
--data-sources-path $dataSourcesFolder/ \
--ref-version hg38
done
执行
for i in {1..7}
do
(nohup bash ./wes_funcotator.sh g${i} >g${i}.funcotator.log &)
done
查看任务进度
for i in {1..7};
do echo g${i};
echo `cat g${i}.funcotator.log`
done
合并maf文件
参考资料:https://www.yuque.com/biotrainee/wes/qmkaqo
构建一个config,里面记录对应的肿瘤名称和正常名称,类似于:
BJ BJ-CA BJ-N
CGX CGX-CA CGX-N
CKY CKY-CA CKY-N
CQ CQ-CA CQ-N
CZX CZX-CA CZX-N
修改一下maf的文件内容
cat config | while read id
do
arr=($id)
sample=${arr[0]}
tumor=${arr[1]}
normal=${arr[2]}
grep -v '^#' ${sample}.maf| grep -v '^Hugo_Symbol'| awk -v T=${tumor} -v N=${normal} 'BEGIN{FS="\t";OFS="\t"}{$16=T;$17=N;print $0}' >gatk/${sample}_funcotator.maf
done
## 取出一个列名
grep 'Hugo_Symbol' BJ.maf >gatk/header
## 删除掉旧文件
rm *.maf
## 合并所有的样本的maf
cat header `ls *maf` > merged.maf
结语
至此,全外显子测序上游分析结束,之后将进行下游分析。也可以参考语雀的详细介绍。
参考资料
去接头: GitHub - usadellab/Trimmomatic, http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
SNVs+Indels: https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-
Mutect2: https://gatk.broadinstitute.org/hc/en-us/articles/360056969692-Mutect2
GetpileupSummaries:https://gatk.broadinstitute.org/hc/en-us/articles/360056969392-GetPileupSummaries
CalculateContamination:https://gatk.broadinstitute.org/hc/en-us/articles/360056968632-CalculateContamination
LearnReadOrientationModel:https://gatk.broadinstitute.org/hc/en-us/articles/360057439111-LearnReadOrientationModel
FilterMutectCalls:https://gatk.broadinstitute.org/hc/en-us/articles/360057439111-LearnReadOrientationModel
Funcotator:https://gatk.broadinstitute.org/hc/en-us/articles/360056967772-Funcotator
网友评论