写在前面:从7月16日开始到PICU轮转,前天跟值了一个夜班,可能是新人比较旺的缘故,从中午就开始收病人,一直忙到凌晨3点多,昨天早上6点过就起来干活查完房开完医嘱,写完病程就到11点多,感觉脑子已经完全不够用了,回到住处就开始补瞌睡。今天6点多就出门上班,因为要等一些检查结果,然后只办了1个出院,忙完所有的事情后就12点了,1点过去参加一个组会,然后2点多终于闲下来,看看之前的结果。
查看bam转换情况发现有部分样本转换过程出了未知错误,没有成功转换,提取出这部分的样本名重新构建config1再来进行转换
#构建config1
basename -a *bam.tmp.0000* >tmp
cat tmp| while read id; do sample=${id%%.hg38.sort*}; echo $sample; done >config1
#删除残余文件
rm -rf *.bam.tmp.0*
#激活小环境重新开始转换
conda activate wes
nohup cat config1 | while read id ; do bam=~/CHD_pooling_seq/${id}.dedup.bam; if [ ! -f ~/project/0.bwa/ok.${id}_marked.status ]; then echo "start CrossMap for ${id}" `date`; python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38 1>~/project/0.bwa/${id}_log.mark 2>&1; if [ $? -eq 0 ]; then touch ~/project/0.bwa/ok.${id}_marked.status; fi; echo "end CrossMap for ${id}" `date`; fi; done &
同时进行varient calling:
单个样本calling的脚本wesFlow_multi_to_gvcf.sh
:
(base) root@1100150:~/project# vi wesFlow_multi_to_gvcf.sh
(base) root@1100150:~/project# cat wesFlow_multi_to_gvcf.sh
#!usr/bin/bash
# use $sample
# bash ~/project/wesFlow_multi_to_gvcf.sh $sample
# This is a wesflow for only one sample
samtools=samtools
GATK=~/biosoft/gatk-4.1.7.0/gatk
#references
ref=~/reference/genome/Homo_sapiens_assembly38.fasta
gatk_ref=~/reference/genome/Homo_sapiens_assembly38.fasta
gatk_bundle=~/annotation/variation/GATK
dbsnp=$gatk_bundle/dbsnp_146.hg38.vcf.gz
indel=$gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
G1000=$gatk_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz
hapmap=$gatk_bundle/hapmap_3.3.hg38.vcf.gz
omini=$gatk_bundle/1000G_omni2.5.hg38.vcf.gz
mills=$gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
outdir=~/project
## outdir directory
if [ ! -d $outdir/0.bwa ]
then mkdir -p $outdir/0.bwa
fi
if [ ! -d $outdir/gatk ]
then mkdir -p $outdir/gatk
fi
## start the gatk analysis
## start the gatk analysis
## with one sample
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" MarkDuplicates \
-I $outdir/0.bwa/$sample.hg38.sorted.bam \
-O $outdir/0.bwa/${sample}.sorted.marked.bam \
-M $outdir/0.bwa/$sample.metrics \
1>$outdir/0.bwa/${sample}_log.mark 2>&1 && echo "MarkDuplicates done!"
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" FixMateInformation \
-I $outdir/0.bwa/${sample}.sorted.marked.bam \
-O $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
-SO coordinate \
1>$outdir/0.bwa/${sample}_log.fix 2>&1
## 86 minutes
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" BaseRecalibrator \
-R $ref \
-I $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
--known-sites $snp \
--known-sites $indel \
--known-sites $1000G \
-O $outdir/0.bwa/${sample}_recal.table \
1>$outdir/0.bwa/${sample}_log.recal 2>&1 && echo "BaseRecalibrator done!"
## 45 minutes
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" ApplyBQSR \
-R $ref \
-I $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
-bqsr $outdir/0.bwa/${sample}_recal.table \
-O $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam \
1>$outdir/0.bwa/${sample}_log.ApplyBQSR 2>&1 && echo "ApplyBQSR done!"
## 449m for 16G data
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" HaplotypeCaller \
-R $ref \
-I $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam \
#--dbsnp $dbsnp \
-O $outdir/gatk/${sample}.HC.vcf.gz \
1>$outdir/0.bwa/${sample}_log.HC 2>&1 && echo "HaplotypeCaller done!"
time $samtools index $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam && echo "** ${sample}.sorted.marked.fixed.bqsr.bam index done! **"
# VQSR
# first SNP mode 分别评估SNP和INDEL突变位点的质量
# SNP mode
time $GATK VariantRecalibrator \
-R $ref \
-V $outdir/gatk/$sample.HC.vcf.gz \
--max-gaussians 4 \
-resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
-resource:omini,known=false,training=true,truth=false,prior=12.0 $omini \
-resource:1000G,known=false,training=true,truth=false,prior=10.0 $G1000 \
-resource:snp,known=true,training=false,truth=false,prior=10.0 $dbsnp \
-an DP -an QD -an SOR -an ReadPosRankSum -an MQRankSum \
-mode SNP \
--rscript-file $outdir/gatk/${sample}.HC.snps.plots.R \
--tranches-file $outdir/gatk/${sample}.HC.snps.tranches \
-O $outdir/gatk/${sample}.HC.snps.recal
time $GATK ApplyVQSR \
-R $ref \
-V $outdir/gatk/$sample.HC.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file $outdir/gatk/$sample.HC.snps.tranches \
--recal-file $outdir/gatk/$sample.HC.snps.recal \
-mode SNP \
-O $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz && echo "** SNPs VQSR done **"
## Indel mode
time $GATK VariantRecalibrator \
-R $ref \
-V $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz \
--max-gaussians 6 \
-resource:mills,known=false,training=true,truth=true,prior=15.0 $mills \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode INDEL \
--rscript-file $outdir/gatk/${sample}.HC.snps.indels.plots.R \
--tranches-file $outdir/gatk/${sample}.HC.snps.indels.tranches \
-O $outdir/gatk/${sample}.HC.snps.indels.recal
time $GATK ApplyVQSR \
-R $ref \
-V $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file $outdir/gatk/$sample.HC.snps.indels.tranches \
--recal-file $outdir/gatk/$sample.HC.snps.indels.recal \
-mode INDEL \
-O $outdir/gatk/$sample.HC.snps.indels.VQSR.vcf.gz && echo "** SNPs and Indels VQSR $sample done **"
写成循环运行
bash CHD_Flow_multi.sh
内容如下:
(wes) root@1100150:~/project# cat CHD_Flow_multi.sh
cat config | while read sample ; do echo $sample; bash ~/project/wesFlow_multi_to_gvcf.sh $sample; done
切换到config目录提交到后台运行
cd ~/project/
nohup bash CHD_Flow_multi.sh &
这个单样本的calling的脚本第一次使用,因此注释的时间可能不全对,先看看,明天再来继续分析吧。
网友评论