美文网首页
2020-07-23 靶向捕获测序数据分析记录5

2020-07-23 靶向捕获测序数据分析记录5

作者: 程凉皮儿 | 来源:发表于2020-07-23 15:24 被阅读0次

写在前面:从7月16日开始到PICU轮转,前天跟值了一个夜班,可能是新人比较旺的缘故,从中午就开始收病人,一直忙到凌晨3点多,昨天早上6点过就起来干活查完房开完医嘱,写完病程就到11点多,感觉脑子已经完全不够用了,回到住处就开始补瞌睡。今天6点多就出门上班,因为要等一些检查结果,然后只办了1个出院,忙完所有的事情后就12点了,1点过去参加一个组会,然后2点多终于闲下来,看看之前的结果。

查看bam转换情况发现有部分样本转换过程出了未知错误,没有成功转换,提取出这部分的样本名重新构建config1再来进行转换

#构建config1
basename -a *bam.tmp.0000* >tmp
cat tmp| while read id; do sample=${id%%.hg38.sort*}; echo $sample; done >config1
#删除残余文件
rm -rf *.bam.tmp.0*
#激活小环境重新开始转换
conda activate wes
nohup cat config1 | while read id ; do bam=~/CHD_pooling_seq/${id}.dedup.bam; if [ ! -f ~/project/0.bwa/ok.${id}_marked.status ]; then echo "start CrossMap for ${id}" `date`; python /root/miniconda3/envs/py3/bin/CrossMap.py bam ~/biosoft/liftover/hg19ToHg38.over.chain.gz ${bam} ~/project/0.bwa/${id}.hg38 1>~/project/0.bwa/${id}_log.mark 2>&1; if [ $? -eq 0 ]; then touch ~/project/0.bwa/ok.${id}_marked.status; fi; echo "end CrossMap for ${id}" `date`; fi; done &

同时进行varient calling:

单个样本calling的脚本wesFlow_multi_to_gvcf.sh

(base) root@1100150:~/project# vi wesFlow_multi_to_gvcf.sh
(base) root@1100150:~/project# cat wesFlow_multi_to_gvcf.sh
#!usr/bin/bash
# use $sample
# bash ~/project/wesFlow_multi_to_gvcf.sh $sample
# This is a wesflow for only one sample

samtools=samtools
GATK=~/biosoft/gatk-4.1.7.0/gatk

#references
ref=~/reference/genome/Homo_sapiens_assembly38.fasta
gatk_ref=~/reference/genome/Homo_sapiens_assembly38.fasta
gatk_bundle=~/annotation/variation/GATK

dbsnp=$gatk_bundle/dbsnp_146.hg38.vcf.gz
indel=$gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
G1000=$gatk_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz
hapmap=$gatk_bundle/hapmap_3.3.hg38.vcf.gz
omini=$gatk_bundle/1000G_omni2.5.hg38.vcf.gz
mills=$gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

outdir=~/project

## outdir directory

if [ ! -d $outdir/0.bwa ]
then mkdir -p $outdir/0.bwa
fi

if [ ! -d $outdir/gatk ]
then mkdir -p $outdir/gatk
fi

## start the gatk analysis

## start the gatk analysis

## with one sample
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" MarkDuplicates \
    -I $outdir/0.bwa/$sample.hg38.sorted.bam \
    -O $outdir/0.bwa/${sample}.sorted.marked.bam \
    -M $outdir/0.bwa/$sample.metrics \
    1>$outdir/0.bwa/${sample}_log.mark 2>&1 && echo "MarkDuplicates done!"

time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" FixMateInformation \
    -I $outdir/0.bwa/${sample}.sorted.marked.bam \
    -O $outdir/0.bwa/${sample}.sorted.marked.fixed.bam \
    -SO coordinate \
    1>$outdir/0.bwa/${sample}_log.fix 2>&1
## 86 minutes
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp"  BaseRecalibrator \
    -R $ref  \
    -I $outdir/0.bwa/${sample}.sorted.marked.fixed.bam  \
    --known-sites $snp \
    --known-sites $indel \
    --known-sites $1000G \
    -O $outdir/0.bwa/${sample}_recal.table \
    1>$outdir/0.bwa/${sample}_log.recal 2>&1 && echo "BaseRecalibrator done!"
## 45 minutes
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp"   ApplyBQSR \
    -R $ref  \
    -I $outdir/0.bwa/${sample}.sorted.marked.fixed.bam  \
    -bqsr $outdir/0.bwa/${sample}_recal.table \
    -O $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam \
    1>$outdir/0.bwa/${sample}_log.ApplyBQSR  2>&1 && echo "ApplyBQSR done!"
## 449m for 16G data
time $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" HaplotypeCaller \
    -R $ref  \
    -I $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam \
    #--dbsnp $dbsnp \
    -O $outdir/gatk/${sample}.HC.vcf.gz \
    1>$outdir/0.bwa/${sample}_log.HC 2>&1 && echo "HaplotypeCaller done!"

time $samtools index $outdir/0.bwa/${sample}.sorted.marked.fixed.bqsr.bam && echo "** ${sample}.sorted.marked.fixed.bqsr.bam index done! **"

# VQSR
# first SNP mode 分别评估SNP和INDEL突变位点的质量
# SNP mode
time $GATK VariantRecalibrator \
    -R $ref \
    -V $outdir/gatk/$sample.HC.vcf.gz \
    --max-gaussians 4 \
    -resource:hapmap,known=false,training=true,truth=true,prior=15.0 $hapmap \
    -resource:omini,known=false,training=true,truth=false,prior=12.0 $omini \
    -resource:1000G,known=false,training=true,truth=false,prior=10.0 $G1000 \
    -resource:snp,known=true,training=false,truth=false,prior=10.0 $dbsnp \
    -an DP -an QD -an SOR -an ReadPosRankSum -an MQRankSum \
    -mode SNP \
    --rscript-file $outdir/gatk/${sample}.HC.snps.plots.R \
    --tranches-file $outdir/gatk/${sample}.HC.snps.tranches \
    -O $outdir/gatk/${sample}.HC.snps.recal

time $GATK ApplyVQSR \
    -R $ref \
    -V $outdir/gatk/$sample.HC.vcf.gz \
    --truth-sensitivity-filter-level 99.0 \
    --tranches-file $outdir/gatk/$sample.HC.snps.tranches \
    --recal-file $outdir/gatk/$sample.HC.snps.recal \
    -mode SNP \
    -O $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz && echo "** SNPs VQSR done **"

## Indel mode
time $GATK VariantRecalibrator \
    -R $ref \
    -V $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz \
    --max-gaussians 6 \
    -resource:mills,known=false,training=true,truth=true,prior=15.0 $mills \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    -mode INDEL \
    --rscript-file $outdir/gatk/${sample}.HC.snps.indels.plots.R \
    --tranches-file $outdir/gatk/${sample}.HC.snps.indels.tranches \
    -O $outdir/gatk/${sample}.HC.snps.indels.recal

time $GATK ApplyVQSR \
    -R $ref \
    -V $outdir/gatk/$sample.HC.snps.VQSR.vcf.gz \
    --truth-sensitivity-filter-level 99.0 \
    --tranches-file $outdir/gatk/$sample.HC.snps.indels.tranches \
    --recal-file $outdir/gatk/$sample.HC.snps.indels.recal \
    -mode INDEL \
    -O $outdir/gatk/$sample.HC.snps.indels.VQSR.vcf.gz && echo "** SNPs and Indels VQSR $sample done **"

写成循环运行
bash CHD_Flow_multi.sh内容如下:

(wes) root@1100150:~/project# cat CHD_Flow_multi.sh
cat config | while read sample ; do echo $sample; bash ~/project/wesFlow_multi_to_gvcf.sh $sample; done

切换到config目录提交到后台运行

cd ~/project/
nohup bash CHD_Flow_multi.sh &

这个单样本的calling的脚本第一次使用,因此注释的时间可能不全对,先看看,明天再来继续分析吧。

相关文章

网友评论

      本文标题:2020-07-23 靶向捕获测序数据分析记录5

      本文链接:https://www.haomeiwen.com/subject/jcorkktx.html