背景:最近慢慢走起了小麦基因组重测序分析,进行分享。小麦基因组庞大(17G),所以其他物种的流程能出来,小麦的bug会让你怀疑人生。
染色体拆分-建立索引-比对-横向分析(read coverage,mapping rate,sequencing depth)-纵向分析(snp,indel)
1. 染色体拆分
(第0步是我的研究需要,合成一个假的参考基因组,一般是不需要的,可直接到正式的第1步)
0. 取出每条染色体的行号,准备分离每条染色体(最笨的方法,请有优化的大神指教)
(先下载好两个参考基因组EnsemblPlants )
cat Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa |grep -n "dna:chromosome" >Aet_position.txt
cat Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa |grep -n "dna:chromosome" >WEW_position.txt
读取Aegilops_tauschii每条染色体的位置
读取Triticum_dicoccoides每条染色体的位置
#将每条染色体按照位置分开
sed -n '1,8372172p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >1D
sed -n '8372173,19233192p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >2D
sed -n '19233193,29686238p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >3D
sed -n '29686239,38453219p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >4D
sed -n '38453220,48076148p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >5D
sed -n '48076149,56343142p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >6D
sed -n '56343143,70578551p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >7D
sed -n '21402081,47711240p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >2AB
sed -n '47711241,74300756p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >3AB
sed -n '74300757,97639496p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >4AB
sed -n '97639497,121190107p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >5AB
sed -n '121190108,143267599p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >6AB
sed -n '143267600,167984010p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >7AB
#将染色体按照自己想要的方式合并,我需要的是1A1B1D,2A2B2D。。。。。先将1A1B和1D合并为1ABD,然后合并为1ABD 2ABD 3ABD...7ABD,名为pseudo_wheat_ref
cat 1AB 1D >1ABD
#相同方法合并2-7
#由于每行中包括很多信息,需要修改行名,先得出关键字所在的行号,再替换该行,由于不会写脚本,只能依次替换,比较麻烦,代码如下
cat -n 1ABD |grep "dna:chromosome"
sed '1c >chr1A' 1ABD >chr1A
sed '9893116c >chr1B' chr1A >chr1B
sed '21402081c >chr1D' chr1B >chr1D
#改名为chrABD
mv chr1D chr1ABD
#替换完后,进行检查,看有没有改错
cat -n chr1ABD |grep "chr"
cat -n chr2ABD |grep "chr"
cat -n chr3ABD |grep "chr"
cat -n chr4ABD |grep "chr"
cat -n chr5ABD |grep "chr"
cat -n chr6ABD |grep "chr"
cat -n chr7ABD |grep "chr"
#出来结果都是以下格式说明没错
#行号 >chr1A
#行号 >chr1B
#行号 >chr1D
#再用 >搜一次,看有没有其他序列
cat -n chr1ABD |grep ">"
cat -n chr2ABD |grep ">"
cat -n chr3ABD |grep ">"
cat -n chr4ABD |grep ">"
cat -n chr5ABD |grep ">"
cat -n chr6ABD |grep ">"
cat -n chr7ABD |grep ">"
#前面六个都正常,第七个时出现以>开头的supercontig,我决定去掉这些,所以
cat chr7ABD |grep -v "dna:supercontig" >chr7ABD_pure
合并
cat chr1ABD chr2ABD chr3ABD chr4ABD chr5ABD chr6ABD chr7ABD_pure >pure_ref_wheat.fasta
1. 染色体拆分
1.1 由于小麦基因组过大,软件可以生成sam/bam文件,但是在samtools index时会报错,下面是我的报错结果展示(也就花了2周而已:( )
#不拆分参考基因组时建索引报错
samtools index -@ 20 4AL_TTD.sort.uniq_mkdup.bam
samtools index 当不拆分参考基因组时每条染色体建立索引报错
解释如下
read mapping with mappers such as BWA, Tophat or STAR as the BAM output format used by these mappers limits the reference contig size to (2^29 - 1) bp (512 MB). Strictly speaking, the BAM files will be valid, however they cannot by indexed with "samtools index" so that random access to chromosomal regions is not possible.
查看samtools index --help时有参数,但是csi难以用于后续的gatk分析,所以需要拆分染色体
-m INT Set minimum interval size for CSI indices to 2^INT [14]
1.2 拆分染色体脚本【引1】
touch split_wheat_chrom.py
vim split_wheat_chrom.py
import argparse
from Bio import SeqIO
from itertools import product
parser = argparse.ArgumentParser()
parser.add_argument("fasta")
args = parser.parse_args()
chr2_split_position = [['chr1A', 416530229], ['chr1B', 485877090], ['chr1D', 398856833], ['chr2A', 441318416], ['chr2B', 452187122], ['chr2D', 471155576], ['chr3A', 331357656], ['chr3B', 422034810], ['chr3D', 495018548], ['chr4A', 370990501], ['chr4B', 459574265], ['chr4D', 410577824], ['chr5A', 431266944], ['chr5B', 460582272], ['chr5D', 404944879], ['chr6A', 451707583], ['chr6B', 441332031], ['chr6D', 448911032], ['chr7A', 388979184], ['chr7B', 445475846], ['chr7D', 431442263], ['chrUn','null']]
with open(args.fasta) as handle:
for record, ch in product(SeqIO.parse(handle, "fasta"), chr2_split_position):
if record.id == ch[0] and ch[0] != 'chrUn':
print(">" + record.id + '_part1\n' + record.seq[:int(ch[1]) - 1])
print(">" + record.id + '_part2\n' + record.seq[int(ch[1])-1:])
if record.id == ch[0] and ch[0] == 'chrUn':
print(">" + record.id + '\n' + record.seq)
:wq
#运行脚本
python split_wheat_chrom.py pure_ref_wheat.fasta >pure_ref_wheat_parts.fasta
1.3 拆分脚本本模块安装
必需先有python,因为我有conda,python肯定是有的,只是下面模块中from Bio import SeqIO还需要安装,安装命令
pip install biopython
官网描述:新的python包包括pip处理工具,可在易于在所有平台进行简便安装
recent versions of Python (starting with Python 2.7.9 and Python 3.4) include the Python package management tool pip, which allows an easy installation from the command line on all platforms.
参考资料:
【引1】【续】中国春基因组2.0版本版本发布
详见官网biopython
合并染色体
2 建索引
2.1 参考基因组构建索引
mkdir 0.index && cd 0.index
bwa index ~/WGS/4Al_TTD/pseudo_wheat_ref/pure_ref_wheat.fasta
2.2 构建一个dict,不建索引gatk生成vcf时要报错
# -R: 输入参考基因组,可为fasta 或fasta.gz
# -O:输出文件,输出文件为sam时只包序列字典,默认条件下使用输入文件的basename并以.dict结尾
gatk CreateSequenceDictionary -R /home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta -O pure_ref_wheat_parts.dict
生成dict的内容查看第二列:染色体名称,第三列:序列的长度,
dict内容
2.3 建立索引
nohup samtools faidx pure_ref_wheat_parts.fasta &
3. 比对
3.1 比对到参考基因组上,并进行排序
为什么要排序?read mapping到参考基因组上是按照位置找的,后续软件的分析多安找名字来排,所以需要sort 【引2】
cat >mapping.sh
vim mapping.sh
ref="/home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta"
fa="/home/huawei/raw_data/YSQ/4AL_resequence/input"
bwa mem -t 20 -R "@RG\tID:4AL_resequence\tSM:4AL_resequence\tLB:WGS\tPL:Illumina" \
$ref $fa/4AL_1.clean.fq.gz $fa/4AL_2.clean.fq.gz |samtools sort -@ 20 -o 4AL_TTD.sort.bam - 1>4AL_log.mark 2>&1
:wq
3.2 取唯一比对
对比对上的reads进行过滤
#-h 加头文件,不加后续gatk报错
#-q 只包含比对质量大于1的reads (整数)
#-F 包括reads没有指定FLAGS,第二列为FLAGS 4 这条reads没比对上, 256大于1次的比对,
#-v 反选
samtools view -@ 20 -h -q 1 -F 256 4AL_TTD.sort.bam |grep -v XA:Z |grep -v SA:Z |samtools view -@ 20 -b - >4AL_TTD.sort.uniq.bam
4. 横向分析测序情况
4.1 比对率 mapping rate
nohup samtools depth 4AL_TTD.sort.bam -a >./4AL_TTD.sort.bam.txt &
4.2 测序深度 sequencing depth
nohup samtools depth 4AL_TTD.sort.bam -a >./4AL_TTD.sort.bam.depth &
4.3 覆盖度 read coverage
nohup genomeCoverageBed -ibam 4AL_TTD.sort.bam -g /home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pseudo_wheat_ref.fa -bga >./4AL_TTD_cov.bedgraph &
5. 纵向分析结果
5.1.1 Mark PCR重复
此步的目的:
• 标记/删除PCR重复的reads
• 为后续call变异位点增加可信度,去掉假阳性
软件:GATK4(MarkDuplicates)
存在问题:"-Xmx20G -Djava.io.tmpdir=./" 是做什么的?加上后占用cpu很大,去掉后报错
mkdir gatk_markdup && cd gatk_markdup
cat >gate_markdup.sh
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I 4AL_TTD.sort.uniq.bam -O 4AL_TTD.sort.uniq_mkdup.bam -M 4AL_TTD_mkdup.metrics 1>4AL_TTD_mkdup_log.mark 2>&1
每条染色体建立索引,要用于vcf生成的话,必需加索引
samtools index -@ 20 4AL_TTD.sort.uniq_mkdup.bam
5. 1. 2 将所有的reads分组
这里其实是把mapping一步的头文件"@RG\tID:4AL_resequence\tSM:4AL_resequence\tLB:WGS\tPL:Illumina" 替换为--LB WGS -PL illumina -PU bwa -SM 4AL_TTD,如果mapping一步有头文件的话,这一步是可以省略的
#查看帮助文档
gatk --list
# AddOrReplaceReadGroups: Assigns all the reads in a file to a single new read-group.
#查看AddOrReplaceReadGroups的子命令
gatk AddOrReplaceReadGroups
#Required Arguments (需要哪些参数,输入、输出、-LB为数据类型、-PL测序平台、-PU比对平台/软件、-SM样品名称):
#--INPUT,-I:String Input file (BAM or SAM or a GA4GH url). Required.
#--OUTPUT,-O:File Output file (BAM or SAM). Required.
#--RGLB,-LB:String Read-Group library Required.
#--RGPL,-PL:String Read-Group platform (e.g. illumina, solid) Required.
#--RGPU,-PU:String Read-Group platform unit (eg. run barcode) Required.
#--RGSM,-SM:String Read-Group sample name Required.
cat >add_bam.sh
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" AddOrReplaceReadGroups -I 4AL_TTD.sort.uniq_mkdup.bam -O 4AL_TTD.sort.uniq_mkdup_add.bam --LB WGS -PL illumina -PU bwa -SM 4AL_TTD
每条染色体建立索引,同5.0.1
samtools index -@ 20 4AL_TTD.sort.uniq_mkdup_add.bam
5.2 找变异
5.2.1 生成初始vcf文件(保证fasta有索引,有dict)
mkdir vcf && cd vcf
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller -R /home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta -I /home/wdd/WGS/4AL_TTD/1.mapping/4AL_TTD.sort.uniq_mkdup_add.bam -O 4AL_TTD_raw.vcf 1>4AL_TTD_log.HC 2>&1
5.2.1.0 HaplotypeCaller增加线程
拆分后有42条染色体,由于与参考基因组的差异较大,这一步1条染色体耗时~24h,并且偶遇断电时间,那么怎么办?
touch chr_vcf.sh
vim chr_vcf.sh
REF='/home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta'
bam='/home/wdd/WGS/4AL_TTD/1.mapping/4AL_TTD.sort.uniq_mkdup_add.bam'
chroms=($(grep '>' $REF |sed 's/>//' | tr '\n' ' '))
for chr in ${chroms[@]}
do
if [ ! -f 4AL.${chr}.vcf.gz ]; then
gatk HaplotypeCaller -R $REF -I $bam --genotyping-mode DISCOVERY \
--intervals ${chr} --sample-ploidy 6 \
-O 4AL.${chr}.vcf.gz &
fi
done && wait
这样飞快,但有一个问题,42条染色体同时会挤爆服务器的,所以42条染色体分为3批来跑,如下cs1.bed, cs2.bed, cs3.bed
cat cs1.bed
#chr1A_part1
#chr1A_part2
#chr1B_part1
#chr1B_part2
#chr1D_part1
#chr1D_part2
#chr2A_part1
#chr2A_part2
#chr2B_part1
#ch2B_part2
#chr2D_part1
#chr2D_part2
#chr3A_part1
touch chr.sh
vim chr.sh
REF='/home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta'
bam='/home/wdd/WGS/4AL_TTD/1.mapping/4AL_TTD.sort.uniq_mkdup_add.bam'
chroms=($(awk '{print $1}' cs1.bed |tr '\n' ' '))
for chr in ${chroms[@]}
do
if [ ! -f 4AL.${chr}.vcf.gz ]; then
gatk HaplotypeCaller -R $REF -I $bam --genotyping-mode DISCOVERY \
--intervals ${chr} --sample-ploidy 6 \
-O 4AL.${chr}.vcf.gz &
fi
done && wait
单条染色体生成vcf及对应的索引
生成vcf和索引
5.2.1.0 HaplotypeCaller增加线程后合并
chroms=($(awk '{print $1}' cs.bed |tr '\n' ' '))
merge_vcfs=""
for chr in ${chroms[@]}; do
merge_vcfs=${merge_vcfs}" -I ${SM}.${chr}.vcf.gz"
done && gatk MergeVcfs ${merge_vcfs} -O ${SM}.HC.vcf.gz && echo "Vcfs haved been successfully merged"
「GATK 4」如何提高HaplotyperCaller的效率
5.2.1 找snp,过滤
SNP/indel detection was performed using the GATK HaplotypeCaller (version 3.5-0 g36282e4) set for diploids with default fil- tering settings [54]. SNPs were preliminarily filtered using GATK VariantFiltration with the parameter --filterExpres- sion “QD < 2.0 || FS > 60.0 || MQRankSum < − 12.5 || ReadPosRankSum < − 8.0 || SOR > 3.0 || MQ < 40.0.” The filtering settings for indels were “QD < 2.0, FS > 200.0,” and “ReadPosRankSum < − 20.0.”
gatk SelectVariants -select-type SNP -V 4AL_resequence.vcf -O ./snp/4AL_resequence.snp.vcf
cd snp
gatk VariantFiltration -V 4AL_resequence.snp.vcf --filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "Filter" -O 4AL_filter.snp.vcf
gatk SelectVariants --exclude-filtered true -V 4AL_filter.snp.vcf -O 4AL_filtered.snp.vcf
5.2.3 找indel, 过滤
gatk SelectVariants -select-type INDEL -V 4AL_resequence.vcf -O ./indel/4AL_resequence.indel.vcf
cd indel
gatk VariantFiltration -V 4AL_resequence.indel.vcf --filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 200.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -20.0" --filter-name "Filter" -O 4AL_filter.indel.vcf
gatk SelectVariants --exclude-filtered true -V 4AL_filter.indel.vcf -O 4AL_filtered.indel.vcf
SNPs that did not meet the following criteria were further excluded: (1) a total read depth (DP) > 240 and < 2200; (2) minor allele frequency (MAF) ≥0.05 for each population, and for Ae. tauschii (n = 5), MAF should be ≥ 0.2; (3) a maximum missing rate < 0.1; and (4) biallelic alleles.
IGV查看
学IGV必看的初级教程
补充:
1. vcf过滤
#--vcf 输入文件格式
#--minDP 最小测序深度
#--maxDP 最大测序深度
#--maf 最小等位基因频率
# –max-missing < float >完整度,介于0到1之间
vcftools --vcf 4AL_filtered.snp.vcf --minDP 240 --maxDP 2200 --maf 0.05 -max-missing 0.1 --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out 4AL
vcftools用法详解
关于SNP的过滤(2):如何使用vcftools进行SNP过滤
SNP and indel annotations were performed according to the wheat genome annotation using the software SnpEff (version 4.3p)
2. CNV找拷贝数变异
学习中
DNA拷贝数变异CNV检测——基础概念篇
安装cnvnator
CNV变异检测-CNVnator
cnvnator安装及使用方法简介
CNVnator
3. circos 画图,将snp,indel,sv展示
学习中
网友评论