「博客翻译」SNP过滤教程（一）

作者: xuzhougeng | 来源:发表于2019-05-10 16:16 被阅读43次

「博客翻译」SNP过滤教程（二）
「博客翻译」SNP过滤教程（一）
SNP过滤教程（通用过滤）
关于SNP的过滤（2）：如何使用vcftools进行SNP过滤
SNP过滤
pytest的中文翻译
SNP过滤时常用参数的解析（群体遗传）
SNP过滤时常用参数的解析（群体遗传）
vcftools过滤SNP数据2021-02-25
博客过滤效果Blogger Presets Lr预设系列

原文地址: SNP Filtering Tutorial

感觉文章写的真棒，于是就翻译了一遍

目标

学习如何使用VCFtools根据缺失信息，基因型深度(genotype depth)，位点质量得分，次等位基因频率和基因型总深度(genotype call depth)去过滤VCF文件
学习使用vcflib过滤 FreeBayes分析RAD数据后得到VCF文件
根据群体内HWE过滤VCF
学习将VCF文件分开输出为SNP和INDEL
使用单倍型鉴定脚本（haplotyping scripts）进一步过滤SNP，处理旁系同源基因和基因型识别时的报错

正文

欢迎大家来到SNP过滤练习部分。我们第一部分的练习适用于基本上所有的VCF文件。第二部分练习，我们将会处理FreeBayes得到的VCF文件。当然，其他SNP caller也能输出类似的注释信息，所以主要学习分析思想，而不是直接套用代码。

让我们先来创建分析的工作文件目录

mkdir filtering
cd filtering

然后大家根据我提供的微云地址下载后续需要的数据，地址为https://share.weiyun.com/5VeHoQ3 密码：xqeyqf

unzip data.zip
ls -l

文件信息

或者你可以尝试原文提供的下载方式。

通用过滤

在热身阶段，我们会用VCFtools(http://vcftools.sourceforge.net) 对我们的vcf文件进行一波过滤。该程序由一个二进制运行程序和一些perl脚本组成，用途就和名字一样，就是用于处理VCF。(PS: 作者认为0.1.11版本的vcftools有更多使用的过滤命令，而我用的是0.1.17)

我们先处理 raw.vcf , 该文件存在很多有可能出错的位点，并且还有很多位点只在一个样本出现。我们将会分成三步进行讲解

第一步: 我们只保留那些一般群体中出现的位点，并且最低的质量分数不低于30，次等位基因深度(minor allele count)不少于3

vcftools --gzvcf raw.vcf.gz --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out raw.g5mac3

在这行代码中，我们用--gzvcf传递了一个压缩的vcf文件，--max-missing 0.5表示过滤掉低于缺失率高于50%的位点，--mac 3则是过滤次等位基因深度低于3。该参数设置和具体的群体有关，这里用的群体至少包含了1个纯合个体，1个或3个杂合个体。

--recode表示输出过滤后的VCF文件，--recode-INFO-all表示包含原来文件中所有INFO信息，--out则是输出文件的前缀。

代码运行信息可能会是下面这个样子

After filtering, kept 40 out of 40 Individuals
Outputting VCF file...
After filtering, kept 78434 out of a possible 147540 Sites
Run Time = 13.00 seconds

两个简单的过滤标准就干掉了将近50%的数据，这样子后面一些步骤运行速度也会更快了

vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3

现在，我么有了一个过滤后的VCF文件，即raw.g5mac3.recode.vcf 。下一步就是根据最低深度进行过滤(还可以加上平均深度)

vcftools --vcf raw.g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --out raw.g5mac3dp3

这一步我们保留了在至少有3个read的位点。由于GATK/FreeBayes在群体 SNP calling时会同时分析所有的样本，因此即便位点的覆盖度低，可信度也很高。

当然作者还担心你不相信，于是还提供了一个脚本帮你评估可能的错误率

curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/ErrorCount.sh
chmod +x ErrorCount.sh 
./ErrorCount.sh raw.g5mac3dp3.recode.vcf

输出信息如下

This script counts the number of potential genotyping errors due to low read depth
It report a low range, based on a 50% binomial probability of observing the second allele in a heterozygote and a high range based on a 25% probability.
Potential genotyping errors from genotypes from only 1 read range from 0 to 0.0
Potential genotyping errors from genotypes from only 2 reads range from 0 to 0.0
Potential genotyping errors from genotypes from only 3 reads range from 15986 to 53714.22
Potential genotyping errors from genotypes from only 4 reads range from 6230 to 31502.04
Potential genotyping errors from genotypes from only 5 reads range from 2493 to 18914
40 number of individuals and 78434 equals 3137360 total genotypes
Total genotypes not counting missing data 2380094
Total potential error rate is between 0.0103815227466 and 0.0437504821238
SCORCHED EARTH SCENARIO
WHAT IF ALL LOW DEPTH HOMOZYGOTE GENOTYPES ARE ERRORS?????
The total SCORCHED EARTH error rate is 0.129149100834.

目前来看，我们的VCF中基因型由于覆盖度低于5的最大错误率低于5%，因此没有啥可以操心。

下一步则是删除那些测序结果不是很好的样本，也就是缺失值过多的样本。先统计下每个样本的缺失比例

vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv
cat out.imiss

在结果中你发现部分个体的缺失率高达99.6%，这个样本是必然要被删掉的。不过在此之前，我们可以用柱状图来看下分布

mawk '!/IN/' out.imiss | cut -f5 > totalmissing
gnuplot << \EOF 
set terminal dumb size 120, 30
set autoscale 
unset label
set title "Histogram of % missing data per individual"
set ylabel "Number of Occurrences"
set xlabel "% of missing data"
#set yr [0:100000]
binwidth=0.01
bin(x,width)=width*floor(x/width) + binwidth/2.0
plot 'totalmissing' using (bin($1,binwidth)):(1.0) smooth freq with boxes
pause -1
EOF

缺失分布

绝大部分的个体的缺失率低于0.5，因此我们用0.5作为我们的阈值进行过滤。

那么问题来了，怎么挑选出缺失率大于50%的个体呢？下面的代码是其中一种解决方法

mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv

下一步我们用vcftools根据提供的样本编号进行过滤

vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm

作者提供了自动化的过滤脚本方便，用于完成刚才的几步。这个脚本会自动推断出比较好的阈值，不过你可以选择no然后自己输入一个阈值。

curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/filter_missing_ind.sh
chmod +x filter_missing_ind.sh
./filter_missing_ind.sh raw.g5mac3dp3.recode.vcf DP3g95maf05

在过滤一些样本后，我们可以根据最大缺失值比例，平均深度和次等位基因频率(MAF)进行进一步过滤

vcftools --vcf raw.g5mac3dplm.recode.vcf --max-missing 0.95 --maf 0.05 --recode --recode-INFO-all --out DP3g95maf05 --min-meanDP 20

这一步过后，我们得到了12,754个候选位点。

不过上面是对群体的所有样本进行过滤。假如你的群体来自于多个区域，你想对不同的群体的样本分别过滤，那么应该怎么做呢？VCFtools工具本身不支持这个功能，但是我们可以将该任务进行拆分成不同步骤，达到我们想要的结果。

你得提供一个包含样本来源信息的文本，比如说我们解压缩后得到的popmap

head popmap 
BR_002  BR
BR_004  BR
BR_006  BR
BR_009  BR
BR_013  BR
BR_015  BR
BR_016  BR
BR_021  BR
BR_023  BR
BR_024  BR

然后我们按照第二列的来源信息对文件进行拆分

mawk '$2 == "BR"' popmap > 1.keep && mawk '$2 == "WL"' popmap > 2.keep

下一步，用VCFtools分别估计不同群体的缺失比例

vcftools --vcf DP3g95maf05.recode.vcf --keep 1.keep --missing-site --out 1
vcftools --vcf DP3g95maf05.recode.vcf --keep 2.keep --missing-site --out 2

于是我们就会得到1.lmiss和2.lmiss

我们将两个文件进行合并，然后根据最后一列提取出缺失率大于0.1的样本.

cat 1.lmiss 2.lmiss | mawk '!/CHR/' | mawk '$6 > 0.1' | cut -f1,2 >> badloci

利用VCFtools进行过滤

vcftools --vcf DP3g95maf05.recode.vcf --exclude-positions badloci --recode --recode-INFO-all --out DP3g95p5maf05

作者用两个来源的群体作为例子进行讲解，实际情况是我们可能会遇到超过2个的情况。作者就写了一个脚本进行处理

curl -L -O https://github.com/jpuritz/dDocent/raw/master/scripts/pop_missing_filter.sh
chmod +x pop_missing_filter.sh
./pop_missing_filter.sh

FreeBayes输出结果过滤

下面的教程假设你的结果文件来自于FreeBayes的输出。

FreeBayes的输出信息非常丰富，我们可以根据RAD-seq的特点和freebayes提供的信息，进行更加复杂的过滤操作。先让我们看看FreeBayes到底有哪些信息

mawk '/#/' DP3g95maf05.recode.vcf

我们选择其中部分INFO 标签进行展示

INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
INFO=<ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
INFO=<ID=CIGAR,Number=A,Type=String,Description="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.">
INFO=<ID=MQM,Number=A,Type=Float,Description="Mean mapping quality of observed alternate alleles">
INFO=<ID=MQMR,Number=1,Type=Float,Description="Mean mapping quality of observed reference alleles">
INFO=<ID=PAIRED,Number=A,Type=Float,Description="Proportion of observed alternate alleles which are supported by properly paired read fragments">
INFO=<ID=PAIREDR,Number=1,Type=Float,Description="Proportion of observed reference alleles which are supported by properly paired read fragments">
FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">

第一个过滤标准是Allele balance。所谓Allele balance就是0和1的比值，0表示和基因组序列相同，1表示和基因组序列不同。由于RAD-seq是对基因组默写特定位点进行测序，因此我们认为Allele balance应该是0.5左右。

作者使用了 vcflib 的vcffilter进行过滤

vcffilter -s -f "AB > 0.25 & AB < 0.75 | AB < 0.01" DP3g95p5maf05.recode.vcf > DP3g95p5maf05.fil1.vcf

上面这条语句保留了AB在0.25和0.75之间，或者AB<0.01的位点。对于AB < 0.01，作者认为这是固定变异(fixed variants), -s表示将这条语句作用于所有位点，而不只是allels。

我们看下过滤前后的位点数

mawk '!/#/' DP3g95p5maf05.recode.vcf | wc -l
mawk '!/#/' DP3g95p5maf05.fil1.vcf | wc -l

vcffilter -f "SAF / SAR > 100 & SRF / SRR > 100 | SAR / SAF > 100 & SRR / SRF > 100" -s DP3g95p5maf05.fil1.vcf > DP3g95p5maf05.fil2.vcf

mawk '!/#/' DP3g95p5maf05.fil2.vcf | wc -l

恭喜你，完成了SNP过滤这一步

「博客翻译」SNP过滤教程（二）
原文地址: SNP Filtering Tutorial 接「博客翻译」SNP过滤教程（一） FreeBayes输...
「博客翻译」SNP过滤教程（一）
原文地址: SNP Filtering Tutorial 感觉文章写的真棒，于是就翻译了一遍目标学习如何使用V...
SNP过滤教程（通用过滤）
https://zhuanlan.zhihu.com/p/65720484[https://zhuanlan.zh...
关于SNP的过滤（2）：如何使用vcftools进行SNP过滤
继续上一节课教程的下半节，继续介绍一些其他过滤SNP的工具，欢迎大家跟着学习。从现在开始，过滤步骤假定vcf文件...
SNP过滤
得到vcf文件后，则需对其snp进行过滤，先对几种简单方法进行汇总 1. 第一步粗过滤若使用GATK进行call...
pytest的中文翻译
原文：Pytest权威教程(官方教程翻译) - 韩志超 - 博客园 (cnblogs.com)[https://w...
SNP过滤时常用参数的解析（群体遗传）
关于为何要进行SNP的过滤：第一，低质量和无信息的SNP会影响后续群体结构或GWAS的分析结果，甚至影响后续对生...
SNP过滤时常用参数的解析（群体遗传）
关于为何要进行SNP的过滤：第一，低质量和无信息的SNP会影响后续群体结构或GWAS的分析结果，甚至影响后续对生...
vcftools过滤SNP数据2021-02-25
一、计算每个SNP缺失频率获得的SNP_missing结果，最后一栏为缺失率：二、过滤掉缺失率大于10%和非b...
博客过滤效果Blogger Presets Lr预设系列
博客过滤效果lr预设-博客过滤效果Lr预设- Mac下载博客过滤效果Blogger Presets预设是一款非常...