由于某些vcf文件比较大,检索感兴趣的位点,时间比较慢。尤其是当想批量获取位点信息时。
以下是一个linux 软件工具 或在python2脚本中导入该模块,提取某个位点的rs编号、深度等信息。
from cyvcf2 import VCF
for variant in VCF('some.vcf.gz'): # 输入vcf文件,然后就可以按照表头提取响应的值
variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']
variant.CHROM, variant.start, variant.end, variant.ID, \
variant.FILTER, variant.QUAL
# numpy arrays of specific things we pull from the sample fields.
# gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array
## INFO Field.
## extract from the info field by it's name:
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float
# convert back to a string.
str(variant)
## sample info...
# Get a numpy array of the depth per sample:
dp = variant.format('DP')
# or of any other format field:
sb = variant.format('SB')
assert sb.shape == (n_samples, 4) # 4-values per
# to do a region-query:
vcf = VCF('some.vcf.gz')
for v in vcf('11:435345-556565'): #这里指我要提取11:435345-556565位置的信息
if v.INFO["AF"] > 0.1: continue
print(str(v))
下面的脚本是我想从dbsnp的vcf中提取,我感兴趣的位点rs编号和ref、alt碱基,
输入文件,argv[1]: dbsnp vcf文件; 第二个文件:argv[2] , 一些基因组位置表(chr,start...)
import sys
from cyvcf2 import VCF
dbsnp = VCF(sys.argv[1])
fp = open(sys.argv[2], "r")
for line in fp:
lst = line.strip().split()
chrom, st, rs, pro = lst[:]
id = " "
for v in dbsnp("%s:%s-%s" % (chrom, st, st)):
id = v.ID
print("%s\t%s\t%s\t%s" % (line.strip(), v.REF, v.ALT, id))
fp.close()
参考网址:cyvcf2
网友评论