用Python处理vcf文件的代码

作者: 邱俊辉 | 来源:发表于2019-09-29 10:34 被阅读0次

问题背景：

我现在有几百个同一菌株的基因组，但是我只需要差别较大的样品去做后续分析，我要把相似的冗余样品去掉那么我该怎么办呢？直接用mega构建进化树是不可行的，因为基因组太大，分析不出来。我把这些样品中的一个样品作为参考基因组，将剩余样品比对到这个样品上，然后call SNP，最终得到数百个vcf文件，这些vcf文件包含了这些样品相对于参考基因组的SNP，写一个脚本将这些SNP分别连接起来作为一条序列，所有序列输出为一个fasta文件，再用mega构建进化树就可以了！

以下是python代码

#Pyvcf是python专门处理vcf文件的一个包
import vcf
import os
import numpy
import collections
#vcf文件所在的路径
filepath=r'C:\Users\18019\Desktop\VCF\Vibrio.VCF'
filelist=os.listdir(filepath)
output_name=[]
for ech in filelist:
    output_name.append(ech.replace('_',''))
output_name+=['VibrioFF75']
dicREF_list=[]
dicALT_list=[]
for ech in filelist:
    ech_vcf=vcf.Reader(filename=r'C:\Users\18019\Desktop\VCF\Vibrio.VCF\%s' % ech)
    dicREF,dicALT={},{}
    for SNP in ech_vcf:
        if SNP.is_snp == 1:
            dicREF[SNP.CHROM + '_' + str(SNP.POS)] = SNP.REF
            if len(SNP.ALT) > 1:
                dicALT[SNP.CHROM + '_' + str(SNP.POS)] = SNP.ALT[0]
            else:
                dicALT[SNP.CHROM + '_' + str(SNP.POS)] = SNP.ALT
    dicREF_list.append(dicREF)
    dicALT_list.append(dicALT)
#求出所有vcf文件中snp的位点
SNP_REF={}
for i in dicREF_list:
    SNP_REF=dict(SNP_REF,**i)
pos_list=list(SNP_REF.keys())
pos_list.sort()
print(len(pos_list))
#将每个vcf中的snp连接起来，如果该位点存在突变，则输出ALT，否则输出REF，保证每条SNP序列长度相同
all_list=[]
for ech_dic in dicALT_list:
    ech_dic_index = dicALT_list.index(ech_dic)
    ech_name=output_name[ech_dic_index]
    ech_list=[]
    for pos in pos_list:
        if pos in ech_dic:
            theSNP = str(ech_dic[pos]).replace('[', '').replace(']', '')
            ech_list.append(theSNP)
        else:
            theSNP = str(SNP_REF[pos]).replace('[', '').replace(']', '')
            ech_list.append(theSNP)
    all_list.append(ech_list)
#如果将SNP连接起来的序列还是过长，则需要进行二次过滤
snp_array=[]
for pos in pos_list:
    theSNP=str(SNP_REF[pos]).replace('[','').replace(']','')
    snp_array.append(theSNP)

for i in all_list:
    snp_array=numpy.row_stack((snp_array,i))
snp_array=numpy.transpose(snp_array)
array_len=len(snp_array)
print(array_len)
snp_least=[]
i=0
while i<array_len:
    snp_least.append(collections.Counter(snp_array[i]).most_common()[-1])
    i=i+1
print(len(snp_least))
pos_index=0
pos_index_list=[]
while pos_index<array_len:
    if ((snp_least[pos_index][1]) < 5):#5为阈值，如果snp只有5个以内的样品存在则将这个位点过滤掉
        pos_index_list.append(pos_index)
    pos_index=pos_index+1
print(len(pos_index_list))
for pos_index in pos_index_list:
    pos_list[pos_index]='0'
pos_list=set(pos_list)
pos_list=list(pos_list)
test_list=pos_list[:]
for i in pos_list:
    if i=='0':
        test_list.remove(i)
pos_list=test_list
print(len(pos_list))

allStr=''
for ech_dic in dicALT_list:
    ech_dic_index=dicALT_list.index(ech_dic)
    ech_name=output_name[ech_dic_index]
    ech_Str='>'+ech_name+'\n'
    for pos in pos_list:
        if pos in ech_dic:
            theSNP = str(ech_dic[pos]).replace('[', '').replace(']', '')
            ech_Str += theSNP
        else:
            theSNP = str(SNP_REF[pos]).replace('[', '').replace(']', '')
            ech_Str += theSNP
    ech_Str+='\n'
    allStr+=ech_Str
allStr+='>'+output_name[-1]+'\n'
for pos in pos_list:
    theSNP=str(SNP_REF[pos]).replace('[','').replace(']','')
    allStr+=theSNP
output=open(r'C:\Users\18019\Desktop\VCF\Vibrio.VCF\total_snp.txt','w')
output.write(allStr)
output.close()

代码仅供参考，具体问题具体分析。。。

网友评论

本文标题：用Python处理vcf文件的代码

本文链接：https://www.haomeiwen.com/subject/czrtyctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

用Python处理vcf文件的代码

问题背景：

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

生物信息学

python