方法简单,使用set & set遍历基因集即可。
1 基因组cluster list
head total_lacto.list
CNGBCC1950658
CNGBCC1950669
CNGBCC1950686
CNGBCC1950698
CNGBCC1950902
2 基因组基因list
head ../cog_uniq/CNGBCC1950658.tsv
COG0006
COG0008
COG0009
COG0012
COG0015
3 python 计算 core gene
思路:
readlines基因组list,去除换行符
用list挨个一个基因list给head,
接着,挨个打开基因list给临时tail,用set & set计算交集
#!/usr/bin/env python
import os, sys, re
g_list = "total_lacto.list"
with open(g_list, 'r') as g_list_file:
# 列表文件中的文件
tmp_list = []
for tmp in g_list_file.readlines():
tmp = tmp.strip()
tmp = "../cog_uniq/{}.tsv".format(tmp)
tmp_list.append(tmp)
# 两两交集
num = 1
with open(tmp_list[0]) as head:
head = head.readlines()
print("\t head done...")
for tail in tmp_list[1:len(tmp_list)]:
with open(tail) as tail:
tail = tail.readlines()
# 核心算法
head = set(head) & set(tail)
num = num + 1
print("\t intersect {} done...".format(num))
# 输出
out_name = "./lacto_core_cog.tsv"
with open(out_name, 'w') as o:
out_file = ''.join(head)
o.write(out_file)
print("\t write done...")
手动抽样验证算法准确性
网友评论