生信编程直播第一题:人类基因组的外显子区域到底有多长
方法1:分别以外显子坐标的起始、终止坐标作为字典的键值对
#计算人基因组的外显子长度
f5 = r'E:\Bioinformatics\data\human\CCDS.current.txt'
head = 1
exon = {}
for line in open(f5):
if head:
head -= 1
continue
temp = line.split('\t')
if temp[9] != '-': #添加条件:如果基因坐标不为空
coordinate = temp[9].lstrip('[').rstrip(']').split(', ')
for i in range(len(coordinate)):
startCDS = coordinate[i].split('-')[0]
exon[startCDS] = coordinate[i].split('-')[1]
length = 0
for startCDS, endCDS in exon.items():
length += int(endCDS.strip("'")) - int(startCDS.strip("'")) +1 #要去除以字符串形式存储的数字的单引号,否则会报错
print(length)
方法2:以外显子起始、终止坐标创建等长列表进行计算
f5 = r'E:\Bioinformatics\data\human\CCDS.current.txt'
head = 1
startCDS = []
endCDS = []
for line in open(f5):
if head:
head -= 1
continue
temp = line.split('\t')
if temp[9] != '-':
coordinate = temp[9].lstrip('[').rstrip(']').split(', ')
for i in range(len(coordinate)):
startCDS.append(coordinate[i].split('-')[0])
endCDS.append(coordinate[i].split('-')[1])
length = 0
for i in range(len(startCDS)):
length += int(endCDS[i]) - int(startCDS[i]) +1 #要去除以字符串形式存储的数字的单引号,否则会报错
print(length)
生信编程直播第二题-hg19基因组序列的一些探究
因为自己目前研究番茄,所以以番茄基因组fasta文件为操作对象
首先,看一下该文件的大小:
不到一个G,约为822M
番茄一共有12条染色体,对每条染色体进行GC%, N%, Length, A, T, G, C, N统计(注意:在计算GC%时,需要去除N):
方法一:
思路为首先创建以染色体名称为key,序列为value的字典;随后,针对该字典中的value进行操作,针对序列进行遍历计数操作;最后,根据所记录的字符数进行GC%及N%的计算,并打印至屏幕。该方法非常消耗计算机资源。
import time
start = time.clock()
f = r'E:\Bioinformatics\data\S_lycopersicum_chromosomes.3.00.fa\S_lycopersicum_chromosomes.3.00.fa'
genome = {}
for line in open(f):
if line.startswith('>'):
chr_id = line.strip()
genome[chr_id] = []
else:
genome[chr_id].append(line.strip())
print('Chr\tGC_ratio\tN_ratio\tLength\tA\tT\tG\tC\tN')
for chrom, seq in genome.items():
seqTotal = ''.join(seq)
ANum = 0
TNum = 0
GNum = 0
CNum = 0
NNum = 0
GC_ratio = 0
N_ratio = 0
for i in seqTotal:
if i == 'A' or i == 'a':
ANum += 1
elif i == 'T' or i == 't':
TNum += 1
elif i == 'G' or i == 'g':
GNum += 1
elif i == 'C' or i == 'c':
CNum += 1
elif i == 'N' or i == 'n':
NNum += 1
GC_ratio = '%.04f' % ((GNum + CNum)/(len(seqTotal)-NNum))
N_ratio = '%.04f' % (NNum/len(seqTotal))
print('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(chrom, GC_ratio, N_ratio, len(seqTotal), ANum,
TNum, GNum, CNum, NNum))
end = time.clock()
print("The function run time is : %.03f seconds" %(end-start))
番茄基因组信息
可以看到,耗时极长,共358秒。
对该程序进行改进是否会减少计算资源的消耗呢?
#改进版本,利用字符串计数方法 .count()
import time
start = time.clock()
f = r'E:\Bioinformatics\data\S_lycopersicum_chromosomes.3.00.fa\S_lycopersicum_chromosomes.3.00.fa'
genome = {}
for line in open(f):
if line.startswith('>'):
chr_id = line.strip()
genome[chr_id] = []
else:
genome[chr_id].append(line.strip())
print('Chr\tGC_ratio\tN_ratio\tLength\tA\tT\tG\tC\tN')
for chrom, seq in genome.items():
seqTotal = ''.join(seq)
ANum = 0
TNum = 0
GNum = 0
CNum = 0
NNum = 0
GC_ratio = 0
N_ratio = 0
ANum = seqTotal.count('A') + seqTotal.count('a')
TNum = seqTotal.count('T') + seqTotal.count('t')
GNum = seqTotal.count('G') + seqTotal.count('g')
CNum = seqTotal.count('C') + seqTotal.count('c')
NNum = seqTotal.count('N') + seqTotal.count('n')
GC_ratio = '%.04f' % ((GNum + CNum)/(len(seqTotal)-NNum))
N_ratio = '%.04f' % (NNum/len(seqTotal))
print('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(chrom, GC_ratio, N_ratio, len(seqTotal), ANum,
TNum, GNum, CNum, NNum))
end = time.clock()
print("The function run time is : %.03f seconds" %(end-start))
方法一改进版
该方法居然快了近10倍!!看来在处理序列类对象的时候,尽量使用针对该类的方法,能不用循环就不用,这之间是“质”的差距。
网友评论