一:读取和写入多序列fasta文件,并将每条记录(序列+标题)写入不同的文件中。
只适合去掉换行符的序列,用perl去掉换行符
perl -pe '/^>/ ? print "\n" : chomp' multi.fa | tail -n +2 > multi.fa_noN
seq = ''
n = 0
for line in open('C:\\shiyan\\honghua\\multi.fa','r'):
if line[0] == '>':
n += 1
out_file = open(str(n) + '_header','w')
out_file.write(line)
out_file.close()
else:
out_file = open(str(n) + '.fa','w')
out_file.write(line)
out_file.close()
二:读取和过滤FASTA文件,仅当起始氨基酸为甲硫氨酸(M)且含有至少一个KED三连氨基酸(KED)时,将记录写入新文件。
for line in open('C:\\shiyan\\honghua\\uniprot_test.fq','r'):
if line[0] == '>' and seq == '':
header = line
elif line[0] != '>':
seq = seq + line
elif line[0] == '>' and seq != '': ##证明指针走到新的序列,输出上一条序列
if seq[0] == 'M' and seq.count('KED') >= 1:
OutputFile = open(header[4:10].strip()+'.fa','w')
OutputFile.write(header + seq)
OutputFile.close()
print(header + seq)
seq = ''
header = line
# take care of the very last record of the input file
if seq[0] == 'M' and seq.count('KED') >= 1:
OutputFile = open(header[4:10].strip()+'.fa','w')
OutputFile.write(header + seq)
OutputFile.close()
print(header + seq)
三:计算在FASTA格式中多个DNA序列的核苷酸频率。
seq = ''
for line in open('C:\\shiyan\\honghua\\multi.fa','r'):
if line[0] == '>' and seq == '':
header = line
elif line[0] != '>':
seq = seq + line
elif line[0] == '>' and seq != '': ##表面指针走到下一条序列,输出上一条序列内容
print(header)
length = float(len(seq))
for nt in 'ATCG':
number = seq.count(nt)
print(nt,number/length)
seq = ''
header = line
# take care of the very last record of the input file
print( header)
for nt in "ATCG":
number = seq.count(nt)
print(nt,number/length)
>AT1G16710.2 cds chromosome:TAIR10:1:5714219:5722614:1 gene:AT1G16710 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:HAC12 description:Histone acetyltransferase HAC12 [Source:UniProtKB/Swiss-Prot;Acc:Q9FWQ5]
A 0.30128956623681125
T 0.2532239155920281
C 0.19949198905822588
G 0.22958186791715515
>AT1G15825.1 cds chromosome:TAIR10:1:5448073:5448608:-1 gene:AT1G15825 gene_biotype:protein_coding transcript_biotype:protein_coding description:Hydroxyproline-rich glycoprotein family protein [Source:UniProtKB/TrEMBL;Acc:F4I177]
A 0.23711340206185566
T 0.22164948453608246
C 0.3556701030927835
G 0.16752577319587628
>AT1G48390.1 cds chromosome:TAIR10:1:17879376:17881243:-1 gene:AT1G48390 gene_biotype:protein_coding transcript_biotype:protein_coding description:RNI-like superfamily protein [Source:UniProtKB/TrEMBL;Acc:F4HYE7]
A 0.884020618556701
T 1.0335051546391754
C 0.6314432989690721
G 0.6520618556701031
四:统计每条染色体长度,ATCGN的含量
seq = ''
print("chromsome\tA\tT\tC\tG\tN\tlength")
for line in open('C:\\shiyan\\honghua\\chr.fa','r'):
line = line.upper()
if line[0] == '>' and seq == '':
header = line.strip()
elif line[0] != '>':
seq = seq + line.strip()
elif line[0] == '>' and seq != '':
print(header, end='\t')
length = len(seq)
for nt in "ATCGN":
if nt == "N":
print(str(seq.count(nt)) + "\t" + str(length), end='\n') ## 要转换成字符串才能拼接
else:
print(str(seq.count(nt)), end='\t')
seq = ''
header = line.strip()
# 注意最后一条数据
print(header, end='\t')
length = len(seq)
for nt in "ATCGN":
if nt == "N":
print(str(seq.count(nt)) + "\t" + str(length), end='\n')
else:
print(str(seq.count(nt)), end='\t')
chromsome A T C G N length
>CHR_1 13 10 7 10 4 44
>CHR_2 11 6 5 8 4 34
>CHR_3 10 6 3 10 4 33
>CHR_4 9 4 2 7 3 25
网友评论