美文网首页
解析数据记录

解析数据记录

作者: 多啦A梦的时光机_648d | 来源:发表于2020-04-02 14:13 被阅读0次

一:读取和写入多序列fasta文件,并将每条记录(序列+标题)写入不同的文件中。

只适合去掉换行符的序列,用perl去掉换行符
perl -pe '/^>/ ? print "\n" : chomp' multi.fa | tail -n +2 > multi.fa_noN

seq = ''
n = 0
for line in open('C:\\shiyan\\honghua\\multi.fa','r'):

    if line[0] == '>':
        n += 1
        out_file = open(str(n) + '_header','w')
        out_file.write(line)
        out_file.close()
    else:
        out_file = open(str(n) + '.fa','w')
        out_file.write(line)
        out_file.close()

二:读取和过滤FASTA文件,仅当起始氨基酸为甲硫氨酸(M)且含有至少一个KED三连氨基酸(KED)时,将记录写入新文件。

for line in open('C:\\shiyan\\honghua\\uniprot_test.fq','r'):
    if line[0] == '>' and seq == '':
        header = line
    elif line[0] != '>':
        seq = seq + line
    elif line[0] == '>' and seq != '':   ##证明指针走到新的序列,输出上一条序列
        if seq[0] == 'M' and seq.count('KED') >= 1:
            OutputFile = open(header[4:10].strip()+'.fa','w')
            OutputFile.write(header + seq)
            OutputFile.close()
            print(header + seq)
        seq = ''
        header = line
# take care of the very last record of the input file
if seq[0] == 'M' and seq.count('KED') >= 1:
    OutputFile = open(header[4:10].strip()+'.fa','w')
    OutputFile.write(header + seq)
    OutputFile.close()
    print(header + seq)

三:计算在FASTA格式中多个DNA序列的核苷酸频率。

seq = ''
for line in open('C:\\shiyan\\honghua\\multi.fa','r'):
    if line[0] == '>' and seq == '':
        header = line
    elif line[0] != '>':
        seq = seq + line
    elif line[0] == '>' and seq != '':   ##表面指针走到下一条序列,输出上一条序列内容
        print(header)
        length = float(len(seq))
        for nt in 'ATCG':
            number = seq.count(nt)
            print(nt,number/length)
        seq = ''
        header = line
# take care of the very last record of the input file
print( header)
for nt in "ATCG":
    number = seq.count(nt)
    print(nt,number/length)

>AT1G16710.2 cds chromosome:TAIR10:1:5714219:5722614:1 gene:AT1G16710 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:HAC12 description:Histone acetyltransferase HAC12 [Source:UniProtKB/Swiss-Prot;Acc:Q9FWQ5]

A 0.30128956623681125
T 0.2532239155920281
C 0.19949198905822588
G 0.22958186791715515
>AT1G15825.1 cds chromosome:TAIR10:1:5448073:5448608:-1 gene:AT1G15825 gene_biotype:protein_coding transcript_biotype:protein_coding description:Hydroxyproline-rich glycoprotein family protein [Source:UniProtKB/TrEMBL;Acc:F4I177]

A 0.23711340206185566
T 0.22164948453608246
C 0.3556701030927835
G 0.16752577319587628
>AT1G48390.1 cds chromosome:TAIR10:1:17879376:17881243:-1 gene:AT1G48390 gene_biotype:protein_coding transcript_biotype:protein_coding description:RNI-like superfamily protein [Source:UniProtKB/TrEMBL;Acc:F4HYE7]

A 0.884020618556701
T 1.0335051546391754
C 0.6314432989690721
G 0.6520618556701031

四:统计每条染色体长度,ATCGN的含量

seq = ''
print("chromsome\tA\tT\tC\tG\tN\tlength")
for line in open('C:\\shiyan\\honghua\\chr.fa','r'):
    line = line.upper()
    if line[0] == '>' and seq == '':
        header = line.strip()
    elif line[0] != '>':
        seq = seq + line.strip()
    elif line[0] == '>' and seq != '':
        print(header, end='\t')
        length = len(seq)
        for nt in "ATCGN":
            if nt == "N":
                print(str(seq.count(nt)) + "\t" + str(length), end='\n')  ## 要转换成字符串才能拼接
            else:
                print(str(seq.count(nt)), end='\t')
        seq = ''
        header = line.strip()
# 注意最后一条数据
print(header, end='\t')
length = len(seq)
for nt in "ATCGN":
    if nt == "N":
        print(str(seq.count(nt)) + "\t" + str(length), end='\n')
    else:
        print(str(seq.count(nt)), end='\t')

chromsome   A   T   C   G   N   length
>CHR_1  13  10  7   10  4   44
>CHR_2  11  6   5   8   4   34
>CHR_3  10  6   3   10  4   33
>CHR_4  9   4   2   7   3   25

相关文章

网友评论

      本文标题:解析数据记录

      本文链接:https://www.haomeiwen.com/subject/xvkouhtx.html