美文网首页
Biopython-Chapter3.生物序列对象

Biopython-Chapter3.生物序列对象

作者: 安哥生个信 | 来源:发表于2018-06-13 22:14 被阅读0次

    原文连接

    序列和字母表

    Bio.Alphabet.IUPAC提供Protein、DNA和RNA的基本定义
    扩展:
    Protein——IUPAC.protein基本类;IUPAC.extended_protein常见氨基酸类
    DNA——IUPAC.unambiguous_dna基本字母;IUPAC.ambiguous_dna歧义字母;IUPAC.extended_dna修饰后的碱基
    RNA——IUPAC.unambiguous_rna基本字母;IUPAC.ambiguous_rna歧义字母

    定义模糊序列

    In[2]: from Bio.Seq import Seq
    In[3]: my_seq = Seq("AGTACACTGGT")
    In[4]: my_seq
    Out[4]: 
    Seq('AGTACACTGGT', Alphabet())
    In[5]: my_seq.alphabet
    Out[5]: 
    Alphabet()
    

    Seq()可以创建一个基本的序列对象

    定义DNA序列

    In[6]: from Bio.Seq import Seq
    In[7]: from Bio.Alphabet import IUPAC
    In[8]: my_seq = Seq("AGCTGCAGCGAGCGAGC", IUPAC.unambiguous_dna)
    In[9]: my_seq
    Out[9]: 
    Seq('AGCTGCAGCGAGCGAGC', IUPACUnambiguousDNA())
    In[10]: my_seq.alphabet
    Out[10]: 
    IUPACUnambiguousDNA()
    

    序列处理

    迭代元素

    In[11]: from Bio.Seq import Seq
    In[12]: from Bio.Alphabet import IUPAC
    In[15]: for index,letter in enumerate(my_seq):
       ...:     print(index,letter)
       ...:     
    0 A
    1 G
    2 T
    3 C
    4 G
    5 A
    
    

    enumerate()可以遍历序列中的元素及其下标

    获取长度

    In[17]: my_seq
    Out[17]: 
    Seq('AGTCGA', IUPACUnambiguousDNA())
    In[18]: print(len(my_seq))
    6
    

    获取序列元素

    In[19]: print(my_seq[0])
    A
    In[20]: print(my_seq[2])
    T
    

    非重叠计数

    In[21]: Seq("AAAAA").count("AA")
    Out[21]: 
    2
    In[22]: "AAAAA".count("AA")
    Out[22]: 
    2
    

    统计GC含量

    In[27]: from Bio.SeqUtils import GC
    In[28]: my_seq
    Out[28]: 
    Seq('AGTCGA', IUPACUnambiguousDNA())
    In[29]: GC(my_seq)
    Out[29]: 
    50.0
    

    切片

    In[30]: my_seq = Seq("AGCTGACTGACGCATGAACGATAGCA", IUPAC.unambiguous_dna)
    In[31]: my_seq[4:12]
    Out[31]: 
    Seq('GACTGACG', IUPACUnambiguousDNA())
    In[32]: my_seq[4:12:3]
    Out[32]: 
    Seq('GTC', IUPACUnambiguousDNA())
    

    产生的新对象保留了原始Seq对象的字母表信息

    返回倒序

    In[33]: my_seq[::-1]
    Out[33]: 
    Seq('ACGATAGCAAGTACGCAGTCAGTCGA', IUPACUnambiguousDNA())
    

    转换字符串

    In[34]: str(my_seq)
    Out[34]: 
    'AGCTGACTGACGCATGAACGATAGCA'
    In[35]: print(my_seq)
    AGCTGACTGACGCATGAACGATAGCA
    In[36]: fasta = ">Name\n%s\n" % my_seq
    In[37]: print(fasta)
    >Name
    AGCTGACTGACGCATGAACGATAGCA
    

    print()%可以自动转换

    序列连接

    相同字母表

    In[39]: dna1 = Seq("AGCTAGCGA",IUPAC.unambiguous_dna)
    In[40]: dna2 = Seq("AGTCCGATG", IUPAC.unambiguous_dna)
    In[41]: dna = dna1 + dna2
    In[42]: dna
    Out[42]: 
    Seq('AGCTAGCGAAGTCCGATG', IUPACUnambiguousDNA())
    

    不同字母表

    In[50]: from Bio.Alphabet import generic_alphabet
    In[51]: protein.alphabet = generic_alphabet
    In[52]: dna.alphabet = generic_alphabet
    In[53]: dna + protein
    Out[53]: 
    Seq('AGCTAGCGAAGTCCGATGEVRNAK', Alphabet())
    

    不同字母表序列连接,必须首先将两个序列转换为通用字母表,否则会报错
    ypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACProtein()

    大小写转换

    In[56]: my_seq = Seq("acgGATC",generic_alphabet)
    In[57]: my_seq.upper()
    Out[57]: 
    Seq('ACGGATC', Alphabet())
    In[58]: my_seq.lower()
    Out[58]: 
    Seq('acggatc', Alphabet())
    

    互补链和反义链

    In[61]: my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
    In[62]: my_seq.complement()
    Out[62]: 
    Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
    In[63]: my_seq.reverse_complement()
    Out[63]: 
    Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
    

    生物过程模拟

    转录

    In[64]: coding_dna = Seq("AGTCGATCGATGACTAGCATGACGCATGACT", IUPAC.unambiguous_dna)
    In[65]: coding_dna
    Out[65]: 
    Seq('AGTCGATCGATGACTAGCATGACGCATGACT', IUPACUnambiguousDNA())
    In[66]: template_dna = coding_dna.reverse_complement()
    In[67]: template_dna
    Out[67]: 
    Seq('AGTCATGCGTCATGCTAGTCATCGATCGACT', IUPACUnambiguousDNA())
    In[68]: mRNA = coding_dna.transcribe()
    In[69]: mRNA
    Out[69]: 
    Seq('AGUCGAUCGAUGACUAGCAUGACGCAUGACU', IUPACUnambiguousRNA())
    In[70]: template_dna.reverse_complement().transcribe()
    Out[70]: 
    Seq('AGUCGAUCGAUGACUAGCAUGACGCAUGACU', IUPACUnambiguousRNA())
    

    transcribe()将T→U转换,并调整字母表

    反转录

    In[71]: mRNA.back_transcribe()
    Out[71]: 
    Seq('AGTCGATCGATGACTAGCATGACGCATGACT', IUPACUnambiguousDNA())
    

    back_transcribe()从U → T的替代并伴随着字母表的变化

    翻译

    In[73]: dna_seq = Seq("ATGCGTAGCTAGCTGACGTACGTAGCA",IUPAC.unambiguous_dna)
    In[74]: len(dna_seq)
    Out[74]: 
    27
    In[75]: mrna_seq = dna.transcribe()
    In[76]: mrna_seq.translate()
    Out[76]: 
    Seq('S*RSPM', HasStopCodon(ExtendedIUPACProtein(), '*'))
    In[77]: dna.translate()
    Out[77]: 
    Seq('S*RSPM', HasStopCodon(ExtendedIUPACProtein(), '*'))
    

    序列长度必须是3的倍数,否则translate()报错

    translate(table,stop_symbol,to_stop,cds)
    table指定遗传密码表,默认使用标准遗传密码,详细见NCBI的遗传密码表说明

    In[79]: dna.translate(table="Yeast Mitochondrial")
    Out[79]: 
    Seq('S*RSPM', HasStopCodon(ExtendedIUPACProtein(), '*'))
    

    指定使用酵母线粒体密码表进行翻译

    to_stop仅翻译到阅读框的第一个终止密码子,然后停止,终止密码子本身不翻译

    In[80]: dna.translate(to_stop =  True)
    Out[80]: 
    Seq('S', ExtendedIUPACProtein())
    

    stop_symbol指定终止符号

    In[82]: dna.translate(stop_symbol = "?")
    Out[82]: 
    Seq('S?RSPM', HasStopCodon(ExtendedIUPACProtein(), '?'))
    

    cds说明翻译时以起始密码子编码最前面的3个碱基

    In[85]: from Bio.Seq import Seq
    In[86]: from Bio.Alphabet import generic_dna
    In[87]: gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
       ...:            "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
       ...:             "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
       ...:             "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
       ...:             "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
       ...:             generic_dna)
    In[88]: gene.translate(table="Bacterial")
    Out[88]: 
    Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
    In[89]: gene.translate(table="Bacterial", cds=True)
    Out[89]: 
    Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())
    

    密码表

    在线密码表

    NCBI的遗传密码表说明

    内置密码表

    In[90]: from Bio.Data import CodonTable
    In[92]: print(CodonTable.unambiguous_dna_by_name["Standard"])  #通过名字来做标识
    Table 1 Standard, SGC0
    
      |  T      |  C      |  A      |  G      |
    --+---------+---------+---------+---------+--
    T | TTT F   | TCT S   | TAT Y   | TGT C   | T
    T | TTC F   | TCC S   | TAC Y   | TGC C   | C
    T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
    T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
    --+---------+---------+---------+---------+--
    C | CTT L   | CCT P   | CAT H   | CGT R   | T
    C | CTC L   | CCC P   | CAC H   | CGC R   | C
    C | CTA L   | CCA P   | CAA Q   | CGA R   | A
    C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
    --+---------+---------+---------+---------+--
    A | ATT I   | ACT T   | AAT N   | AGT S   | T
    A | ATC I   | ACC T   | AAC N   | AGC S   | C
    A | ATA I   | ACA T   | AAA K   | AGA R   | A
    A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
    --+---------+---------+---------+---------+--
    G | GTT V   | GCT A   | GAT D   | GGT G   | T
    G | GTC V   | GCC A   | GAC D   | GGC G   | C
    G | GTA V   | GCA A   | GAA E   | GGA G   | A
    G | GTG V   | GCG A   | GAG E   | GGG G   | G
    --+---------+---------+---------+---------+--
    
    In[93]: print(CodonTable.unambiguous_dna_by_id[1]) #通过数字来做标识
    Table 1 Standard, SGC0
    
      |  T      |  C      |  A      |  G      |
    --+---------+---------+---------+---------+--
    T | TTT F   | TCT S   | TAT Y   | TGT C   | T
    T | TTC F   | TCC S   | TAC Y   | TGC C   | C
    T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
    T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
    --+---------+---------+---------+---------+--
    C | CTT L   | CCT P   | CAT H   | CGT R   | T
    C | CTC L   | CCC P   | CAC H   | CGC R   | C
    C | CTA L   | CCA P   | CAA Q   | CGA R   | A
    C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
    --+---------+---------+---------+---------+--
    A | ATT I   | ACT T   | AAT N   | AGT S   | T
    A | ATC I   | ACC T   | AAC N   | AGC S   | C
    A | ATA I   | ACA T   | AAA K   | AGA R   | A
    A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
    --+---------+---------+---------+---------+--
    G | GTT V   | GCT A   | GAT D   | GGT G   | T
    G | GTC V   | GCC A   | GAC D   | GGC G   | C
    G | GTA V   | GCA A   | GAA E   | GGA G   | A
    G | GTG V   | GCG A   | GAG E   | GGG G   | G
    --+---------+---------+---------+---------+--
    

    Seq对象

    比较

    In[2]: from Bio.Seq import Seq
    In[3]: from Bio.Alphabet import IUPAC
    In[4]: seq1 = Seq("AGCT", IUPAC.unambiguous_dna)
    In[5]: seq2 = Seq("AGCT", IUPAC.unambiguous_dna)
    In[6]: seq1 == seq2
    Out[6]: 
    True
    In[12]: id(seq1) == id(seq2)
    Out[12]: 
    False
    In[13]: id(seq1)
    Out[13]: 
    2111559244880
    In[14]: id(seq2)
    Out[14]: 
    2111559374160
    In[15]: str(seq1) == str(seq2)
    Out[15]: 
    True
    

    两个Seq对象,序列和字母表都时相同的,虽然seq1 == seq2 返回True,但是其实内存中这两个对象不是同一个。通过id()函数可以看到id(seq1) == id(seq2)返回False,所以在做序列比较时,可以使用str()处理后,只是以字符串比较。

    可变

    tomutable()

    In[19]: my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
    In[20]: my_seq[5] = "G"
    Traceback (most recent call last):
      File "C:\Users\AnLau\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-20-56a40d7fb976>", line 1, in <module>
        my_seq[5] = "G"
    TypeError: 'Seq' object does not support item assignment
    In[21]: mutable_seq = my_seq.tomutable()
    In[22]: mutable_seq
    Out[22]: 
    MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
    In[23]: mutable_seq[5]="G"
    In[24]: mutable_seq
    Out[24]: 
    MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
    

    Seq对象不可变
    可以使用tomutable()函数将Seq对象变为MutableSeq对象

    创建MutableSeq对象

    In[28]: mutable_seq = MutableSeq("AGCGATGAC",IUPAC.unambiguous_dna)
    In[29]: mutable_seq
    Out[29]: 
    MutableSeq('AGCGATGAC', IUPACUnambiguousDNA())
    In[30]: mutable_seq[0]="T"
    In[31]: mutable_seq
    Out[31]: 
    MutableSeq('TGCGATGAC', IUPACUnambiguousDNA())
    In[32]: mutable_seq.remove("T")
    In[33]: mutable_seq
    Out[33]: 
    MutableSeq('GCGATGAC', IUPACUnambiguousDNA())
    In[34]: mutable_seq.reverse()
    In[35]: mutable_seq
    Out[35]: 
    MutableSeq('CAGTAGCG', IUPACUnambiguousDNA())
    In[36]: new_seq = mutable_seq.toseq()
    In[37]: new_seq
    Out[37]: 
    Seq('CAGTAGCG', IUPACUnambiguousDNA())
    In[38]: new_seq.reverse_complement()
    Out[38]: 
    Seq('CGCTACTG', IUPACUnambiguousDNA())
    In[39]: new_seq
    Out[39]: 
    Seq('CAGTAGCG', IUPACUnambiguousDNA())
    

    可以使用toseq()将MutableSeq对象转变为Seq对象
    MutableSeq对象有reverse()方法,而且各个方法直接修改MutableSeq对象本身

    UnknownSeq对象

    In[40]: from Bio.Seq import UnknownSeq
    In[41]: unk = UnknownSeq(20)
    In[42]: unk
    Out[42]: 
    UnknownSeq(20, alphabet = Alphabet(), character = '?')
    In[43]: unk_dna = UnknownSeq(20,IUPAC.unambiguous_dna)
    In[44]: unk_dna
    Out[44]: 
    UnknownSeq(20, alphabet = IUPACUnambiguousDNA(), character = 'N')
    In[45]: print(unk)
    ????????????????????
    In[46]: print(unk_dna)
    NNNNNNNNNNNNNNNNNNNN
    

    UnknownSeq对象可以只存储一个“N”和序列所需的长度(整数),节省内存

    直接使用字符串

    In[47]: from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
    In[48]: dna_string = "AGTCGATCGATCGACTGCGACGTCGA"
    In[49]: reverse_complement(dna_string)
    Out[49]: 
    'TCGACGTCGCAGTCGATCGATCGACT'
    In[50]: transcribe(dna_string)
    Out[50]: 
    'AGUCGAUCGAUCGACUGCGACGUCGA'
    In[51]: translate(dna_string)
    Out[51]: 
    'SRSIDCDV'
    C:\Users\AnLau\Anaconda3\lib\site-packages\Bio\Seq.py:2309: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
      BiopythonWarning)
    

    相关文章

      网友评论

          本文标题:Biopython-Chapter3.生物序列对象

          本文链接:https://www.haomeiwen.com/subject/uriseftx.html