美文网首页
计算序列aa频率

计算序列aa频率

作者: 每天都想睡觉的阿源 | 来源:发表于2018-03-23 18:21 被阅读0次

      其实Python的语法也看了好一阵子,但一直没有实战,但是今天上完多序列比对,突然就想试试Python的实战,没想到上手这的是快。

      随便先写个貌似蛋白序列的字符串

    >>> proseq = 'AWJFBAJDKAJNFKAFLMMALFWSHFBJSBFKJANDKAJNDKJQNKJQK'

    Python的索引功能

    >>>'MEafw' [1]

    'E'

    >>> 'MEafw' [0]

    'M'

    >>> 'MEafw' [-1]

    'w'

    >>> proseq [1]

    'W'

    这里要注意的是Python的计数原则是从0开始的,故0对应着的是第一个字符,此外负号代表的从后往前计数

    切片

    >>> proseq [0:3]

    'AWJ'

    >>> proseq [3:]

    'FBAJDKAJNFKAFLMMALFWSHFBJSBFKJANDKAJNDKJQNKJQK'

    切片的功能类似索引,不过就是截取一段范围而已,中间用“:”表示区间范围,如果不加位置,则直接到最后

    字符串运算

    >>> 'pro' * 2

    'propro'

    >>> 'pro' + 'pro'

    'propro'

    有意思的是,在Python中的字符串并不仅仅限于数字,字符串也是可以运算的

    >>> len(proseq)

    49

    len()函数可以完成确定字符串长度

    >>> proseq.count('A')

    7

    .count()函数可进行字符计算相关工作

    为什么len()则前置,而.count()的命令后置?

    因为这是是否为内置函数的区别,如len()为内置函数,具有很广泛的广式广式性。而其他函数特定的适用于某些特定的数据类型

    For 循环

    基本语法:

    for  <index variable> in <sequence>:

    <command 1>

    <command 2>

    ..........

    <command x>

    语法一定要会,不然白搭

    <sequence>可以字符串或是对象的集合,<index variable>是变量名,是遍历时提取元素的值。第一次循环取得第一个值,依次向下,通过缩进四个字符标记循环体,指令最后执行退出循环体

    >>> for amino_acid in 'ABCDEFGHIKLMNPQRSTVWY':

    ...    number = proseq.count(amino_acid)

    ...    print(amino_acid , number)

    ...

    A 7

    B 3

    C 0

    D 3

    E 0

    F 6

    G 0

    H 1

    I 0

    K 7

    L 2

    M 2

    N 4

    P 0

    Q 2

    R 0

    S 2

    T 0

    V 0

    W 2

    Y 0

    这里注意Python3以上版本print需要print(),此外print可以利用,分离同时打印多个


    实战;Telomerase reverse transcriptase中哪个氨基酸出现最频繁?

    telomerase = '''MSITDLSPTLGILRSLYPHVQVLVDFADDIVFREGHKATLIEESDTSHFKSFVRGIFVCF

    ... HKELQQVPSCNQICTLPELLAFVLNSVKRKRKRNVLAHGYNFQSLAQEERDADQFKLQGD

    ... VTQSAAYVHGSDLWRKVSMRLGTDITRYLFESCSVFVAVPPSCLFQVCGIPIYDCFSLAT

    ... ASLGFSLQSRGCRERCLGVNSMKRRAFNVKRYLRKRKTETDQKDEARVCSGKRRRVMEED

    ... KVSCETMQDGESGKTTLVQKQPGSKKRSEMEATLLPLEGGPSWRSGTFPPLPPSQSFMRT

    ... LGFLYGGRGMRSFLLNRKKKTAEGFRKIQGRDLIRIVFFEGVLYLNGLERKPKKLPRRFF

    ... NMVPLFSQLLRQHRRCPYSRLLQKTCPLVGIKDAGQAELSSFLPQHCGSHRVYLFVRECL

    ... LAVIPQELWGSEHNRLLYFARVRFFLRSGKFERLSVAELMWKIKVNNCDWLKISKTGRVP

    ... PSELSYRTQILGQFLAWLLDGFVVGLVRACFYATESMGQKNAIRFYRQEVWAKLQDLAFR

    ... SHISKGQMVELTPDQVAALPKSTIISRLRFIPKTDGMRPITRVIGADAKTRLYQSHVRDL

    ... LDMLRACVCSTPSLLGSTVWGMTDIHKVLSSIAPAQKEKPQPLYFVKMDVSGAYESLPHN

    ... KLIEVINQVLTPVLNEVFTIRRFAKIWADSHEGLKKAFIRQADFLEANMGSINMKQFLTS

    ... LQKKGKLHHSVLVEQIFSSDLEGKDALQFFTQILKGSVIQFGKKTYRQCQGVPQGSAVSS

    ... VLCCLCYGHMENVLFKDIINKKSCLMRLVDDFLLITPNLHDAQTFLKILLAGVPQYGLVV

    ... NPQKVVVNFEDYGSTDSCPGLRVLPLRCLFPWCGLLLDTHTLDIYKDYSSYADLSLRYSL

    ... TLGSCHSAGHQMKRKLMGILRLKCHALFLDLKTNSLEAIYKNIYKLLLLHALRFHVCAQS

    ... LPFGQSVAKNPAYFLLMIWDMVEYTNYLIRLSNNGLISGSTSQTGSVQYEAVELLFCLSF

    ... LLVLSKHRRLYKDLLLHLHKRKRRLEQCLGDLRLARVRQAANPRNPLDFLAIKT'''

    (注意这里一定要用‘’‘ ’‘’ 不然你试试用’‘’能概括这么多蛋白序列,中间又不能用\来继续)

    >>> telomerase

    'MSITDLSPTLGILRSLYPHVQVLVDFADDIVFREGHKATLIEESDTSHFKSFVRGIFVCF\nHKELQQVPSCNQICTLPELLAFVLNSVKRKRKRNVLAHGYNFQSLAQEERDADQFKLQGD\nVTQSAAYVHGSDLWRKVSMRLGTDITRYLFESCSVFVAVPPSCLFQVCGIPIYDCFSLAT\nASLGFSLQSRGCRERCLGVNSMKRRAFNVKRYLRKRKTETDQKDEARVCSGKRRRVMEED\nKVSCETMQDGESGKTTLVQKQPGSKKRSEMEATLLPLEGGPSWRSGTFPPLPPSQSFMRT\nLGFLYGGRGMRSFLLNRKKKTAEGFRKIQGRDLIRIVFFEGVLYLNGLERKPKKLPRRFF\nNMVPLFSQLLRQHRRCPYSRLLQKTCPLVGIKDAGQAELSSFLPQHCGSHRVYLFVRECL\nLAVIPQELWGSEHNRLLYFARVRFFLRSGKFERLSVAELMWKIKVNNCDWLKISKTGRVP\nPSELSYRTQILGQFLAWLLDGFVVGLVRACFYATESMGQKNAIRFYRQEVWAKLQDLAFR\nSHISKGQMVELTPDQVAALPKSTIISRLRFIPKTDGMRPITRVIGADAKTRLYQSHVRDL\nLDMLRACVCSTPSLLGSTVWGMTDIHKVLSSIAPAQKEKPQPLYFVKMDVSGAYESLPHN\nKLIEVINQVLTPVLNEVFTIRRFAKIWADSHEGLKKAFIRQADFLEANMGSINMKQFLTS\nLQKKGKLHHSVLVEQIFSSDLEGKDALQFFTQILKGSVIQFGKKTYRQCQGVPQGSAVSS\nVLCCLCYGHMENVLFKDIINKKSCLMRLVDDFLLITPNLHDAQTFLKILLAGVPQYGLVV\nNPQKVVVNFEDYGSTDSCPGLRVLPLRCLFPWCGLLLDTHTLDIYKDYSSYADLSLRYSL\nTLGSCHSAGHQMKRKLMGILRLKCHALFLDLKTNSLEAIYKNIYKLLLLHALRFHVCAQS\nLPFGQSVAKNPAYFLLMIWDMVEYTNYLIRLSNNGLISGSTSQTGSVQYEAVELLFCLSF\nLLVLSKHRRLYKDLLLHLHKRKRRLEQCLGDLRLARVRQAANPRNPLDFLAIKT'

    >>>

    >>> for amino in "ABCDEFGHIKLMNPQRSTVWY":

    ...    number = telomerase.count(amino)

    ...    print(amino , number) 

    ...

    A 56

    B 0

    C 32

    D 46

    E 46

    F 59

    G 64

    H 28

    I 47

    K 73

    L 146

    M 24

    N 31

    P 44

    Q 54

    R 79

    S 83

    T 46

    V 73

    W 11

    Y 32

    好啦,现在很明显是L lys是最多的,B是最少的呀,都没有,但是好像突然发现有点问题,一个len()函数瞅瞅

    >>> len('ABCDEFGHIKLMNPQRSTVWY')

    21

    瞬间好像明白了什么,下次aa单字符再写错,直接面壁思过!

    相关文章

      网友评论

          本文标题:计算序列aa频率

          本文链接:https://www.haomeiwen.com/subject/imgxcftx.html