问题描述

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].

思路分析

蠢蠢的我以为要枚举长度为10的子串并用KMP算法进行匹配，因此去复习了一波KMP，然后就超时啦（手动再见）。
正确的做法是，由于字母只有ATCG四个，可以根据字母对一个字符串进行编码，也可以理解为将ATCG对应为0123，即将字符串转换为4进制数，然后对数字进行匹配就好了。其中还需要注意，不用每次都重算字符串的encode，因为前一个字符串的值后一个可以利用，而因为10个字母正好对应20位二进制数，因此可以通过“&0xFFFFF”来去掉10个字母之前的值，最后是注意用dict提高速度。

AC代码

class Solution(object):
    def findRepeatedDnaSequences(self, s):
        """
        :type s: str
        :rtype: List[str]
        """
        n = len(s)
        rst = set()
        once = set()
        code = 0
        code_map = {'A':0, 'C':1, 'G':2, 'T':3}
        for i in range(n):
            code = code * 4 + code_map[s[i]] & 0xFFFFF
            if i < 9:
                continue
            if code in once:
                rst.add(s[i-9 : i+1])
            else:
                once.add(code)
        return list(rst)

KMP复习

def getNext(p):
    n = len(p)
    next = [0 for i in range(n)]
    k = 0
    for i in range(1, n):
        while k > 0 and p[k] != p[i]:
            k = next[k-1]
        if p[k] == p[i]:
            k += 1
        next[i] = k
    return next

def kmp(s, p):
    i = 0
    j = 0
    n = len(s)
    m = len(p)
    next = getNext(p)
    while j < m and i < n:
        if s[i] == p[j]:
            i += 1
            j += 1
        else:
            if j == 0:
                i += 1
            else:
                j = next[j-1]
    if j == m:
        return True
    else:
        return False