美文网首页算法散记
Find first occurrence of W in S

Find first occurrence of W in S

作者: DjangoW | 来源:发表于2018-08-15 23:44 被阅读0次

    1. Problem

    Find the start position of first occurrence of String W in String S? (Leetcode Address)

    2. Simple Solution

    Each round (i from 0 to len(S)-len(W)), start from the ith position and match if the following substring match W or not, if match return i, go to next round( start from i+1 position) if not. Time complexity is O(n*m).
    Example:S = "THIS IS A SIMPLE EXAMPLE" W = "EXAMPLE"

    Solution:

    round 1:

    THIS IS A SIMPLE EXAMPLE
    EXAMPLE
    

    round 2:

    THIS IS A SIMPLE EXAMPLE
     EXAMPLE
    

    ......
    round 18:

    THIS IS A SIMPLE EXAMPLE
                     EXAMPLE
    

    match successful, return position of E (17).

    Show me the code (haystack is S, needle is W):

        public int strStr(String haystack, String needle) {
            if(needle.equals("")) return 0;
            if(needle.length() > haystack.length()) return -1;
            for(int i = 0; i < haystack.length(); i ++){
                if(haystack.length() - i < needle.length()) return -1;
                int j = 0;
                while(j < needle.length()){
                    if(haystack.charAt(i+j) != needle.charAt(j)) break;
                    j ++;
                }
                if(j == needle.length()) return i;
            }
            return -1;
        }
    

    3. KMP Alg.

    KMP is the abbr. of Knuth-Morris-Pratt, one of the authors of this algorithm. The other two are Donald Knuth and Vaughan Pratt.

    They solve the above problem in a more efficient way. Let's explain it using an example.
    The last example is not good enough for explain this algorithm, let's change another one: S = "BBCABCDAB ABCDABCDABDE", W = "ABCDABD"

    Solution:

    pre-process:(partial match table)
    traversal W once (O(m)) we can get table shown below, the 1 under A represents how many char match from the beginning.

    A B C D A B D
    0 0 0 0 1 2 0

    round 1:

    BBCABCDAB ABCDABCDABDE
    ABCDABD
    

    round 4:

    BBCABCDAB ABCDABCDABDE
       ABCDABD
    

    ' ' and D not match, and ' ' not part of ABCDABD, so move to the position next to ' '.
    round 5:

    BBCABCDAB ABCDABCDABDE
              ABCDABD
    

    as can be seen that till D, already 6 chars match, and partial match number is 2, so number of positions to move is num(match) - num(partial match) = (6 -2) = 4. And comparing should start from num(partial match) + 1 = 2 + 1 = 3 after finish moving. Then we get round 11.
    round 6:

    BBCABCDAB ABCDABCDABDE
                  ABCDABD
    

    start comparing from num(partial match) + 1 = 2 + 1 = 3 position of ABCDABD, match successful, return position of first element (10).
    As we can see, if use the simple solution, it would take 15 rounds to find the position, but with KMP it only takes 6 round to find it.

    Time Complexity:
    table algorithm is O(k), where k is the length of W. the traversal algorithm is O(n), where n is the length of S. so the time complexity of the whole algorithm is O(k+n).

    Show me the code!

    public static int kmp(String S, String W) {
            if(W.length() == 0) return 0;
            //calculate partial table
            int count = 0;
            Set<Character> map = new HashSet<>();
            int[] partialTable = new int[W.length()];
            partialTable[0] = 0;
            map.add(W.charAt(0));
            for(int k = 1; k < W.length(); k ++) {
                map.add(W.charAt(k));
                if(W.charAt(k) == W.charAt(count)) {
                    partialTable[k] = (++count);
                }else {
                    if(count != 0 && W.charAt(k) == W.charAt(k - 1) && partialTable[k - 1] == partialTable[count -1] + 1) {//deal with special case (last 'a' in 'aabaaa')
                        partialTable[k] = count;
                    }else {
                        count = 0;
                        if(W.charAt(k) == W.charAt(count)) partialTable[k] = (++count);
                        else partialTable[k] = count;
                    }
                }
            }
            // start to find if W exist in S
            int i = 0, j = 0;
            while (i <= S.length() - W.length()) {
                for(; j < W.length(); j ++) {
                    if(S.charAt(i + j) == W.charAt(j)) {
                        if(j == W.length() - 1) return i;
                    }else {
                        if(!map.contains(S.charAt(i + j))) {
                            i += (j+1);
                            j = 0;
                            break;
                        }
                        if(j == 0) {
                            i ++;
                            j = 0;
                            break;
                        }
                        i += (j - partialTable[j - 1]); //jump to the proper position
                        j = partialTable[j - 1]; //jump the one already compared
                        break;
                    }
                }
            }
            return -1;
        }
    

    4. BM Alg.

    BM algorithm was created by Boyer-Moore and J Strother Moore. Which is more efficient that KMP in most cases, especially in real world cases. That's why BM is widely used in content matching in files in real world.

    Back to the example:S = "THIS IS A SIMPLE EXAMPLE" W = "EXAMPLE"

    Solution:

    round 1:

    THIS IS A SIMPLE EXAMPLE
    EXAMPLE
    

    start matching from the end of EXAMPLE, E not match, and S not in EXAMPLE, so direct jump to the position after S:
    round 2:

    THIS IS A SIMPLE EXAMPLE
           EXAMPLE
    

    E not match P, P is treated as a "bad character" and P's last position in EXAMPLE is 4 (count from 0), so should move pst(mapped by 'P' in 'EXAMPLE') - pst(last 'P' in 'EXAMPLE') = 6 - 4 = 2 steps forward.
    round 3:

    THIS IS A SIMPLE EXAMPLE
             EXAMPLE
    

    As we can see, MPLE match (treat as "good character"), A and I not match ("bad character").
    good character rule: pst(from 0, good character's position) - pst(last position of good character) = 3 - (-1) = 4
    bad character rule: pst(bad character's positon) - pst(last position of bad character) = 2-(-1) = 3
    4 > 3, so move 4 steps forward.
    round 4:

    THIS IS A SIMPLE EXAMPLE
                 EXAMPLE
    

    E not match A, A is bad character, its current mapping position is 6, A's last position in EXAMPLE is 2, so should move 6 - 2 = 4 steps forward.
    round 5:

    THIS IS A SIMPLE EXAMPLE
                     EXAMPLE
    

    match, return start position in S: 17.

    Show me the code!

        public static int bm(String S, String W) {
            if(W.length() == 0) return 0;
            Map<Character, List<Integer>> map = new HashMap<>();// to record each char's position
            for(int i = 0; i < W.length(); i ++) {
                if(map.get(W.charAt(i)) == null) 
                    map.put(W.charAt(i), new ArrayList<>());
                map.get(W.charAt(i)).add(i);
            }
            int stepNum = 0;
            for(int k = 0; k <= S.length() - W.length(); k += stepNum) {
                for(int m = W.length() - 1; m >= 0; m --) {
                    if(S.charAt(k + m) == W.charAt(m)) {//good char
                        if(m == 0) return k;//match found
                    }else {//bad char
                        // steps calculate based on bad char 
                        int badStepNum = 0;
                        List<Integer> list = map.get(S.charAt(k + m));
                        if(list == null) {//not exist in W
                            badStepNum = m - (-1);
                        }else {
                            int badCharLastPst = -1;
                            for(int i = list.size() - 1; i >= 0; i --)
                                if(m > list.get(i) && m - list.get(i) < m - badCharLastPst)
                                    badCharLastPst = list.get(i);
                            badStepNum = m - badCharLastPst;
                        }
                        // steps calculate based on good char 
                        int goodStepNum = 0;
                        if(m < W.length() - 1) {
                            list = map.get(S.charAt(k + m + 1));
                            if(list == null) {//not exist in W
                                goodStepNum = (m + 1) - (-1);
                            }else {
                                int goodCharLastPst = -1;
                                for(int i = list.size() - 1; i >= 0; i --)
                                    if(m + 1 > list.get(i) && m + 1 - list.get(i) < m + 1 - goodCharLastPst)
                                        goodCharLastPst = list.get(i);
                                goodStepNum = m + 1 - goodCharLastPst;
                            }
                        }
                        stepNum = Math.max(badStepNum, goodStepNum);
                        break;
                    }
                }
            }
            return -1;
        }
    

    Reference

    https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm
    https://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm

    相关文章

      网友评论

        本文标题:Find first occurrence of W in S

        本文链接:https://www.haomeiwen.com/subject/xejgbftx.html