Find first occurrence of W in S

作者: DjangoW | 来源:发表于2018-08-15 23:44 被阅读0次

1. Problem

Find the start position of first occurrence of String W in String S? (Leetcode Address)

2. Simple Solution

Each round (i from 0 to len(S)-len(W)), start from the ith position and match if the following substring match W or not, if match return i, go to next round( start from i+1 position) if not. Time complexity is O(n*m).
Example：S = "THIS IS A SIMPLE EXAMPLE" W = "EXAMPLE"

Solution:

round 1:

THIS IS A SIMPLE EXAMPLE
EXAMPLE

round 2:

THIS IS A SIMPLE EXAMPLE
 EXAMPLE

......
round 18：

THIS IS A SIMPLE EXAMPLE
                 EXAMPLE

match successful, return position of E (17).

Show me the code (haystack is S, needle is W)：

    public int strStr(String haystack, String needle) {
        if(needle.equals("")) return 0;
        if(needle.length() > haystack.length()) return -1;
        for(int i = 0; i < haystack.length(); i ++){
            if(haystack.length() - i < needle.length()) return -1;
            int j = 0;
            while(j < needle.length()){
                if(haystack.charAt(i+j) != needle.charAt(j)) break;
                j ++;
            }
            if(j == needle.length()) return i;
        }
        return -1;
    }

3. KMP Alg.

KMP is the abbr. of Knuth-Morris-Pratt, one of the authors of this algorithm. The other two are Donald Knuth and Vaughan Pratt.

They solve the above problem in a more efficient way. Let's explain it using an example.
The last example is not good enough for explain this algorithm, let's change another one: S = "BBCABCDAB ABCDABCDABDE", W = "ABCDABD"

Solution:

pre-process:(partial match table)
traversal W once (O(m)) we can get table shown below, the 1 under A represents how many char match from the beginning.

A	B	C	D	A	B	D
0	0	0	0	1	2	0

round 1:

BBCABCDAB ABCDABCDABDE
ABCDABD

round 4:

BBCABCDAB ABCDABCDABDE
   ABCDABD

' ' and D not match, and ' ' not part of ABCDABD, so move to the position next to ' '.
round 5：

BBCABCDAB ABCDABCDABDE
          ABCDABD

as can be seen that till D, already 6 chars match, and partial match number is 2, so number of positions to move is num(match) - num(partial match) = (6 -2) = 4. And comparing should start from num(partial match) + 1 = 2 + 1 = 3 after finish moving. Then we get round 11.
round 6：

BBCABCDAB ABCDABCDABDE
              ABCDABD

start comparing from num(partial match) + 1 = 2 + 1 = 3 position of ABCDABD, match successful, return position of first element (10).
As we can see, if use the simple solution, it would take 15 rounds to find the position, but with KMP it only takes 6 round to find it.

Time Complexity:
table algorithm is O(k), where k is the length of W. the traversal algorithm is O(n), where n is the length of S. so the time complexity of the whole algorithm is O(k+n).

Show me the code!

public static int kmp(String S, String W) {
        if(W.length() == 0) return 0;
        //calculate partial table
        int count = 0;
        Set<Character> map = new HashSet<>();
        int[] partialTable = new int[W.length()];
        partialTable[0] = 0;
        map.add(W.charAt(0));
        for(int k = 1; k < W.length(); k ++) {
            map.add(W.charAt(k));
            if(W.charAt(k) == W.charAt(count)) {
                partialTable[k] = (++count);
            }else {
                if(count != 0 && W.charAt(k) == W.charAt(k - 1) && partialTable[k - 1] == partialTable[count -1] + 1) {//deal with special case (last 'a' in 'aabaaa')
                    partialTable[k] = count;
                }else {
                    count = 0;
                    if(W.charAt(k) == W.charAt(count)) partialTable[k] = (++count);
                    else partialTable[k] = count;
                }
            }
        }
        // start to find if W exist in S
        int i = 0, j = 0;
        while (i <= S.length() - W.length()) {
            for(; j < W.length(); j ++) {
                if(S.charAt(i + j) == W.charAt(j)) {
                    if(j == W.length() - 1) return i;
                }else {
                    if(!map.contains(S.charAt(i + j))) {
                        i += (j+1);
                        j = 0;
                        break;
                    }
                    if(j == 0) {
                        i ++;
                        j = 0;
                        break;
                    }
                    i += (j - partialTable[j - 1]); //jump to the proper position
                    j = partialTable[j - 1]; //jump the one already compared
                    break;
                }
            }
        }
        return -1;
    }

4. BM Alg.

BM algorithm was created by Boyer-Moore and J Strother Moore. Which is more efficient that KMP in most cases, especially in real world cases. That's why BM is widely used in content matching in files in real world.

Back to the example：S = "THIS IS A SIMPLE EXAMPLE" W = "EXAMPLE"

Solution:

round 1:

THIS IS A SIMPLE EXAMPLE
EXAMPLE

start matching from the end of EXAMPLE, E not match, and S not in EXAMPLE, so direct jump to the position after S:
round 2:

THIS IS A SIMPLE EXAMPLE
       EXAMPLE

E not match P, P is treated as a "bad character" and P's last position in EXAMPLE is 4 (count from 0), so should move pst(mapped by 'P' in 'EXAMPLE') - pst(last 'P' in 'EXAMPLE') = 6 - 4 = 2 steps forward.
round 3:

THIS IS A SIMPLE EXAMPLE
         EXAMPLE

As we can see, MPLE match (treat as "good character"), A and I not match ("bad character").
good character rule: pst(from 0, good character's position) - pst(last position of good character) = 3 - (-1) = 4
bad character rule: pst(bad character's positon) - pst(last position of bad character) = 2-(-1) = 3
4 > 3, so move 4 steps forward.
round 4:

THIS IS A SIMPLE EXAMPLE
             EXAMPLE

E not match A, A is bad character, its current mapping position is 6, A's last position in EXAMPLE is 2, so should move 6 - 2 = 4 steps forward.
round 5:

THIS IS A SIMPLE EXAMPLE
                 EXAMPLE

match, return start position in S: 17.

Show me the code!

    public static int bm(String S, String W) {
        if(W.length() == 0) return 0;
        Map<Character, List<Integer>> map = new HashMap<>();// to record each char's position
        for(int i = 0; i < W.length(); i ++) {
            if(map.get(W.charAt(i)) == null) 
                map.put(W.charAt(i), new ArrayList<>());
            map.get(W.charAt(i)).add(i);
        }
        int stepNum = 0;
        for(int k = 0; k <= S.length() - W.length(); k += stepNum) {
            for(int m = W.length() - 1; m >= 0; m --) {
                if(S.charAt(k + m) == W.charAt(m)) {//good char
                    if(m == 0) return k;//match found
                }else {//bad char
                    // steps calculate based on bad char 
                    int badStepNum = 0;
                    List<Integer> list = map.get(S.charAt(k + m));
                    if(list == null) {//not exist in W
                        badStepNum = m - (-1);
                    }else {
                        int badCharLastPst = -1;
                        for(int i = list.size() - 1; i >= 0; i --)
                            if(m > list.get(i) && m - list.get(i) < m - badCharLastPst)
                                badCharLastPst = list.get(i);
                        badStepNum = m - badCharLastPst;
                    }
                    // steps calculate based on good char 
                    int goodStepNum = 0;
                    if(m < W.length() - 1) {
                        list = map.get(S.charAt(k + m + 1));
                        if(list == null) {//not exist in W
                            goodStepNum = (m + 1) - (-1);
                        }else {
                            int goodCharLastPst = -1;
                            for(int i = list.size() - 1; i >= 0; i --)
                                if(m + 1 > list.get(i) && m + 1 - list.get(i) < m + 1 - goodCharLastPst)
                                    goodCharLastPst = list.get(i);
                            goodStepNum = m + 1 - goodCharLastPst;
                        }
                    }
                    stepNum = Math.max(badStepNum, goodStepNum);
                    break;
                }
            }
        }
        return -1;
    }

Reference

https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm
https://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm

Find first occurrence of W in S

1. Problem

2. Simple Solution

Solution:

3. KMP Alg.

Solution:

4. BM Alg.

Solution:

Reference

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

算法散记