AC算法(Aho-Corasick)

作者: 致虑 | 来源:发表于2018-08-30 20:10 被阅读0次

AC算法(Aho-Corasick)
AC自动机-去除敏感字符
Aho-Corasick 多模式匹配算法、AC自动机详解
敏感词过滤的算法原理之 Aho-Corasick 算法
AC自动机
AC自动机
AC自动机专题整理
LeetCode127. 单词接龙
敏感词过滤算法Aho-Corasick
Go-汉诺塔递归算法

Aho-Corasick Algorithm 简称简称AC算法，通过将模式串预处理为确定有限状态自动机，扫描文本一遍就能结束。其复杂度为O(n)，即与模式串的数量和长度无关；与其相当的就是Wu-Manber algorithm了（由吳昇博士跟UdiManber所提出）。

AC算法的主要思想就是构造的有限状态自动机，根据有限状态自动机会根据输入进行模式串匹配。有限状态自动机会随着字符的输入而发生状态转移，转移的状态有如下三种：

1.success状态，即AC自动机根据输入有能直接到达的状态（没有发生跳转）；

2.failure状态，即AC自动机根据输入没有直接到达的状态，这时候就会发生跳转，跳转到其他一个路径（比如AC根节点就是其第一个孩子的所有failure状态）

3.output状态，即成功匹配到一个输入段

以上三个阶段分别对应算法中的三个步骤：

1.建立Pattern tree；即建立自动机，简单来说就是根据输入的字符串构造一棵“树”；

2.建立failure状态，即在每个叶子节点上加上failure状态（根节点不需要），即标注当前输入串到当前叶子节点时，若不能继续匹配所能跳转的路径；

3.比对text，即成功到达output状态的时候，代表一次匹配成功。

举例说明：

如果有ABCD,BELIEVE,ELF,ELEPHANT,PHANTOM五个字符串，则构成AC自动机如下：

image.png

其中黑色箭头代表success状态走向，红色箭头代表failure状态走向（其余所有节点没有标注红色箭头的全部指向根节点，包括根节点本身），红色节点代表output状态；即输入一个模式串，沿着上面的pattern tree状态走，只要走到红色节点，代表一次匹配成功。

匹配流程：输入一个字符串，先匹配success状态（程序中用一个数组记录AC自动机每个节点的success状态），若success状态匹配无法匹配，则匹配failure状态（程序中用一个数组记录AC自动机每个节点的failure状态），直到找到output节点。

代码细节：代码主要参照GitHub上面，写的非常清晰，非常好，这里主要截取几个重要的方法：

1.通过添加模式串，构造trie树，即我们上述所讲的pattern树（此过程会建立success状态表）：

/**
     * 添加一个模式串
     *
     * @param keyword
     */
    public void addKeyword(String keyword) {
        if (keyword == null || keyword.length() == 0) {
            return;
        }
        State currentState = this.rootState;
        for (Character character : keyword.toCharArray()) {
            currentState = currentState.addState(character);
        }
        currentState.addEmit(keyword);
    }

image.gif

使用方法：

   Trie trie = new Trie();
        trie.addKeyword("ABCD");
        trie.addKeyword("BELIEVE");
        trie.addKeyword("ELF");
        trie.addKeyword("ELEPHANT");

image.gif

2.建立failure状态表：

/**
     * 建立failure表
     */
    private void constructFailureStates() {

        //阻塞队列
        Queue<State> queue = new LinkedBlockingDeque<State>();

        // 第一步，将深度为1的节点的failure设为根节点
        for (State depthOneState : this.rootState.getStates()) {
            depthOneState.setFailure(this.rootState);
            queue.add(depthOneState);
        }
        this.failureStatesConstructed = true;

        // 第二步，为深度 > 1 的节点建立failure表，这是一个bfs
        while (!queue.isEmpty()) {
            State currentState = queue.remove();

            for (Character transition : currentState.getTransitions()) {
                State targetState = currentState.nextState(transition);
                queue.add(targetState);

                State traceFailureState = currentState.failure();
                while (traceFailureState.nextState(transition) == null) {
                    traceFailureState = traceFailureState.failure();
                }
                State newFailureState = traceFailureState.nextState(transition);
                targetState.setFailure(newFailureState);
                targetState.addEmit(newFailureState.emit());
            }
        }
    }

image.gif

3.状态跳转：

/**
     * 跳转到下一个状态
     *
     * @param currentState
     *            当前状态
     * @param character
     *            接受字符
     * @return 跳转结果
     */
    private static State getState(State currentState, Character character) {
        State newCurrentState = currentState.nextState(character); // 先按success跳转
        while (newCurrentState == null) // 跳转失败的话，按failure跳转
        {
            currentState = currentState.failure();
            newCurrentState = currentState.nextState(character);
        }
        return newCurrentState;
    }

image.gif

4.匹配过程：

/**
     * 模式匹配
     *
     * @param text
     *            待匹配的文本
     * @return 匹配到的模式串
     */
    @SuppressWarnings("unchecked")
    public Collection<Emit> parseText(String text) {
        checkForConstructedFailureStates();

        int position = 0;
        State currentState = this.rootState;
        List<Emit> collectedEmits = new ArrayList<Emit>();
        for (Character character : text.toCharArray()) {
            if (trieConfig.isCaseInsensitive()) {
                character = Character.toLowerCase(character);
            }
            currentState = getState(currentState, character);
            storeEmits(position, currentState, collectedEmits);
            ++position;
        }

        if (trieConfig.isOnlyWholeWords()) {
            removePartialMatches(text, collectedEmits);
        }

        if (!trieConfig.isAllowOverlaps()) {
            IntervalTree intervalTree = new IntervalTree((List<Intervalable>) (List<?>) collectedEmits);
            intervalTree.removeOverlaps((List<Intervalable>) (List<?>) collectedEmits);
        }

        return collectedEmits;
    }