美文网首页
自定义分词器

自定义分词器

作者: 烂泥_119c | 来源:发表于2021-02-21 22:27 被阅读0次

    前言

    es能够实现快速的全文搜索,除了依赖其本身倒排索引的思想,还依赖其分词器

    分析器

    • es本身内置了一些常用的分析器(analyzer),分析器由三种构建组成:
      • character filter: 字符过滤器(在一段文本进行分词之前,先进行预处理,比如过滤html标签等)
      • tokenizer: 分词器(对字段进行切分)
      • token filter: token过滤器(对切分的单词进行加工,如大小写转换等)
    • 三者顺序: character filter -> tokenizer -> token filter
    • 三者个数: character filter(0个或多个)+tokenizer(恰好一个)+token filter(0个或多个)

    es内置的分析器

    • es内置了一些常用的分析器,如下:
    Standard Analyzer - 默认分词器,按词切分,小写处理
    Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
    Stop Analyzer - 小写处理,停用词过滤(the,a,is)
    Whitespace Analyzer - 按照空格切分,不转小写
    Keyword Analyzer - 不分词,直接将输入当作输出
    Patter Analyzer - 正则表达式,默认\W+(非字符分割)
    Language - 提供了30多种常见语言的分词器
    Customer Analyzer 自定义分词器
    
    • 根据这些分词器我们可以进行自定义一些简单的分词器,如 以逗号分隔的分词器
    {
     "settings":{
      "analysis":{
        "analyzer":{
          "comma":{
            "type":"pattern",
            "pattern":","
          }
        }
      }
     }
    }
    
    • 或者自定义选择分词器及过滤器,组装一个新的分析器
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "std_folded": {
                        "type": "custom",
                        "tokenizer": "standard",
                        "filter": [
                            "lowercase",
                            "asciifolding"
                        ]
                    }
                }
            }
        }
    }
    

    自定义分析器

    • 并不是所有的需求都可以以内置的组件进行组装得到,当有一些特殊的需求时,内置的分词器可能很难实现,这时我们可以尝试自定义分析器。 以下以连续字符串分词为例: 给定一个字符串,要求分词出来的结果涵盖: 所有的连续3个字母、4个字母、5个字母...
      嗯... 其实elasticsearch内置的分词器,也可以实现,如下:
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "ngram",
              "min_gram": 4,
              "max_gram": 10,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          }
        }
      }}
    

    自定义插件实现

    这里我们以一个空格分词器为例

    pom文件
      <properties>
        <elasticsearch.version>6.5.4</elasticsearch.version>
        <lucene.version>7.5.0</lucene.version>
        <maven.compiler.target>1.8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
      </properties>
    
      <dependencies>
        <dependency>
          <groupId>org.elasticsearch</groupId>
          <artifactId>elasticsearch</artifactId>
          <version>${elasticsearch.version}</version>
          <scope>provided</scope>
        </dependency>
      </dependencies>
    
      <build>
        <resources>
          <resource>
            <directory>src/main/resources</directory>
            <filtering>false</filtering>
            <excludes>
              <exclude>*.properties</exclude>
            </excludes>
          </resource>
        </resources>
        <plugins>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>2.6</version>
            <configuration>
              <appendAssemblyId>false</appendAssemblyId>
              <outputDirectory>${project.build.directory}/releases/</outputDirectory>
              <descriptors>
                <descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
              </descriptors>
            </configuration>
            <executions>
              <execution>
                <phase>package</phase>
                <goals>
                  <goal>single</goal>
                </goals>
              </execution>
            </executions>
          </plugin>
          <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.5.1</version>
            <configuration>
              <source>${maven.compiler.target}</source>
              <target>${maven.compiler.target}</target>
            </configuration>
          </plugin>
        </plugins>
      </build>
    
    • 注意这里指定了 plugin.xml并设置了静态资源文件
    plugin.xml 注意文件位置
    <?xml version="1.0"?>
    <assembly>
      <id>my-analysis</id>
      <formats>
        <format>zip</format>
      </formats>
      <includeBaseDirectory>false</includeBaseDirectory>
      <files>
        <file>
          <source>${project.basedir}/src/main/resources/my.properties</source>
          <outputDirectory/>
          <filtered>true</filtered>
        </file>
      </files>
      <dependencySets>
        <dependencySet>
          <outputDirectory/>
          <useProjectArtifact>true</useProjectArtifact>
          <useTransitiveFiltering>true</useTransitiveFiltering>
          <excludes>
            <exclude>org.elasticsearch:elasticsearch</exclude>
          </excludes>
        </dependencySet>
      </dependencySets>
    </assembly>
    
    • 这里指定了my.properties
    my.properties
    description=${project.description}
    version=${project.version}
    name=${project.name}
    classname=com.test.plugin.MyPlugin
    java.version=${maven.compiler.target}
    elasticsearch.version=${elasticsearch.version}
    
    • 这里指定了classname就是我们的插件类
    代码
    • 分析器
    package com.test.index.analysis;
    
    import org.apache.lucene.analysis.Analyzer;
    
    /**
     * @author phil.zhang
     * @date 2021/2/21
     */
    public class MyAnalyzer extends Analyzer {
      @Override
      protected TokenStreamComponents createComponents(String fieldName) {
        MyTokenizer myTokenizer = new MyTokenizer();
        return new TokenStreamComponents(myTokenizer);
      }
    }
    
    • 分析器provider
    package com.test.index.analysis;
    
    import org.elasticsearch.common.settings.Settings;
    import org.elasticsearch.env.Environment;
    import org.elasticsearch.index.IndexSettings;
    import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
    
    /**
     * @author phil.zhang
     * @date 2021/2/21
     */
    public class MyAnalyzerProvider extends AbstractIndexAnalyzerProvider<MyAnalyzer> {
      private MyAnalyzer myAnalyzer;
      public MyAnalyzerProvider(IndexSettings indexSettings,Environment environment, String name, Settings settings) {
        super(indexSettings,name,settings);
        myAnalyzer = new MyAnalyzer();
      }
      @Override
      public MyAnalyzer get() {
        return myAnalyzer;
      }
    }
    
    • 分词器--核心逻辑
    package com.test.index.analysis;
    
    import java.io.IOException;
    import org.apache.lucene.analysis.Tokenizer;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
    
    /**
     * @author phil.zhang
     * @date 2021/2/21
     */
    public class MyTokenizer extends Tokenizer {
      private final StringBuilder buffer = new StringBuilder();
      private int suffixOffset;
      /** 分词开始的位置 **/
      private int tokenStart = 0;
      /** 分词结束的位置 **/
      private int tokenEnd = 0;
      /** 将attribute加入map, 这里分出来的词语 需要包含字符串 和 offset两种属性 **/
      private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);
      private final OffsetAttribute offsetAttribute = addAttribute(OffsetAttribute.class);
    
      @Override
      public boolean incrementToken() throws IOException {
        clearAttributes();
        buffer.setLength(0); // 清空数据
        int ci;
        char ch;
        tokenStart = tokenEnd;
        // 读取一个字符
        ci = input.read();
        ch = (char)ci;
        while (true) {
          if (ci == -1) {
            // 没有数据了
            if (buffer.length() == 0) {
              // 分词结束
              return false;
            }else {
              // 返回一个分词结果
              termAttribute.setEmpty().append(buffer);
              offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
              return true;
            }
          }else if (ch == ' ') {
            // 遇到空格
            tokenEnd ++;
            if (buffer.length()>0) {
              termAttribute.setEmpty().append(buffer);
              offsetAttribute.setOffset(correctOffset(tokenStart),correctOffset(tokenEnd));
              return true;
            }else {
              ci = input.read();
              ch = (char) ci;
            }
          }else { // 没有遇到空格,继续追加
            buffer.append(ch);
            tokenEnd++;
            ci = input.read();
            ch = (char) ci;
    
          }
        }
      }
    
      @Override
      public void end() throws IOException {
        int finalOffset = correctOffset(suffixOffset);
        offsetAttribute.setOffset(finalOffset,finalOffset);
      }
    
      @Override
      public void reset() throws IOException {
        super.reset();
        tokenStart = tokenEnd = 0;
      }
    }
    
    • 分词器工厂
    package com.test.index.analysis;
    
    import org.apache.lucene.analysis.Tokenizer;
    import org.elasticsearch.common.settings.Settings;
    import org.elasticsearch.env.Environment;
    import org.elasticsearch.index.IndexSettings;
    import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
    
    /**
     * @author phil.zhang
     * @date 2021/2/21
     */
    public class MyTokenizerFactory extends AbstractTokenizerFactory {
    
      public MyTokenizerFactory(IndexSettings indexSettings,Environment environment,String ignored, Settings settings) {
        super(indexSettings,ignored,settings);
      }
    
      @Override
      public Tokenizer create() {
        return new MyTokenizer();
      }
    }
    
    • 插件类
    package com.test.plugin;
    
    import com.test.index.analysis.MyAnalyzerProvider;
    import com.test.index.analysis.MyTokenizerFactory;
    import java.util.HashMap;
    import java.util.Map;
    import org.apache.lucene.analysis.Analyzer;
    import org.elasticsearch.index.analysis.AnalyzerProvider;
    import org.elasticsearch.index.analysis.TokenizerFactory;
    import org.elasticsearch.indices.analysis.AnalysisModule;
    import org.elasticsearch.plugins.AnalysisPlugin;
    import org.elasticsearch.plugins.Plugin;
    
    /**
     * @author phil.zhang
     * @date 2021/2/21
     */
    public class MyPlugin extends Plugin implements AnalysisPlugin {
    
      @Override
      public Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> getTokenizers() {
        Map<String, AnalysisModule.AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
        extra.put("my-word", MyTokenizerFactory::new);
        return extra;
      }
      @Override
      public Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
    
        Map<String, AnalysisModule.AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> extra = new HashMap<>();
        extra.put("my-word", MyAnalyzerProvider::new);
        return extra;
      }
    }
    
    后续

    到这里代码就开发完成了,可以进行简单的自测看下效果,然后就可以使用maven命令进行打包,之后就是分词器插件的安装流程,这里不再进一步说明

    相关文章

      网友评论

          本文标题:自定义分词器

          本文链接:https://www.haomeiwen.com/subject/thecfltx.html