美文网首页
Elasticsearch-分析器

Elasticsearch-分析器

作者: _吱吱呀呀 | 来源:发表于2018-04-14 13:59 被阅读0次

    1.注册分析器

    nalyzertokenizerfilter可以在elasticsearch.yml 配置

     index :
        analysis :
            analyzer :
                standard :
                    type: standard               
                    stopwords : [stop1, stop2]
                myAnalyzer1 :
                    type: standard               
                    stopwords : [stop1, stop2, stop3]
                    max_token_length : 500           
                myAnalyzer2 :
                    tokenizer : standard
                    filter : [standard, lowercase, stop]
            tokenizer :
                myTokenizer1 :
                    type: standard               
                    max_token_length : 900           
                myTokenizer2 :
                    type: keyword             
                    buffer_size : 512       
            filter :
                myTokenFilter1 :
                    type: stop               
                    stopwords : [stop1, stop2, stop3, stop4]
                myTokenFilter2 :
                    type: length               
                    min : 0               
                    max : 2000
    

    analyzer:ES内置若干analyzer, 另外还可以用内置的character filter, tokenizer, token filter组装一个analyzer(custom analyzer)

    index :
        analysis :
            analyzer :
                myAnalyzer :
                    tokenizer : standard
                    filter : [standard, lowercase, stop]
    

    如果你要使用第三方的analyzer插件,需要先在配置文件elasticsearch.yml中注册, 下面是配置IkAnalyzer的例子

      index:
           analysis:
                analyzer:     
                    ik:
                      alias: [ik_analyzer]
                      type: org.elasticsearch.index.analysis.IkAnalyzerProvider
    

    当一个analyzer在配置文件中被注册到一个名字(logical name)下后,在mapping定义或者一些API里就可以用这个名字来引用该analyzer了。

    二.ES中内置的analyzer,tokenizer,filter

    ES内置的一些analyzer

    analyzer logical name description
    standard analyzer standard standard tokenizer, standard filter, lower case filter, stop filter
    simple analyzer simple lower case tokenizer
    stop analyzer stop lower case tokenizer, stop filter
    keyword analyzer keyword 不分词,内容整体作为一个token(not_analyzed)
    pattern analyzer whitespace 正则表达式分词,默认匹配
    language analyzers lang 各种语言
    snowball analyzer snowball standard tokenizer, standard filter, lower case filter, stop filter, snowball filter
    custom analyzer custom 一个Tokenizer, 零个或多个Token Filter, 零个或多个Char Filter

    tokenizer:ES内置的tokenizer列表

    tokenizer logical name description
    standard tokenizer standard
    edge ngram tokenizer edgeNGram
    keyword tokenizer keyword 不分词
    letter analyzer letter 按单词分
    lowercase analyzer lowercase letter tokenizer, lower case filter
    ngram analyzers nGram
    whitespace analyzer whitespace 以空格为分隔符拆分
    pattern analyzer pattern 定义分隔符的正则表达式
    uax email url analyzer uax_url_email 不拆分url和email
    path hierarchy analyzer path_hierarchy 处理类似/path/to/somthing样式的字符串

    token filter:ES内置的token filter列表。

    token filter logical name description
    standard filter standar
    dascii folding filter ascii folding
    lengthfilter length 去掉太长或者太短的
    lowercase filter lowercase 转成小写
    ngram filter nGram
    edge ngram filter edgeNGram
    porter stem filter porterStem 波特词干算法
    shingle filter shingle 定义分隔符的正则表达式
    stop filter stop 移除 stop wordsword
    delimiter filter word_delimiter 将一个单词再拆成子分词
    stemmer token filter stemmer
    stemmer override filter stemmer_override
    keyword marker filter keyword_marker
    keyword repeat filter keyword_repeat
    kstem filter kstem
    snowball filter snowball
    phonetic filte rphonetic 插件
    synonym filter synonyms 处理同义词
    compound word filter dictionary_decompounder, hyphenation_decompounder 分解复合词
    reverse filter reverse 反转字符串
    elision filter elision 去掉缩略语
    truncate filter truncate 截断字符串
    unique filter unique
    pattern capture filter pattern_capture
    pattern replace filter pattern_replace 用正则表达式替换
    trim filter trim 去掉空格
    limit token count filter limit 限制token数量
    hunspell filter hunspell 拼写检查
    common grams filter common_grams
    normalization filter arabic_normalization, persian_normalization

    character filter:ES内置的character filter列表

    character filter logical name description
    mapping char filter mapping 根据配置的映射关系替换字符
    html strip char filter html_strip 去掉HTML元素
    pattern replace char filter pattern_replace 用正则表达式处理字符串

    相关文章

      网友评论

          本文标题:Elasticsearch-分析器

          本文链接:https://www.haomeiwen.com/subject/nkklkftx.html