ES7 Analysis

作者: 逸章 | 来源:发表于2020-05-01 15:24 被阅读0次

ES7 Analysis
ES7特性
ES7、ES6、ES5实现求数组的并集
Java Lexcial Structure
LSA,LDA,LRA
商务英语level5 unit2 part2 Vocabula
网易云音乐数据交互—async&await实现版
Eng: Applications of Data Analys
文本情感度分析
商务英语 Level 5 Unit 2 Part 2 Voca

一、例子

1 whitespace分词器（ "analyzer": "whitespace"）

POST _analyze 
{
    "analyzer": "whitespace",
    "text": "This is my first program and these are my first 5 lines."
}

图片.png

2 standard分词器（ "tokenizer": "standard"）

下面的lowercase会把分词结果用小写展示

POST _analyze 
{
    "tokenizer": "standard",
    "filter": ["lowercase", "asciifolding"],
    "text": "Is this déja vu?"
}

图片.png

3 自定义analysis（"type": "custom"）

下面的"type": "custom"是固定写法（另外一种是"type": "standard"）

PUT my_index_name 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "custom_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                }
            }
        }
    },
    "mappings": {
          "properties": {
                "my_sentence": {
                    "type": "text",
                    "analyzer": "custom_analyzer"
                }
            }
    }
}

可以这样使用

GET my_index_name/_analyze
{
    "analyzer": "custom_analyzer",
    "text": "Is this déjà vu?"
}

也可以这样使用：

GET my_index_name/_analyze
{
    "field": "my_sentence",
    "text": "Is this déjà vu?"
}

图片.png

4 自定义analysis（"type": "standard"）&& 使用stopwords

PUT my_index2 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_personalized_analyzer": {
                    "type": "standard",
                    "max_token_length": 5,
                    "stopwords": "_english_"
                }
            }
        }
    }
}

这样使用：

POST my_index2/_analyze
 {
    "analyzer": "my_personalized_analyzer",
    "text": "This is my first program and these are my first 5 lines"
 }

分词结果为：[my, first, progr, am, my, first, 5 lines]
注意，如果使用默认analysis，则最后一个是line，s会去掉

5. 自定义analysis（"type": "standard"）&& 无stopwords

PUT my_index2 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_personalized_analyzer": {
                    "type": "standard",
                    "max_token_length": 5
                }
            }
        }
    }
}

POST my_index2/_analyze
 {
    "analyzer": "my_personalized_analyzer",
    "text": "This is my first program and these are my first 5 lines"
 }

图片.png

注意this和is都留下来了

6. 不需要type=xxx

定义一个可以去重的analysis：

PUT standard_example 
{
    "settings": {
      "analysis": {
            "analyzer": {
                "rebuilt_standard": {
                    "tokenizer": "standard",
                    "filter": [
                        "remove_duplicates"
                    ]
                }
            }
        }
    }
}

7. simple分析器（"analyzer": "simple"）

A simple analyzer breaks the text into intervals whenever it encounters a non-letter term. It cannot be configured and only contains a lowercase tokenizer.
看一个例子：

POST _analyze 
{
    "analyzer": "simple",
    "text": "This is my first program and these are my first 5 lines."
}

5和最后一个点号都被忽略

现在自定义一个：

PUT simple_example 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "rebuilt_simple": {
                    "tokenizer": "lowercase",
                    "filter": []
                }
            }
        }
    }
}

使用

POST simple_example/_analyze
 {
    "analyzer": "rebuilt_simple",
    "text": "one 2 three Five 5"
 }

图片.png

8. stop分析器（"analyzer": "stop"）

默认是使用_english_ stop words

图片.png
当然我们也可以自定义：

PUT my_index_name 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "the_stop_analyzer": {
                    "type": "stop",
                    "stopwords": ["first", "and", "_english_"]
                }
            }
        }
    }
}

测试：

POST my_index_name/_analyze
 {
   "analyzer": "the_stop_analyzer",
   "text": "This is my first program and these are my first lines."
 }

图片.png

stop analyzer是由一个lowercase tokenizer 和一个 stop token filter构成, 这两个参数都可以配置，如下自定义：

PUT stop_example 
{
  "settings": {
        "analysis": {
          
            "filter": {
                "english_stop": {
                    "type": "stop",
                    "stopwords": "_english_" //this can be overwritten with stopwords or stopwords_path parameters
                }
            },
            
            
            "analyzer": {
                "rebuilt_stop": {
                    "tokenizer": "lowercase",
                    "filter": [
                        "english_stop"
                    ]
                }
            }
        }
    }
}

9. keyword分析器（ "analyzer": "keyword"）

POST _analyze 
{
    "analyzer": "keyword",
    "text": "This is my first program and these are my first lines."
}

图片.png

我们自定义一个：

PUT keyword_example 
{
    "settings": {
        "analysis": {
            "analyzer": {
                "rebuilt_keyword": {
                    "tokenizer": "keyword",
                    "filter": []
                }
            }
        }
    }
}

10. pattern分析器（"analyzer": "pattern"）

A pattern analyzer splits the text into terms using a regular expression. The regular expression must match the token separators and not the tokens themselves. The regular expression defaults to non-word characters using \W+:(注意默认是non-word作为separator)

POST _analyze {
    "analyzer": "pattern",
    "text": "This is my first program and these are my first line."
}

图片.png

配置其他参数：

"analysis": {
            "analyzer": {
                "email_analyzer": {
                    "type": "pattern",
                    "pattern": "\\W|_",
                    "Lowercase": true
                }
            }
        }
    }
}

POST my_index_name/_analyze
{
 "analyzer": "email_analyzer",
 "text": "Jane_Smith@foo-bar.com"
}

图片.png

自定义方式：

PUT pattern_example 
{
    "settings": {
        "analysis": {
            "tokenizer": {
                "split_on_non_word": {
                    "type": "pattern",
                    "pattern": "\\W+"
                }
            },
            "analyzer": {
              "rebuilt_pattern": {
                    "tokenizer": "split_on_non_word",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    }
}

图片.png

11. The language analyzer

Language analyzers can be set to analyze the text of a specific language. They support custom stopwords, and the stem_exclusion parameter allows the user to specify a certain array of lowercase words that are not to be stemmed.

POST _analyze
{
  "analyzer": "ik_smart",
  "text": ["我是中国人"]
}

ik_smart:会做最粗粒度的拆分

图片.png

下面这个结果不同：

    POST _analyze
    {
      "analyzer": "ik_max_word",
      "text": ["我是中国人"]
    }

ik_max_word:会将文本做最细粒度的拆分

图片.png

现在可以多配置一些：

PUT my_index2
{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik": {
                    "tokenizer": "ik_max_word"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text"
            },
            "content": {
                "type": "text",
                "analyzer": "ik"
            }
        }
    }
}

测试：

POST my_index2/_analyze
{
  "analyzer": "ik",
  "text": ["我是中国人"]
}

图片.png

也可以这样使用：

POST my_index2/_analyze
    {
      "field": "content",
      "text": ["我是中国人"]
    }

映射字符过滤器char_filter例子:

PUT char_filter_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{
          "tokenizer":"keyword",
          "char_filter":["my_char_filter"]
        }
      },
      "char_filter":{
          "my_char_filter":{
            "type":"mapping",
            "mappings":["孙悟空 => 齐天大圣","猪八戒 => 天蓬元帅"]
          }
        }
    }
  }
}

测试：

POST char_filter_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "孙悟空打妖怪，猪八戒吃西瓜"
}

图片.png

12. Normalizers

but instead of producing multiple tokens, they only produce one
They do not contain tokenizers and accept only some character filters and token filters
There is no built-in normalizer
定义一个normalizer类似与定义一个analyzer, 只是它使用normalizer keyword而不是analyzer

PUT index 
{
    "settings": {
        "analysis": {
            "char_filter": {
                "quote": {
                    "type": "mapping",
                    "mappings": [
                        "« => \"",
                        "» => \""
                    ]
                }
            },
            "normalizer": {
                "my_normalizer_name": {
                    "type": "custom",
                    "char_filter": ["quote"],
                    "filter": ["lowercase", "asciifolding"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "foo": {
              "type": "keyword",
                "normalizer": "my_normalizer_name"
            }
        }
    }
}

插入一些数据：

PUT index/_doc/1
{
  "foo": "BÀR"
}

PUT index/_doc/2
{
  "foo": "bar"
}

PUT index/_doc/3
{
  "foo": "baz"
}

PUT index/_doc/4
{
  "foo": "« will be changed"
}

POST index/_refresh

查询：

GET index/_search
{
  "query": {
    "term": {
      "foo": "BAR"
    }
  }
}

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

图片.png

原因说明：The above queries match documents 1 and 2 since BÀR is converted to bar at both index and query time.

图片.png

ES7 Analysis
一、例子 1 whitespace分词器（ "analyzer": "whitespace"） 2 standar...
ES7特性
ES7 没有this 箭头函数对比ES6 ES7 Spread & Rest Operator spread .....
ES7、ES6、ES5实现求数组的并集
1、ES7实现方式 ES7中新增一个Array.prototype.includes()：includes() 方...
Java Lexcial Structure
Lexical analysis lexical analysis is the process of trans...
LSA,LDA,LRA
Latent Relational Analysis Latent Semantic Analysis (LSA)...
商务英语level5 unit2 part2 Vocabula
A swot analysis consists of internal analysis and externa...
网易云音乐数据交互—async&await实现版
今天要分享一个东西--关于ES7 async结合Fetch异步编程问题。 ES7 async/await被称作异步...
Eng: Applications of Data Analys
Database Analysis & Decision Support Market analysis & ma...
文本情感度分析
1）What is Sentiment Analysis?## 情感分析（Sentiment analysis），...
商务英语 Level 5 Unit 2 Part 2 Voca
Vocabulary【A SWOT Analysis】 A SWOT analysis is a way to e...