一、Analysis 与 Analyzer

分词器会将词语都转换成小写

二、Analyzer 分词器

三、使用 _analyzer API 查看分词器,对文本的分析情况
直接指定Analyzer进行测试

1.指定索引的字段进行测试

2.自定义分词器进行测试

https://www.cnblogs.com/shoufeng/p/10562746.html
3.standard Analyzer 默认分词 器

4.Simple Analyzer

5.WhiteSpace Analyzer

6. Stop Analyzer

7.keyword Analyzer -- 对输入 不需要分词

8. pattern Analyzer


9. language Analyzers 各国语言分词器

四、中文分词的难点

1. ICU Analyzer -- 需要手动安装插件


2. 更多的中文分词器
IK
- 支持自定义词库,支持热更新分词字典
https://github.com/medcl/elasticsearch-analysis-ik
THULAC
- 清华大学自然语言处理和社会人文计算实验室的一套中文分词器
https://github.com/microbun/elasticsearch-thulac-plugin
#Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
#Stop Analyzer – 小写处理,停用词过滤(the,a,is)
#Whitespace Analyzer – 按照空格切分,不转小写
#Keyword Analyzer – 不分词,直接将输入当作输出
#Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
#Language – 提供了30多种常见语言的分词器
#2 running Quick brown-foxes leap over lazy dogs in the summer evening
#查看不同的analyzer的效果
#standard
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#simpe
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他说的确实在理”"
}
POST _analyze
{
"analyzer": "standard",
"text": "他说的确实在理”"
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "这个苹果不大好吃"
}
网友评论