一、Analysis 与 Analyzer
image.png
分词器会将词语都转换成小写
image.png
二、Analyzer 分词器
image.png
三、使用 _analyzer API 查看分词器,对文本的分析情况
直接指定Analyzer进行测试
image.png
1.指定索引的字段进行测试
image.png
2.自定义分词器进行测试
image.png
https://www.cnblogs.com/shoufeng/p/10562746.html
3.standard Analyzer 默认分词 器
image.png
4.Simple Analyzer
image.png
5.WhiteSpace Analyzer
image.png
6. Stop Analyzer
image.png
7.keyword Analyzer -- 对输入 不需要分词
image.png
8. pattern Analyzer
image.png
image.png
9. language Analyzers 各国语言分词器
image.png
四、中文分词的难点
image.png
1. ICU Analyzer -- 需要手动安装插件
image.png
2. 更多的中文分词器
IK
- 支持自定义词库,支持热更新分词字典
https://github.com/medcl/elasticsearch-analysis-ik
THULAC
- 清华大学自然语言处理和社会人文计算实验室的一套中文分词器
https://github.com/microbun/elasticsearch-thulac-plugin
#Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
#Stop Analyzer – 小写处理,停用词过滤(the,a,is)
#Whitespace Analyzer – 按照空格切分,不转小写
#Keyword Analyzer – 不分词,直接将输入当作输出
#Patter Analyzer – 正则表达式,默认 \W+ (非字符分隔)
#Language – 提供了30多种常见语言的分词器
#2 running Quick brown-foxes leap over lazy dogs in the summer evening
#查看不同的analyzer的效果
#standard
GET _analyze
{
"analyzer": "standard",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#simpe
GET _analyze
{
"analyzer": "simple",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "stop",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#stop
GET _analyze
{
"analyzer": "whitespace",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#keyword
GET _analyze
{
"analyzer": "keyword",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
GET _analyze
{
"analyzer": "pattern",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
#english
GET _analyze
{
"analyzer": "english",
"text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "他说的确实在理”"
}
POST _analyze
{
"analyzer": "standard",
"text": "他说的确实在理”"
}
POST _analyze
{
"analyzer": "icu_analyzer",
"text": "这个苹果不大好吃"
}










网友评论