ElasticSearch实现拼音以及多音字搜索

作者: marvinxu | 来源:发表于2017-02-27 13:46 被阅读766次

elastic默认的分词插件对中文支持不好，比如对中华人民共和国进行分词的时候：

  1.使用默认分词插件，会分别搜索中、华、人、民、共、和、国
  2. 推荐使用大名鼎鼎的ik分词器， 地址：https://github.com/medcl/elasticsearch-analysis-ik/
  3. 安装方法如前面的一片文章所述,建议使用ik_smarter就够了，上面的分词会变成： 中华、人民、共和国
  4. 使用ik之后，还是有一个问题，对于中文用户，很多人在搜索的时候不一定会切换到中文的输入法，也就是输入的是拼音， 但是ik并不支持拼音搜索，这样搜出来的结果是英文或者是不准确的，所以需要使用拼音分词插件：https://github.com/medcl/elasticsearch-analysis-pinyin
  5. 安装方法依旧如前面所述

安装插件之后，需要重新更新一下mapping，以实现拼音+多音字的搜索结果：

topic = \
{
    "settings": {
        "analysis": {
            "analyzer": {
                "ik_pinyin_analyzer": {
                    "type":"custom",
                    "tokenizer": "ik_smart",
                    "filter": ["my_pinyin","word_delimiter"]
                }
            },
            "filter": {
                "my_pinyin": {
                    "type": "pinyin",
                    "keep_first_letter": False,
                    "keep_full_pinyin": True,
                    "keep_none_chinese": True,
                    "keep_none_chinese_in_first_letter": True,
                    "keep_original": False,
                    "limit_first_letter_length": 16,
                    "lowercase": True,
                    "trim_whitespace": True,
                }
            }
        }
    },
    "mappings" : {
        "topic" : {
            "properties" : {
                "creator" : {
                    "type" : "string",
                    "index": "not_analyzed"
                },
                "postCount" : {
                    "type" : "integer",
                    "index": "not_analyzed"
                },
                "followNum" : {
                    "type" : "integer",
                    "index": "not_analyzed"
                },
                "creatTime" : {
                    "type" : "date",
                    "index": "not_analyzed"
                },
                "tagName": {
                    "type": "text",
                    "index": "analyzed",
                    "store": "no",
                    "analyzer": "ik_pinyin_analyzer",
                    "term_vector": "with_positions_offsets",
                    "boost": 10,
                    "fields" : {
                        "untouch": {
                            "type": "keyword"
                        }
                    }
                }
            }
        }
    }
}

搜索实现的结果如下图所示：

Paste_Image.png

网友评论

现代诗体的程序员:多谢，非常有用
4b3bf50cbb09:您好！请问多音字如何处理呢？根据官方github的例子去做，“银行”会被分成了“yinxing”，请问如何能解决？
perfect_jimmy:你好，这个怎么在自定义字典里面添加？？求指导
marvinxu:@Patrick_0b27 开源的其它的我不知道。两个思路，一个是简单的在自定义字典里面添加，另一个可以考虑机器学习的方法去解决。

elasticsearch

本文标题：ElasticSearch实现拼音以及多音字搜索

本文链接：https://www.haomeiwen.com/subject/xymawttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

ElasticSearch实现拼音以及多音字搜索

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

elasticsearch