美文网首页elasticsearch玩转大数据elasticsearch
二十、Elasticsearch混合使用match和match_

二十、Elasticsearch混合使用match和match_

作者: 编程界的小学生 | 来源:发表于2017-07-17 11:53 被阅读2341次

    1、什么是召回率?

    比如你搜索一个java spark,总共有100个doc,能返回多少个doc作为结果,就是召回率,recall

    2、什么是精准度?

    比如你搜索一个java spark,能不能尽可能让包含java spark或者是java和spark离的很近的doc排在最前面,precision直接用match_phrase短语搜索,会导致必须所有term都在doc field中出现,而且距离在slop限定范围内才能匹配上。

    match_phrase,proximity match要求doc必须包含所有的term,才能作为结果返回;如果某一个doc可能就是有某个term没有包含,那么就无法作为结果返回。

    比如:
    java spark --》 hello world java : 就无法匹配到
    java spark --》 hello world,java spark : 可以匹配到

    3、疑问
    近似匹配的时候,召回率比较低,精准度太高了,但是有时我们希望的是匹配到几个term中的部分,就可以作为结果出来,这样可以提高召回率,同时我们也希望用上match_phrase根据距离提升分数的功能,让几个term距离越近分数就越高,越优先返回。

    就是优先满足召回率。比如
    java spark --》 包含java的返回,包含spark的也返回,包含 java和spark的也返回,同时兼顾精准度,就是包含java和spark,同时java和spark距离越近的doc排最前面。

    4、解决疑问
    可以用bool组合match query和match_phrase query一起,来实现上述效果。

    match提高召回率,带java和带spark的都要返回。
    match_phrase提高精准度,保证同时带java和spark的排在最前面。

    效果1:直接用bool match query

    GET /forum/article/_search 
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "content": "java spark"
              }
            }
          ]
        }
      }
    }
    

    结果

    {
      "took": 54,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 0.68640786,
        "hits": [
          {
            "_index": "forum",
            "_type": "article",
            "_id": "2",
            "_score": 0.68640786,
            "_source": {
              "articleID": "KDKE-B-9947-#kL5",
              "userID": 1,
              "hidden": false,
              "postDate": "2017-01-02",
              "tag": [
                "java"
              ],
              "tag_cnt": 1,
              "view_cnt": 50,
              "title": "this is java blog",
              "content": "i think java is the best programming language",
              "sub_title": "learned a lot of course",
              "author_first_name": "Smith",
              "author_last_name": "Williams",
              "new_author_last_name": "Williams",
              "new_author_first_name": "Smith"
            }
          },
          {
            "_index": "forum",
            "_type": "article",
            "_id": "5",
            "_score": 0.68324494,
            "_source": {
              "articleID": "DHJK-B-1395-#Ky5",
              "userID": 3,
              "hidden": false,
              "postDate": "2017-03-01",
              "tag": [
                "elasticsearch"
              ],
              "tag_cnt": 1,
              "view_cnt": 10,
              "title": "this is spark blog",
              "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
              "sub_title": "haha, hello world",
              "author_first_name": "Tonny",
              "author_last_name": "Peter Smith",
              "new_author_last_name": "Peter Smith",
              "new_author_first_name": "Tonny"
            }
          }
        ]
      }
    }
    

    结果发现单独包含java和spark的也被返回了,而且单独包含java的却排到了第一位,既包含java又包含spark的却排到了最后。

    效果2:直接用match_phrase

    GET /forum/article/_search 
    {
      "query": {
        "match_phrase": {
          "content": {
            "query": "java spark",
            "slop" : 50
          }
        }
      }
    }
    

    结果:

    {
      "took": 3,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.5753642,
        "hits": [
          {
            "_index": "forum",
            "_type": "article",
            "_id": "5",
            "_score": 0.5753642,
            "_source": {
              "articleID": "DHJK-B-1395-#Ky5",
              "userID": 3,
              "hidden": false,
              "postDate": "2017-03-01",
              "tag": [
                "elasticsearch"
              ],
              "tag_cnt": 1,
              "view_cnt": 10,
              "title": "this is spark blog",
              "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
              "sub_title": "haha, hello world",
              "author_first_name": "Tonny",
              "author_last_name": "Peter Smith",
              "new_author_last_name": "Peter Smith",
              "new_author_first_name": "Tonny"
            }
          }
        ]
      }
    }
    

    结果发现只返回了既包含java又包含spark的数据,召回率降低了。

    最终效果:我们将两个结果合并,既用bool match query又用match_phrase

    GET /forum/article/_search 
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "content": "java spark"
              }
            }
          ],
          "should": [
            {
              "match_phrase": {
                "content": {
                  "query": "java spark",
                  "slop" : 50
                }
              }
            }
          ]
        }
      }
    }
    

    结果:

    {
      "took": 4,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 2,
        "max_score": 1.258609,
        "hits": [
          {
            "_index": "forum",
            "_type": "article",
            "_id": "5",
            "_score": 1.258609,
            "_source": {
              "articleID": "DHJK-B-1395-#Ky5",
              "userID": 3,
              "hidden": false,
              "postDate": "2017-03-01",
              "tag": [
                "elasticsearch"
              ],
              "tag_cnt": 1,
              "view_cnt": 10,
              "title": "this is spark blog",
              "content": "spark is best big data solution based on scala ,an programming language similar to java spark",
              "sub_title": "haha, hello world",
              "author_first_name": "Tonny",
              "author_last_name": "Peter Smith",
              "new_author_last_name": "Peter Smith",
              "new_author_first_name": "Tonny"
            }
          },
          {
            "_index": "forum",
            "_type": "article",
            "_id": "2",
            "_score": 0.68640786,
            "_source": {
              "articleID": "KDKE-B-9947-#kL5",
              "userID": 1,
              "hidden": false,
              "postDate": "2017-01-02",
              "tag": [
                "java"
              ],
              "tag_cnt": 1,
              "view_cnt": 50,
              "title": "this is java blog",
              "content": "i think java is the best programming language",
              "sub_title": "learned a lot of course",
              "author_first_name": "Smith",
              "author_last_name": "Williams",
              "new_author_last_name": "Williams",
              "new_author_first_name": "Smith"
            }
          }
        ]
      }
    }
    

    结果发现非常完美,两个都包含的排到了第一位,并且分数远高于第二个。而且召回率也很高

    若有兴趣,欢迎来加入群,【Java初学者学习交流群】:458430385,此群有Java开发人员、UI设计人员和前端工程师。有问必答,共同探讨学习,一起进步!
    欢迎关注我的微信公众号【Java码农社区】,会定时推送各种干货:


    qrcode_for_gh_577b64e73701_258.jpg

    相关文章

      网友评论

      • 7baf9571335d:"fields": {
        "server_type": "server"
        },
        或者说,我想匹配,fileds下的server_type等与server的。。。。
      • 7baf9571335d:"tag": [
        "java"
        ],
        当我想匹配,tag下的java,那个match怎么处理,刚研究

      本文标题:二十、Elasticsearch混合使用match和match_

      本文链接:https://www.haomeiwen.com/subject/bhufkxtx.html