美文网首页
51、初识搜索引擎_相关度评分TF&IDF算法解密

51、初识搜索引擎_相关度评分TF&IDF算法解密

作者: 拉提娜的爸爸 | 来源:发表于2020-01-09 10:43 被阅读0次

    1、算法介绍

    relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度
    Elasticsearch使用的是term frequency/inverse document frequency算法,简称为TF/IDF算法

    (1)Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关

    例如:搜索请求:hello world
    doc1:hello you, and world is very good
    doc2:hello, how are you
    因为doc1的文本中出现的词条更多,所以doc1的相关度就更高。

    (2)Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关

    例如:搜索请求:hello world
    doc1:hello, today is very good
    doc2:hi world, how are you
    假设在index中有1万条document,hello这个单词在所有的document中,一共出现了1000次;world这个单词在所有的document中,一共出现了100次
    结果是doc2相关读更高,这种对搜索请求各占一半的情况下, 会用这种方式计算相关度

    (3)Field-length norm:field长度,field越长,相关度越弱

    搜索请求:hello world
    doc1:{ "title": "hello article", "content": "babaaba 1万个单词" }
    doc2:{ "title": "my article", "content": "blablabala 1万个单词,hi world" }
    当hello world在整个index中出现的次数是一样多的时候,doc1更相关,因为title field更短

    2、_score是如何被计算出来的

    我们可以根据一下语法查看_score是如何被计算出来的

    GET /test_index/test_type/_search?explain
    {
      "query": {
        "match": {
          "test_field": "test hello"
        }
      }
    }
    -------------------------------------结果-------------------------------------
    {
      "took": 29,
      "timed_out": false,
      "_shards": {
        "total": 3,
        "successful": 3,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 0.25316024,
        "hits": [
          {
            "_shard": "[test_index][0]",
            "_node": "S2PtarqKSIqzdxIdXKKYWg",
            "_index": "test_index",
            "_type": "test_type",
            "_id": "7",
            "_score": 0.25316024,
            "_source": {
              "test_field": "test client 2"
            },
            "_explanation": {
              "value": 0.25316024,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.25316024,
                  "description": "sum of:",
                  "details": [
                    {
                      "value": 0.25316024,
                      "description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
                      "details": [
                        {
                          "value": 0.25316024,
                          "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                          "details": [
                            {
                              "value": 0.2876821,
                              "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "docFreq",
                                  "details": []
                                },
                                {
                                  "value": 1,
                                  "description": "docCount",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 0.88,
                              "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "termFreq=1.0",
                                  "details": []
                                },
                                {
                                  "value": 1.2,
                                  "description": "parameter k1",
                                  "details": []
                                },
                                {
                                  "value": 0.75,
                                  "description": "parameter b",
                                  "details": []
                                },
                                {
                                  "value": 3,
                                  "description": "avgFieldLength",
                                  "details": []
                                },
                                {
                                  "value": 4,
                                  "description": "fieldLength",
                                  "details": []
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value": 0,
                  "description": "match on required clause, product of:",
                  "details": [
                    {
                      "value": 0,
                      "description": "# clause",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "*:*, product of:",
                      "details": [
                        {
                          "value": 1,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "queryNorm",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          },
          {
            "_shard": "[test_index][1]",
            "_node": "S2PtarqKSIqzdxIdXKKYWg",
            "_index": "test_index",
            "_type": "test_type",
            "_id": "8",
            "_score": 0.25316024,
            "_source": {
              "test_field": "test client 2"
            },
            "_explanation": {
              "value": 0.25316024,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.25316024,
                  "description": "sum of:",
                  "details": [
                    {
                      "value": 0.25316024,
                      "description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
                      "details": [
                        {
                          "value": 0.25316024,
                          "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                          "details": [
                            {
                              "value": 0.2876821,
                              "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "docFreq",
                                  "details": []
                                },
                                {
                                  "value": 1,
                                  "description": "docCount",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 0.88,
                              "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "termFreq=1.0",
                                  "details": []
                                },
                                {
                                  "value": 1.2,
                                  "description": "parameter k1",
                                  "details": []
                                },
                                {
                                  "value": 0.75,
                                  "description": "parameter b",
                                  "details": []
                                },
                                {
                                  "value": 3,
                                  "description": "avgFieldLength",
                                  "details": []
                                },
                                {
                                  "value": 4,
                                  "description": "fieldLength",
                                  "details": []
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value": 0,
                  "description": "match on required clause, product of:",
                  "details": [
                    {
                      "value": 0,
                      "description": "# clause",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "*:*, product of:",
                      "details": [
                        {
                          "value": 1,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "queryNorm",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          },
          {
            "_shard": "[test_index][2]",
            "_node": "S2PtarqKSIqzdxIdXKKYWg",
            "_index": "test_index",
            "_type": "test_type",
            "_id": "1",
            "_score": 0.25316024,
            "_source": {
              "test_field": "test service 1"
            },
            "_explanation": {
              "value": 0.25316024,
              "description": "sum of:",
              "details": [
                {
                  "value": 0.25316024,
                  "description": "sum of:",
                  "details": [
                    {
                      "value": 0.25316024,
                      "description": "weight(test_field:test in 0) [PerFieldSimilarity], result of:",
                      "details": [
                        {
                          "value": 0.25316024,
                          "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                          "details": [
                            {
                              "value": 0.2876821,
                              "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "docFreq",
                                  "details": []
                                },
                                {
                                  "value": 1,
                                  "description": "docCount",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 0.88,
                              "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "termFreq=1.0",
                                  "details": []
                                },
                                {
                                  "value": 1.2,
                                  "description": "parameter k1",
                                  "details": []
                                },
                                {
                                  "value": 0.75,
                                  "description": "parameter b",
                                  "details": []
                                },
                                {
                                  "value": 3,
                                  "description": "avgFieldLength",
                                  "details": []
                                },
                                {
                                  "value": 4,
                                  "description": "fieldLength",
                                  "details": []
                                }
                              ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value": 0,
                  "description": "match on required clause, product of:",
                  "details": [
                    {
                      "value": 0,
                      "description": "# clause",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "*:*, product of:",
                      "details": [
                        {
                          "value": 1,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "queryNorm",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          }
        ]
      }
    }
    

    3、分析一个document是如何被匹配上的

    可以根据这样子的写法查看一个document是如何被匹配上的

    GET /test_index/test_type/2/_explain
    {
      "query": {
        "match": {
          "test_field": "test hello"
        }
      }
    }
    

    相关文章

      网友评论

          本文标题:51、初识搜索引擎_相关度评分TF&IDF算法解密

          本文链接:https://www.haomeiwen.com/subject/gtfyactx.html