023.基于IT论坛案例学习Elasticsearch(二)：Q

作者: CoderJed | 来源:发表于2020-08-02 14:40 被阅读0次

1. 准备测试数据

PUT article
{
  "mappings": {
    "_doc": {
      "properties": {
        "articleId": {
          "type": "keyword"
        }
      }
    }
  }
}

POST /article/_doc/_bulk
{"index": {"_id": 1}}
{"articleId": "XHDK-A-1293-#fJ3", "userId": 1, "hidden": false, "postDate": "2017-01-01", "tag": ["java", "elasticsearch"], "tag_cnt": 2, "view_cnt": 30, "title" : "this is java and elasticsearch blog"}
{"index": {"_id": 2}}
{"articleId": "KDKE-B-9947-#kL5", "userId": 1, "hidden": false, "postDate": "2017-01-02", "tag": ["java"], "tag_cnt": 1, "view_cnt": 50, "title" : "this is java blog"}
{"index": {"_id": 3}}
{"articleId": "JODL-X-1937-#pV7", "userId": 2, "hidden": false, "postDate": "2017-01-01", "tag": ["elasticsearch"], "tag_cnt": 1, "view_cnt": 100, "title": "this is elasticsearch blog"}
{"index": {"_id": 4}}
{"articleId": "QQPX-R-3956-#aD8", "userId": 2, "hidden": true, "postDate": "2017-01-02", "tag": ["java", "elasticsearch", "hadoop"], "tag_cnt": 3, "view_cnt": 80, "title": "this is java, elasticsearch, hadoop blog"}
{"index": {"_id": 5}}
{"articleID": "DHJK-B-1395-#Ky5", "userID": 3, "hidden": false, "postDate": "2017-03-01", "tag": ["spark"], "tag_cnt": 1, "view_cnt": 10, "title": "this is spark blog"}

2. 手动控制搜索精确度

# 搜索标题中包含java或elasticsearch的blog
# 4条结果
GET /article/_doc/_search
{
  "query": {
    "match": {
      "title": "java elasticsearch"
    }
  }
}

# 搜索标题中包含java和elasticsearch的blog
# 2条结果
# 如果希望所有的搜索关键字都要匹配，那么就用and
GET /article/_doc/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch",
        "operator": "and"
      }
    }
  }
}

# 搜索包含标题中java，elasticsearch，spark，hadoop，4个关键字中，至少3个的blog
# 1条结果
GET /article/_doc/_search
{
  "query": {
    "match": {
      "title": {
        "query": "java elasticsearch spark hadoop",
        "minimum_should_match": "75%"
      }
    }
  }
}

# 用bool组合多个搜索条件
# 查询标题中必须包含"java"，必须不包含"spark"，"hadoop"和"elasticsearch"包含不包含都可以的帖子
GET /article/_doc/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {"title": "java"}
      },
      "must_not": {
        "match": {"title": "spark"}
      },
      "should": [
        {"match": {"title": "hadoop"}},
        {"match": {"title": "elasticsearch"}}
      ]
    }
  }
}

# 结果
{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.449981,
    "hits" : [
      {
        "_index" : "article",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.449981,
        "_source" : {
          "articleId" : "QQPX-R-3956-#aD8",
          "userId" : 2,
          "hidden" : true,
          "postDate" : "2017-01-02",
          "tag" : [
            "java",
            "elasticsearch",
            "hadoop"
          ],
          "tag_cnt" : 3,
          "view_cnt" : 80,
          "title" : "this is java, elasticsearch, hadoop blog"
        }
      },
      {
        "_index" : "article",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "articleId" : "XHDK-A-1293-#fJ3",
          "userId" : 1,
          "hidden" : false,
          "postDate" : "2017-01-01",
          "tag" : [
            "java",
            "elasticsearch"
          ],
          "tag_cnt" : 2,
          "view_cnt" : 30,
          "title" : "this is java and elasticsearch blog"
        }
      },
      {
        "_index" : "article",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.19856805,
        "_source" : {
          "articleId" : "KDKE-B-9947-#kL5",
          "userId" : 1,
          "hidden" : false,
          "postDate" : "2017-01-02",
          "tag" : [
            "java"
          ],
          "tag_cnt" : 1,
          "view_cnt" : 50,
          "title" : "this is java blog"
        }
      }
    ]
  }
}

bool组合多个搜索条件，如何计算relevance score(相关度分数)？

relevance score = must和should搜索对应的分数加起来 / must和should的总数

排名第一：标题包含"java"，同时包含should中所有的关键字即"hadoop"和"elasticsearch"
排名第二：标题包含"java"，同时包含should中的任何一个关键字
排名第三：标题包含"java"，不包含should中的任何关键字

should是可以影响相关度分数的，根据must的条件去计算出document对这个搜索条件的relevance score，在满足must的基础之上，should中的条件，不匹配也可以，但是如果匹配的更多，那么document的relevance score就会更高

# 使用bool组合多个搜索条件，控制全文检索的精确度
# 搜索标题中至少包含"java"、"hadoop"、"spark"、"elasticsearch"其中3个关键字的帖子
GET /article/_doc/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {"title": "java"}},
        {"match": {"title": "elasticsearch"}},
        {"match": {"title": "hadoop"}},
        {"match": {"title": "spark"}}
      ],
      "minimum_should_match": 3
    }
  }
}

默认情况下，在有must的情况下，should可以不匹配任何条件，在没有must的情况下，多个should的条件必须满足其中一个，可以使用"minimum_should_match"参数来精确控制should的行为。

minimum_should_match：

正数，例如3，那么should的多个条件中必须满足3个条件
负数，例如-2，代表可以有2个条件不满足，其他都应该满足
百分比正数：代表should条件总数的百分比个条件应该满足，例如总共10个条件，百分比为30%，那么至少3个条件应该满足，需满足条件的个数向下取整
百分比负数：代表占此比例的条件可以不满足，其余的均需要满足，计算结果向下取整
百分比和数字组合：3<90%，如果条件个数<=3，那么必须全部满足，否则，满足90%(向下取整)即可
多个组合(空格隔开)：2<-25% 9<-3，如果条件个数<=2，则必须都满足，如果条件个数为[3,9]，则需要25%的条件满足，否则，只能有3个条件不满足，其余都需要满足

3.match query的自动转化

# 在使用match query进行多值搜索的时候，es会在底层自动将match query转换为bool的语法
{
  "match": {"title": "java elasticsearch"}
}
# 转化为
{
  "bool": {
    "should": [
      {"term": {"title": "java"}},
      {"term": {"title": "elasticsearch"}},
    ]
  }
}

{
  "match": {
    "title": {
      "query": "java elasticsearch",
      "operator": "and"
    }
  }
}
# 转换为
{
  "bool": {
    "must": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"}}
    ]
  }
}

{
  "match": {
    "title": {
      "query":  "java elasticsearch hadoop spark",
      "minimum_should_match": "75%"
    }
  }
}
# 转换为
{
  "bool": {
    "should": [
      { "term": { "title": "java" }},
      { "term": { "title": "elasticsearch"}},
      { "term": { "title": "hadoop" }},
      { "term": { "title": "spark" }}
    ],
    "minimum_should_match": 3 
  }
}

4. boost：搜索条件权重控制

需求：搜索标题中包含"blog"的帖子，同时如果标题中包含"java"、hadoop"、"elasticsearch"或者"spark"也可以，但包含"spark"的帖子要求它被优先搜索出来

知识点，搜索条件的权重，boost，可以将某个搜索条件的权重加大，此时当匹配这个搜索条件和匹配另一个搜索条件的document，计算relevance score时，匹配权重更大的搜索条件的document，relevance score会更高，也就会优先被返回

默认情况下，搜索条件的权重都是一样的，都是1

GET /article/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"title": "blog"}}
      ],
      "should": [
        {"match": {"title": "java"}},
        {"match": {"title": "hadoop"}},
        {"match": {"title": "elasticsearch"}},
        {
          "match": {
            "title": {
              "query": "spark",
              "boost": 5.0
            } 
          }
        }
      ]
    }
  }
}

5. 多shard场景下relevance score不准确问题

如果你的一个index有多个shard的话，可能搜索结果会不准确，原因如下：

当一个搜索请求被转发到一个shard后，会这样计算每条document的relevance score：

计算关键词在这条docuemnt的指定field中出现的次数（TF）
计算关键词在此shard的所有docuemnt的指定field中出现的次数（IDF）
注意：某个shard中只是包含一个index的部分document，而在默认情况下，IDF就是在shard本地进行计算

relevance score与TF成正比，与IDF成反比，在不考虑其他因素的前提下（relevance score不仅与TF和IDF有关），我们可以认为score=TF/IDF

假设在A shard中，所有"title"中包含"java"关键词的doucment，在某一条document中，"java"在"title"字段中出现了10次，但是在A shard中，"java"在所有的document的"title"字段中出现了100次，那么在A shard中，score=10/100=0.1

假设在B shard中，所有"title"中包含"java"关键词的doucment，在某一条document中，"java"在"title"字段中出现了1次，但是在B shard中，"java"在所有的document的"title"字段中也出现了1次，那么在B shard中，score=1/1=1

这样就造成了结果的不准确，应该是A shard中的那条document的score比B shard中的docuemnt的score高，造成这种现象的原因就是IDF是在一个shard的本地计算的，如果是在所有的shard中计算就不会有这个问题

解决办法：

生产环境下，数据量大，尽可能实现均匀分配

数据量很大的话，一般情况下，ES都是在多个shard中均匀路由数据的，路由的时候根据_id，进行负载均衡，比如说有10个document，其"title"都包含"java"关键词，一共有5个shard，如果负载均衡的话，每个shard都应该有2个document，这样计算出来的结果就没有问题了
测试环境下，将索引的primary shard设置为1个，如果说只有一个shard，当然也就没有这个问题了
测试环境下，搜索附带search_type=dfs_query_then_fetch参数，此参数的作用是计算IDF的时候，计算全局的IDF而非本地的IDF，这样可以解决这个问题，但是会带来性能问题，在生产环境不推荐使用

6. dis_max：实现搜索的best_fields策略

6.1 dis_max

# 为帖子增加"content"字段
POST /article/_doc/_bulk
{"update": {"_id": "1"}}
{"doc": {"content": "i like to write best elasticsearch article"}}
{"update": {"_id": "2"}}
{"doc": {"content": "i think java is the best programming language"}}
{"update": {"_id": "3"}}
{"doc": {"content": "i am only an elasticsearch beginner"}}
{"update": { "_id": "4"}}
{"doc": {"content": "elasticsearch and hadoop are all very good solution, i am a beginner"}}
{"update": { "_id": "5"}}
{"doc": {"content": "spark is best big data solution based on scala ,an programming language similar to java"}}

# 搜索title或content中包含java或solution的帖子
GET /article/_doc/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {"title": "java solution"}},
        {"match": {"content": "java solution"}}
      ]
    }
  },
  "_source": ["title", "content"]
}

# 结果
# 1. doc2(score=0.95): title匹配"java"，content匹配"java"
# 2. doc4(score=0.81): title匹配"java"，content匹配"solution"
# 3. doc5(score=0.58): title无匹配，content匹配"java"和"solution"
# 4. doc1(score=0.29): title匹配"java"，content无匹配
# 我们期望的是doc5排名靠前，因为content同时匹配了"java"和"solution"
# 而实际却是doc2和doc4排名更靠前

# 分析一下doc4和doc5的socre
# score=每个query的分数 * 匹配到条件的个数 / 总条件个数
# 假设在每个查询条件中，匹配到一个单词得分1
# 那么doc4得分为(1+1)*2/2=2
# 同理doc5得分为(0+2)*1/2=1
# 于是doc4就排在了doc5之前

dis_max query：搜索到的结果，如果某一个field中匹配到了尽可能多的关键词，那么它应该评分更高，而不是尽可能多的field匹配到了少数的关键词就排在了前面
dis_max的原理是：多个query中，得分最高的query的分数，就是最终的分数

GET /article/_doc/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "java solution"}},
        {"match": {"content": "java solution"}}
      ]
    }
  },
  "_source": ["title", "content"]
}

# 结果
# 1. doc2(score=0.75): title匹配"java"，content匹配"java"
# 2. doc4(score=0.64): title匹配"java"，content匹配"solution"
# 3. doc5(score=0.58): title无匹配，content匹配"java"和"solution"
# 4. doc1(score=0.29): title匹配"java"，content无匹配
# 说明
# 这里没有达到我的预期效果，预期应该是doc5的排名应该更靠前，这可能与影响评分的其他因素有关
# 但是doc2和doc4的得分都下降了，说明dis_max还是有有效果的

6.2 tie_breaker

# 换一个搜索条件 
GET /article/_doc/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "java beginner"}},
        {"match": {"content": "java beginner"}}
      ]
    }
  },
  "_source": ["title", "content"]
}

# 结果
# 1. doc2(score=0.75): title匹配java，content匹配java
# 2. doc4(score=0.64): title匹配java，content匹配beginner
# 3. doc5(score=0.29): title不匹配，content匹配java
# 4. doc1(score=0.29): title匹配java，content不匹配
# 5. doc3(score=0.29): title不匹配，content匹配beginner

# tie_breaker
# 这里结果还是比较符合预期的，以下是一种不符合预期的结果：
# 1. docA: title匹配java，content不匹配
# 1. docB: title不匹配，content匹配beginner
# 2. docC: title匹配java，content匹配beginner
# 对于上述结果，我们期望的可能是docC排在其他两个doc之前，这就需要使用tie_breaker
# dis_max只取某一个query最大的分数，完全不考虑其他query的分数
# 使用tie_breaker可以将其他query的分数也考虑进去
# tie_breaker参数的作用：
# 将其他query的分数乘以tie_breaker的值
# 然后与最高的分数综合在一起进行计算
# 除了取最高分以外，还会考虑其他的query的分数
# tie_breaker的值，在0~1之间，是个小数

GET /article/_doc/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {"match": {"title": "java beginner"}},
        {"match": {"content": "java beginner"}}
      ],
      "tie_breaker": 0.3
    }
  },
  "_source": ["title", "content"]
}

# 结果
# 1. doc2(score=0.81): title匹配java，content匹配java
# 2. doc4(score=0.69): title匹配java，content匹配beginner
# 3. doc5(score=0.29): title不匹配，content匹配java
# 4. doc1(score=0.29): title匹配java，content不匹配
# 5. doc3(score=0.29): title不匹配，content匹配beginner
# 可以看出对分数还是有影响的

6.3 multi_match、dis_max和tie_breaker的联合使用

GET /article/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "java solution",
      "type": "best_fields", # best fields就是dis_max的策略，即认为一个字段匹配到尽可能多的关键词就评分更高
      "fields": ["title^2", "content"], # "title^2：将title字段的权重乘以2
      "tie_breaker": 0.3,
      "minimum_should_match": "50%"
    }
  },
  "_source": ["title", "content"]
}

# 上面的查询语法与下面的查询语法的结果是一样的
GET /article/_doc/_search
{
  "query": {
    "dis_max": {
      "queries": [
        {
          "match": {
            "title": {
              "query": "java solution",
              "minimum_should_match": "50%",
              "boost": 2
            }
          }
        },
        {
          "match": {
            "content": {
              "query": "java solution",
              "minimum_should_match": "50%"
            }
          }
        }
      ],
      "tie_breaker": 0.3
    }
  },
  "_source": ["title", "content"]
}

# minimum_should_match的作用：去长尾
# 比如搜索5个关键词，但是很多结果只匹配1个关键词的，这些结果与预期相差甚远，这些结果就是长尾
# minimum_should_match，控制搜索结果的精准度，只有匹配一定数量的关键词的数据，才能返回

7. most-fields策略

best-fields策略：某一个field匹配尽可能多的关键词的doc评分更高
most-fields策略：尽可能多的field可以匹配到关键词，那个这个doc的评分会更高

# 增加测试数据
POST /article/_mapping/_doc
{
  "properties": {
      "sub_title": { 
          "type": "text",
          "analyzer": "english",
          "fields": {
              "std": { 
                "type": "text",
                "analyzer": "standard"
              }
          }
      }
  }
}

POST /article/_doc/_bulk
{"update": {"_id": "1"}}
{"doc": {"sub_title": "learning more courses"}}
{"update": {"_id": "2"}}
{"doc": {"sub_title": "learned a lot of course"}}
{"update": {"_id": "3"}}
{"doc": {"sub_title" : "we have a lot of fun"}}
{"update": {"_id": "4"}}
{"doc": {"sub_title": "both of them are good"}}
{"update": {"_id": "5"}}
{"doc": {"sub_title": "haha, hello world"}}

GET /article/_doc/search
{
  "query": {
    "match": {
      "sub_title": "learning courses"
    }
  }
}
# 结果，两条：doc1和doc2
# 使用english分词器进行分词后,learning->learn,learned->learn,courses->course
# 所以两条都可以搜索到

# 只有doc1一条结果
# sub_title.std使用standard分词器，按照非字母和非数字字符进行分隔，单词转为小写，不会进行常规化处理
GET /article/_doc/_search
{
  "query": {
    "match": {
      "sub_title.std": "learning courses"
    }
  }
}

# most_fieds策略
GET /article/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "learning courses",
      "type": "most_fields",
      "fields": ["sub_title", "sub_title.std"]
    }
  },
  "_source": ["sub_title"]
}}
# 1. doc2(score=1.39): sub_title匹配learn和course，sub_title.std无匹配
# 2. doc1(score=1.15): sub_title匹配learn和course，sub_title.std匹配learning和courses
# 这里结果也是很难理解的，因为我们预期doc1是比doc2优先返回的
# 评分计算时很复杂的， 不只是TF/IDF算法，不同的query，不同的语法，都有不同的计算score的细节，所以这里就不再深究了

best_fields与most_fields策略的区别：

best_fields，对多个field进行搜索，挑选某个field匹配度最高的那个分数，同时在多个query最高分相同的情况下，在一定程度上考虑其他query的分数，简单来说，对多个field进行搜索，某一个field尽可能包含更多关键字的doc评分更好

优点：通过best_fields策略，以及综合考虑其他field，还有minimum_should_match支持，可以尽可能精准地将匹配的结果推送到最前面

缺点：除了那些精准匹配的结果，其他差不多大的结果，排序结果不是太均匀，没有什么区分度了

例子：百度，最匹配的排到最前面
most_fields，综合多个field一起进行搜索，尽可能多地让所有field的query参与到总分数的计算中来，有越多的field可以匹配到关键词，这条doc的评分就更高

优点：将匹配到更多field的结果推送到最前面，整个排序结果是比较均匀的

缺点：可能那些精准匹配的结果，无法推送到最前面

例子：wiki，明显的most_fields策略，搜索结果比较均匀，但是的确要翻好几页才能找到最匹配的结果

8. cross-fields搜索

cross-fields搜索：

搜索的文本包含在多个field中，比如搜索"James Bob"，"James"在"first_name"字段中保存，"Bob"在"last_name"字段中保存，或者搜索一个地址文本，这些信息散落在"country"，"province"，"city"等字段中，这样的搜索就叫做cross-fields搜索。

使用most_fields策略进行cross-fields搜索是比较合适的，因为cross-fields本来就是需要在多个field中去搜索，而most_fields策略就是尽可能得去多个field中去匹配关键词

# 添加测试数据
POST /article/_doc/_bulk
{"update": {"_id": "1"}}
{"doc" : {"author_first_name": "Peter", "author_last_name": "Smith"}}
{"update": {"_id": "2"}}
{"doc" : {"author_first_name": "Smith", "author_last_name": "Williams"}}
{"update": {"_id": "3"}}
{"doc" : {"author_first_name": "Jack", "author_last_name": "Ma"}}
{"update": {"_id": "4"}}
{"doc" : {"author_first_name": "Robbin", "author_last_name": "Li"}}
{"update": {"_id": "5"}}
{"doc" : {"author_first_name": "Tonny", "author_last_name": "Peter Smith"}}

GET /article/_doc/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "type": "most_fields",
      "fields": ["author_first_name", "author_last_name"]
    }
  },
  "_source": ["author_first_name", "author_last_name"]
}

# 结果
# 1. doc2(score=0.69): author_first_name匹配Smith，author_last_name无匹配
# 2. doc5(score=0.58): author_first_name不匹配，author_last_name匹配Peter和Smith
# 3. doc1(score=0.58): author_first_name匹配Peter，author_last_name匹配Smith
# 结果分析：为什么只匹配到"Smith"的doc反而排名最高
# score受TF/IDF的影响，IDF就是搜索词在所有文档中出现的次数，这个数字越高，TF/IDF越低
# 在所有文档的 "author_last_name"字段中，Smith出现2次，Peter出现1次
# 在所有文档的 "author_first_name"字段中，Smith出现1次，Peter出现1次
# 所有，在author_last_name中匹配到Smith的分数就不如在author_first_name中匹配到Smith的分数
# 当然影响分数的因素是很多的，这里是说一个普适的规律

使用most_fields进行cross-fields搜索的一些问题：

问题1：越多的field的匹配到关键词其分数会高与少量field匹配到多个关键词的分数
问题2：没办法用minimum_should_match去掉长尾数据，就是匹配的特别少的数据
问题3：TF/IDF算法可能导致结果无法符合预期，比如上面例子中的情况

解决办法一：copy_to，将多个field组合成一个field，用了copy_to语法之后，就可以将多个字段的值拷贝到一个字段中，并建立倒排索引，但是在index中是查不到这个字段的，这是一个隐藏的字段

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "first_name": "Peter",
  "last_name": "Smith"
}

PUT my_index/_doc/2
{
  "first_name": "Smith",
  "last_name": "Williams"
}

PUT my_index/_doc/3
{
  "first_name": "Jack",
  "last_name": "Ma"
}

PUT my_index/_doc/4
{
  "first_name": "Robbin",
  "last_name": "Li"
}

PUT my_index/_doc/5
{
  "first_name": "Tonny",
  "last_name": "Peter Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": "Peter Smith"
    }
  }
}

# 结果
# 1. doc2(score=0.69): first_name匹配Smith，last_name无匹配
# 2. doc5(score=0.58): first_name不匹配，author_last_name匹配Peter和Smith
# 3. doc1(score=0.58): first_name匹配Peter，last_name匹配Smith
# 这个结果也是不符合预期的，这与ES计算分数的算法有关，影响分数的因素是很多的

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "Peter Smith",
        "operator": "and"
      }
    }
  }
}

# 结果
# 1. doc5(score=0.58): first_name不匹配，author_last_name匹配Peter和Smith
# 2. doc1(score=0.58): first_name匹配Peter，last_name匹配Smith

解决办法二：使用cross-fields的原生语法

most-fields，只需要任意一个字段中出现一个关键词即可
GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "Peter Smith",
      "type": "cross_fields",
      "operator": "and",
      "fields": ["first_name", "last_name"]
    }
  }
}

# 结果
[
    {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 0.5753642,
        "_source" : {
            "first_name" : "Tonny",
            "last_name" : "Peter Smith"
        }
    },
    {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
            "first_name" : "Peter",
            "last_name" : "Smith"
        }
    }
]

# type=cross_fields的原理
# 要求Peter必须在first_name或last_name中出现
# 要求Smith必须在first_name或last_name中出现
# doc2(first_name=Smith,last_name=Williams)就无法满足条件了
# 而原来的most-fields，只需要任意一个字段中出现一个关键词即可

023.基于IT论坛案例学习Elasticsearch(二)：Q

1. 准备测试数据

2. 手动控制搜索精确度

3.match query的自动转化

4. boost：搜索条件权重控制

5. 多shard场景下relevance score不准确问题

6. dis_max：实现搜索的best_fields策略

6.1 dis_max

6.2 tie_breaker

6.3 multi_match、dis_max和tie_breaker的联合使用

7. most-fields策略

8. cross-fields搜索

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读