Elasticsearch 深入搜索-多字段搜索

作者: 觉释 | 来源:发表于2020-08-28 08:29 被阅读0次

多字符串查询

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title":  "War and Peace" }},
        { "match": { "author": "Leo Tolstoy"   }}
      ]
    }
  }
}

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title":  "War and Peace" }},
        { "match": { "author": "Leo Tolstoy"   }},
        { "bool":  {
          "should": [
            { "match": { "translator": "Constance Garnett" }},
            { "match": { "translator": "Louise Maude"      }}
          ]
        }}
      ]
    }
  }
}

为什么将译者条件语句放入另一个独立的 bool 查询中呢？所有的四个 match 查询都是 should 语句，所以为什么不将 translator 语句与其他如 title 、 author 这样的语句放在同一层呢？

答案在于评分的计算方式。 bool 查询运行每个 match 查询，再把评分加在一起，然后将结果与所有匹配的语句数量相乘，最后除以所有的语句数量。处于同一层的每条语句具有相同的权重。在前面这个例子中，包含 translator 语句的 bool 查询，只占总评分的三分之一。如果将 translator 语句与 title 和 author 两条语句放入同一层，那么 title 和 author 语句只贡献四分之一评分。

语句的优先级

为了提升 title 和 author 字段的权重，为它们分配的 boost 值大于 1 ：

GET /_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { 
            "title":  {
              "query": "War and Peace",
              "boost": 2
        }}},
        { "match": { 
            "author":  {
              "query": "Leo Tolstoy",
              "boost": 2
        }}},
        { "bool":  { 
            "should": [
              { "match": { "translator": "Constance Garnett" }},
              { "match": { "translator": "Louise Maude"      }}
            ]
        }}
      ]
    }
  }
}

title 和 author 语句的 boost 值为 2 。
嵌套 bool 语句默认的 boost 值为 1 。

单字符串查询

最佳字段

假设有个网站允许用户搜索博客的内容，以下面两篇博客内容文档为例：

PUT /my_index/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}

PUT /my_index/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}

{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

最佳字段查询调优

如下，一个简单的 dis_max 查询会采用单个最佳匹配字段，而忽略其他的匹配：

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}

tie_breaker 参数

可以通过指定 tie_breaker 这个参数将其他匹配语句的评分也考虑其中：

{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.3
        }
    }
}

tie_breaker 参数提供了一种 dis_max 和 bool 之间的折中选择，它的评分方式如下：

获得最佳匹配语句的评分 _score 。
将其他匹配语句的评分结果与 tie_breaker 相乘。
对以上评分求和并规范化。
有了 tie_breaker ，会考虑所有匹配语句，但最佳匹配语句依然占最终结果里的很大一部分。

multi_match 查询

multi_match 查询为能在多个字段上反复执行相同查询提供了一种便捷方式。
默认情况下，查询的类型是 best_fields ，这表示它会为每个字段生成一个 match 查询，然后将它们组合到 dis_max 查询的内部，如下：

{
  "dis_max": {
    "queries":  [
      {
        "match": {
          "title": {
            "query": "Quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      {
        "match": {
          "body": {
            "query": "Quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
    ],
    "tie_breaker": 0.3
  }
}

上面这个查询用 multi_match 重写成更简洁的形式：

{
    "multi_match": {
        "query":                "Quick brown fox",
        "type":                 "best_fields", 
        "fields":               [ "title", "body" ],
        "tie_breaker":          0.3,
        "minimum_should_match": "30%" 
    }
}

best_fields 类型是默认值，可以不指定。
如 minimum_should_match 或 operator 这样的参数会被传递到生成的 match 查询中。

查询字段名称的模糊匹配

字段名称可以用模糊匹配的方式给出：任何与模糊模式正则匹配的字段都会被包括在搜索条件中，例如可以使用以下方式同时匹配 book_title 、 chapter_title 和 section_title （书名、章名、节名）这三个字段：

{
    "multi_match": {
        "query":  "Quick brown fox",
        "fields": "*_title"
    }
}

提升单个字段的权重

可以使用 ^ 字符语法为单个字段提升权重，在字段名称的末尾添加 ^boost ，其中 boost 是一个浮点数：

{
    "multi_match": {
        "query":  "Quick brown fox",
        "fields": [ "*_title", "chapter_title^2" ] 
    }
}

chapter_title 这个字段的 boost 值为 2 ，而其他两个字段 book_title 和 section_title 字段的默认 boost 值为 1 。

多数字段

多字段映射

首先要做的事情就是对我们的字段索引两次：一次使用词干模式以及一次非词干模式。为了做到这点，采用 multifields 来实现，已经在 multifields 有所介绍：

DELETE /my_index

PUT /my_index
{
    "settings": { "number_of_shards": 1 }, 
    "mappings": {
        "my_type": {
            "properties": {
                "title": { 
                    "type":     "string",
                    "analyzer": "english",
                    "fields": {
                        "std":   { 
                            "type":     "string",
                            "analyzer": "standard"
                        }
                    }
                }
            }
        }
    }
}

接着索引一些文档：

PUT /my_index/1
{ "title": "My rabbit jumps" }

PUT /my_index/2
{ "title": "Jumping jack rabbits" }

这里用一个简单 match 查询 title 标题字段是否包含 jumping rabbits （跳跃的兔子）：

GET /my_index/_search
{
   "query": {
        "match": {
            "title": "jumping rabbits"
        }
    }
}

因为有了 english 分析器，这个查询是在查找以 jump 和 rabbit 这两个被提取词的文档。两个文档的 title 字段都同时包括这两个词，所以两个文档得到的评分也相同：

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.42039964,
        "_source": {
           "title": "My rabbit jumps"
        }
     },
     {
        "_id": "2",
        "_score": 0.42039964,
        "_source": {
           "title": "Jumping jack rabbits"
        }
     }
  ]
}

如果只是查询 title.std 字段，那么只有文档 2 是匹配的。尽管如此，如果同时查询两个字段，然后使用 bool 查询将评分结果合并，那么两个文档都是匹配的（ title 字段的作用），而且文档 2 的相关度评分更高（ title.std 字段的作用）：

GET /my_index/_search
{
   "query": {
        "multi_match": {
            "query":  "jumping rabbits",
            "type":   "most_fields", 
            "fields": [ "title", "title.std" ]
        }
    }
}

我们希望将所有匹配字段的评分合并起来，所以使用 most_fields 类型。这让 multi_match 查询用 bool 查询将两个字段语句包在里面，而不是使用 dis_max 查询。

每个字段对于最终评分的贡献可以通过自定义值 boost 来控制。比如，使 title 字段更为重要，这样同时也降低了其他信号字段的作用：

GET /my_index/_search
{
   "query": {
        "multi_match": {
            "query":       "jumping rabbits",
            "type":        "most_fields",
            "fields":      [ "title^10", "title.std" ] 
        }
    }
}

title 字段的 boost 的值为 10 使它比 title.std 更重要。

跨字段实体搜索

简单方式

依次查询每个字段并将每个字段的匹配评分结果相加，听起来真像是 bool 查询：

{
  "query": {
    "bool": {
      "should": [
        { "match": { "street":    "Poland Street W1V" }},
        { "match": { "city":      "Poland Street W1V" }},
        { "match": { "country":   "Poland Street W1V" }},
        { "match": { "postcode":  "Poland Street W1V" }}
      ]
    }
  }
}

为每个字段重复查询字符串会使查询瞬间变得冗长，可以采用 multi_match 查询，将 type 设置成 most_fields 然后告诉 Elasticsearch 合并所有匹配字段的评分：

{
  "query": {
    "multi_match": {
      "query":       "Poland Street W1V",
      "type":        "most_fields",
      "fields":      [ "street", "city", "country", "postcode" ]
    }
  }
}

字段中心式查询

在多个字段中匹配相同的词

GET /_validate/query?explain
{
  "query": {
    "multi_match": {
      "query":   "Poland Street W1V",
      "type":    "most_fields",
      "fields":  [ "street", "city", "country", "postcode" ]
    }
  }
}

解决方案

{
    "first_name":  "Peter",
    "last_name":   "Smith",
    "full_name":   "Peter Smith"
}

自定义_all 字段

PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "last_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}

first_name 和 last_name 字段中的值会被复制到 full_name 字段。
有了这个映射，我们可以用 first_name 来查询名，用 last_name 来查询姓，或者直接使用 full_name 查询整个姓名。
first_name 和 last_name 的映射并不影响 full_name 如何被索引， full_name 将两个字段的内容复制到本地，然后根据 full_name 的映射自行索引。
只要对“主”字段 copy_to 就能轻而易举的达到相同的效果：

PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name", 
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}

copy_to 是针对“主”字段，而不是多字段的

cross_fields 跨字段查询

按字段提高权重

GET /books/_search
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title^2", "description" ] 
        }
    }
}

title 字段的权重提升值为 2 ， description 字段的权重提升值默认为 1 。

Exact-Value 精确值字段

在结束多字段查询这个话题之前，我们最后要讨论的是精确值 not_analyzed 未分析字段。将 not_analyzed 字段与 multi_match 中 analyzed 字段混在一起没有多大用处。

原因可以通过查看查询的 explanation 解释得到，设想将 title 字段设置成 not_analyzed ：

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields",
            "fields":      [ "title", "first_name", "last_name" ]
        }
    }
}

因为 title 字段是未分析过的，Elasticsearch 会将 “peter smith” 这个完整的字符串作为查询条件来搜索！

title:peter smith
(
    blended("peter", fields: [first_name, last_name])
    blended("smith", fields: [first_name, last_name])
)

显然这个项不在 title 的倒排索引中，所以需要在 multi_match 查询中避免使用 not_analyzed 字段。