4.1-基于词项和基于全文的搜索

作者: 落日彼岸 | 来源:发表于2020-03-30 15:16 被阅读0次

4.1-基于词项和基于全文的搜索
（五）高级搜索1
Elasticsearch 查询
【elasticsearch】12、基于词项和基于全文的搜索
一篇文章入门Elasticsearch查询
基于term和基于全文的搜索
Elasticsearch 深入搜索-全文搜索
mac安装ElasticSearch笔记
常用query
使用nutch搭建类似百度/谷歌的搜索引擎

基于 Term 的查询

Term 的重要性
- Term 是表达语意的最⼩单位。搜索和利⽤统计语⾔模型进⾏⾃然语⾔处理都需要处理 Term
特点
- Term Level Query: Term Query / Range Query / Exists Query / Prefix Query /Wildcard Query
- 在 ES 中，Term 查询，对输⼊不做分词。会将输⼊作为⼀个整体，在倒排索引中查找准确的词项，并且使⽤相关度算分公式为每个包含该词项的⽂档进⾏相关度算分 – 例如“Apple Store”
- 可以通过 Constant Score 将查询转换成⼀个 Filtering，避免算分，并利⽤缓存，提⾼性

关于 Term 查询的例子

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

几个查询的结果分别是什么?
如果搜不不到，为什么?
应该如何解决

GET /products

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        //"value": "iPhone" //查不到结果
        "value":"iphone" //可以查到结果
      }
    }
  }
}

POST /products/_search
{
  "query": {
    "term": {
      "desc.keyword": {
        "value": "iPhone" //可以查到结果
        //"value":"iphone" //查不到结果
      }
    }
  }
}


POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "XHDK-A-1293-#fJ3" //查不到结果
        //"value": "xhdk" //可以查到结果,根据分词分析
        //"value": "xhdk-a-1293-#fJ3" //查不到结果
      }
    }
  }
}

POST /products/_search
{
  //"explain": true,
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"//可以查到结果
      }
    }
  }
}

//查看分词结果
POST /_analyze
{
 "analyzer": "standard",
 "text": ["XHDK-A-1293-#fJ3"]
}

//res
{
  "tokens" : [
    {
      "token" : "xhdk",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "1293",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "fj3",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

多字段 Mapping 和 Term查询

GET products/_mapping

//res
{
  "products" : {
    "mappings" : {
      "properties" : {
        "desc" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "productID" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

使用keyword关键字进行查询,严格匹配
term查询会返回算分结果

复合查询 – Constant Score 转为 Filter

将 Query 转成 Filter，忽略 TF-IDF 计算，避免相关性算分的开销
Filter 可以有效利⽤缓存

POST /products/_search
{
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }
    }
  }
}

//res
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_shard" : "[products][0]",
        "_node" : "BsfHcVuGT8-7CROZ1odZUg",
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "productID" : "XHDK-A-1293-#fJ3",
          "desc" : "iPhone"
        },
        "_explanation" : {
          "value" : 1.0,
          "description" : "ConstantScore(productID.keyword:XHDK-A-1293-#fJ3)",
          "details" : [ ]
        }
      }
    ]
  }
}

基于全⽂的查询

基于全⽂本的查找
- Match Query / Match Phrase Query / Query String Query
特点
索引和搜索时都会进⾏分词，查询字符串先传递到⼀个合适的分词器，然后⽣成⼀个供查询的词项列表
查询时候，先会对输⼊的查询进⾏分词，然后每个词项逐个进⾏底层的查询，最终将结果进⾏合并。并为每个⽂档⽣成⼀个算分。

例如查 “Matrix reloaded”，会查到包括Matrix 或者 reload的所有结果。

Match Query Result

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reload" // or
      }
    }
  }
}

//res
"hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "2571",
        "_score" : 9.095142, //返回相关的算分结果
        "_source" : {
          "genre" : [
            "Action",
            "Sci-Fi",
            "Thriller"
          ],
          "title" : "Matrix, The",
          "year" : 1999,
          "@version" : "1",
          "id" : "2571"
        }
      }
    ]

Operator

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reload"
        , "operator": "and" //精准筛选
      }
    }
  }
}

//res
"profile" : {
    "shards" : [
      {
        "id" : "[QG8Co41UQGKuwzGrkvpzOA][movies][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "+title:matrix +title:reload",//精准筛选
                "time_in_nanos" : 2900408,

Minimum_should_match

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reload",
        "minimum_should_match": 2
      }
    }
  }
}

//res
"profile" : {
    "shards" : [
      {
        "id" : "[BsfHcVuGT8-7CROZ1odZUg][movies][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "(title:matrix title:reload)~2",
                "time_in_nanos" : 5050509,

Match Phrase Query

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match_phrase": {
      "title": {
        "query": "Matrix reload",
        "slop": 1
      }
    }
  }
}

//res
"profile" : {
    "shards" : [
      {
        "id" : "[BsfHcVuGT8-7CROZ1odZUg][movies][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "PhraseQuery",
                "description" : """title:"matrix reload"~1""",

Match Query 查询过程

基于全⽂本的查找
- Match Query / Match Phrase Query / Query String Query
基于全⽂本的查询的特点
- 索引和搜索时都会进⾏分词，查询字符串先传递到⼀个合适的分词器，然后⽣成⼀个供查询的词项列表
- 查询会对每个词项逐个进⾏底层的查询，再将结果进⾏合并。并为每个⽂档⽣成⼀个算分

本节知识点回顾

基于词项的查找 vs 基于全⽂的查找
通过字段 Mapping 控制字段的分词
- "Text" vs "Keyword"
通过参数控制查询的 Precision & Recall
复合查询 – Constant Score 查询
- 即便是对 Keyword 进⾏ Term 查询，同样会进⾏算分
- 可以将查询转为 Filtering，取消相关性算分的环节，以提升性能

课程demo

DELETE products
PUT products
{
  "settings": {
    "number_of_shards": 1
  }
}


POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

GET /products

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        //"value": "iPhone"
        "value":"iphone"
      }
    }
  }
}

POST /products/_search
{
  "query": {
    "term": {
      "desc.keyword": {
        //"value": "iPhone"
        //"value":"iphone"
      }
    }
  }
}


POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}

POST /products/_search
{
  //"explain": true,
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}




POST /products/_search
{
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }

    }
  }
}


#设置 position_increment_gap
DELETE groups
PUT groups
{
  "mappings": {
    "properties": {
      "names":{
        "type": "text",
        "position_increment_gap": 0
      }
    }
  }
}

GET groups/_mapping

POST groups/_doc
{
  "names": [ "John Water", "Water Smith"]
}

POST groups/_search
{
  "query": {
    "match_phrase": {
      "names": {
        "query": "Water Water",
        "slop": 100
      }
    }
  }
}


POST groups/_search
{
  "query": {
    "match_phrase": {
      "names": "Water Smith"
    }
  }
}

相关阅读

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/term-level-queries.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/full-text-queries.html

4.1-基于词项和基于全文的搜索
基于 Term 的查询 Term 的重要性Term 是表达语意的最⼩单位。搜索和利⽤统计语⾔模型进⾏⾃然语⾔处理都...
（五）高级搜索1
一、基于词项和基于全文的搜索基于term的查询term 是表达语义的最小单位，搜索和基于统计的语言模型进行自然语...
Elasticsearch 查询
基于词项和基于全文搜索基于Term的查询 Term是表达语义的最小单位，在Elasticsearch中，Term...
【elasticsearch】12、基于词项和基于全文的搜索
基于term的查询 term的重要性term是表达语义的最小单位，搜索和利用统计语言模型进行自然语言处理都需要处理...
一篇文章入门Elasticsearch查询
此篇为开篇贴，后面会陆续更新基于词项和基于全文的搜索结构化搜索搜索的相关性算分 Query & Filter...
基于term和基于全文的搜索
介绍本章会详细介绍基于term和基于全文搜索，通过不同的demo，介绍两种搜索的区别。 term搜索 term：...
Elasticsearch 深入搜索-全文搜索
基于词项和基于全文匹配查询匹配查询 match 是个核心查询。无论需要查询什么字段， match 查询都应...
mac安装ElasticSearch笔记
Elastic Search是一个基于Lucene的全文搜索框架，提供了restful接口。最近项目组的机器学习项...
常用query
基于logstash 热点政策基于click事件进行过滤，基于itemid和title进行聚合热词每一天的搜索...
使用nutch搭建类似百度/谷歌的搜索引擎
Nutch是基于Lucene实现的搜索引擎。包括全文搜索和Web爬虫。Lucene为Nutch提供了文本索引和搜索...