美文网首页Elaticsearch进阶ElasticTalk
Elasticsearch如何实现 SQL语句中 Group B

Elasticsearch如何实现 SQL语句中 Group B

作者: rockybean | 来源:发表于2018-05-21 08:13 被阅读270次

    有 SQL 背景的同学在学习 Elasticsearch 时,面对一个查询需求,不由自主地会先思考如何用 SQL 来实现,然后再去想 Elasticsearch 的 Query DSL 如何实现。那么本篇就给大家讲一条常见的 SQL 语句如何用 Elasticsearch 的查询语言实现。

    1. SQL语句

    假设我们有一个汽车的数据集,每个汽车都有车型、颜色等字段,我希望获取颜色种类大于1个的前2车型。假设汽车的数据模型如下:

    {
        "model":"modelA",
        "color":"red"
    }
    

    假设我们有一个 cars 表,通过如下语句创建测试数据。

    INSERT INTO cars (model,color) VALUES ('A','red'); 
    INSERT INTO cars (model,color) VALUES ('A','white'); 
    INSERT INTO cars (model,color) VALUES ('A','black'); 
    INSERT INTO cars (model,color) VALUES ('A','yellow'); 
    INSERT INTO cars (model,color) VALUES ('B','red'); 
    INSERT INTO cars (model,color) VALUES ('B','white'); 
    INSERT INTO cars (model,color) VALUES ('C','black'); 
    INSERT INTO cars (model,color) VALUES ('C','red'); 
    INSERT INTO cars (model,color) VALUES ('C','white'); 
    INSERT INTO cars (model,color) VALUES ('C','yellow'); 
    INSERT INTO cars (model,color) VALUES ('C','blue'); 
    INSERT INTO cars (model,color) VALUES ('D','red');
    INSERT INTO cars (model,color) VALUES ('A','red'); 
    

    那么实现我们需求的 SQL 语句也比较简单,实现如下:

    SELECT model,COUNT(DISTINCT color) color_count FROM cars GROUP BY model HAVING color_count > 1 ORDER BY color_count desc LIMIT 2;
    

    这条查询语句中 Group By 是按照 model 做分组, Having color_count>1 限定了车型颜色种类大于1,**ORDER BY color_count desc ** 限定结果按照颜色种类倒序排列,而 LIMIT 2 限定只返回前3条数据。

    那么在 Elasticsearch 中如何实现这个需求呢?

    2. 在 Elasticsearch 模拟测试数据

    首先我们需要先在 elasticsearch 中插入测试的数据,这里我们使用 bulk 接口 ,如下所示:

    POST _bulk
    {"index":{"_index":"cars","_type":"doc","_id":"1"}}
    {"model":"A","color":"red"}
    {"index":{"_index":"cars","_type":"doc","_id":"2"}}
    {"model":"A","color":"white"}
    {"index":{"_index":"cars","_type":"doc","_id":"3"}}
    {"model":"A","color":"black"}
    {"index":{"_index":"cars","_type":"doc","_id":"4"}}
    {"model":"A","color":"yellow"}
    {"index":{"_index":"cars","_type":"doc","_id":"5"}}
    {"model":"B","color":"red"}
    {"index":{"_index":"cars","_type":"doc","_id":"6"}}
    {"model":"B","color":"white"}
    {"index":{"_index":"cars","_type":"doc","_id":"7"}}
    {"model":"C","color":"black"}
    {"index":{"_index":"cars","_type":"doc","_id":"8"}}
    {"model":"C","color":"red"}
    {"index":{"_index":"cars","_type":"doc","_id":"9"}}
    {"model":"C","color":"white"}
    {"index":{"_index":"cars","_type":"doc","_id":"10"}}
    {"model":"C","color":"yellow"}
    {"index":{"_index":"cars","_type":"doc","_id":"11"}}
    {"model":"C","color":"blue"}
    {"index":{"_index":"cars","_type":"doc","_id":"12"}}
    {"model":"D","color":"red"}
    {"index":{"_index":"cars","_type":"doc","_id":"13"}}
    {"model":"A","color":"red"}
    

    其中 index 为 cars,type 为 doc,所有数据与mysql 数据保持一致。大家可以在 Kibana 的 Dev Tools 中执行上面的命令,然后执行下面的查询语句验证数据是否已经成功存入。

    GET cars/_search
    

    3. Group By VS Terms/Metric Aggregation

    SQL 中 Group By 语句在 Elasticsearch 中对应的是 Terms Aggregation,即分桶聚合,对应 Group By color 的语句如下所示:

    GET cars/_search
    {
      "size":0,
      "aggs":{
        "models":{
          "terms":{
            "field":"model.keyword"
          }
        }
      }
    }
    

    结果如下:

    {
      "took": 161,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 13,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "models": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "A",
              "doc_count": 5
            },
            {
              "key": "C",
              "doc_count": 5
            },
            {
              "key": "B",
              "doc_count": 2
            },
            {
              "key": "D",
              "doc_count": 1
            }
          ]
        }
      }
    }
    

    我们看 aggregations 这个 key 下面的即为返回结果。

    SQL 语句中还有一项是 COUNT(DISTINCT color) color_count 用于计算每个 model 的颜色数,在 Elasticsearch 中我们需要使用一个指标类聚合 Cardinality ,进行不同值计数。语句如下:

    GET cars/_search
    {
      "size": 0,
      "aggs": {
        "models": {
          "terms": {
            "field": "model.keyword"
          },
          "aggs": {
            "color_count": {
              "cardinality": {
                "field": "color.keyword"
              }
            }
          }
        }
      }
    }
    

    其返回结果如下:

    {
      "took": 74,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 13,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "models": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "A",
              "doc_count": 5,
              "color_count": {
                "value": 4
              }
            },
            {
              "key": "C",
              "doc_count": 5,
              "color_count": {
                "value": 5
              }
            },
            {
              "key": "B",
              "doc_count": 2,
              "color_count": {
                "value": 2
              }
            },
            {
              "key": "D",
              "doc_count": 1,
              "color_count": {
                "value": 1
              }
            }
          ]
        }
      }
    }
    

    结果中 color_count 即为每个 model 的颜色数,但这里所有的模型都返回了,我们只想要颜色数大于1的模型,因此这里还要加一个过滤条件。

    4. Having Condition VS Bucket Filter Aggregation

    Having color_count > 1 在 Elasticsearch 中对应的是 Bucket Filter 聚合,语句如下所示:

    GET cars/_search
    {
      "size": 0,
      "aggs": {
        "models": {
          "terms": {
            "field": "model.keyword"
          },
          "aggs": {
            "color_count": {
              "cardinality": {
                "field": "color.keyword"
              }
            },
            "color_count_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "colorCount": "color_count"
                },
                "script": "params.colorCount>1"
              }
            }
          }
        }
      }
    }
    

    返回结果如下:

    {
      "took": 39,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 13,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "models": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "A",
              "doc_count": 5,
              "color_count": {
                "value": 4
              }
            },
            {
              "key": "C",
              "doc_count": 5,
              "color_count": {
                "value": 5
              }
            },
            {
              "key": "B",
              "doc_count": 2,
              "color_count": {
                "value": 2
              }
            }
          ]
        }
      }
    }
    

    此时返回结果只包含颜色数大于1的模型,但大家会发现颜色数多的 C 不是在第一个位置,我们还需要做排序处理。

    5. Order By Limit VS Bucket Sort Aggregation

    ORDER BY color_count desc LIMIT 3 在 Elasticsearch 中可以使用 Bucket Sort 聚合实现,语句如下所示:

    GET cars/_search
    {
      "size": 0,
      "aggs": {
        "models": {
          "terms": {
            "field": "model.keyword"
          },
          "aggs": {
            "color_count": {
              "cardinality": {
                "field": "color.keyword"
              }
            },
            "color_count_filter": {
              "bucket_selector": {
                "buckets_path": {
                  "colorCount": "color_count"
                },
                "script": "params.colorCount>1"
              }
            },
            "color_count_sort": {
              "bucket_sort": {
                "sort": {
                  "color_count": "desc"
                },
                "size": 2
              }
            }
          }
        }
      }
    }
    

    返回结果如下:

    {
      "took": 32,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 13,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "models": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "C",
              "doc_count": 5,
              "color_count": {
                "value": 5
              }
            },
            {
              "key": "A",
              "doc_count": 5,
              "color_count": {
                "value": 4
              }
            }
          ]
        }
      }
    }
    

    至此我们便将 SQL 语句实现的功能用 Elasticsearch 查询语句实现了。对比 SQL 语句与 Elasticsearch 的查询语句,大家会发现后者复杂了很多,但并非无章可循,随着大家对常见语法越来越熟悉,相信一定会越写越得心应手!

    相关文章

      网友评论

        本文标题:Elasticsearch如何实现 SQL语句中 Group B

        本文链接:https://www.haomeiwen.com/subject/nrnxjftx.html