美文网首页
6.4-聚合分析的原理及精准度问题

6.4-聚合分析的原理及精准度问题

作者: 落日彼岸 | 来源:发表于2020-04-08 00:41 被阅读0次

    分布式系统的近似统计算法

    image.png

    Min 聚合分析的执⾏流程

    image.png

    Terms Aggregation 的返回值

    image.png
    • 在 Terms Aggregation 的返回中有两个特殊 的数值

    • doc_count_error_upper_bound : 被遗漏的 term 分桶,包含的⽂档,有可能的最⼤值

    • sum_other_doc_count: 除了返回结果 bucket 的 terms 以外,其他 terms 的⽂档总数(总 数-返回的总数)

    GET kibana_sample_data_flights/_search
    {
      "size": 0,
      "aggs": {
        "weather": {
          "terms": {
            "field":"OriginWeather",
            "size":5,
            "show_term_doc_count_error":true
          }
        }
      }
    }
    
    res:
    {
      "took" : 33,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 10000,
          "relation" : "gte"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "weather" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 2932,
          "buckets" : [
            {
              "key" : "Clear",
              "doc_count" : 2324,
              "doc_count_error_upper_bound" : 0
            },
            {
              "key" : "Cloudy",
              "doc_count" : 2319,
              "doc_count_error_upper_bound" : 0
            },
            {
              "key" : "Rain",
              "doc_count" : 2214,
              "doc_count_error_upper_bound" : 0
            },
            {
              "key" : "Sunny",
              "doc_count" : 2209,
              "doc_count_error_upper_bound" : 0
            },
            {
              "key" : "Thunder & Lightning",
              "doc_count" : 1061,
              "doc_count_error_upper_bound" : 0
            }
          ]
        }
      }
    }
    
    
    

    Terms 聚合分析的执⾏流程

    image.png

    Terms 不正确的案例

    image.png

    如何解决 Terms 不准的问题:提升 shard_size 的参数

    • Terms 聚合分析不准的原因,数据分散在多个分 ⽚上, Coordinating Node ⽆法获取数据全貌

    • 解决⽅案 1:当数据量不⼤时,设置 Primary Shard 为 1;实现准确性

    • ⽅案 2:在分布式数据上,设置 shard_size 参 数,提⾼精确度

      • 原理:每次从 Shard 上额外多获取数据,提升准 确率
    image.png

    打开 show_term_doc_count_error

    POST my_flights/_search
    {
      "size": 0,
      "aggs": {
        "weather": {
          "terms": {
            "field":"OriginWeather",
            "size":1,
            "shard_size":1,
            "show_term_doc_count_error":true
          }
        }
      }
    }
    

    shard_size 设定

    课程demo

    DELETE my_flights
    PUT my_flights
    {
      "settings": {
        "number_of_shards": 20
      },
      "mappings" : {
          "properties" : {
            "AvgTicketPrice" : {
              "type" : "float"
            },
            "Cancelled" : {
              "type" : "boolean"
            },
            "Carrier" : {
              "type" : "keyword"
            },
            "Dest" : {
              "type" : "keyword"
            },
            "DestAirportID" : {
              "type" : "keyword"
            },
            "DestCityName" : {
              "type" : "keyword"
            },
            "DestCountry" : {
              "type" : "keyword"
            },
            "DestLocation" : {
              "type" : "geo_point"
            },
            "DestRegion" : {
              "type" : "keyword"
            },
            "DestWeather" : {
              "type" : "keyword"
            },
            "DistanceKilometers" : {
              "type" : "float"
            },
            "DistanceMiles" : {
              "type" : "float"
            },
            "FlightDelay" : {
              "type" : "boolean"
            },
            "FlightDelayMin" : {
              "type" : "integer"
            },
            "FlightDelayType" : {
              "type" : "keyword"
            },
            "FlightNum" : {
              "type" : "keyword"
            },
            "FlightTimeHour" : {
              "type" : "keyword"
            },
            "FlightTimeMin" : {
              "type" : "float"
            },
            "Origin" : {
              "type" : "keyword"
            },
            "OriginAirportID" : {
              "type" : "keyword"
            },
            "OriginCityName" : {
              "type" : "keyword"
            },
            "OriginCountry" : {
              "type" : "keyword"
            },
            "OriginLocation" : {
              "type" : "geo_point"
            },
            "OriginRegion" : {
              "type" : "keyword"
            },
            "OriginWeather" : {
              "type" : "keyword"
            },
            "dayOfWeek" : {
              "type" : "integer"
            },
            "timestamp" : {
              "type" : "date"
            }
          }
        }
    }
    
    
    POST _reindex
    {
      "source": {
        "index": "kibana_sample_data_flights"
      },
      "dest": {
        "index": "my_flights"
      }
    }
    
    GET kibana_sample_data_flights/_count
    GET my_flights/_count
    
    get kibana_sample_data_flights/_search
    
    
    GET kibana_sample_data_flights/_search
    {
      "size": 0,
      "aggs": {
        "weather": {
          "terms": {
            "field":"OriginWeather",
            "size":5,
            "show_term_doc_count_error":true
          }
        }
      }
    }
    
    
    GET my_flights/_search
    {
      "size": 0,
      "aggs": {
        "weather": {
          "terms": {
            "field":"OriginWeather",
            "size":1,
            "shard_size":1,
            "show_term_doc_count_error":true
          }
        }
      }
    }
    

    相关文章

      网友评论

          本文标题:6.4-聚合分析的原理及精准度问题

          本文链接:https://www.haomeiwen.com/subject/knwaphtx.html