美文网首页我爱编程
对Elasticsearch字段进行去重,结果保存为文件

对Elasticsearch字段进行去重,结果保存为文件

作者: 幸运猪x | 来源:发表于2018-06-11 18:05 被阅读0次

    前情提要

    据任务要求:从ES集群中查询出ip字段,对ip字段去重,并且将纯净的ip保存到文件中。
    这里基于某个字段去重,其实就是wordcount问题

    1. 首先通过python制造样例数据

    # -*- coding: utf-8 -*-
    # 生成ip列字段
    ip = []
    for i in range(1, 50):
        ip.append("192.168.100." + bytes(i))
    
    # 将样例数据写入json文件
    with open("data.json", "w") as f:
        i = 1
        for ipp in ip:
            for j in range(i, i + 100):
                line = '{"index":{"_index":"data","_type":"log","_id":'+bytes(j)+'}}\n{"color":"green","state":"open","address":"'+ipp+'","time":"2018-06-11"}\n'
                f.write(line)
            i = i + 100
    

    部分样例数据:

    {"index":{"_index":"data","_type":"log","_id":1}}
    {"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
    {"index":{"_index":"data","_type":"log","_id":2}}
    {"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
    {"index":{"_index":"data","_type":"log","_id":3}}
    {"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
    {"index":{"_index":"data","_type":"log","_id":4}}
    {"color":"green","state":"open","address":"192.168.100.1","time":"2018-06-11"}
    

    2. 将样例数据批量导入到ES中

    # 导入数据
    curl -PUT localhost:9200/_bulk --data-binary @data.json
    

    此时ES中已经有样例数据了

    curl -X GET localhost:9200/data/log/101
    ###
    {"_index":"data","_type":"log","_id":"101","_version":1,"found":true,"_source":{"color":"green","state":"open",}
    ###
    

    3. ES的去重并保存为文件

    结果处理有两种方式

    • 利用jq工具,将结果保存为csv文件
    # wordcount and save results as csv
    curl -X GET 'http://localhost:9200/data/log/_search' -d '
    {
        "size": 0,
        "aggs": {
            "group_by_state": {
                "terms": {
                    "field": "address", # 指定字段为address
                    "size": 0 # 0,返回所有结果
                }
            }
        }
    }' | jq -r '.aggregations|.group_by_state|.buckets[]|[.key, .doc_count]|@csv' >> result.csv
    

    结果样例

    "192.168.100.1",100
    "192.168.100.10",100
    "192.168.100.11",100
    etc ...
    
    • 利用grep的正则表达式对结果进行解析
    # wordcount and save results as txt
    curl -X GET 'http://localhost:9200/data/log/_search' -d '
    {
        "size": 0,
        "aggs": {
            "group_by_state": {
                "terms": {
                    "field": "address",
                    "size": 0
                }
            }
        }
    }' | grep -Po 'key[" :]+\K[^"]+' >> result
    

    样例结果

    192.168.100.1
    192.168.100.10
    192.168.100.11
    192.168.100.12
    192.168.100.13
    192.168.100.14
    192.168.100.15
    192.168.100.16
    192.168.100.17
    192.168.100.18
    192.168.100.19
    etc ...
    

    相关文章

      网友评论

        本文标题:对Elasticsearch字段进行去重,结果保存为文件

        本文链接:https://www.haomeiwen.com/subject/rubheftx.html