美文网首页ElasticSearch
2-Elasticsearch集群数据批量导入

2-Elasticsearch集群数据批量导入

作者: 唐影若凡 | 来源:发表于2017-03-14 10:23 被阅读0次

声明:原创文章,转载请注明出处。http://www.jianshu.com/u/e02df63eaa87

1、数据形式

我们采用Person的作为数据出发点,将文件中的序列化为Json的Person对象导入Elasticsearch集群中。
本文中的代码详见:https://github.com/hawkingfoo/es-batch-import

1.1 数据类型
public class Person {
    private int pid;            // person id
    private int age;
    private boolean sex;
    private String name;
    private String addr;
}
1.2 序列化Json后的文件类型

Person.dat id与json串以\t作为分割。

0   {"pid":0,"age":41,"sex":true,"name":"Lucy","addr":"Shanghai"}
1   {"pid":1,"age":9,"sex":true,"name":"Jenny","addr":"Shenzhen"}
2   {"pid":2,"age":9,"sex":true,"name":"Lily","addr":"Tianjin"}
3   {"pid":3,"age":42,"sex":false,"name":"David","addr":"Guangzhou"}
4   {"pid":4,"age":40,"sex":true,"name":"Mary","addr":"Chongqing"}
5   {"pid":5,"age":3,"sex":true,"name":"Jenny","addr":"Guangzhou"}

2、ES建立index和mapping

建立5个分片1个副本的index,其中ES的type为infos,对应的mapping如下:

{
  "settings": {
    "index": {
      "creation_date": "1470300617555",
      "legacy": {
        "routing": {
          "hash": {
            "type": "org.elasticsearch.cluster.routing.DjbHashFunction"
          },
          "use_type": "false"
        }
      },
      "number_of_shards": "5",
      "number_of_replicas": "1",
      "uuid": "mJXGBmnYS12mXBo0aGrR3Q",
      "version": {
        "created": "1070099",
        "upgraded": "2030499"
      }
    }
  },
  "mappings": {
    "infos": {
      "_timestamp": {},
      "properties": {
        "sex": {
          "type": "boolean"
        },
        "name": {
          "index": "not_analyzed",
          "type": "string"
        },
        "pid": {
          "type": "integer"
        },
        "addr": {
          "index": "not_analyzed",
          "type": "string"
        },
        "age": {
          "type": "integer"
        }
      }
    }
  }
}

3、导入程序模块

3.1 流程图
导入模块

整个导入模块的流程图如上,Main创建ESClientBulkProcessor;读取Person.dat中的Json串,组成UpdateRequest后加入到BulkProcessor中,当BulkProcessor满足一定的写入条件后,会批量进行发送到ES集群。

3.2 ESClient建立

添加Maven依赖:

<dependency>
        <groupId>org.elasticsearch</groupId>
        <artifactId>elasticsearch</artifactId>
        <version>2.3.4</version>
</dependency>
// ESConfig
public class ESConfig {
    private String esClusterName;    // 集群名称
    private String esClusterAddress; // 集群地址
    private String esIndex;          // ES库
    private String esType;           // ES表
    private int batchSize;           // 批量导入大小
    private String filePath;         // 导入文件的路径
    private int esThreadNum;         // 导入到ES的并发数量
    private String localClientIP;    // 本机IP地址

    public String getEsClusterName() {
        return esClusterName;
    }

    public ESConfig setEsClusterName(String esClusterName) {
        this.esClusterName = esClusterName;
        return this;
    }

    public String getEsClusterAddress() {
        return esClusterAddress;
    }

    public ESConfig setEsClusterAddress(String esClusterAddress) {
        this.esClusterAddress = esClusterAddress;
        return this;
    }

    public String getEsIndex() {
        return esIndex;
    }

    public ESConfig setEsIndex(String esIndex) {
        this.esIndex = esIndex;
        return this;
    }

    public String getEsType() {
        return esType;
    }

    public ESConfig setEsType(String esType) {
        this.esType = esType;
        return this;
    }

    public int getBatchSize() {
        return batchSize;
    }

    public ESConfig setBatchSize(int batchSize) {
        this.batchSize = batchSize;
        return this;
    }

    public String getFilePath() {
        return filePath;
    }

    public ESConfig setFilePath(String filePath) {
        this.filePath = filePath;
        return this;
    }

    public int getEsThreadNum() {
        return esThreadNum;
    }

    public ESConfig setEsThreadNum(int esThreadNum) {
        this.esThreadNum = esThreadNum;
        return this;
    }

    public String getLocalClientIP() {
        return localClientIP;
    }

    public ESConfig setLocalClientIP(String localClientIP) {
        this.localClientIP = localClientIP;
        return this;
    }
}

ESClient:

public class ESClient {
    private static final Logger logger = LogManager.getLogger(ESClient.class);

    public BulkProcessor createBulkProcessor(ESConfig esConfig) {
        String clusterName = esConfig.getEsClusterName();
        String clusterAddr = esConfig.getEsClusterAddress();

        if (clusterName == null || clusterName.isEmpty()) {
            logger.error("invalid cluster name.");
            return null;
        }
        if (clusterAddr == null || clusterAddr.isEmpty()) {
            logger.info("invalid cluster address.");
            return null;
        }
        String[] addr = clusterAddr.split(":");
        if (addr.length != 2) {
            logger.info("invalid cluster address.");
            return null;
        }
        Settings settings = Settings.settingsBuilder()
                .put("cluster.name", clusterName)
                .put("cluster.transport.sniff", true)
                .put("index.refresh_interval", "60s")
                .build();
        // 创建 TransportClient
        TransportClient transportClient = new TransportClient.Builder()
                .settings(settings).build();

        List<InetSocketTransportAddress> addrList = new ArrayList<>();
        try {
            addrList.add(new InetSocketTransportAddress(InetAddress.getByName(addr[0]),
                    Integer.parseInt(addr[1])));
        } catch (Exception e) {
            logger.error("exception:", e);
            return null;
        }

        for (InetSocketTransportAddress address : addrList) {
            transportClient.addTransportAddress(address);
        }
        Client client = transportClient;

        // 初始化Bulk处理器
        BulkProcessor bulkProcessor = BulkProcessor.builder(
                client,
                new BulkProcessor.Listener() {
                    long begin;
                    long cost;
                    int count = 0;

                    @Override
                    public void beforeBulk(long executionId, BulkRequest bulkRequest) {
                        begin = System.currentTimeMillis();
                    }

                    @Override
                    public void afterBulk(long executionId, BulkRequest bulkRequest, BulkResponse bulkResponse) {
                        cost = (System.currentTimeMillis() - begin) / 1000;
                        count += bulkRequest.numberOfActions();
                        logger.info("bulk success. size:[{}] cost:[{}s]", count, cost);
                    }

                    @Override
                    public void afterBulk(long executionId, BulkRequest bulkRequest, Throwable throwable) {
                        logger.error("bulk update has failures, will retry:" + throwable);
                    }
                })
                .setBulkActions(esConfig.getBatchSize())                    // 批量导入个数
                .setBulkSize(new ByteSizeValue(1, ByteSizeUnit.MB))    // 满1MB进行导入
                .setConcurrentRequests(esConfig.getEsThreadNum())           // 并发数
                .setFlushInterval(TimeValue.timeValueSeconds(5))            // 冲刷间隔60s
                .setBackoffPolicy(BackoffPolicy.constantBackoff(TimeValue.timeValueSeconds(1), 3)) // 重试3次,间隔1s
                .build();
        return bulkProcessor;
    }
}

在3.1节中,我们曾提到过满足发送条件这个概念,对应于上面BulkProcessor中的3个set方法。分别是:

  • 当导入数据(UpdateRequest)的个数达到后,进行发送;
  • 当导入数据的大小达到1MB后,进行发送;
  • 当距离上一次发送超过60秒时,进行发送。
3.3 读取并组装UpdateRequest

ESImporter:

public class ESImporter {
    private static final Logger logger = LogManager.getLogger(ESImporter.class);
    
    public void importer(ESConfig esConfig) {

        File file = new File(esConfig.getFilePath());
        BufferedReader reader = null;
        // 创建BulkProcessor
        BulkProcessor bulkProcessor = new ESClient().createBulkProcessor(esConfig);
        if (bulkProcessor == null) {
            logger.error("create bulk processor failed.");
            return;
        }
        UpdateRequest updateRequest;
        String[] arrStr;
        try {
            reader = new BufferedReader(new FileReader(file));
            String tempString;
            // 一次读入一行,直到读入null为文件结束
            while ((tempString = reader.readLine()) != null) {
                arrStr = tempString.split("\t");
                if (arrStr.length != 2) {
                    continue;
                }
                updateRequest = new UpdateRequest(esConfig.getEsIndex(), esConfig.getEsType(), arrStr[0])
                        .doc(arrStr[1]).docAsUpsert(true);
                bulkProcessor.add(updateRequest);
            }
            reader.close();
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (reader != null) {
                    reader.close();
                }
                if (bulkProcessor != null) {
                    bulkProcessor.awaitClose(1, TimeUnit.MINUTES);
                }
            } catch (Exception e) {
                // do nothing
            }
        }
    }
}

这个模块主要是读取文件中的Json行,组装成UpdateRequest后,加入到bulkProcessor中。

3.4 服务启动模块

ImportMain:

public class ImportMain {
    private static final Logger logger = LogManager.getLogger(ImportMain.class);

    public static void main(String[] args) {
        try {
            if (args.length < 1) {
                System.err.println("usage: <file_path>");
                System.exit(1);
            }
            ESConfig esConfig = new ESConfig()
                    .setEsClusterName("elasticsearch")
                    .setEsClusterAddress("127.0.0.1:9300")
                    .setEsIndex("person")
                    .setEsType("infos")
                    .setBatchSize(100)
                    .setFilePath(args[0])
                    .setEsThreadNum(1);
           
            long begin = System.currentTimeMillis();
            ESImporter esImporter = new ESImporter();
            esImporter.importer(esConfig);
            long cost = System.currentTimeMillis() - begin;
            logger.info("import end. cost:[{}ms]", cost);
        } catch (Exception e) {
            logger.error("exception:", e);
        }
    }
}
3.5 代码目录
代码目录
3.6 ES集群查看

导入结束后,在ES集群上可以看到导入的docs。


docs data

相关文章

网友评论

    本文标题:2-Elasticsearch集群数据批量导入

    本文链接:https://www.haomeiwen.com/subject/rrfdwttx.html