美文网首页
接入es数据到hdfs

接入es数据到hdfs

作者: paopaodaxia | 来源:发表于2017-12-29 10:34 被阅读0次

    最近接到一个需求,需要接入es日志数据到hdfs,进行分析,网上查找了一下资料,总结一下方法大致有如下几种

    1. hive本身直接支持连接es
      可直接参考链接 http://lxw1234.com/archives/2015/12/585.htm
      说一下这种方式的弊端:

      • (a)、es集群通常会为了安全考虑加入用户认证和证书认证,上述方式不支持
      • (b)、hive定义表结构的时候字段类型映射必须与es匹配,而当es文档type有字段类型变更之后,hive无法很好的识别,这就会hive报类似类型转换的错
    2. es提供了两种java api用来操作es
      es的官方api地址:https://www.elastic.co/guide/en/elasticsearch/client/index.html

      • (a)、transport接口即为TCP连接
        因为集群做了用户认证和证书认证,采用如下方式连接es,遗憾的是一直连不上
        因为时间问题,暂时没解决这个问题,希望有同学有空能帮忙解决,谢谢了

    Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available: [{#transport#-1}{3HUrRF8JQGCz_TlwhQOFiA}{10.17.2.79}{10.17.2.79:9305}]]

    Settings settings = Settings.builder()
                    .put("cluster.name", esDataToText.cluster)
                    .put("xpack.security.user", esDataToText.userPw)
                    .put("xpack.ssl.key", esDataToText.keyPath)
                    .put("xpack.ssl.certificate", esDataToText.crtPath)
                    .put("xpack.ssl.certificate_authorities", esDataToText.cacrtPath)
                    .put("xpack.security.transport.ssl.enabled", true)
                    .put("client.transport.ping_timeout", "100s")
                    .build();
            try {
                TransportClient client = new PreBuiltXPackTransportClient(settings)
                        .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(esDataToText.urls), esDataToText.port));
                SearchResponse response = client.prepareSearch("ndf.dlp")
                        .setQuery(QueryBuilders.matchAllQuery())
                        .execute().actionGet();
                SearchHits resultHits = response.getHits();
                Long result_cnt = resultHits.totalHits;
                logger.info("数据量为:" + result_cnt);
            } catch (UnknownHostException e) {
                e.printStackTrace();
            }
    
    • (b)、rest接口访问es即为http接口
      这种方式以http接口的形式访问,因为es集群是采用ssl认证,所以我们先进行认证
      • (1) 将证书文件合成jks文件,es官网API是操作KeyStore
        keytool -import -v -trustcacerts -file niudingfeng.crt -keystore my_keystore.jks -keypass password -storepass password
      • (2) 用户密码验证以及https认证
            //用户密码验证
            final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
            credentialsProvider.setCredentials(AuthScope.ANY,
                    new UsernamePasswordCredentials("bigdata", "123456qwerty"));
    
            //ssl证书验证
            SSLContextBuilder sslBuilder = null;
            try {
                sslBuilder = SSLContexts.custom().loadTrustMaterial(truststore, null);
            } catch (KeyStoreException e) {
                e.printStackTrace();
            }
    

    以上为认证代码

    • (3) 连接es获取数据
      注意:http接口默认返回十条数据,如需要返回更多则需要制定from size
      因为es版本问题,无法用到官方java high level rest client,最低版本要求为5.6,故不推荐使用这种方式
    RestClient restClient = RestClient.builder(new HttpHost("testelk002.niudingfeng.com", 9205, "https"))
                    .setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
                        @Override
                        public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpClientBuilder) {
                            return httpClientBuilder.setSSLContext(sslContext).setDefaultCredentialsProvider(credentialsProvider);
                        }
                    })
                    .build();
            Response response = null;
    
    
            try {
                String method = "GET";
                String endpoint = "/ndf.dlp/_search";
                String queryStr = "{\n" +
                        "\t\t\"query\":{ \"range\": {\n" +
                        "      \t\t\t\t\t\"@timestamp\": {\n" +
                        "        \t\t\t\t\t\"gte\": \"2017-12-27\",\n" +
                        "        \t\t\t\t\t\"lte\": \"2017-12-28\"\n" +
                        "      \t\t\t\t\t\t\t}\n" +
                        "    \t\t\t\t\t\t}\n" +
                        "\t\t\t\t}\n" +
                        "}";
    //            String queryStr = "{\"query\":{\"match_all\":{}}}";
                HttpEntity entity = new NStringEntity(queryStr, ContentType.APPLICATION_JSON);
    
                response = restClient.performRequest(method,endpoint,Collections.<String, String>emptyMap(),entity);
                String res = EntityUtils.toString(response.getEntity());
                String resFile = "D:\\java\\es\\res.txt";
                File file = new File(resFile);
                if(file.exists()){
                    file.delete();
                }
                BufferedWriter bw = new BufferedWriter(new FileWriter(resFile));
                bw.write(res);
                bw.close();
                restClient.close();
    
    
                } catch (IOException e) {
                    e.printStackTrace();
                }
    
    1. 最后我们采用Python api来实现

      Python查询es也有两种方式

    • (a)、search
    res = es.search(index='index_name', 
    doc_type=’type_name’, body=es_query, request_timeout=999999,params={“search_type”:”query_and_fetch”}) 
     说明:search返回的结果为字典不是生成器,和在sense上查询返回的结果相同,信息比较全,
    如果数据量大,分页用from size控制,但是会排序,性能比较差
    
    • (b)、helps.scan
    es_client = es.Elasticsearch(
        [host],
        http_auth=(user, pswd),
        port=port,
        use_ssl=True,
        verify_certs=False,
        timeout=300)
    res = helpers.scan(es_client, index=index, query=query, scroll='1m',request_timeout=999999,preserve_order=False)
    说明:scan是对满足语句的结果进行扫描,全部返回下来,结果为一个生成器需要解析,scroll为滚屏时间参数,不会进行排序,建议使用这种方式
    

    相关文章

      网友评论

          本文标题:接入es数据到hdfs

          本文链接:https://www.haomeiwen.com/subject/kwxigxtx.html