美文网首页
大数据开发之Hive优化篇6-Hive on spark

大数据开发之Hive优化篇6-Hive on spark

作者: 只是甲 | 来源:发表于2021-06-15 16:27 被阅读0次

    备注:
    Hive 版本 2.1.1

    一.Hive on Spark介绍

    Hive是基于Hadoop平台的数据仓库,最初由Facebook开发,在经过多年发展之后,已经成为Hadoop事实上的SQL引擎标准。相较于其他诸如Impala、Shark(SparkSQL的前身)等引擎而言,Hive拥有更为广泛的用户基础以及对SQL语法更全面的支持。Hive最初的计算引擎为MapReduce,受限于其自身的Map+Reduce计算模式,以及不够充分的内存利用,MapReduce的性能难以得到提升。

    Hortonworks于2013年提出将Tez作为另一个计算引擎以提高Hive的性能。Spark则是最初由加州大学伯克利分校开发的分布式计算引擎,借助于其灵活的DAG执行模式、对内存的充分利用,以及RDD所能表达的丰富语义,Spark受到了Hadoop社区的广泛关注。在成为Apache顶级项目之后,Spark更是集成了流处理、图计算、机器学习等功能,是业界公认最具潜力的下一代通用计算框架。鉴于此,Hive社区于2014年推出了Hive on Spark项目(HIVE-7292),将Spark作为继MapReduce和Tez之后Hive的第三个计算引擎。该项目由Cloudera、Intel和MapR等几家公司共同开发,并受到了来自Hive和Spark两个社区的共同关注。目前Hive on Spark的功能开发已基本完成,并于2015年1月初合并回trunk,预计会在Hive下一个版本中发布。本文将介绍Hive on Spark的设计架构,包括如何在Spark上执行Hive查询,以及如何借助Spark来提高Hive的性能等。另外本文还将介绍Hive on Spark的进度和计划,以及初步的性能测试数据。

    我们建议修改Hive,增加Spark作为第三执行后端(Hive -7292),与MapReduce和Tez并行。

    Spark是一个开源的数据分析集群计算框架,它建立在Hadoop的两阶段MapReduce范式之外,但建立在HDFS之上。Spark的主要抽象是一个分布式项目集合,称为弹性分布式数据集(Resilient distributed Dataset, RDD)。rdd可以通过Hadoop inputformat(例如HDFS文件)创建,也可以通过转换其他rdd创建。通过一系列的转换(如groupBy和filter),或者Spark提供的count和save等操作来应用rdd,可以处理和分析rdd,从而实现MapReduce作业的功能,而不需要中间阶段。

    SQL查询可以很容易地转换为Spark转换和操作,正如在Shark和Spark SQL中演示的那样。事实上,许多原始转换和操作都是面向sql的,比如join和count。

    1.1 Hive on spark 动机

    下面是Hive在Spark上运行的主要动机:
    Spark用户的好处:对于已经在使用Spark进行其他数据处理和机器学习需求的用户来说,这个特性非常有价值。在一个执行后端进行标准化对于操作管理来说是很方便的,并且可以更容易地开发专门技术来调试问题和进行改进。

    更广泛地采用Hive:遵循前面的观点,这将Hive作为SQL on Hadoop选项引入Spark用户群,进一步增加了Hive的采用。

    性能:Hive查询,特别是涉及多个减速阶段的查询,运行速度会更快,从而像Tez一样提高用户体验。

    Spark执行后端并不是为了取代Tez或MapReduce。Hive项目的多个后端并存是正常的。用户可以选择使用Tez、Spark或MapReduce。根据用例的不同,每种方法都有不同的优点。而Hive的成功并不完全依赖于Tez或Spark的成功。

    1.2 设计原则

    主要的设计原则是不影响或限制Hive现有的代码路径,因此不影响功能或性能。也就是说,用户选择在MapReduce或Tez上运行Hive,就会像现在一样拥有现有的功能和代码路径。此外,在执行层插入Spark可以最大限度地保持代码共享,并降低维护成本,因此Hive community不需要专门为Spark进行投资。

    同时,选择Spark作为执行引擎的用户将自动拥有Hive提供的所有丰富功能特性。未来添加到Hive的特性(如新数据类型、udf、逻辑优化等)应该会自动提供给那些用户,而无需在Hive的Spark执行引擎中进行任何定制工作。

    1.3 与Shark和Spark SQL的比较

    Spark生态系统中有两个相关项目对Spark提供Hive QL支持:Shark和Spark SQL。

    Shark项目将Hive生成的查询计划转换为自己的表示,并在Spark上执行。

    Spark SQL是Spark中的一个特性。它使用Hive的解析器作为前台提供Hive QL支持。Spark应用程序开发人员可以在代码中轻松地用SQL以及其他Spark操作符表示数据处理逻辑。Spark SQL支持与Hive不同的用例。

    与Shark和Spark SQL相比,我们的设计方法支持所有现有的Hive特性,包括Hive QL(以及任何未来的扩展),以及Hive与授权、监视、审计和其他操作工具的集成。

    1.4 其它考虑

    我们知道一个新的执行后端是一个重要的任务。它不可避免地增加了复杂性和维护成本,即使设计避免触及现有的代码路径。Hive现在可以在MapReduce、Tez和Spark上运行单元测试。我们认为利大于弊。从基础设施的角度来看,我们可以赞助更多的硬件来进行持续集成。

    最后,Hive on Tez已经奠定了一些重要的基础,这对支持一个新的执行引擎(如Spark)非常有帮助。这里的项目肯定会从中受益。另一方面,Spark是一个非常不同于MapReduce或Tez的框架。因此,在集成过程中很可能会发现漏洞和小问题。预计Hive社区将与Spark社区紧密合作,以确保集成的成功。

    二.Hive on Spark 性能测试

    代码:

    set hive.execution.engine=mr;
    select count(*) from ods_fact_sale;
    set hive.execution.engine=spark;
    select count(*) from ods_fact_sale;
    

    测试记录:

    hive> 
        > set hive.execution.engine=mr;
    
    hive> select count(*) from ods_fact_sale;
    Query ID = root_20210106155340_8c89f5f6-c599-49e6-9cec-d73d278a1df6
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    21/01/06 15:53:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1609141291605_0037, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0037/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1609141291605_0037
    Hadoop job information for Stage-1: number of mappers: 117; number of reducers: 1
    2021-01-06 15:53:48,454 Stage-1 map = 0%,  reduce = 0%
    2021-01-06 15:53:57,802 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 12.98 sec
    2021-01-06 15:54:03,965 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 24.23 sec
    2021-01-06 15:54:10,130 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 30.39 sec
    2021-01-06 15:54:11,158 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 36.31 sec
    2021-01-06 15:54:17,313 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 48.68 sec
    2021-01-06 15:54:23,460 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 53.97 sec
    2021-01-06 15:54:24,492 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 59.82 sec
    2021-01-06 15:54:30,630 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 71.02 sec
    2021-01-06 15:54:36,770 Stage-1 map = 12%,  reduce = 0%, Cumulative CPU 82.71 sec
    2021-01-06 15:54:42,903 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 94.78 sec
    2021-01-06 15:54:48,025 Stage-1 map = 15%,  reduce = 0%, Cumulative CPU 99.97 sec
    2021-01-06 15:54:54,167 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 112.03 sec
    2021-01-06 15:54:56,203 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 118.08 sec
    2021-01-06 15:55:00,300 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 124.0 sec
    2021-01-06 15:55:03,370 Stage-1 map = 19%,  reduce = 0%, Cumulative CPU 130.14 sec
    2021-01-06 15:55:06,440 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 136.01 sec
    2021-01-06 15:55:10,531 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 141.94 sec
    2021-01-06 15:55:16,665 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 153.96 sec
    2021-01-06 15:55:18,711 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 159.88 sec
    2021-01-06 15:55:23,829 Stage-1 map = 24%,  reduce = 0%, Cumulative CPU 165.26 sec
    2021-01-06 15:55:24,853 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 170.8 sec
    2021-01-06 15:55:29,968 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 176.84 sec
    2021-01-06 15:55:36,112 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 188.49 sec
    2021-01-06 15:55:38,155 Stage-1 map = 28%,  reduce = 0%, Cumulative CPU 194.5 sec
    2021-01-06 15:55:42,244 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 200.54 sec
    2021-01-06 15:55:44,293 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 206.5 sec
    2021-01-06 15:55:48,381 Stage-1 map = 31%,  reduce = 0%, Cumulative CPU 212.52 sec
    2021-01-06 15:55:50,417 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 218.44 sec
    2021-01-06 15:55:57,574 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 229.58 sec
    2021-01-06 15:56:01,657 Stage-1 map = 34%,  reduce = 0%, Cumulative CPU 235.5 sec
    2021-01-06 15:56:03,697 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 241.52 sec
    2021-01-06 15:56:07,790 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 247.48 sec
    2021-01-06 15:56:09,834 Stage-1 map = 37%,  reduce = 0%, Cumulative CPU 253.49 sec
    2021-01-06 15:56:13,926 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU 259.41 sec
    2021-01-06 15:56:20,069 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 271.13 sec
    2021-01-06 15:56:23,140 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 277.05 sec
    2021-01-06 15:56:25,191 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 282.81 sec
    2021-01-06 15:56:29,272 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU 288.89 sec
    2021-01-06 15:56:31,321 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 294.76 sec
    2021-01-06 15:56:34,386 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 300.7 sec
    2021-01-06 15:56:40,529 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 312.71 sec
    2021-01-06 15:56:44,608 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU 318.59 sec
    2021-01-06 15:56:46,644 Stage-1 map = 47%,  reduce = 0%, Cumulative CPU 324.63 sec
    2021-01-06 15:56:50,726 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU 330.7 sec
    2021-01-06 15:56:52,775 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 336.71 sec
    2021-01-06 15:56:57,890 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 347.89 sec
    2021-01-06 15:57:05,042 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 360.31 sec
    2021-01-06 15:57:12,199 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 372.25 sec
    2021-01-06 15:57:18,334 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 378.35 sec
    2021-01-06 15:57:19,353 Stage-1 map = 56%,  reduce = 0%, Cumulative CPU 384.4 sec
    2021-01-06 15:57:26,511 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 396.7 sec
    2021-01-06 15:57:31,623 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 408.79 sec
    2021-01-06 15:57:37,757 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 414.69 sec
    2021-01-06 15:57:38,775 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 420.57 sec
    2021-01-06 15:57:43,890 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 426.57 sec
    2021-01-06 15:57:50,023 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 438.61 sec
    2021-01-06 15:57:51,045 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 444.8 sec
    2021-01-06 15:57:57,193 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 450.01 sec
    2021-01-06 15:57:58,213 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 456.07 sec
    2021-01-06 15:58:03,320 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 462.04 sec
    2021-01-06 15:58:04,341 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 467.93 sec
    2021-01-06 15:58:10,477 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 479.64 sec
    2021-01-06 15:58:14,561 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 485.57 sec
    2021-01-06 15:58:16,601 Stage-1 map = 71%,  reduce = 0%, Cumulative CPU 491.55 sec
    2021-01-06 15:58:20,691 Stage-1 map = 72%,  reduce = 0%, Cumulative CPU 497.1 sec
    2021-01-06 15:58:23,759 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 503.28 sec
    2021-01-06 15:58:26,822 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU 509.13 sec
    2021-01-06 15:58:32,955 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 520.92 sec
    2021-01-06 15:58:36,022 Stage-1 map = 76%,  reduce = 0%, Cumulative CPU 526.84 sec
    2021-01-06 15:58:39,093 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 532.84 sec
    2021-01-06 15:58:42,165 Stage-1 map = 78%,  reduce = 0%, Cumulative CPU 538.87 sec
    2021-01-06 15:58:45,234 Stage-1 map = 79%,  reduce = 0%, Cumulative CPU 544.83 sec
    2021-01-06 15:58:51,355 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 556.59 sec
    2021-01-06 15:58:56,464 Stage-1 map = 81%,  reduce = 0%, Cumulative CPU 562.72 sec
    2021-01-06 15:58:57,495 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 568.08 sec
    2021-01-06 15:59:03,628 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU 574.07 sec
    2021-01-06 15:59:09,756 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 579.96 sec
    2021-01-06 15:59:11,807 Stage-1 map = 84%,  reduce = 28%, Cumulative CPU 580.79 sec
    2021-01-06 15:59:14,876 Stage-1 map = 85%,  reduce = 28%, Cumulative CPU 586.69 sec
    2021-01-06 15:59:27,168 Stage-1 map = 86%,  reduce = 28%, Cumulative CPU 598.43 sec
    2021-01-06 15:59:30,236 Stage-1 map = 86%,  reduce = 29%, Cumulative CPU 598.54 sec
    2021-01-06 15:59:33,305 Stage-1 map = 87%,  reduce = 29%, Cumulative CPU 604.45 sec
    2021-01-06 15:59:39,442 Stage-1 map = 88%,  reduce = 29%, Cumulative CPU 610.43 sec
    2021-01-06 15:59:45,578 Stage-1 map = 89%,  reduce = 29%, Cumulative CPU 616.54 sec
    2021-01-06 15:59:47,626 Stage-1 map = 89%,  reduce = 30%, Cumulative CPU 616.59 sec
    2021-01-06 15:59:51,719 Stage-1 map = 90%,  reduce = 30%, Cumulative CPU 622.46 sec
    2021-01-06 15:59:56,826 Stage-1 map = 91%,  reduce = 30%, Cumulative CPU 628.32 sec
    2021-01-06 16:00:09,096 Stage-1 map = 92%,  reduce = 30%, Cumulative CPU 640.11 sec
    2021-01-06 16:00:12,179 Stage-1 map = 92%,  reduce = 31%, Cumulative CPU 640.19 sec
    2021-01-06 16:00:15,240 Stage-1 map = 93%,  reduce = 31%, Cumulative CPU 646.25 sec
    2021-01-06 16:00:21,368 Stage-1 map = 94%,  reduce = 31%, Cumulative CPU 652.31 sec
    2021-01-06 16:00:27,502 Stage-1 map = 95%,  reduce = 31%, Cumulative CPU 658.42 sec
    2021-01-06 16:00:30,569 Stage-1 map = 95%,  reduce = 32%, Cumulative CPU 658.47 sec
    2021-01-06 16:00:34,655 Stage-1 map = 96%,  reduce = 32%, Cumulative CPU 664.26 sec
    2021-01-06 16:00:39,774 Stage-1 map = 97%,  reduce = 32%, Cumulative CPU 670.08 sec
    2021-01-06 16:00:52,043 Stage-1 map = 98%,  reduce = 32%, Cumulative CPU 682.02 sec
    2021-01-06 16:00:54,090 Stage-1 map = 98%,  reduce = 33%, Cumulative CPU 682.07 sec
    2021-01-06 16:00:58,174 Stage-1 map = 99%,  reduce = 33%, Cumulative CPU 688.07 sec
    2021-01-06 16:01:04,310 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 693.43 sec
    2021-01-06 16:01:06,358 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 695.5 sec
    MapReduce Total cumulative CPU time: 11 minutes 35 seconds 500 msec
    Ended Job = job_1609141291605_0037
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 117  Reduce: 1   Cumulative CPU: 695.5 sec   HDFS Read: 31436910990 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 11 minutes 35 seconds 500 msec
    OK
    767830000
    Time taken: 447.145 seconds, Fetched: 1 row(s)
    hive> 
        > set hive.execution.engine=spark;
    hive> select count(*) from ods_fact_sale;
    Query ID = root_20210106160132_8d81e192-ceb7-46a3-bc60-70a5eeabce87
    Total jobs = 1
    Launching Job 1 out of 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    Running with YARN Application = application_1609141291605_0038
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/yarn application -kill application_1609141291605_0038
    Hive on Spark Session Web UI URL: http://hp3:44667
    
    Query Hive on Spark job[0] stages: [0, 1]
    Spark job[0] status = RUNNING
    --------------------------------------------------------------------------------------
              STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
    --------------------------------------------------------------------------------------
    Stage-0 ........         0      FINISHED    117        117        0        0       0  
    Stage-1 ........         0      FINISHED      1          1        0        0       0  
    --------------------------------------------------------------------------------------
    STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 51.30 s    
    --------------------------------------------------------------------------------------
    Spark job[0] finished successfully in 51.30 second(s)
    Spark Job[0] Metrics: TaskDurationTime: 316285, ExecutorCpuTime: 247267, JvmGCTime: 5415, BytesRead / RecordsRead: 31436921640 / 767830000, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 6669 / 117, ShuffleBytesWritten / ShuffleRecordsWritten: 6669 / 117
    OK
    767830000
    Time taken: 71.384 seconds, Fetched: 1 row(s)
    hive> 
    

    从测试记录可以看出,执行速度从mr的8分钟比哪位了71秒,性能大幅提升

    参考

    1.https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
    2.http://lxw1234.com/archives/2015/05/200.htm

    相关文章

      网友评论

          本文标题:大数据开发之Hive优化篇6-Hive on spark

          本文链接:https://www.haomeiwen.com/subject/tfocoktx.html