备注:
Hive 版本 2.1.1
一.Hive on Spark介绍
Hive是基于Hadoop平台的数据仓库,最初由Facebook开发,在经过多年发展之后,已经成为Hadoop事实上的SQL引擎标准。相较于其他诸如Impala、Shark(SparkSQL的前身)等引擎而言,Hive拥有更为广泛的用户基础以及对SQL语法更全面的支持。Hive最初的计算引擎为MapReduce,受限于其自身的Map+Reduce计算模式,以及不够充分的内存利用,MapReduce的性能难以得到提升。
Hortonworks于2013年提出将Tez作为另一个计算引擎以提高Hive的性能。Spark则是最初由加州大学伯克利分校开发的分布式计算引擎,借助于其灵活的DAG执行模式、对内存的充分利用,以及RDD所能表达的丰富语义,Spark受到了Hadoop社区的广泛关注。在成为Apache顶级项目之后,Spark更是集成了流处理、图计算、机器学习等功能,是业界公认最具潜力的下一代通用计算框架。鉴于此,Hive社区于2014年推出了Hive on Spark项目(HIVE-7292),将Spark作为继MapReduce和Tez之后Hive的第三个计算引擎。该项目由Cloudera、Intel和MapR等几家公司共同开发,并受到了来自Hive和Spark两个社区的共同关注。目前Hive on Spark的功能开发已基本完成,并于2015年1月初合并回trunk,预计会在Hive下一个版本中发布。本文将介绍Hive on Spark的设计架构,包括如何在Spark上执行Hive查询,以及如何借助Spark来提高Hive的性能等。另外本文还将介绍Hive on Spark的进度和计划,以及初步的性能测试数据。
我们建议修改Hive,增加Spark作为第三执行后端(Hive -7292),与MapReduce和Tez并行。
Spark是一个开源的数据分析集群计算框架,它建立在Hadoop的两阶段MapReduce范式之外,但建立在HDFS之上。Spark的主要抽象是一个分布式项目集合,称为弹性分布式数据集(Resilient distributed Dataset, RDD)。rdd可以通过Hadoop inputformat(例如HDFS文件)创建,也可以通过转换其他rdd创建。通过一系列的转换(如groupBy和filter),或者Spark提供的count和save等操作来应用rdd,可以处理和分析rdd,从而实现MapReduce作业的功能,而不需要中间阶段。
SQL查询可以很容易地转换为Spark转换和操作,正如在Shark和Spark SQL中演示的那样。事实上,许多原始转换和操作都是面向sql的,比如join和count。
1.1 Hive on spark 动机
下面是Hive在Spark上运行的主要动机:
Spark用户的好处:对于已经在使用Spark进行其他数据处理和机器学习需求的用户来说,这个特性非常有价值。在一个执行后端进行标准化对于操作管理来说是很方便的,并且可以更容易地开发专门技术来调试问题和进行改进。
更广泛地采用Hive:遵循前面的观点,这将Hive作为SQL on Hadoop选项引入Spark用户群,进一步增加了Hive的采用。
性能:Hive查询,特别是涉及多个减速阶段的查询,运行速度会更快,从而像Tez一样提高用户体验。
Spark执行后端并不是为了取代Tez或MapReduce。Hive项目的多个后端并存是正常的。用户可以选择使用Tez、Spark或MapReduce。根据用例的不同,每种方法都有不同的优点。而Hive的成功并不完全依赖于Tez或Spark的成功。
1.2 设计原则
主要的设计原则是不影响或限制Hive现有的代码路径,因此不影响功能或性能。也就是说,用户选择在MapReduce或Tez上运行Hive,就会像现在一样拥有现有的功能和代码路径。此外,在执行层插入Spark可以最大限度地保持代码共享,并降低维护成本,因此Hive community不需要专门为Spark进行投资。
同时,选择Spark作为执行引擎的用户将自动拥有Hive提供的所有丰富功能特性。未来添加到Hive的特性(如新数据类型、udf、逻辑优化等)应该会自动提供给那些用户,而无需在Hive的Spark执行引擎中进行任何定制工作。
1.3 与Shark和Spark SQL的比较
Spark生态系统中有两个相关项目对Spark提供Hive QL支持:Shark和Spark SQL。
Shark项目将Hive生成的查询计划转换为自己的表示,并在Spark上执行。
Spark SQL是Spark中的一个特性。它使用Hive的解析器作为前台提供Hive QL支持。Spark应用程序开发人员可以在代码中轻松地用SQL以及其他Spark操作符表示数据处理逻辑。Spark SQL支持与Hive不同的用例。
与Shark和Spark SQL相比,我们的设计方法支持所有现有的Hive特性,包括Hive QL(以及任何未来的扩展),以及Hive与授权、监视、审计和其他操作工具的集成。
1.4 其它考虑
我们知道一个新的执行后端是一个重要的任务。它不可避免地增加了复杂性和维护成本,即使设计避免触及现有的代码路径。Hive现在可以在MapReduce、Tez和Spark上运行单元测试。我们认为利大于弊。从基础设施的角度来看,我们可以赞助更多的硬件来进行持续集成。
最后,Hive on Tez已经奠定了一些重要的基础,这对支持一个新的执行引擎(如Spark)非常有帮助。这里的项目肯定会从中受益。另一方面,Spark是一个非常不同于MapReduce或Tez的框架。因此,在集成过程中很可能会发现漏洞和小问题。预计Hive社区将与Spark社区紧密合作,以确保集成的成功。
二.Hive on Spark 性能测试
代码:
set hive.execution.engine=mr;
select count(*) from ods_fact_sale;
set hive.execution.engine=spark;
select count(*) from ods_fact_sale;
测试记录:
hive>
> set hive.execution.engine=mr;
hive> select count(*) from ods_fact_sale;
Query ID = root_20210106155340_8c89f5f6-c599-49e6-9cec-d73d278a1df6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
21/01/06 15:53:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1609141291605_0037, Tracking URL = http://hp3:8088/proxy/application_1609141291605_0037/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1609141291605_0037
Hadoop job information for Stage-1: number of mappers: 117; number of reducers: 1
2021-01-06 15:53:48,454 Stage-1 map = 0%, reduce = 0%
2021-01-06 15:53:57,802 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 12.98 sec
2021-01-06 15:54:03,965 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 24.23 sec
2021-01-06 15:54:10,130 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 30.39 sec
2021-01-06 15:54:11,158 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 36.31 sec
2021-01-06 15:54:17,313 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 48.68 sec
2021-01-06 15:54:23,460 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 53.97 sec
2021-01-06 15:54:24,492 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 59.82 sec
2021-01-06 15:54:30,630 Stage-1 map = 10%, reduce = 0%, Cumulative CPU 71.02 sec
2021-01-06 15:54:36,770 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 82.71 sec
2021-01-06 15:54:42,903 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 94.78 sec
2021-01-06 15:54:48,025 Stage-1 map = 15%, reduce = 0%, Cumulative CPU 99.97 sec
2021-01-06 15:54:54,167 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 112.03 sec
2021-01-06 15:54:56,203 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 118.08 sec
2021-01-06 15:55:00,300 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 124.0 sec
2021-01-06 15:55:03,370 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 130.14 sec
2021-01-06 15:55:06,440 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 136.01 sec
2021-01-06 15:55:10,531 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 141.94 sec
2021-01-06 15:55:16,665 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 153.96 sec
2021-01-06 15:55:18,711 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 159.88 sec
2021-01-06 15:55:23,829 Stage-1 map = 24%, reduce = 0%, Cumulative CPU 165.26 sec
2021-01-06 15:55:24,853 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 170.8 sec
2021-01-06 15:55:29,968 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 176.84 sec
2021-01-06 15:55:36,112 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 188.49 sec
2021-01-06 15:55:38,155 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 194.5 sec
2021-01-06 15:55:42,244 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 200.54 sec
2021-01-06 15:55:44,293 Stage-1 map = 30%, reduce = 0%, Cumulative CPU 206.5 sec
2021-01-06 15:55:48,381 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 212.52 sec
2021-01-06 15:55:50,417 Stage-1 map = 32%, reduce = 0%, Cumulative CPU 218.44 sec
2021-01-06 15:55:57,574 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 229.58 sec
2021-01-06 15:56:01,657 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 235.5 sec
2021-01-06 15:56:03,697 Stage-1 map = 35%, reduce = 0%, Cumulative CPU 241.52 sec
2021-01-06 15:56:07,790 Stage-1 map = 36%, reduce = 0%, Cumulative CPU 247.48 sec
2021-01-06 15:56:09,834 Stage-1 map = 37%, reduce = 0%, Cumulative CPU 253.49 sec
2021-01-06 15:56:13,926 Stage-1 map = 38%, reduce = 0%, Cumulative CPU 259.41 sec
2021-01-06 15:56:20,069 Stage-1 map = 39%, reduce = 0%, Cumulative CPU 271.13 sec
2021-01-06 15:56:23,140 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 277.05 sec
2021-01-06 15:56:25,191 Stage-1 map = 41%, reduce = 0%, Cumulative CPU 282.81 sec
2021-01-06 15:56:29,272 Stage-1 map = 42%, reduce = 0%, Cumulative CPU 288.89 sec
2021-01-06 15:56:31,321 Stage-1 map = 43%, reduce = 0%, Cumulative CPU 294.76 sec
2021-01-06 15:56:34,386 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 300.7 sec
2021-01-06 15:56:40,529 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 312.71 sec
2021-01-06 15:56:44,608 Stage-1 map = 46%, reduce = 0%, Cumulative CPU 318.59 sec
2021-01-06 15:56:46,644 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 324.63 sec
2021-01-06 15:56:50,726 Stage-1 map = 48%, reduce = 0%, Cumulative CPU 330.7 sec
2021-01-06 15:56:52,775 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 336.71 sec
2021-01-06 15:56:57,890 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 347.89 sec
2021-01-06 15:57:05,042 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 360.31 sec
2021-01-06 15:57:12,199 Stage-1 map = 54%, reduce = 0%, Cumulative CPU 372.25 sec
2021-01-06 15:57:18,334 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 378.35 sec
2021-01-06 15:57:19,353 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 384.4 sec
2021-01-06 15:57:26,511 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 396.7 sec
2021-01-06 15:57:31,623 Stage-1 map = 59%, reduce = 0%, Cumulative CPU 408.79 sec
2021-01-06 15:57:37,757 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 414.69 sec
2021-01-06 15:57:38,775 Stage-1 map = 61%, reduce = 0%, Cumulative CPU 420.57 sec
2021-01-06 15:57:43,890 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 426.57 sec
2021-01-06 15:57:50,023 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 438.61 sec
2021-01-06 15:57:51,045 Stage-1 map = 64%, reduce = 0%, Cumulative CPU 444.8 sec
2021-01-06 15:57:57,193 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 450.01 sec
2021-01-06 15:57:58,213 Stage-1 map = 66%, reduce = 0%, Cumulative CPU 456.07 sec
2021-01-06 15:58:03,320 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 462.04 sec
2021-01-06 15:58:04,341 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 467.93 sec
2021-01-06 15:58:10,477 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 479.64 sec
2021-01-06 15:58:14,561 Stage-1 map = 70%, reduce = 0%, Cumulative CPU 485.57 sec
2021-01-06 15:58:16,601 Stage-1 map = 71%, reduce = 0%, Cumulative CPU 491.55 sec
2021-01-06 15:58:20,691 Stage-1 map = 72%, reduce = 0%, Cumulative CPU 497.1 sec
2021-01-06 15:58:23,759 Stage-1 map = 73%, reduce = 0%, Cumulative CPU 503.28 sec
2021-01-06 15:58:26,822 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 509.13 sec
2021-01-06 15:58:32,955 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 520.92 sec
2021-01-06 15:58:36,022 Stage-1 map = 76%, reduce = 0%, Cumulative CPU 526.84 sec
2021-01-06 15:58:39,093 Stage-1 map = 77%, reduce = 0%, Cumulative CPU 532.84 sec
2021-01-06 15:58:42,165 Stage-1 map = 78%, reduce = 0%, Cumulative CPU 538.87 sec
2021-01-06 15:58:45,234 Stage-1 map = 79%, reduce = 0%, Cumulative CPU 544.83 sec
2021-01-06 15:58:51,355 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 556.59 sec
2021-01-06 15:58:56,464 Stage-1 map = 81%, reduce = 0%, Cumulative CPU 562.72 sec
2021-01-06 15:58:57,495 Stage-1 map = 82%, reduce = 0%, Cumulative CPU 568.08 sec
2021-01-06 15:59:03,628 Stage-1 map = 83%, reduce = 0%, Cumulative CPU 574.07 sec
2021-01-06 15:59:09,756 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 579.96 sec
2021-01-06 15:59:11,807 Stage-1 map = 84%, reduce = 28%, Cumulative CPU 580.79 sec
2021-01-06 15:59:14,876 Stage-1 map = 85%, reduce = 28%, Cumulative CPU 586.69 sec
2021-01-06 15:59:27,168 Stage-1 map = 86%, reduce = 28%, Cumulative CPU 598.43 sec
2021-01-06 15:59:30,236 Stage-1 map = 86%, reduce = 29%, Cumulative CPU 598.54 sec
2021-01-06 15:59:33,305 Stage-1 map = 87%, reduce = 29%, Cumulative CPU 604.45 sec
2021-01-06 15:59:39,442 Stage-1 map = 88%, reduce = 29%, Cumulative CPU 610.43 sec
2021-01-06 15:59:45,578 Stage-1 map = 89%, reduce = 29%, Cumulative CPU 616.54 sec
2021-01-06 15:59:47,626 Stage-1 map = 89%, reduce = 30%, Cumulative CPU 616.59 sec
2021-01-06 15:59:51,719 Stage-1 map = 90%, reduce = 30%, Cumulative CPU 622.46 sec
2021-01-06 15:59:56,826 Stage-1 map = 91%, reduce = 30%, Cumulative CPU 628.32 sec
2021-01-06 16:00:09,096 Stage-1 map = 92%, reduce = 30%, Cumulative CPU 640.11 sec
2021-01-06 16:00:12,179 Stage-1 map = 92%, reduce = 31%, Cumulative CPU 640.19 sec
2021-01-06 16:00:15,240 Stage-1 map = 93%, reduce = 31%, Cumulative CPU 646.25 sec
2021-01-06 16:00:21,368 Stage-1 map = 94%, reduce = 31%, Cumulative CPU 652.31 sec
2021-01-06 16:00:27,502 Stage-1 map = 95%, reduce = 31%, Cumulative CPU 658.42 sec
2021-01-06 16:00:30,569 Stage-1 map = 95%, reduce = 32%, Cumulative CPU 658.47 sec
2021-01-06 16:00:34,655 Stage-1 map = 96%, reduce = 32%, Cumulative CPU 664.26 sec
2021-01-06 16:00:39,774 Stage-1 map = 97%, reduce = 32%, Cumulative CPU 670.08 sec
2021-01-06 16:00:52,043 Stage-1 map = 98%, reduce = 32%, Cumulative CPU 682.02 sec
2021-01-06 16:00:54,090 Stage-1 map = 98%, reduce = 33%, Cumulative CPU 682.07 sec
2021-01-06 16:00:58,174 Stage-1 map = 99%, reduce = 33%, Cumulative CPU 688.07 sec
2021-01-06 16:01:04,310 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 693.43 sec
2021-01-06 16:01:06,358 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 695.5 sec
MapReduce Total cumulative CPU time: 11 minutes 35 seconds 500 msec
Ended Job = job_1609141291605_0037
MapReduce Jobs Launched:
Stage-Stage-1: Map: 117 Reduce: 1 Cumulative CPU: 695.5 sec HDFS Read: 31436910990 HDFS Write: 109 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 minutes 35 seconds 500 msec
OK
767830000
Time taken: 447.145 seconds, Fetched: 1 row(s)
hive>
> set hive.execution.engine=spark;
hive> select count(*) from ods_fact_sale;
Query ID = root_20210106160132_8d81e192-ceb7-46a3-bc60-70a5eeabce87
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Running with YARN Application = application_1609141291605_0038
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/yarn application -kill application_1609141291605_0038
Hive on Spark Session Web UI URL: http://hp3:44667
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 117 117 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 51.30 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 51.30 second(s)
Spark Job[0] Metrics: TaskDurationTime: 316285, ExecutorCpuTime: 247267, JvmGCTime: 5415, BytesRead / RecordsRead: 31436921640 / 767830000, BytesReadEC: 0, ShuffleTotalBytesRead / ShuffleRecordsRead: 6669 / 117, ShuffleBytesWritten / ShuffleRecordsWritten: 6669 / 117
OK
767830000
Time taken: 71.384 seconds, Fetched: 1 row(s)
hive>
从测试记录可以看出,执行速度从mr的8分钟比哪位了71秒,性能大幅提升
参考
1.https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
2.http://lxw1234.com/archives/2015/05/200.htm
网友评论