2020-10-22 Hive分区使用—从另一张表提取分区值

作者: 春生阁 | 来源:发表于2020-10-22 13:16 被阅读0次

背景

Hive中的分区可以大大提升执行效率，为提高灵活性会出现一种动态控制分区的场景。常用分区的方法是day between '2020-10-20' and '2020-10-21'，当从另一张表中获取两个日期字段时，不再使用分区功能，变成了全表扫描。

数据表示例

myPartitionTable：基础数据表，数据量大，包含day的分区键
currrentPartitionTable：分区维度表，很小的维度表，标记day的开始和结束日期

解决方案

使用MAPJOINs功能，需要通过修改auto.convert.join参数来实现。

--配置参数
set hive.auto.convert.join=true;
//When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:
set hive.auto.convert.join.noconditionaltask=true;
set hive.auto.convert.join.noconditionaltask.size=10000000;

--执行代码
select count(*) 
from currrentPartitionTable as t1
inner join myPartitionTable as t2
on t2.day between t1.str_dte and t1.end_dte
and t1.col_nam=t2.col_nam
;

执行效率，从原来执行90分钟变成了15分钟完成。

参考文章

select a partitioned table and specify partition through a join

网友评论

本文标题：2020-10-22 Hive分区使用—从另一张表提取分区值

本文链接：https://www.haomeiwen.com/subject/kabumktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！