美文网首页
hive sql优化实例

hive sql优化实例

作者: 大数据技术进阶 | 来源:发表于2018-04-05 13:47 被阅读0次

    Hive中SQL的优化技巧,核心思想是避免数据倾斜。

    1、避免在同一个查询中同时出现count, distinct,group by

    2、left join 时把小数据量的表放在前面

    3、尽量使用子查询

    参数配置

    SET mapred.reduce.tasks=50;
    SET mapreduce.reduce.memory.mb=6000;
    SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;

    涉及数据倾斜的话,主要是reduce中数据倾斜的问题,可能通过设置hive中reduce的并行数,reduce的内存大小单位为m,reduce中 shuffle的刷磁盘的比例,来解决。

    实例一

    --分月
    select substr(a.day,1,6)month,count(distinct a.userid)
    from dms.tracklog_5min a
    join default.site_activeuser_tmp c
    on a.userid=c.id
    where a.day>='201505' and a.day<'201506'
    group by substr(a.day,1,6) ;

    --优化后
    select '201505',count(*) from
    (
    select distinct c.userid
    from
    (select userid from default.site_activeuser_tmp where month='201505') c
    left join
    (
    select userid from
    dms.tracklog_5min
    where day>='201505' and day<'201506'
    ) tmp
    on tmp.userid=c.userid
    ) t;

    实例二

    --分事业部
    select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name
    from dms.tracklog_5min a join default.d_channel b
    on a.host=b.host
    join default.site_activeuser_tmp c
    on a.userid=c.id
    where a.day>='201505' and a.day<'201506'
    group by substr(a.day,1,6),b.dept_name;

    --优化后
    SET mapred.reduce.tasks=50;
    SET mapreduce.reduce.memory.mb=6000;
    SET mapreduce.reduce.shuffle.memory.limit.percent=0.06;

    select "201505" month,count(t.userid),t.dept_name
    from
    (select userid from default.site_activeuser_tmp where month='201505') c
    left join
    (
    select distinct a.userid userid,b.dept_name dept_name from default.d_channel b
    left join
    (select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
    on a.host=b.host
    )t
    on t.userid=c.userid
    group by t.dept_name ;

    实例三

    --分产品
    select substr(a.day,1,6)month,count(distinct a.userid) ,b.dept_name,b.prod_name
    from dms.tracklog_5min a join default.d_channel b
    on a.host=b.host
    join default.site_activeuser_tmp c
    on a.userid=c.id
    where a.day>='201505' and a.day<'201506'
    group by substr(a.day,1,6),b.dept_name,b.prod_name;

    --优化后
    select "201505" month,count(t.userid) cnt,t.dept_name dept_name,t.prod_name prod_name
    from
    (select userid from default.site_activeuser_tmp where month='201505') c
    left join
    (
    select distinct a.userid userid,b.dept_name dept_name,b.prod_name prod_name from default.d_channel b
    left join
    (select host,userid from dms.tracklog_5min where day>='201505' and day<'201506' ) a
    on a.host=b.host
    )t
    on t.userid=c.userid
    group by t.prod_name,t.dept_name ;

    相关文章

      网友评论

          本文标题:hive sql优化实例

          本文链接:https://www.haomeiwen.com/subject/jjarhftx.html