美文网首页
Hive的三种去重方式对比

Hive的三种去重方式对比

作者: Mr_WangZz | 来源:发表于2019-12-06 15:19 被阅读0次

    数据量级:从 415816 到 221436

    第1种 distinct

    SELECT  COUNT(DISTINCT uid)
    FROM    tableA
    WHERE   dt = '20191205'
    AND     event = 'start'
    ;
    --29.379s
    

    第2种 group by

    SELECT  COUNT(*)
    FROM    (
                SELECT  uid
                        ,collect_set(model)[0]
                        ,collect_set(os)[0]
                FROM    tableA
                WHERE   dt = '20191205'
                AND     event = 'start'
                GROUP BY uid
            ) a
    ;
    --25.239s
    

    第3种 row_number()

    SELECT  COUNT(*)
    FROM    (
                SELECT  uid
                        ,row_number() OVER(PARTITION BY uid ORDER BY uid) rn
                FROM    tableA
                WHERE   dt = '20191205'
                AND     event = 'start'
            ) c
    WHERE   rn = 1
    ;
    -- 25.162s
    -- 25.154s
    

    加不加 DESC 除了结果顺序不同,对查询效率的影响

    SELECT  COUNT(*)
    FROM    (
                SELECT  uid
                        ,row_number() OVER(PARTITION BY uid ORDER BY uid DESC) rn
                FROM    tableA
                WHERE   dt = '20191205'
                AND     event = 'start'
            ) c
    WHERE   rn = 1
    ;
    -- 24.436s
    -- 25.435s
    -- 24.114s
    

    总结:

    • distinct使用起来虽然简单,但是效率不及 group by 和 row_number()。
    • 当数据量非常大时,尤其达到百万级及以上,应优先使用后两种去重方式。

    参考:

    1. 当数据量达到亿级时,group by和distinct区别以及性能比较
    2. Hive 三种去重方式

    相关文章

      网友评论

          本文标题:Hive的三种去重方式对比

          本文链接:https://www.haomeiwen.com/subject/klregctx.html