Impala与Hive语法差异之一count()函数使用

作者: 晨冉1688 | 来源:发表于2020-04-12 19:14 被阅读0次

Impala与Hive语法差异之一count()函数使用
浅谈交互式查询⼯工具Impala(二)
Hive分析函数的使用
Hive窗口函数详解—及3套案例练习
Impala从入门到放弃-V1.0
Impala数据加载与查询
浅谈交互式查询⼯工具Impala(一)
Impala 适配 Hadoop生态圈
002 Impala | 安装要求
聚合函数和排序

报错信息：AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT user_id)
deviating function: count(DISTINCT CASE WHEN status = 1 THEN user_id
ELSE NULL END) Consider using NDV() instead of COUNT(DISTINCT)
if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT
query option to perform this rewrite automatically.

报错代码如下

select 
    grad_id,
    count(distinct user_id) as total_users,
    count(distinct case when status = 1 then user_id else null end) as distinct_total_users
from 
    student_center_info
group by 
    grad_id;

在impala中运行代码会报如下错误
运行报错信息如下：
AnalysisException: all DISTINCT aggregate functions need to have the
same set of parameters as count(DISTINCT user_id)
deviating function: count(DISTINCT CASE WHEN status = 1 THEN user_id
ELSE NULL END) Consider using NDV() instead of COUNT(DISTINCT)
if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT
query option to perform this rewrite automatically.

出现这个个报错的原因是因为count（）按某个列统计时，只能使用一次，如上面的代码，使用两次count（）对user_id统计时，就会报错，解决办法如下：

select 
    a.grad_id,
    a.total_user,
    b.total_users
from
    (select 
        grad_id,
        count(distinct user_id) as total_users
    from 
        student_center_info
    group by 
        grad_id) as a
 left join
     (select 
            grad_id,
            count(distinct case when status = 1 then user_id else null end) as total_users
        from 
            student_center_info
    group by 
        grad_id) as b
 on 
     a.grad_id = b.grad_id

如果用impala的话，可以用ndv()函数来代替，但是ndv()这个去重不是准确的，如果对数据要求精度比较高的，不建议使用ndv()。

在hive中可以正常运行下面代码，并且输出正确结果

select 
    grad_id,
    count(distinct user_id) as total_users,
    count(distinct case when status = 1 then user_id else null end) as distinct_total_users
from 
    student_center_info
group by 
    grad_id;