hive中distribute by 、sort by 、clu

作者: 叫我不矜持 | 来源:发表于2019-05-29 19:46 被阅读0次

order by
只有一个reduce负责对所有的数据进行排序，若大数据量，则需要较长的时间。建议在小的数据集中使用order by 进行排序。
如果设置hive.mapred.mode参数控制Hive执行方式为strict，则order by 则需要指定limit（若有分区还有指定哪个分区）；若为nostrict，则与关系型数据库差不多。

distribute by
hive中（distribute by + “表中字段”）关键字控制map输出结果的分发,采用hash算法，在map端将查询的结果中hash值相同的结果分发到对应的reduce节点去处理。

sort by
sort by为每一个reducer产生一个排序文件，和distribute by一般情况下会结合使用。可以通过mapred.reduce.task 指定reduce个数，查询后的数据被分发到相关的reduce中。sort by不能保证全局数据有序。

cluster by
cluster by 相当于 distribute by 和sort by 的结合，默认只能是升序，以下两种写法查询结果相同

//cluster by
hive>select * from store cluster by merid;
//distribute by,sort by
hive>select * from store distribute by merid sort by merid asc;

网友评论

本文标题：hive中distribute by 、sort by 、clu

本文链接：https://www.haomeiwen.com/subject/wksutctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

hive中distribute by 、sort by 、clu

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读