hive 总结一

作者: 利伊奥克儿 | 来源:发表于2019-07-15 21:27 被阅读1次

本文参考：黑泽君相关博客
本文是我总结日常工作中遇到的坑，结合黑泽君相关博客，选取、补充了部分内容。

上传数据

上传数据后执行修复 msck 命令

上传数据
hive> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201904/day=14;
hive> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201904/day=14;

查询数据（查询不到刚上传的数据）
hive> select * from dept_partition2 where month='201904' and day='14';
OK

执行修复命令
hive> msck repair table dept_partition2;

再次查询数据
hive> select * from dept_partition2 where month='201904' and day='14';
OK
dept_partition2.deptno    dept_partition2.dname   dept_partition2.loc dept_partition2.month   dept_partition2.day
10    ACCOUNTING  1700    201904  14
20    RESEARCH    1800    201904  14
30    SALES   1900    201904  14
40    OPERATIONS  1700    201904  14

注：数据如果一开始就放到指定路径下，再通过load  data 命令好像会失败，以前遇到过，这里就不测试了，依稀记得有这么个坑

上传数据后添加分区

上传数据
hive> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201905/day=15;
hive> dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201905/day=15;

执行添加分区
hive> alter table dept_partition2 add partition(month='201905', day='15');

查询数据
hive> select * from dept_partition2 where month='201905' and day='15';

创建文件夹后load数据到分区（最常用）

创建目录
hive> dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201906/day=16;

上传数据
hive> load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201906',day='16');

查询数据
hive> select * from dept_partition2 where month='201906' and day='16';

Export导出数据

将查询的结果导出到本地
hive> insert overwrite local directory '/opt/module/datas/export/student'
select * from student;

将查询的结果格式化导出到本地
hive> insert overwrite local directory '/opt/module/datas/export/student1'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
select * from student;

将查询的结果导出到HDFS上(没有local)
hive> insert overwrite directory '/user/atguigu/student2'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
select * from student;

Hadoop命令导出到本地
hive> dfs -get /user/hive/warehouse/student/month=201909/000000_0 /opt/module/datas/export/student3.txt;
相当于直接拿文件了

Hive Shell 命令导出
bin/hive -e 'select * from default.student;' > /opt/module/datas/export/student4.txt;

Export导出到HDFS上
hive> export table default.student to '/user/hive/warehouse/export/student';

like和rlike

1）使用LIKE运算选择类似的值；
2）选择条件可以包含字符或数字:
%代表零个或多个字符(任意个字符)。
_ 代表一个字符。
3）RLIKE子句是Hive中这个功能的一个扩展，其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

查找以2开头薪水的员工信息
hive> select * from emp where sal LIKE '2%';
7698    BLAKE   MANAGER 7839    1981-5-1    2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9    2450.0  NULL    10

查找第二个数值为2的薪水的员工信息
hive> select * from emp where sal LIKE '_2%';
7521    WARD    SALESMAN    7698    1981-2-22   1250.0  500.0   30
7654    MARTIN  SALESMAN    7698    1981-9-28   1250.0  1400.0  30

查找薪水中含有2的员工信息
hive> select sal from emp where sal RLIKE '[2]';
1250.0
1250.0
2850.0
2450.0

having语句

having与where不同点
（1）where针对表中的列发挥作用，查询数据；having针对查询结果中的列发挥作用，筛选数据。
（2）where后面不能写分组函数，而having后面可以使用分组函数。
（3）having只用于group by分组统计语句。

求emp表中每个部门的平均工资
hive> select deptno, avg(sal) avg_sal from emp 
group by deptno;
10    2916.6666666666665
20    1975.0
30    1566.6666666666667

求emp表中平均薪水大于2000的部门
hive> select deptno, avg(sal) avg_sal from emp 
group by deptno 
having avg_sal>2000;
10    2916.6666666666665

表的别名

使用别名好处
（1）使用别名可以简化查询。
（2）使用表名前缀可以提高执行效率。
如果你自己实现解析器，如果是模糊字段名*，或者不带表名前缀，得有不少预处理动作，判断字段名来源于哪个表。而直接指定时，则不用这些判断，省了不少时间。
数据库的sql解析也不过是程序，可以从实现的角度去想想这类问题。

排序

全局排序（order by）

查询员工信息按工资降序排列
hive> select * from emp order by sal desc;

按照员工薪水的2倍排序(按照别名排序)
hive> select ename, sal*2 twosal from emp order by twosal;

按照部门和工资升序排序
hive> select ename, deptno, sal from emp order by deptno, sal;

每个MapReduce内部排序（sort by）
sort by：对于每个Reducer内部进行排序，对全局结果集来说不是排序，有多个Reducer。

设置reduce个数
hive> set mapreduce.job.reduces=3;

查看设置reduce个数
hive> set mapreduce.job.reduces;
mapreduce.job.reduces=3

根据部门编号降序查看员工信息
hive> select * from emp sort by deptno  desc;

分区排序（distribute by）

distribute by：类似MR中partition，作用是进行分区，需要结合sort by使用。
注意：Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
对于distributeby进行测试，一定要分配多reduce进行处理，否则无法看到distribute by的效果。

先按照部门编号分区，再按照员工编号降序排序。
hive> set mapreduce.job.reduces=3;
hive> insert overwrite local directory '/opt/module/datas/distributeby-result' 
select * from emp distribute by deptno sort by empno desc;

cluster by
当distribute by和sorts by的字段相同时，可以使用cluster by方式。 
cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。
1）以下两种写法等价
hive> select * from emp cluster by deptno;

分桶

分区针对的是数据的存储路径(文件夹)；分桶针对的是数据文件(文件)。
分区提供一个隔离数据和优化查询的便利方式。
不过，并非所有的数据集都可形成合理的分区，要确定合适的划分大小这个问题。
分桶是将数据集分解成更容易管理的若干部分的另一个技术。适合单个文件很大的情况。

创建分桶表
hive> create table stu_buck(id int, name string)
clustered by(id) 
into 4 buckets
row format delimited 
fields terminated by '\t';

查看表结构
hive> desc formatted stu_buck;
Num Buckets:            4    

导入数据到分桶表中
hive> load data local inpath '/opt/module/datas/stu_buck.txt' into table stu_buck;

上述操作后 发现并没有分成4个桶
因为桶表不能通过load的方式直接加载数据，只能从另一张表中插入数据。  
其实仔细想一下就知道了，load只是把文件移动了一个位置，并没有对文件切割。

先建一个普通的stu表
hive>create table stu(id int, name string)
row format delimited fields terminated by '\t';

向普通的stu表中导入数据
hive>load data local inpath '/opt/module/datas/stu_buck.txt' into table stu;

清空stu_buck表中数据
hive> truncate table stu_buck;

导入数据到分桶表，通过子查询的方式
hive> insert into table stu_buck select id, name from stu;

上述操作后 发现还是没有分成4个桶
因为有些属性没有设置

hive> set hive.enforce.bucketing=true;
hive> set mapreduce.job.reduces=-1; -- -1表示reduce的个数不是预先设置好了，而是在执行HQL语句的时候自动分析出来需要几个reduce。
hive> truncate table stu_buck;
hive> insert into table stu_buck
select id, name from stu;

上述操作后 表被分成4个桶

修改桶表中bueket数量
hive>alter table stu_buck clustered by(id,name) sorted by(id) into 10 buckets;

完整语法
hive>create table bkt(name string,id string,phone string,card_num bigint,email string,addr string) clustered by(card_num) into 30 buckets;
hive>create table bak(name string,id string,phone string,card_num bigint,email string,addr string) row format delimited fields terminated by ','
hive>load data local inpath '/home/xfvm/bak' into table bak;
hive>insert into table bkt select * from bak;

分桶抽样查询

tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y) 。
X表示从哪个桶中开始抽取，
Y表示相隔多少个桶再次抽取。

hive> select * from bkt tablesample(bucket 2 out of 6 on card_num)
表示从桶中抽取5（30/6）个bucket数据，从第2个bucket开始抽取，抽取的个数由每个桶中的数据量决定。  
相隔6个桶再次抽取，因此，依次抽取的桶为：2，8，14，20，26

注意：x的值必须小于等于y的值，否则报错如下：
FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table bkt

网友评论

本文标题：hive 总结一

本文链接：https://www.haomeiwen.com/subject/hjwokctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

hive 总结一

上传数据

Export导出数据

like和rlike

having语句

表的别名

排序

分桶

分桶抽样查询

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据爬虫Python AI Sql

大数据

大数据，机器学习，人工智能

玩转大数据

Hadoop

hive 总结一

上传数据

Export导出数据

like和rlike

having语句

表的别名

排序

分桶

分桶抽样查询

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据 爬虫Python AI Sql

大数据

大数据，机器学习，人工智能

玩转大数据

Hadoop

大数据爬虫Python AI Sql