好程序员大数据培训教程分享hive分区和分桶

作者: ab6973df9221 | 来源:发表于2019-08-20 14:19 被阅读0次

好程序员大数据培训教程分享hive分区和分桶
Hive 1.2.1 分区和分捅
好程序员大数据培训之掌握Hive的静态分区与动态分区
Hive 分桶详解
Hive 分桶
好程序员大数据培训教程分享UDF函数
Hive 分区/分桶
大数据框架(分区，分桶，分片)
案例详解__HIVE中内部表、外部表、分区表和分桶表
Hive分桶

好程序员大数据培训教程分享hive分区和分桶，hive 分区

1.为什么要分区？？

当单个表数据量越来越大的时候，hive查询通常会全表扫描，这将会浪费我们不关心数据的扫描，浪费大量时间。从而hive引出分区概念partition

2.怎么分区？？

看具体业务，能把一堆数据拆分成多个堆的数据就可以。通常使用id 、年、月、天、区域、省份、 hive分区和mysql分区的区别？？ mysql的分区字段采用的表内字段。 hive的分区字段使用的是表外字段。

3.hive分区细节？？

1、分区本质是在该表下创建对应的目录。 2、分区名大小写不区分，建议不要使用中文。 3、可以查询分区信息。但是我们的分区字段相当于是一个伪字段，在元数据中存在，但是不真实存在数据内容中。

4、加载数据时要指定分区

4.分区操作

创建一级分区表：

create table if not exists day_part(uid int,uname string)partitioned by(year int)row format delimited fields terminated by '\t';load data local inpath '/root/Desktop/student.txt' into table day_part partition(year=2017);load data local inpath '/root/Desktop/score.txt' into table day_part partition(year=2016);show partitions day_part;

二级分区

create table if not exists day_part1(uid int,uname string)partitioned by(year int,month int)row format delimited fields terminated by '\t';load data local inpath '/root/Desktop/student.txt' into table day_part1 partition(year=2017,month=04);load data local inpath '/root/Desktop/score.txt' into table day_part1 partition(year=2017,month=03);

三级分区：

create table if not exists day_part2(uid int,uname string)partitioned by(year int,month int,day int)row format delimited fields terminated by '\t';

对分区进行操作：显示分区：

show partitions day_part;

新增分区：空的

alter table day_part1 add partition(year=2017,month=2);alter table day_part1 add partition(year=2017,month=1) partition(year=2016,month=12);

新增分区并加载数据：

alter table day_part1 add partition(year=2016,month=11) location "/user/hive/warehouse/qf1603.db/day_part1/year=2017/month=2";

修改分区所对应的存储路径：

##路径必须从hdfs写起alter table day_part1 partition(year=2016,month=11) set location "hdfs://linux1:9000/user/hive/warehouse/qf1603.db/day_part1/year=2017/month=3";

删除分区：删除分区将会删除对应的分区目录(数据)

##删除某个分区alter table day_part1 drop partition(year=2017,month=2);##删除多个alter table day_part1 drop partition(year=2017,month=3),partition(year=2017,month=4);

静态分区、动态分区、混合分区静态分区：新增分区或者是加载分区数据时，已经指定分区名。动态分区：新增分区或者是加载分区数据时，分区名未知。混合分区：静态分区和动态分区同时存在。

动态分区的相关属性： hive.exec.dynamic.partition=true :是否允许动态分区 hive.exec.dynamic.partition.mode=strict ：分区模式设置nostrict strict：最少需要有一个是静态分区 nostrict：可以全部是动态分区 hive.exec.max.dynamic.partitions=1000 ：允许动态分区的最大数量 hive.exec.max.dynamic.partitions.pernode =100 ：单个节点上的mapper/reducer允许创建的最大分区

创建临时表：

##创建临时表create table if not exists tmp(uid int,commentid bigint,recommentid bigint,year int,month int,day int)row format delimited fields terminated by '\t';##加载数据load data local inpath '/root/Desktop/comm' into table tmp;

创建动态分区：

##创建动态分区表create table if not exists dyp1(uid int,commentid bigint,recommentid bigint)partitioned by(year int,month int,day int)row format delimited fields terminated by '\t';

为动态分区加载数据：

##严格模式insert into table dyp1 partition(year=2016,month,day)select uid,commentid,recommentid,month,day from tmp;##非严格模式##设置非严格模式动态分区set hive.exec.dynamic.partition.mode=nostrict; ##创建动态分区表create table if not exists dyp2(uid int,commentid bigint,recommentid bigint)partitioned by(year int,month int,day int)row format delimited fields terminated by '\t';##为非严格模式动态分区加载数据insert into table dyp2 partition(year,month,day)select uid,commentid,recommentid,year,month,day from tmp;

hive提供我们一个严格模式：为了阻止用户不小心提交恶意hql hive.mapred.mode=nostrict : strict 如果该模式值为strict，将会阻止以下三种查询： 1、对分区表查询，where中过滤字段不是分区字段。 2、笛卡尔积join查询，join查询语句，不带on条件或者 where条件。

select stu.id,stu.name,score.gradefrom student stujoin score;

可以：

select stu.id,stu.name,score.gradefrom student stujoin scorewhere stu.id = score.uid;

3、对order by查询，有order by的查询不带limit语句。

selectstudent.*from studentorder by student.id desc;

注意： 1、尽量不要是用动态分区，因为动态分区的时候，将会为每一个分区分配reducer数量，当分区数量多的时候，reducer数量将会增加，对服务器是一种灾难。 2、动态分区和静态分区的区别，静态分区不管有没有数据都将会创建该分区，动态分区是有结果集将创建，否则不创建。 3、hive动态分区的严格模式和hive提供的hive.mapred.mode的严格模式。

分桶

1.为什么要分桶？？

分区数据依然很大，对分区数据或者表数据更加细粒度的管理。分桶关键字： clustered by(uid) into n buckets 、bucket 、分桶使用表内字段怎么分桶？？对分桶字段进行hash值，然后将hash值模于总的桶数，然后得到桶数

2.分桶的意义：

1、快速抽样查询。tablesample 2、减少查询扫描数据量，提高查询效率。

##创建分桶表，设置4个分桶create table if not exists bucket1(uid int,uname String)clustered by(uid) into 4 bucketsrow format delimited fields terminated by '\t';

3.分桶的操作：

为分桶表加载数据：分桶不能使用load方式来加载数据，而需要iinsert into方式来加载并且需要设置属性：

##设置分桶启用hive> set hive.enforce.bucketing=true;##错误的加载数据方式load data local inpath '/root/Desktop/student' into table bucket1;##创建分桶表，设置4个分桶create table if not exists bucket7(uid int,uname String)clustered by(uid) into 4 bucketsrow format delimited fields terminated by '\t';##为分桶表加载数据insert into table bucket7select id,name from student;

分桶查询：tablesample(bucket x out of y on uid) 注意：x不能大于y x：所取桶的起始位置， y：所取桶的总数，y是总桶数的因子。y大于源总桶数相当于拉伸，y小于源总桶数相当于压缩 1 out of 2 1 1+4/2 2 out of 2 2 2+4/2

1 out of 4 1 1+4

select * from bucket7;select * from bucket7 tablesample(bucket 1 out of 4 on uid);select * from bucket7 tablesample(bucket 2 out of 4 on uid);select * from bucket7 tablesample(bucket 1 out of 2 on uid);select * from bucket7 tablesample(bucket 2 out of 2 on uid);select * from bucket7 tablesample(bucket 3 out of 2 on uid);select * from bucket7 tablesample(bucket 1 out of 8 on uid);select * from bucket7 tablesample(bucket 5 out of 8 on uid);

分区+分桶：(qfstu) uid,uname,class,master gender分区分桶uid 基偶分桶查询女生中的学号为基数？？

##创建表create table if not exists qftmp(uid int,uname string,class int,gender int)row format delimited fields terminated by '\t';##加载数据load data local inpath '/home/qf' into table qftmp;##创建动态分区分桶表create table if not exists qf(uid int,uname string,class int)partitioned by(gender int)clustered by(uid) into 2 bucketsrow format delimited fields terminated by '\t';##为动态分区分桶表加载数据insert into table qf partition(gender)select uid,uname,class,gender from qftmp;

查询女生中的学号为基数？？？？？

select * from qf where gender = 2 and uid%2 != 0;select * from qf tablesample(bucket 2 out of 2 on uid) where gender = 2;

分桶使用内部关键字，分区使用的是外部字段。两者都是对hive的一个优化。分区和分桶的数量都要合理设置，不是越多越好。

抽样：

select * from student order by rand() limit 3;select * from student limit 3;select * from student tablesample(3 rows);select * from student tablesample(20B); ##最小单位是Bselect * from student tablesample(20 percent);##百分比

好程序员大数据培训官网：http://www.goodprogrammer.org/

好程序员大数据培训教程分享hive分区和分桶
好程序员大数据培训教程分享hive分区和分桶，hive 分区 1.为什么要分区？？当单个表数据量越来越大的时候，...
Hive 1.2.1 分区和分捅
1. 借鉴 Hive学习笔记——Hive中的分桶Hive分区和分桶（0925）HIVE表索引，分区和分桶的区别 2...
好程序员大数据培训之掌握Hive的静态分区与动态分区
好程序员大数据培训之掌握Hive的静态分区与动态分区分区是hive存放数据的一种方式。将列值作为目录来存放数据，就...
Hive 分桶详解
1分桶 1.1什么是分桶？和分区有什么区别？分区：Hive在查询数据的时候，一般会扫描整个表的数据,会消耗很多不...
Hive 分桶
Hive 分桶分桶对于每一个表或者分区，Hive可以进一步组织成桶，也就是更为细粒度的数据范围划分Hive是针对...
好程序员大数据培训教程分享UDF函数
好程序员大数据培训教程分享UDF函数 1.为什么需要UDF？ 1）、因为内部函数没法满足需求。 2）、hive它本...
Hive 分区/分桶
分区/桶 Hive 分区 Hive的分区方式：由于Hive实际上是数据文件在HDFS存在的目录区分分区字段是虚拟列...
大数据框架(分区，分桶，分片)
前言在大数据分布式中，分区，分桶，分片是设计框架的重点。此篇就来总结各个框架。建议收藏目录 Hive分区与分桶...
案例详解__HIVE中内部表、外部表、分区表和分桶表
目录一、Hive建表语法二、内部表外部表三、分区表四、分桶表 Hive在建表时可指定内部表、外部表、分区表和分桶表...
Hive分桶
Hive分桶分桶表是对列值取哈希值的方式，将不同数据放到不同文件中存储。对于hive中每一个表、分区都可以进一...