本节主要内容:
Hive基本使用
一.Hive存储模式
Hive中建立的表都叫metastore表。这些表并不真实的存储数据,而是定义真实数据跟hive之间的映射,就像传统数据库中表的meta信息,所以叫做metastore。
实际存储的时候可以定义的存储模式有四种:
---内部表(默认)
---分区表
---桶表
---外部表
二.内部表
建立内部表
CREATE TABLE worker(id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
建立一个worker的内部表,内部表是默认的类型,所以不用写存储的模式,并且使用逗号作为分隔符存储。
# hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> show tables;
OK
Time taken: 0.948 seconds
hive> CREATE TABLE worker(id INT, name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
OK
Time taken: 34.248 seconds
hive>
> CREATE TABLE worker(id INT, name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
054为ascii码中的逗号,Hive没有专门的数据存储格式,也没有为数据建立索引,用户可以非常自由的组织Hive中的表,只需要在创建表的时候告诉Hive,数据中的列分隔符和行分隔符,Hive就可以解析数据。
比如:
create table user_info (user_id int, cid string, ckid string, username string)
row format delimited
fields terminated by '\t'
lines terminated by '\n';
导入数据表的数据格式是:字段之间是tab键分割,行之间是断行。
文件内容格式如下:
100636 100890 c5c86f4cddc15eb7 yyyvybtvt
100612 100865 97cc70d411c18b6f gyvcycy
100078 100087 ecd6026a15ffddf5 qa000100
hive> show tables;
OK
worker
Time taken: 0.28 seconds, Fetched: 1 row(s)
hive>
表的存储位置
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse
Found 1 items
drwxrwxrwt - root supergroup 0 2020-07-03 03:34 /user/hive/warehouse/worker
插入数据:
Hive不支持单句插入的语句,必须批量,不能像sql一样使用insert into worker value(1,'zhangsan')
插入的方式有两种:
其一:从文件读取数据
其二:从别的表读出数据插入(insert from select)
从文件读取数据
# cat /worker.txt
1,zhangsan
2,lisi
3,wangwu
4,zhaoliu
插入数据到表中:
hive> LOAD DATA LOCAL INPATH '/worker.txt' INTO TABLE worker;
Loading data to table default.worker
Table default.worker stats: [numFiles=1, totalSize=37]
OK
Time taken: 4.477 seconds
hive> select * from worker;
OK
1 zhangsan
2 lisi
3 wangwu
4 zhaoliu
Time taken: 1.293 seconds, Fetched: 4 row(s)
查看数据存储:
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/worker
Found 1 items
-rwxrwxrwt 3 root supergroup 37 2020-07-03 03:34 /user/hive/warehouse/worker/worker.txt
继续插入数据
# cat /worker.abc
5,tianqi
6,wangba
7,pijiu
注意,hive不需要扩展名,扩展名可以随意写
hive> LOAD DATA LOCAL INPATH '/worker.txt' INTO TABLE worker;
Loading data to table default.worker
Table default.worker stats: [numFiles=1, totalSize=37]
OK
Time taken: 34.11 seconds
hive> select * from worker;
OK
1 zhangsan
2 lisi
3 wangwu
4 zhaoliu
Time taken: 28.489 seconds, Fetched: 4 row(s)
hive>
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/worker
Found 1 items
-rwxrwxrwt 3 root supergroup 37 2020-07-03 03:48 /user/hive/warehouse/worker/worker.txt
LOAD DATA LOCAL INPATH '/worker.txt' INTO TABLE worker;
LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件,一个是从hdfs上找文件
三.分区表
分区表是用来加速查询的,主要依赖于指定条件
举例:我们按照日期查询数据,所以我们要根据日期分区
创建分区表:
hive> create table partition_student(id int,name string)
> partitioned by(daytime string)
> row format delimited fields TERMINATED BY '\054';
OK
Time taken: 2.492 seconds
hive> show tables;
OK
partition_student
worker
Time taken: 8.952 seconds, Fetched: 2 row(s)
hive>
创建数据文件:
# vi 2020070301
1,zhangsan
2,lisi
3,wangwu
4,zhaoliu
# vi 2020070302
33,tianqi
44,xiongda
55,xionger
导入数据表的数据格式是:字段之间是tab键分割,行之间是断行。
hive> LOAD DATA LOCAL INPATH '/2020070301' INTO TABLE partition_student partition(daytime='2020070301');
Loading data to table default.partition_student partition (daytime=2020070301)
Partition default.partition_student{daytime=2020070301} stats: [numFiles=1, numRows=0, totalSize=37, rawDataSize=0]
OK
Time taken: 56.454 seconds
hive> LOAD DATA LOCAL INPATH '/2020070302' INTO TABLE partition_student partition(daytime='2020070302');
Loading data to table default.partition_student partition (daytime=2020070302)
Partition default.partition_student{daytime=2020070302} stats: [numFiles=1, numRows=0, totalSize=32, rawDataSize=0]
OK
Time taken: 9.389 seconds
hive>
注意:每次导入文件进去,要手动定义daytime=,作为后面分区查询的依据
hive> select * from partition_student where daytime='2020070301';
OK
1 zhangsan 2020070301
2 lisi 2020070301
3 wangwu 2020070301
4 zhaoliu 2020070301
Time taken: 15.493 seconds, Fetched: 4 row(s)
hive> select * from partition_student where daytime='2020070302';
OK
33 tianqi 2020070302
44 xiongda 2020070302
55 xionger 2020070302
Time taken: 2.219 seconds, Fetched: 3 row(s)
hive>
hive> select * from partition_student;
OK
1 zhangsan 2020070301
2 lisi 2020070301
3 wangwu 2020070301
4 zhaoliu 2020070301
33 tianqi 2020070302
44 xiongda 2020070302
55 xionger 2020070302
Time taken: 4.861 seconds, Fetched: 7 row(s)
hive>
注意:
select * from partition_student where daytime='2020070302';支持where 使用and,当然应该使用在多维表中,并在创建表时声明多个分区类型。
例如:
create table student(id int, name string)
partitioned by(daytime string,telnum string)定义了两个分区类型,就可以在where后使用and
row format delimited fields TERMINATED BY '\054';
存储结构
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/partition_student
Found 2 items
drwxrwxrwt - root supergroup 0 2020-07-03 04:02 /user/hive/warehouse/partition_student/daytime=2020070301
drwxrwxrwt - root supergroup 0 2020-07-03 04:03 /user/hive/warehouse/partition_student/daytime=2020070302
四.分桶:
分桶是相对分区进行更细粒度的划分。
分桶将整个数据内容安装某列属性值得hash值进行区分,
如要按照name属性分为3个桶,
就是对name属性值的hash值对3取摸,按照取模结果对数据分桶。
如取模结果为0的数据记录存放到一个文件,取模为1的数据存放到一个文件,取模为2的数据存放到一个文件。
与分区不同的是,分区依据的不是真实数据表文件中的列,而是我们指定的伪列,
但是分桶是依据数据表中真实的列而不是伪列。所以在指定分区依据的列的时候要指定列的类型,
因为在数据表文件中不存在这个列,相当于新建一个列。而分桶依据的是表中已经存在的列,
这个列的数据类型显然是已知的,所以不需要指定列的类型。
1、建表
通过 clustered by(字段名) into bucket_num buckets 分桶,意思是根据字段名分成bucket_num个桶
> create table test_bucket (
> id int comment 'ID',
> name string comment '名字'
> )
> comment '测试分桶'
> clustered by(id) into 3 buckets
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
测试数据
# for i in {1..10}; do echo $i,name$i >> bucket_data.txt;done
# cat bucket_data.txt
1,name1
2,name2
3,name3
4,name4
5,name5
6,name6
7,name7
8,name8
9,name9
10,name10
load data
直接load data不会有分桶的效果,这样和不分桶一样,在HDFS上只有一个文件。
load data local inpath '/buckt_data.txt' into table test_bucket;
需要借助中间表
> create table test (
> id int comment 'ID',
> name string comment '名字'
> )
> comment '测试分桶中间表'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
OK
Time taken: 0.483 seconds
hive> load data local inpath '/bucket_data.txt' into table test;
Loading data to table default.test
Table default.test stats: [numFiles=1, totalSize=82]
OK
Time taken: 2.077 seconds
然后通过下面的语句,将中间表的数据插入到分桶表中,这样会产生三个文件。
hive> set hive.enforce.bucketing = true;
强制分桶。
hive> set hive.enforce.bucketing = true;
hive> insert into test_bucket select * from test;
Query ID = root_20200703042727_918a62fa-bdce-4ca8-8a07-e1db2d085076
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1593650776473_0001, Tracking URL = http://node1.hadoop.com:8088/proxy/application_1593650776473_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1593650776473_0001
http://172.26.37.245:19888/jobhistory/app中可以看到分布式任务
查看文件结构
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/test_bucket
Found 3 items
-rwxrwxrwt 3 hdfs supergroup 24 2020-07-03 23:06 /user/hive/warehouse/test_bucket/000000_0
-rwxrwxrwt 3 hdfs supergroup 34 2020-07-03 23:06 /user/hive/warehouse/test_bucket/000001_0
-rwxrwxrwt 3 hdfs supergroup 24 2020-07-03 23:06 /user/hive/warehouse/test_bucket/000002_0
五.外部表:
外部表不是由hive来存储的,可以依赖Hbase来存储,hive只是做一个映射。
1.创建hbase表
# hbase shell
hbase(main):011:0> create 'student','info'
0 row(s) in 2.2390 seconds
=> Hbase::Table - student
hbase(main):012:0> put 'student',1,'info:id',1
0 row(s) in 0.2780 seconds
hbase(main):014:0> put 'student',1,'info:name','zhangsan'
0 row(s) in 0.0200 seconds
hbase(main):016:0> put 'student',2,'info:id',2
0 row(s) in 0.0100 seconds
hbase(main):018:0> put 'student',2,'info:name','lisi'
0 row(s) in 0.0100 seconds
hbase(main):019:0> scan 'student'
ROW COLUMN+CELL
1 column=info:id, timestamp=1556032332227, value=1
1 column=info:name, timestamp=1556032361655, value=zhangsan
2 column=info:id, timestamp=1556032380941, value=2
2 column=info:name, timestamp=1556032405776, value=lisi
2 row(s) in 0.0490 seconds
建立hbase和hive的映射
hive> CREATE EXTERNAL TABLE ex_student(key int, id int, name string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")
> TBLPROPERTIES ("hbase.table.name" = "student");
hive> CREATE EXTERNAL TABLE ex_student(key int, id int, name string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")
> TBLPROPERTIES ("hbase.table.name" = "student");
OK
Time taken: 5.019 seconds
hive> select * from ex_student;
OK
1 1 zhangsan
2 2 lisi
Time taken: 1.072 seconds, Fetched: 2 row(s)
hive>
文件结构
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/
Found 6 items
drwxrwxrwt - root supergroup 0 2020-07-03 23:19 /user/hive/warehouse/ex_student
drwxrwxrwt - root supergroup 0 2020-07-03 23:18 /user/hive/warehouse/h_employee
drwxrwxrwt - root supergroup 0 2020-07-03 22:03 /user/hive/warehouse/partition_student
drwxrwxrwt - root supergroup 0 2020-07-03 22:49 /user/hive/warehouse/test
drwxrwxrwt - root supergroup 0 2020-07-03 23:06 /user/hive/warehouse/test_bucket
drwxrwxrwt - root supergroup 0 2020-07-03 21:42 /user/hive/warehouse/worker
# sudo -u hdfs hadoop fs -ls /user/hive/warehouse/ex_student/
因为是映射,看不到存储的文件。
网友评论