1. The conception of a DATA WAREHOUSE
1.1 ETL&ELT
image.png- ETL: 1. extract 2. transform 3. load
- ELT: 1. extract 2. load 3. transform
Switching methods based on different contexts
1.2 MT data warehouse
image.pngimage.png
2. Apache Hive
2.1 what's hive
- HIVE is an open-source data warehouse built on top of Hadoop
- It can store structured and semi-structured data files and map them into a table.
- HIVEtranslates HQL to MapReduce, then submit to Hadoop
- HIVE stores data using HDFS and analyses data using MapReduce
-
the structure of HIVE
image.png
2.2 Table in HIVE
2.2.1the storage path
image.png①If the table is in a default database, then the storage path in HDFS is
/User/hive/warehouse/t_order
; ②If the table is in custom databases, the path in HDFS is /User/hive/warehouse/itcast.db/address
2.2.2 what's Partitions
image.png- Partitions are a means of optimizing tables in HIVE
2.2.3 What's buckets
image.png- Buckets are a means of optimizing join in the hive.
2.2.4 HIVE metadata
- The metadata stores databases, tables, location of tables, properties, type, and field order.
- The metadata must store in the relation database like Derby
2.2.5 HIVE metastore
image.png-
Three models of the hive: local, remote, embedded
image.png
2.3 Hive install
pass
2. Hive SQL
2.1 DDL creates table(IMPORTANT)
image.png2.2 data types
- primitive data types: numeric, string, time data, miscellaneous
- complex data types: arrays, maps, structs, union
2.3 example of creating primitive tables
image.pngimage.png
- upload archer.txt, build mapping. default path
/usrs/hive/warehouse
hadoop fs -put archer.txt /usrs/hive/warehouse/ithemia.db/t_archer
2.4 example of creating complex tables
image.pngcreate table t_hot_hero_skin_price(
id int,
name string,
win_rate int,
skin_price map<string,int>
)
row format delimited delimiter
fields terminated by ',' --字段之间分隔符
collection items terminated by '-' --集合元素之间分隔符
map keys terminated by ':'; --集合元素kv之间分隔符;
2.5 create a table with default
image.pngcreate table t_team_ace_player(
id int,
team_name string,
ace_player_name string
); --没有指定row format语句 此时采用的是默认的\001作为字段的分隔符
2.6 create a table in a specified path
- 1.hadoop fs -mkdir /data
- create a table
create table t_team_ace_player_location(
id int,
team_name string,
ace_player_name string)
location '/data'; --使用location关键字指定本张表数据在hdfs上的存储路径
- 3.hadoop fs -put t_team_ace_player_location.txt /data
2.7 create an internal table or an external table
2.7.1 internal tables
- internal tables are managed by hive all the time, so when dropping a table, the table will delete both in hdfs and hive. The default way to create tables is internal.
create table student(
num int,
name string,
sex string,
age int,
dept string)
row format delimited
fields terminated by ',';
2.7.2 external tables
- creating the external table is a way to protect metadata. When dropping a table, the table will delete in the hive, not in the hdfs.
create external table student_ext(
num int,
name string,
sex string,
age int,
dept string)
row format delimited
fields terminated by ','
location '/stu';
2.8 hive partitioned tables
2.8.1 Before optimization
image.png- upload 6 files to Linux
- upload 6 files to HDFS
hadoop fs -put archer.txt mage.txt tank.txt` /user/hive/warehouse/ithemia.db/t_all_hero
- create a table to connect the 6 files
create table t_all_hero(
id int,
name string,
hp_max int,
mp_max int,
attack_max int,
defense_max int,
attack_range string,
role_main string,
role_assist string
)
row format delimited
fields terminated by "\t";
- execute hive SQL
--non-partitioned table, full table scan
select count(*) from t_all_hero where role_main="archer" and hp_max >6000;
- it took 8 seconds
2.8.2 After optimization
- create a partition table
--注意分区表创建语法规则
--分区表建表
create table t_all_hero_part(
id int,
name string,
hp_max int,
mp_max int,
attack_max int,
defense_max int,
attack_range string,
role_main string,
role_assist string
) partitioned by (role string)--注意哦 这里是分区字段不能和里面字段重复
row format delimited
fields terminated by "\t";
- load partitions
load data local inpath '/root/hivedata/archer.txt' into table t_all_hero_part partition(role='sheshou');
load data local inpath '/root/hivedata/assassin.txt' into table t_all_hero_part partition(role='cike');
load data local inpath '/root/hivedata/mage.txt' into table t_all_hero_part partition(role='fashi');
load data local inpath '/root/hivedata/support.txt' into table t_all_hero_part partition(role='fuzhu');
load data local inpath '/root/hivedata/tank.txt' into table t_all_hero_part partition(role='tanke');
load data local inpath '/root/hivedata/warrior.txt' into table t_all_hero_part partition(role='zhanshi');
image.png
- execute hivesql
-- partitioned table
select count(*) from t_all_hero_part where role="sheshou" and hp_max >6000;
2.8.3 multiple partitioned tables
we normally use two tables to create multiple partitioned
--There is a progressive relationship between two fielders.
create table t_user_province_city (id int, name string,age int) partitioned by (province string, city string);
- load
--双分区表的数据加载 静态分区加载数据
load data local inpath '/root/hivedata/user.txt' into table t_user_province_city
partition(province='zhejiang',city='hangzhou');
load data local inpath '/root/hivedata/user.txt' into table t_user_province_city
partition(province='zhejiang',city='ningbo');
load data local inpath '/root/hivedata/user.txt' into table t_user_province_city
partition(province='shanghai',city='pudong');
- execute a query
select * from t_user_province_city where province= "zhejiang" and city ="hangzhou";
image.png
网友评论