一.需求
1.1 数据准备
-- 表
drop table if exists gcable_look_offline;
CREATE TABLE gcable_look_offline(
client_type string COMMENT '终端类型',
client_id string COMMENT '终端号',
region_code string COMMENT '终端区域码',
network_id string COMMENT '网络Id',
service_param array<string> COMMENT '业务服务参数')
PARTITIONED BY (stadate int)
row format delimited fields terminated by '|'
collection items terminated by "\^"
-- 数据
08|109001212201363151|3060||2021-09-13T08:58:45Z^04^3001^茂名公共^3001^3001^3001^0^1
08|109001212201363151|3060||2021-09-13T08:58:45Z^04^1715^卫生健康^1715^1715^1715^0^1
08|109001210401297979|3201||2021-09-13T08:58:45Z^01^com.suma.midware.dmp.play.PlayActivity^直播全屏^com.suma.midware.ihome.third.Main3DActivity^首页
08|279001205200190933|3201||2021-09-13T08:58:47Z^01^com.suma.midware.ihome.third.Main3DActivity^首页^^
08|299001204200473398|3201||2021-09-13T08:58:45Z^03^GDZX2220210107121454^中国兄弟连12^11003^GDZX0920190517000004_CPBQ00006^07^2194^1^2813^1^0^0
08|299001204200468045|3220||2021-09-13T08:58:45Z^09^00700100^01^01^file:///data/adv/pic/1602815006348.jpg^null
-- shell 录入数据
v_date=20211122
filename=/root/20211123.txt
hive -e "load data local inpath '${filename}' overwrite into table test.gcable_look_offline partition(stadate = ${v_date})"
查看数据:
> select * from gcable_look_offline;
OK
08 109001212201363151 3060 ["2021-09-13T08:58:45Z","04","3001","茂名公共","3001","3001","3001","0","1"] 20211122
08 109001212201363151 3060 ["2021-09-13T08:58:45Z","04","1715","卫生健康","1715","1715","1715","0","1"] 20211122
08 109001210401297979 3201 ["2021-09-13T08:58:45Z","01","com.suma.midware.dmp.play.PlayActivity","直播全屏","com.suma.midware.ihome.third.Main3DActivity","首页"] 20211122
08 279001205200190933 3201 ["2021-09-13T08:58:47Z","01","com.suma.midware.ihome.third.Main3DActivity","首页","",""] 20211122
08 299001204200473398 3201 ["2021-09-13T08:58:45Z","03","GDZX2220210107121454","中国兄弟连12","11003","GDZX0920190517000004_CPBQ00006","07","2194","1","2813","1","0","0"] 20211122
08 299001204200468045 3220 ["2021-09-13T08:58:45Z","09","00700100","01","01","file:///data/adv/pic/1602815006348.jpg","null"] 20211122
Time taken: 0.841 seconds, Fetched: 6 row(s)
hive>
1.2 需求
需要将service_param这个复合类型的array第一列和第二列单独拿出来作为表新的列。
二.解决方案
Hive提供了explode、posexplode两个UDTF来协助处理array这样的复合类型。
-- 不包含pos
select region_code,new_param from gcable_look_offline lateral view explode(service_param) gcable_look_offline as new_param;
-- 包含pos
select region_code,pos,new_param from gcable_look_offline lateral view posexplode(service_param) gcable_look_offline as pos,new_param;
测试记录:
hive> use test;
OK
Time taken: 1.17 seconds
hive> select region_code,new_param from gcable_look_offline lateral view explode(service_param) gcable_look_offline as new_param;
Query ID = root_20211123165539_5b007f70-0826-495a-8b84-0702ae9c0ead
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1637634703570_0005, Tracking URL = http://hp1:8088/proxy/application_1637634703570_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1637634703570_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-11-23 16:55:49,502 Stage-1 map = 0%, reduce = 0%
2021-11-23 16:55:54,726 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.83 sec
MapReduce Total cumulative CPU time: 1 seconds 830 msec
Ended Job = job_1637634703570_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.83 sec HDFS Read: 6923 HDFS Write: 1530 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
3060 2021-09-13T08:58:45Z
3060 04
3060 3001
3060 茂名公共
3060 3001
3060 3001
3060 3001
3060 0
3060 1
3060 2021-09-13T08:58:45Z
3060 04
3060 1715
3060 卫生健康
3060 1715
3060 1715
3060 1715
3060 0
3060 1
3201 2021-09-13T08:58:45Z
3201 01
3201 com.suma.midware.dmp.play.PlayActivity
3201 直播全屏
3201 com.suma.midware.ihome.third.Main3DActivity
3201 首页
3201 2021-09-13T08:58:47Z
3201 01
3201 com.suma.midware.ihome.third.Main3DActivity
3201 首页
3201
3201
3201 2021-09-13T08:58:45Z
3201 03
3201 GDZX2220210107121454
3201 中国兄弟连12
3201 11003
3201 GDZX0920190517000004_CPBQ00006
3201 07
3201 2194
3201 1
3201 2813
3201 1
3201 0
3201 0
3220 2021-09-13T08:58:45Z
3220 09
3220 00700100
3220 01
3220 01
3220 file:///data/adv/pic/1602815006348.jpg
3220 null
Time taken: 17.069 seconds, Fetched: 50 row(s)
hive> select region_code,pos,new_param from gcable_look_offline lateral view posexplode(service_param) gcable_look_offline as pos,new_param;
Query ID = root_20211123165607_ab382c18-7409-4527-a0b4-2eab00e0f348
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1637634703570_0006, Tracking URL = http://hp1:8088/proxy/application_1637634703570_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1637634703570_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-11-23 16:56:14,368 Stage-1 map = 0%, reduce = 0%
2021-11-23 16:56:20,564 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.84 sec
MapReduce Total cumulative CPU time: 1 seconds 840 msec
Ended Job = job_1637634703570_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.84 sec HDFS Read: 7075 HDFS Write: 1633 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 840 msec
OK
3060 0 2021-09-13T08:58:45Z
3060 1 04
3060 2 3001
3060 3 茂名公共
3060 4 3001
3060 5 3001
3060 6 3001
3060 7 0
3060 8 1
3060 0 2021-09-13T08:58:45Z
3060 1 04
3060 2 1715
3060 3 卫生健康
3060 4 1715
3060 5 1715
3060 6 1715
3060 7 0
3060 8 1
3201 0 2021-09-13T08:58:45Z
3201 1 01
3201 2 com.suma.midware.dmp.play.PlayActivity
3201 3 直播全屏
3201 4 com.suma.midware.ihome.third.Main3DActivity
3201 5 首页
3201 0 2021-09-13T08:58:47Z
3201 1 01
3201 2 com.suma.midware.ihome.third.Main3DActivity
3201 3 首页
3201 4
3201 5
3201 0 2021-09-13T08:58:45Z
3201 1 03
3201 2 GDZX2220210107121454
3201 3 中国兄弟连12
3201 4 11003
3201 5 GDZX0920190517000004_CPBQ00006
3201 6 07
3201 7 2194
3201 8 1
3201 9 2813
3201 10 1
3201 11 0
3201 12 0
3220 0 2021-09-13T08:58:45Z
3220 1 09
3220 2 00700100
3220 3 01
3220 4 01
3220 5 file:///data/adv/pic/1602815006348.jpg
3220 6 null
Time taken: 14.502 seconds, Fetched: 50 row(s)
hive>
网友评论