一实现wordcount
1.装入数据.pnghello world hive
hadoop hello hive
hello
world
2.把数据装入hdfs.png
创建表
因为先有数据再有表
所以创建外部表
不知一行有几行单词,把整行当成一个列来存
create external table wc(
line string
);
location '/usr/'
3.创建外部表.png
创建结果表
create table wc_result(
word string,
ct int
);
4查看数据结构1.png
查看是数组结构
5查看数据结构2.png
再切割
from (select explode(split(line,' ')) word(这里给别名) from wc) t1
insert into wc_result
select t1.word,count(t1.word) group by t1.word;
6执行MR
7.结果.png
基站掉话率
基站掉话率,找出掉线率最高的前10名
1.数据格式record_time,imei,cell,ph_num,call_num,drop_num,duration,drop_rate,net_type,erl
2011-07-13 00:00:00+08,356966,29448-37062,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,352024,29448-51331,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,353736,29448-51331,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,353736,29448-51333,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,351545,29448-51333,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,353736,29448-51343,1,0,0,8,0,G,0
2011-07-13 00:00:00+08,359681,29448-51462,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,354707,29448-51462,0,0,0,0,0,G,0
2011-07-13 00:00:00+08,356137,29448-51470,0,0,0,0,0,G,0
排名需求
record time:通话时间
imei:基站编号
cell:手机编号
drop_num:掉话的描述
duration:通话持续总秒数
建表
create table call_monitor (
record_time string,
imei string,
cell string,
ph_num string,
call_num string,
drop_num int,
duration int,
drop_rate double,
net_type string,
erl string
)
row format delimited fields terminated by ',';
2.装上数据文件
3.创建表并加载数据
4查看前几行数据,结果正常(出现null是因为结果类型不匹配,设置的是int)
create table call_result(
imei string,
drop_num int,
duration int,
drop_rate double
)
查询
5执行却报错.png出错原因在于sdura是别名,却在此写sdrop/sdura。
from call_monitor cm
insert into call_result
select cm.imei,sum(cm.drop_num) sdrop,
sum(cm.duration) sdura,
sum(cm.drop_num)/sum(cm.duration) s_rate group by cm.imei order by s_rate desc;
6运行.png
7,结果.png
网友评论