一、topN求解
1、导入数据
hive (default)> create table rating_json(json string);
hive (default)> load data local inpath '/home/hadoop/data/rating.json' overwite into table rating_json;
hive (default)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
2、使用json_tuple函数解析json
hive (default)> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json limit 10;
OK
movie_id rate time user_id
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
Time taken: 0.041 seconds, Fetched: 10 row(s)
3、做成大宽表
在生产上一般来说是要将time再次处理成时间戳/年/月/日等,新成一张大宽表,以便后续会用到。
hive (default)> create table rating_width as
> select
> movie_id,rate,time,user_id,
> year(from_unixtime(cast(time as bigint))) as year,
> month(from_unixtime(cast(time as bigint))) as month,
> day(from_unixtime(cast(time as bigint))) as day,
> hour(from_unixtime(cast(time as bigint))) as hour,
> minute(from_unixtime(cast(time as bigint))) as minute,
> from_unixtime(cast(time as bigint)) as ts
> from
> (
> select
> json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)
> from rating_json
> ) tmp
> ;
hive (default)> select * from rating_width limit 10;
OK
rating_width.movie_id rating_width.rate rating_width.time rating_width.user_id rating_width.year rating_width.month rating_width.day rating_width.hour rating_width.minute rating_width.ts
1193 5 978300760 1 2001 1 1 6 12 2001-01-01 06:12:40
661 3 978302109 1 2001 1 1 6 35 2001-01-01 06:35:09
914 3 978301968 1 2001 1 1 6 32 2001-01-01 06:32:48
3408 4 978300275 1 2001 1 1 6 4 2001-01-01 06:04:35
2355 5 978824291 1 2001 1 7 7 38 2001-01-07 07:38:11
1197 3 978302268 1 2001 1 1 6 37 2001-01-01 06:37:48
1287 5 978302039 1 2001 1 1 6 33 2001-01-01 06:33:59
2804 5 978300719 1 2001 1 1 6 11 2001-01-01 06:11:59
594 4 978302268 1 2001 1 1 6 37 2001-01-01 06:37:48
919 4 978301368 1 2001 1 1 6 22 2001-01-01 06:22:48
4、求解topN
hive (default)> select movie_id,rate,time,user_id,rank
> from
> (
> select movie_id,rate,time,user_id,row_number() over(partition by user_id order by rate desc) as rank from rating_width
> ) t
> where rank<=3
> limit 10;
movie_id rate time user_id rank
1035 5 978301753 1 1
1 5 978824268 1 2
2028 5 978301619 1 3
3591 5 978228525 10 1
3809 5 978228451 10 2
954 5 978225922 10 3
800 5 977593915 100 1
527 5 977594839 100 2
919 5 977594947 100 3
2571 5 975041210 1000 1
二、修改HS2端口,并使用beeline连接
[hadoop@Hadoop001 bin]$ pwd
/home/hadoop/app/hive/bin
#启动hs2
[hadoop@Hadoop001 bin]$ nohup hiveserver2 >> /home/hadoop/app/hive/logs/hs2.log 2>&1 &
#beeline连接,默认端口为10000
[hadoop@Hadoop001 bin]$ ./beeline -u jdbc:hive2://Hadoop001:10000/default -n hadoop
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/home/hadoop/app/protobuf/bin:/home/hadoop/app/maven/bin:/usr/java/jdk1.8.0_45/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
scan complete in 1ms
Connecting to jdbc:hive2://Hadoop001:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
0: jdbc:hive2://Hadoop001:10000/default> show tables;
INFO : Compiling command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3): show tables
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3); Time taken: 0.398 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3): show tables
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3); Time taken: 0.037 seconds
INFO : OK
+--------------------------+--+
| tab_name |
+--------------------------+--+
| genome_scores |
| hello |
| hello2 |
| hive_array |
| hive_map |
| hive_rownumber |
| hive_struct |
| order_external_partiton |
| order_mulit_partiton |
| order_partiton |
| rating_json |
| rating_width |
| word |
+--------------------------+--+
13 rows selected (0.594 seconds)
修改hs2的默认端口:
添加到hive-site.xml中
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
三、高性能刷新分区数据
可以通过命令MSCK REPAIR TABLE table_name;
刷新分区。
如果有很多分区没加入的话,建议用MSCK REPAIR TABLE batch语句来避免OOM。通过配置hive.msck.repair.batch.size,可以内部批量地运行。默认的值为0,表示马上执行所有的分区。
网友评论