美文网首页
Hive--topN求解/修改HS2端口/高性能刷新分区数据

Hive--topN求解/修改HS2端口/高性能刷新分区数据

作者: 吃货大米饭 | 来源:发表于2019-07-21 17:39 被阅读0次

一、topN求解

1、导入数据

hive (default)> create table rating_json(json string);
hive (default)> load data local inpath '/home/hadoop/data/rating.json' overwite into table rating_json;
hive (default)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}

2、使用json_tuple函数解析json

hive (default)> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json limit 10;
OK
movie_id        rate    time    user_id
1193    5       978300760       1
661     3       978302109       1
914     3       978301968       1
3408    4       978300275       1
2355    5       978824291       1
1197    3       978302268       1
1287    5       978302039       1
2804    5       978300719       1
594     4       978302268       1
919     4       978301368       1
Time taken: 0.041 seconds, Fetched: 10 row(s)

3、做成大宽表
在生产上一般来说是要将time再次处理成时间戳/年/月/日等,新成一张大宽表,以便后续会用到。

hive (default)> create table rating_width as
              > select
              > movie_id,rate,time,user_id,
              > year(from_unixtime(cast(time as bigint))) as year,
              > month(from_unixtime(cast(time as bigint))) as month,
              > day(from_unixtime(cast(time as bigint))) as day,
              > hour(from_unixtime(cast(time as bigint))) as hour,
              > minute(from_unixtime(cast(time as bigint))) as minute,
              > from_unixtime(cast(time as bigint)) as ts
              > from
              > (
              > select 
              > json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) 
              > from rating_json
              > ) tmp
              > ;
hive (default)> select * from rating_width limit 10;
OK
rating_width.movie_id   rating_width.rate       rating_width.time       rating_width.user_id    rating_width.year       rating_width.month   rating_width.day        rating_width.hour       rating_width.minute     rating_width.ts
1193    5       978300760       1       2001    1       1       6       12      2001-01-01 06:12:40
661     3       978302109       1       2001    1       1       6       35      2001-01-01 06:35:09
914     3       978301968       1       2001    1       1       6       32      2001-01-01 06:32:48
3408    4       978300275       1       2001    1       1       6       4       2001-01-01 06:04:35
2355    5       978824291       1       2001    1       7       7       38      2001-01-07 07:38:11
1197    3       978302268       1       2001    1       1       6       37      2001-01-01 06:37:48
1287    5       978302039       1       2001    1       1       6       33      2001-01-01 06:33:59
2804    5       978300719       1       2001    1       1       6       11      2001-01-01 06:11:59
594     4       978302268       1       2001    1       1       6       37      2001-01-01 06:37:48
919     4       978301368       1       2001    1       1       6       22      2001-01-01 06:22:48

4、求解topN

hive (default)> select movie_id,rate,time,user_id,rank 
              > from
              > (
              > select movie_id,rate,time,user_id,row_number() over(partition by user_id order by rate desc)  as rank  from rating_width 
              > ) t
              > where rank<=3
              > limit 10;
movie_id        rate    time    user_id rank
1035    5       978301753       1       1
1       5       978824268       1       2
2028    5       978301619       1       3
3591    5       978228525       10      1
3809    5       978228451       10      2
954     5       978225922       10      3
800     5       977593915       100     1
527     5       977594839       100     2
919     5       977594947       100     3
2571    5       975041210       1000    1

二、修改HS2端口,并使用beeline连接

[hadoop@Hadoop001 bin]$ pwd
/home/hadoop/app/hive/bin
#启动hs2
[hadoop@Hadoop001 bin]$ nohup hiveserver2 >> /home/hadoop/app/hive/logs/hs2.log 2>&1 &
#beeline连接,默认端口为10000
[hadoop@Hadoop001 bin]$ ./beeline -u jdbc:hive2://Hadoop001:10000/default -n hadoop  
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/home/hadoop/app/protobuf/bin:/home/hadoop/app/maven/bin:/usr/java/jdk1.8.0_45/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
scan complete in 1ms
Connecting to jdbc:hive2://Hadoop001:10000/default
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
0: jdbc:hive2://Hadoop001:10000/default> show tables;
INFO  : Compiling command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3): show tables
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3); Time taken: 0.398 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3): show tables
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3); Time taken: 0.037 seconds
INFO  : OK
+--------------------------+--+
|         tab_name         |
+--------------------------+--+
| genome_scores            |
| hello                    |
| hello2                   |
| hive_array               |
| hive_map                 |
| hive_rownumber           |
| hive_struct              |
| order_external_partiton  |
| order_mulit_partiton     |
| order_partiton           |
| rating_json              |
| rating_width             |
| word                     |
+--------------------------+--+
13 rows selected (0.594 seconds)

修改hs2的默认端口:
添加到hive-site.xml中

<property>
  <name>hive.server2.thrift.port</name>
  <value>10000</value>
</property>

三、高性能刷新分区数据

可以通过命令MSCK REPAIR TABLE table_name;刷新分区。
如果有很多分区没加入的话,建议用MSCK REPAIR TABLE batch语句来避免OOM。通过配置hive.msck.repair.batch.size,可以内部批量地运行。默认的值为0,表示马上执行所有的分区。

相关文章

网友评论

      本文标题:Hive--topN求解/修改HS2端口/高性能刷新分区数据

      本文链接:https://www.haomeiwen.com/subject/qpcvlctx.html