美文网首页
Hive--topN求解/修改HS2端口/高性能刷新分区数据

Hive--topN求解/修改HS2端口/高性能刷新分区数据

作者: 吃货大米饭 | 来源:发表于2019-07-21 17:39 被阅读0次

    一、topN求解

    1、导入数据

    hive (default)> create table rating_json(json string);
    hive (default)> load data local inpath '/home/hadoop/data/rating.json' overwite into table rating_json;
    hive (default)> select * from rating_json limit 10;
    OK
    rating_json.json
    {"movie":"1193","rate":"5","time":"978300760","userid":"1"}
    {"movie":"661","rate":"3","time":"978302109","userid":"1"}
    {"movie":"914","rate":"3","time":"978301968","userid":"1"}
    {"movie":"3408","rate":"4","time":"978300275","userid":"1"}
    {"movie":"2355","rate":"5","time":"978824291","userid":"1"}
    {"movie":"1197","rate":"3","time":"978302268","userid":"1"}
    {"movie":"1287","rate":"5","time":"978302039","userid":"1"}
    {"movie":"2804","rate":"5","time":"978300719","userid":"1"}
    {"movie":"594","rate":"4","time":"978302268","userid":"1"}
    {"movie":"919","rate":"4","time":"978301368","userid":"1"}
    

    2、使用json_tuple函数解析json

    hive (default)> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json limit 10;
    OK
    movie_id        rate    time    user_id
    1193    5       978300760       1
    661     3       978302109       1
    914     3       978301968       1
    3408    4       978300275       1
    2355    5       978824291       1
    1197    3       978302268       1
    1287    5       978302039       1
    2804    5       978300719       1
    594     4       978302268       1
    919     4       978301368       1
    Time taken: 0.041 seconds, Fetched: 10 row(s)
    

    3、做成大宽表
    在生产上一般来说是要将time再次处理成时间戳/年/月/日等,新成一张大宽表,以便后续会用到。

    hive (default)> create table rating_width as
                  > select
                  > movie_id,rate,time,user_id,
                  > year(from_unixtime(cast(time as bigint))) as year,
                  > month(from_unixtime(cast(time as bigint))) as month,
                  > day(from_unixtime(cast(time as bigint))) as day,
                  > hour(from_unixtime(cast(time as bigint))) as hour,
                  > minute(from_unixtime(cast(time as bigint))) as minute,
                  > from_unixtime(cast(time as bigint)) as ts
                  > from
                  > (
                  > select 
                  > json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) 
                  > from rating_json
                  > ) tmp
                  > ;
    hive (default)> select * from rating_width limit 10;
    OK
    rating_width.movie_id   rating_width.rate       rating_width.time       rating_width.user_id    rating_width.year       rating_width.month   rating_width.day        rating_width.hour       rating_width.minute     rating_width.ts
    1193    5       978300760       1       2001    1       1       6       12      2001-01-01 06:12:40
    661     3       978302109       1       2001    1       1       6       35      2001-01-01 06:35:09
    914     3       978301968       1       2001    1       1       6       32      2001-01-01 06:32:48
    3408    4       978300275       1       2001    1       1       6       4       2001-01-01 06:04:35
    2355    5       978824291       1       2001    1       7       7       38      2001-01-07 07:38:11
    1197    3       978302268       1       2001    1       1       6       37      2001-01-01 06:37:48
    1287    5       978302039       1       2001    1       1       6       33      2001-01-01 06:33:59
    2804    5       978300719       1       2001    1       1       6       11      2001-01-01 06:11:59
    594     4       978302268       1       2001    1       1       6       37      2001-01-01 06:37:48
    919     4       978301368       1       2001    1       1       6       22      2001-01-01 06:22:48
    

    4、求解topN

    hive (default)> select movie_id,rate,time,user_id,rank 
                  > from
                  > (
                  > select movie_id,rate,time,user_id,row_number() over(partition by user_id order by rate desc)  as rank  from rating_width 
                  > ) t
                  > where rank<=3
                  > limit 10;
    movie_id        rate    time    user_id rank
    1035    5       978301753       1       1
    1       5       978824268       1       2
    2028    5       978301619       1       3
    3591    5       978228525       10      1
    3809    5       978228451       10      2
    954     5       978225922       10      3
    800     5       977593915       100     1
    527     5       977594839       100     2
    919     5       977594947       100     3
    2571    5       975041210       1000    1
    

    二、修改HS2端口,并使用beeline连接

    [hadoop@Hadoop001 bin]$ pwd
    /home/hadoop/app/hive/bin
    #启动hs2
    [hadoop@Hadoop001 bin]$ nohup hiveserver2 >> /home/hadoop/app/hive/logs/hs2.log 2>&1 &
    #beeline连接,默认端口为10000
    [hadoop@Hadoop001 bin]$ ./beeline -u jdbc:hive2://Hadoop001:10000/default -n hadoop  
    which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/home/hadoop/app/protobuf/bin:/home/hadoop/app/maven/bin:/usr/java/jdk1.8.0_45/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)
    scan complete in 1ms
    Connecting to jdbc:hive2://Hadoop001:10000/default
    Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
    Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    Beeline version 1.1.0-cdh5.7.0 by Apache Hive
    0: jdbc:hive2://Hadoop001:10000/default> show tables;
    INFO  : Compiling command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3): show tables
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null)
    INFO  : Completed compiling command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3); Time taken: 0.398 seconds
    INFO  : Concurrency mode is disabled, not creating a lock manager
    INFO  : Executing command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3): show tables
    INFO  : Starting task [Stage-0:DDL] in serial mode
    INFO  : Completed executing command(queryId=hadoop_20190721020707_334c85ad-2e18-4c3e-80c1-86623a6c6ef3); Time taken: 0.037 seconds
    INFO  : OK
    +--------------------------+--+
    |         tab_name         |
    +--------------------------+--+
    | genome_scores            |
    | hello                    |
    | hello2                   |
    | hive_array               |
    | hive_map                 |
    | hive_rownumber           |
    | hive_struct              |
    | order_external_partiton  |
    | order_mulit_partiton     |
    | order_partiton           |
    | rating_json              |
    | rating_width             |
    | word                     |
    +--------------------------+--+
    13 rows selected (0.594 seconds)
    

    修改hs2的默认端口:
    添加到hive-site.xml中

    <property>
      <name>hive.server2.thrift.port</name>
      <value>10000</value>
    </property>
    

    三、高性能刷新分区数据

    可以通过命令MSCK REPAIR TABLE table_name;刷新分区。
    如果有很多分区没加入的话,建议用MSCK REPAIR TABLE batch语句来避免OOM。通过配置hive.msck.repair.batch.size,可以内部批量地运行。默认的值为0,表示马上执行所有的分区。

    相关文章

      网友评论

          本文标题:Hive--topN求解/修改HS2端口/高性能刷新分区数据

          本文链接:https://www.haomeiwen.com/subject/qpcvlctx.html