美文网首页
Hive的分桶详解

Hive的分桶详解

作者: 数据萌新 | 来源:发表于2018-09-26 17:06 被阅读0次

    Hive分桶通俗点来说就是将表(或者分区,也就是hdfs上的目录而真正的数据是存储在该目录下的文件)中文件分成几个文件去存储。比如表buck(目录,里面存放了某个文件如sz.data)文件中本来是1000000条数据,由于在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便,所以我们可以分4个文件去存储。
    下面记录了从头到尾以及出现问题的操作

    进行连接,创建数据库myhive2,使用该数据库

    [root@mini1 ~]# cd apps/hive/bin
    [root@mini1 bin]# ./beeline 
    Beeline version 1.2.1 by Apache Hive
    beeline> !connect jdbc:hive2://localhost:10000
    Connecting to jdbc:hive2://localhost:10000
    Enter username for jdbc:hive2://localhost:10000: root
    Enter password for jdbc:hive2://localhost:10000: ******
    Connected to: Apache Hive (version 1.2.1)
    Driver: Hive JDBC (version 1.2.1)
    Transaction isolation: TRANSACTION_REPEATABLE_READ
    0: jdbc:hive2://localhost:10000> show databases;
    +----------------+--+
    | database_name  |
    +----------------+--+
    | default        |
    | myhive         |
    +----------------+--+
    2 rows selected (1.795 seconds)
    0: jdbc:hive2://localhost:10000> create database myhive2;
    No rows affected (0.525 seconds)
    0: jdbc:hive2://localhost:10000> use myhive2;
    No rows affected (0.204 seconds)
    
    

    创建分桶表,导入数据,查看表内容

    0: jdbc:hive2://localhost:10000> create table buck(id string,name string)
    0: jdbc:hive2://localhost:10000> clustered by (id) sorted by (id) into 4 buckets
    0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
    No rows affected (0.34 seconds)
    0: jdbc:hive2://localhost:10000> desc buck;
    +-----------+------------+----------+--+
    | col_name  | data_type  | comment  |
    +-----------+------------+----------+--+
    | id        | string     |          |
    | name      | string     |          |
    +-----------+------------+----------+--+
    2 rows selected (0.55 seconds)
    load data local inpath '/root/sz.data' into table buck;
    INFO  : Loading data to table myhive2.buck from file:/root/sz.data
    INFO  : Table myhive2.buck stats: [numFiles=1, totalSize=91]
    No rows affected (1.411 seconds)
    0: jdbc:hive2://localhost:10000> select * from buck;
    +----------+------------+--+
    | buck.id  | buck.name  |
    +----------+------------+--+
    | 1        | zhangsan   |
    | 2        | lisi       |
    | 3        | wangwu     |
    | 4        | furong     |
    | 5        | fengjie    |
    | 6        | aaa        |
    | 7        | bbb        |
    | 8        | ccc        |
    | 9        | ddd        |
    | 10       | eee        |
    | 11       | fff        |
    | 12       | ggg        |
    +----------+------------+--+
    
    

    如果分桶了的话,那么buck目录下应该有4个文件,页面查看


    image.png

    然而并没有,还是自己导入的那个文件。
    这是因为分桶不是hive活着hadoop自动给我们划分文件来分桶的,而应该是我们分好之后导入才好。
    需要设置开启分桶,设置reducetask数量(跟分桶数量一致)

    0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing = true;
    No rows affected (0.063 seconds)
    0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing ;
    +------------------------------+--+
    |             set              |
    +------------------------------+--+
    | hive.enforce.bucketing=true  |
    +------------------------------+--+
    1 row selected (0.067 seconds)
    0: jdbc:hive2://localhost:10000> set mapreduce.job.reduces=4;
    

    那么创建另外一个表tp,将该表数据放入到buck中(select出来insert 进去),放入的时候指定进行分桶,那么会分四桶,每个里面进行排序。那么最后buck表就进行了分桶(分桶是导入的时候就分桶的而不是自己实现分桶(文件划分))。
    接下来,清空buck表信息,创建表tp,将tp中数据查询出来insert into到buck中。

    0: jdbc:hive2://localhost:10000> truncate table buck;
    No rows affected (0.316 seconds)
    0: jdbc:hive2://localhost:10000> create table tp(id string,name string)
    0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
    No rows affected (0.112 seconds)
    0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table tp;
    INFO  : Loading data to table myhive2.tp from file:/root/sz.data
    INFO  : Table myhive2.tp stats: [numFiles=1, totalSize=91]
    No rows affected (0.419 seconds)
    0: jdbc:hive2://localhost:10000> show tables;
    +-----------+--+
    | tab_name  |
    +-----------+--+
    | buck      |
    | tp        |
    +-----------+--+
    2 rows selected (0.128 seconds)
    0: jdbc:hive2://localhost:10000> select * from tp;
    +--------+-----------+--+
    | tp.id  |  tp.name  |
    +--------+-----------+--+
    | 1      | zhangsan  |
    | 2      | lisi      |
    | 3      | wangwu    |
    | 4      | furong    |
    | 5      | fengjie   |
    | 6      | aaa       |
    | 7      | bbb       |
    | 8      | ccc       |
    | 9      | ddd       |
    | 10     | eee       |
    | 11     | fff       |
    | 12     | ggg       |
    +--------+-----------+--+
    12 rows selected (0.243 seconds)
    0: jdbc:hive2://localhost:10000> insert into buck 
    0: jdbc:hive2://localhost:10000> select id,name from tp distribute by (id) sort by (id);
    INFO  : Number of reduce tasks determined at compile time: 4
    INFO  : In order to change the average load for a reducer (in bytes):
    INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
    INFO  : In order to limit the maximum number of reducers:
    INFO  :   set hive.exec.reducers.max=<number>
    INFO  : In order to set a constant number of reducers:
    INFO  :   set mapreduce.job.reduces=<number>
    INFO  : number of splits:1
    INFO  : Submitting tokens for job: job_1508216103995_0028
    INFO  : The url to track the job: http://mini1:8088/proxy/application_1508216103995_0028/
    INFO  : Starting Job = job_1508216103995_0028, Tracking URL = http://mini1:8088/proxy/application_1508216103995_0028/
    INFO  : Kill Command = /root/apps/hadoop-2.6.4/bin/hadoop job  -kill job_1508216103995_0028
    INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 4
    INFO  : 2017-10-19 03:57:23,631 Stage-1 map = 0%,  reduce = 0%
    INFO  : 2017-10-19 03:57:29,349 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.18 sec
    INFO  : 2017-10-19 03:57:40,096 Stage-1 map = 100%,  reduce = 25%, Cumulative CPU 2.55 sec
    INFO  : 2017-10-19 03:57:41,152 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 5.29 sec
    INFO  : 2017-10-19 03:57:42,375 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.61 sec
    INFO  : MapReduce Total cumulative CPU time: 6 seconds 610 msec
    INFO  : Ended Job = job_1508216103995_0028
    INFO  : Loading data to table myhive2.buck from hdfs://192.168.25.127:9000/user/hive/warehouse/myhive2.db/buck/.hive-staging_hive_2017-10-19_03-57-14_624_1985499545258899177-1/-ext-10000
    INFO  : Table myhive2.buck stats: [numFiles=4, numRows=12, totalSize=91, rawDataSize=79]
    No rows affected (29.238 seconds)
    0: jdbc:hive2://localhost:10000> select * from buck;
    +----------+------------+--+
    | buck.id  | buck.name  |
    +----------+------------+--+
    | 11       | fff        |
    | 4        | furong     |
    | 8        | ccc        |
    | 1        | zhangsan   |
    | 12       | ggg        |
    | 5        | fengjie    |
    | 9        | ddd        |
    | 2        | lisi       |
    | 6        | aaa        |
    | 10       | eee        |
    | 3        | wangwu     |
    | 7        | bbb        |
    +----------+------------+--+
    
    

    到这应该就知道已经分桶了,否则id应该是1-12出来的,这是因为在4个桶中,分别进行了各自的排序,而不是跟order by一样会进行全局排序,页面查看下吧。


    image.png

    能看到确实分了4桶,客户端查看下内容吧(可以直接解析hdfs操作的)

    0: jdbc:hive2://localhost:10000> dfs -ls /user/hive/warehouse/myhive2.db/buck;
    +-----------------------------------------------------------------------------------------------------------+--+
    |                                                DFS Output                                                 |
    +-----------------------------------------------------------------------------------------------------------+--+
    | Found 4 items                                                                                             |
    | -rwxr-xr-x   2 root supergroup         22 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000000_0  |
    | -rwxr-xr-x   2 root supergroup         34 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000001_0  |
    | -rwxr-xr-x   2 root supergroup         13 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000002_0  |
    | -rwxr-xr-x   2 root supergroup         22 2017-10-19 03:57 /user/hive/warehouse/myhive2.db/buck/000003_0  |
    +-----------------------------------------------------------------------------------------------------------+--+
    5 rows selected (0.028 seconds)
    0: jdbc:hive2://localhost:10000> dfs -cat  /user/hive/warehouse/myhive2.db/buck/000000_0;
    +-------------+--+
    | DFS Output  |
    +-------------+--+
    | 11,fff      |
    | 4,furong    |
    | 8,ccc       |
    +-------------+--+
    3 rows selected (0.02 seconds)
    0: jdbc:hive2://localhost:10000> dfs -cat  /user/hive/warehouse/myhive2.db/buck/000001_0;
    +-------------+--+
    | DFS Output  |
    +-------------+--+
    | 1,zhangsan  |
    | 12,ggg      |
    | 5,fengjie   |
    | 9,ddd       |
    +-------------+--+
    4 rows selected (0.08 seconds)
    0: jdbc:hive2://localhost:10000> dfs -cat  /user/hive/warehouse/myhive2.db/buck/000002_0;
    +-------------+--+
    | DFS Output  |
    +-------------+--+
    | 2,lisi      |
    | 6,aaa       |
    +-------------+--+
    2 rows selected (0.088 seconds)
    0: jdbc:hive2://localhost:10000> dfs -cat  /user/hive/warehouse/myhive2.db/buck/000003_0;
    +-------------+--+
    | DFS Output  |
    +-------------+--+
    | 10,eee      |
    | 3,wangwu    |
    | 7,bbb       |
    +-------------+--+
    3 rows selected (0.062 seconds)
    

    注: select id,name from tp distribute by (id) sort by (id)语句中distribute by (id) sort by (id)知道根据id进行分桶(根据id进行hash散列),根据id进行排序默认升序。如果两者字段相同那么可以使用cluster by (id);也就是说可以写成

    insert into buck select id ,name from p cluster by (id);
    

    效果是一样的。

    分桶的作用

    观察下面的语句。

    select a.id,a.name,b.addr from a join b on a.id = b.id;
    

    如果a表和b表已经是分桶表,而且分桶的字段是id字段,那么做这个操作的时候就不需要再进行全表笛卡尔积了。

    相关文章

      网友评论

          本文标题:Hive的分桶详解

          本文链接:https://www.haomeiwen.com/subject/vpljoftx.html