美文网首页
CentOS 7.2 安装Hive和Hive使用札记

CentOS 7.2 安装Hive和Hive使用札记

作者: 溯水心生 | 来源:发表于2018-01-16 13:57 被阅读23次

    Hive由Facebook开源,是一个构建在Hadoop之上的数据仓库将结构化的数据映射成表支持类SQL查询,Hive中称为HQL无法实时更新,只支持向现有表中追加数据。


    Hive原理图.png

    Hive常用文件格式类型

    TEXTFILE

    • 默认文件格式,建表时用户需要显示指定分隔符
    • 存储方式:行存储

    SequenceFile

    • 二进制键值对序列化文件格式
    • 存储方式:行存储

    列式存储格式

    • RCFILE/ORC
    • 存储方式:列存储

    常用数据类型

    1.整数类型

    • SMALLINT、INT、BIGINT
    • 空间占用分别是1字节、2字节、4字节、8字节

    2.浮点类型

    • DOUBLE
    • 空间占用分别是32位和64位浮点数

    3. 布尔类型BOOLEAN

    • 用于存储true和false

    4.字符串文本类型STRING

    • 存储变长字符串,对类型长度没有限制

    5.时间戳类型TIMESTAMP

    • 存储精度为纳秒的时间戳

    复杂数据类型

    1.ARRAY

    • 存储相同类型的数据,可以通过下标获取数据
    • 定义:ARRAY<STRING>
    • 查询:array[index]

    2.MAP

    • 存储键值对数据,键或者值的类型必须相同,通过键获取值。
    • 定义:MAP<STRING,INT>
    • 查询:map[‘key’]

    3.STRUCT

    • 可以存储多种不同的数据类型,一旦声明好结构,各字段的位置不能够改变。
    • 定义:STRUCT<city:STRING, address :STRING,door_num:STRING>
    • 查询:struct.fieldname

    一、Hive的安装

    1.下载Hive安装包并解压

    [hadoop@hadoop01 apps]$ tar -zxvf apache-hive-1.2.2-bin.tar.gz 
    
    

    2.使用Root用户创建软链接

    [root@hadoop01 apps]# ln -s /home/hadoop/apps/apache-hive-1.2.2-bin /usr/local/hive
    
    

    3.为Hive指定用户组

    [root@hadoop01 apps]# chown -R hadoop:hadoop /usr/local/hive
    

    4. 添加Hive到系统环境变量并生效

    [root@hadoop01 apps]# vim /etc/profile
    
    

    添加环境变量内容为:

    export HIVE_HOME=/usr/local/hive
    export PATH=$PATH:${JAVA_HOME}/bin:${ZOOKEEPER_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
    :${HIVE_HOME}/bin
    
    

    生效环境变量

    [root@hadoop01 apps]# source /etc/profile
    

    5.配置Hive的默认metastore

    修改Hive配置目录下的hive-site.xml配置文件,编辑内容如下:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
       Licensed to the Apache Software Foundation (ASF) under one or more
       contributor license agreements.  See the NOTICE file distributed with
       this work for additional information regarding copyright ownership.
       The ASF licenses this file to You under the Apache License, Version 2.0
       (the "License"); you may not use this file except in compliance with
       the License.  You may obtain a copy of the License at
    
           http://www.apache.org/licenses/LICENSE-2.0
    
       Unless required by applicable law or agreed to in writing, software
       distributed under the License is distributed on an "AS IS" BASIS,
       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
       See the License for the specific language governing permissions and
       limitations under the License.
    -->
    <configuration>
        <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://192.168.43.50:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8</value>
            <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionDriverName</name>
            <value>com.mysql.jdbc.Driver</value>
            <description>Driver class name for a JDBC metastore</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionUserName</name>
            <value>hadoop</value>
            <description>username to use against metastore database</description>
        </property>
        <property>
            <name>javax.jdo.option.ConnectionPassword</name>
            <value>xxxx</value>
            <description>password to use against metastore database</description>
        </property>
    </configuration>
    
    

    说明:数据库连接地址为Mysql地址,且所配置用户具有外网访问数据库权限,ConnectionPassword配置成个人Mysql数据库用户密码

    二、Hive的使用

    1.运行hive

    [hadoop@hadoop01 ~]$ hive
    
    Logging initialized using configuration in jar:file:/home/hadoop/apps/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties
    

    2.查看数据库

    hive> show databases;
    OK
    default
    Time taken: 0.99 seconds, Fetched: 1 row(s)
    

    3.创建用户表: user_info

    字段信息

    字段名称 字段类型
    用户id string
    地域id string
    年龄 int
    职业 string
    create table user_info(
    user_id string,
    area_id string,
    age int,
    occupation string
    )
    row format delimited fields terminated by '\t' 
    lines terminated by '\n'
    stored as textfile; 
    

    4.查看表

    查看default库中的表,发现新建的user_info表在default库中

    hive> show tables;
    OK
    user_info
    Time taken: 0.04 seconds, Fetched: 1 row(s)
    
    

    查看对应文件目录信息

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:48 /user/hive/warehouse/user_info
    

    5.hive删除表

    删除user_info表,user_info表在hdfs的目录也会被同时删除

    hive> drop table user_info;
    OK
    Time taken: 0.935 seconds
    hive> show tables;
    OK
    Time taken: 0.041 seconds
    

    查看文件目录位置

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse
    [hadoop@hadoop01 root]$ 
    
    

    6.创建数据库,用于存储维度

    hive> create database rel;
    OK
    Time taken: 0.098 seconds
    hive> show databases;
    OK
    default
    rel
    Time taken: 0.025 seconds, Fetched: 2 row(s)
    
    

    查看对应文件目录信息:

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:55 /user/hive/warehouse/rel.db
    
    

    7.创建内部管理表

    在数据库rel中创建学生信息表,字段信息:学号、姓名、年龄、地域。切换使用rel数据库:

    use rel;
    create table student_info(
    student_id string comment '学号',
    name string comment '姓名',
    age int comment '年龄',
    origin string comment '地域'
    )
    comment '学生信息表'
    row format delimited 
    fields terminated by '\t' 
    lines terminated by '\n'
    stored as textfile;
    

    查看对应目录信息

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/rel.db
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:59 /user/hive/warehouse/rel.db/student_info
    

    8.使用load从本地加载数据到表

    使用load从本地加载数据到表student_info

    hive> load data local inpath '/home/hadoop/apps/hive_test_data/student_info_data.txt' into table student_info;
    Loading data to table rel.student_info
    Table rel.student_info stats: [numFiles=1, totalSize=341]
    OK
    Time taken: 1.144 seconds
    
    

    查看student_info 表信息和对应文件路径

    hive> select * from student_info;
    OK
    1   xiaoming    20  11
    2   xiaobai 21  31
    3   zhangfei    22  44
    4   likui   19  44
    5   zhaoyun 21  13
    6   zhangsan    20  11
    7   lisi    19  11
    8   wangwu  23  31
    9   zhaofei 19  21
    10  zhangyan    20  21
    11  lihe    20  22
    12  caoyang 17  32
    13  lihao   19  32
    14  zhaoming    21  50
    15  zhouhong    18  51
    16  yangshuo    23  33
    17  xiaofei 24  13
    18  liman   23  13
    19  qianbao 20  13
    20  sunce   21  41
    Time taken: 0.767 seconds, Fetched: 20 row(s)
    

    查看对应文件夹路径信息

    [hadoop@hadoop01 hive_test_data]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_info
    Found 1 items
    -rwxr-xr-x   3 hadoop supergroup        341 2018-01-14 20:09 /user/hive/warehouse/rel.db/student_info/student_info_data.txt
    

    9.使用load从HDFS上加载数据到表student_info

    先删除原有数据文件

    [hadoop@hadoop01 hive_test_data]$ hadoop fs -rm -f /user/hive/warehouse/rel.db/student_info/student_info_data.txt
    18/01/14 20:15:31 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
    Deleted /user/hive/warehouse/rel.db/student_info/student_info_data.txt
    

    将本地文件上传到HDFS根目录下

    [hadoop@hadoop01 hive_test_data]$ hadoop fs -put /home/hadoop/apps/hive_test_data/student_info_data.txt /
    [hadoop@hadoop01 hive_test_data]$ hadoop fs -ls /
    Found 6 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-14 16:23 /addata
    drwxr-xr-x   - hadoop supergroup          0 2017-12-23 20:20 /data
    -rw-r--r--   3 hadoop supergroup        341 2018-01-14 20:16 /student_info_data.txt
    drwxrwx---   - hadoop supergroup          0 2018-01-14 17:26 /tmp
    drwxr-xr-x   - hadoop supergroup          0 2018-01-14 19:48 /user
    drwxr-xr-x   - hadoop supergroup          0 2018-01-13 16:26 /wordcount
    

    使用load将HDFS文件加载到student_info 表中

    hive> load data inpath '/student_info_data.txt' into table student_info;
    Loading data to table rel.student_info
    Table rel.student_info stats: [numFiles=1, totalSize=341]
    OK
    Time taken: 0.602 seconds
    hive> select * from student_info;
    OK
    1   xiaoming    20  11
    2   xiaobai 21  31
    3   zhangfei    22  44
    4   likui   19  44
    5   zhaoyun 21  13
    6   zhangsan    20  11
    7   lisi    19  11
    8   wangwu  23  31
    9   zhaofei 19  21
    10  zhangyan    20  21
    11  lihe    20  22
    12  caoyang 17  32
    13  lihao   19  32
    14  zhaoming    21  50
    15  zhouhong    18  51
    16  yangshuo    23  33
    17  xiaofei 24  13
    18  liman   23  13
    19  qianbao 20  13
    20  sunce   21  41
    Time taken: 0.143 seconds, Fetched: 20 row(s)
    

    采用覆盖重写方式加载文件到student_info 表中

    • 原hdfs根目录下的student_info_data.txt已经被剪切到student_info表的hdfs路径下/user/hive/warehouse/rel.db/student_info
    hive> load data inpath '/student_info_data.txt' overwrite into table student_info;
    Loading data to table rel.student_info
    Table rel.student_info stats: [numFiles=1, numRows=0, totalSize=341, rawDataSize=0]
    OK
    Time taken: 0.41 seconds
    

    10.Hive的数据类型

    字段名 类型 注释
    user_id string 用户ID
    salary int 工资
    worked_citys array<string> 工作过的城市
    social_security map<string,float> 社保缴费情况(养老,医保)
    wealfare struct<meal_allowance:float,if_regular:boolean,commerical_insurance:float> 福利(吃饭补助(float),是否转正(boolean),商业保险(float)

    创建员工表

    hive> create table rel.employee(
        > user_id string,
        > salary int,
        > worked_citys array<string>,
        > social_security map<string,float>,
        > welfare struct<meal_allowance:float,if_regular:boolean,commercial_insurance:float>
        > )
        > row format delimited fields terminated by '\t' 
        > collection items terminated by ','
        > map keys terminated by ':'
        > lines terminated by '\n'
        > stored as textfile;
    OK
    Time taken: 0.212 seconds
    hive> show tables;
    OK
    employee
    student_info
    Time taken: 0.057 seconds, Fetched: 2 row(s)
    

    从本地加载数据到表employee

    hive> load data local inpath '/home/hadoop/apps/hive_test_data/employee_data.txt' into table employee;
    Loading data to table rel.employee
    Table rel.employee stats: [numFiles=1, totalSize=206]
    OK
    Time taken: 0.388 seconds
    hive> select * from employee;
    OK
    zhangsan    10800   ["beijing","shanghai"]  {"养老":1000.0,"医疗":600.0}    {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
    lisi    20000   ["beijing","nanjing"]   {"养老":2000.0,"医疗":1200.0}   {"meal_allowance":2000.0,"if_regular":false,"commercial_insurance":500.0}
    wangwu  17000   ["shanghai","nanjing"]  {"养老":1800.0,"医疗":1100.0}   {"meal_allowance":2000.0,"if_regular":true,"commercial_insurance":500.0}
    Time taken: 0.127 seconds, Fetched: 3 row(s)
    

    查询已转正的员工编号,工资,工作过的第一个城市,社保养老缴费情况,福利餐补金额

    select user_id,
    salary,
    worked_citys[0],
    social_security['养老'],
    welfare.meal_allowance 
    from rel.employee
    where welfare.if_regular=true;
    

    11.创建外部表 【常用】

    字段名 字段类型 字段注释
    student_id string 学生ID
    name string 学生姓名
    institute_id string 学院ID
    major_id string 专业ID
    school_year string 入学年份

    可以提前创建好hdfs路径
    hadoop mkdir -p /user/hive/warehouse/data/student_school_info
    如果没有提前创建好,在创建外部表的时候会根据指定路径自动创建

    字段名 字段类型 字段注释
    student_id string 学生ID
    name string 学生姓名
    institute_id string 学院ID
    major_id string 专业ID
    school_year string 入学年份
    create external table rel.student_school_info(
    student_id string,
    name string,
    institute_id string,
    major_id string,
    school_year string
    )
    row format delimited 
    fields terminated by '\t' 
    lines terminated by '\n'
    stored as textfile
    location '/user/hive/warehouse/data/student_school_info';
    

    查看对应文件目录

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/data/
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-15 14:08 /user/hive/warehouse/data/student_school_info
    
    

    上传本地数据文件到hdfs

    [hadoop@hadoop01 root]$ hadoop fs -put /home/hadoop/apps/hive_test_data/student_school_info_external_data.txt /user/hive/warehouse/data/student_school_info/
    

    12.创建内部分区表

    字段名称 类型 注释
    studnet_id string 学号
    name string 姓名
    institute_id string 学院ID
    major_id string 专业ID
    create table student_school_info_partition(
    student_id string,
    name string,
    institute_id string,
    major_id string
    )
    partitioned by(school_year string) 
    row format delimited
    fields terminated by '\t' 
    lines terminated by '\n'
    stored as textfile; 
    

    使用insert into从student_school_info表将2017年入学的学籍信息导入到student_school_info_partition分区表中

    insert into table student_school_info_partition partition(school_year='2017')
    select t1.student_id,t1.name,t1.institute_id,t1.major_id
    from student_school_info t1
    where t1.school_year=2017;
    

    13.查看分区

    hive> show partitions student_school_info_partition;
    OK
    school_year=2017
    Time taken: 0.191 seconds, Fetched: 1 row(s)
    

    查看hdfs路径

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-16 13:20 /user/hive/warehouse/rel.db/student_school_info_partition/school_year=2017
    

    查询student_school_info_partition

    hive> select * from student_school_info_partition where school_year='2017';
    OK
    1   xiaoming    information software    2017
    2   xiaobai information computer    2017
    3   zhangfei    information computer    2017
    4   likui   information bigdata 2017
    5   zhaoyun information bigdata 2017
    6   zhangsan    information software    2017
    7   lisi    information bigdata 2017
    8   wangwu  information computer    2017
    Time taken: 0.226 seconds, Fetched: 8 row(s)
    

    14.删除分区

    hive> alter table student_school_info_partition drop partition (school_year='2017');
    Dropped the partition school_year=2017
    OK
    Time taken: 0.71 seconds
    

    15.使用动态分区添加数据

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    insert overwrite table student_school_info_partition partition(school_year)
    select t1.student_id,t1.name,t1.institute_id,t1.major_id,t1.school_year
    from student_school_info t1
    

    查看分区

    hive> show partitions student_school_info_partition;
    OK
    school_year=2017
    Time taken: 0.12 seconds, Fetched: 1 row(s)
    

    查看hdfs路径

    [hadoop@hadoop01 root]$ hadoop fs -ls /user/hive/warehouse/rel.db/student_school_info_partition/
    Found 1 items
    drwxr-xr-x   - hadoop supergroup          0 2018-01-16 13:27 /user/hive/warehouse/rel.db/student_school_info_partition/school_year=2017
    

    15.创建外部分区表【常用】

    create external table rel.student_school_info_external_partition(
    student_id string,
    name string,
    institute_id string,
    major_id string
    )
    partitioned by(school_year string) 
    row format delimited 
    fields terminated by '\t' 
    lines terminated by '\n'
    stored as textfile
    location '/user/hive/warehouse/data/student_school_info_external_partition';
    

    在分区表的hdfs路径中添加school_year=2017目录

    hadoop fs -mkdir /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
    

    将student_school_external_partition_data.txt文件上传到school_year=2017文件夹下

    hadoop fs -put student_school_external_partition_data.txt /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
    

    虽然数据已经添加到了分区对应的hdfs路径,但是表还没有添加分区,所以查询的时候没有数据

    手动添加分区

    hive> alter table student_school_info_external_partition add partition(school_year='2017');
    OK
    Time taken: 0.111 seconds
    
    hive> select * from student_school_info_external_partition;
    OK
    1   xiaoming    information software    2017
    2   xiaobai information computer    2017
    3   zhangfei    information computer    2017
    4   likui   information bigdata 2017
    5   zhaoyun information bigdata 2017
    6   zhangsan    information software    2017
    7   lisi    information bigdata 2017
    8   wangwu  information computer    2017
    Time taken: 0.127 seconds, Fetched: 8 row(s)
    

    删除分区

    hive> alter table student_school_info_external_partition drop partition(school_year='2017');
    Dropped the partition school_year=2017
    OK
    Time taken: 0.19 seconds
    

    查看分区,分区已经被删除

    hive> show partitions student_school_info_external_partition;
    OK
    Time taken: 0.168 seconds
    

    查看hdfs分区数据,分区数据还在

    [hadoop@hadoop01 hive_test_data]$ hadoop fs -ls /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017
    Found 1 items
    -rw-r--r--   3 hadoop supergroup        250 2018-01-16 13:33 /user/hive/warehouse/data/student_school_info_external_partition/school_year=2017/student_school_external_partition_data.txt
    

    相关文章

      网友评论

          本文标题:CentOS 7.2 安装Hive和Hive使用札记

          本文链接:https://www.haomeiwen.com/subject/qmdloxtx.html