美文网首页大数据
大数据开发之Hive篇1-Hive介绍

大数据开发之Hive篇1-Hive介绍

作者: 只是甲 | 来源:发表于2020-12-17 13:40 被阅读0次

    一.Hive简介

    1. Hive是基于Hadoop的一个数据仓库工具。
    2. 可以将结构化的数据文件映射为一张数据库表,并提供简单的类SQL(HQL)查询功能,可以将HQL语句转换为MapReduce任务进行运行。
    3. 学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发与门的MapReduce应用。
    4. 适合数据仓库的ETL和统计分析。

    二.Hive的特点

    1. 简单易用
      基于SQL表达式语法,兼容大部分SQL-92语义和部分SQL-2003扩展语义
    2. 可扩展
      Hive基于Hadoop实现,可以自由的扩展集群的规模,一般情况下丌需要重启服务
    3. 延展性
      Hive支持用户自定义函数,用户可以根据自己的需求来实现自己的函数
    4. 容错性
      Hadoop良好的容错性,节点出现问题SQL仍可完成执行

    三.Hive使用场景

    1. 最佳使用场合
      大数据集的批处理作业,例如:网络日志分析
    2. 不适用于
      不能在大规模数据集上实现低延迟快速的查询,例如:Hive 在几百MB 的数据集上执行查询一般有分钟级的时间延迟。
      不支持联机事务处理(OLTP)
      Hive 不提供基于行级的数据更新操作(2.0版本开始支持Update)

    四.Hive命令

    hive命令如下图所示:


    image.png

    测试记录:

      hive> set mapred.reduce.tasks=32;
      hive> set;
      hive> select a.* from tab1;
      hive> !ls;
      hive> dfs -ls;
    

    Hive CLI使用

    [root@hp1 libext]# hive -H
    WARNING: Use "yarn jar" to launch YARN applications.
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    usage: hive
     -d,--define <key=value>          Variable subsitution to apply to hive
                                      commands. e.g. -d A=B or --define A=B
        --database <databasename>     Specify the database to use
     -e <quoted-query-string>         SQL from command line
     -f <filename>                    SQL from files
     -H,--help                        Print help information
        --hiveconf <property=value>   Use value for given property
        --hivevar <key=value>         Variable subsitution to apply to hive
                                      commands. e.g. --hivevar A=B
     -i <filename>                    Initialization SQL file
     -S,--silent                      Silent mode in interactive shell
     -v,--verbose                     Verbose mode (echo executed SQL to the
                                      console)
    

    Hive执行sql脚本

    hive -e 'select count(*) from test.emp'
    hive -f a.sql
    

    测试记录:

    [root@hp1 ~]# more a.sql 
    select count(*) from test.emp;
    [root@hp1 ~]# 
    [root@hp1 ~]# hive -f a.sql
    WARNING: Use "yarn jar" to launch YARN applications.
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    
    Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
    Query ID = root_20201127172235_292c9382-f5e4-46fd-ab9b-5660eeba4f44
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
      set mapreduce.job.reduces=<number>
    20/11/27 17:22:38 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1606444720955_0003, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0003/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0003
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
    2020-11-27 17:22:46,594 Stage-1 map = 0%,  reduce = 0%
    2020-11-27 17:22:51,815 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.99 sec
    2020-11-27 17:22:52,850 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.91 sec
    2020-11-27 17:22:58,031 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.93 sec
    MapReduce Total cumulative CPU time: 5 seconds 930 msec
    Ended Job = job_1606444720955_0003
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.93 sec   HDFS Read: 14178 HDFS Write: 102 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 5 seconds 930 msec
    OK
    14
    Time taken: 23.234 seconds, Fetched: 1 row(s)
    

    五.Hive参数

    hive参数概述:


    image.png

    5.1 Hivevar参数示例

    代码:

    hive -d foo=bar
    set foo;
    set hivevar:empno=7369;
    select * from test.emp where empno=${hivevar:empno};
    

    测试记录:

    [root@hp1 ~]# hive -d foo=bar
    WARNING: Use "yarn jar" to launch YARN applications.
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    
    Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
    
    WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
    hive> set foo;
    foo=bar
        > set hivevar:empno=7369;
    hive> select * from test.emp where empno=${hivevar:empno};
    Query ID = root_20201127173211_9f830148-5427-4b15-b356-32d2ab9b3c74
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    20/11/27 17:32:12 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1606444720955_0004, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0004/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0004
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
    2020-11-27 17:32:20,448 Stage-1 map = 0%,  reduce = 0%
    2020-11-27 17:32:26,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.61 sec
    MapReduce Total cumulative CPU time: 6 seconds 610 msec
    Ended Job = job_1606444720955_0004
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2   Cumulative CPU: 6.61 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 6 seconds 610 msec
    OK
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    Time taken: 17.64 seconds, Fetched: 1 row(s)
    

    5.2 Hiveconf参数示例

    代码:
    ···
    hive --hiveconf hive.cli.print.header=true
    select * from test.emp where empno=7369;
    set hive.cli.print.header;
    set hive.cli.print.header=false;
    set hive.cli.print.header;
    select * from test.emp where empno=7369;
    ···

    测试记录:

    [root@hp1 ~]# hive --hiveconf hive.cli.print.header=true
    WARNING: Use "yarn jar" to launch YARN applications.
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    
    Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
    
    WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
    hive> 
        > select * from test.emp where empno=7369;
    Query ID = root_20201127173938_ca672a76-b8d5-4281-86a1-f8305b202ad8
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    20/11/27 17:39:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1606444720955_0005, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0005/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0005
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
    2020-11-27 17:39:48,730 Stage-1 map = 0%,  reduce = 0%
    2020-11-27 17:39:54,961 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.4 sec
    MapReduce Total cumulative CPU time: 6 seconds 400 msec
    Ended Job = job_1606444720955_0005
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2   Cumulative CPU: 6.4 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 6 seconds 400 msec
    OK
    emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    Time taken: 17.515 seconds, Fetched: 1 row(s)
    hive> 
        > set hive.cli.print.header;
    hive.cli.print.header=true
    hive> set hive.cli.print.header=false;
    hive> set hive.cli.print.header;
    hive.cli.print.header=false
    hive> select * from test.emp where empno=7369;
    Query ID = root_20201127174246_26ec846e-1d16-4cf2-a9d4-6436632dc587
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    20/11/27 17:42:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1606444720955_0006, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0006/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0006
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
    2020-11-27 17:42:53,878 Stage-1 map = 0%,  reduce = 0%
    2020-11-27 17:43:01,096 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.51 sec
    MapReduce Total cumulative CPU time: 6 seconds 510 msec
    Ended Job = job_1606444720955_0006
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2   Cumulative CPU: 6.51 sec   HDFS Read: 11969 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 6 seconds 510 msec
    OK
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    Time taken: 15.72 seconds, Fetched: 1 row(s)
    hive> 
    

    5.3 System变量

    代码:

    set system:user.name;
    set system:user.name=7369;
    set system:user.name;
    select * from test.emp where empno=${system:user.name};
    

    测试记录:

    [root@hp1 ~]# hive
    WARNING: Use "yarn jar" to launch YARN applications.
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    
    Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
    
    WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
    hive> set system:user.name;
    system:user.name=root
    hive> set system:user.name=7369;
    hive> set system:user.name;
    system:user.name=7369
    hive> select * from test.emp where empno=${system:user.name};
    Query ID = 7369_20201127180045_9f902d3a-070d-4e12-8641-38063197eafc
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    20/11/27 18:00:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1606444720955_0010, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0010/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0010
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
    2020-11-27 18:00:55,955 Stage-1 map = 0%,  reduce = 0%
    2020-11-27 18:01:02,281 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.11 sec
    2020-11-27 18:01:03,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.29 sec
    MapReduce Total cumulative CPU time: 6 seconds 290 msec
    Ended Job = job_1606444720955_0010
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2   Cumulative CPU: 6.29 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 6 seconds 290 msec
    OK
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    Time taken: 19.033 seconds, Fetched: 1 row(s)
    hive> 
    

    5.4 Env变量

    代码:

    export empno=7369
    hive
    set env:empno;
    select * from test.emp where empno=${env:empno};
    

    测试记录:

    [root@hp1 ~]# export empno=7369
    [root@hp1 ~]# hive
    WARNING: Use "yarn jar" to launch YARN applications.
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    
    Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
    
    WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
    hive> set env:empno;
    env:empno=7369
    hive> select * from test.emp where empno=${env:empno};
    Query ID = root_20201127180306_e2344df7-49f4-4746-9c60-6f62f81d36bb
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    20/11/27 18:03:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
    Starting Job = job_1606444720955_0011, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0011/
    Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0011
    Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
    2020-11-27 18:03:16,404 Stage-1 map = 0%,  reduce = 0%
    2020-11-27 18:03:22,675 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.41 sec
    MapReduce Total cumulative CPU time: 6 seconds 410 msec
    Ended Job = job_1606444720955_0011
    MapReduce Jobs Launched: 
    Stage-Stage-1: Map: 2   Cumulative CPU: 6.41 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
    Total MapReduce CPU Time Spent: 6 seconds 410 msec
    OK
    7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
    Time taken: 17.48 seconds, Fetched: 1 row(s)
    hive> 
    

    六.Hive系统架构

    Hive系统架构图


    image.png

    6.1 Hive接口

    用户接口主要有三个:CLI,ThriftServer和HWI。
    最常用的是CLI,Cli启动的时候,会同时启动一个Hive Driver。
    ThriftServer是以Thrift协议封装的Hive服务化接口,可以提供跨语言的访问如Python、C++等。并实现了JDBC/ODBC协议
    HWI 提供了一个基于浏览器访问Hive的途径。

    6.2 元数据存储

    Hive 将元数据存储在数据库中,如mysql、oracle、derby。
    Hive 中的元数据包括表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等

    6.3 驱动器(Driver)

    1. 编译器
      完成词法分析、语法分析,将HQL查询解析成AST
      AST生成逻辑执行计划
      逻辑执行计划生成物理MR执行计划
    2. 优化器
      对逻辑执行计划进行优化
      对物理执行计划进行优化
    3. 执行器
      生成的物理执行计划转变成MR Job
      提交到Hadoop上面执行

    相关文章

      网友评论

        本文标题:大数据开发之Hive篇1-Hive介绍

        本文链接:https://www.haomeiwen.com/subject/kgagiktx.html