大数据开发之Hive篇1-Hive介绍

作者: 只是甲 | 来源:发表于2020-12-17 13:40 被阅读0次

大数据开发之Hive篇1-Hive介绍
大数据开发之Hive优化篇1-Hive 优化概述
大数据开发之Hive案例篇1-Hive表中文注释乱码
094-BigData-22Hive数据类型及操作
大数据开发之Hive篇16-Hive的事务介绍
hive的安装和使用
hive介绍
数据科学之路（7）即席查询工具Impala
大数据开发之Hive篇8-Hive视图
总结：Hive，Hive on Spark和SparkSQL区别

一.Hive简介

Hive是基于Hadoop的一个数据仓库工具。
可以将结构化的数据文件映射为一张数据库表，并提供简单的类SQL(HQL)查询功能，可以将HQL语句转换为MapReduce任务进行运行。
学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发与门的MapReduce应用。
适合数据仓库的ETL和统计分析。

二.Hive的特点

简单易用
基于SQL表达式语法，兼容大部分SQL-92语义和部分SQL-2003扩展语义
可扩展
Hive基于Hadoop实现，可以自由的扩展集群的规模，一般情况下丌需要重启服务
延展性
Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数
容错性
Hadoop良好的容错性，节点出现问题SQL仍可完成执行

三.Hive使用场景

最佳使用场合
大数据集的批处理作业，例如：网络日志分析
不适用于
不能在大规模数据集上实现低延迟快速的查询，例如：Hive 在几百MB 的数据集上执行查询一般有分钟级的时间延迟。
不支持联机事务处理（OLTP）
Hive 不提供基于行级的数据更新操作（2.0版本开始支持Update）

四.Hive命令

hive命令如下图所示:

image.png

测试记录:

  hive> set mapred.reduce.tasks=32;
  hive> set;
  hive> select a.* from tab1;
  hive> !ls;
  hive> dfs -ls;

Hive CLI使用

[root@hp1 libext]# hive -H
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

Hive执行sql脚本

hive -e 'select count(*) from test.emp'
hive -f a.sql

测试记录:

[root@hp1 ~]# more a.sql 
select count(*) from test.emp;
[root@hp1 ~]# 
[root@hp1 ~]# hive -f a.sql
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false
Query ID = root_20201127172235_292c9382-f5e4-46fd-ab9b-5660eeba4f44
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
20/11/27 17:22:38 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0003, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0003/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0003
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-11-27 17:22:46,594 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:22:51,815 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1.99 sec
2020-11-27 17:22:52,850 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.91 sec
2020-11-27 17:22:58,031 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 5.93 sec
MapReduce Total cumulative CPU time: 5 seconds 930 msec
Ended Job = job_1606444720955_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 5.93 sec   HDFS Read: 14178 HDFS Write: 102 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 930 msec
OK
14
Time taken: 23.234 seconds, Fetched: 1 row(s)

五.Hive参数

hive参数概述:

image.png

5.1 Hivevar参数示例

代码:

hive -d foo=bar
set foo;
set hivevar:empno=7369;
select * from test.emp where empno=${hivevar:empno};

测试记录:

[root@hp1 ~]# hive -d foo=bar
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set foo;
foo=bar
    > set hivevar:empno=7369;
hive> select * from test.emp where empno=${hivevar:empno};
Query ID = root_20201127173211_9f830148-5427-4b15-b356-32d2ab9b3c74
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:32:12 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0004, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0004/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0004
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:32:20,448 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:32:26,683 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.61 sec
MapReduce Total cumulative CPU time: 6 seconds 610 msec
Ended Job = job_1606444720955_0004
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.61 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 610 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 17.64 seconds, Fetched: 1 row(s)

5.2 Hiveconf参数示例

代码:
···
hive --hiveconf hive.cli.print.header=true
select * from test.emp where empno=7369;
set hive.cli.print.header;
set hive.cli.print.header=false;
set hive.cli.print.header;
select * from test.emp where empno=7369;
···

测试记录:

[root@hp1 ~]# hive --hiveconf hive.cli.print.header=true
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> 
    > select * from test.emp where empno=7369;
Query ID = root_20201127173938_ca672a76-b8d5-4281-86a1-f8305b202ad8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:39:40 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0005, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0005/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0005
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:39:48,730 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:39:54,961 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.4 sec
MapReduce Total cumulative CPU time: 6 seconds 400 msec
Ended Job = job_1606444720955_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.4 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 400 msec
OK
emp.empno       emp.ename       emp.job emp.mgr emp.hiredate    emp.sal emp.comm        emp.deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 17.515 seconds, Fetched: 1 row(s)
hive> 
    > set hive.cli.print.header;
hive.cli.print.header=true
hive> set hive.cli.print.header=false;
hive> set hive.cli.print.header;
hive.cli.print.header=false
hive> select * from test.emp where empno=7369;
Query ID = root_20201127174246_26ec846e-1d16-4cf2-a9d4-6436632dc587
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 17:42:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0006, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0006/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0006
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 17:42:53,878 Stage-1 map = 0%,  reduce = 0%
2020-11-27 17:43:01,096 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.51 sec
MapReduce Total cumulative CPU time: 6 seconds 510 msec
Ended Job = job_1606444720955_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.51 sec   HDFS Read: 11969 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 510 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 15.72 seconds, Fetched: 1 row(s)
hive>

5.3 System变量

代码:

set system:user.name;
set system:user.name=7369;
set system:user.name;
select * from test.emp where empno=${system:user.name};

测试记录:

[root@hp1 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set system:user.name;
system:user.name=root
hive> set system:user.name=7369;
hive> set system:user.name;
system:user.name=7369
hive> select * from test.emp where empno=${system:user.name};
Query ID = 7369_20201127180045_9f902d3a-070d-4e12-8641-38063197eafc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 18:00:47 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0010, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0010/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0010
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 18:00:55,955 Stage-1 map = 0%,  reduce = 0%
2020-11-27 18:01:02,281 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.11 sec
2020-11-27 18:01:03,313 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.29 sec
MapReduce Total cumulative CPU time: 6 seconds 290 msec
Ended Job = job_1606444720955_0010
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.29 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 290 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 19.033 seconds, Fetched: 1 row(s)
hive>

5.4 Env变量

代码:

export empno=7369
hive
set env:empno;
select * from test.emp where empno=${env:empno};

测试记录:

[root@hp1 ~]# export empno=7369
[root@hp1 ~]# hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/jars/hive-common-2.1.1-cdh6.3.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive> set env:empno;
env:empno=7369
hive> select * from test.emp where empno=${env:empno};
Query ID = root_20201127180306_e2344df7-49f4-4746-9c60-6f62f81d36bb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
20/11/27 18:03:08 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm69
Starting Job = job_1606444720955_0011, Tracking URL = http://hp3:8088/proxy/application_1606444720955_0011/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job  -kill job_1606444720955_0011
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-11-27 18:03:16,404 Stage-1 map = 0%,  reduce = 0%
2020-11-27 18:03:22,675 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 6.41 sec
MapReduce Total cumulative CPU time: 6 seconds 410 msec
Ended Job = job_1606444720955_0011
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2   Cumulative CPU: 6.41 sec   HDFS Read: 11789 HDFS Write: 231 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 410 msec
OK
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
Time taken: 17.48 seconds, Fetched: 1 row(s)
hive>

六.Hive系统架构

Hive系统架构图

image.png

6.1 Hive接口

用户接口主要有三个：CLI，ThriftServer和HWI。
最常用的是CLI，Cli启动的时候，会同时启动一个Hive Driver。
ThriftServer是以Thrift协议封装的Hive服务化接口，可以提供跨语言的访问如Python、C++等。并实现了JDBC/ODBC协议
HWI 提供了一个基于浏览器访问Hive的途径。

6.2 元数据存储

Hive 将元数据存储在数据库中，如mysql、oracle、derby。
Hive 中的元数据包括表的名字，表的列和分区及其属性，表的属性（是否为外部表等），表的数据所在目录等

6.3 驱动器（Driver）

编译器
完成词法分析、语法分析，将HQL查询解析成AST
AST生成逻辑执行计划
逻辑执行计划生成物理MR执行计划
优化器
对逻辑执行计划进行优化
对物理执行计划进行优化
执行器
生成的物理执行计划转变成MR Job
提交到Hadoop上面执行

大数据开发之Hive篇1-Hive介绍
一.Hive简介 Hive是基于Hadoop的一个数据仓库工具。可以将结构化的数据文件映射为一张数据库表，并提供...
大数据开发之Hive优化篇1-Hive 优化概述
备注:Hive 版本 2.1.1 一.Hive的实现原理 Hive的编译器将HQL转换成一组操作符(Operato...
大数据开发之Hive案例篇1-Hive表中文注释乱码
备注:Hive 版本 2.1.1MySQL 版本 5.6.49 一.问题描述创建表备注中文显示乱码二.解决方案...
094-BigData-22Hive数据类型及操作
上一篇：093-BigData-21Hive介绍和安装一、Hive数据类型 3.1 基本数据类型对于Hive的...
大数据开发之Hive篇16-Hive的事务介绍
备注:Hive 版本 2.1.1 一.Hive事务简介 1.1 为什么要支持ACID 事务的四个属性：1.原子性（...
hive的安装和使用
hive介绍 hive是一款Apache旗下的数据仓库构建软件，能够读取hdfs的数据使用SQL对数据进行处理，大...
hive介绍
大数据时代的技术hive：hive介绍
数据科学之路（7）即席查询工具Impala
我们在前面两篇中介绍了数据仓库工具 Hive，但是早期的 Hive 是依赖 Hadoop 的 MapReduce ...
大数据开发之Hive篇8-Hive视图
备注:Hive 版本 2.1.1 前言:视图是从一个或几个基本表导出的表。视图本身不独立存储在数据库中，是一个虚表...
总结：Hive，Hive on Spark和SparkSQL区别
Hive on Mapreduce Hive的原理大家可以参考这篇大数据时代的技术hive：hive介绍，实际的一...

大数据开发之Hive篇1-Hive介绍

一.Hive简介

二.Hive的特点

三.Hive使用场景

四.Hive命令

五.Hive参数

5.1 Hivevar参数示例

5.2 Hiveconf参数示例

5.3 System变量

5.4 Env变量

六.Hive系统架构

6.1 Hive接口

6.2 元数据存储

6.3 驱动器（Driver）

相关文章

大数据开发之Hive篇1-Hive介绍

大数据开发之Hive优化篇1-Hive 优化概述

大数据开发之Hive案例篇1-Hive表中文注释乱码

094-BigData-22Hive数据类型及操作

大数据开发之Hive篇16-Hive的事务介绍

hive的安装和使用

hive介绍

数据科学之路（7）即席查询工具Impala

大数据开发之Hive篇8-Hive视图

总结：Hive，Hive on Spark和SparkSQL区别

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

大数据