Hadoop——Hive

作者: _羊羽_ | 来源:发表于2019-08-14 23:03 被阅读0次

大数据学习教程
hive 相关
Hive 安装
hadoop以及hive的日常使用
Hive学习
HIVE安装指南
大数据开发实战：离线大数据处理的主要技术--Hive,概念，SQ
数仓实战03：数仓搭建-ODS 层
Hive安装教程
Hive 基础搭建教程

概述

官方地址http://hive.apache.org/
官方文档地址:https://cwiki.apache.org/confluence/display/Hive/GettingStarted
简单，容易上手提供了类似SQL查询语言HQL
为超大数据集设计的计算、存储拓展能力
统一的元数据管理

安装

官方下载地址 http://hive.apache.org/downloads.html

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
tar -zxvf ~/Downloads/apache-hive-1.2.2-bin.tar.gz -C /opt/module/
cd /opt/module/
mv apache-hive-1.2.2-bin/ hive

cdh版本安装地址

#wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.7.0.tar.gz
#tar -zxvf hive-1.1.0-cdh5.7.0.tar.gz -C app/
#mv hive-1.1.0-cdh5.7.0/ hive

配置

设置环境变量

export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH

修改hive中config目录的配置

cp hive-env.sh.template hive-env.sh

设置 hive-env.sh

HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0

拷贝MySQL驱动到lib目录下

# wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-java-5.1.48.tar.gz
# mkdir lib
# tar -zxvf mysql-connector-java-5.1.48.tar.gz -C lib
# cp ~/lib/mysql-connector-java-5.1.48/mysql-connector-java-5.1.48-bin.jar ~/app/hive/lib

Hadoop集群配置

hadoop fs -mkdir /tmp
hadoop fs -mkdir -p /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

设置数据库权限

cd /root/app/hive/conf
vim hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive_metastore?useSSL=false</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
    <description>password to use against metastore database</description>
  </property>
</configuration>

在hive-site.xml文件中添加如下配置信息,显示当前数据库，以及查询表的头信息配置。

        <property>
                <name>hive.cli.print.header</name>
                <value>true</value>
        </property>
        <property>
                <name>hive.cli.print.current.db</name>
                <value>true</value>
        </property>

创建Mysql数据库和权限

# mycli -uroot -hlocalhost
mysql root@localhost:(none)> create database hive_metastore
mysql root@localhost:(none)> grant all privileges on hive.* to 'hive'@'localhost' identified by 'hive'; 
mysql root@localhost:(none)> flush privileges;

修改日志存储路径

$ cp hive-log4j.properties.template hive-log4j.properties
$ vim hive-log4j.properties
hive.log.dir=/opt/module/hive/logs

hive启动

启动hive之前，请先启动hadoop集群。

    start-all.sh #启动hadoop
    hive  #启动hive

spark2.0问题

ls: 无法访问/root/app/spark/lib/spark-assembly-.jar: 没有那个文件或目录,
spark升级到spark2以后，原有lib目录下的大JAR包被分散成多个小JAR包，原来的spark-assembly-.jar已经不存在，所以hive没有办法找到这个JAR包。

111 # add Spark assembly jar to the classpath
112 if [[ -n "$SPARK_HOME" ]]
113 then
114   sparkAssemblyPath=`ls ${SPARK_HOME}/jars/*.jar`
115   CLASSPATH="${CLASSPATH}:${sparkAssemblyPath}"
116 fi

基本操作

创建数据库

hive> create database if not exists hive;
OK
Time taken: 0.215 seconds

查看数据库

hive> show databases;
OK
default
hive
Time taken: 0.043 seconds, Fetched: 2 row(s)

切换数据库

hive> use hive;

创建数据表

hive>create table wordcount (count string);
hive> create table student(id int,name string);

查看表

show tables;

查看表结构

hive> desc student;
OK
id                      int
name                    string
Time taken: 0.236 seconds, Fetched: 2 row(s)

插入数据

hive> insert into student values(1,"xiaoming");

image.png

查询数据

hive> select * from student;
OK
1   xiaoming
Time taken: 0.099 seconds, Fetched: 1 row(s)

加载本地数据

创建测试数据vim student.txt

1       zhangsan
2       wangwu
3       zhaosi
4       liliu

创建student表

hive> use default;
OK
Time taken: 1.018 seconds
hive> show tables;
OK
Time taken: 0.296 seconds
hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.384 seconds

加载本地数据到表中

hive> load data local inpath '/opt/module/data/student.txt' into table student;
Loading data to table default.student
Table default.student stats: [numFiles=2, numRows=0, totalSize=48, rawDataSize=0]
OK
Time taken: 0.997 seconds

查看加载数据结果

hive> select *from student;
OK
1   zhangsan
2   wangwu
3   zhaosi
4   liliu
Time taken: 0.355 seconds, Fetched: 4 row(s)

wordcount

hive> load data local inpath '/home/hadoop/data/hello.txt' into table wordcount;

执行job

select word,count(1) from wordcount lateral view explode(split(count,'\t')) wc as word group by word;

MapReduce 加载

[root@aliyun conf]# cd ~/app/hadoop/
[root@aliyun hadoop]# mkdir input
[root@aliyun hadoop]# cd input
[root@aliyun input]# vim test.txt
 hello world
 hello hadoop
# hadoop fs -put input/ /input
19/08/21 20:59:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# hadoop fs -ls /input
19/08/21 20:59:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 root supergroup         25 2019-08-21 20:59 /input/test.txt

# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /input /output

查看结果

[root@aliyun hadoop]# hadoop fs -ls /output
19/08/21 21:02:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 root supergroup          0 2019-08-21 21:01 /output/_SUCCESS
-rw-r--r--   1 root supergroup         25 2019-08-21 21:01 /output/part-r-00000
[root@aliyun hadoop]# hadoop fs -cat /output/part-r-00000
19/08/21 21:03:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hadoop  1
hello   2
world   1

hive

hive> use hive;
hive> create table docs(line string);
hive> load data inpath '/input/' overwrite into table docs;
hive> create table word_count as
    > select word, count(1) as count from
    > (select explode(split(line,' '))as word from docs) w
    > group by word
    > order by word;
hive> select *from word_count;
OK
hadoop  1
hello   2
world   1
Time taken: 0.178 seconds, Fetched: 3 row(s)

Hive数据仓库位置配置

1）Default数据仓库的最原始位置是在hdfs上的：/user/hive/warehouse路径下。
2）在仓库目录下，没有对默认的数据库default创建文件夹。如果某张表属于default数据库，直接在数据仓库目录下创建一个文件夹。
3）修改default数据仓库原始位置（将hive-default.xml.template如下配置信息拷贝到hive-site.xml文件中）。

配置同组用户有执行权限

hdfs dfs -chmod g+w /user/hive/warehouse

查看student数据存储

$ hadoop fs -ls /user/hive/warehouse/student
19/08/26 00:56:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rwxrwxr-x   1 baxiang supergroup         11 2019-08-26 00:18 /user/hive/warehouse/student/000000_0
-rwxrwxr-x   1 baxiang supergroup         37 2019-08-26 00:50 /user/hive/warehouse/student/student.txt
baxiangs-Mac-mini:data baxiang$ hadoop fs -text /user/hive/warehouse/student/student.txt
19/08/26 00:57:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1   zhangsan
2   wangwu
3   zhaosi
4   liliu

脚本执行

$ vim student.sql
SELECT * FROM student;
$ hive -f student.sql
$ hive -f student.sql >student_all.txt //执行文件中的sql语句并将结果写入文件中

数据查看

查看hdfs数据

hive (default)> dfs -ls /user/hive/;
Found 1 items
drwxrwxr-x   - baxiang supergroup          0 2019-08-26 00:12 /user/hive/warehouse

查看本地数据

hive (default)> ! cat /opt/module/data/student.sql;
SELECT * FROM student;

HiveJDBC

启动hiveserver2

$ hiveserver2

启动beeline

beeline
Beeline version 1.2.2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: root
Enter password for jdbc:hive2://localhost:10000: ****

数据类型

图片.png

管理表

默认创建的表都是所谓的管理表，有时也被称为内部表。因为这种表，Hive会（或多或少地）控制着数据的生命周期。Hive默认情况下会将这些表的数据存储在由配置项hive.metastore.warehouse.dir(例如，/user/hive/warehouse)所定义的目录的子目录下。当我们删除一个管理表时，Hive也会删除这个表中数据。管理表不适合和其他工具共享数据。

hadoop  dfs -mkdir /student;
hadoop dfs -put /opt/module/data/student.txt /student;
 hadoop fs -ls /student
Found 1 items
-rw-r--r--   1 baxiang supergroup         42 2019-08-27 15:43 /student/student.txt

创建表

hive (default)> create external table stu(
              > id int,
              > name string)
              > row format delimited fields terminated by '\t'
              > location '/student';
OK
Time taken: 0.599 seconds

hive (default)> select *from stu;
OK
stu.id  stu.name
1   xiaoming
2   xiaowang
3   xiaohong
Time taken: 0.07 seconds, Fetched: 3 row(s)

删除表
外部表删除后，hdfs中的数据还在，但是metadata中stu_external的元数据已被删除

hive (default)> drop table stu;
OK
Time taken: 0.514 seconds
hive (default)> dfs -cat /student/student.txt;
1   xiaoming
2   xiaowang
3   xiaohong

相互转换

hive (default)> desc formatted test;
Table Type:             MANAGED_TABLE
hive (default)> alter table test set tblproperties('EXTERNAL'= 'TRUE');
hive (default)> desc formatted test;
Table Type:             EXTERNAL_TABLE

分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

hive (default)> create table part_month(
              > id int,name string,code int
              > )
              > partitioned by (month string)
              > row format delimited fields terminated by '\t';
OK

hive (default)> load data local inpath '/opt/module/data/part.txt' into table part_month partition(month='201908');
Loading data to table default.part_month partition (month=201908)
Partition default.part_month{month=201908} stats: [numFiles=1, numRows=0, totalSize=69, rawDataSize=0]
OK
Time taken: 0.727 seconds
hive (default)> load data local inpath '/opt/module/data/part.txt' into table part_month partition(month='201907');
Loading data to table default.part_month partition (month=201907)
Partition default.part_month{month=201907} stats: [numFiles=1, numRows=0, totalSize=69, rawDataSize=0]
OK
Time taken: 0.619 seconds

查询结果

hive (default)> select *from part_month where month='201907';
OK
part_month.id   part_month.name part_month.code part_month.month
10  ACCOUNTING  1700    201907
20  RESEARCH    1800    201907
30  SALES   1900    201907
40  OPERATIONS  1700    201907
Time taken: 0.209 seconds, Fetched: 4 row(s)

创建分区

Time taken: 0.052 seconds, Fetched: 3 row(s)
hive (default)> alter table part_month add partition(month='201706');
OK
Time taken: 0.57 seconds
hive (default)> alter table part_month add partition(month='201705')  partition(month='201704');
OK
Time taken: 0.623 seconds

删除分区

hive (default)> alter table part_month drop  partition(month='201704');
Dropped the partition month=201704
OK
Time taken: 0.383 seconds

查看分区数量

hive (default)> show partitions part_month;
OK
partition
month=201907
month=201908
month=201909
Time taken: 0.052 seconds, Fetched: 3 row(s)

修改

重命名表
ALTER TABLE table_name RENAME TO new_table_name

hive (default)> alter table test rename to tb_test;
OK
Time taken: 0.213 seconds

增加字段

hive (default)> alter table tb_test add columns(create_at timestamp);
OK
Time taken: 0.245 seconds
hive (default)> desc tb_test;

修改类型

hive (default)> alter table tb_test change column create_at create_time string;
OK
Time taken: 0.203 seconds

向表中装载数据（Load）

hive> load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];
（1）load data:表示加载数据
（2）local:表示从本地加载数据到hive表；否则从HDFS加载数据到hive表
（3）inpath:表示加载数据的路径
（4）overwrite:表示覆盖表中已有数据，否则表示追加
（5）into table:表示加载到哪张表
（6）student:表示具体的表
（7）partition:表示上传到指定分区
创建测试数据

1       xiaoming
2       xiaowang
3       xiaohong

创建表

create table student(id string, name string) row format delimited fields terminated by '\t';

加载测试数据

hive (default)> load data local inpath '/opt/module/data/student.txt' into table student;
Loading data to table default.student
Table default.student stats: [numFiles=1, totalSize=33]
OK
Time taken: 0.32 seconds

加载hdfs数据

➜  ~ hadoop fs -mkdir /test                  
➜  ~ hadoop fs -put /opt/module/data/student.txt /test

覆盖原先的数据

hive (default)> load data inpath '/test/student.txt' overwrite into table student;
Loading data to table default.student
Table default.student stats: [numFiles=1, numRows=0, totalSize=33, rawDataSize=0]
OK
Time taken: 0.509 second

清除表数据

Truncate只能删除管理表，不能删除外部表中数据

hive (default)> truncate table student;

大数据学习教程
Hadoop生态 Hadoop相关内容 Spark Spark系列教程 Hive Hive快速入门 Elastic...
hive 相关
hive 相关搭建hadoop和hive，mysql的环境，过程截图 1.hadoop install 2.m...
Hive 安装
安装 Hadoop 集群参考 Docker Hadoop 集群搭建下载 Hive Hive下载地址本实例下载版...
hadoop以及hive的日常使用
Attention 1. 使用java操作hadoop、hive时，所需要的jar包尽量与hadoop、hive的...
Hive学习
Hive学习标签（空格分隔）： Hadoop Hive 官网地址：https://hive.apache.org...
HIVE安装指南
for my honey! 1、前言安装hive前提是要先安装hadoop集群，并且hive只需要再hadoop...
大数据开发实战：离线大数据处理的主要技术--Hive,概念，SQ
1、Hive出现背景 Hive是Facebook开发并贡献给Hadoop开源社区的。它是建立在Hadoop体系架构...
数仓实战03：数仓搭建-ODS 层
1.创建数据库 1）启动hive[atguigu@hadoop102 hive]$ nohup bin/hive ...
Hive安装教程
先安装好hadoop,再安装Hive 下载地址Index of /dist/hive/hive-3.1.2 (ap...
Hive 基础搭建教程
需要安装Hadoop，教程：Hadoop 基础搭建教程需要了解Hive基本概念：Hive 基础知识 1. 相关依...

Hadoop——Hive

概述

安装

配置

Hadoop集群配置

设置数据库权限

创建Mysql数据库和权限

修改日志存储路径

hive启动

spark2.0问题

基本操作

加载本地数据

wordcount

Hive数据仓库位置配置

脚本执行

数据查看

HiveJDBC

数据类型

管理表

分区表

修改

向表中装载数据（Load）

清除表数据

相关文章

大数据学习教程

hive 相关

Hive 安装

hadoop以及hive的日常使用

Hive学习

HIVE安装指南

大数据开发实战：离线大数据处理的主要技术--Hive,概念，SQ

数仓实战03：数仓搭建-ODS 层

Hive安装教程

Hive 基础搭建教程

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读