Hive部署及整合Hbase和pyspark

作者: 大锤强无敌 | 来源:发表于2020-12-11 10:41 被阅读0次

Hive部署及整合Hbase和pyspark
【phoenix-使用】我的问题集
【hive-部署】hive和hbase整合及常见问题
HBase和Hive整合
hbase集成hive操作
hbase整合hive
Hive整合Hbase
黑猴子的家：HBase 与 Hive 集成
【Hive】Hive 整合映射 HBase
Kylin即席查询02：Kylin 安装

Hive部署及整合Hbase

1.软件版本信息

hadoop: 2.8.3
hbase: 1.0.0
hive: 2.3.4
mysql：5.7.24

2.准备mysql

# mysql 主要用于存储hive表的元数据，不会存储具体的数据

2.1 创建名称为hive的数据库

2.2 修改数据库编码格式为 latin1 ,排序规则为 latin1_bin

3. hive配置

--我将下载好的tar.gz包解压到 /home/etc/这个目录下，成为 /home/etc/apache-hive-2.3.4/

3.1 复制 /home/etc/apache-hive-2.3.4/conf/ 目录下 hive-default.xml.template 一份命名为 hive-site.xml

cp hive-default.xml.template hive-site.xml

3.2 修改内容

<!--根据name找到如下的配置并修改-->

<!--mysql连接的地址-->
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
</property>

<!--mysql连接DiverName-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>

<!--mysql连接用户名-->
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>zabbix</value>
    <description>username to use against metastore database</description>
</property>

<!--mysql连接密码-->
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
    <description>password to use against metastore database</description>
</property>

<!--io临时目录，需手动创建此目录-->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/home/etc/apache-hive-2.3.4/iotmp</value>
    <description>Local scratch space for Hive jobs</description>
</property>

<!--io临时目录，同上目录-->
<property>
    <name>hive.querylog.location</name>
    <value>/home/etc/apache-hive-2.3.4/iotmp</value>
    <description>Location of Hive run time structured log file</description>
</property>

<!--io临时目录，同上目录-->
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/home/etc/apache-hive-2.3.4/iotmp</value>
    <description>Temporary local directory for added resources in the remote file system.         </description>
</property>

3.3 创建 iotmp目录

在 apache-hive-2.3.4目录下创建iotmp目录

3.4 配置环境变量

修改 ~/.bash_profile(用户环境变量)

# 1.配置HIVE_HOME
export HIVE_HOME=、home/etc/apache-hive-2.3.4
# 2.将HIVE_HOME加到PATH环境变量中
export PATH=$PATH:$HIVE_HOME/bin
# 3.重新加载环境变量
source ~/.bash_profile

3.5 导入hbase的jar包和mysql的驱动包

# 1.放入mysql驱动包 到 /home/etc/apache-hive-2.3.4/lib/ 下
mysql-connector-java-5.1.9.jar
# 2.放入hbase相关包 到 /home/etc/apache-hive-2.3.4/lib/ 下
进入 hbase的lib目录下
cp hbase-protocol-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-server-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-client-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-common-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-common-1.0.0-tests.jar /home/etc/apache-hive-2.3.4/lib/

3.6 在hive中加入hbase的配置

修改 /home/etc/apache-hive-2.3.4/conf/ 下的 hive-site.xml配置

<!--file下的jar路径加入到path中-->
<property>
    <name>hive.reloadable.aux.jars.path</name>
    <value>
        file:///home/etc/apache-hive-2.3.4/lib/hive-hbase-handler-2.3.4.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-protocol-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-server-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-client-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-common-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-common-1.0.0-tests.jar,
        file:///home/etc/apache-hive-2.3.4/lib/zookeeper-3.4.6.jar,
        file:///home/etc/apache-hive-2.3.4/lib/guava-14.0.1.jar
    </value>
    <description>
      The locations of the plugin jars, which can be a comma-separated folders or jars. Jars can be renewed
      by executing reload command. And these jars can be used as the auxiliary classes like creating a UDF or SerDe.
    </description>
</property>

<!--防止启动时报MissingTableException:Required table missing : "VERSION" in Catalog ""
Schema "". DataNucleus requires this table to perform its persistence operations-->
<property>
    <name>datanucleus.schema.autoCreateAll</name>
    <value>true</value>
    <description>Auto creates necessary schema on a startup if one doesn't exist</description>
</property>

4.启动hive

4.1 交互式启动

启动一个hive交互shell，在任意地方输入hive即可启动

set hive.cli.print.current.db=true; (让提示符显示当前库)
set hive.cli.print.header=true;(显示查询结果时显示字段名称)

4.2 启动hive服务

在 /home/etc/apache-hive-2.3.4/bin/目录下

hiveserver2 -hiveconf hive.root.logger=DEBUG,console

或者后台启动

hiveserver2 1>/dev/null 2>&1 &

5.hive建表

5.1 创建hive表的同时，创建hbase的表

在hive交互式环境下执行

# Hive中的表名tbl_name
# 指定存储处理器
# 声明列族，列名
# hbase.table.name 声明HBase表名， 为可选属性默认与Hive的表名相同， 
# hbase.mapred.output.outputtable 指定插入数据时写入的表， 如果以后需要往该表插入数据就需要指定该值
# key为rowkey,cf1为列族，val为列
CREATE TABLE tbl_name(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "tbl_name", "hbase.mapred.output.outputtable" = "tbl_name");

5.2 根据hbase表创建hive表

在hive交互式环境下执行

# 现在hbase中存在一个表名为 JIAYUAN ,有一个列族为 body
# 列有userId,url,deviceId,type,platId,channelId,citycode,req,res,time
# EXTERANL代表为hbase的外部表
# jiayuan_table1为hive中的表名

CREATE EXTERNAL TABLE jiayuan_table1 (
rowkey string,
userId string,
url string,
deviceId string,
type string,
platId string,
channelId string,
citycode string,
req string,
res string,
time string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,body:userId,body:url,body:deviceId,body:type,body:platId,body:channelId,body:citycode,body:req,body:res,body:time")
TBLPROPERTIES ("hbase.table.name" = "JIAYUAN");

查看数据

show tables；查看所有的表

select * from jiayuan_table1 limi 1; 从表中查询一条数据

hive> select * from jiayuan_table1 limit 1;
OK
8c64a436-afb7-4d91-8e24-726e7437dc791604887326065       8c64a436-afb7-4d91-8e24-726e7437dc79    /house/search-keyword           GET     KM      ios     NULL    {"searchType":"forsale","cityCode":"530100","keyword":"j"}   {"code":"hlsp-err-02","data":[],"errMsg":"关键词字数不少于2个字"}       1604887326065
Time taken: 3.282 seconds, Fetched: 1 row(s)

6.pyspark 连接 hive

6.1 迁移jar包

在 /home/etc/apache-hive-2.3.4/lib/ 里 找到 hive-hbase-handler-2.3.4.jar 包
放入 spark 文件夹下的 jars 目录里
否则会在读取 hive 表数据时报如下错误
# java.lang.ClassNotFoundException: org.apache.hadoop.hive.hbase.HBaseStorageHandler

6.2 修改配置文件

修改 /home/etc/apache-hive-2.3.4/conf/hive-site.xml

<!--host为hive服务器的地址，9083为默认的端口-->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://host:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

6.3 迁移配置文件

将/home/etc/apache-hive-2.3.4/conf/hive-site.xml
复制一份到 spark目录下的conf文件夹内

6.4 启动服务

# 1.启动metastore服务
hive --service metastore
# 2.启动hive服务
hiveserver2 -hiveconf hive.root.logger=DEBUG,console

6.5 pyspark 访问

from pyspark.sql import SparkSession

# config中配置的即为6.2修改配置文件的地址
# enableHiveSupport,开启hive支持
spark = SparkSession\
    .builder\
    .appName('map-search-list')\
    .config("hive.metastore.uris", "thrift://128.196.216.16:9083")\
    .enableHiveSupport()\
    .getOrCreate()
 
# 展示所有表
spark.sql('show tables').show()
hive_database = "default"             #  要操作的数据库
hive_table = "jiayuan_table1"             #  要操作的数据表
hive_read_sql = "select * from {}.{}".format(hive_database, hive_table)
df = spark.sql(hive_read_sql) #default.jiayuan_table1
df.show()
print('读取hive完成')

Hive部署及整合Hbase和pyspark
Hive部署及整合Hbase 1.软件版本信息 2.准备mysql 2.1 创建名称为hive的数据库 2.2 修...
【phoenix-使用】我的问题集
问题一场景描述：hive已整合hbase,由于业务需求使用phoenix。这时hive整合hbase的表操作将出...
【hive-部署】hive和hbase整合及常见问题
【hive安装】https://www.jianshu.com/p/38f88fdc4022 一、配置 1、hiv...
HBase和Hive整合
HBase版本：1.2.6Hive版本：1.2.1 1. 把HIVE_HOME/lib/hive-hbase-ha...
hbase集成hive操作
集成步骤: 将hive提供的一个和hbase整合的通信包, 导入到Hbase的lib目录下cd /export/s...
hbase整合hive
Hbase是被设计用来做K-V查询，但有时候也会遇到基于Hbase表的复杂统计，写MR很不方便。hive考虑到这一...
Hive整合Hbase
HBase 虽然可以存储数亿或数十亿行数据，但是对于数据分析来说，不太友好，只提供了简单的基于 Key 值的快速查...
黑猴子的家：HBase 与 Hive 集成
1、Hive和HBase整合集成 https://www.jianshu.com/p/425df8dbca11 2...
【Hive】Hive 整合映射 HBase
@[toc] 一、前言 HBase 只提供了简单的基于 Key 值的快速查询能力，没法进行大量的条件查询，对于数据...
Kylin即席查询02：Kylin 安装
1.Kylin 依赖环境安装 Kylin 前需先部署好 Hadoop、Hive、Zookeeper、HBase，...