美文网首页
Hive部署及整合Hbase和pyspark

Hive部署及整合Hbase和pyspark

作者: 大锤强无敌 | 来源:发表于2020-12-11 10:41 被阅读0次

Hive部署及整合Hbase

1.软件版本信息

hadoop: 2.8.3
hbase: 1.0.0
hive: 2.3.4
mysql:5.7.24

2.准备mysql

# mysql 主要用于存储hive表的元数据,不会存储具体的数据

2.1 创建名称为hive的数据库

2.2 修改数据库编码格式为 latin1 ,排序规则为 latin1_bin

3. hive配置

--我将下载好的tar.gz包解压到 /home/etc/这个目录下,成为 /home/etc/apache-hive-2.3.4/

3.1 复制 /home/etc/apache-hive-2.3.4/conf/ 目录下 hive-default.xml.template 一份命名为 hive-site.xml

cp hive-default.xml.template hive-site.xml

3.2 修改内容

<!--根据name找到如下的配置并修改-->

<!--mysql连接的地址-->
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
</property>

<!--mysql连接DiverName-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>

<!--mysql连接用户名-->
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>zabbix</value>
    <description>username to use against metastore database</description>
</property>

<!--mysql连接密码-->
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
    <description>password to use against metastore database</description>
</property>

<!--io临时目录,需手动创建此目录-->
<property>
    <name>hive.exec.local.scratchdir</name>
    <value>/home/etc/apache-hive-2.3.4/iotmp</value>
    <description>Local scratch space for Hive jobs</description>
</property>

<!--io临时目录,同上目录-->
<property>
    <name>hive.querylog.location</name>
    <value>/home/etc/apache-hive-2.3.4/iotmp</value>
    <description>Location of Hive run time structured log file</description>
</property>

<!--io临时目录,同上目录-->
<property>
    <name>hive.downloaded.resources.dir</name>
    <value>/home/etc/apache-hive-2.3.4/iotmp</value>
    <description>Temporary local directory for added resources in the remote file system.         </description>
</property>

3.3 创建 iotmp目录

在 apache-hive-2.3.4目录下创建iotmp目录

3.4 配置环境变量

修改 ~/.bash_profile(用户环境变量)

# 1.配置HIVE_HOME
export HIVE_HOME=、home/etc/apache-hive-2.3.4
# 2.将HIVE_HOME加到PATH环境变量中
export PATH=$PATH:$HIVE_HOME/bin
# 3.重新加载环境变量
source ~/.bash_profile

3.5 导入hbase的jar包和mysql的驱动包

# 1.放入mysql驱动包 到 /home/etc/apache-hive-2.3.4/lib/ 下
mysql-connector-java-5.1.9.jar
# 2.放入hbase相关包 到 /home/etc/apache-hive-2.3.4/lib/ 下
进入 hbase的lib目录下
cp hbase-protocol-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-server-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-client-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-common-1.0.0.jar /home/etc/apache-hive-2.3.4/lib/
cp hbase-common-1.0.0-tests.jar /home/etc/apache-hive-2.3.4/lib/

3.6 在hive中加入hbase的配置

修改 /home/etc/apache-hive-2.3.4/conf/ 下的 hive-site.xml配置

<!--file下的jar路径加入到path中-->
<property>
    <name>hive.reloadable.aux.jars.path</name>
    <value>
        file:///home/etc/apache-hive-2.3.4/lib/hive-hbase-handler-2.3.4.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-protocol-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-server-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-client-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-common-1.0.0.jar,
        file:///home/etc/apache-hive-2.3.4/lib/hbase-common-1.0.0-tests.jar,
        file:///home/etc/apache-hive-2.3.4/lib/zookeeper-3.4.6.jar,
        file:///home/etc/apache-hive-2.3.4/lib/guava-14.0.1.jar
    </value>
    <description>
      The locations of the plugin jars, which can be a comma-separated folders or jars. Jars can be renewed
      by executing reload command. And these jars can be used as the auxiliary classes like creating a UDF or SerDe.
    </description>
</property>

<!--防止启动时报MissingTableException:Required table missing : "VERSION" in Catalog ""
Schema "". DataNucleus requires this table to perform its persistence operations-->
<property>
    <name>datanucleus.schema.autoCreateAll</name>
    <value>true</value>
    <description>Auto creates necessary schema on a startup if one doesn't exist</description>
</property>

4.启动hive

4.1 交互式启动

启动一个hive交互shell,在任意地方输入hive即可启动

set hive.cli.print.current.db=true; (让提示符显示当前库)
set hive.cli.print.header=true;(显示查询结果时显示字段名称)

4.2 启动hive服务

在 /home/etc/apache-hive-2.3.4/bin/目录下

hiveserver2 -hiveconf hive.root.logger=DEBUG,console

或者后台启动

hiveserver2 1>/dev/null 2>&1 &

5.hive建表

5.1 创建hive表的同时,创建hbase的表

在hive交互式环境下执行

# Hive中的表名tbl_name
# 指定存储处理器
# 声明列族,列名
# hbase.table.name 声明HBase表名, 为可选属性默认与Hive的表名相同, 
# hbase.mapred.output.outputtable 指定插入数据时写入的表, 如果以后需要往该表插入数据就需要指定该值
# key为rowkey,cf1为列族,val为列
CREATE TABLE tbl_name(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'   
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "tbl_name", "hbase.mapred.output.outputtable" = "tbl_name"); 

5.2 根据hbase表创建hive表

在hive交互式环境下执行

# 现在hbase中存在一个表名为 JIAYUAN ,有一个列族为 body
# 列有userId,url,deviceId,type,platId,channelId,citycode,req,res,time
# EXTERANL代表为hbase的外部表
# jiayuan_table1为hive中的表名

CREATE EXTERNAL TABLE jiayuan_table1 (
rowkey string,
userId string,
url string,
deviceId string,
type string,
platId string,
channelId string,
citycode string,
req string,
res string,
time string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,body:userId,body:url,body:deviceId,body:type,body:platId,body:channelId,body:citycode,body:req,body:res,body:time")
TBLPROPERTIES ("hbase.table.name" = "JIAYUAN");

查看数据

show tables;查看所有的表

select * from jiayuan_table1 limi 1; 从表中查询一条数据

hive> select * from jiayuan_table1 limit 1;
OK
8c64a436-afb7-4d91-8e24-726e7437dc791604887326065       8c64a436-afb7-4d91-8e24-726e7437dc79    /house/search-keyword           GET     KM      ios     NULL    {"searchType":"forsale","cityCode":"530100","keyword":"j"}   {"code":"hlsp-err-02","data":[],"errMsg":"关键词字数不少于2个字"}       1604887326065
Time taken: 3.282 seconds, Fetched: 1 row(s)

6.pyspark 连接 hive

6.1 迁移jar包

在 /home/etc/apache-hive-2.3.4/lib/ 里 找到 hive-hbase-handler-2.3.4.jar 包
放入 spark 文件夹下的 jars 目录里
否则会在读取 hive 表数据时报如下错误
# java.lang.ClassNotFoundException: org.apache.hadoop.hive.hbase.HBaseStorageHandler

6.2 修改配置文件

修改 /home/etc/apache-hive-2.3.4/conf/hive-site.xml

<!--host为hive服务器的地址,9083为默认的端口-->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://host:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>

6.3 迁移配置文件

将/home/etc/apache-hive-2.3.4/conf/hive-site.xml
复制一份到 spark目录下的conf文件夹内

6.4 启动服务

# 1.启动metastore服务
hive --service metastore
# 2.启动hive服务
hiveserver2 -hiveconf hive.root.logger=DEBUG,console

6.5 pyspark 访问

from pyspark.sql import SparkSession

# config中配置的即为6.2修改配置文件的地址
# enableHiveSupport,开启hive支持
spark = SparkSession\
    .builder\
    .appName('map-search-list')\
    .config("hive.metastore.uris", "thrift://128.196.216.16:9083")\
    .enableHiveSupport()\
    .getOrCreate()
 
# 展示所有表
spark.sql('show tables').show()
hive_database = "default"             #  要操作的数据库
hive_table = "jiayuan_table1"             #  要操作的数据表
hive_read_sql = "select * from {}.{}".format(hive_database, hive_table)
df = spark.sql(hive_read_sql) #default.jiayuan_table1
df.show()
print('读取hive完成')

相关文章

网友评论

      本文标题:Hive部署及整合Hbase和pyspark

      本文链接:https://www.haomeiwen.com/subject/ipfjgktx.html