使用SparkSQL来连接Hive
注意:在搭建hive数据仓库的时候我们会发现会有如下提示.大概意思是说在hive1.X版本默认使用的是Hive-on-MR模式,由于MR的计算框架是基于磁盘I/O的,可能无法满足一些特定场景,因此Hive 2之后的版本已经将执行引擎切换为spark或tez等基于内存的计算引擎。
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
该章节主要学习下如何使用SparkSQL
来连接Hive
进行数据仓库管理。
Spark基础环境准备
Spark也是分master和slave节点的,默认在master1上安装sparkmaster节点,在slave2和slave3节点上安装sparkslave节点
# 拷贝Spark包
$ ansible -i test.txt hadoop* -m copy -a "src=pkgs/spark-2.2.1-bin-hadoop2.7.tgz dest=/export/hdfs/" -s
$ ansible -i test.txt slave1 -m shell -a "cd /export/hdfs/;tar -zxf spark-2.2.1-bin-hadoop2.7.tgz -C servers/ ;ln -s /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7 /export/hdfs/servers/spark" -s
# 设置Spark环境
$ export SPARK_HOME=/export/hdfs/servers/spark-2.2.1-bin-hadoop2.7
$ export PATH=${SPARK_HOME}/bin:$PATH
$ cd ${SPARK_HOME}/conf
$ cp spark-env.sh.template spark-env.sh
$ vim spark-env.sh
export JAVA_HOME=/export/hdfs/servers/jdk1.8.0_60
export HADOOP_HOME=/export/hdfs/servers/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_HOME=/export/hdfs/servers/spark-2.2.1-bin-hadoop2.7
SPARK_MASTER_IP=master
SPARK_LOCAL_DIRS=/export/hdfs/servers/spark
SPARK_DRIVER_MEMORY=1G
$ cat /export/hdfs/servers/spark/conf/slaves
slave2
slave3
# 拷贝上述配置文件到slave2和slave3两个节点
scp -rp spark-2.2.1-bin-hadoop2.7 slave2:/export/hdfs/servers/
scp -rp spark-2.2.1-bin-hadoop2.7 slave3:/export/hdfs/servers/
# 配置yarn的内存[在yarn-site.xml配置中]
yarn.nodemanager.resource.memory-mb
# 启动spark程序
# 会分别启动Master和Worker进程
$ /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out
slave3: starting org.apache.spark.deploy.worker.Worker, logging to /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave3.out
slave2: starting org.apache.spark.deploy.worker.Worker, logging to /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-slave2.out
# 检查进程[master上多了master进程,slave上多了Worker进程]
[root@master1 servers]# jps
32049 SecondaryNameNode
31847 NameNode
1160 Master
16648 ResourceManager
1245 Jps
[root@slave2 ~]# jps
25557 Worker
25654 Jps
24873 DataNode
16299 NodeManager
测试Spark环境
# 登录Spark的master节点
/export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/bin/run-example SparkPi 10
测试SparkSQL
注意:hive需要开启metastore服务
hive开启metastore服务
# 需要hive启动metastore服务
[root@slave1 hive]# nohup ./bin/hive --service metastore > hive_metastore.log &
修改spark的配置信息
$ cat /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/conf/hive-site.xml
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://192.168.0.2:9083</value>
</property>
</configuration>
# 执行spark-sql连接hive
$ /export/hdfs/servers/spark-2.2.1-bin-hadoop2.7/bin/spark-sql --master spark://master1:7077 --executor-memory 1g
......
.......
spark-sql> show databases;
18/11/09 17:40:12 INFO execution.SparkSqlParser: Parsing command: show databases
default
test
Time taken: 0.03 seconds, Fetched 2 row(s)
18/11/09 17:40:12 INFO CliDriver: Time taken: 0.03 seconds, Fetched 2 row(s)
spark-sql> select * from test.appinfo;
.......
.......
data-web p0 bgbiao ops1 ["10.0.0.1","10.0.0.2"]
data-api p0 biaoge sre1 ["192.168.0.1","192.168.0.2"]
data-models p1 xxbandy sre1 ["10.0.0.3","192.168.0.3"]
....
## 此时spark不退出即可访问spark-ui来查看相关信息
http://192.168.0.1:4040/jobs/
网友评论