- 安装 Hive +Sqoop+MySQL+Spark
- 先将 system_logs.sql 导入 MySQL,使用 Sqoop 将 MySQL 中的数据导入 Hive
- 使用 spark 读取 Hive 中的数据,完成以下要求:
- 使用 Spark 分别计算 Hive 数据中 params 字段空值和不含空值的 IP TOP5
- 使用 Spark 统计每月访问次数最多的日期
1. 安装Hive +Sqoop+MySQL+Spark
-
mysql的安装不再赘述
-
安装hive
wget http://mirror.bit.edu.cn/apache/hive/hive-3.1.1/apache-hive-3.1.1-bin.tar.gz
tar zxvf apache-hive-3.1.1-bin.tar.gz
mv apache-hive-3.1.1-bin hive
# 修改配置
cp hive-default.xml.template hive-site.xml
# 下载mysql jar包
cp mysql-connector-java-5.1.46-bin.jar hive/lib
编辑hive-site.xml:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://39.96.19.70:8806/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>jie</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>1</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>
- 安装sqoop:
wget http://mirror.bit.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
tar zxf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
mv sqoop-1.4.7.bin__hadoop-2.6.0 sqoop
编辑sqoop-env.sh:(没有安装的注释就好)
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/opt/bigdata/hadoop/hadoop-2.7.3
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/opt/bigdata/hadoop/hadoop-2.7.3
#set the path to where bin/hbase is available
#export HBASE_HOME=
#Set the path to where bin/hive is available
export HIVE_HOME=/opt/bigdata/hadoop/hive
#Set the path for where zookeper config dir is
#export ZOOCFGDIR=
测试:
sqoop list-databases --connect jdbc:mysql://172.17.41.83:8806/bdp --username jie --password 1
- 安装spark
wget http://mirror.bit.edu.cn/apache/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar zxf spark-2.4.0-bin-hadoop2.7.tgz
mv spark-2.4.0-bin-hadoop2.7 spark
vim spark/conf/spark-env.sh
编辑spark-env.sh:
JAVA_HOME=/opt/bigdata/hadoop/jdk1.8.0_191
SCALA_HOME=/opt/bigdata/hadoop/scala
HADOOP_HOME=/opt/bigdata/hadoop/hadoop-2.7.3
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=master
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8180
SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=1g
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8181
SPARK_WORKER_INSTANCES=1
#export SPARK_DIST_CLASSPATH=$(/opt/bigdata/hadoop/hadoop-2.7.3/bin/hadoop classpath)
vim spark-defaults.conf:
spark.master spark://master:7077
vim slaves:
slave1
slave2
#复制hive配置到conf中
cp hive/conf/hive-site.xml spark/conf/hive-site.xml
# 复制mysql jar包到spark jars中
cp hive/lib/mysql-connector-java-5.1.46-bin.jar spark/jars
运行测试:./bin/run-example SparkPi
![](https://img.haomeiwen.com/i4180122/8bb8a1721c644aa1.png)
![](https://img.haomeiwen.com/i4180122/5d5a7ff415e0dbf6.png)
- 导入环境变量
export JAVA_HOME=/opt/bigdata/hadoop/jdk1.8.0_191
export HIVE_HOME=/opt/bigdata/hadoop/hive
export SQOOP_HOME=/opt/bigdata/hadoop/sqoop
export SPARK_HOME=/opt/bigdata/hadoop/spark
export HADOOP_HOME=/opt/bigdata/hadoop/hadoop-2.7.3
export HADOOP_COMMON_HOME=/opt/bigdata/hadoop/hadoop-2.7.3
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$SQOOP_HOME/bin:$SPARK_HOME/bin:$PATH
- 启动环境
./hadoop-2.7.3/sbin/start-sll.sh
./spark/sbin/start-all.sh
2. 先将 system_logs.sql 导入 MySQL,使用 Sqoop 将 MySQL 中的数据导入 Hive
- 将sql文件上传到云主机
rz -E
- 进入数据库:mysql -h172.17.41.83 -P8806 -ujie -p1
-- 创建数据库bdp
create database bdp;
use bdp;
-- 导入sql
source system_logs.sql
- 使用Sqoop将数据导入Hive
sqoop import --connect jdbc:mysql://39.96.19.70:8806/bdp --username jie --password 1 --table system_logs \
--target-dir /bigdata/data/sqoop/system_logs \
--delete-target-dir \
--fields-terminated-by '\t' \
--direct
![](https://img.haomeiwen.com/i4180122/5d64775197f2046f.png)
查看导入的数据 hadoop fs -cat /bigdata/data/sqoop/system_logs/* | head
![](https://img.haomeiwen.com/i4180122/9206bc0b1a4e3fd6.png)
打开hive,导入数据
create database bdp;
drop table if exists bdp.system_logs;
CREATE TABLE bdp.system_logs (
id int,
ip string,
username string,
visitrecord string,
visittime TIMESTAMP,
method string,
params string,
clusterid int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
load data inpath '/bigdata/data/sqoop/system_logs' into table bdp.system_logs ;
select * from bdp.system_logs limit 10;
![](https://img.haomeiwen.com/i4180122/c40ec580b14940de.png)
-
使用spark计算
打开spark-shell
image.png
原因:内存不足
增加内存后重新打开:
![](https://img.haomeiwen.com/i4180122/78519758f1f7ad60.png)
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
// 使用 Spark 分别计算 Hive 数据中 params 字段空值和不含空值的 IP TOP5
val df=hiveCtx.sql("select distinct ip from bdp.system_logs where params = 'NULL' limit 5").collect().foreach(println)
val df=hiveCtx.sql("select distinct ip from bdp.system_logs where params != 'NULL' limit 5").collect().foreach(println)
// 使用 Spark 统计每月访问次数最多的日期
hiveCtx.sql("select date_format(visittime, 'y-MM') as m, day(visittime) as d, count(ip) as c from bdp.system_logs group by m,d order by m,c desc limit 100").dropDuplicates(Seq("m")).show()
报错:
![](https://img.haomeiwen.com/i4180122/1dc720e4616a7d87.png)
步骤 1 执行命令退出安全模式:hadoop dfsadmin -safemode leave
步骤 2 执行健康检查,删除损坏掉的block。 hdfs fsck / -delete
![](https://img.haomeiwen.com/i4180122/06632f7a03a447e0.png)
![](https://img.haomeiwen.com/i4180122/9e5ff52134fdefd9.png)
附:开机后如何打开hadoop&spark
分别开启3个容器
![](https://img.haomeiwen.com/i4180122/295bfa836f20cf35.png)
分别进入3个容器开启sshd服务,示例:
![](https://img.haomeiwen.com/i4180122/f81cecb881bd2038.png)
进入master容器,打开hadoop:
![](https://img.haomeiwen.com/i4180122/f270f651ed20591e.png)
网友评论