开始安装部署spark
1. 编辑 spark-env.sh
我的配置:
export SPARK_HOME=/home/hadoop/spark-2.4.3-bin-hadoop2.7
export SCALA_HOME=/usr/local/scala-2.11.8
export JAVA_HOME=/usr/local/java
export HADOOP_HOME=/home/hadoop/hadoop-2.7.4
export SPARK_CLASSPATH=$SPARK_CLASSPATH:hdfs://leo/jar:hdfs://leo/datas
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$YARN_HOME/etc/hadoop
export SPARK_EXECUTOR_CORES=2
export SPARK_LOCAL_DIRS=/home/hadoop/spark-2.4.3-bin-hadoop2.7
export SPARK_DRIVER_MEMORY=4G
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
2. 编辑slaves文件
node1
node2
node3
node4
node5
3. 分发spark至其他节点
scp -r spark-2.4.3-bin-hadoop2.7/ hadoop@node2:/home/hadoop/spark-2.4.3-bin-hadoop2.7/
4. 增加spark的环境变量
这里我只在node1节点上配置了spark的环境变量
export SPARK_HOME=/home/hadoop/spark-2.4.3-bin-hadoop2.7/
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
5. 运行spark官方例子,测试yarn提交spark任务
cd /home/hadoop/spark-2.4.3-bin-hadoop2.7
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 1G --executor-memory 1G --executor-cores 1 examples/jars/spark-examples_2.11-2.4.3.jar 40
- 通过web UI查看任务运行状态

6. spark-sql的使用
spark-sql
spark-sql —master yarn-client —executor-memory 1G —num-executors 10
7. 配置pyspark
- python3.6的编译安装请参考之前的文章
- 编辑 spark-env.sh
# export python的安装路径
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
export PYSPARK_PYTHON=/usr/bin/python
- 新建Python文件 test.py
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("First_App")
sc = SparkContext(conf=conf)
count = sc.textFile("/test/test.txt").count()
print(count)
sc.stop()
- spark-submit 提交
spark-submit --master yarn --deploy-mode cluster --executor-memory 2G --num-executors 5 --executor-cores 2 --driver-memory 2G test.py
- 运行成功的截图

小节
spark on yarn 模式部署spark,spark集群不用启动,spark-submit任务提交后,由yarn负责资源调度。文章中如果存在,还望大家及时指正。
网友评论