问题
我希望定时执行一个spark任务,来做数据同步
在crontab中设定了定时启动spark-submit,到时间后,只有脚本的日志,并没有执行spark任务,尝试了脚本中的路径全部替换为绝对路径,依然不行
原因
crontab 会用另一套环境变量,因此crontab中的命令,尽量用绝对路径,
而环境变量也需要额外设置,需要在脚本中export一遍
解决
如下代码所示,spark部署时所需要的各个环境变量,均需要在crontab执行spark-submit命令之前,export一遍,只export path是不够的
脚本:
#!/bin/bash
# export 环境变量
export JAVA_HOME=/opt/jdk1.8.0_161
export SCALA_HOME=/opt/scala-2.11.11
export HADOOP_HOME=/opt/hadoop-2.7.6
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_YARN_USER_ENV=${HADOOP_CONF_DIR}
export SPARK_HOME=/opt/spark-2.3.0-bin-hadoop2.7
export HIVE_HOME=/opt/hive-2.3.3-bin
export HIVE_CONF_DIR=${HIVE_HOME}/conf
export PYTHON_HOME=/usr/local/python-3.6.5
export PATH=/opt/jdk1.8.0_161/bin:/opt/scala-2.11.11/bin:/opt/hadoop-2.7.6/bin:/opt/spark-2.3.0-bin-hadoop2.7/bin:/opt/hive-2.3.3-bin/bin:/usr/local/python-3.6.5/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin
# 执行spark应用
/bin/bash /opt/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --queue root.dev2 --driver-memory 1800m --executor-memory 1500m --py-files /opt/spark-2.3.0-bin-hadoop2.7/apps/loggerFactory.py /opt/spark-2.3.0-bin-hadoop2.7/apps/cron_daily_save_accesslog_to_hive.py
网友评论