美文网首页
(十四)Spark on Yarn的基本使用及常见错误

(十四)Spark on Yarn的基本使用及常见错误

作者: 白面葫芦娃92 | 来源:发表于2018-09-19 11:44 被阅读0次

    将spark作业提交到yarn上执行
    spark仅仅作为一个客户端

    ./spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master yarn \
     /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.3.1.jar \
    3
    

    --master yarn 相当于 --deploy-mode client,也就是yarn-client模式时,后边这句--deploy-mode client可写可不写
    如果是yarn-cluster模式,则需要写上--deploy-mode cluster

    直接按上方代码启动,会报错:

    Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
            at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:288)
            at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:248)
            at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:120)
            at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:130)
            at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    

    需要在环境变量中设置HADOOP_CONF_DIR or YARN_CONF_DIR

    [hadoop@hadoop001 ~]$ cd $SPARK_HOME/conf
    [hadoop@hadoop001 conf]$ vi spark-env.sh
    export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0/etc/hadoop
    

    查看日志,发现有一个步骤耗费了比较长的时间:
    Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME

    18/09/19 17:30:32 INFO yarn.Client: Preparing resources for our AM container
    18/09/19 17:30:35 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    18/09/19 17:30:44 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_libs__2104928720237052389.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_libs__2104928720237052389.zip
    18/09/19 17:30:54 INFO yarn.Client: Uploading resource file:/tmp/spark-8152492d-487e-4d35-962a-42344edea033/__spark_conf__1822648312505136721.zip -> hdfs://192.168.137.251:9000/user/hadoop/.sparkStaging/application_1537349385350_0001/__spark_conf__.zip
    

    官网上也有相关说明:
    To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer toSpark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.
    可做如下配置

    [hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /system/spark-lib
    [hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/app/spark-2.3.1-bin-2.6.0-cdh5.7.0/jars/* /system/spark-lib
    [hadoop@hadoop000 ~]$ hadoop fs -chmod -R 755 /system/spark-lib
    [hadoop@hadoop000 ~]$ cd $SPARK_HOME/conf
    [hadoop@hadoop000 conf]$ cp spark-defaults.conf.template spark-defaults.conf
    [hadoop@hadoop000 conf]$ vi spark-defaults.conf
    spark.yarn.jars    hdfs://192.168.137.251:9000//system/spark-lib/*
    (如果没有*会报错:Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher)
    

    运行日志由之前的upload..........变成

    18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-asn1-api-1.0.0-M20.jar
    18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/api-util-1.0.0-M20.jar
    18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arpack_combined_all-0.1.jar
    18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-format-0.8.0.jar
    18/09/19 19:13:20 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://192.168.137.251:9000/system/spark-lib/arrow-memory-0.8.0.jar
    .............
    .............
    

    spark.yarn.jar配置成HDFS上的公共lib库中的jar包。这个配置项会使提交job时,不是从本地上传.jar包,而是从HDFS的一个目录复制到另一个目录,总的来说节省了一点时间。(网上有的文章里说,这里的配置,会节省掉上传jar包的步骤,其实是不对的,只是把从本地上传的步骤改成了在HDFS上的复制操作。)
    这是每次提交申请资源都要耗费几十秒的时间的根本原因,这些jar包在yarn环境里都能访问的到,意味着在yarn的所有节点,所有container都能访问的到才可以,对离线来说还可以接受,但要求很高的话,每次启动spark作业都要耗费几十秒是不能接受的。spark可以和微服务结合起来,使用sping boot等把spark做成一个长服务,让它7
    24小时不停运行,提交作业时不用一次又一次地重新申请资源

    其他常用命令

    --executor-cores NUM        Number of cores per executor. (Default: 1 in YARN mode,
                                or all available cores on the worker in standalone mode)
    --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
    --num-executors NUM         Number of executors to launch (Default: 2).
                                If dynamic allocation is enabled, the initial number of
                                executors will be at least NUM.
    --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).
    

    相关文章

      网友评论

          本文标题:(十四)Spark on Yarn的基本使用及常见错误

          本文链接:https://www.haomeiwen.com/subject/yjyvnftx.html