美文网首页
spark.yarn.jars的配置

spark.yarn.jars的配置

作者: 喵星人ZC | 来源:发表于2019-05-11 12:12 被阅读0次

    现象,当运行spark-shell --master yarn-client时,sc和spark初始化好之前会耗费很长时间,尤其是当自己电脑配置低时。

    [hadoop@hadoop000 ~]$ spark-shell --master yarn-client
    Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
    19/05/11 11:21:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    19/05/11 11:21:32 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    Spark context Web UI available at http://hadoop000:4040
    Spark context available as 'sc' (master = yarn, app id = application_1557541510136_0002).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.4.2
          /_/
             
    Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> 
    

    这种现象其实也就是我们提交作业(spark-submit)时会遇到的,因为spark-shell底层就是调用的spark-submit,那么如何减少提交spark作业所耗费的时间呢?

    我们看spark-shell --master yarn-client的日志发现:

    WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    

    每次在spark运行时都会把yarn所需的spark jar打包上传至HDFS,然后分发到每个NM,为了节省时间我们可以将jar包提前上传至HDFS,那么spark在运行时就少了一步上传,可以直接从HDFS读取了。

    具体做法如下:
    1、HDFS上创建存放spark jar的目录

    hadoop fs -mkdir -p  /spark-yarn/jars
    

    2、将$SPARK_HOME/jars下的jar包上传至刚建的HDFS路径

    [hadoop@hadoop000 jars]$ cd /home/hadoop/soul/app/spark-2.4.2-bin-2.6.0-cdh5.7.0/jars/
    [hadoop@hadoop000 jars]$ hadoop fs -put * /spark-yarn/jars/
    

    3、在 spark-defaults.conf中添加

    spark.yarn.jars=hdfs://hadoop000:8020/spark-yarn/jars/*.jar
    

    重新运行 spark-shell --master yarn-client

    [hadoop@hadoop000 ~]$ spark-shell --master yarn-client
    Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead.
    19/05/11 12:03:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://hadoop000:4040
    Spark context available as 'sc' (master = yarn, app id = application_1557547254373_0002).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.4.2
          /_/
             
    Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> 
    
    

    已经没有以下提醒了:

    WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
    

    相关文章

      网友评论

          本文标题:spark.yarn.jars的配置

          本文链接:https://www.haomeiwen.com/subject/adqxaqtx.html