美文网首页
Spark On Yarn 远程idea提交运行(不是调试)

Spark On Yarn 远程idea提交运行(不是调试)

作者: Vbias | 来源:发表于2019-01-21 17:51 被阅读0次

    Spark On Yarn 远程idea提交运行(不是调试)

    1. 需要注意的问题

    1.1 centos搭建的集群会出现is running beyond virtual memory limits的问题

    Current usage: xx MB of xxGB physical memory used; xx GB of xx GB virtual memory used. 
    

    解决方法:

    # yarn-site.xml中添加以下属性
            <property>
                    <name>yarn.nodemanager.vmem-check-enabled</name>
                    <value>false</value>
            </property>
    

    1.2 在linux下使用idea连接docker搭建的集群,之间虽然能够互相ping通,但是还是有防火墙依然会让集群不能访问宿主机

    19/01/21 16:44:16 INFO Client: Application report for application_1548058747747_0006 (state: ACCEPTED)
    

    程序运行一直出现这个记录, 解决办法:关闭防火墙

    1.3 宿主机占不到集群,一直使用0.0.0.0:8032端口(这一步设置很重要)

    这是因为没有把resource资源文件设置成资源文件, 解决方案:
    右键点击resource文件,选择Mark Directory as >> Resources root

    2. 最终文件形式(src部分)

    在idea新建项目, sbt构建项目, sbt版本随意, scala版本选择2.11.8, 因为我的集群中没有专门配置scala,因此用spark-2.3.1-bin-hadoop2.7自带的scala, 其版本号就是2.11.8, src目录如下

    # 右键点击resource选择Mark Directory as >> Resources root, 或者去project struct设置
    src
    ├── main
    │   ├── resource
    │   │   ├── core-site.xml
    │   │   ├── hdfs-site.xml
    │   │   └── yarn-site.xml
    │   └── scala
    │       ├── SparkPI.scala
    │       └── WordCount.scala
    └── test
        └── scala
    

    2.1 以提交wordcount为例子

    单单这些代码是不能运行的,还需要设置集群,1) 添加集群jars包, 2) 使用sbt打包

    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
    
    object WordCount {
    
      def main(args: Array[String]): Unit = {
        System.setProperty("HADOOP_USER_NAME", "root")
        System.setProperty("user.name", "root")
    
        val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
          .set("deploy-mode", "client")
          .set("spark.yarn.jars", "hdfs:/user/root/jars/*")  //集群的jars包,是你自己上传上去的
          .setJars(List("/home/lee/IdeaProjects/test/target/scala-2.11/test_2.11-0.1.jar")) //这是sbt打包后的文件
          .setIfMissing("spark.driver.host", "192.168.1.9") //设置你自己的ip
    
        val sc = new SparkContext(conf)
    
        val rdd = sc.textFile("hdfs:/input/README.txt")
        val count = rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
        count.collect().foreach(println)
      }
    }
    

    2.2 依赖

    # build.sbt中添加一下内容
    // https://mvnrepository.com/artifact/org.apache.spark/spark-yarn
    libraryDependencies += "org.apache.spark" %% "spark-yarn" % "2.3.1"
    

    3. 步骤

    3.1 设置jars

    注意 wordcountconf中的.set("spark.yarn.jars", "hdfs:/user/root/jars/*"),这里面由于没有在本地添加spark的jars包,因此直接使用集群中的jars包, 这个包需要在集群里面提交

    # 在docker环境下, 可以使用如下指令
    docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /input
    docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user
    docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root
    docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root/jars
    docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/spark/jars/* /user/root/jars
    docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/hadoop/README.txt /input
    # /opt/module/hadoop/ 是你自己的hadoop目录
    # /opt/module/spark/ 是你自己的spark目录
    
    # 在集群中,假如环境都设置好了,那么就可以
    hdfs dfs -mkdir /input
    hdfs dfs -mkdir /user
    hdfs dfs -mkdir /user/root
    hdfs dfs -mkdir /user/root/jars
    hdfs dfs -put  your_spark_path/jars/* /user/root/jars
    hdfs dfs -put /opt/module/hadoop/README.txt /input
    

    当然如果你不喜欢用/user/root目录来放jars,那么也可以自定义,当然在wordcount里面就要做出对应改变了。

    3.3 选用本地jars包(与3.1二选一)

    如果不想提交spark的jars包到集群,那么可以把spark的jars可以复制到项目里

    ls /opt/module/spark
    bin  conf  data  examples  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  work  yarn
    

    对就是SPARK_HOME目录下的jars文件夹, 复制到项目, 最终你的 your_project/jars里面应该是下面这些内容

    activation-1.1.1.jar                         hadoop-yarn-client-2.7.3.jar               metrics-graphite-3.1.5.jar 
    ......
    zstd-jni-1.3.2-2.jar
    hadoop-yarn-api-2.7.3.jar                    metrics-core-3.1.5.jar
    

    选择file>>project structure>>module, 选择name方框下的dependecies,在点击该栏目右上方的+号, 选择1. jars and Directories, 再弹出框中选择 your_project/jars

    3.3 打包

    在idea底部选择sbt shell
    第一次输入clean
    第二次输入package
    如果选择其他的打包方式,那就需要修改confsetJars

    4. 运行

    19/01/21 16:44:41 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 
    19/01/21 16:44:41 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:20) finished in 0.827 s
    19/01/21 16:44:41 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:20, took 6.945556 s
    (under,1)
    (this,3)
    (distribution,2)
    (Technology,1)
    (country,1)
    (is,1)
    (Jetty,1)
    (currently,1)
    (permitted.,1)
    (check,1)
    (have,1)
    (Security,1)
    (U.S.,1)
    (with,1)
    (BIS,1)
    (This,1)
    (mortbay.org.,1)
    ((ECCN),1)
    (using,2)
    (security,1)
    (Department,1)
    (export,1)
    (reside,1)
    (any,1)
    (algorithms.,1)
    (from,1)
    (re-export,2)
    (has,1)
    (SSL,1)
    (Industry,1)
    (Administration,1)
    (details,1)
    (provides,1)
    (http://hadoop.apache.org/core/,1)
    (country's,1)
    (Unrestricted,1)
    (740.13),1)
    (policies,1)
    (country,,1)
    (concerning,1)
    (uses,1)
    (Apache,1)
    (possession,,2)
    (information,2)
    (our,2)
    (as,1)
    (,18)
    (Bureau,1)
    (wiki,,1)
    (please,2)
    (form,1)
    (information.,1)
    (ENC,1)
    (Export,2)
    (included,1)
    (asymmetric,1)
    (Commodity,1)
    (Software,2)
    (For,1)
    (it,1)
    (The,4)
    (about,1)
    (visit,1)
    (website,1)
    (<http://www.wassenaar.org/>,1)
    (performing,1)
    (Section,1)
    (on,2)
    ((see,1)
    (http://wiki.apache.org/hadoop/,1)
    (classified,1)
    (following,1)
    (in,1)
    (object,1)
    (cryptographic,3)
    (which,2)
    (See,1)
    (encryption,3)
    (Number,1)
    (and/or,1)
    (software,2)
    (for,3)
    ((BIS),,1)
    (makes,1)
    (at:,2)
    (manner,1)
    (Core,1)
    (latest,1)
    (your,1)
    (may,1)
    (the,8)
    (Exception,1)
    (includes,2)
    (restrictions,1)
    (import,,2)
    (project,1)
    (you,1)
    (use,,2)
    (another,1)
    (if,1)
    (or,2)
    (Commerce,,1)
    (source,1)
    (software.,2)
    (laws,,1)
    (BEFORE,1)
    (Hadoop,,1)
    (License,1)
    (written,1)
    (code,1)
    (Regulations,,1)
    (software,,2)
    (more,2)
    (software:,1)
    (see,1)
    (regulations,1)
    (of,5)
    (libraries,1)
    (by,1)
    (exception,1)
    (Control,1)
    (code.,1)
    (eligible,1)
    (both,1)
    (to,2)
    (Foundation,1)
    (Government,1)
    (functions,1)
    (and,6)
    (5D002.C.1,,1)
    ((TSU),1)
    (Hadoop,1)
    19/01/21 16:44:42 INFO SparkContext: Invoking stop() from shutdown hook
    19/01/21 16:44:42 INFO SparkUI: Stopped Spark web UI at http://192.168.1.9:4040
    19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Interrupting monitor thread
    19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Shutting down all executors
    19/01/21 16:44:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
    19/01/21 16:44:42 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
    (serviceOption=None,
     services=List(),
     started=false)
    19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Stopped
    19/01/21 16:44:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    19/01/21 16:44:42 INFO MemoryStore: MemoryStore cleared
    19/01/21 16:44:42 INFO BlockManager: BlockManager stopped
    19/01/21 16:44:42 INFO BlockManagerMaster: BlockManagerMaster stopped
    19/01/21 16:44:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    19/01/21 16:44:42 INFO SparkContext: Successfully stopped SparkContext
    19/01/21 16:44:42 INFO ShutdownHookManager: Shutdown hook called
    19/01/21 16:44:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-88c6c289-4d49-4035-96d7-19ba6410ef8a
    

    相关文章

      网友评论

          本文标题:Spark On Yarn 远程idea提交运行(不是调试)

          本文链接:https://www.haomeiwen.com/subject/ekqhjqtx.html