美文网首页
第一个spark程序:用maven实现WordCount

第一个spark程序:用maven实现WordCount

作者: symsimmy | 来源:发表于2018-03-07 20:16 被阅读0次

    1.新建一个maven项目

    2.填写GroupId和ArtifactId,然后点击Next

    3.开启Auto-Import

    4.编辑pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.symsimmy</groupId>
        <artifactId>sparklearning</artifactId>
        <version>1.0-SNAPSHOT</version>
    
        <properties>
            <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
            <spark.version>2.2.0</spark.version>
            <scala.version>2.11</scala.version>
            <hadoop.version>2.9.0</hadoop.version>
        </properties>
    
        <dependencies>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-streaming_${scala.version}</artifactId>
                <version>${spark.version}</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>4.12</version>
            </dependency>
        </dependencies>
    
        <build>
            <sourceDirectory>src/main/scala</sourceDirectory>
            <testSourceDirectory>src/test/scala</testSourceDirectory>
    
            <plugins>
                <plugin>
                    <!-- MAVEN 编译使用的JDK版本 -->
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.7</version>
                    <configuration>
                        <source>1.8</source>
                        <target>1.8</target>
                        <encoding>UTF-8</encoding>
                    </configuration>
                </plugin>
            </plugins>
        </build>
    
    </project>
    

    5.添加scala SDK

    这里经过我的验证,在spark-2.2.0版本下,scala-2.12.4会出错,这是一个坑,所以我推荐官方2.2.0文档中出现的2.11.8,实测可以用.
    进入Project Structure->Library,点击+号,选择Scala SDK.


    6.配置环境

    • src目录下面添加scala目录,并设置为Source Folders
    • test目录下面添加scala目录和resources目录,并分别设置为Source FoldersTest Resource Folders

    7.运行spark实例SparkPi

    src/scala目录下,右键New一个Scala Class,命名为ScalaPi,类型选择为Object

    /*
     * Licensed to the Apache Software Foundation (ASF) under one or more
     * contributor license agreements.  See the NOTICE file distributed with
     * this work for additional information regarding copyright ownership.
     * The ASF licenses this file to You under the Apache License, Version 2.0
     * (the "License"); you may not use this file except in compliance with
     * the License.  You may obtain a copy of the License at
     *
     *    http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     */
    
    // scalastyle:off println
    package scala
    
    import scala.math.random
    
    import org.apache.spark.sql.SparkSession
    
    /** Computes an approximation to pi */
    object SparkPi {
      def main(args: Array[String]) {
        val spark = SparkSession
          .builder()
          .appName("Spark Pi")
          .getOrCreate() 
        val slices = if (args.length > 0) args(0).toInt else 2
        val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
        val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
          val x = random * 2 - 1
          val y = random * 2 - 1
          if (x*x + y*y <= 1) 1 else 0
        }.reduce(_ + _)
        println("Pi is roughly " + 4.0 * count / (n - 1))
        spark.stop()
      }
    }
    // scalastyle:on println
    

    报错,如何解决

    17/12/06 10:10:14 ERROR SparkContext: Error initializing SparkContext.
    org.apache.spark.SparkException: A master URL must be set in your configuration
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
        at SparkPi$.main(SparkPi.scala:31)
        at SparkPi.main(SparkPi.scala)
    17/12/06 10:10:14 INFO SparkContext: Successfully stopped SparkContext
    Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
        at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
        at SparkPi$.main(SparkPi.scala:31)
        at SparkPi.main(SparkPi.scala)
    
    Process finished with exit code 1
    

    首先,必须说明,直接运行是一定会出错的,我们知道spark有几种运行模式.如果想本地运行,必须设置为local
    我们可以在Edit Configurations中添加VM Options,-Dspark.master=local


    • 点击ok,之后就正确运行了


    8.设置环境,打jar包

    • 设置Artifacts,选择From modules with dependencies...

    • 选择Main ClassSparkPi

    • 最后为这样,可以自定义生成的jar包的位置,点击ok即可


    • 点击build->build Artifacts

    • 在屏幕中间出现该菜单,点击Build

    • build成功后,会在out目录下面生成一个jar包

    使用spark-shell运行生成的jar包

    参考文章

    相关文章

      网友评论

          本文标题:第一个spark程序:用maven实现WordCount

          本文链接:https://www.haomeiwen.com/subject/hyozbxtx.html