美文网首页
二种方法实现Spark计算WordCount

二种方法实现Spark计算WordCount

作者: 小猪Harry | 来源:发表于2018-10-17 09:42 被阅读0次

    1.spark-shell

    val lines = sc.textFile("hdfs://spark1:9000/spark.txt")
    val words = lines.flatMap(line => line.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.foreach(wordcount => println(wordcount._1 + " appeared " + wordcount._2 + " times"))
    

    2.Scala for idea

        <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_2.11</artifactId>
          <version>2.2.0</version>
        </dependency>
    
    package cn.spark.study.core
     
    import org.apache.spark.SparkConf
    import org.apache.spark.SparkContext
     
    object WordCount {
       
      def main(args: Array[String]) { 
        val conf = new SparkConf()
            .setAppName("WordCount")
    .setMaster("spark://hadoop:7077");
    //.setMaster("local[2]");//本地运行(windows)
        val sc = new SparkContext(conf)
        
        val lines = sc.textFile(args(0), 1);
        val words = lines.flatMap { line => line.split(" ")}
        val pairs = words.map {word => (word, 1)}
        val wordCount = pairs.reduceByKey(_ + _)
        wordCount.foreach(wordCount => println(wordCount._1 + " appeared " + wordCount._2 + " times"))
      }
    }
    

    最后,需要使用spark submit提交到spark集群中进行运行,执行脚本如下:

    /usr/local/spark/bin/spark-submit \
    --class cn.spark.study.core.WordCount \
    /usr/local/spark-study/scala/wordcount.jar \
    /root/test.txt
    ~                                                        
    

    注意:需要停止spark-shell,否则可能出现内存不足错误(Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources)

    相关文章

      网友评论

          本文标题:二种方法实现Spark计算WordCount

          本文链接:https://www.haomeiwen.com/subject/bzdlzftx.html