美文网首页
Spark的wordcount及排序

Spark的wordcount及排序

作者: 喵星人ZC | 来源:发表于2019-11-06 22:47 被阅读0次

    1、读取文件

    scala> sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").collect
    res1: Array[String] = Array(spark       hadoop  hadoop, hive    hbase   hbase, hive     hadoop  hadoop, hive    hadoop  hadoop)
    

    2、对数据进行压扁并以tab键分割

    sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).collect
    res2: Array[String] = Array(spark, hadoop, hadoop, hive, hbase, hbase, hive, hadoop, hadoop, hive, hadoop, hadoop)
    

    3、赋1操作

    scala> sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).map((_,1)).collect
    res3: Array[(String, Int)] = Array((spark,1), (hadoop,1), (hadoop,1), (hive,1), (hbase,1), (hbase,1), (hive,1), (hadoop,1), (hadoop,1), (hive,1), (hadoop,1), (hadoop,1))
    

    4、聚合相同的K

    scala> val result = sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).map((_,1)).reduceByKey(_+_)
    result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at reduceByKey at <console>:24
    
    scala> result.collect
    res8: Array[(String, Int)] = Array((hive,3), (spark,1), (hadoop,6), (hbase,2))
    

    5、以单词次数降序排序
    第一步:首先单词与次数调换位置

    scala> result.map(x => (x._2,x._1))
    res5: Array[(Int, String)] = Array((3,hive), (1,spark), (6,hadoop), (2,hbase))
    

    第二步:按K降序排序

    scala>  result.map(x => (x._2,x._1)).sortByKey(false).collect
    res11: Array[(Int, String)] = Array((6,hadoop), (3,hive), (2,hbase), (1,spark))
    

    第三步:将KV的位置进行调换,换成我们想要的格式

    scala>  result.map(x => (x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect
    res12: Array[(String, Int)] = Array((hadoop,6), (hive,3), (hbase,2), (spark,1))
    

    相关文章

      网友评论

          本文标题:Spark的wordcount及排序

          本文链接:https://www.haomeiwen.com/subject/mqgpbctx.html