1、读取文件
scala> sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").collect
res1: Array[String] = Array(spark hadoop hadoop, hive hbase hbase, hive hadoop hadoop, hive hadoop hadoop)
2、对数据进行压扁并以tab键分割
sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).collect
res2: Array[String] = Array(spark, hadoop, hadoop, hive, hbase, hbase, hive, hadoop, hadoop, hive, hadoop, hadoop)
3、赋1操作
scala> sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).map((_,1)).collect
res3: Array[(String, Int)] = Array((spark,1), (hadoop,1), (hadoop,1), (hive,1), (hbase,1), (hbase,1), (hive,1), (hadoop,1), (hadoop,1), (hive,1), (hadoop,1), (hadoop,1))
4、聚合相同的K
scala> val result = sc.textFile("file:///home/hadoop/soul/data/wordcount.txt").flatMap(x => (x.split("\t"))).map((_,1)).reduceByKey(_+_)
result: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[20] at reduceByKey at <console>:24
scala> result.collect
res8: Array[(String, Int)] = Array((hive,3), (spark,1), (hadoop,6), (hbase,2))
5、以单词次数降序排序
第一步:首先单词与次数调换位置
scala> result.map(x => (x._2,x._1))
res5: Array[(Int, String)] = Array((3,hive), (1,spark), (6,hadoop), (2,hbase))
第二步:按K降序排序
scala> result.map(x => (x._2,x._1)).sortByKey(false).collect
res11: Array[(Int, String)] = Array((6,hadoop), (3,hive), (2,hbase), (1,spark))
第三步:将KV的位置进行调换,换成我们想要的格式
scala> result.map(x => (x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1)).collect
res12: Array[(String, Int)] = Array((hadoop,6), (hive,3), (hbase,2), (spark,1))
网友评论