需求:求IP访问次数的TopN
1) 获取ip => (ip,1)
2)reduceByKey(+)
3)排序 sortBy
object test {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
val lines = sc.textFile("file:///E:/BigDataSoftware/data/baidu.log")
lines.map(x=> {
val tmp = x.split("\t")
(tmp(5),1) //取出IP这一列,并转换成tuple类型(IP,1)
}).reduceByKey(_+_).sortBy(_._2,false).take(N)
sc.stop()
}
}
sortBy默认是升序,sortBy(_._2,false)指的是按降序排列,下面是sortBy的源码
/**
* Return this RDD sorted by the given key function.
*/
def sortBy[K](
f: (T) => K,
ascending: Boolean = true,
numPartitions: Int = this.partitions.length)
(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {
this.keyBy[K](f)
.sortByKey(ascending, numPartitions)
.values
}
网友评论