美文网首页Spark 实战HDP - 大数据大数据分析
Spark RDD/Dataframe/Dataset 无聊的性

Spark RDD/Dataframe/Dataset 无聊的性

作者: 大猪大猪 | 来源:发表于2019-01-01 23:14 被阅读2次

    Spark 在三个弹性数据集,但是我们并不知道哪个性能比较好(有的文章的说Dataset<Dataframe<RDD),好了,这下就有个无聊的人了,那就是我,这里会测试一下它们的性能如何。

    测试代码

    class App10 {
    
      System.setProperty("java.security.krb5.conf", "/etc/krb5.conf")
      System.setProperty("sun.security.krb5.debug", "false")
    
      val sparkConf = new SparkConf()
        .set("spark.shuffle.service.enabled", "true")
        .set("spark.dynamicAllocation.enabled", "true")
        .set("spark.dynamicAllocation.minExecutors", "1")
        .set("spark.dynamicAllocation.initialExecutors", "1")
        .set("spark.dynamicAllocation.maxExecutors", "6")
        .set("spark.dynamicAllocation.executorIdleTimeout", "60")
        .set("spark.dynamicAllocation.cachedExecutorIdleTimeout", "60")
        .set("spark.executor.cores", "4")
        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        //    .setMaster("local[12]")
        .setAppName("无聊的Dataset、Dataframe、RDD测试")
    
      val spark = SparkSession
        .builder
        .config(sparkConf)
        .getOrCreate()
    
    
      def run(typ: Int): Unit = {
    
        import spark.implicits._
        spark.sparkContext.setLogLevel("ERROR")
        if (typ == 0) {
          val rdd = spark.sparkContext
            .parallelize((0 to 4000000).map {
              num => {
                Log10(UUID.randomUUID().toString, num)
              }
            })
    
          val count = rdd.count()
        } else if (typ == 1) {
          val rdd = spark.sparkContext
            .parallelize((0 to 4000000).map {
              num => {
                Log10(UUID.randomUUID().toString, num)
              }
            }).toDF()
    
          val count = rdd.count()
        } else if (typ == 2) {
          val rdd = spark.sparkContext
            .parallelize((0 to 4000000).map {
              num => {
                Log10(UUID.randomUUID().toString, num)
              }
            }).toDS()
    
          val count = rdd.count()
        }
    
      }
    
    }
    
    case class Log10(uid: String, age: Int)
    
    object App10 {
      def main(args: Array[String]): Unit = {
        new App10().run(args(0).toInt)
      }
    }
    

    测试组

    PS:集群是两台2台12核24G的机子,里面没有跑任务任务,是空闲的主机,这样测试出来的结果比较理想。

    第一组

    time spark-submit --master yarn --jars "hdfs:///tmp/jars/*" --class com.dounine.hbase.App10 --driver-memory 3g --executor-memory 2G build/libs/hdfs-token-1.0.0-SNAPSHOT.jar 0
    

    三次结果

    real    0m34.242s
    user    0m54.498s
    sys 0m3.584s
    -----------------------
    real    0m34.009s
    user    0m45.385s
    sys 0m3.520s
    ----------------------
    real    0m34.948s
    user    0m49.349s
    sys 0m3.407s
    

    第二组

    time spark-submit --master yarn --jars "hdfs:///tmp/jars/*" --class com.dounine.hbase.App10 --driver-memory 3g --executor-memory 2G build/libs/hdfs-token-1.0.0-SNAPSHOT.jar 1
    

    三次结果

    real    0m37.738s
    user    0m52.649s
    sys 0m3.684s
    ------------------
    real    0m37.471s
    user    0m50.647s
    sys 0m3.557s
    -------------------
    real    0m37.248s
    user    0m46.946s
    sys 0m3.471s
    

    第三组

    time spark-submit --master yarn --jars "hdfs:///tmp/jars/*" --class com.dounine.hbase.App10 --driver-memory 3g --executor-memory 2G build/libs/hdfs-token-1.0.0-SNAPSHOT.jar 2
    

    三次结果

    real    0m36.179s
    user    0m59.250s
    sys 0m3.674s
    ---------------------
    real    0m35.090s
    user    0m54.178s
    sys 0m3.476s
    --------------------
    real    0m35.181s
    user    0m50.917s
    sys 0m3.599s
    

    结论

    还是 RDD 性能好一些,可能是我打开的方式不对,下次想到更好测试再测看看。

    相关文章

      网友评论

        本文标题:Spark RDD/Dataframe/Dataset 无聊的性

        本文链接:https://www.haomeiwen.com/subject/ovoolqtx.html