Spark DataFrame基本操作

作者: sparkle123 | 来源:发表于2018-03-05 14:59 被阅读0次

    DataFrame的概念来自R/Pandas语言,不过R/Pandas只是runs on One MachineDataFrame是分布式的,接口简单易用。

    • Threshold: Spark RDD API VS MapReduce API
    • One Machine:R/Pandas

    官网的说明
    http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#datasets-and-dataframes
    拔粹如下:

    • A Dataset is a distributed collection of data:分布式的数据集
    • A DataFrame is a Dataset organized into named columns. (RDD with Schema)
      以列(列名、列的类型、列值)的形式构成的分布式数据集,按照列赋予不同的名称
    • An abstraction for selecting,filtering,aggregation and plotting structured data
    • It is conceptually equivalent to a table in a relational database
      or a data frame in R/Python

    RDDDataFrame对比:

    • RDD运行起来,速度根据执行语言不同而不同:
    java/scala  ==> jvm
    python ==> python runtime
    
    • DataFrame运行起来,执行语言不同,但是运行速度一样:
    java/scala/python ==> Logic Plan
    

    根据官网的例子来了解下DataFrame的基本操作,

    import org.apache.spark.sql.SparkSession
    
    /**
      * DataFrame API基本操作
      */
    object DataFrameApp {
      def main(args: Array[String]): Unit = {
    
        val spark = SparkSession
          .builder()
          .appName("DataFrameApp")
          .master("local[2]")
          .getOrCreate();
    
        // 将json文件加载成一个dataframe
        val peopleDF = spark.read.json("C:\\Users\\Administrator\\IdeaProjects\\SparkSQLProject\\spark-warehouse\\people.json");
        // Prints the schema to the console in a nice tree format.
        peopleDF.printSchema();
    
        // 输出数据集的前20条记录
        peopleDF.show();
    
        //查询某列所有的数据: select name from table
        peopleDF.select("name").show();
    
        // 查询某几列所有的数据,并对列进行计算: select name, age+10 as age2 from table
        peopleDF.select(peopleDF.col("name"), (peopleDF.col("age") + 10).as("age2")).show();
    
        //根据某一列的值进行过滤: select * from table where age>19
        peopleDF.filter(peopleDF.col("age") > 19).show();
    
        //根据某一列进行分组,然后再进行聚合操作: select age,count(1) from table group by age
        peopleDF.groupBy("age").count().show();
    
     spark.stop();
      }
    }
    

    相关文章

      网友评论

        本文标题:Spark DataFrame基本操作

        本文链接:https://www.haomeiwen.com/subject/unhhfftx.html