美文网首页
21-SparkSQL02

21-SparkSQL02

作者: CrUelAnGElPG | 来源:发表于2018-09-06 09:36 被阅读0次

    DataFrame

    python pandas

    R

    RDD MapReduce

    DataFrame vs Dataset(1.6)

    DS: Java  Scala

    DF: 4

    SchemaRDD < 1.3

    ==>

    DataFrame

    A Dataset is a distributed collection of data.

    剥洋葱式分析

    A DataFrame is a distributed collection of data

    organized into named columns

    table in a relational database

    DataFrame = Dataset[Row]

    DataFrame vs RDD vs Dataset

    概念  collection

    API    map  filter  flatMap .....

    数据结构

    textFile(path)

    RDD[Person]

    name age height

    spark.sql("").show()

    Spark SQL入口点

    <2: SQLContext  HiveContext

    >=2: SparkSession

    spark.read.format("json").load(path)

    spark.read.format("text").load(path)

    spark.read.format("parquet").load(path)

    spark.read.format("orc").load(path)

    源码面前 了无秘密

    infos.txt ==> DataFrame

    val students = sc.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF()

    show()

    => show(20,true)

    show(5)

    相关文章

      网友评论

          本文标题:21-SparkSQL02

          本文链接:https://www.haomeiwen.com/subject/axptgftx.html