DataFrame
python pandas
R
RDD MapReduce
DataFrame vs Dataset(1.6)
DS: Java Scala
DF: 4
SchemaRDD < 1.3
==>
DataFrame
A Dataset is a distributed collection of data.
剥洋葱式分析
A DataFrame is a distributed collection of data
organized into named columns
table in a relational database
DataFrame = Dataset[Row]
DataFrame vs RDD vs Dataset
概念 collection
API map filter flatMap .....
数据结构
textFile(path)
RDD[Person]
name age height
spark.sql("").show()
Spark SQL入口点
<2: SQLContext HiveContext
>=2: SparkSession
spark.read.format("json").load(path)
spark.read.format("text").load(path)
spark.read.format("parquet").load(path)
spark.read.format("orc").load(path)
源码面前 了无秘密
infos.txt ==> DataFrame
val students = sc.textFile("file:///home/hadoop/data/student.data").map(_.split("\\|")).map(x=>Student(x(0),x(1),x(2),x(3))).toDF()
show()
=> show(20,true)
show(5)
网友评论