前言
在使用spark进行数据处理时,之前RDD使用较多,目前随着任务复杂度增加,有schema的DataFrame更加简洁便利。现在总结一下df的基本操作,作为备忘。
DataFrame是什么
DataFrame是一种以RDD为基础的分布式数据集,每一列都有自己的名称和数据类型。 Spark中的DataFrame属于比较基本的数据结构,在进行数据处理时,用DataFrame进行处理较RDD相比,速度更快,是带有schema的RDD数据结构。反观RDD,在进行使用时,只有内部分区,无法知道数据元素内部的数据结构。
下面介绍一下在使用DataFrame(df)时的基本操作:
1.在服务器上启动spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
2.生成DF格式数据
首先我们创造一个map,用字母指向该字母为首的水果英文单词,并转换为array形式。再将该array转换为rdd,并转换为名称为"alphabet"和"fruit"的df
同理生成名称为"alphabet"和"fruit1"的df1,为之后两个df做join操作作准备
scala> val test = Map("a" -> "apple", "b" -> "banana", "c" -> "cherry", "d" -> "durian", "g" -> "grape").toArray
test: Array[(String, String)] = Array((a,apple), (b,banana), (g,grape), (c,cherry), (d,durian))
scala> val df = sc.parallelize(test).toDF("alphabet", "fruit")
df: org.apache.spark.sql.DataFrame = [alphabet: string, fruit: string]
scala> val test1 = Map("a" -> "apple1", "b" -> "banana1", "c" -> "cherry1", "d" -> "durian1", "g" -> "grape1").toArray
test1: Array[(String, String)] = Array((a,apple1), (b,banana1), (g,grape1), (c,cherry1), (d,durian1))
scala> val df1 = sc.parallelize(test1).toDF("alphabet", "fruit1")
df1: org.apache.spark.sql.DataFrame = [alphabet: string, fruit1: string]
3.利用show()查看当前的df数据
scala> df.show()
+--------+------+
|alphabet| fruit|
+--------+------+
| a| apple|
| b|banana|
| g| grape|
| c|cherry|
| d|durian|
+--------+------+
4.增加一列fruit_copy,值和fruit一样
scala> df.withColumn("fruit_copy", df("fruit")).show()
+--------+------+----------+
|alphabet| fruit|fruit_copy|
+--------+------+----------+
| a| apple| apple|
| b|banana| banana|
| g| grape| grape|
| c|cherry| cherry|
| d|durian| durian|
+--------+------+----------+
5.将其中一列重命名,将fruit_copy重命名为fruit_double
scala> df.withColumn("fruit_copy", df("fruit")).withColumnRenamed("fruit_copy", "fruit_douuble").show()
+--------+------+-------------+
|alphabet| fruit|fruit_douuble|
+--------+------+-------------+
| a| apple| apple|
| b|banana| banana|
| g| grape| grape|
| c|cherry| cherry|
| d|durian| durian|
+--------+------+-------------+
6.将df和df1两个DataFrame做join left的操作,相当于rdd的leftOuterJoin,根据"alphabet"这列进行join
scala> df.join(df1, df("alphabet") === df1("alphabet"), "left").show()
+--------+------+--------+-------+
|alphabet| fruit|alphabet| fruit1|
+--------+------+--------+-------+
| g| grape| g| grape1|
| d|durian| d|durian1|
| c|cherry| c|cherry1|
| b|banana| b|banana1|
| a| apple| a| apple1|
+--------+------+--------+-------+
7.explode操作, 将"fruit"列按照"a"分割。两种写法:
方法一:
scala> df.withColumn("split", split(col("fruit"), "a")).withColumn("split", explode(col("split"))).show()
+--------+------+------+
|alphabet| fruit| split|
+--------+------+------+
| a| apple| |
| a| apple| pple|
| b|banana| b|
| b|banana| n|
| b|banana| n|
| b|banana| |
| g| grape| gr|
| g| grape| pe|
| c|cherry|cherry|
| d|durian| duri|
| d|durian| n|
+--------+------+------+
方法二:
scala> df.explode("fruit", "split") { it: String => it.split("a")}.show()
+--------+------+------+
|alphabet| fruit| split|
+--------+------+------+
| a| apple| |
| a| apple| pple|
| b|banana| b|
| b|banana| n|
| b|banana| n|
| b|banana| |
| g| grape| gr|
| g| grape| pe|
| c|cherry|cherry|
| d|durian| duri|
| d|durian| n|
+--------+------+------+
网友评论