数据准备,先构建两个DataFrame
scala> val df1 = spark.createDataset(Seq(("a", 1,2), ("b",2,3) )).toDF("k1","k2","k3")
df1: org.apache.spark.sql.DataFrame = [k1: string, k2: int ... 1 more field]
scala> val df2 = spark.createDataset(Seq(("a", 2,2), ("b",3,3), ("b", 2,1), ("c", 1,1)) ).toDF("k1","k2","k4")
df2: org.apache.spark.sql.DataFrame = [k1: string, k2: int ... 1 more field]
scala> df1.show
+---+---+---+
| k1| k2| k3|
+---+---+---+
| a| 1| 2|
| b| 2| 3|
+---+---+---+
scala> df2.show
+---+---+---+
| k1| k2| k4|
+---+---+---+
| a| 2| 2|
| b| 3| 3|
| b| 2| 1|
| c| 1| 1|
+---+---+---+
Join算子说明:
join比较通用两种调用方式,注意在usingColumns里的字段必须在两个DF中都存在
joinType:默认是 `inner`. 必须是以下类型的一种:`inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,`right`, `right_outer`, `left_semi`, `left_anti`.
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
(1)join内连接
//select * from df1 join df2 on df1.key1=df2.key1
//方法一
scala> df1.join(df2,"k1").show
+---+---+---+---+---+
| k1| k2| k3| k2| k4|
+---+---+---+---+---+
| a| 1| 2| 2| 2|
| b| 2| 3| 2| 1|
| b| 2| 3| 3| 3|
+---+---+---+---+---+
//方法二
scala> df1.join(df2,df1("k1") === df2("k1")).show
+---+---+---+---+---+---+
| k1| k2| k3| k1| k2| k4|
+---+---+---+---+---+---+
| a| 1| 2| a| 2| 2|
| b| 2| 3| b| 2| 1|
| b| 2| 3| b| 3| 3|
+---+---+---+---+---+---+
//方法三
scala> df1.join(df2,df1("k1") === df2("k1"),"inner").show
+---+---+---+---+---+---+
| k1| k2| k3| k1| k2| k4|
+---+---+---+---+---+---+
| a| 1| 2| a| 2| 2|
| b| 2| 3| b| 2| 1|
| b| 2| 3| b| 3| 3|
+---+---+---+---+---+---+
//不同字段比较
//select * from df1 join df2 on df1.key2=df2.key4
scala> df1.join(df2,df1("k2") === df2("k4"),"inner").show
+---+---+---+---+---+---+
| k1| k2| k3| k1| k2| k4|
+---+---+---+---+---+---+
| a| 1| 2| c| 1| 1|
| a| 1| 2| b| 2| 1|
| b| 2| 3| a| 2| 2|
+---+---+---+---+---+---+
//多个字段比较
//select * from df1 join df2 on df1.key1=df2.key1 and df1.key2=df2.key2
scala> df1.join(df2,Seq("k1","k2"),"inner").show
+---+---+---+---+
| k1| k2| k3| k4|
+---+---+---+---+
| b| 2| 3| 1|
+---+---+---+---+
(2)其他join类型,只需把inner改成你需要的类型即可
scala> df1.join(df2,Seq("k1"),"left").show
+---+---+---+---+---+
| k1| k2| k3| k2| k4|
+---+---+---+---+---+
| a| 1| 2| 2| 2|
| b| 2| 3| 2| 1|
| b| 2| 3| 3| 3|
+---+---+---+---+---+
//左外连接 left_outer可简写为left
scala> df1.join(df2,Seq("k1"),"left_outer").show
+---+---+---+---+---+
| k1| k2| k3| k2| k4|
+---+---+---+---+---+
| a| 1| 2| 2| 2|
| b| 2| 3| 2| 1|
| b| 2| 3| 3| 3|
+---+---+---+---+---+
//左半连接
scala> df1.join(df2,Seq("k1"),"leftsemi").show
+---+---+---+
| k1| k2| k3|
+---+---+---+
| a| 1| 2|
| b| 2| 3|
+---+---+---+
scala> df1.join(df2,Seq("k1","k2"),"left").show
+---+---+---+----+
| k1| k2| k3| k4|
+---+---+---+----+
| a| 1| 2|null|
| b| 2| 3| 1|
+---+---+---+----+
scala> df1.join(df2,Seq("k1"),"right").show
+---+----+----+---+---+
| k1| k2| k3| k2| k4|
+---+----+----+---+---+
| a| 1| 2| 2| 2|
| b| 2| 3| 3| 3|
| b| 2| 3| 2| 1|
| c|null|null| 1| 1|
+---+----+----+---+---+
left是left_outer的简写
转自:https://blog.csdn.net/lingbo229/article/details/82464466
网友评论