美文网首页
Spark SQL/DataFrame/DataSet Joi

Spark SQL/DataFrame/DataSet Joi

作者: DuLaGong | 来源:发表于2019-04-30 14:10 被阅读0次

数据准备,先构建两个DataFrame

scala>  val df1 = spark.createDataset(Seq(("a", 1,2), ("b",2,3) )).toDF("k1","k2","k3")

df1: org.apache.spark.sql.DataFrame = [k1: string, k2: int ... 1 more field]

scala>  val df2 = spark.createDataset(Seq(("a", 2,2), ("b",3,3), ("b", 2,1), ("c", 1,1)) ).toDF("k1","k2","k4")

df2: org.apache.spark.sql.DataFrame = [k1: string, k2: int ... 1 more field]

scala> df1.show

+---+---+---+

| k1| k2| k3|

+---+---+---+

|  a|  1|  2|

|  b|  2|  3|

+---+---+---+

scala> df2.show

+---+---+---+

| k1| k2| k4|

+---+---+---+

|  a|  2|  2|

|  b|  3|  3|

|  b|  2|  1|

|  c|  1|  1|

+---+---+---+

Join算子说明:

join比较通用两种调用方式,注意在usingColumns里的字段必须在两个DF中都存在

joinType:默认是 `inner`. 必须是以下类型的一种:`inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,`right`, `right_outer`, `left_semi`, `left_anti`.

def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

(1)join内连接

//select * from df1 join df2 on df1.key1=df2.key1

//方法一

scala> df1.join(df2,"k1").show

+---+---+---+---+---+

| k1| k2| k3| k2| k4|

+---+---+---+---+---+

|  a|  1|  2|  2|  2|

|  b|  2|  3|  2|  1|

|  b|  2|  3|  3|  3|

+---+---+---+---+---+

//方法二

scala> df1.join(df2,df1("k1") === df2("k1")).show

+---+---+---+---+---+---+

| k1| k2| k3| k1| k2| k4|

+---+---+---+---+---+---+

|  a|  1|  2|  a|  2|  2|

|  b|  2|  3|  b|  2|  1|

|  b|  2|  3|  b|  3|  3|

+---+---+---+---+---+---+

//方法三

scala> df1.join(df2,df1("k1") === df2("k1"),"inner").show

+---+---+---+---+---+---+

| k1| k2| k3| k1| k2| k4|

+---+---+---+---+---+---+

|  a|  1|  2|  a|  2|  2|

|  b|  2|  3|  b|  2|  1|

|  b|  2|  3|  b|  3|  3|

+---+---+---+---+---+---+

//不同字段比较

//select * from df1 join df2 on df1.key2=df2.key4

scala> df1.join(df2,df1("k2") === df2("k4"),"inner").show

+---+---+---+---+---+---+

| k1| k2| k3| k1| k2| k4|

+---+---+---+---+---+---+

|  a|  1|  2|  c|  1|  1|

|  a|  1|  2|  b|  2|  1|

|  b|  2|  3|  a|  2|  2|

+---+---+---+---+---+---+

//多个字段比较

//select * from df1 join df2 on df1.key1=df2.key1 and df1.key2=df2.key2

scala> df1.join(df2,Seq("k1","k2"),"inner").show

+---+---+---+---+

| k1| k2| k3| k4|

+---+---+---+---+

|  b|  2|  3|  1|

+---+---+---+---+

(2)其他join类型,只需把inner改成你需要的类型即可

scala> df1.join(df2,Seq("k1"),"left").show

+---+---+---+---+---+

| k1| k2| k3| k2| k4|

+---+---+---+---+---+

|  a|  1|  2|  2|  2|

|  b|  2|  3|  2|  1|

|  b|  2|  3|  3|  3|

+---+---+---+---+---+

//左外连接 left_outer可简写为left

scala> df1.join(df2,Seq("k1"),"left_outer").show

+---+---+---+---+---+

| k1| k2| k3| k2| k4|

+---+---+---+---+---+

|  a|  1|  2|  2|  2|

|  b|  2|  3|  2|  1|

|  b|  2|  3|  3|  3|

+---+---+---+---+---+

//左半连接

scala> df1.join(df2,Seq("k1"),"leftsemi").show

+---+---+---+

| k1| k2| k3|

+---+---+---+

|  a|  1|  2|

|  b|  2|  3|

+---+---+---+

scala> df1.join(df2,Seq("k1","k2"),"left").show

+---+---+---+----+

| k1| k2| k3|  k4|

+---+---+---+----+

|  a|  1|  2|null|

|  b|  2|  3|  1|

+---+---+---+----+

scala> df1.join(df2,Seq("k1"),"right").show

+---+----+----+---+---+

| k1|  k2|  k3| k2| k4|

+---+----+----+---+---+

|  a|  1|  2|  2|  2|

|  b|  2|  3|  3|  3|

|  b|  2|  3|  2|  1|

|  c|null|null|  1|  1|

+---+----+----+---+---+

left是left_outer的简写

转自:https://blog.csdn.net/lingbo229/article/details/82464466

相关文章

网友评论

      本文标题:Spark SQL/DataFrame/DataSet Joi

      本文链接:https://www.haomeiwen.com/subject/ixfvnqtx.html