一. 需求
我们使用pyspark的Dataframe的时候,经常会遇到求 差集、交集 、并集。
虽然这个需求可以通过Spark SQL来实现,但是如果列比较多,使用Spark SQL实现也是比较麻烦的。
二. 解决方案
2.1 数据准备
代码:
from pyspark.sql import SparkSession
spark = SparkSession. \
Builder(). \
appName('local'). \
master('local'). \
getOrCreate()
sentenceDataFrame = spark.createDataFrame((
(1, "asf"),
(2, "2143"),
(3, "rfds")
)).toDF("label", "sentence")
sentenceDataFrame.show()
sentenceDataFrame1 = spark.createDataFrame((
(1, "asf"),
(2, "2143"),
(4, "f8934y")
)).toDF("label", "sentence")
sentenceDataFrame1.show()
测试记录:
2.2 差集
代码:
#差集 subtract
newDF = sentenceDataFrame.select("sentence").subtract(sentenceDataFrame1.select("sentence"))
newDF.show()
newDF = sentenceDataFrame.subtract(sentenceDataFrame1)
newDF.show()
测试记录:
2.3 交集
代码:
#交集 intersect
newDF_intersect = sentenceDataFrame1.select("sentence").intersect(sentenceDataFrame.select("sentence"))
newDF_intersect.show()
newDF_intersect = sentenceDataFrame1.intersect(sentenceDataFrame)
newDF_intersect.show()
测试记录:
2.4 并集
代码:
#并集 union
newDF_union = sentenceDataFrame.select("sentence").union(sentenceDataFrame1.select("sentence"))
newDF_union.show()
newDF_union = sentenceDataFrame.union(sentenceDataFrame1)
newDF_union.show()
测试记录:
2.5 并集 +去重
代码:
#并集 +去重
newDF_union = sentenceDataFrame.select("sentence").union(sentenceDataFrame1.select("sentence")).distinct()
newDF_union.show()
newDF_union = sentenceDataFrame.union(sentenceDataFrame1).distinct()
newDF_union.show()
测试记录:
参考:
1.https://blog.csdn.net/m0_37442062/article/details/87988751
网友评论