pyspark小技巧

作者: vincentxia | 来源:发表于2018-09-27 14:08 被阅读0次

pyspark小技巧
Spark Python API Docs(part one)
pyspark整理
PySpark初见
Jupyter配置教程
PySpark Recipes A Problem-Soluti
Spark-pyspark
Spark Python API Docs(part three
from pyspark.mllib.recommendatio
pyspark读写hbase

1. pyspark添加列，并向udf中传递多个参数

场景：现在有个keyword的list，需要对输入的每行数据的token字段进行判断，判断token是否在keyword中，并把判别的结果添加到新的列中。比如token为"union"，那么返回1，token为"union1"，返回0。

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf,col

keyword_list=['union','workers','strike','pay','rally','free','immigration']

def label_maker_topic(tokens,topic_words):
        if tokens in topic_words:
            return 1
        else:
            return 0

def make_topic_word(topic_words):
     return udf(lambda c: label_maker_topic(c, topic_words),IntegerType())

df.withColumn("topics", make_topic_word(keyword_list)(col("token"))

注意：上面的keyword_list是一个变量，同样可以传递多个变量，通过df.withColumn("topics", make_topic_word(keyword_list,param2,param3,param4....)(col("token"))引用即可，同时函数的签名也需要变化。

2.filter中对数据进行模糊过滤

场景：现在需要对df中的数据进行过滤，类似sql中where...like...条件，下面对name列按包含jack关键字进行过滤

df.filter(df.name.like("%jack%")).show()

参考：
Passing a data frame column and external list to udf under withColumn
pyspark函数手册
 PySpark - Pass list as parameter to UDF

网友评论

本文标题：pyspark小技巧

本文链接：https://www.haomeiwen.com/subject/iileoftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

pyspark小技巧

1. pyspark添加列，并向udf中传递多个参数

2.filter中对数据进行模糊过滤

相关文章

pyspark小技巧

Spark Python API Docs(part one)

pyspark整理

PySpark初见

Jupyter配置教程

PySpark Recipes A Problem-Soluti

Spark-pyspark

Spark Python API Docs(part three

from pyspark.mllib.recommendatio

pyspark读写hbase

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读