最近用到dataframe的groupBy有点多,所以做个小总结,主要是一些与groupBy一起使用的一些聚合函数,如mean、sum、collect_list等;聚合后对新列重命名。
大纲
- groupBy以及列名重命名
- 相关聚合函数
1. groupBy
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', level='a', age=5, height=80),Row(name='Bob', level='a', age=5, height=80),Row(name='Cycy', level='b', age=10, height=80),Row(name='Didi', level='b', age=12, height=75),Row(name='EiEi', level='b', age=10, height=70)])
df = rdd.toDF()
#df.select(['name','level']).show()
print df.columns
#['age', 'height', 'level', 'name']
print df.dtypes
#[('age', 'bigint'), ('height', 'bigint'), ('level', 'string'), ('name', 'string')]
1.1 使用{}进行聚合操作
单列聚合
# 按level分组,计算age的平均值
_df1 = df.groupBy('level').agg({"age":"mean", "height": "mean"})
print _df1.show()
"""
+-----+------------------+-----------+
|level| avg(age)|avg(height)|
+-----+------------------+-----------+
| b|10.666666666666666| 75.0|
| a| 5.0| 80.0|
+-----+------------------+-----------+
"""
多列聚合
_df2 = df.groupBy('level','age').agg({"height":"mean"})
#下面用[]的写法是对的
#_df2 = df.groupBy(['level','age']).agg({"height":"mean"})
#用()的写法报错
#_df2 = df.groupBy(('level','age')).agg({"height":"mean"})
print _df2.show()
"""
+-----+---+-----------+
|level|age|avg(height)|
+-----+---+-----------+
| a| 5| 80.0|
| b| 10| 75.0|
| b| 12| 75.0|
+-----+---+-----------+
"""
重命名
这种方式下聚合后重命名,可以使用dataframe的withColumnRenamed函数。以多列聚合的case为例。
_df = _df2.withColumnRenamed('avg(height)', 'mean_height')
print _df.show()
"""
+-----+---+-----------+
|level|age|mean_height|
+-----+---+-----------+
| a| 5| 80.0|
| b| 10| 75.0|
| b| 12| 75.0|
+-----+---+-----------+
"""
1.2 使用pyspark.sql.functions的函数聚合、重命名
这种方式使用更简洁。
from pyspark.sql import functions as sf
_df3 = df.groupBy('level').agg(sf.mean(df.age).alias('mean_age'),sf.mean(df.height).alias('mean_height'))
#_df3 = df.groupBy('level').agg(sf.mean(df["age"]).alias('mean_age'),sf.mean(df["height"]).alias('mean_height'))
print _df3.show()
"""
+-----+------------------+-----------+
|level| mean_age|mean_height|
+-----+------------------+-----------+
| b|10.666666666666666| 75.0|
| a| 5.0| 80.0|
+-----+------------------+-----------+
"""
2.相关聚合函数
具体函数可见pyspark.sql.functions;我看了一遍,比较齐全,基本hive的用法都可以支持。下面列举一些我最近常用的函数。
'max': 'Aggregate function: returns the maximum value of the expression in a group.',
'min': 'Aggregate function: returns the minimum value of the expression in a group.',
'count': 'Aggregate function: returns the number of items in a group.',
'sum': 'Aggregate function: returns the sum of all values in the expression.',
'avg': 'Aggregate function: returns the average of the values in a group.',
'mean': 'Aggregate function: returns the average of the values in a group.',
'sumDistinct': 'Aggregate function: returns the sum of distinct values in the expression.'
'collect_list': returns a list of objects with duplicates.
'collect_set': returns a set of objects with duplicate elements eliminated.
'stddev': 'Aggregate function: alias for stddev_samp.',
'variance': 'Aggregate function: alias for var_samp.'
下面介绍sumDistinct、collect_list、collect_set
sumDistinct 去重后求和
df.groupBy('level').agg(sf.mean(df.age).alias('mean_age'),sf.sumDistinct(df.height).alias('sum_height'))
"""
+-----+------------------+----------+
|level| mean_age|sum_height|
+-----+------------------+----------+
| b|10.666666666666666| 225|
| a| 5.0| 80|
+-----+------------------+----------+
"""
collect_list
pyspark.sql.functions 还可以嵌套使用, collect_set返回去重后的结果,用法同collect_list
print df.groupBy('level').agg(sf.collect_list(df.name)).show()
"""
+-----+------------------+
|level|collect_list(name)|
+-----+------------------+
| b|[Didi, EiEi, Cycy]|
| a| [Alice, Bob]|
+-----+------------------+
"""
# concat_ws拼接字符串类型的array
print df.groupBy('level').agg(sf.concat_ws(',', sf.collect_list(df.name))).show()
"""
+-----+--------------------------------+
|level|concat_ws(,, collect_list(name))|
+-----+--------------------------------+
| b| Cycy,Didi,EiEi|
| a| Alice,Bob|
+-----+--------------------------------+
"""
print df.groupBy('level').agg(sf.concat_ws(',', sf.collect_list(df.name)).alias("names")).show()
"""
+-----+--------------+
|level| names|
+-----+--------------+
| b|Cycy,Didi,EiEi|
| a| Alice,Bob|
+-----+--------------+
"""
网友评论