小结
这个跟SQL很像,好处就是直接在python集成处理,不用像以前那样要分开在数据库里操作。
- Groupby 分组聚合
1.1. 基本操作
1.2. 对分组进行迭代
1.3. 选取一个或一组列 - 数据聚合
2.1. 分组级转换 transform
2.2. apply "拆分-应用-合并"
1. Groupby 分组聚合
Groupby.png1.1. 基本操作
引入相关库:
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
基本格式
DataFrame['数据处理序列'].groupby(键值).操作()
例子:
以下的分组键是Series
In [96]: df = DataFrame({'key1':['a','a','b','b','a'],
...: 'key2':['one','two','one','two','one'],
...: 'data1':np.random.randn(5),
...: 'data2':np.random.randn(5)})
In [97]:
In [97]: df
Out[97]:
key1 key2 data1 data2
0 a one -1.390006 0.277334
1 a two -1.309464 1.245893
2 b one 1.293608 1.206705
3 b two -0.549139 0.140945
4 a one 2.292085 0.271638
In [98]: grouped = df['data1'].groupby(df['key1'])
# 操作
In [99]: grouped
Out[99]: <pandas.core.groupby.groupby.SeriesGroupBy object at 0x000000EA33F9B438
>
# mean() 看平均值
In [100]: grouped.mean()
Out[100]:
key1
a -0.135795
b 0.372235
Name: data1, dtype: float64
In [101]: means = df['data1'].groupby([df['key1'],df['key2']]).mean()
In [102]: means
Out[102]:
key1 key2
a one 0.451039
two -1.309464
b one 1.293608
two -0.549139
Name: data1, dtype: float64
In [103]: means.unstack()
Out[103]:
key2 one two
key1
a 0.451039 -1.309464
b 1.293608 -0.549139
以下键值是任意数组
# np.array 数组作为键值
In [5]: states = np.array(['O','C','C','O','O'])
In [6]: years = np.array([2005,2005,2006,2005,2006])
In [7]: df['data1'].groupby([states,years]).mean()
Out[7]:
C 2005 1.325095
2006 0.111973
O 2005 -0.428585
2006 0.479145
Name: data1, dtype: float64
# 仅指定键值/键值对,对全部数据序列进行groupby
In [9]: df.groupby('key1').mean()
Out[9]:
data1 data2
key1
a 0.543673 0.807876
b -0.285988 1.684493
In [10]: df.groupby(['key1','key2']).mean()
Out[10]:
data1 data2
key1 key2
a one 0.152961 1.385948
two 1.325095 -0.348269
b one 0.111973 1.977837
two -0.683948 1.391149
# 查看分组大小
In [11]: df.groupby('key1').size()
Out[11]:
key1
a 3
b 2
dtype: int64
In [12]: df.groupby(['key1','key2']).size()
Out[12]:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
1.2. 对分组进行迭代
- 查看某个键/键值对的详细分组信息
# 需要处理的DataFrame
In [14]: df
Out[14]:
key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149
4 a one 0.479145 1.150689
# 针对一个键值的分组情况
In [16]: for name,group in df.groupby('key1'):
...: print(name)
...: print(group)
...:
a
key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
4 a one 0.479145 1.150689
b
key1 key2 data1 data2
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149
# 针对键值对的分组情况
In [19]: for (k1,k2),group in df.groupby(['key1','key2']):
...: print(k1,k2)
...: print(group)
...:
a one
key1 key2 data1 data2
0 a one -0.173222 1.621207
4 a one 0.479145 1.150689
a two
key1 key2 data1 data2
1 a two 1.325095 -0.348269
b one
key1 key2 data1 data2
2 b one 0.111973 1.977837
b two
key1 key2 data1 data2
3 b two -0.683948 1.391149
- 把以上的分组变为字典
In [20]: pieces = dict(list(df.groupby('key1')))
In [21]: pieces
Out[21]:
{'a': key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
4 a one 0.479145 1.150689, 'b': key1 key2 data1 data2
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149}
In [22]: pieces
Out[22]:
{'a': key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
4 a one 0.479145 1.150689, 'b': key1 key2 data1 data2
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149}
In [23]: pieces['a']
Out[23]:
key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
4 a one 0.479145 1.150689
In [24]: pieces['b']
Out[24]:
key1 key2 data1 data2
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149
1.3. 选取一个或一组列
- 基本格式
# normal code
In [29]: df.groupby('key1')['data1']
In [30]: df[['data2']].groupby(df['key1'])
# Syntactic sugar
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
# example
In [33]: df.groupby(['key1','key2'])['data2'].mean()
Out[33]:
key1 key2
a one 1.385948
two -0.348269
b one 1.977837
two 1.391149
Name: data2, dtype: float64
2. 数据聚合
简单agg
和aggregate
例子:
In [48]: df
Out[48]:
key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149
4 a one 0.479145 1.150689
In [49]: def PeakToPeak(arr):
...: return arr.max() - arr.min()
In [50]: df.groupby('key1').agg(PeakToPeak)
Out[50]:
data1 data2
key1
a 1.498317 1.969476
b 0.795921 0.586688
2.1. 分组级转换 transform
- 添加一个用于存放各索引分组平均值的列
# previous method
In [68]: df
Out[68]:
key1 key2 data1 data2
0 a one -0.173222 1.621207
1 a two 1.325095 -0.348269
2 b one 0.111973 1.977837
3 b two -0.683948 1.391149
4 a one 0.479145 1.150689
In [69]: k1_means = df.groupby('key1').mean().add_prefix('mean_')
In [70]: k1_means
Out[70]:
mean_data1 mean_data2
key1
a 0.543673 0.807876
b -0.285988 1.684493
In [71]:
In [71]: pd.merge(df,k1_means,left_on = 'key1' , right_index = True)
Out[71]:
key1 key2 data1 data2 mean_data1 mean_data2
0 a one -0.173222 1.621207 0.543673 0.807876
1 a two 1.325095 -0.348269 0.543673 0.807876
4 a one 0.479145 1.150689 0.543673 0.807876
2 b one 0.111973 1.977837 -0.285988 1.684493
3 b two -0.683948 1.391149 -0.285988 1.684493
# transform method
In [83]: meanData = df.groupby('key1').transform(np.mean).add_prefix('mean_')
In [84]: meanData
Out[84]:
mean_data1 mean_data2
0 0.543673 0.807876
1 0.543673 0.807876
2 -0.285988 1.684493
3 -0.285988 1.684493
4 0.543673 0.807876
In [85]: pd.concat([df,meanData], axis = 1)
Out[85]:
key1 key2 data1 data2 mean_data1 mean_data2
0 a one -0.173222 1.621207 0.543673 0.807876
1 a two 1.325095 -0.348269 0.543673 0.807876
2 b one 0.111973 1.977837 -0.285988 1.684493
3 b two -0.683948 1.391149 -0.285988 1.684493
4 a one 0.479145 1.150689 0.543673 0.807876
- transform 原数据转换为均值。
In [90]: people
Out[90]:
a b c d e
Joe 0.498185 0.460470 -0.892633 -1.561500 0.279949
Steve -0.885170 -1.490421 -0.787302 1.559050 1.183115
Wes -0.237464 NaN NaN -0.043788 -1.091813
Jim -1.547607 -0.121682 -0.355623 -1.703322 -0.733741
Travis 0.638562 0.486515 -0.233517 0.023372 0.366325
In [94]: key = list('ototo')
# 按键值key,计算均值
In [95]: people.groupby(key).mean()
Out[95]:
a b c d e
o 0.299761 0.473492 -0.563075 -0.527305 -0.148513
t -1.216388 -0.806052 -0.571462 -0.072136 0.224687
# 把原数据转换为以上均值
In [96]: people.groupby(key).transform(np.mean)
Out[96]:
a b c d e
Joe 0.299761 0.473492 -0.563075 -0.527305 -0.148513
Steve -1.216388 -0.806052 -0.571462 -0.072136 0.224687
Wes 0.299761 0.473492 -0.563075 -0.527305 -0.148513
Jim -1.216388 -0.806052 -0.571462 -0.072136 0.224687
Travis 0.299761 0.473492 -0.563075 -0.527305 -0.148513
- 计算数据与均值的差值(一般用于规范化数据)
In [105]: people
Out[105]:
a b c d e
Joe 0.498185 0.460470 -0.892633 -1.561500 0.279949
Steve -0.885170 -1.490421 -0.787302 1.559050 1.183115
Wes -0.237464 NaN NaN -0.043788 -1.091813
Jim -1.547607 -0.121682 -0.355623 -1.703322 -0.733741
Travis 0.638562 0.486515 -0.233517 0.023372 0.366325
# 设置规范化值,原数据和均值的差值
In [106]: def demean(arr):
...: return arr - arr.mean()
In [107]: key
Out[107]: ['o', 't', 'o', 't', 'o']
# 将原数据转换为规范化值
In [108]: demeaned = people.groupby(key).transform(demean)
In [109]: demeaned
Out[109]:
a b c d e
Joe 0.198424 -0.013023 -0.329558 -1.034194 0.428462
Steve 0.331218 -0.684370 -0.215840 1.631186 0.958428
Wes -0.537225 NaN NaN 0.483517 -0.943300
Jim -0.331218 0.684370 0.215840 -1.631186 -0.958428
Travis 0.338801 0.013023 0.329558 0.550677 0.514838
# 理论上结果是0,由于计算机浮点值限制,实际上是无限趋近于零的极小值
In [110]: demeaned.groupby(key).mean()
Out[110]:
a b ... d e
o 1.850372e-17 2.775558e-17 ... -3.700743e-17 -3.700743e-17
t 5.551115e-17 0.000000e+00 ... 0.000000e+00 0.000000e+00
[2 rows x 5 columns]
2.2. apply "拆分-应用-合并"
将待处理对象拆分为多个对象-->对每一个对象应用函数-->最后合并数据
- 基本使用方法
# 定义 top 函数,显示'tip_pct'最高的几列
In [114]: def top(df, n=5, column = 'tip_pct'):
...: return df.sort_index(by = column)[-n:]
# 直接应用函数
In [115]: top(tips)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
sort_index is deprecated, please use .sort_values(by=...)
Out[115]:
total_bill tip sex smoker day time size tip_pct
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [117]: top(tips,n=6)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
sort_index is deprecated, please use .sort_values(by=...)
Out[117]:
total_bill tip sex smoker day time size tip_pct
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
232 11.61 3.39 Male No Sat Dinner 2 0.291990
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
# groupby 后应用 apply 函数
In [118]: tips.groupby('smoker').apply(top)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
sort_index is deprecated, please use .sort_values(by=...)
Out[118]:
total_bill tip sex ... time size tip_pct
smoker ...
No 88 24.71 5.85 Male ... Lunch 2 0.236746
185 20.69 5.00 Male ... Dinner 5 0.241663
51 10.29 2.60 Female ... Dinner 2 0.252672
149 7.51 2.00 Male ... Lunch 2 0.266312
232 11.61 3.39 Male ... Dinner 2 0.291990
Yes 109 14.31 4.00 Female ... Dinner 2 0.279525
183 23.17 6.50 Male ... Dinner 4 0.280535
67 3.07 1.00 Female ... Dinner 1 0.325733
178 9.60 4.00 Female ... Dinner 2 0.416667
172 7.25 5.15 Male ... Dinner 2 0.710345
[10 rows x 8 columns]
In [119]: tips.groupby(['smoker', 'day']).apply(top, n = 1 , column = 'total_bi
...: ll')
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
sort_index is deprecated, please use .sort_values(by=...)
Out[119]:
total_bill tip sex ... time size tip_pct
smoker day ...
No Fri 94 22.75 3.25 Female ... Dinner 2 0.142857
Sat 212 48.33 9.00 Male ... Dinner 4 0.186220
Sun 156 48.17 5.00 Male ... Dinner 6 0.103799
Thur 142 41.19 5.00 Male ... Lunch 5 0.121389
Yes Fri 95 40.17 4.73 Male ... Dinner 4 0.117750
Sat 170 50.81 10.00 Male ... Dinner 3 0.196812
Sun 182 45.35 3.50 Male ... Dinner 3 0.077178
Thur 197 43.11 5.00 Female ... Lunch 4 0.115982
[8 rows x 8 columns]
In [123]: result = tips.groupby('smoker')['tip_pct'].describe()
In [124]: result
Out[124]:
count mean std ... 50% 75% max
smoker ...
No 151.0 0.159328 0.039910 ... 0.155625 0.185014 0.291990
Yes 93.0 0.163196 0.085119 ... 0.153846 0.195059 0.710345
[2 rows x 8 columns]
In [125]:
In [125]: result.T
Out[125]:
smoker No Yes
count 151.000000 93.000000
mean 0.159328 0.163196
std 0.039910 0.085119
min 0.056797 0.035638
25% 0.136906 0.106771
50% 0.155625 0.153846
75% 0.185014 0.195059
max 0.291990 0.710345
- 分组键是否禁止 group_keys - True/False
In [127]: tips.groupby('smoker', group_keys= False).apply(top)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
sort_index is deprecated, please use .sort_values(by=...)
Out[127]:
total_bill tip sex smoker day time size tip_pct
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
51 10.29 2.60 Female No Sun Dinner 2 0.252672
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [128]: tips.groupby('smoker').apply(top)
C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
sort_index is deprecated, please use .sort_values(by=...)
Out[128]:
total_bill tip sex ... time size tip_pct
smoker ...
No 88 24.71 5.85 Male ... Lunch 2 0.236746
185 20.69 5.00 Male ... Dinner 5 0.241663
51 10.29 2.60 Female ... Dinner 2 0.252672
149 7.51 2.00 Male ... Lunch 2 0.266312
232 11.61 3.39 Male ... Dinner 2 0.291990
Yes 109 14.31 4.00 Female ... Dinner 2 0.279525
183 23.17 6.50 Male ... Dinner 4 0.280535
67 3.07 1.00 Female ... Dinner 1 0.325733
178 9.60 4.00 Female ... Dinner 2 0.416667
172 7.25 5.15 Male ... Dinner 2 0.710345
[10 rows x 8 columns]
- 数据集的桶(bucket)和分位数(quantity)分析
先把数据集用cut / qcut 分成数据桶(块),然后用groupby/apply进行分位数分析。
In [129]: frame = DataFrame({'data1':np.random.randn(1000),
...: 'data2':np.random.randn(1000)})
# using cut
In [130]: factor = pd.cut(frame.data1, 4)
In [131]: factor[:10]
Out[131]:
0 (-0.286, 1.462]
1 (-0.286, 1.462]
2 (-2.034, -0.286]
3 (1.462, 3.209]
4 (-0.286, 1.462]
5 (-0.286, 1.462]
6 (-3.788, -2.034]
7 (-3.788, -2.034]
8 (-0.286, 1.462]
9 (-0.286, 1.462]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.788, -2.034] < (-2.034, -0.286] < (-0.28
6, 1.462] <
(1.462, 3.209]]
In [132]: len(factor)
Out[132]: 1000
# 定义 { 函数名:函数 }
In [134]: def get_stats(group):
...: return { 'min': group.min(), 'max' : group.max(),
...: 'count': group.count(), 'mean' : group.mean()}
In [135]: grouped = frame.data2.groupby(factor)
In [136]: grouped.apply(get_stats).unstack()
Out[136]:
count max mean min
data1
(-3.788, -2.034] 18.0 1.701853 0.276175 -1.210724
(-2.034, -0.286] 383.0 2.860290 0.006860 -2.509304
(-0.286, 1.462] 513.0 3.147908 0.081546 -2.712499
(1.462, 3.209] 86.0 2.066749 -0.001550 -2.043683
# use qcut
In [137]: grouping = pd.qcut(frame.data1, 10, labels=False)
In [138]: grouped = frame.data2.groupby(grouping)
In [139]: grouped.apply(get_stats).unstack()
Out[139]:
count max mean min
data1
0 100.0 2.585796 0.094647 -2.299329
1 100.0 2.601896 0.110122 -2.396772
2 100.0 2.860290 -0.025375 -2.177280
3 100.0 2.139157 -0.104499 -2.509304
4 100.0 2.826224 0.143575 -2.290512
5 100.0 3.147908 0.106912 -1.686569
6 100.0 2.173290 0.056089 -1.773193
7 100.0 1.974363 -0.036664 -2.352925
8 100.0 2.182190 0.161370 -2.712499
9 100.0 2.066749 -0.013192 -2.043683
- 用分组特定值填充缺失值
4.1. 填充单行序列
In [140]: s = Series(np.random.randn(6))
In [141]: s[::2] = np.nan
In [142]: s
Out[142]:
0 NaN
1 -0.504470
2 NaN
3 -0.358606
4 NaN
5 -0.257657
dtype: float64
In [143]: s.fillna(s.mean())
Out[143]:
0 -0.373577
1 -0.504470
2 -0.373577
3 -0.358606
4 -0.373577
5 -0.257657
dtype: float64
4.2. 分组填充缺失值
apply后因为是对各分组进行操作,一般会用到lambda函数,或类似的def函数
In [178]: numbers = ContinueLetter('a',8)
In [179]: numbers
Out[179]: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
In [180]: group_key = ['first'] * 4 + ['second'] * 4
In [181]: data = Series(np.random.randn(8) , index = numbers)
In [183]: data[['a','f','h']] = np.nan
In [184]: data
Out[184]:
a NaN
b 0.670058
c -0.931242
d -0.512491
e 0.150320
f NaN
g 0.266838
h NaN
dtype: float64
In [187]: fill_mean = lambda g : g.fillna(g.mean())
In [188]: data.groupby(group_key).apply(fill_mean)
Out[188]:
a -0.257892
b 0.670058
c -0.931242
d -0.512491
e 0.150320
f 0.208579
g 0.266838
h 0.208579
dtype: float64
2018.8.24 《用python进行数据分析》
网友评论