2019-06-02
1、创建一个非常简单的表格型数据集
>>> df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],'key2':['one', 'two', 'one', 'two', 'one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
>>> df
key1 key2 data1 data2
0 a one -0.051260 -1.444195
1 a two -1.594938 0.778813
2 b one -1.726857 -1.709059
3 b two -0.997476 -0.825530
4 a one 1.868982 -0.535930
按照key1进行分组,并计算data1的平均值,可用访问data1,并根据key1调用groupby
>>> grouped = df['data1'].groupby(df['key1'])
>>> grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000149A6A023C8>
grouped是含分组键df['key1']的中间变量,并无任何数据值,此刻需要调用groupby的mean()函数来计算分组平均值。
>>> grouped.mean()
key1
a 0.074261
b -1.362166
Name: data1, dtype: float64
2、传入多个数据进行分组,按照key1/key2进行分组
>>> mean1 = df['data1'].groupby([df['key1'],df['key2']]).mean()
>>> mean1
key1 key2
a one 0.908861
two -1.594938
b one -1.726857
two -0.997476
Name: data1, dtype: float64
>>> mean1.unstack()
key2 one two
key1
a 0.908861 -1.594938
b -1.726857 -0.997476
在上面这些示例中,分组键均为Series。实际上,分组键可以是任何长度适当的数组:
现给出另外两个长度和key1/key2一样的array。对他们分组同样可以求得data1的平均值。
>>> states = np.array(['Shanghai','Beijing','Beijing','Shanghai','Shanghai'])
>>> years = np.array([2005,2005,2006,2005,2006])
>>> df['data1'].groupby([states,years]).mean()
Beijing 2005 -1.594938
2006 -1.726857
Shanghai 2005 -0.524368
2006 1.868982
Name: data1, dtype: float64
3、不规定求哪一些的平均值,则会自动计算所有数值列的平均值
>>> df.groupby('key1').mean()
data1 data2
key1
a 0.087512 0.366339
b 0.534393 -0.142072
(1)按照key1进行分组后,对多列采用相同的聚合方法,可以借助apply函数:
>>> df.groupby('key1').apply(np.mean)
data1 data2
key1
a 0.087512 0.366339
b 0.534393 -0.142072
(2)按key1进行分组,计算各组数据的均值和中值,借助agg函数:
>>> df.groupby('key1').agg([np.mean,np.median])
data1 data2
mean median mean median
key1
a 0.087512 0.053482 0.366339 0.097251
b 0.534393 0.534393 -0.142072 -0.142072
(3)按key1进行分组,只计算data1列的均值和中值:
>>> df.groupby('key1')['data1'].agg([np.mean,np.median])
mean median
key1
a 0.087512 0.053482
b 0.534393 0.534393
(4)按key1进行分组,只计算data1列的均值和中值,需要定制显示标题,可以这样设置:
>>> df.groupby('key1')['data1'].agg({'MEAN':np.mean,'MEDIAN':np.median})
MEAN MEDIAN
key1
a 0.087512 0.053482
b 0.534393 0.534393
不知所以的错误:当按照key1进行分组,计算所有列的均值和中值时,需要定制显示标题,就会报错。
df.groupby('key1').agg({'MEAN':np.mean,'MEDIAN':np.median})
报错。。。
(5)按key1进行分组,data1列计算均值,data2列计算中值,同样借助agg函数:
>>> df.groupby('key1').agg({'data1':'mean','data2':'median'})
data1 data2
key1
a 0.087512 0.097251
b 0.534393 -0.142072
关于agg的拓展:
通过lambda匿名函数来进行特殊的计算:
计算各组数据的绝对值的平均数:
>>> df.groupby('key1')['data1'].agg({'lambda':lambda x:np.mean(abs(x))})
lambda
key1
a 0.146380
b 0.474747
https://jingyan.baidu.com/article/d45ad148947fd369552b80f6.html
如果不想求某一列的均值,可以将其剔除:
>>> df.drop('data1',axis=1).groupby(['key1','key2']).mean()
data2
key1 key2
a one 0.626655
two -0.154292
b one -0.044297
two -0.239848
groupby中的size的用法:显示该分组下有几个数值含在其中,就像Excel中数据透视表中的的计数
>>> df.groupby(['key1','key2']).size()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
分组键中的任何缺失值都被排除在外。
4、对分组进行迭代
>>> for name,group in df.groupby('key1'):
print(name)
print(group)
a
key1 key2 data1 data2
0 a one -0.051260 -1.444195
1 a two -1.594938 0.778813
4 a one 1.868982 -0.535930
b
key1 key2 data1 data2
2 b one -1.726857 -1.709059
3 b two -0.997476 -0.825530
对于多重键的情况,即有多个分组键,会多次分组:
>>> for (k1,k2),group in df.groupby(['key1','key2']):
print(k1,k2)
print(group)
a one
key1 key2 data1 data2
0 a one -0.051260 -1.444195
4 a one 1.868982 -0.535930
a two
key1 key2 data1 data2
1 a two -1.594938 0.778813
b one
key1 key2 data1 data2
2 b one -1.726857 -1.709059
b two
key1 key2 data1 data2
3 b two -0.997476 -0.82553
也可以将df根据dtype进行分组,即将数值分为一组,将字符串分成一组
>>> df.dtypes
key1 object
key2 object
data1 float64
data2 float64
dtype: object
>>> grouped = df.groupby(df.dtypes,axis=1)
>>> dict(list(grouped))
{dtype('float64'): data1 data2
0 -0.051260 -1.444195
1 -1.594938 0.778813
2 -1.726857 -1.709059
3 -0.997476 -0.825530
4 1.868982 -0.535930, dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
5、选取一组或一列
对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的。
(1)
>>> df.groupby([df['key1']])[['data2']].mean()
data2
key1
a -0.400437
b -1.267294
(2)
>>> df.groupby([df['key1']])['data2'].mean()#df.groupby(['key1']).data2.mean()一样
key1
a -0.400437
b -1.267294
Name: data2, dtype: float64
(3)
>>> df[['data2']].groupby([df['key1']]).mean()
data2
key1
a -0.400437
b -1.267294
(1)和(3)是完全等效的,(1)和(2)的区别在于['data2']和[['data2']]:如果传入的是单个,如['data2'],则返回的是已分组的Series;如果传入的是列表或数组,如[['data2']],则返回的是已分组的DataFrame。
6、通过字典或Series分组
除数组以外,分组形式还可以其他形式存在
字典:
2039742751512
>>> people = pd.DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
>>> people
a b c d e
Joe -0.678913 1.980195 -0.643765 0.388631 -0.025347
Steve 0.950373 1.771783 1.261615 -0.650729 -0.077973
Wes -1.588147 -1.223409 0.218183 -0.182874 1.149757
Jim -0.167653 0.023572 -0.272739 -1.196497 0.217304
Travis -0.525785 -0.502944 0.321604 -0.132109 -0.868102
假设已知列的分组关系,并希望根据分组计算列的总计:
>>> mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
然后将此字典传给groupby
>>> by_column = people.groupby(mapping,axis=1)
>>> by_column.sum()
blue red
Joe -0.255134 1.275936
Steve 0.610886 2.644182
Wes -0.182874 -0.438390
Jim -1.469236 0.073223
Travis 0.189495 -1.896830
Series:
用Series作为分组键,pandas会检查Series以确保索引和分组值是对齐的
>>> map_series = pd.Series(mapping)
>>> map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
>>> people.groupby(map_series,axis=1).sum()
blue red
Joe -0.255134 1.275936
Steve 0.610886 2.644182
Wes -0.182874 -0.438390
Jim -1.469236 0.073223
Travis 0.189495 -1.896830
这里表现的按照Series分组和字典分组结果一样,按照Series分组时将mapping转化成Series.
7、通过函数进行分组
任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称。
在上面的people这个DataFrame中,人名为索引值,将函数len作为分组键,如下:
>>> people.groupby(len).sum()
a b c d e
3 -2.917995 -0.954225 -2.642262 0.484488 -3.649103
5 0.504809 -0.644040 -1.536119 2.070732 -0.835017
6 -0.770610 0.759477 -0.257607 -0.081076 -0.241806
将函数跟数组、列表、字典、Series混合使用:
>>> key_list = ['one','one','one','two','two']
>>> people.groupby([len,key_list]).min()
a b c d e
3 one -1.580011 -0.738648 -1.031669 -0.598777 -2.249030
two -0.464435 0.089984 -1.150972 0.817304 0.089607
5 one 0.504809 -0.644040 -1.536119 2.070732 -0.835017
6 two -0.770610 0.759477 -0.257607 -0.081076 -0.241806
8、根据索引级别分组 #不是特别懂!
>>> columms = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,2,5,1,3]],names=['city','tenor'])
>>> columms
MultiIndex(levels=[['JP', 'US'], [1, 2, 3, 5]],
codes=[[1, 1, 1, 0, 0], [0, 1, 3, 0, 2]],
names=['city', 'tenor'])
>>> hire_df = pd.DataFrame(np.random.randn(4,5),columns=columms)
>>> hire_df
city US JP
tenor 1 2 5 1 3
0 0.554358 0.301741 -0.861113 -0.977792 -0.404298
1 1.430362 0.856493 -0.418466 0.519151 -0.879406
2 -0.149399 -1.044713 0.030997 -0.964583 0.917320
3 0.179831 0.638233 0.572928 -1.151039 -0.496245
>>> hire_df.groupby(level='city',axis=1).count()
city JP US
0 2 3
1 2 3
2 2 3
3 2 3
网友评论