美文网首页
groupby()聚合和分组运算

groupby()聚合和分组运算

作者: 酸甜柠檬26 | 来源:发表于2019-06-07 22:59 被阅读0次

2019-06-02
1、创建一个非常简单的表格型数据集

>>> df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],'key2':['one', 'two', 'one', 'two', 'one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
>>> df
  key1 key2     data1     data2
0    a  one -0.051260 -1.444195
1    a  two -1.594938  0.778813
2    b  one -1.726857 -1.709059
3    b  two -0.997476 -0.825530
4    a  one  1.868982 -0.535930

按照key1进行分组,并计算data1的平均值,可用访问data1,并根据key1调用groupby

>>> grouped = df['data1'].groupby(df['key1'])
>>> grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000149A6A023C8>

grouped是含分组键df['key1']的中间变量,并无任何数据值,此刻需要调用groupby的mean()函数来计算分组平均值。

>>> grouped.mean()
key1
a    0.074261
b   -1.362166
Name: data1, dtype: float64

2、传入多个数据进行分组,按照key1/key2进行分组

>>> mean1 = df['data1'].groupby([df['key1'],df['key2']]).mean()
>>> mean1
key1  key2
a     one     0.908861
      two    -1.594938
b     one    -1.726857
      two    -0.997476
Name: data1, dtype: float64
>>> mean1.unstack()
key2       one       two
key1                    
a     0.908861 -1.594938
b    -1.726857 -0.997476

在上面这些示例中,分组键均为Series。实际上,分组键可以是任何长度适当的数组:
现给出另外两个长度和key1/key2一样的array。对他们分组同样可以求得data1的平均值。

>>> states = np.array(['Shanghai','Beijing','Beijing','Shanghai','Shanghai'])
>>> years = np.array([2005,2005,2006,2005,2006])
>>> df['data1'].groupby([states,years]).mean()
Beijing   2005   -1.594938
          2006   -1.726857
Shanghai  2005   -0.524368
          2006    1.868982
Name: data1, dtype: float64

3、不规定求哪一些的平均值,则会自动计算所有数值列的平均值

>>> df.groupby('key1').mean()
         data1     data2
key1                    
a     0.087512  0.366339
b     0.534393 -0.142072

(1)按照key1进行分组后,对多列采用相同的聚合方法,可以借助apply函数:

>>> df.groupby('key1').apply(np.mean)
         data1     data2
key1                    
a     0.087512  0.366339
b     0.534393 -0.142072

(2)按key1进行分组,计算各组数据的均值和中值,借助agg函数:

>>> df.groupby('key1').agg([np.mean,np.median])
         data1               data2          
          mean    median      mean    median
key1                                        
a     0.087512  0.053482  0.366339  0.097251
b     0.534393  0.534393 -0.142072 -0.142072

(3)按key1进行分组,只计算data1列的均值和中值:

>>> df.groupby('key1')['data1'].agg([np.mean,np.median])
          mean    median
key1                    
a     0.087512  0.053482
b     0.534393  0.534393

(4)按key1进行分组,只计算data1列的均值和中值,需要定制显示标题,可以这样设置:

>>> df.groupby('key1')['data1'].agg({'MEAN':np.mean,'MEDIAN':np.median})
          MEAN    MEDIAN
key1                    
a     0.087512  0.053482
b     0.534393  0.534393

不知所以的错误:当按照key1进行分组,计算所有列的均值和中值时,需要定制显示标题,就会报错。

df.groupby('key1').agg({'MEAN':np.mean,'MEDIAN':np.median})
报错。。。

(5)按key1进行分组,data1列计算均值,data2列计算中值,同样借助agg函数:

>>> df.groupby('key1').agg({'data1':'mean','data2':'median'})
         data1     data2
key1                    
a     0.087512  0.097251
b     0.534393 -0.142072

关于agg的拓展:
通过lambda匿名函数来进行特殊的计算:
计算各组数据的绝对值的平均数:

>>> df.groupby('key1')['data1'].agg({'lambda':lambda x:np.mean(abs(x))})
        lambda
key1          
a     0.146380
b     0.474747

https://jingyan.baidu.com/article/d45ad148947fd369552b80f6.html
如果不想求某一列的均值,可以将其剔除:

>>> df.drop('data1',axis=1).groupby(['key1','key2']).mean()
              data2
key1 key2          
a    one   0.626655
     two  -0.154292
b    one  -0.044297
     two  -0.239848

groupby中的size的用法:显示该分组下有几个数值含在其中,就像Excel中数据透视表中的的计数

>>> df.groupby(['key1','key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

分组键中的任何缺失值都被排除在外。
4、对分组进行迭代

>>> for name,group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2     data1     data2
0    a  one -0.051260 -1.444195
1    a  two -1.594938  0.778813
4    a  one  1.868982 -0.535930
b
  key1 key2     data1     data2
2    b  one -1.726857 -1.709059
3    b  two -0.997476 -0.825530

对于多重键的情况,即有多个分组键,会多次分组:

>>> for (k1,k2),group in df.groupby(['key1','key2']):
    print(k1,k2)
    print(group)

    
a one
  key1 key2     data1     data2
0    a  one -0.051260 -1.444195
4    a  one  1.868982 -0.535930
a two
  key1 key2     data1     data2
1    a  two -1.594938  0.778813
b one
  key1 key2     data1     data2
2    b  one -1.726857 -1.709059
b two
  key1 key2     data1    data2
3    b  two -0.997476 -0.82553

也可以将df根据dtype进行分组,即将数值分为一组,将字符串分成一组

>>> df.dtypes
key1      object
key2      object
data1    float64
data2    float64
dtype: object
>>> grouped = df.groupby(df.dtypes,axis=1)
>>> dict(list(grouped))
{dtype('float64'):       data1     data2
0 -0.051260 -1.444195
1 -1.594938  0.778813
2 -1.726857 -1.709059
3 -0.997476 -0.825530
4  1.868982 -0.535930, dtype('O'):   key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one}

5、选取一组或一列
对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的。
(1)

>>> df.groupby([df['key1']])[['data2']].mean()
         data2
key1          
a    -0.400437
b    -1.267294

(2)

>>> df.groupby([df['key1']])['data2'].mean()#df.groupby(['key1']).data2.mean()一样
key1
a   -0.400437
b   -1.267294
Name: data2, dtype: float64

(3)

>>> df[['data2']].groupby([df['key1']]).mean()
         data2
key1         
a    -0.400437
b    -1.267294

(1)和(3)是完全等效的,(1)和(2)的区别在于['data2']和[['data2']]:如果传入的是单个,如['data2'],则返回的是已分组的Series;如果传入的是列表或数组,如[['data2']],则返回的是已分组的DataFrame。
6、通过字典或Series分组
除数组以外,分组形式还可以其他形式存在
字典:

2039742751512
>>> people = pd.DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
>>> people
               a         b         c         d         e
Joe    -0.678913  1.980195 -0.643765  0.388631 -0.025347
Steve   0.950373  1.771783  1.261615 -0.650729 -0.077973
Wes    -1.588147 -1.223409  0.218183 -0.182874  1.149757
Jim    -0.167653  0.023572 -0.272739 -1.196497  0.217304
Travis -0.525785 -0.502944  0.321604 -0.132109 -0.868102

假设已知列的分组关系,并希望根据分组计算列的总计:

>>> mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
然后将此字典传给groupby
>>> by_column = people.groupby(mapping,axis=1)
>>> by_column.sum()
            blue       red
Joe    -0.255134  1.275936
Steve   0.610886  2.644182
Wes    -0.182874 -0.438390
Jim    -1.469236  0.073223
Travis  0.189495 -1.896830

Series:
用Series作为分组键,pandas会检查Series以确保索引和分组值是对齐的

>>> map_series = pd.Series(mapping)
>>> map_series
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
>>> people.groupby(map_series,axis=1).sum()
            blue       red
Joe    -0.255134  1.275936
Steve   0.610886  2.644182
Wes    -0.182874 -0.438390
Jim    -1.469236  0.073223
Travis  0.189495 -1.896830

这里表现的按照Series分组和字典分组结果一样,按照Series分组时将mapping转化成Series.
7、通过函数进行分组
任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称。
在上面的people这个DataFrame中,人名为索引值,将函数len作为分组键,如下:

>>> people.groupby(len).sum()
          a         b         c         d         e
3 -2.917995 -0.954225 -2.642262  0.484488 -3.649103
5  0.504809 -0.644040 -1.536119  2.070732 -0.835017
6 -0.770610  0.759477 -0.257607 -0.081076 -0.241806

将函数跟数组、列表、字典、Series混合使用:

>>> key_list = ['one','one','one','two','two']
>>> people.groupby([len,key_list]).min()
              a         b         c         d         e
3 one -1.580011 -0.738648 -1.031669 -0.598777 -2.249030
  two -0.464435  0.089984 -1.150972  0.817304  0.089607
5 one  0.504809 -0.644040 -1.536119  2.070732 -0.835017
6 two -0.770610  0.759477 -0.257607 -0.081076 -0.241806

8、根据索引级别分组 #不是特别懂!

>>> columms = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,2,5,1,3]],names=['city','tenor'])
>>> columms
MultiIndex(levels=[['JP', 'US'], [1, 2, 3, 5]],
           codes=[[1, 1, 1, 0, 0], [0, 1, 3, 0, 2]],
           names=['city', 'tenor'])
>>> hire_df = pd.DataFrame(np.random.randn(4,5),columns=columms)
>>> hire_df
city         US                            JP          
tenor         1         2         5         1         3
0      0.554358  0.301741 -0.861113 -0.977792 -0.404298
1      1.430362  0.856493 -0.418466  0.519151 -0.879406
2     -0.149399 -1.044713  0.030997 -0.964583  0.917320
3      0.179831  0.638233  0.572928 -1.151039 -0.496245
>>> hire_df.groupby(level='city',axis=1).count()
city  JP  US
0      2   3
1      2   3
2      2   3
3      2   3

相关文章

网友评论

      本文标题:groupby()聚合和分组运算

      本文链接:https://www.haomeiwen.com/subject/zdxfxctx.html