美文网首页
groupby()聚合和分组运算

groupby()聚合和分组运算

作者: 酸甜柠檬26 | 来源:发表于2019-06-07 22:59 被阅读0次

    2019-06-02
    1、创建一个非常简单的表格型数据集

    >>> df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],'key2':['one', 'two', 'one', 'two', 'one'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
    >>> df
      key1 key2     data1     data2
    0    a  one -0.051260 -1.444195
    1    a  two -1.594938  0.778813
    2    b  one -1.726857 -1.709059
    3    b  two -0.997476 -0.825530
    4    a  one  1.868982 -0.535930
    

    按照key1进行分组,并计算data1的平均值,可用访问data1,并根据key1调用groupby

    >>> grouped = df['data1'].groupby(df['key1'])
    >>> grouped
    <pandas.core.groupby.generic.SeriesGroupBy object at 0x00000149A6A023C8>
    

    grouped是含分组键df['key1']的中间变量,并无任何数据值,此刻需要调用groupby的mean()函数来计算分组平均值。

    >>> grouped.mean()
    key1
    a    0.074261
    b   -1.362166
    Name: data1, dtype: float64
    

    2、传入多个数据进行分组,按照key1/key2进行分组

    >>> mean1 = df['data1'].groupby([df['key1'],df['key2']]).mean()
    >>> mean1
    key1  key2
    a     one     0.908861
          two    -1.594938
    b     one    -1.726857
          two    -0.997476
    Name: data1, dtype: float64
    >>> mean1.unstack()
    key2       one       two
    key1                    
    a     0.908861 -1.594938
    b    -1.726857 -0.997476
    

    在上面这些示例中,分组键均为Series。实际上,分组键可以是任何长度适当的数组:
    现给出另外两个长度和key1/key2一样的array。对他们分组同样可以求得data1的平均值。

    >>> states = np.array(['Shanghai','Beijing','Beijing','Shanghai','Shanghai'])
    >>> years = np.array([2005,2005,2006,2005,2006])
    >>> df['data1'].groupby([states,years]).mean()
    Beijing   2005   -1.594938
              2006   -1.726857
    Shanghai  2005   -0.524368
              2006    1.868982
    Name: data1, dtype: float64
    

    3、不规定求哪一些的平均值,则会自动计算所有数值列的平均值

    >>> df.groupby('key1').mean()
             data1     data2
    key1                    
    a     0.087512  0.366339
    b     0.534393 -0.142072
    

    (1)按照key1进行分组后,对多列采用相同的聚合方法,可以借助apply函数:

    >>> df.groupby('key1').apply(np.mean)
             data1     data2
    key1                    
    a     0.087512  0.366339
    b     0.534393 -0.142072
    

    (2)按key1进行分组,计算各组数据的均值和中值,借助agg函数:

    >>> df.groupby('key1').agg([np.mean,np.median])
             data1               data2          
              mean    median      mean    median
    key1                                        
    a     0.087512  0.053482  0.366339  0.097251
    b     0.534393  0.534393 -0.142072 -0.142072
    

    (3)按key1进行分组,只计算data1列的均值和中值:

    >>> df.groupby('key1')['data1'].agg([np.mean,np.median])
              mean    median
    key1                    
    a     0.087512  0.053482
    b     0.534393  0.534393
    

    (4)按key1进行分组,只计算data1列的均值和中值,需要定制显示标题,可以这样设置:

    >>> df.groupby('key1')['data1'].agg({'MEAN':np.mean,'MEDIAN':np.median})
              MEAN    MEDIAN
    key1                    
    a     0.087512  0.053482
    b     0.534393  0.534393
    

    不知所以的错误:当按照key1进行分组,计算所有列的均值和中值时,需要定制显示标题,就会报错。

    df.groupby('key1').agg({'MEAN':np.mean,'MEDIAN':np.median})
    报错。。。
    

    (5)按key1进行分组,data1列计算均值,data2列计算中值,同样借助agg函数:

    >>> df.groupby('key1').agg({'data1':'mean','data2':'median'})
             data1     data2
    key1                    
    a     0.087512  0.097251
    b     0.534393 -0.142072
    

    关于agg的拓展:
    通过lambda匿名函数来进行特殊的计算:
    计算各组数据的绝对值的平均数:

    >>> df.groupby('key1')['data1'].agg({'lambda':lambda x:np.mean(abs(x))})
            lambda
    key1          
    a     0.146380
    b     0.474747
    

    https://jingyan.baidu.com/article/d45ad148947fd369552b80f6.html
    如果不想求某一列的均值,可以将其剔除:

    >>> df.drop('data1',axis=1).groupby(['key1','key2']).mean()
                  data2
    key1 key2          
    a    one   0.626655
         two  -0.154292
    b    one  -0.044297
         two  -0.239848
    

    groupby中的size的用法:显示该分组下有几个数值含在其中,就像Excel中数据透视表中的的计数

    >>> df.groupby(['key1','key2']).size()
    key1  key2
    a     one     2
          two     1
    b     one     1
          two     1
    dtype: int64
    

    分组键中的任何缺失值都被排除在外。
    4、对分组进行迭代

    >>> for name,group in df.groupby('key1'):
        print(name)
        print(group)
    
    a
      key1 key2     data1     data2
    0    a  one -0.051260 -1.444195
    1    a  two -1.594938  0.778813
    4    a  one  1.868982 -0.535930
    b
      key1 key2     data1     data2
    2    b  one -1.726857 -1.709059
    3    b  two -0.997476 -0.825530
    

    对于多重键的情况,即有多个分组键,会多次分组:

    >>> for (k1,k2),group in df.groupby(['key1','key2']):
        print(k1,k2)
        print(group)
    
        
    a one
      key1 key2     data1     data2
    0    a  one -0.051260 -1.444195
    4    a  one  1.868982 -0.535930
    a two
      key1 key2     data1     data2
    1    a  two -1.594938  0.778813
    b one
      key1 key2     data1     data2
    2    b  one -1.726857 -1.709059
    b two
      key1 key2     data1    data2
    3    b  two -0.997476 -0.82553
    

    也可以将df根据dtype进行分组,即将数值分为一组,将字符串分成一组

    >>> df.dtypes
    key1      object
    key2      object
    data1    float64
    data2    float64
    dtype: object
    
    >>> grouped = df.groupby(df.dtypes,axis=1)
    >>> dict(list(grouped))
    {dtype('float64'):       data1     data2
    0 -0.051260 -1.444195
    1 -1.594938  0.778813
    2 -1.726857 -1.709059
    3 -0.997476 -0.825530
    4  1.868982 -0.535930, dtype('O'):   key1 key2
    0    a  one
    1    a  two
    2    b  one
    3    b  two
    4    a  one}
    

    5、选取一组或一列
    对于由DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的。
    (1)

    >>> df.groupby([df['key1']])[['data2']].mean()
             data2
    key1          
    a    -0.400437
    b    -1.267294
    

    (2)

    >>> df.groupby([df['key1']])['data2'].mean()#df.groupby(['key1']).data2.mean()一样
    key1
    a   -0.400437
    b   -1.267294
    Name: data2, dtype: float64
    

    (3)

    >>> df[['data2']].groupby([df['key1']]).mean()
             data2
    key1         
    a    -0.400437
    b    -1.267294
    

    (1)和(3)是完全等效的,(1)和(2)的区别在于['data2']和[['data2']]:如果传入的是单个,如['data2'],则返回的是已分组的Series;如果传入的是列表或数组,如[['data2']],则返回的是已分组的DataFrame。
    6、通过字典或Series分组
    除数组以外,分组形式还可以其他形式存在
    字典:

    2039742751512
    >>> people = pd.DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],index=['Joe','Steve','Wes','Jim','Travis'])
    >>> people
                   a         b         c         d         e
    Joe    -0.678913  1.980195 -0.643765  0.388631 -0.025347
    Steve   0.950373  1.771783  1.261615 -0.650729 -0.077973
    Wes    -1.588147 -1.223409  0.218183 -0.182874  1.149757
    Jim    -0.167653  0.023572 -0.272739 -1.196497  0.217304
    Travis -0.525785 -0.502944  0.321604 -0.132109 -0.868102
    

    假设已知列的分组关系,并希望根据分组计算列的总计:

    >>> mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
    然后将此字典传给groupby
    >>> by_column = people.groupby(mapping,axis=1)
    >>> by_column.sum()
                blue       red
    Joe    -0.255134  1.275936
    Steve   0.610886  2.644182
    Wes    -0.182874 -0.438390
    Jim    -1.469236  0.073223
    Travis  0.189495 -1.896830
    

    Series:
    用Series作为分组键,pandas会检查Series以确保索引和分组值是对齐的

    >>> map_series = pd.Series(mapping)
    >>> map_series
    a       red
    b       red
    c      blue
    d      blue
    e       red
    f    orange
    dtype: object
    >>> people.groupby(map_series,axis=1).sum()
                blue       red
    Joe    -0.255134  1.275936
    Steve   0.610886  2.644182
    Wes    -0.182874 -0.438390
    Jim    -1.469236  0.073223
    Travis  0.189495 -1.896830
    

    这里表现的按照Series分组和字典分组结果一样,按照Series分组时将mapping转化成Series.
    7、通过函数进行分组
    任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称。
    在上面的people这个DataFrame中,人名为索引值,将函数len作为分组键,如下:

    >>> people.groupby(len).sum()
              a         b         c         d         e
    3 -2.917995 -0.954225 -2.642262  0.484488 -3.649103
    5  0.504809 -0.644040 -1.536119  2.070732 -0.835017
    6 -0.770610  0.759477 -0.257607 -0.081076 -0.241806
    

    将函数跟数组、列表、字典、Series混合使用:

    >>> key_list = ['one','one','one','two','two']
    >>> people.groupby([len,key_list]).min()
                  a         b         c         d         e
    3 one -1.580011 -0.738648 -1.031669 -0.598777 -2.249030
      two -0.464435  0.089984 -1.150972  0.817304  0.089607
    5 one  0.504809 -0.644040 -1.536119  2.070732 -0.835017
    6 two -0.770610  0.759477 -0.257607 -0.081076 -0.241806
    

    8、根据索引级别分组 #不是特别懂!

    >>> columms = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,2,5,1,3]],names=['city','tenor'])
    >>> columms
    MultiIndex(levels=[['JP', 'US'], [1, 2, 3, 5]],
               codes=[[1, 1, 1, 0, 0], [0, 1, 3, 0, 2]],
               names=['city', 'tenor'])
    >>> hire_df = pd.DataFrame(np.random.randn(4,5),columns=columms)
    >>> hire_df
    city         US                            JP          
    tenor         1         2         5         1         3
    0      0.554358  0.301741 -0.861113 -0.977792 -0.404298
    1      1.430362  0.856493 -0.418466  0.519151 -0.879406
    2     -0.149399 -1.044713  0.030997 -0.964583  0.917320
    3      0.179831  0.638233  0.572928 -1.151039 -0.496245
    >>> hire_df.groupby(level='city',axis=1).count()
    city  JP  US
    0      2   3
    1      2   3
    2      2   3
    3      2   3
    

    相关文章

      网友评论

          本文标题:groupby()聚合和分组运算

          本文链接:https://www.haomeiwen.com/subject/zdxfxctx.html