美文网首页数据预处理
python 数据聚合groupby和分组运算transform

python 数据聚合groupby和分组运算transform

作者: LeeMin_Z | 来源:发表于2018-08-25 12:12 被阅读585次

    小结

    这个跟SQL很像,好处就是直接在python集成处理,不用像以前那样要分开在数据库里操作。

    1. Groupby 分组聚合
      1.1. 基本操作
      1.2. 对分组进行迭代
      1.3. 选取一个或一组列
    2. 数据聚合
      2.1. 分组级转换 transform
      2.2. apply "拆分-应用-合并"

    1. Groupby 分组聚合

    Groupby.png

    1.1. 基本操作

    引入相关库:

    import pandas as pd
    import numpy as np
    from pandas import DataFrame,Series
    

    基本格式

    DataFrame['数据处理序列'].groupby(键值).操作()

    例子:
    以下的分组键是Series

    In [96]: df = DataFrame({'key1':['a','a','b','b','a'],
        ...: 'key2':['one','two','one','two','one'],
        ...: 'data1':np.random.randn(5),
        ...: 'data2':np.random.randn(5)})
    
    In [97]:
    
    In [97]: df
    Out[97]:
      key1 key2     data1     data2
    0    a  one -1.390006  0.277334
    1    a  two -1.309464  1.245893
    2    b  one  1.293608  1.206705
    3    b  two -0.549139  0.140945
    4    a  one  2.292085  0.271638
    
    In [98]: grouped = df['data1'].groupby(df['key1'])
    
    # 操作
    
    In [99]: grouped
    Out[99]: <pandas.core.groupby.groupby.SeriesGroupBy object at 0x000000EA33F9B438
    >
    
    # mean() 看平均值
    In [100]: grouped.mean()
    Out[100]:
    key1
    a   -0.135795
    b    0.372235
    Name: data1, dtype: float64
    
    In [101]: means = df['data1'].groupby([df['key1'],df['key2']]).mean()
    
    In [102]: means
    Out[102]:
    key1  key2
    a     one     0.451039
          two    -1.309464
    b     one     1.293608
          two    -0.549139
    Name: data1, dtype: float64
    
    In [103]: means.unstack()
    Out[103]:
    key2       one       two
    key1
    a     0.451039 -1.309464
    b     1.293608 -0.549139
    

    以下键值是任意数组

    #  np.array 数组作为键值
    
    In [5]: states = np.array(['O','C','C','O','O'])
    
    In [6]: years = np.array([2005,2005,2006,2005,2006])
    
    In [7]: df['data1'].groupby([states,years]).mean()
    Out[7]:
    C  2005    1.325095
       2006    0.111973
    O  2005   -0.428585
       2006    0.479145
    Name: data1, dtype: float64
    
    # 仅指定键值/键值对,对全部数据序列进行groupby
    
    In [9]: df.groupby('key1').mean()
    Out[9]:
             data1     data2
    key1
    a     0.543673  0.807876
    b    -0.285988  1.684493
    
    In [10]: df.groupby(['key1','key2']).mean()
    Out[10]:
                  data1     data2
    key1 key2
    a    one   0.152961  1.385948
         two   1.325095 -0.348269
    b    one   0.111973  1.977837
         two  -0.683948  1.391149
    
    # 查看分组大小
    
    In [11]: df.groupby('key1').size()
    Out[11]:
    key1
    a    3
    b    2
    dtype: int64
    
    In [12]: df.groupby(['key1','key2']).size()
    Out[12]:
    key1  key2
    a     one     2
          two     1
    b     one     1
          two     1
    dtype: int64
    
    

    1.2. 对分组进行迭代

    1. 查看某个键/键值对的详细分组信息
    # 需要处理的DataFrame
    
    In [14]: df
    Out[14]:
      key1 key2     data1     data2
    0    a  one -0.173222  1.621207
    1    a  two  1.325095 -0.348269
    2    b  one  0.111973  1.977837
    3    b  two -0.683948  1.391149
    4    a  one  0.479145  1.150689
    
    # 针对一个键值的分组情况
    
    In [16]: for name,group in df.groupby('key1'):
        ...:     print(name)
        ...:     print(group)
        ...:
    a
      key1 key2     data1     data2
    0    a  one -0.173222  1.621207
    1    a  two  1.325095 -0.348269
    4    a  one  0.479145  1.150689
    b
      key1 key2     data1     data2
    2    b  one  0.111973  1.977837
    3    b  two -0.683948  1.391149
    
    # 针对键值对的分组情况
    
    In [19]: for (k1,k2),group in df.groupby(['key1','key2']):
        ...:     print(k1,k2)
        ...:     print(group)
        ...:
    a one
      key1 key2     data1     data2
    0    a  one -0.173222  1.621207
    4    a  one  0.479145  1.150689
    a two
      key1 key2     data1     data2
    1    a  two  1.325095 -0.348269
    b one
      key1 key2     data1     data2
    2    b  one  0.111973  1.977837
    b two
      key1 key2     data1     data2
    3    b  two -0.683948  1.391149
    
    
    1. 把以上的分组变为字典
    In [20]: pieces = dict(list(df.groupby('key1')))
    
    In [21]: pieces
    Out[21]:
    {'a':   key1 key2     data1     data2
     0    a  one -0.173222  1.621207
     1    a  two  1.325095 -0.348269
     4    a  one  0.479145  1.150689, 'b':   key1 key2     data1     data2
     2    b  one  0.111973  1.977837
     3    b  two -0.683948  1.391149}
    
    In [22]: pieces
    Out[22]:
    {'a':   key1 key2     data1     data2
     0    a  one -0.173222  1.621207
     1    a  two  1.325095 -0.348269
     4    a  one  0.479145  1.150689, 'b':   key1 key2     data1     data2
     2    b  one  0.111973  1.977837
     3    b  two -0.683948  1.391149}
    
    In [23]: pieces['a']
    Out[23]:
      key1 key2     data1     data2
    0    a  one -0.173222  1.621207
    1    a  two  1.325095 -0.348269
    4    a  one  0.479145  1.150689
    
    In [24]: pieces['b']
    Out[24]:
      key1 key2     data1     data2
    2    b  one  0.111973  1.977837
    3    b  two -0.683948  1.391149
    

    1.3. 选取一个或一组列

    1. 基本格式
    # normal code 
    
    In [29]: df.groupby('key1')['data1']
    
    In [30]: df[['data2']].groupby(df['key1'])
    
    # Syntactic sugar
    
    df.groupby('key1')['data1']
    df.groupby('key1')[['data2']]
    
    # example 
    
    In [33]: df.groupby(['key1','key2'])['data2'].mean()
    Out[33]:
    key1  key2
    a     one     1.385948
          two    -0.348269
    b     one     1.977837
          two     1.391149
    Name: data2, dtype: float64
    
    

    2. 数据聚合

    简单aggaggregate例子:

    In [48]: df
    Out[48]:
      key1 key2     data1     data2
    0    a  one -0.173222  1.621207
    1    a  two  1.325095 -0.348269
    2    b  one  0.111973  1.977837
    3    b  two -0.683948  1.391149
    4    a  one  0.479145  1.150689
    
    In [49]: def PeakToPeak(arr):
        ...:     return arr.max() - arr.min()
    
    In [50]: df.groupby('key1').agg(PeakToPeak)
    Out[50]:
             data1     data2
    key1
    a     1.498317  1.969476
    b     0.795921  0.586688
    

    2.1. 分组级转换 transform

    1. 添加一个用于存放各索引分组平均值的列
    # previous method
    
    In [68]: df
    Out[68]:
      key1 key2     data1     data2
    0    a  one -0.173222  1.621207
    1    a  two  1.325095 -0.348269
    2    b  one  0.111973  1.977837
    3    b  two -0.683948  1.391149
    4    a  one  0.479145  1.150689
    
    In [69]: k1_means = df.groupby('key1').mean().add_prefix('mean_')
    
    In [70]: k1_means
    Out[70]:
          mean_data1  mean_data2
    key1
    a       0.543673    0.807876
    b      -0.285988    1.684493
    
    In [71]:
    
    In [71]: pd.merge(df,k1_means,left_on = 'key1' ,  right_index = True)
    Out[71]:
      key1 key2     data1     data2  mean_data1  mean_data2
    0    a  one -0.173222  1.621207    0.543673    0.807876
    1    a  two  1.325095 -0.348269    0.543673    0.807876
    4    a  one  0.479145  1.150689    0.543673    0.807876
    2    b  one  0.111973  1.977837   -0.285988    1.684493
    3    b  two -0.683948  1.391149   -0.285988    1.684493
    
    # transform method 
    
    In [83]: meanData = df.groupby('key1').transform(np.mean).add_prefix('mean_')
    
    In [84]: meanData
    Out[84]:
       mean_data1  mean_data2
    0    0.543673    0.807876
    1    0.543673    0.807876
    2   -0.285988    1.684493
    3   -0.285988    1.684493
    4    0.543673    0.807876
    
    In [85]: pd.concat([df,meanData], axis = 1)
    Out[85]:
      key1 key2     data1     data2  mean_data1  mean_data2
    0    a  one -0.173222  1.621207    0.543673    0.807876
    1    a  two  1.325095 -0.348269    0.543673    0.807876
    2    b  one  0.111973  1.977837   -0.285988    1.684493
    3    b  two -0.683948  1.391149   -0.285988    1.684493
    4    a  one  0.479145  1.150689    0.543673    0.807876
    
    1. transform 原数据转换为均值。
    In [90]: people
    Out[90]:
                   a         b         c         d         e
    Joe     0.498185  0.460470 -0.892633 -1.561500  0.279949
    Steve  -0.885170 -1.490421 -0.787302  1.559050  1.183115
    Wes    -0.237464       NaN       NaN -0.043788 -1.091813
    Jim    -1.547607 -0.121682 -0.355623 -1.703322 -0.733741
    Travis  0.638562  0.486515 -0.233517  0.023372  0.366325
    
    In [94]: key = list('ototo')
    
    # 按键值key,计算均值
    
    In [95]: people.groupby(key).mean()
    Out[95]:
              a         b         c         d         e
    o  0.299761  0.473492 -0.563075 -0.527305 -0.148513
    t -1.216388 -0.806052 -0.571462 -0.072136  0.224687
    
    # 把原数据转换为以上均值
    
    In [96]: people.groupby(key).transform(np.mean)
    Out[96]:
                   a         b         c         d         e
    Joe     0.299761  0.473492 -0.563075 -0.527305 -0.148513
    Steve  -1.216388 -0.806052 -0.571462 -0.072136  0.224687
    Wes     0.299761  0.473492 -0.563075 -0.527305 -0.148513
    Jim    -1.216388 -0.806052 -0.571462 -0.072136  0.224687
    Travis  0.299761  0.473492 -0.563075 -0.527305 -0.148513
    
    
    1. 计算数据与均值的差值(一般用于规范化数据)
    In [105]: people
    Out[105]:
                   a         b         c         d         e
    Joe     0.498185  0.460470 -0.892633 -1.561500  0.279949
    Steve  -0.885170 -1.490421 -0.787302  1.559050  1.183115
    Wes    -0.237464       NaN       NaN -0.043788 -1.091813
    Jim    -1.547607 -0.121682 -0.355623 -1.703322 -0.733741
    Travis  0.638562  0.486515 -0.233517  0.023372  0.366325
    
    # 设置规范化值,原数据和均值的差值
    
    In [106]: def demean(arr):
         ...:     return arr - arr.mean()
    
    In [107]: key
    Out[107]: ['o', 't', 'o', 't', 'o']
    
    # 将原数据转换为规范化值
    
    In [108]: demeaned = people.groupby(key).transform(demean)
    
    In [109]: demeaned
    Out[109]:
                   a         b         c         d         e
    Joe     0.198424 -0.013023 -0.329558 -1.034194  0.428462
    Steve   0.331218 -0.684370 -0.215840  1.631186  0.958428
    Wes    -0.537225       NaN       NaN  0.483517 -0.943300
    Jim    -0.331218  0.684370  0.215840 -1.631186 -0.958428
    Travis  0.338801  0.013023  0.329558  0.550677  0.514838
    
    # 理论上结果是0,由于计算机浮点值限制,实际上是无限趋近于零的极小值
    
    In [110]: demeaned.groupby(key).mean()
    Out[110]:
                  a             b      ...                  d             e
    o  1.850372e-17  2.775558e-17      ...      -3.700743e-17 -3.700743e-17
    t  5.551115e-17  0.000000e+00      ...       0.000000e+00  0.000000e+00
    
    [2 rows x 5 columns]
    

    2.2. apply "拆分-应用-合并"

    将待处理对象拆分为多个对象-->对每一个对象应用函数-->最后合并数据

    1. 基本使用方法
    # 定义 top 函数,显示'tip_pct'最高的几列
    
    In [114]: def top(df, n=5, column = 'tip_pct'):
         ...:     return df.sort_index(by = column)[-n:]
    
    # 直接应用函数
    
    In [115]: top(tips)
    C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
     sort_index is deprecated, please use .sort_values(by=...)
    Out[115]:
         total_bill   tip     sex smoker  day    time  size   tip_pct
    183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
    232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
    67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
    178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
    172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345
    
    In [117]: top(tips,n=6)
    C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
     sort_index is deprecated, please use .sort_values(by=...)
    Out[117]:
         total_bill   tip     sex smoker  day    time  size   tip_pct
    109       14.31  4.00  Female    Yes  Sat  Dinner     2  0.279525
    183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
    232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
    67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
    178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
    172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345
    
    # groupby 后应用 apply 函数
    
    In [118]: tips.groupby('smoker').apply(top)
    C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
     sort_index is deprecated, please use .sort_values(by=...)
    Out[118]:
                total_bill   tip     sex    ...       time size   tip_pct
    smoker                                  ...
    No     88        24.71  5.85    Male    ...      Lunch    2  0.236746
           185       20.69  5.00    Male    ...     Dinner    5  0.241663
           51        10.29  2.60  Female    ...     Dinner    2  0.252672
           149        7.51  2.00    Male    ...      Lunch    2  0.266312
           232       11.61  3.39    Male    ...     Dinner    2  0.291990
    Yes    109       14.31  4.00  Female    ...     Dinner    2  0.279525
           183       23.17  6.50    Male    ...     Dinner    4  0.280535
           67         3.07  1.00  Female    ...     Dinner    1  0.325733
           178        9.60  4.00  Female    ...     Dinner    2  0.416667
           172        7.25  5.15    Male    ...     Dinner    2  0.710345
    
    [10 rows x 8 columns]
    
    
    In [119]: tips.groupby(['smoker', 'day']).apply(top, n = 1 , column = 'total_bi
         ...: ll')
    C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
     sort_index is deprecated, please use .sort_values(by=...)
    Out[119]:
                     total_bill    tip     sex    ...       time size   tip_pct
    smoker day                                    ...
    No     Fri  94        22.75   3.25  Female    ...     Dinner    2  0.142857
           Sat  212       48.33   9.00    Male    ...     Dinner    4  0.186220
           Sun  156       48.17   5.00    Male    ...     Dinner    6  0.103799
           Thur 142       41.19   5.00    Male    ...      Lunch    5  0.121389
    Yes    Fri  95        40.17   4.73    Male    ...     Dinner    4  0.117750
           Sat  170       50.81  10.00    Male    ...     Dinner    3  0.196812
           Sun  182       45.35   3.50    Male    ...     Dinner    3  0.077178
           Thur 197       43.11   5.00  Female    ...      Lunch    4  0.115982
    
    [8 rows x 8 columns]
    
    
    In [123]: result = tips.groupby('smoker')['tip_pct'].describe()
    
    In [124]: result
    Out[124]:
            count      mean       std    ...          50%       75%       max
    smoker                               ...
    No      151.0  0.159328  0.039910    ...     0.155625  0.185014  0.291990
    Yes      93.0  0.163196  0.085119    ...     0.153846  0.195059  0.710345
    
    [2 rows x 8 columns]
    
    In [125]:
    
    In [125]: result.T
    Out[125]:
    smoker          No        Yes
    count   151.000000  93.000000
    mean      0.159328   0.163196
    std       0.039910   0.085119
    min       0.056797   0.035638
    25%       0.136906   0.106771
    50%       0.155625   0.153846
    75%       0.185014   0.195059
    max       0.291990   0.710345
    
    1. 分组键是否禁止 group_keys - True/False
    In [127]: tips.groupby('smoker', group_keys= False).apply(top)
    C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
     sort_index is deprecated, please use .sort_values(by=...)
    Out[127]:
         total_bill   tip     sex smoker   day    time  size   tip_pct
    88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
    185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
    51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
    149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
    232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
    109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
    183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
    67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
    178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
    172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345
    
    In [128]: tips.groupby('smoker').apply(top)
    C:\Users\jasminezhang\Anaconda3\Scripts\ipython:2: FutureWarning: by argument to
     sort_index is deprecated, please use .sort_values(by=...)
    Out[128]:
                total_bill   tip     sex    ...       time size   tip_pct
    smoker                                  ...
    No     88        24.71  5.85    Male    ...      Lunch    2  0.236746
           185       20.69  5.00    Male    ...     Dinner    5  0.241663
           51        10.29  2.60  Female    ...     Dinner    2  0.252672
           149        7.51  2.00    Male    ...      Lunch    2  0.266312
           232       11.61  3.39    Male    ...     Dinner    2  0.291990
    Yes    109       14.31  4.00  Female    ...     Dinner    2  0.279525
           183       23.17  6.50    Male    ...     Dinner    4  0.280535
           67         3.07  1.00  Female    ...     Dinner    1  0.325733
           178        9.60  4.00  Female    ...     Dinner    2  0.416667
           172        7.25  5.15    Male    ...     Dinner    2  0.710345
    
    [10 rows x 8 columns]
    
    1. 数据集的桶(bucket)和分位数(quantity)分析

    先把数据集用cut / qcut 分成数据桶(块),然后用groupby/apply进行分位数分析。

    In [129]: frame = DataFrame({'data1':np.random.randn(1000),
         ...:                    'data2':np.random.randn(1000)})
    
    
    # using cut 
    
    In [130]: factor = pd.cut(frame.data1, 4)
    
    In [131]: factor[:10]
    Out[131]:
    0     (-0.286, 1.462]
    1     (-0.286, 1.462]
    2    (-2.034, -0.286]
    3      (1.462, 3.209]
    4     (-0.286, 1.462]
    5     (-0.286, 1.462]
    6    (-3.788, -2.034]
    7    (-3.788, -2.034]
    8     (-0.286, 1.462]
    9     (-0.286, 1.462]
    Name: data1, dtype: category
    Categories (4, interval[float64]): [(-3.788, -2.034] < (-2.034, -0.286] < (-0.28
    6, 1.462] <
                                        (1.462, 3.209]]
    
    In [132]: len(factor)
    Out[132]: 1000
    
    # 定义 { 函数名:函数 }
    In [134]: def get_stats(group):
         ...:     return { 'min': group.min(), 'max' : group.max(),
         ...:             'count': group.count(), 'mean' : group.mean()}
    
    
    In [135]: grouped = frame.data2.groupby(factor)
    
    In [136]: grouped.apply(get_stats).unstack()
    Out[136]:
                      count       max      mean       min
    data1
    (-3.788, -2.034]   18.0  1.701853  0.276175 -1.210724
    (-2.034, -0.286]  383.0  2.860290  0.006860 -2.509304
    (-0.286, 1.462]   513.0  3.147908  0.081546 -2.712499
    (1.462, 3.209]     86.0  2.066749 -0.001550 -2.043683
    
    
    # use qcut
    
    In [137]: grouping = pd.qcut(frame.data1, 10, labels=False)
    
    In [138]: grouped = frame.data2.groupby(grouping)
    
    In [139]: grouped.apply(get_stats).unstack()
    Out[139]:
           count       max      mean       min
    data1
    0      100.0  2.585796  0.094647 -2.299329
    1      100.0  2.601896  0.110122 -2.396772
    2      100.0  2.860290 -0.025375 -2.177280
    3      100.0  2.139157 -0.104499 -2.509304
    4      100.0  2.826224  0.143575 -2.290512
    5      100.0  3.147908  0.106912 -1.686569
    6      100.0  2.173290  0.056089 -1.773193
    7      100.0  1.974363 -0.036664 -2.352925
    8      100.0  2.182190  0.161370 -2.712499
    9      100.0  2.066749 -0.013192 -2.043683
    
    
    1. 用分组特定值填充缺失值

    4.1. 填充单行序列

    
    In [140]: s = Series(np.random.randn(6))
    
    In [141]: s[::2] = np.nan
    
    In [142]: s
    Out[142]:
    0         NaN
    1   -0.504470
    2         NaN
    3   -0.358606
    4         NaN
    5   -0.257657
    dtype: float64
    
    In [143]: s.fillna(s.mean())
    Out[143]:
    0   -0.373577
    1   -0.504470
    2   -0.373577
    3   -0.358606
    4   -0.373577
    5   -0.257657
    dtype: float64
    

    4.2. 分组填充缺失值

    apply后因为是对各分组进行操作,一般会用到lambda函数,或类似的def函数

    
    In [178]: numbers = ContinueLetter('a',8)
    
    In [179]: numbers
    Out[179]: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    
    In [180]: group_key = ['first'] * 4 + ['second'] * 4
    
    In [181]: data = Series(np.random.randn(8) , index = numbers)
    
    In [183]: data[['a','f','h']] = np.nan
    
    In [184]: data
    Out[184]:
    a         NaN
    b    0.670058
    c   -0.931242
    d   -0.512491
    e    0.150320
    f         NaN
    g    0.266838
    h         NaN
    dtype: float64
    
    In [187]: fill_mean = lambda g : g.fillna(g.mean())
    
    In [188]: data.groupby(group_key).apply(fill_mean)
    Out[188]:
    a   -0.257892
    b    0.670058
    c   -0.931242
    d   -0.512491
    e    0.150320
    f    0.208579
    g    0.266838
    h    0.208579
    dtype: float64
    
    

    2018.8.24 《用python进行数据分析》

    相关文章

      网友评论

        本文标题:python 数据聚合groupby和分组运算transform

        本文链接:https://www.haomeiwen.com/subject/krjtiftx.html