美文网首页
Pandas 分组 groupby

Pandas 分组 groupby

作者: 陈天睡懒觉 | 来源:发表于2022-05-21 13:55 被阅读0次
    import pandas as pd
    df = pd.read_csv('data/gapminder.tsv',sep='\t')
    
    print(df.head())
    
    # 输出结果
           country continent  year  lifeExp       pop   gdpPercap
    0  Afghanistan      Asia  1952   28.801   8425333  779.445314
    1  Afghanistan      Asia  1957   30.332   9240934  820.853030
    2  Afghanistan      Asia  1962   31.997  10267083  853.100710
    3  Afghanistan      Asia  1967   34.020  11537966  836.197138
    4  Afghanistan      Asia  1972   36.088  13079460  739.981106
    
    print(df.dtypes)
    
    # 输出结果
    country       object
    continent     object
    year           int64
    lifeExp      float64
    pop            int64
    gdpPercap    float64
    dtype: object
    

    按照单变量分组和计算

    按照year分组并求每组的lifeExp的均值,得到一个Series

    mean_lifeExp_by_year = df.groupby('year')['lifeExp'].mean()
    print(mean_lifeExp_by_year)
    print(mean_lifeExp_by_year.iloc[0])
    print(mean_lifeExp_by_year.loc[1952])
    
    year
    1952    49.057620
    1957    51.507401
    1962    53.609249
    1967    55.678290
    1972    57.647386
    1977    59.570157
    1982    61.533197
    1987    63.212613
    1992    64.160338
    1997    65.014676
    2002    65.694923
    2007    67.007423
    Name: lifeExp, dtype: float64
    49.05761971830987
    49.05761971830987
    

    取出值或者索引

    print(mean_lifeExp_by_year.values)
    print(mean_lifeExp_by_year.index)
    
    [49.05761972 51.50740113 53.60924901 55.67828958 57.64738648 59.57015746
     61.53319718 63.21261268 64.16033803 65.01467606 65.69492254 67.00742254]
    Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
                2007],
               dtype='int64', name='year')
    

    转成DataFrame,to_frame()方法,仍然只有一列

    mean_lifeExp_by_year_df = mean_lifeExp_by_year.to_frame()
    print(mean_lifeExp_by_year_df.shape) # (12, 1)
    

    按照多变量分组和计算多个值

    按照year和continent分组并求每组的lifeExp,GDP的均值,得到一个DataFrame

    multi_group_var = df.groupby(['year','continent'])[['lifeExp','gdpPercap']].mean()
    print(type(multi_group_var))
    print(multi_group_var.head(10))
    
    # 输出结果
    <class 'pandas.core.frame.DataFrame'>
                      lifeExp     gdpPercap
    year continent                         
    1952 Africa     39.135500   1252.572466
         Americas   53.279840   4079.062552
         Asia       46.314394   5195.484004
         Europe     64.408500   5661.057435
         Oceania    69.255000  10298.085650
    1957 Africa     41.266346   1385.236062
         Americas   55.960280   4616.043733
         Asia       49.318544   5787.732940
         Europe     66.703067   6963.012816
         Oceania    70.295000  11598.522455
    

    观察到year不全,需要平铺DataFrame,使用reset_index方法。

    flat = multi_group_var.reset_index()
    print(flat.head(10))
    
    # 输出结果
       year continent    lifeExp     gdpPercap
    0  1952    Africa  39.135500   1252.572466
    1  1952  Americas  53.279840   4079.062552
    2  1952      Asia  46.314394   5195.484004
    3  1952    Europe  64.408500   5661.057435
    4  1952   Oceania  69.255000  10298.085650
    5  1957    Africa  41.266346   1385.236062
    6  1957  Americas  55.960280   4616.043733
    7  1957      Asia  49.318544   5787.732940
    8  1957    Europe  66.703067   6963.012816
    9  1957   Oceania  70.295000  11598.522455
    

    分组计数

    • 唯一值计数 nunique()
    • 频数计数 value_counts()

    唯一值计数 nunique()

    country_nunique_by_continent = df.groupby('continent')['country'].nunique()
    print(type(country_nunique_by_continent))
    print(country_nunique_by_continent)
    
    # 输出结果
    <class 'pandas.core.series.Series'>
    continent
    Africa      52
    Americas    25
    Asia        33
    Europe      30
    Oceania      2
    Name: country, dtype: int64
    

    频数计数 value_counts(),频数计数的结果无法铺平

    country_count_by_continent = df.groupby('continent')['country'].value_counts()
    print(type(country_count_by_continent))
    print(country_count_by_continent.head(20))
    
    # 输出结果
    <class 'pandas.core.series.Series'>
    continent  country                 
    Africa     Algeria                     12
               Angola                      12
               Benin                       12
               Botswana                    12
               Burkina Faso                12
               Burundi                     12
               Cameroon                    12
               Central African Republic    12
               Chad                        12
               Comoros                     12
               Congo, Dem. Rep.            12
               Congo, Rep.                 12
               Cote d'Ivoire               12
               Djibouti                    12
               Egypt                       12
               Equatorial Guinea           12
               Eritrea                     12
               Ethiopia                    12
               Gabon                       12
               Gambia                      12
    Name: country, dtype: int64
    

    获取分组结果

    单变量分组结果获取

    groupby_year = df.groupby('year')
    year_1952 = groupby_year.get_group(1952)
    print(year_1952.head(10))
    
    # 输出结果
             country continent  year  lifeExp       pop     gdpPercap
    0    Afghanistan      Asia  1952   28.801   8425333    779.445314
    12       Albania    Europe  1952   55.230   1282697   1601.056136
    24       Algeria    Africa  1952   43.077   9279525   2449.008185
    36        Angola    Africa  1952   30.015   4232095   3520.610273
    48     Argentina  Americas  1952   62.485  17876956   5911.315053
    60     Australia   Oceania  1952   69.120   8691212  10039.595640
    72       Austria    Europe  1952   66.800   6927772   6137.076492
    84       Bahrain      Asia  1952   50.939    120447   9867.084765
    96    Bangladesh      Asia  1952   37.484  46886859    684.244172
    108      Belgium    Europe  1952   68.000   8730405   8343.105127
    

    多变量分组结果获取,元组形式

    groupby_year_continent = df.groupby(['year','continent'])
    year_1952_Asia = groupby_year_continent.get_group((1952, 'Asia'))
    print(year_1952.head(10))
    
    # 输出结果
             country continent  year  lifeExp       pop     gdpPercap
    0    Afghanistan      Asia  1952   28.801   8425333    779.445314
    12       Albania    Europe  1952   55.230   1282697   1601.056136
    24       Algeria    Africa  1952   43.077   9279525   2449.008185
    36        Angola    Africa  1952   30.015   4232095   3520.610273
    48     Argentina  Americas  1952   62.485  17876956   5911.315053
    60     Australia   Oceania  1952   69.120   8691212  10039.595640
    72       Austria    Europe  1952   66.800   6927772   6137.076492
    84       Bahrain      Asia  1952   50.939    120447   9867.084765
    96    Bangladesh      Asia  1952   37.484  46886859    684.244172
    108      Belgium    Europe  1952   68.000   8730405   8343.105127
    

    相关文章

      网友评论

          本文标题:Pandas 分组 groupby

          本文链接:https://www.haomeiwen.com/subject/spkcprtx.html