美文网首页pandas 1.x cookbook《Pandas 1.x Cookbook·第二版》
《Pandas 1.x Cookbook · 第二版》第02章

《Pandas 1.x Cookbook · 第二版》第02章

作者: SeanCheney | 来源:发表于2021-02-03 18:29 被阅读0次

    第01章 Pandas基础
    第02章 DataFrame基础运算
    第03章 创建和持久化DataFrame
    第04章 开始数据分析
    第05章 探索性数据分析
    第06章 选取数据子集
    第07章 过滤行
    第08章 索引对齐


    2.1 从DataFrame中选择多列

    使用列名列表提取DataFrame的多列:

    >>> import pandas as pd
    >>> import numpy as np
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movie_actor_director = movies[
    ...     [
    ...         "actor_1_name",
    ...         "actor_2_name",
    ...         "actor_3_name",
    ...         "director_name",
    ...     ]
    ... ]
    >>> movie_actor_director.head()
      actor_1_name actor_2_name actor_3_name director_name
    0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
    1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
    2  Christop...  Rory Kin...  Stephani...   Sam Mendes
    3    Tom Hardy  Christia...  Joseph G...  Christop...
    4  Doug Walker   Rob Walker          NaN  Doug Walker
    
    # 提取单列时,列表和键名提取出来的数据类型不同。
    >>> type(movies[["director_name"]])
    <class 'pandas.core.frame.DataFrame'>   # DataFrame类型
    >>> type(movies["director_name"])
    <class 'pandas.core.series.Series'>   # Series类型
    

    也可以使用loc提取多列。

    >>> type(movies.loc[:, ["director_name"]])
    <class 'pandas.core.frame.DataFrame'>
    >>> type(movies.loc[:, "director_name"])
    <class 'pandas.core.series.Series'>
    

    预先将列名存储在列表中,可以提高代码的可读性。

    >>> cols = [
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ...     "director_name",
    ... ]
    >>> movie_actor_director = movies[cols]
    

    如果没有使用列表,则会报KeyError错误。

    >>> movies[
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ...     "director_name",
    ... ]
    Traceback (most recent call last):
      ...
    KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')
    

    2.2 使用方法提取多列

    缩短列名之后查看每种数据类型的个数:

    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return (
    ...         str(col)
    ...         .replace("facebook_likes", "fb")
    ...         .replace("_for_reviews", "")
    ...     )
    >>> movies = movies.rename(columns=shorten)
    >>> movies.dtypes.value_counts()
    float64    13
    int64       3
    object     12
    dtype: int64
    

    使用.select_dtypes方法提取整型的列:

    >>> movies.select_dtypes(include="int").head()
       num_voted_users  cast_total_fb  movie_fb
    0           886204           4834     33000
    1           471220          48350         0
    2           275868          11700     85000
    3          1144337         106759    164000
    4                8            143         0
    

    选择所有数值类型的列:

    >>> movies.select_dtypes(include="number").head()
       num_critics  duration  ...  aspect_ratio  movie_fb
    0        723.0     178.0  ...         1.78      33000
    1        302.0     169.0  ...         2.35          0
    2        602.0     148.0  ...         2.35      85000
    3        813.0     164.0  ...         2.35     164000
    4          NaN       NaN  ...          NaN          0
    

    选择整型和字符串的列:

    >>> movies.select_dtypes(include=["int", "object"]).head()
       color        direc/_name  ... conte/ating movie_fb
    0  Color      James Cameron  ...       PG-13    33000
    1  Color     Gore Verbinski  ...       PG-13        0
    2  Color         Sam Mendes  ...       PG-13    85000
    3  Color  Christopher Nolan  ...       PG-13   164000
    4    NaN        Doug Walker  ...         NaN        0
    

    提取所有非浮点类型的列:

    >>> movies.select_dtypes(exclude="float").head()
       color director_name  ... content_rating movie_fb
    0  Color  James Ca...   ...        PG-13      33000
    1  Color  Gore Ver...   ...        PG-13          0
    2  Color   Sam Mendes   ...        PG-13      85000
    3  Color  Christop...   ...        PG-13     164000
    4    NaN  Doug Walker   ...          NaN          0
    

    使用.filter方法筛选所有列名中包含fb的列:

    >>> movies.filter(like="fb").head()
       director_fb  actor_3_fb  ...  actor_2_fb  movie_fb
    0          0.0       855.0  ...       936.0     33000
    1        563.0      1000.0  ...      5000.0         0
    2          0.0       161.0  ...       393.0     85000
    3      22000.0     23000.0  ...     23000.0    164000
    4        131.0         NaN  ...        12.0         0
    

    items参数可以用来选择多列:

    >>> cols = [
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ...     "director_name",
    ... ]
    >>> movies.filter(items=cols).head()
          actor_1_name  ...      director_name
    0      CCH Pounder  ...      James Cameron
    1      Johnny Depp  ...     Gore Verbinski
    2  Christoph Waltz  ...         Sam Mendes
    3        Tom Hardy  ...  Christopher Nolan
    4      Doug Walker  ...        Doug Walker
    

    regex参数可以用来进行正则匹配,下面的代码提取出了列名中包含数字的列:

    >>> movies.filter(regex=r"\d").head()
       actor_3_fb actor_2_name  ...  actor_3_name actor_2_fb
    0       855.0  Joel Dav...  ...    Wes Studi       936.0
    1      1000.0  Orlando ...  ...  Jack Dav...      5000.0
    2       161.0  Rory Kin...  ...  Stephani...       393.0
    3     23000.0  Christia...  ...  Joseph G...     23000.0
    4         NaN   Rob Walker  ...          NaN        12.0
    

    2.3 按列名进行排列

    对列进行排序的原则:

    • 将列分为分类型和连续型;
    • 按照分类型和连续型对列分组;
    • 分类型排在连续型的前面;

    下面是个例子。先读取数据,缩短列名:

    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return col.replace("facebook_likes", "fb").replace(
    ...         "_for_reviews", ""
    ...     )
    >>> movies = movies.rename(columns=shorten)
    

    对下面的列名进行

    >>> movies.columns
    Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
           'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
           'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
           'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
           'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
           'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
           'movie_fb'],
          dtype='object')
    
    >>> cat_core = [
    ...     "movie_title",
    ...     "title_year",
    ...     "content_rating",
    ...     "genres",
    ... ]
    >>> cat_people = [
    ...     "director_name",
    ...     "actor_1_name",
    ...     "actor_2_name",
    ...     "actor_3_name",
    ... ]
    >>> cat_other = [
    ...     "color",
    ...     "country",
    ...     "language",
    ...     "plot_keywords",
    ...     "movie_imdb_link",
    ... ]
    >>> cont_fb = [
    ...     "director_fb",
    ...     "actor_1_fb",
    ...     "actor_2_fb",
    ...     "actor_3_fb",
    ...     "cast_total_fb",
    ...     "movie_fb",
    ... ]
    >>> cont_finance = ["budget", "gross"]
    >>> cont_num_reviews = [
    ...     "num_voted_users",
    ...     "num_user",
    ...     "num_critic",
    ... ]
    >>> cont_other = [
    ...     "imdb_score",
    ...     "duration",
    ...     "aspect_ratio",
    ...     "facenumber_in_poster",
    ... ]
    

    将上面所有列表连起来,组成最终的列的顺序,并确认没有遗漏任何列:

    >>> new_col_order = (
    ...     cat_core
    ...     + cat_people
    ...     + cat_other
    ...     + cont_fb
    ...     + cont_finance
    ...     + cont_num_reviews
    ...     + cont_other
    ... )
    >>> set(movies.columns) == set(new_col_order)
    True
    

    将新的列数组传给movies,得到排好列的对象:

    >>> movies[new_col_order].head()
       movie_title  title_year  ... aspect_ratio facenumber_in_poster
    0       Avatar      2009.0  ...         1.78          0.0
    1  Pirates ...      2007.0  ...         2.35          0.0
    2      Spectre      2015.0  ...         2.35          1.0
    3  The Dark...      2012.0  ...         2.35          0.0
    4  Star War...         NaN  ...          NaN          0.0
    

    2.4 对DataFrame进行概括性分析

    查看数据集的属性:shape、size、ndim。

    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies.shape
    (4916, 28)
    >>> movies.size
    137648
    >>> movies.ndim
    2
    

    .count方法可以统计所有的非缺失值:

    >>> movies.count()
    color                      4897
    director_name              4814
    num_critic_for_reviews     4867
    duration                   4901
    director_facebook_likes    4814
                               ... 
    title_year                 4810
    actor_2_facebook_likes     4903
    imdb_score                 4916
    aspect_ratio               4590
    movie_facebook_likes       4916
    Length: 28, dtype: int64
    

    .min.max.mean.median.std方法,返回的是数值列的统计信息:

    >>> movies.min()
    num_critic_for_reviews        1.00
    duration                      7.00
    director_facebook_likes       0.00
    actor_3_facebook_likes        0.00
    actor_1_facebook_likes        0.00
                                ...   
    title_year                 1916.00
    actor_2_facebook_likes        0.00
    imdb_score                    1.60
    aspect_ratio                  1.18
    movie_facebook_likes          0.00
    Length: 16, dtype: float64
    

    .describe是一个非常强大的方法,可以返回描述性统计信息和分位数,如果想在屏幕中显示更多信息,可以用.T进行矩阵转置:

    >>> movies.describe().T
                   count         mean  ...       75%       max
    num_criti...  4867.0   137.988905  ...    191.00     813.0
    duration      4901.0   107.090798  ...    118.00     511.0
    director_...  4814.0   691.014541  ...    189.75   23000.0
    actor_3_f...  4893.0   631.276313  ...    633.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  11000.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2011.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...    912.00  137000.0
    imdb_score    4916.0     6.437429  ...      7.20       9.5
    aspect_ratio  4590.0     2.222349  ...      2.35      16.0
    movie_fac...  4916.0  7348.294142  ...   2000.00  349000.0
    

    .describe方法中通过percentiles参数,可以得到任意分位数:

    >>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
                   count         mean  ...       99%       max
    num_criti...  4867.0   137.988905  ...    546.68     813.0
    duration      4901.0   107.090798  ...    189.00     511.0
    director_...  4814.0   691.014541  ...  16000.00   23000.0
    actor_3_f...  4893.0   631.276313  ...  11000.00   23000.0
    actor_1_f...  4909.0  6494.488491  ...  44920.00  640000.0
    ...              ...          ...  ...       ...       ...
    title_year    4810.0  2002.447609  ...   2016.00    2016.0
    actor_2_f...  4903.0  1621.923516  ...  17000.00  137000.0
    imdb_score    4916.0     6.437429  ...      8.50       9.5
    aspect_ratio  4590.0     2.222349  ...      4.00      16.0
    movie_fac...  4916.0  7348.294142  ...  93850.00  349000.0
    

    如果在上述的描述性方法中,将参数skipna设为False,则可以将所有列都体现出来:

    >>> movies.min(skipna=False)
    num_critic_for_reviews     NaN
    duration                   NaN
    director_facebook_likes    NaN
    actor_3_facebook_likes     NaN
    actor_1_facebook_likes     NaN
                              ... 
    title_year                 NaN
    actor_2_facebook_likes     NaN
    imdb_score                 1.6
    aspect_ratio               NaN
    movie_facebook_likes       0.0
    Length: 16, dtype: float64
    

    2.5 DataFrame的链式方法

    使用.isnull方法,判断每个值是否是缺失值:

    >>> movies = pd.read_csv("data/movie.csv")
    >>> def shorten(col):
    ...     return col.replace("facebook_likes", "fb").replace(
    ...         "_for_reviews", ""
    ...     )
    >>> movies = movies.rename(columns=shorten)
    >>> movies.isnull().head()
       color  director_name  ...  aspect_ratio  movie_fb
    0  False        False    ...        False      False
    1  False        False    ...        False      False
    2  False        False    ...        False      False
    3  False        False    ...        False      False
    4   True        False    ...         True      False
    

    .sum方法可以对TrueFalse求和,True是1,False是0,这样就能清楚地看到每列有多少缺失值:

    >>> (movies.isnull().sum().head())
    color             19
    director_name    102
    num_critic        49
    duration          15
    director_fb      102
    dtype: int64
    

    再进一步,两个.sum方法连用,可以知道总共有多少缺失值:

    >>> movies.isnull().sum().sum()
    2654
    

    如果仅仅想知道是否有缺失值,使用.any()更为便捷:

    >>> movies.isnull().any().any()
    True
    

    原理

    .isnull方法将原始的DataFrame转换为了相同大小的布尔值矩阵:

    >>> movies.isnull().dtypes.value_counts()
    bool    28
    dtype: int64
    

    更多

    如果object类型的数据存在缺失值,则在做聚合运算(.min.max.sum)时,返回为空:

    >>> movies[["color", "movie_title", "color"]].max()
    Series([], dtype: float64)
    

    如果想让返回值不为空,需要对缺失值进行填充:

    >>> movies.select_dtypes(["object"]).fillna("").max()
    color                            Color
    director_name            Étienne Faure
    actor_2_name             Zubaida Sahar
    genres                         Western
    actor_1_name             Óscar Jaenada
                              ...         
    plot_keywords      zombie|zombie spoof
    movie_imdb_link    http://www.imdb....
    language                          Zulu
    country                   West Germany
    content_rating                       X
    Length: 12, dtype: object
    

    出于可读性考虑,链式方法通常用括号括起来,这样对每个方法做注释和调试时非常方便:

    >>> (movies.select_dtypes(["object"]).fillna("").max())
    color                            Color
    director_name            Étienne Faure
    actor_2_name             Zubaida Sahar
    genres                         Western
    actor_1_name             Óscar Jaenada
                              ...         
    plot_keywords      zombie|zombie spoof
    movie_imdb_link    http://www.imdb....
    language                          Zulu
    country                   West Germany
    content_rating                       X
    Length: 12, dtype: object
    

    2.6 DataFrame运算

    DataFrame的列的类型可能是数值,也可能是对象,直接+5的话,会报类型错误:

    >>> colleges = pd.read_csv("data/college.csv")
    >>> colleges + 5
    Traceback (most recent call last):
      ...
    TypeError: can only concatenate str (not "int") to str
    

    .filter方法筛选出所有列名以'UGDS_'开头的列,该列是按照种族分类的本科生:

    >>> colleges = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = colleges.filter(like="UGDS_")
    >>> college_ugds.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    Alabama S...      0.0158      0.9208  ...    0.0243     0.0137
    

    Pandas采取的是“四舍六入五成双” 的 Banker's Rounding 规则 (五后有数入、五后无数凑偶数)。观察UGDS_BLACK是如何变化的:

    >>> name = "Northwest-Shoals Community College"
    >>> college_ugds.loc[name]
    UGDS_WHITE    0.7912
    UGDS_BLACK    0.1250
    UGDS_HISP     0.0339
    UGDS_ASIAN    0.0036
    UGDS_AIAN     0.0088
    UGDS_NHPI     0.0006
    UGDS_2MOR     0.0012
    UGDS_NRA      0.0033
    UGDS_UNKN     0.0324
    Name: Northwest-Shoals Community College, dtype: float64
    >>> college_ugds.loc[name].round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.12
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    

    如果圆整之前加0.0001,看看变化:

    >>> (college_ugds.loc[name] + 0.0001).round(2)
    UGDS_WHITE    0.79
    UGDS_BLACK    0.13
    UGDS_HISP     0.03
    UGDS_ASIAN    0.00
    UGDS_AIAN     0.01
    UGDS_NHPI     0.00
    UGDS_2MOR     0.00
    UGDS_NRA      0.00
    UGDS_UNKN     0.03
    Name: Northwest-Shoals Community College, dtype: float64
    

    在开始圆整探险之前,将college_ugds中的每个数加0.00501

    >>> college_ugds + 0.00501
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...     0.03831     0.94031  ...   0.01091    0.01881
    Universit...     0.59721     0.26501  ...   0.02291    0.01501
    Amridge U...     0.30401     0.42421  ...   0.00501    0.27651
    Universit...     0.70381     0.13051  ...   0.03821    0.04001
    Alabama S...     0.02081     0.92581  ...   0.02931    0.01871
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...
    

    使用//,将DataFrame中的值圆整为百分比的整数:

    >>> (college_ugds + 0.00501) // 0.01
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...         3.0        94.0  ...       1.0        1.0
    Universit...        59.0        26.0  ...       2.0        1.0
    Amridge U...        30.0        42.0  ...       0.0       27.0
    Universit...        70.0        13.0  ...       3.0        4.0
    Alabama S...         2.0        92.0  ...       2.0        1.0
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       
    

    如果将其除以100,则:

    >>> college_ugds_op_round =(
    ...     (college_ugds + 0.00501) // 0.01 / 100
    ... )
    >>> college_ugds_op_round.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...        0.03        0.94  ...      0.01       0.01
    Universit...        0.59        0.26  ...      0.02       0.01
    Amridge U...        0.30        0.42  ...      0.00       0.27
    Universit...        0.70        0.13  ...      0.03       0.04
    Alabama S...        0.02        0.92  ...      0.02       0.01
    

    下面使用round方法,因为是bankers规则,加个0.00001:

    >>> college_ugds_round = (college_ugds + 0.00001).round(2)
    

    判断两个结果是否相同:

    >>> college_ugds_op_round.equals(college_ugds_round)
    True
    

    原理

    浮点运算会产生误差:

    >>> 0.045 + 0.005
    0.049999999999999996
    

    加0.00001之后,则变为:

    >>> 0.045 + 0.005 + 0.00001
    0.05001
    

    更多

    和Series相同,DataFrame也有对应的函数运算方法:

    >>> college2 = (
    ...     college_ugds.add(0.00501).floordiv(0.01).div(100)
    ... )
    >>> college2.equals(college_ugds_op_round)
    True 
    

    2.7 比较缺失值

    Pandas使用np.nan表示缺失值,这个对象很独特:

    >>> np.nan == np.nan
    False
    >>> None == None
    True
    >>> np.nan > 5
    False
    >>> 5 > np.nan
    False
    >>> np.nan != 5
    True
    

    先加载数据:

    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    

    ==是对DataFrame中每个元素进行比较:

    >>> college_ugds == 0.0019
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    

    但是,如果DataFrame中有缺失值,用==就会出现问题:

    >>> college_self_compare = college_ugds == college_ugds
    >>> college_self_compare.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Alabama A...        True        True  ...      True       True
    Universit...        True        True  ...      True       True
    Amridge U...        True        True  ...      True       True
    Universit...        True        True  ...      True       True
    Alabama S...        True        True  ...      True       True
    

    看起来好像没问题,但是如果用.all方法,就会发现问题:

    >>> college_self_compare.all()
    UGDS_WHITE    False
    UGDS_BLACK    False
    UGDS_HISP     False
    UGDS_ASIAN    False
    UGDS_AIAN     False
    UGDS_NHPI     False
    UGDS_2MOR     False
    UGDS_NRA      False
    UGDS_UNKN     False
    dtype: bool
    

    这是因为缺失值不能相互比较,如果像下面用`== np.nan判断有没有缺失值,就会得到0:

    >>> (college_ugds == np.nan).sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    

    统计缺失值个数的方法是使用.isna

    >>> college_ugds.isna().sum()
    UGDS_WHITE    661
    UGDS_BLACK    661
    UGDS_HISP     661
    UGDS_ASIAN    661
    UGDS_AIAN     661
    UGDS_NHPI     661
    UGDS_2MOR     661
    UGDS_NRA      661
    UGDS_UNKN     661
    dtype: int64
    

    比较两个DataFrame的正确方法是使用.equals方法:

    >>> college_ugds.equals(college_ugds)
    True
    

    更多

    .eq方法等价于==

    >>> college_ugds.eq(0.0019)  # same as college_ugds == .0019
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    Amridge U...       False       False  ...     False      False
    Universit...       False       False  ...     False      False
    Alabama S...       False       False  ...     False      False
    ...                  ...         ...  ...       ...        ...
    SAE Insti...       False       False  ...     False      False
    Rasmussen...       False       False  ...     False      False
    National ...       False       False  ...     False      False
    Bay Area ...       False       False  ...     False      False
    Excel Lea...       False       False  ...     False      False
    

    pandas.testing包中有个断言方法assert_frame_equal,可以用于判断两个DataFrame是否相同,如果不同返回AssertionError,如果相同返回None

    >>> from pandas.testing import assert_frame_equal
    >>> assert_frame_equal(college_ugds, college_ugds) is None
    True
    

    2.8 转置DataFrame运算的方向

    DataFrame的许多方法都使用了axis参数,这个参数控制了运算方向。axis参数可以是index0)或columns1)。字符串更清晰,建议使用字符串。

    提取并筛选数据。

    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    >>> college_ugds.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    

    .count()方法默认axis=0,等价于college_ugds.count(axis=0)college_ugds.count(axis='index')

    >>> college_ugds.count()
    UGDS_WHITE    6874
    UGDS_BLACK    6874
    UGDS_HISP     6874
    UGDS_ASIAN    6874
    UGDS_AIAN     6874
    UGDS_NHPI     6874
    UGDS_2MOR     6874
    UGDS_NRA      6874
    UGDS_UNKN     6874
    dtype: int64
    

    axis参数改为columns,可以得到每行非空值的数量:

    >>> college_ugds.count(axis="columns").head()
    INSTNM
    Alabama A & M University               9
    University of Alabama at Birmingham    9
    Amridge University                     9
    University of Alabama in Huntsville    9
    Alabama State University               9
    dtype: int64
    

    计算每行是否是百分之百:

    >>> college_ugds.sum(axis="columns").head()
    INSTNM
    Alabama A & M University               1.0000
    University of Alabama at Birmingham    0.9999
    Amridge University                     1.0000
    University of Alabama in Huntsville    1.0000
    Alabama State University               1.0000
    dtype: float64
    

    计算每列的中位数:

    >>> college_ugds.median(axis="index")
    UGDS_WHITE    0.55570
    UGDS_BLACK    0.10005
    UGDS_HISP     0.07140
    UGDS_ASIAN    0.01290
    UGDS_AIAN     0.00260
    UGDS_NHPI     0.00000
    UGDS_2MOR     0.01750
    UGDS_NRA      0.00000
    UGDS_UNKN     0.01430
    dtype: float64
    

    更多

    使用cumsum可以沿着列的方向进行累计求和,能从另一个视角观察白人和黑人所占比例:

    >>> college_ugds_cumsum = college_ugds.cumsum(axis=1)
    >>> college_ugds_cumsum.head()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...      0.0333      0.9686  ...    0.9862     1.0000
    Universit...      0.5922      0.8522  ...    0.9899     0.9999
    Amridge U...      0.2990      0.7182  ...    0.7285     1.0000
    Universit...      0.6988      0.8243  ...    0.9650     1.0000
    Alabama S...      0.0158      0.9366  ...    0.9863     1.0000
    

    2.9 校园的多样性

    US News 的多样性指数TOP10高校如下:

    >>> pd.read_csv(
    ...     "data/college_diversity.csv", index_col="School"
    ... )
                                                       Diversity Index
    School
    Rutgers University--Newark  Newark, NJ                        0.76
    Andrews University  Berrien Springs, MI                       0.74
    Stanford University  Stanford, CA                             0.74
    University of Houston  Houston, TX                            0.74
    University of Nevada--Las Vegas  Las Vegas, NV                0.74
    University of San Francisco  San Francisco, CA                0.74
    San Francisco State University  San Francisco, CA             0.73
    University of Illinois--Chicago  Chicago, IL                  0.73
    New Jersey Institute of Technology  Newark, NJ                0.72
    Texas Woman's University  Denton, TX                          0.72
    

    对于我们的问题,先读取数据:

    >>> college = pd.read_csv(
    ...     "data/college.csv", index_col="INSTNM"
    ... )
    >>> college_ugds = college.filter(like="UGDS_")
    

    计算每行有多少缺失值,并从大到小排列:

    >>> (
    ...     college_ugds.isnull()
    ...     .sum(axis="columns")
    ...     .sort_values(ascending=False)
    ...     .head()
    ... )
    INSTNM
    Excel Learning Center-San Antonio South         9
    Philadelphia College of Osteopathic Medicine    9
    Assemblies of God Theological Seminary          9
    Episcopal Divinity School                       9
    Phillips Graduate Institute                     9
    dtype: int64
    

    如果一行的九列都是缺失值,则使用.dropna方法删掉该行:

    >>> college_ugds = college_ugds.dropna(how="all")
    >>>; college_ugds.isnull().sum()
    UGDS_WHITE    0
    UGDS_BLACK    0
    UGDS_HISP     0
    UGDS_ASIAN    0
    UGDS_AIAN     0
    UGDS_NHPI     0
    UGDS_2MOR     0
    UGDS_NRA      0
    UGDS_UNKN     0
    dtype: int64
    

    如果某个种族的比例超过15%,则进行统计:

    >>> college_ugds.ge(0.15)
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...                     
    Alabama A...       False        True  ...     False      False
    Universit...        True        True  ...     False      False
    Amridge U...        True        True  ...     False       True
    Universit...        True       False  ...     False      False
    Alabama S...       False        True  ...     False      False
    ...                  ...         ...  ...       ...        ...
    Hollywood...        True        True  ...     False      False
    Hollywood...       False        True  ...     False      False
    Coachella...        True       False  ...     False      False
    Dewey Uni...       False       False  ...     False      False
    Coastal P...        True        True  ...     False      False
    

    使用sum方法对每行进行统计:

    >>> diversity_metric = college_ugds.ge(0.15).sum(
    ...     axis="columns"
    ... )
    >>> diversity_metric.head()
    INSTNM
    Alabama A & M University               1
    University of Alabama at Birmingham    2
    Amridge University                     3
    University of Alabama in Huntsville    1
    Alabama State University               1
    dtype: int64
    

    使用.value_counts查看该序列是如何分布的:

    >>> diversity_metric.value_counts()
    1    3042
    2    2884
    3     876
    4      63
    0       7
    5       2
    dtype: int64
    

    惊讶地发现,有两所学校有5个种族的比例超过了15%。对diversity_metric进行排列:

    >>> diversity_metric.sort_values(ascending=False).head()
    INSTNM
    Regency Beauty Institute-Austin          5
    Central Texas Beauty College-Temple      5
    Sullivan and Cogliano Training Center    4
    Ambria College of Nursing                4
    Berkeley College-New York                4
    dtype: int64
    

    查看排名最高的两所学校:

    >>> college_ugds.loc[
    ...     [
    ...         "Regency Beauty Institute-Austin",
    ...         "Central Texas Beauty College-Temple",
    ...     ]
    ... ]
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Regency B...      0.1867      0.2133  ...       0.0     0.2667
    Central T...      0.1616      0.2323  ...       0.0     0.1515
    

    再来看看 US News中的排名前五的学校的表现:

    >>> us_news_top = [
    ...     "Rutgers University-Newark",
    ...     "Andrews University",
    ...     "Stanford University",
    ...     "University of Houston",
    ...     "University of Nevada-Las Vegas",
    ... ]
    >>> diversity_metric.loc[us_news_top]
    INSTNM
    Rutgers University-Newark         4
    Andrews University                3
    Stanford University               3
    University of Houston             3
    University of Nevada-Las Vegas    3
    dtype: int64
    

    更多

    查看最不具有多样性的学校的前十名:

    >>> (
    ...     college_ugds.max(axis=1)
    ...     .sort_values(ascending=False)
    ...     .head(10)
    ... )
    INSTNM
    Dewey University-Manati                               1.0
    Yeshiva and Kollel Harbotzas Torah                    1.0
    Mr Leon's School of Hair Design-Lewiston              1.0
    Dewey University-Bayamon                              1.0
    Shepherds Theological Seminary                        1.0
    Yeshiva Gedolah Kesser Torah                          1.0
    Monteclaro Escuela de Hoteleria y Artes Culinarias    1.0
    Yeshiva Shaar Hatorah                                 1.0
    Bais Medrash Elyon                                    1.0
    Yeshiva of Nitra Rabbinical College                   1.0
    dtype: float64
    

    是否存在所有种族比例都超过1%的学校:

    >>> (college_ugds > 0.01).all(axis=1).any()
    True
    

    第01章 Pandas基础
    第02章 DataFrame基础运算
    第03章 创建和持久化DataFrame
    第04章 开始数据分析
    第05章 探索性数据分析
    第06章 选取数据子集
    第07章 过滤行
    第08章 索引对齐

    相关文章

      网友评论

        本文标题:《Pandas 1.x Cookbook · 第二版》第02章

        本文链接:https://www.haomeiwen.com/subject/raxmtltx.html