pandas入门

作者: python测试开发 | 来源:发表于2018-10-27 12:03 被阅读161次

    pandas入门

    简介

    pandas包含的数据结构和操作工具能快速简单地清洗和分析数据。

    pandas经常与NumPy和SciPy这样的数据计算工具,statsmodels和scikit-learn之类的分析库及数据可视化库(如matplotlib)等一起用使用。pandas基于NumPy的数组,经常可以不使用循环就能处理好大量数据。

    pandas适合处理表格数据或巨量数据。NumPy则适合处理巨量的数值数组数据。

    这里约定导入方式:

    技术支持qq群:521070358 630011153

    #!python
    
    import pandas as pd
    

    参考资料

    主要数据结构:Series和DataFrame。

    Series

    Series类似于一维数组的对象,它由一组数据(NumPy类似数据类型)以及相关的数据标签(即索引)组成。仅由一组数据即可产生最简单的Series:

    #!python
    
    In [2]: import pandas as pd
    
    In [3]: obj = pd.Series([4, 7, -5, 3])
    
    In [4]: obj
    Out[4]: 
    0    4
    1    7
    2   -5
    3    3
    dtype: int64
    
    In [5]: obj.values
    Out[5]: array([ 4,  7, -5,  3])
    
    In [6]: obj.index
    Out[6]: Int64Index([0, 1, 2, 3], dtype='int64')
    
    

    指定索引:

    #!python
    
    In [2]: obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
    
    In [3]: obj2
    Out[3]: 
    d    4
    b    7
    a   -5
    c    3
    dtype: int64
    
    In [4]: obj2.index
    Out[4]: Index(['d', 'b', 'a', 'c'], dtype='object')
    
    In [10]: obj2['a']
    Out[10]: -5
    
    In [11]: obj2['d'] = 6
    
    In [12]: obj2[['c', 'a', 'd']]
    Out[12]: 
    c    3
    a   -5
    d    6
    dtype: int64
    
    

    可见与普通NumPy数组相比,你还可以通过索引的方式选取Series中的值。

    NumPy函数或类似操作,如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的链接:

    #!python
    
    In [13]: obj2[obj2 > 0]
    Out[13]: 
    d    6
    b    7
    c    3
    dtype: int64
    
    In [14]: obj2 * 2
    Out[14]: 
    d    12
    b    14
    a   -10
    c     6
    dtype: int64
    
    In [15]: obj2
    Out[15]: 
    d    6
    b    7
    a   -5
    c    3
    dtype: int64
    
    In [17]: import numpy as np
    
    In [18]: np.exp(obj2)
    Out[18]: 
    d     403.428793
    b    1096.633158
    a       0.006738
    c      20.085537
    dtype: float64
    
    In [19]: 'b' in obj2
    Out[19]: True
    
    In [20]: 'e' in obj2
    Out[20]: False
    
    

    可见可以吧Series看成是定长的有序字典。也可由字典创建Series:

    #!python
    
    In [21]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
    
    In [22]: obj3 = pd.Series(sdata)
    
    In [23]: obj3
    Out[23]: 
    Ohio      35000
    Oregon    16000
    Texas     71000
    Utah       5000
    dtype: int64
    
    In [24]: states = ['California', 'Ohio', 'Oregon', 'Texas']
    
    In [25]: obj4 = pd.Series(sdata, index=states)
    
    In [26]: obj4
    Out[26]: 
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64
    
    In [27]: pd.isnull(obj4)
    Out[27]: 
    California     True
    Ohio          False
    Oregon        False
    Texas         False
    dtype: bool
    
    In [28]: pd.notnull(obj4)
    Out[28]: 
    California    False
    Ohio           True
    Oregon         True
    Texas          True
    dtype: bool
    
    In [29]: obj4.isnull()
    Out[29]: 
    California     True
    Ohio          False
    Oregon        False
    Texas         False
    dtype: bool
    
    In [32]: obj4.notnull()
    Out[32]: 
    California    False
    Ohio           True
    Oregon         True
    Texas          True
    dtype: bool
    
    

    相加

    #!python
    
    In [33]: obj3
    Out[33]: 
    Ohio      35000
    Oregon    16000
    Texas     71000
    Utah       5000
    dtype: int64
    
    In [34]: obj4
    Out[34]: 
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64
    
    In [35]: obj3 + obj4
    Out[35]: 
    California         NaN
    Ohio           70000.0
    Oregon         32000.0
    Texas         142000.0
    Utah               NaN
    dtype: float64
    
    In [36]: obj4.name = 'population'
    
    In [37]: obj4.index.name = 'state'
    
    In [38]: obj4
    Out[38]: 
    state
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    Name: population, dtype: float64
    
    In [40]: obj = pd.Series([4, 7, -5, 3])
    
    In [41]: obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
    
    In [42]: obj
    Out[42]: 
    Bob      4
    Steve    7
    Jeff    -5
    Ryan     3
    dtype: int64
    
    

    本文代码地址:https://github.com/china-testing/python-api-tesing/

    本文最新版本地址:http://t.cn/R8tJ9JH

    交流QQ群:python 测试开发 144081101

    wechat: pythontesting

    淘宝天猫可以把链接发给qq850766020,为你生成优惠券,降低你的购物成本!

    DataFrame

    DataFrame是矩状表格型的数据结构,包含有序的列,每列可以是不同的类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看做由相同索引的Series组成的字典。DataFrame中的数据是以一个或多个二维块存放的。

    构建DataFrame的办法有很多,最常用的是直接传入等长列表或NumPy数组组成的字典。DataFrame会自动加上索引(跟Series一样),有序排列。

    #!python
    
    In [1]: import pandas as pd
    
    In [2]: import numpy as np
    
    In [3]: 
    
    In [3]: data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       ...: 'year': [2000, 2001, 2002, 2001, 2002, 2003],
       ...: 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
    
    In [4]: 
    
    In [4]: frame = pd.DataFrame(data)
    
    In [5]: frame
    Out[5]: 
       pop   state  year
    0  1.5    Ohio  2000
    1  1.7    Ohio  2001
    2  3.6    Ohio  2002
    3  2.4  Nevada  2001
    4  2.9  Nevada  2002
    5  3.2  Nevada  2003
    
    In [6]: frame.head()
    Out[6]: 
       pop   state  year
    0  1.5    Ohio  2000
    1  1.7    Ohio  2001
    2  3.6    Ohio  2002
    3  2.4  Nevada  2001
    4  2.9  Nevada  2002
    
    In [7]: 
    
    In [7]: pd.DataFrame(data, columns=['year', 'state', 'pop'])
    Out[7]: 
       year   state  pop
    0  2000    Ohio  1.5
    1  2001    Ohio  1.7
    2  2002    Ohio  3.6
    3  2001  Nevada  2.4
    4  2002  Nevada  2.9
    5  2003  Nevada  3.2
    
    In [8]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
       ...: index=['one', 'two', 'three', 'four', 'five', 'six'])
    
    In [9]: frame2
    Out[9]: 
           year   state  pop debt
    one    2000    Ohio  1.5  NaN
    two    2001    Ohio  1.7  NaN
    three  2002    Ohio  3.6  NaN
    four   2001  Nevada  2.4  NaN
    five   2002  Nevada  2.9  NaN
    six    2003  Nevada  3.2  NaN
    
    In [10]: frame2['state']
    Out[10]: 
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    six      Nevada
    Name: state, dtype: object
    
    

    可见还可以通过columns指定DataFrame的列序, index指定索引名。跟Series一样,如果传入的列在数据中找不到,就会产生NaN值。通过类似字典的方式或属性的方式,可以将DataFrame的列获取为Series,返回的Series拥有DataFrame相同的索引,且其name属性也已经被相应地设置好。

    行也可以用loc属性通过位置或名称的方式进行获取。列可以通过赋值的方式进行修改。
    将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配。如果赋值的是Series,就会精确匹配

    #!python
    
    In [11]: frame2.loc['three']
    Out[11]: 
    year     2002
    state    Ohio
    pop       3.6
    debt      NaN
    Name: three, dtype: object
    
    In [12]: frame2['debt'] = 16.5
    
    In [13]: frame2
    Out[13]: 
           year   state  pop  debt
    one    2000    Ohio  1.5  16.5
    two    2001    Ohio  1.7  16.5
    three  2002    Ohio  3.6  16.5
    four   2001  Nevada  2.4  16.5
    five   2002  Nevada  2.9  16.5
    six    2003  Nevada  3.2  16.5
    
    In [14]: frame2['debt'] = np.arange(6.)
    
    In [15]: frame2
    Out[15]: 
           year   state  pop  debt
    one    2000    Ohio  1.5   0.0
    two    2001    Ohio  1.7   1.0
    three  2002    Ohio  3.6   2.0
    four   2001  Nevada  2.4   3.0
    five   2002  Nevada  2.9   4.0
    six    2003  Nevada  3.2   5.0
    
    In [16]: val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
    
    In [17]: frame2['debt'] = val
    
    In [18]: frame2
    Out[18]: 
           year   state  pop  debt
    one    2000    Ohio  1.5   NaN
    two    2001    Ohio  1.7  -1.2
    three  2002    Ohio  3.6   NaN
    four   2001  Nevada  2.4  -1.5
    five   2002  Nevada  2.9  -1.7
    six    2003  Nevada  3.2   NaN
    
    

    为不存在的列赋值会创建出一个新列。关键字del用于删除列:

    #!python
    
    In [19]: frame2['eastern'] = frame2['state'] == 'Ohio'
    
    In [20]: frame2
    Out[20]: 
           year   state  pop  debt  eastern
    one    2000    Ohio  1.5   NaN     True
    two    2001    Ohio  1.7  -1.2     True
    three  2002    Ohio  3.6   NaN     True
    four   2001  Nevada  2.4  -1.5    False
    five   2002  Nevada  2.9  -1.7    False
    six    2003  Nevada  3.2   NaN    False
    
    In [21]: del frame2['eastern']
    
    In [22]: frame2.columns
    Out[22]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
    
    

    通过索引方式返回的列只是相应数据的视图而不是副本。因此,对返回的Series所做的任何就地修改
    全都会反映到源DataFrame上。通过Series的copy方法即可显式地复制列。

    另一种常见的数据形式是嵌套字典,外层字典的键作为列,内层键则作为行索引:

    #!python
    
    In [23]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       ....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
    
    In [24]: frame3 = pd.DataFrame(pop)
    
    In [25]: frame3
    Out[25]: 
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    2002     2.9   3.6
    
    In [26]: frame3.T
    Out[26]: 
            2000  2001  2002
    Nevada   NaN   2.4   2.9
    Ohio     1.5   1.7   3.6
    
    In [27]: pd.DataFrame(pop, index=[2001, 2002, 2003])
    Out[27]: 
          Nevada  Ohio
    2001     2.4   1.7
    2002     2.9   3.6
    2003     NaN   NaN
    
    In [28]: pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}
    
    In [29]: pdata
    Out[29]: 
    {'Ohio': 2000    1.5
     2001    1.7
     Name: Ohio, dtype: float64, 'Nevada': 2000    NaN
     2001    2.4
     Name: Nevada, dtype: float64}
    
    In [30]: pd.DataFrame(pdata)
    Out[30]: 
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    
    In [31]: frame3.index.name = 'year'; frame3.columns.name = 'state'
    
    In [32]: frame3
    Out[32]: 
    state  Nevada  Ohio
    year               
    2000      NaN   1.5
    2001      2.4   1.7
    2002      2.9   3.6
    
    In [33]: frame3.values
    Out[33]: 
    array([[ nan,  1.5],
           [ 2.4,  1.7],
           [ 2.9,  3.6]])
    
    In [34]: frame2.values
    Out[34]: 
    array([[2000, 'Ohio', 1.5, nan],
           [2001, 'Ohio', 1.7, -1.2],
           [2002, 'Ohio', 3.6, nan],
           [2001, 'Nevada', 2.4, -1.5],
           [2002, 'Nevada', 2.9, -1.7],
           [2003, 'Nevada', 3.2, nan]], dtype=object)
    
    

    可见可以转置,由Series组成的字典和字典类似。如果设置了DataFrame的index和columns的name属性,则这些信息也会被显示出来。跟Series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据。如果DataFrame各列的数据类型不同,则值数组的数据类型就会选用能兼容所有列的数据类型。

    DataFrame的constructor接受的类型为:2D ndarray、dict of arrays, lists, or tuples、NumPy structured/record、array、dict of Series、dict of dicts、List of dicts or Series、List of lists or tuples、Another DataFrame、NumPy MaskedArray。

    更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

    索引对象

    pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会被转换成Index。

    #!python
    
    In [35]: obj = pd.Series(range(3), index=['a', 'b', 'c'])
    
    In [36]: index = obj.index
    
    In [37]: index
    Out[37]: Index(['a', 'b', 'c'], dtype='object')
    
    In [38]: index[1:]
    Out[38]: Index(['b', 'c'], dtype='object')
    
    In [39]: index[1] = 'd'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-39-676fdeb26a68> in <module>()
    ----> 1 index[1] = 'd'
    
    /usr/local/lib/python3.5/dist-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
       1722 
       1723     def __setitem__(self, key, value):
    -> 1724         raise TypeError("Index does not support mutable operations")
       1725 
       1726     def __getitem__(self, key):
    
    TypeError: Index does not support mutable operations
    
    In [40]: labels = pd.Index(np.arange(3))
    
    In [41]: labels
    Out[41]: Int64Index([0, 1, 2], dtype='int64')
    
    In [42]: obj2 = pd.Series([1.5, -2.5, 0], index=labels)
    
    In [43]: obj2
    Out[43]: 
    0    1.5
    1   -2.5
    2    0.0
    dtype: float64
    
    In [44]: obj2.index is labels
    Out[44]: True
    
    In [45]: frame3
    Out[45]: 
    state  Nevada  Ohio
    year               
    2000      NaN   1.5
    2001      2.4   1.7
    2002      2.9   3.6
    
    In [46]: frame3.columns
    Out[46]: Index(['Nevada', 'Ohio'], dtype='object', name='state')
    
    In [47]: 'Ohio' in frame3.columns
    Out[47]: True
    
    In [48]: 2003 in frame3.index
    Out[48]: False
    
    In [49]: dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
    
    In [50]: dup_labels
    Out[50]: Index(['foo', 'foo', 'bar', 'bar'], dtype='object')
    
    

    Index对象是不可变的,因此用户不能对其进行修改,这样Index对象在多个数据结构之间可安全共享。除了像数组,Index类似固定大小的集合。

    Index的方法和属性有:append,difference,intersection,union,isin,delete,drop,insert,is_monotonic,unique。

    更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html

    基本功能

    本节中,我将介绍操作Series和DataFrame中的数据的基本手段。

    重新索引

    #!python
    
    In [51]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
    
    In [52]: obj
    Out[52]: 
    d    4.5
    b    7.2
    a   -5.3
    c    3.6
    dtype: float64
    
    # 调用reindex将会根据新索引进行重排。如果某个索引值当前不存在,就为NaN
    
    In [53]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
    
    In [54]: obj2
    Out[54]: 
    a   -5.3
    b    7.2
    c    3.6
    d    4.5
    e    NaN
    dtype: float64
    
    In [55]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
    
    In [56]: obj3
    Out[56]: 
    0      blue
    2    purple
    4    yellow
    dtype: object
    
    # 对于时间序列这样的有序数据,重新索引时可能需要做插值处理。method选项即可达到此目的,例如,使用ffill以实现前向值填充:
    
    In [57]: obj3.reindex(range(6), method='ffill')
    Out[57]: 
    0      blue
    1      blue
    2    purple
    3    purple
    4    yellow
    5    yellow
    dtype: object
    
    # DataFrame中reindex可以调整行列
    In [58]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
       ....: index=['a', 'c', 'd'],
       ....: columns=['Ohio', 'Texas', 'California'])
    
    In [59]: frame
    Out[59]: 
       Ohio  Texas  California
    a     0      1           2
    c     3      4           5
    d     6      7           8
    
    In [60]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])
    
    In [61]: frame2
    Out[61]: 
       Ohio  Texas  California
    a   0.0    1.0         2.0
    b   NaN    NaN         NaN
    c   3.0    4.0         5.0
    d   6.0    7.0         8.0
    
    In [62]: states = ['Texas', 'Utah', 'California']
    
    In [63]: frame.reindex(columns=states)
    Out[63]: 
       Texas  Utah  California
    a      1   NaN           2
    c      4   NaN           5
    d      7   NaN           8
    
    In [69]:  frame2 = frame.reindex(['a', 'b', 'c', 'd'],columns=states)
    
    In [70]: frame2
    Out[70]: 
       Texas  Utah  California
    a    1.0   NaN         2.0
    b    NaN   NaN         NaN
    c    4.0   NaN         5.0
    d    7.0   NaN         8.0
    
    

    reindex函数的参数有index,method,fill_value,limit,tolerance,level,copy等。

    更多参考: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

    丢弃指定轴上的项

    丢弃某条轴上的一项很简单,只要有索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是在指定轴上删除了指定值的新对象:

    #!python
    
    In [71]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
    
    In [72]: obj
    Out[72]: 
    a    0.0
    b    1.0
    c    2.0
    d    3.0
    e    4.0
    dtype: float64
    
    In [73]: new_obj = obj.drop('c')
    
    In [74]: new_obj
    Out[74]: 
    a    0.0
    b    1.0
    d    3.0
    e    4.0
    dtype: float64
    
    In [75]: obj
    Out[75]: 
    a    0.0
    b    1.0
    c    2.0
    d    3.0
    e    4.0
    dtype: float64
    
    In [76]: obj.drop(['d', 'c'])
    Out[76]: 
    a    0.0
    b    1.0
    e    4.0
    dtype: float64
    
    In [77]: obj
    Out[77]: 
    a    0.0
    b    1.0
    c    2.0
    d    3.0
    e    4.0
    dtype: float64
    
    In [78]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
       ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
       ....: columns=['one', 'two', 'three', 'four'])
    
    In [79]: data
    Out[79]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In [80]: data.drop(['Colorado', 'Ohio'])
    Out[80]: 
              one  two  three  four
    Utah        8    9     10    11
    New York   12   13     14    15
    
    In []: data.drop('two',1)
    Out[57]: 
              one  three  four
    Ohio        0      2     3
    Colorado    4      6     7
    Utah        8     10    11
    New York   12     14    15
    
    In []: data.drop('two', axis=1)
    Out[58]: 
              one  three  four
    Ohio        0      2     3
    Colorado    4      6     7
    Utah        8     10    11
    New York   12     14    15
    
    In []: data.drop(['two', 'four'], axis='columns')
    Out[59]: 
              one  three
    Ohio        0      2
    Colorado    4      6
    Utah        8     10
    New York   12     14
    
    In []: obj.drop('c', inplace=True)
    
    In []: obj
    Out[61]: 
    d    4.5
    b    7.2
    a   -5.3
    dtype: float64
    
    

    索引、选取和过滤

    Series索引(obj[...])的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数。下面是几个例子:

    #!python
    obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
    
    obj
    Out[63]: 
    a    0.0
    b    1.0
    c    2.0
    d    3.0
    dtype: float64
    
    obj['b']
    Out[64]: 1.0
    
    obj[1]
    Out[65]: 1.0
    
    obj[2:4]
    Out[66]: 
    c    2.0
    d    3.0
    dtype: float64
    
    obj[['b', 'a', 'd']]
    Out[67]: 
    b    1.0
    a    0.0
    d    3.0
    dtype: float64
    
    obj[[1, 3]]
    Out[68]: 
    b    1.0
    d    3.0
    dtype: float64
    
    obj[obj < 2]
    Out[69]: 
    a    0.0
    b    1.0
    dtype: float64
    
    obj['b':'c']
    Out[70]: 
    b    1.0
    c    2.0
    dtype: float64
    
    obj['b':'c'] = 5
    
    obj
    Out[72]: 
    a    0.0
    b    5.0
    c    5.0
    d    3.0
    dtype: float64
    
    

    注意标签的方式和python的列表不同,后面的index也是包含在里面的。

    #!python
    
    data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])
    
    data
    Out[74]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    data['two']
    Out[75]: 
    Ohio         1
    Colorado     5
    Utah         9
    New York    13
    Name: two, dtype: int32
    
    data[['three', 'one']]
    Out[76]: 
              three  one
    Ohio          2    0
    Colorado      6    4
    Utah         10    8
    New York     14   12
    
    data[:2]
    Out[77]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    
    data[data['three'] > 5]
    Out[78]: 
              one  two  three  four
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    data < 5
    Out[79]: 
                one    two  three   four
    Ohio       True   True   True   True
    Colorado   True  False  False  False
    Utah      False  False  False  False
    New York  False  False  False  False
    
    data[data < 5] = 0
    
    data
    Out[81]: 
              one  two  three  four
    Ohio        0    0      0     0
    Colorado    0    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    • loc和iloc

    对于行上的DataFrame标签索引有特殊的索引操作符loc(标签)和iloc(整数索引)能够从DataFrame中选择子集。

    #!python
    
    data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four'])
    
    data
    Out[74]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    data['two']
    Out[75]: 
    Ohio         1
    Colorado     5
    Utah         9
    New York    13
    Name: two, dtype: int32
    
    data[['three', 'one']]
    Out[76]: 
              three  one
    Ohio          2    0
    Colorado      6    4
    Utah         10    8
    New York     14   12
    
    data[:2]
    Out[77]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    
    data[data['three'] > 5]
    Out[78]: 
              one  two  three  four
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    data < 5
    Out[79]: 
                one    two  three   four
    Ohio       True   True   True   True
    Colorado   True  False  False  False
    Utah      False  False  False  False
    New York  False  False  False  False
    
    data[data < 5] = 0
    
    data
    Out[81]: 
              one  two  three  four
    Ohio        0    0      0     0
    Colorado    0    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    data.loc['Colorado', ['two', 'three']]
    Out[82]: 
    two      5
    three    6
    Name: Colorado, dtype: int32
    
    data.iloc[2, [3, 0, 1]]
    Out[83]: 
    four    11
    one      8
    two      9
    Name: Utah, dtype: int32
    
    data.iloc[2]
    Out[84]: 
    one       8
    two       9
    three    10
    four     11
    Name: Utah, dtype: int32
    
    data.iloc[[1, 2], [3, 0, 1]]
    Out[85]: 
              four  one  two
    Colorado     7    0    5
    Utah        11    8    9
    
    data.loc[:'Utah', 'two']
    Out[86]: 
    Ohio        0
    Colorado    5
    Utah        9
    Name: two, dtype: int32
    
    data.iloc[:, :3][data.three > 5]
    Out[87]: 
              one  two  three
    Colorado    0    5      6
    Utah        8    9     10
    New York   12   13     14
    

    注意ix现在已经不推荐使用。

    整数索引(Integer Indexes)

    pandas对象的整数索引与内置Python数据的索引语义存在一些差异,以下代码会生成错误:

    #!python
    
    ser = pd.Series(np.arange(3.))
    
    ser[-1]
    Traceback (most recent call last):
    
      File "<ipython-input-20-3cbe0b873a9e>", line 1, in <module>
        ser[-1]
    
      File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
        result = self.index.get_value(self, key)
    
      File "C:\Users\andrew\AppData\Local\conda\conda\envs\my_root\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
        tz=getattr(series.dtype, 'tz', None))
    
      File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value
    
      File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value
    
      File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc
    
      File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item
    
      File "pandas\_libs\hashtable_class_helper.pxi", line 765, in pandas._libs.hashtable.Int64HashTable.get_item
    
    KeyError: -1
    
    ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
    
    ser2[-1]
    Out[22]: 2.0
    
    ser[:1]
    Out[23]: 
    0    0.0
    dtype: float64
    
    ser.loc[:1]
    Out[24]: 
    0    0.0
    1    1.0
    dtype: float64
    
    ser.iloc[:1]
    Out[25]: 
    0    0.0
    dtype: float64
    

    算术和数据对齐

    pandas可在不同索引的对象建进行算术运算,类似数据库的连接:

    #!python
    
    s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
    
    s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
    
    s1
    Out[28]: 
    a    7.3
    c   -2.5
    d    3.4
    e    1.5
    dtype: float64
    
    s2
    Out[29]: 
    a   -2.1
    c    3.6
    e   -1.5
    f    4.0
    g    3.1
    dtype: float64
    
    s1 + s2
    Out[30]: 
    a    5.2
    c    1.1
    d    NaN
    e    0.0
    f    NaN
    g    NaN
    dtype: float64
    
    df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
    index=['Ohio', 'Texas', 'Colorado'])
    
    df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
    
    df1
    Out[33]: 
                b    c    d
    Ohio      0.0  1.0  2.0
    Texas     3.0  4.0  5.0
    Colorado  6.0  7.0  8.0
    
    df2
    Out[34]: 
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    df1 + df2
    Out[35]: 
                b   c     d   e
    Colorado  NaN NaN   NaN NaN
    Ohio      3.0 NaN   6.0 NaN
    Oregon    NaN NaN   NaN NaN
    Texas     9.0 NaN  12.0 NaN
    Utah      NaN NaN   NaN NaN
    
    df1 = pd.DataFrame({'A': [1, 2]})
    
    df2 = pd.DataFrame({'B': [3, 4]})
    
    df1
    Out[38]: 
       A
    0  1
    1  2
    
    df2
    Out[39]: 
       B
    0  3
    1  4
    
    df1 - df2
    Out[40]: 
        A   B
    0 NaN NaN
    1 NaN NaN
    

    还可以进行值的填充

    #!python
    
    df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
    columns=list('abcd'))
    
    df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
    columns=list('abcde'))
    
    df1
    Out[43]: 
         a    b     c     d
    0  0.0  1.0   2.0   3.0
    1  4.0  5.0   6.0   7.0
    2  8.0  9.0  10.0  11.0
    
    df2
    Out[44]: 
          a     b     c     d     e
    0   0.0   1.0   2.0   3.0   4.0
    1   5.0   6.0   7.0   8.0   9.0
    2  10.0  11.0  12.0  13.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    df2.loc[1, 'b'] = np.nan
    
    df2
    Out[46]: 
          a     b     c     d     e
    0   0.0   1.0   2.0   3.0   4.0
    1   5.0   NaN   7.0   8.0   9.0
    2  10.0  11.0  12.0  13.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    df1 + df2
    Out[47]: 
          a     b     c     d   e
    0   0.0   2.0   4.0   6.0 NaN
    1   9.0   NaN  13.0  15.0 NaN
    2  18.0  20.0  22.0  24.0 NaN
    3   NaN   NaN   NaN   NaN NaN
    
    df1.add(df2, fill_value=0)
    Out[48]: 
          a     b     c     d     e
    0   0.0   2.0   4.0   6.0   4.0
    1   9.0   5.0  13.0  15.0   9.0
    2  18.0  20.0  22.0  24.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    1 / df1
    Out[49]: 
              a         b         c         d
    0       inf  1.000000  0.500000  0.333333
    1  0.250000  0.200000  0.166667  0.142857
    2  0.125000  0.111111  0.100000  0.090909
    
    df1.rdiv(1)
    Out[50]: 
              a         b         c         d
    0       inf  1.000000  0.500000  0.333333
    1  0.250000  0.200000  0.166667  0.142857
    2  0.125000  0.111111  0.100000  0.090909
    
    
    df1.reindex(columns=df2.columns, fill_value=0)
    Out[53]: 
         a    b     c     d  e
    0  0.0  1.0   2.0   3.0  0
    1  4.0  5.0   6.0   7.0  0
    2  8.0  9.0  10.0  11.0  0
    
    
    Method Description
    add, radd for addition (+)
    sub, rsub for subtraction (-)
    div, rdiv for division (/)
    floordiv, rfloordiv for floor division (//)
    mul, rmul for multiplication (*)
    pow, rpow for exponentiation (**)
    • DataFrame和Series间的操作

    默认基于行进行广播,用( axis='index' or axis=0 )可以基于列进行广播。

    #!python
    
    arr = np.arange(12.).reshape((3, 4))
    
    arr
    Out[55]: 
    array([[  0.,   1.,   2.,   3.],
           [  4.,   5.,   6.,   7.],
           [  8.,   9.,  10.,  11.]])
    
    arr[0]
    Out[56]: array([ 0.,  1.,  2.,  3.])
    
    arr - arr[0]
    Out[57]: 
    array([[ 0.,  0.,  0.,  0.],
           [ 4.,  4.,  4.,  4.],
           [ 8.,  8.,  8.,  8.]])
    
    arr
    Out[58]: 
    array([[  0.,   1.,   2.,   3.],
           [  4.,   5.,   6.,   7.],
           [  8.,   9.,  10.,  11.]])
    
    frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
    columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
    
    series = frame.iloc[0]
    
    frame
    Out[61]: 
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    series
    Out[62]: 
    b    0.0
    d    1.0
    e    2.0
    Name: Utah, dtype: float64
    
    frame - series
    Out[63]: 
              b    d    e
    Utah    0.0  0.0  0.0
    Ohio    3.0  3.0  3.0
    Texas   6.0  6.0  6.0
    Oregon  9.0  9.0  9.0
    
    series2 = pd.Series(range(3), index=['b', 'e', 'f'])
    
    series2
    Out[65]: 
    b    0
    e    1
    f    2
    dtype: int32
    
    frame + series2
    Out[66]: 
              b   d     e   f
    Utah    0.0 NaN   3.0 NaN
    Ohio    3.0 NaN   6.0 NaN
    Texas   6.0 NaN   9.0 NaN
    Oregon  9.0 NaN  12.0 NaN
    
    
    series3 = frame['d']
    
    frame
    Out[69]: 
              b     d     e
    Utah    0.0   1.0   2.0
    Ohio    3.0   4.0   5.0
    Texas   6.0   7.0   8.0
    Oregon  9.0  10.0  11.0
    
    series3
    Out[70]: 
    Utah       1.0
    Ohio       4.0
    Texas      7.0
    Oregon    10.0
    Name: d, dtype: float64
    
    frame.sub(series3, axis='index')
    Out[71]: 
              b    d    e
    Utah   -1.0  0.0  1.0
    Ohio   -1.0  0.0  1.0
    Texas  -1.0  0.0  1.0
    Oregon -1.0  0.0  1.0
    

    函数应用和映射

    NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象:

    另一个常见的操作是将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能:

    许多最为常见的数组统计功能都被实现成DataFrame的方法(如sum和mean),因此无需使用apply方法。除标量值外,传递给apply的函数还可以返回由多个值组成的Series:

    此外,元素级的Python函数也是可以用的。假如你想得到frame中各个浮点值的格式化字符串,使用applymap即可:

    之所以叫做applymap,是因为Series有一个用于应用元素级函数的map方法:

    #!python
    
    arr = np.arange(12.).reshape((3, 4))
    
    arr
    Out[73]: 
    array([[  0.,   1.,   2.,   3.],
           [  4.,   5.,   6.,   7.],
           [  8.,   9.,  10.,  11.]])
    
    arr[0]
    Out[74]: array([ 0.,  1.,  2.,  3.])
    
    arr - arr[0]
    Out[75]: 
    array([[ 0.,  0.,  0.,  0.],
           [ 4.,  4.,  4.,  4.],
           [ 8.,  8.,  8.,  8.]])
    
    
    
    
    
    frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
    
    frame
    Out[77]: 
                   b         d         e
    Utah    0.255395  1.983985  0.936326
    Ohio    0.319394  2.231544 -0.051256
    Texas  -0.041388 -0.026032 -0.446722
    Oregon  1.099475 -1.432638 -0.919189
    
    np.abs(frame)
    Out[78]: 
                   b         d         e
    Utah    0.255395  1.983985  0.936326
    Ohio    0.319394  2.231544  0.051256
    Texas   0.041388  0.026032  0.446722
    Oregon  1.099475  1.432638  0.919189
    
    f = lambda x: x.max() - x.min()
    
    frame.apply(f)
    Out[80]: 
    b    1.140863
    d    3.664181
    e    1.855515
    dtype: float64
    
    frame.apply(f, axis='columns')
    Out[81]: 
    Utah      1.728590
    Ohio      2.282800
    Texas     0.420690
    Oregon    2.532113
    dtype: float64
    
    def f(x):
        return pd.Series([x.min(), x.max()], index=['min', 'max'])
    
    
    frame.apply(f)
    Out[83]: 
                b         d         e
    min -0.041388 -1.432638 -0.919189
    max  1.099475  2.231544  0.936326
    
    format = lambda x: '%.2f' % x
    
    frame.applymap(format)
    Out[85]: 
                b      d      e
    Utah     0.26   1.98   0.94
    Ohio     0.32   2.23  -0.05
    Texas   -0.04  -0.03  -0.45
    Oregon   1.10  -1.43  -0.92
    
    frame['e'].map(format)
    Out[86]: 
    Utah       0.94
    Ohio      -0.05
    Texas     -0.45
    Oregon    -0.92
    Name: e, dtype: object
    
    

    排序和排名

    根据条件对数据集排序(sorting)也是重要的内置运算。要对行或列索引进行排序(按字典顺序),可使用sort_index方法,它将返回一个已排序的新对象。

    而对于DataFrame,则可以根据任意轴上的索引进行排序:

    #!python
    
    obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
    
    obj.sort_index()
    Out[88]: 
    a    1
    b    2
    c    3
    d    0
    dtype: int32
    
    frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
    index=['three', 'one'],columns=['d', 'a', 'b', 'c'])
    
    frame
    Out[90]: 
           d  a  b  c
    three  0  1  2  3
    one    4  5  6  7
    
    frame.sort_index()
    Out[91]: 
           d  a  b  c
    one    4  5  6  7
    three  0  1  2  3
    
    frame.sort_index(axis='columns')
    Out[94]: 
           a  b  c  d
    three  1  2  3  0
    one    5  6  7  4
    
    

    数据默认是按升序排序的,但也可以降序排序,若要按值对Series进行排序,可使用其order方法。在排序时,任何缺失值默认都会被放到Series的末尾。

    #!python
    
    frame.sort_index(axis='columns', ascending=False)
    Out[95]: 
           d  c  b  a
    three  0  3  2  1
    one    4  7  6  5
    obj = pd.Series([4, 7, -3, 2])
    
    obj.sort_values()
    Out[97]: 
    2   -3
    3    2
    0    4
    1    7
    dtype: int64
    
    obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
    
    obj.sort_values()
    Out[99]: 
    4   -3.0
    5    2.0
    0    4.0
    2    7.0
    1    NaN
    3    NaN
    dtype: float64
    
    obj.sort_values(ascending=False)
    Out[100]: 
    2    7.0
    0    4.0
    5    2.0
    4   -3.0
    1    NaN
    3    NaN
    dtype: float64
    

    在DataFrame上,你可能希望根据一个或多个列中的值进行排序。将一个或多个列的名字传递给by选项即可。要根据多个列进行排序,传入名称的列表即可:

    #!python
    
    frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
    
    frame
    Out[102]: 
       a  b
    0  0  4
    1  1  7
    2  0 -3
    3  1  2
    
    frame.sort_values(by='b')
    Out[103]: 
       a  b
    2  0 -3
    3  1  2
    0  0  4
    1  1  7
    
    frame.sort_values(by=['a', 'b'])
    Out[104]: 
       a  b
    2  0 -3
    0  0  4
    3  1  2
    1  1  7
    

    排名(ranking)跟排序关系密切,且它会增设排名值(从1开始,一直到数组中有效数据的数量)。它跟numpy.argsort产生的间接排序索引差不多,只不过它可以根据某种规则破坏平级关系。接下来介绍Series和DataFrame的rank方法。默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系的:

    也可以根据值在原数据中出现的顺序给出排名

    当然,你也可以按降序进行排名:

    #!python
    
    obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
    
    obj.rank()
    Out[106]: 
    0    6.5
    1    1.0
    2    6.5
    3    4.5
    4    3.0
    5    2.0
    6    4.5
    dtype: float64
    
    obj.rank(method='first')
    Out[107]: 
    0    6.0
    1    1.0
    2    7.0
    3    4.0
    4    3.0
    5    2.0
    6    5.0
    dtype: float64
    
    obj.rank(ascending=False, method='max')
    Out[108]: 
    0    2.0
    1    7.0
    2    2.0
    3    4.0
    4    5.0
    5    6.0
    6    4.0
    dtype: float64
    
    frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
    'c': [-2, 5, 8, -2.5]})
    
    frame
    Out[110]: 
       a    b    c
    0  0  4.3 -2.0
    1  1  7.0  5.0
    2  0 -3.0  8.0
    3  1  2.0 -2.5
    
    frame.rank(axis='columns')
    Out[111]: 
         a    b    c
    0  2.0  3.0  1.0
    1  1.0  3.0  2.0
    2  2.0  1.0  3.0
    3  2.0  3.0  1.0
    
    
    Method Description
    'average' Default: assign the average rank to each entry in the equal group
    'min' Use the minimum rank for the whole group
    'max' Use the maximum rank for the whole group
    'first' Assign ranks in the order the values appear in the data
    'dense' Like method='min' , but ranks always increase by 1 in between groups rather than the number of equal
    elements in a group

    带有重复值的轴索引

    直到目前为止,我所介绍的所有范例都有着唯一的轴标签(索引值)。虽然许多pandas函数(如reindex)都要求标签唯一,但这并不是强制性的。

    #!python
    
    import pandas as pd
    
    obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
    
    obj
    Out[3]: 
    a    0
    a    1
    b    2
    b    3
    c    4
    dtype: int32
    
    obj.index.is_unique
    Out[4]: False
    
    obj['a']
    Out[5]: 
    a    0
    a    1
    dtype: int32
    
    obj['c']
    Out[6]: 4
    
    import numpy as np
    
    df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
    
    df
    Out[10]: 
              0         1         2
    a  0.835470  0.465657 -0.068212
    a -1.067020  1.148283  1.722324
    b  0.057184 -0.441111 -0.388286
    b -0.363911 -0.599963  0.126594
    
    df.loc['b']
    Out[11]: 
              0         1         2
    b  0.057184 -0.441111 -0.388286
    b -0.363911 -0.599963  0.126594
    

    汇总和计算描述统计

    pandas对象拥有一组常用的数学和统计方法。它们大部分都属于reduction和summary统计,用于从Series中提取单个值(如sum或mean)或从DataFrame的行或列中提取Series。跟对应的NumPy数组方法相比,它们都是基于没有缺失数据的假设而构建的。接下来看一个简单DataFrame:

    #!python
    
    df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], 
    index=['a', 'b', 'c', 'd'],columns=['one', 'two'])
    
    df
    Out[14]: 
        one  two
    a  1.40  NaN
    b  7.10 -4.5
    c   NaN  NaN
    d  0.75 -1.3
    
    df.sum()
    Out[15]: 
    one    9.25
    two   -5.80
    dtype: float64
    
    df.sum(axis='columns')
    Out[16]: 
    a    1.40
    b    2.60
    c    0.00
    d   -0.55
    dtype: float64
    
    df.mean(axis='columns', skipna=False)
    Out[17]: 
    a      NaN
    b    1.300
    c      NaN
    d   -0.275
    dtype: float64
    
    df.mean(axis='columns')
    Out[18]: 
    a    1.400
    b    1.300
    c      NaN
    d   -0.275
    dtype: float64
    
    
    Method Description
    axis Axis to reduce over; 0 for DataFrame’s rows and 1 for columns
    skipna Exclude missing values; True by default
    level Reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

    有些方法(如idxmin和idxmax)返回的是间接统计(比如达到最小值或最大值的索引),cumsum则为累计求和,describe则为汇总统计。

    #!python
    
    df
    Out[19]: 
        one  two
    a  1.40  NaN
    b  7.10 -4.5
    c   NaN  NaN
    d  0.75 -1.3
    
    df.idxmax()
    Out[20]: 
    one    b
    two    d
    dtype: object
    
    df.cumsum()
    Out[21]: 
        one  two
    a  1.40  NaN
    b  8.50 -4.5
    c   NaN  NaN
    d  9.25 -5.8
    
    df.describe()
    Out[22]: 
                one       two
    count  3.000000  2.000000
    mean   3.083333 -2.900000
    std    3.493685  2.262742
    min    0.750000 -4.500000
    25%    1.075000 -3.700000
    50%    1.400000 -2.900000
    75%    4.250000 -2.100000
    max    7.100000 -1.300000
    
    obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
    
    obj.describe()
    Out[24]: 
    count     16
    unique     3
    top        a
    freq       8
    dtype: object
    
    
    Method Description
    count Number of non-NA values
    describe Compute set of summary statistics for Series or each DataFrame column
    min, max Compute minimum and maximum values
    argmin, argmax Compute index locations (integers) at which minimum or maximum value obtained, respectively
    idxmin, idxmax Compute index labels at which minimum or maximum value obtained, respectively
    quantile Compute sample quantile ranging from 0 to 1
    sum Sum of values
    mean Mean of values
    median Arithmetic median (50% quantile) of values
    mad Mean absolute deviation from mean value
    prod Product of all values
    var Sample variance of values
    std Sample standard deviation of values
    skew Sample skewness (third moment) of values
    kurt Sample kurtosis (fourth moment) of values
    cumsum Cumulative sum of values
    cummin, cummax Cumulative minimum or maximum of values, respectively
    cumprod Cumulative product of values
    diff Compute first arithmetic difference (useful for time series)
    pct_change Compute percent changes

    相关性和方差

    一些汇总统计,如相关和方差,是从成对的参数程程。 让我们考虑一些来自Yahoo的股票价格和数量DataFrame! 使用附加的pandas-datareader包,

    暂略

    唯一值、值计数以及成员资格

    还有一类方法可以从一维Series的值中抽取信息。以下面这个Series为例:

    #!python
    
    import pandas as pd
    
    obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
    
    uniques = obj.unique()
    
    uniques
    Out[9]: array(['c', 'a', 'd', 'b'], dtype=object)
    
    obj.value_counts()
    Out[10]: 
    a    3
    c    3
    b    2
    d    1
    dtype: int64
    
    pd.value_counts(obj.values, sort=False)
    Out[11]: 
    c    3
    d    1
    b    2
    a    3
    dtype: int64
    
    obj
    Out[12]: 
    0    c
    1    a
    2    d
    3    a
    4    a
    5    b
    6    b
    7    c
    8    c
    dtype: object
    
    mask = obj.isin(['b', 'c'])
    
    mask
    Out[14]: 
    0     True
    1    False
    2    False
    3    False
    4    False
    5     True
    6     True
    7     True
    8     True
    dtype: bool
    
    obj[mask]
    Out[15]: 
    0    c
    5    b
    6    b
    7    c
    8    c
    dtype: object
    
    to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])
    
    unique_vals = pd.Series(['c', 'b', 'a'])
    
    pd.Index(unique_vals).get_indexer(to_match)
    Out[18]: array([0, 2, 1, 1, 0, 2], dtype=int64)
    
    data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], 'Qu2': [2, 3, 1, 2, 3], 'Qu3': [1, 5, 2, 4, 4]})
    
    data
    Out[20]: 
       Qu1  Qu2  Qu3
    0    1    2    1
    1    3    3    5
    2    4    1    2
    3    3    2    4
    4    4    3    4
    
    result = data.apply(pd.value_counts).fillna(0)
    
    result
    Out[22]: 
       Qu1  Qu2  Qu3
    1  1.0  1.0  1.0
    2  0.0  2.0  1.0
    3  2.0  2.0  0.0
    4  2.0  0.0  2.0
    5  0.0  0.0  1.0
    
    Method Description
    isin Compute boolean array indicating whether each Series value is contained in the passed sequence ofvalues
    match Compute integer indices for each value in an array into another array of distinct values; helpful for data
    alignment and join-type operations
    unique Compute array of unique values in a Series, returned in the order observed
    value_counts Return a Series containing unique values as its index and frequencies as its values, ordered count in
    descending order

    数据清洗和准备

    在进行数据分析和建模的过程中,需要花费大量的时间(80%或更多)在数据准备上:加载,清理,转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。

    在本章中,我将讨论缺失数据,重复数据,字符串操作,和其他一些分析数据转换。在下一章中,我将重点放在组合上,并以各种方式重新排列数据集。

    处理缺失数据

    数值用浮点数NaN (Not a Number)表示缺失。

    #!python
    
    In [1]: import numpy as np
    
    In [2]: import pandas as pd
    
    In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
    
    In [4]: string_data
    Out[4]: 
    0     aardvark
    1    artichoke
    2          NaN
    3      avocado
    dtype: object
    
    In [5]: string_data.isnull()
    Out[5]: 
    0    False
    1    False
    2     True
    3    False
    dtype: bool
    
    In [6]: string_data[0] = None
    
    In [7]: string_data.isnull()
    Out[7]: 
    0     True
    1    False
    2     True
    3    False
    dtype: bool
    
    NA相关的处理方法
    
    
    
    

    数据缺失用NA(not available)表示, python内置的None也为NA。

    Argument Description
    dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
    fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' .
    isnull Return boolean values indicating which values are missing/NA.
    notnull Negation of isnull .
    #!python
    
    In [8]: from numpy import nan as NA
    
    In [9]: data = pd.Series([1, NA, 3.5, NA, 7])
    
    In [10]: data.dropna()
    Out[10]: 
    0    1.0
    2    3.5
    4    7.0
    dtype: float64
    
    In [11]: data[data.notnull()]
    Out[11]: 
    0    1.0
    2    3.5
    4    7.0
    dtype: float64
    
    In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
       ....: [NA, NA, NA], [NA, 6.5, 3.]])
    
    In [13]: cleaned = data.dropna()
    
    In [14]: data
    Out[14]: 
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    2  NaN  NaN  NaN
    3  NaN  6.5  3.0
    
    In [15]: cleaned
    Out[15]: 
         0    1    2
    0  1.0  6.5  3.0
    
    In [16]: data.dropna(how='all')
    Out[16]: 
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    3  NaN  6.5  3.0
    
    In [17]: data[4] = NA
    
    In [18]: data
    Out[18]: 
         0    1    2   4
    0  1.0  6.5  3.0 NaN
    1  1.0  NaN  NaN NaN
    2  NaN  NaN  NaN NaN
    3  NaN  6.5  3.0 NaN
    
    In [19]: data.dropna((axis='columns', how='all')
    Out[19]: 
         0    1    2
    0  1.0  6.5  3.0
    1  1.0  NaN  NaN
    2  NaN  NaN  NaN
    3  NaN  6.5  3.0
    
    

    how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。

    #!python
    
    In [21]: df = pd.DataFrame(np.random.randn(7, 3))
    
    In [22]: df.iloc[:4, 1] = NA
    
    In [23]: df.iloc[:2, 2] = NA
    
    In [24]: df
    Out[24]: 
              0         1         2
    0 -0.843340       NaN       NaN
    1 -1.305941       NaN       NaN
    2  1.026378       NaN  2.176567
    3  0.048885       NaN  0.012649
    4  0.591212 -0.739625  1.017533
    5  0.633873 -0.124162 -0.823495
    6 -1.537827  0.802565  0.359058
    
    In [25]: df.dropna()
    Out[25]: 
              0         1         2
    4  0.591212 -0.739625  1.017533
    5  0.633873 -0.124162 -0.823495
    6 -1.537827  0.802565  0.359058
    
    In [26]: df.dropna(thresh=2)
    Out[26]: 
              0         1         2
    2  1.026378       NaN  2.176567
    3  0.048885       NaN  0.012649
    4  0.591212 -0.739625  1.017533
    5  0.633873 -0.124162 -0.823495
    6 -1.537827  0.802565  0.359058
    
    

    fillna用来对缺失值进行填充。可以针对列进行填充,用上一行的值填充,用平均值填充等。

    #!python
    
    In [27]: df.fillna(0)
    Out[27]: 
              0         1         2
    0 -0.843340  0.000000  0.000000
    1 -1.305941  0.000000  0.000000
    2  1.026378  0.000000  2.176567
    3  0.048885  0.000000  0.012649
    4  0.591212 -0.739625  1.017533
    5  0.633873 -0.124162 -0.823495
    6 -1.537827  0.802565  0.359058
    
    In [28]: df.fillna({1: 0.5, 2: 0})
    Out[28]: 
              0         1         2
    0 -0.843340  0.500000  0.000000
    1 -1.305941  0.500000  0.000000
    2  1.026378  0.500000  2.176567
    3  0.048885  0.500000  0.012649
    4  0.591212 -0.739625  1.017533
    5  0.633873 -0.124162 -0.823495
    6 -1.537827  0.802565  0.359058
    
    In [29]: _ = df.fillna(0, inplace=True)
    
    In [30]: df
    Out[30]: 
              0         1         2
    0 -0.843340  0.000000  0.000000
    1 -1.305941  0.000000  0.000000
    2  1.026378  0.000000  2.176567
    3  0.048885  0.000000  0.012649
    4  0.591212 -0.739625  1.017533
    5  0.633873 -0.124162 -0.823495
    6 -1.537827  0.802565  0.359058
    
    In [31]: df = pd.DataFrame(np.random.randn(6, 3))
    
    In [32]: df.iloc[2:, 1] = NA
    
    In [33]: df.iloc[4:, 2] = NA
    
    In [34]: df
    Out[34]: 
              0         1         2
    0 -0.081265 -0.820770 -0.746845
    1  1.150648  0.977842  0.861825
    2  1.823679       NaN  1.272047
    3  0.293133       NaN  0.273399
    4  0.235116       NaN       NaN
    5  1.365186       NaN       NaN
    
    In [35]: df.fillna(method='ffill')
    Out[35]: 
              0         1         2
    0 -0.081265 -0.820770 -0.746845
    1  1.150648  0.977842  0.861825
    2  1.823679  0.977842  1.272047
    3  0.293133  0.977842  0.273399
    4  0.235116  0.977842  0.273399
    5  1.365186  0.977842  0.273399
    
    In [36]: df.fillna(method='ffill', limit=2)
    Out[36]: 
              0         1         2
    0 -0.081265 -0.820770 -0.746845
    1  1.150648  0.977842  0.861825
    2  1.823679  0.977842  1.272047
    3  0.293133  0.977842  0.273399
    4  0.235116       NaN  0.273399
    5  1.365186       NaN  0.273399
    
    In [37]: data = pd.Series([1., NA, 3.5, NA, 7])
    
    In [38]: data.fillna(data.mean())
    Out[38]: 
    0    1.000000
    1    3.833333
    2    3.500000
    3    3.833333
    4    7.000000
    dtype: float64
    
    
    Argument Description
    value Scalar value or dict-like object to use to fill missing values
    method Interpolation; by default 'ffill' if function called with no other arguments
    axis Axis to fill on; default axis=0
    inplace Modify the calling object without producing a copy
    limit For forward and backward filling, maximum number of consecutive periods to fill

    数据转换

    去重

    #!python
    
    In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
       ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})
    
    In [40]: data
    Out[40]: 
        k1  k2
    0  one   1
    1  two   1
    2  one   2
    3  two   3
    4  one   3
    5  two   4
    6  two   4
    
    In [41]: data.duplicated()
    Out[41]: 
    0    False
    1    False
    2    False
    3    False
    4    False
    5    False
    6     True
    dtype: bool
    
    In [42]: data.drop_duplicates()
    Out[42]: 
        k1  k2
    0  one   1
    1  two   1
    2  one   2
    3  two   3
    4  one   3
    5  two   4
    
    In [43]: data['v1'] = range(7)
    
    In [44]: data.drop_duplicates(['k1'])
    Out[44]: 
        k1  k2  v1
    0  one   1   0
    1  two   1   1
    
    In [45]: data.drop_duplicates(['k1', 'k2'], keep='last')
    Out[45]: 
        k1  k2  v1
    0  one   1   0
    1  two   1   1
    2  one   2   2
    3  two   3   3
    4  one   3   4
    6  two   4   6
    
    

    使用函数或者映射(map)转换数据

    #!python
    
    import pandas as np
    
    import pandas as pd
    
    data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
    'Pastrami', 'corned beef', 'Bacon',
    'pastrami', 'honey ham', 'nova lox'],
    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
    
    data
    Out[5]: 
              food  ounces
    0        bacon     4.0
    1  pulled pork     3.0
    2        bacon    12.0
    3     Pastrami     6.0
    4  corned beef     7.5
    5        Bacon     8.0
    6     pastrami     3.0
    7    honey ham     5.0
    8     nova lox     6.0
    
    meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
    }
    
    lowercased = data['food'].str.lower()
    
    lowercased
    Out[8]: 
    0          bacon
    1    pulled pork
    2          bacon
    3       pastrami
    4    corned beef
    5          bacon
    6       pastrami
    7      honey ham
    8       nova lox
    Name: food, dtype: object
    
    data['animal'] = lowercased.map(meat_to_animal)
    
    data
    Out[10]: 
              food  ounces  animal
    0        bacon     4.0     pig
    1  pulled pork     3.0     pig
    2        bacon    12.0     pig
    3     Pastrami     6.0     cow
    4  corned beef     7.5     cow
    5        Bacon     8.0     pig
    6     pastrami     3.0     cow
    7    honey ham     5.0     pig
    8     nova lox     6.0  salmon
    
    data['food'].map(lambda x: meat_to_animal[x.lower()])
    Out[11]: 
    0       pig
    1       pig
    2       pig
    3       cow
    4       cow
    5       pig
    6       cow
    7       pig
    8    salmon
    Name: food, dtype: object
    

    替换

    #!python
    
    In [2]: import pandas as pd
    
    In [3]: import numpy as np
    
    In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])
    
    In [5]: data
    Out[5]: 
    0       1.0
    1    -999.0
    2       2.0
    3    -999.0
    4   -1000.0
    5       3.0
    dtype: float64
    
    In [6]: data.replace(-999, np.nan)
    Out[6]: 
    0       1.0
    1       NaN
    2       2.0
    3       NaN
    4   -1000.0
    5       3.0
    dtype: float64
    
    In [7]: data.replace([-999, -1000], np.nan)
    Out[7]: 
    0    1.0
    1    NaN
    2    2.0
    3    NaN
    4    NaN
    5    3.0
    dtype: float64
    
    In [8]: data.replace([-999, -1000], [np.nan, 0])
    Out[8]: 
    0    1.0
    1    NaN
    2    2.0
    3    NaN
    4    0.0
    5    3.0
    dtype: float64
    
    In [9]: data.replace({-999: np.nan, -1000: 0})
    Out[9]: 
    0    1.0
    1    NaN
    2    2.0
    3    NaN
    4    0.0
    5    3.0
    dtype: float64
    
    

    索引和列名修改

    #!python
    
    In [2]: import pandas as pd
    
    In [3]: import numpy as np
    
    
    
    In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),
       ....: index=['Ohio', 'Colorado', 'New York'],
       ....: columns=['one', 'two', 'three', 'four'])
    
    In [11]: data
    Out[11]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    New York    8    9     10    11
    
    In [5]: data.replace(4, 40)
    Out[5]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado   40    5      6     7
    New York    8    9     10    11
    
    
    In [12]: transform = lambda x: x[:4].upper()
    
    In [13]: data.index.map(transform)
    Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')
    
    In [14]: data
    Out[14]: 
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    New York    8    9     10    11
    
    In [15]: data.index = data.index.map(transform)
    
    In [16]: data
    Out[16]: 
          one  two  three  four
    OHIO    0    1      2     3
    COLO    4    5      6     7
    NEW     8    9     10    11
    
    In [17]: data.rename(index=str.title, columns=str.upper)
    Out[17]: 
          ONE  TWO  THREE  FOUR
    Ohio    0    1      2     3
    Colo    4    5      6     7
    New     8    9     10    11
    
    In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})
    Out[18]: 
             one  two  peekaboo  four
    INDIANA    0    1         2     3
    COLO       4    5         6     7
    NEW        8    9        10    11
    
    In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
    
    In [20]: data
    Out[20]: 
             one  two  three  four
    INDIANA    0    1      2     3
    COLO       4    5      6     7
    NEW        8    9     10    11
    
    

    离散化和面元划分

    以下暂略

    字符串处理

    #!python
    
    In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
       ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}
    
    In [8]: data = pd.Series(data)
    
    In [9]: data
    Out[9]: 
    Dave     dave@google.com
    Rob        rob@gmail.com
    Steve    steve@gmail.com
    Wes                  NaN
    dtype: object
    
    In [10]: data.isnull()
    Out[10]: 
    Dave     False
    Rob      False
    Steve    False
    Wes       True
    dtype: bool
    
    In [11]: data.str.contains('gmail')
    Out[11]: 
    Dave     False
    Rob       True
    Steve     True
    Wes        NaN
    dtype: object
    
    In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
    
    In [13]: data.str.findall(pattern, flags=re.IGNORECASE)
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-13-085c16e4dbfe> in <module>()
    ----> 1 data.str.findall(pattern, flags=re.IGNORECASE)
    
    NameError: name 're' is not defined
    
    In [14]: import re
    
    In [15]: data.str.findall(pattern, flags=re.IGNORECASE)
    Out[15]: 
    Dave     [(dave, google, com)]
    Rob        [(rob, gmail, com)]
    Steve    [(steve, gmail, com)]
    Wes                        NaN
    dtype: object
    
    In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)
    
    In [17]: matches
    Out[17]: 
    Dave     True
    Rob      True
    Steve    True
    Wes       NaN
    dtype: object
    
    In [18]: matches.str.get(1)
    Out[18]: 
    Dave    NaN
    Rob     NaN
    Steve   NaN
    Wes     NaN
    dtype: float64
    
    In [19]: matches.str[0]
    Out[19]: 
    Dave    NaN
    Rob     NaN
    Steve   NaN
    Wes     NaN
    dtype: float64
    
    In [20]: data.str[:5]
    Out[20]: 
    Dave     dave@
    Rob      rob@g
    Steve    steve
    Wes        NaN
    dtype: object
    
    
    Method Description
    cat Concatenate strings element-wise with optional delimiter
    contains Return boolean array if each string contains pattern/regex
    count Count occurrences of pattern
    extract Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group
    endswith Equivalent to x.endswith(pattern) for each element
    startswith Equivalent to x.startswith(pattern) for each element
    findall Compute list of all occurrences of pattern/regex for each string
    get Index into each element (retrieve i-th element)
    isalnum Equivalent to built-in str.alnum
    isalpha Equivalent to built-in str.isalpha
    isdecimal Equivalent to built-in str.isdecimal
    isdigit Equivalent to built-in str.isdigit
    islower Equivalent to built-in str.islower
    isnumeric Equivalent to built-in str.isnumeric
    isupper Equivalent to built-in str.isupper
    join Join strings in each element of the Series with passed separator
    len Compute length of each string
    lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element
    match Use re.match with the passed regular expression on each element, returning matched groups as list
    pad Add whitespace to left, right, or both sides of strings
    center Equivalent to pad(side='both')
    repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)
    replace Replace occurrences of pattern/regex with some other string
    slice Slice each string in the Series
    split Split strings on delimiter or regular expression
    strip Trim whitespace from both sides, including newlines
    rstrip Trim whitespace on right side
    lstrip Trim whitespace on left side

    数据争夺:连接,合并,和重塑

    在许多应用程序中,数据可能分布在多个文件或数据库中,或者是以不易分析的形式。 本章重点介绍连接,合并,和重塑。

    分层索引

    #!python
    
    import pandas as pd
    
    import numpy as np
    
    data = pd.Series(np.random.randn(9), index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],[1, 2, 3, 1, 3, 1, 2, 2, 3]])
    
    data
    Out[5]: 
    a  1   -1.111004
       2   -0.451764
       3   -0.501180
    b  1    1.007739
       3    0.407470
    c  1   -0.307985
       2    0.608742
    d  2    1.432663
       3   -1.660043
    dtype: float64
    
    data['b']
    Out[6]: 
    1    1.007739
    3    0.407470
    dtype: float64
    
    data['b':'c']
    Out[7]: 
    b  1    1.007739
       3    0.407470
    c  1   -0.307985
       2    0.608742
    dtype: float64
    
    data.loc[['b', 'd']]
    Out[8]: 
    b  1    1.007739
       3    0.407470
    d  2    1.432663
       3   -1.660043
    dtype: float64
    
    data.loc[:, 2]
    Out[9]: 
    a   -0.451764
    c    0.608742
    d    1.432663
    dtype: float64
    
    data.unstack()
    Out[10]: 
              1         2         3
    a -1.111004 -0.451764 -0.501180
    b  1.007739       NaN  0.407470
    c -0.307985  0.608742       NaN
    d       NaN  1.432663 -1.660043
    
    data.unstack().stack()
    Out[11]: 
    a  1   -1.111004
       2   -0.451764
       3   -0.501180
    b  1    1.007739
       3    0.407470
    c  1   -0.307985
       2    0.608742
    d  2    1.432663
       3   -1.660043
    dtype: float64
    
    frame = pd.DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
    
    frame
    Out[13]: 
         Ohio     Colorado
        Green Red    Green
    a 1     0   1        2
      2     3   4        5
    b 1     6   7        8
      2     9  10       11
    
    frame.index.names = ['key1', 'key2']
    
    frame.columns.names = ['state', 'color']
    
    frame
    Out[16]: 
    state      Ohio     Colorado
    color     Green Red    Green
    key1 key2                   
    a    1        0   1        2
         2        3   4        5
    b    1        6   7        8
         2        9  10       11
    
    frame['Ohio']
    Out[17]: 
    color      Green  Red
    key1 key2            
    a    1         0    1
         2         3    4
    b    1         6    7
         2         9   10
    
    
    • 重新排序和排序级别

    以下暂略

    联结和合并数据集

    • pandas.merge可根据键将不同DataFrame中的行连接起来。SQL或其他关系型数据库的用户对此应该会比较
      熟悉,因为它实现的就是数据库的连接操作。

    • pandas.concat可以沿轴将多个对象堆叠到一起。

    • 实例方法combine_first可以将重复数据编接在一起,用一个对象中的值填充另一个对象中的缺失值。

    数据库风格的DataFrame合并

    #!python
    
    df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
    'data1': range(7)})
    
    df1
    Out[20]: 
       data1 key
    0      0   b
    1      1   b
    2      2   a
    3      3   c
    4      4   a
    5      5   a
    6      6   b
    
    df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
    
    df2
    Out[22]: 
       data2 key
    0      0   a
    1      1   b
    2      2   d
    
    pd.merge(df1, df2)
    Out[23]: 
       data1 key  data2
    0      0   b      1
    1      1   b      1
    2      6   b      1
    3      2   a      0
    4      4   a      0
    5      5   a      0
    
    pd.merge(df1, df2, on='key')
    Out[24]: 
       data1 key  data2
    0      0   b      1
    1      1   b      1
    2      6   b      1
    3      2   a      0
    4      4   a      0
    5      5   a      0
    
    df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
    'data1': range(7)})
    
    df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
    
    pd.merge(df3, df4, left_on='lkey', right_on='rkey')
    Out[27]: 
       data1 lkey  data2 rkey
    0      0    b      1    b
    1      1    b      1    b
    2      6    b      1    b
    3      2    a      0    a
    4      4    a      0    a
    5      5    a      0    a
    
    pd.merge(df1, df2, how='outer')
    Out[28]: 
       data1 key  data2
    0    0.0   b    1.0
    1    1.0   b    1.0
    2    6.0   b    1.0
    3    2.0   a    0.0
    4    4.0   a    0.0
    5    5.0   a    0.0
    6    3.0   c    NaN
    7    NaN   d    2.0
    
    
    Option Behavior
    'inner' Use only the key combinations observed in both tables
    'left' Use all key combinations found in the left table
    'right' Use all key combinations found in the right table
    'output' Use all key combinations observed in both tables together

    上面是多对一合并,下面看下多对多

    数据聚合与分组运算

    对数据集进行分组并对各组应用函数(无论是聚合还是转换)是数据分析工作中的重要环节。在将数据集准备好之后,通常的任务就是计算分组统计或生成透视表。

    pandas提供了灵活高效的gruopby功能,能以自然的方式对数据集进行切片、切块、摘要等操作。关系型数据库和SQL(Structured Query Language,结构化查询语言)能够如此流行的原因之一就是其能够方便地对数据进行连接、过滤、转换和聚合。但是,像SQL这样的查询语
    言所能执行的分组运算的种类很有限。

    在本章中你将会看到,由于Python和pandas强大的表达能力,我们可以执行复杂得多的分组运算(利用任何可以接受pandas对象或NumPy数组的函数)。在本章中,你将会学到:

    • 根据一个或多个键(可以是函数、数组或DataFrame列名)拆分pandas对象。
    • 计算分组摘要统计,如计数、平均值、标准差,或用户自定义函数。
    • 对DataFrame的列应用各种各样的函数。
    • 应用组内转换或其他运算,如规格化、线性回归、排名或选取子集等。
    • 计算透视表或交叉表。
    • 执行分位数分析以及其他分组分析。

    分组技术

    Hadley Wickham(许多热门R语言包的作者)创造了用于表示分组运算的术语"split-apply-combine"(拆分-应用-合并),我觉得这个词很好地描述了整个过程。分组运算的第一个阶段,pandas对象(无论是Series、DataFrame还是其他的)中的数据会根据你所提供的一个或多个键被拆分(split)为多组。拆分操作是在对象的特定轴上执行的。例如,DataFrame可以在其行(axis=0)或列(axis=1)上进行分组。然后,将一个函数应用(apply)到各个分组并产生一个新值。最后,所有这些函数的执行结果会被合并(combine)到最终的结果对象中。结果对象的形式一般取决于数据上所执行的操作。图9-1大致说明了一个简单的分组聚合过程。

    分组键可以有多种形式,且类型不必相同:

    • 列表或数组,其长度与待分组的轴一样。
    • 表示DataFrame某个列名的值。
    • 字典或Series,给出待分组轴上的值与分组名之间的对应关系。
    • 函数,用于处理轴索引或索引中的各个标签。

    注意,后三种都只是快捷方式而已,其最终目的仍然是产生一组用于拆分对象的值。如果觉得这些东西看起来很抽象,不用担心,我将在本章中给出大量有关于此的示例。首先来看看下面这个非常简单的表格型数据集(以DataFrame的形式)。

    #!python
    
    df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],
    'data1' : np.random.randn(5),'data2' : np.random.randn(5)})
    
    df
    Out[32]: 
          data1     data2 key1 key2
    0 -0.592555  0.537886    a  one
    1  0.286764  1.498792    a  two
    2 -0.149658  0.847675    b  one
    3  0.961803 -1.218945    b  two
    4  0.896790  1.461441    a  one
    
    

    假设你想要按key1进行分组,并计算data1列的平均值。实现该功能的方式有很多,而我们这里要用的是:访问data1,并根据key1调用groupby。

    #!python
    
    grouped = df['data1'].groupby(df['key1'])
    
    grouped
    Out[34]: <pandas.core.groupby.SeriesGroupBy object at 0x000001937BF46E48>
    

    变量grouped是一个GroupBy对象。它实际上还没有进行任何计算,只是含有一些有关分组键df['key1']的中间数据而已。换句话说,该对象已经有了接下来对各分组执行运算所需的一切信息。例如,我们可以调用GroupBy的mean方法来计算分组平均值:

    #!python
    
    grouped.mean()
    Out[35]: 
    key1
    a    0.197000
    b    0.406073
    Name: data1, dtype: float64
    

    数据(Series)根据分组键进行了聚合,产生了新的Series,其索引为key1列中的唯一值。

    如果我们一次传入多个数组,就会得到不同的结果:

    #!python
    
    means = df['data1'].groupby([df['key1'], df['key2']]).mean()
    
    means
    Out[38]: 
    key1  key2
    a     one     0.152117
          two     0.286764
    b     one    -0.149658
          two     0.961803
    Name: data1, dtype: float64
    
    means.unstack()
    Out[39]: 
    key2       one       two
    key1                    
    a     0.152117  0.286764
    b    -0.149658  0.961803
    
    states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
    
    years = np.array([2005, 2005, 2006, 2005, 2006])
    
    df['data1'].groupby([states, years]).mean()
    Out[42]: 
    California  2005    0.286764
                2006   -0.149658
    Ohio        2005    0.184624
                2006    0.896790
    Name: data1, dtype: float64
    
    

    更常用的是列名(可以是字符串、数字或其他Python对象)用作分组键:

    #!python
    
    df.groupby('key1').mean()
    Out[44]: 
             data1     data2
    key1                    
    a     0.197000  1.166040
    b     0.406073 -0.185635
    
    df.groupby(['key1', 'key2']).mean()
    Out[45]: 
                  data1     data2
    key1 key2                    
    a    one   0.152117  0.999663
         two   0.286764  1.498792
    b    one  -0.149658  0.847675
         two   0.961803 -1.218945
    
    df.groupby(['key1', 'key2']).size()
    Out[46]: 
    key1  key2
    a     one     2
          two     1
    b     one     1
          two     1
    dtype: int64
    

    你可能已经注意到在执行df.groupby('key1').mean()时,结果中没有key2列。这是因为df['key2']不是数值数据(俗称“麻烦列”),所以被从结果中排除了。默认情况下,所有数值列都会被聚合,虽然有时可能会被过滤为一个子集(稍后就会讲到)。分组键中的任何缺失值都会被排除在结果之外。

    • 分组迭代

    GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)。看看下面这个简单的数据集:

    #!python
    
    for name, group in df.groupby('key1'):
        print(name)
        print(group)
        
    a
          data1     data2 key1 key2
    0 -0.592555  0.537886    a  one
    1  0.286764  1.498792    a  two
    4  0.896790  1.461441    a  one
    b
          data1     data2 key1 key2
    2 -0.149658  0.847675    b  one
    3  0.961803 -1.218945    b  two
    

    对于多重键的情况,元组的第一个元素将会是由键值组成的元组:

    相关文章

      网友评论

      本文标题:pandas入门

      本文链接:https://www.haomeiwen.com/subject/uxdmaftx.html