美文网首页
《利用Python进行数据分析》第五章 pandas的基本功能

《利用Python进行数据分析》第五章 pandas的基本功能

作者: 龍猫君 | 来源:发表于2017-12-12 23:24 被阅读0次

    介绍操作Series和DataFrame中的数据的基本功能

    重新索引
    pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象。以之前的一个简单示例来说

    In [1]: from pandas import Series,DataFrame
    
    In [2]: import pandas as pd
    
    In [3]: import numpy as np
    
    In [4]: obj=Series([6.5,7.8,-5.9,8.6],index=['d','b','a','c'])
    
    In [5]: obj
    Out[5]: 
    d 6.5
    b 7.8
    a -5.9
    c 8.6
    dtype: float64
    

    调用该Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在,就引入缺失值

    In [6]: obj2=obj.reindex(['a', 'b', 'c', 'd', 'e'])
    
    In [7]: obj2
    Out[7]: 
    a -5.9
    b 7.8
    c 8.6
    d 6.5
    e NaN
    dtype: float64
    
    In [8]: obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
    Out[8]: 
    a -5.9
    b 7.8
    c 8.6
    d 6.5
    e 0.0
    dtype: float64
    
    In [9]: obj3=Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
    
    In [10]: obj3.reindex(range(6), method='ffill')
    Out[10]: 
    0 blue
    1 blue
    2 purple
    3 purple
    4 yellow
    5 yellow
    dtype: object
    
    In [11]: obj3.reindex(range(6), method='bfill')
    Out[11]: 
    0 blue
    1 purple
    2 purple
    3 yellow
    4 yellow
    5 NaN
    dtype: object
    
    In [12]: obj3.reindex(range(6), method='pad')
    Out[12]: 
    0 blue
    1 blue
    2 purple
    3 purple
    4 yellow
    5 yellow
    dtype: object
    

    对于DataFrame,reindex可以修改(行)索引、列,或两个都修改。如果仅传入一个序列,则会重新索引行

    In [13]: frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
    
    In [14]: frame
    Out[14]: 
    Ohio Texas California
    a 0 1 2
    c 3 4 5
    d 6 7 8
    
    In [15]: frame2=frame.reindex(['a', 'b', 'c', 'd'])
    
    In [16]: frame2
    Out[16]: 
    Ohio Texas California
    a 0.0 1.0 2.0
    b NaN NaN NaN
    c 3.0 4.0 5.0
    d 6.0 7.0 8.0
    

    使用columns关键字即可重新索引列

    In [17]: states = ['Texas', 'Utah', 'California']
    
    In [18]: frame.reindex(columns=states)
    Out[18]: 
    Texas Utah California
    a 1 NaN 2
    c 4 NaN 5
    d 7 NaN 8
    

    利用ix的标签索引功能

    In [28]: frame
    Out[28]: 
    Ohio Texas California
    a 0 1 2
    c 3 4 5
    d 6 7 8
    
    In [31]: states = ['Texas', 'Utah', 'California']
    
    In [32]: frame.ix[['a', 'b', 'c', 'd'], states]
    Out[32]: 
    Texas Utah California
    a 1.0 NaN 2.0
    b NaN NaN NaN
    c 4.0 NaN 5.0
    d 7.0 NaN 8.0
    

    丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。由于需要执行一些数据整理和集合逻辑,所以drop方法返回的是一个在指定轴上删除了指定值的新对象

    In [33]: obj=Series(np.arange(5.),index=['a', 'b', 'c', 'd', 'e'])
    
    In [34]: obj
    Out[34]: 
    a 0.0
    b 1.0
    c 2.0
    d 3.0
    e 4.0
    dtype: float64
    
    In [35]: new_obj=obj.drop('c')
    
    In [36]: new_obj
    Out[36]: 
    a 0.0
    b 1.0
    d 3.0
    e 4.0
    dtype: float64
    
    In [37]: obj.drop(['d','b'])
    Out[37]: 
    a 0.0
    c 2.0
    e 4.0
    dtype: float64
    

    对于DataFrame,可以删除任意轴上的索引值

    In [41]: data = DataFrame(np.arange(16).reshape((4, 4)),
        ...: index=['Ohio', 'Colorado', 'Utah', 'New York'], 
        ...: columns=['one', 'two', 'three', 'four'])
    
    In [42]: data
    Out[42]: 
    one two three four
    Ohio 0 1 2 3
    Colorado 4 5 6 7
    Utah 8 9 10 11
    New York 12 13 14 15
    
    In [43]: data.drop(['Colorado', 'Ohio'])
    Out[43]: 
    one two three four
    Utah 8 9 10 11
    New York 12 13 14 15
    
    In [44]: data.drop('two',axis=1)
    Out[44]: 
    one three four
    Ohio 0 2 3
    Colorado 4 6 7
    Utah 8 10 11
    New York 12 14 15
    
    In [45]: data.drop(['two', 'four'], axis=1)
    Out[45]: 
    one three
    Ohio 0 2
    Colorado 4 6
    Utah 8 10
    New York 12 14
    

    索引、选取和过滤
    Series的索引值不只是整数

    In [47]: obj=Series(np.arange(5.),index=['a', 'b', 'c', 'd','e'])
    
    In [48]: obj
    Out[48]: 
    a 0.0
    b 1.0
    c 2.0
    d 3.0
    e 4.0
    dtype: float64
    
    In [49]: obj['b']
    Out[49]: 1.0
    
    In [50]: obj[3]
    Out[50]: 3.0
    
    In [51]: obj[3:5]
    Out[51]: 
    d 3.0
    e 4.0
    dtype: float64
    
    In [52]: obj[['b','e','d']]
    Out[52]: 
    b 1.0
    e 4.0
    d 3.0
    dtype: float64
    
    In [53]: obj[[1,4]]
    Out[53]: 
    b 1.0
    e 4.0
    dtype: float64
    
    In [54]: obj[obj<3]
    Out[54]: 
    a 0.0
    b 1.0
    c 2.0
    dtype: float64
    

    利用标签的切片运算与普通的Python切片运算不同,其末端是包含的(inclusive)

    In [55]: obj['b':'d']
    Out[55]: 
    b 1.0
    c 2.0
    d 3.0
    dtype: float64
    
    In [56]: obj['b':'d']=6
    
    In [57]: obj
    Out[57]: 
    a 0.0
    b 6.0
    c 6.0
    d 6.0
    e 4.0
    dtype: float64
    

    DataFrame进行索引其实就是获取一个或多个列

    In [60]: data = DataFrame(np.arange(16).reshape((4, 4)),
        ...: index=['Ohio', 'Colorado', 'Utah', 'New York'],
        ...: columns=['one', 'two', 'three', 'four'])
    
    In [61]: data
    Out[61]: 
    one two three four
    Ohio 0 1 2 3
    Colorado 4 5 6 7
    Utah 8 9 10 11
    New York 12 13 14 15
    
    In [62]: data['two']
    Out[62]: 
    Ohio 1
    Colorado 5
    Utah 9
    New York 13
    Name: two, dtype: int32
    
    In [63]: data[['three','one']]
    Out[63]: 
    three one
    Ohio 2 0
    Colorado 6 4
    Utah 10 8
    New York 14 12
    

    通过切片或布尔型数组选取行

    In [64]: data[:3]
    Out[64]: 
    one two three four
    Ohio 0 1 2 3
    Colorado 4 5 6 7
    Utah 8 9 10 11
    
    In [65]: data[data['three']>5]
    Out[65]: 
    one two three four
    Colorado 4 5 6 7
    Utah 8 9 10 11
    New York 12 13 14 15
    

    通过布尔型DataFrame(比如下面由标量比较运算得出的)进行索引

    In [66]: data<6
    Out[66]: 
    one two three four
    Ohio True True True True
    Colorado True True False False
    Utah False False False False
    New York False False False False
    
    In [67]: data[data<5]=0
    
    In [68]: data
    Out[68]: 
    one two three four
    Ohio 0 0 0 0
    Colorado 0 5 6 7
    Utah 8 9 10 11
    New York 12 13 14 15
    

    利用索引字段ix,它可以通过NumPy式的标记法以及轴标签从DataFrame中选取行和列的子集。其中:ix is deprecated,可以使用loc

    In [69]: data.ix['Colorado', ['two', 'three']]
    C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
    .ix is deprecated. Please use
    .loc for label based indexing or
    .iloc for positional indexing
    
    See the documentation here:
    http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
    """Entry point for launching an IPython kernel.
    Out[69]: 
    two 5
    three 6
    Name: Colorado, dtype: int32
    
    In [70]: data.loc['Colorado', ['two', 'three']]
    Out[70]: 
    two 5
    three 6
    Name: Colorado, dtype: int32
    
    In [71]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
    Out[71]: 
    four one two
    Colorado 7 0 5
    Utah 11 8 9
    
    In [72]: data.ix[2]
    Out[72]: 
    one 8
    two 9
    three 10
    four 11
    Name: Utah, dtype: int32
    
    In [73]: data.loc[:'Utah','two']
    Out[73]: 
    Ohio 0
    Colorado 5
    Utah 9
    Name: two, dtype: int32
    
    In [74]: data.ix[data.three>5]
    Out[74]: 
    one two three four
    Colorado 0 5 6 7
    Utah 8 9 10 11
    New York 12 13 14 15
    
    In [75]: data.ix[data.three > 5, :3]
    Out[75]: 
    one two three
    Colorado 0 5 6
    Utah 8 9 10
    New York 12 13 14
    

    对pandas对象中的数据的选取和重排方式有很多



    为什么不是输出7 4 5,而输出的是7 0 5,是不能理解的小地方,只能慢慢体会其中的用法。


    其中,get_value方法是选取,set-value方法是设置

    算术运算和数据对齐
    Pandas可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。

    In [77]: s1=Series([6.8,-4.5,3.6,5.6],index=['a','c','d','e'])
    
    In [78]: s2 = Series([-6.5, 3.6, -5.6, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
    
    In [79]: s1
    Out[79]: 
    a 6.8
    c -4.5
    d 3.6
    e 5.6
    dtype: float64
    
    In [80]: s2
    Out[80]: 
    a -6.5
    c 3.6
    e -5.6
    f 4.0
    g 3.1
    dtype: float64
    
    In [81]: s1+s2
    Out[81]: 
    a 0.3
    c -0.9
    d NaN
    e 0.0
    f NaN
    g NaN
    dtype: float64
    

    自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。对于DataFrame,对齐操作会同时发生在行和列上。相加后将会返回一个新的DataFrame,其索引和列为原来那两个DataFrame的并集

    In [85]: df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
        ...: index=['Ohio', 'Texas', 'Colorado'])
    
    In [86]: df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
        ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
    
    In [87]: df1
    Out[87]: 
    b c d
    Ohio 0.0 1.0 2.0
    Texas 3.0 4.0 5.0
    Colorado 6.0 7.0 8.0
    
    In [88]: df2
    Out[88]: 
    b d e
    Utah 0.0 1.0 2.0
    Ohio 3.0 4.0 5.0
    Texas 6.0 7.0 8.0
    Oregon 9.0 10.0 11.0
    
    In [89]: df1+df2
    Out[89]: 
    b c d e
    Colorado NaN NaN NaN NaN
    Ohio 3.0 NaN 6.0 NaN
    Oregon NaN NaN NaN NaN
    Texas 9.0 NaN 12.0 NaN
    Utah NaN NaN NaN NaN
    

    在算术方法中填充值
    在对不同索引的对象进行算术运算时,你可能希望当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(比如0),相加时,没有重叠的位置就会产生NA值。

    In [95]: f1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
    
    In [96]: f2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
    
    In [97]: f1
    Out[97]: 
    a b c d
    0 0.0 1.0 2.0 3.0
    1 4.0 5.0 6.0 7.0
    2 8.0 9.0 10.0 11.0
    
    In [98]: f2
    Out[98]: 
    a b c d e
    0 0.0 1.0 2.0 3.0 4.0
    1 5.0 6.0 7.0 8.0 9.0
    2 10.0 11.0 12.0 13.0 14.0
    3 15.0 16.0 17.0 18.0 19.0
    
    In [99]: f1+f2
    Out[99]: 
    a b c d e
    0 0.0 2.0 4.0 6.0 NaN
    1 9.0 11.0 13.0 15.0 NaN
    2 18.0 20.0 22.0 24.0 NaN
    3 NaN NaN NaN NaN NaN
    

    使用add方法,传入f2以及一个fill_value参数

    In [102]: f1.add(f2, fill_value=0)
    Out[102]: 
    a b c d e
    0 0.0 2.0 4.0 6.0 4.0
    1 9.0 11.0 13.0 15.0 9.0
    2 18.0 20.0 22.0 24.0 14.0
    3 15.0 16.0 17.0 18.0 19.0
    

    在对Series或DataFrame重新索引时,也可以指定一个填充值

    In [103]: f1.reindex(columns=f2.columns, fill_value=0)
    Out[103]: 
    a b c d e
    0 0.0 1.0 2.0 3.0 0
    1 4.0 5.0 6.0 7.0 0
    2 8.0 9.0 10.0 11.0 0
    
    In [105]: f1*f2
    Out[105]: 
    a b c d e
    0 0.0 1.0 4.0 9.0 NaN
    1 20.0 30.0 42.0 56.0 NaN
    2 80.0 99.0 120.0 143.0 NaN
    3 NaN NaN NaN NaN NaN
    

    DataFrame和Series之间的运算
    计算一个二维数组与其某行之间的差,出现的结果这就叫做广播(broadcasting),如下:

    In [106]: arr=np.arange(12.).reshape((3,4))
    
    In [107]: arr
    Out[107]: 
    array([[ 0., 1., 2., 3.],
    [ 4., 5., 6., 7.],
    [ 8., 9., 10., 11.]])
    
    In [108]: arr[0]
    Out[108]: array([ 0., 1., 2., 3.])
    
    In [109]: arr-arr[0]
    Out[109]: 
    array([[ 0., 0., 0., 0.],
    [ 4., 4., 4., 4.],
    [ 8., 8., 8., 8.]])
    

    默认情况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播

    In [110]: frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
         ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
    
    In [111]: frame
    Out[111]: 
    b d e
    Utah 0.0 1.0 2.0
    Ohio 3.0 4.0 5.0
    Texas 6.0 7.0 8.0
    Oregon 9.0 10.0 11.0
    
    In [112]: series = frame.ix[0]
    
    In [113]: series
    Out[113]: 
    b 0.0
    d 1.0
    e 2.0
    Name: Utah, dtype: float64
    
    In [114]: frame-series
    Out[114]: 
    b d e
    Utah 0.0 0.0 0.0
    Ohio 3.0 3.0 3.0
    Texas 6.0 6.0 6.0
    Oregon 9.0 9.0 9.0
    

    如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集

    In [115]: series2 = Series(range(3), index=['b', 'e', 'f'])
    
    In [116]: series2
    Out[116]: 
    b 0
    e 1
    f 2
    dtype: int32
    
    In [117]: frame + series2
    Out[117]: 
    b d e f
    Utah 0.0 NaN 3.0 NaN
    Ohio 3.0 NaN 6.0 NaN
    Texas 6.0 NaN 9.0 NaN
    Oregon 9.0 NaN 12.0 NaN
    

    如果你希望匹配行且在列上广播,则必须使用算术运算方法。

    In [118]: series3 = frame['d']
    
    In [119]: series3
    Out[119]: 
    Utah 1.0
    Ohio 4.0
    Texas 7.0
    Oregon 10.0
    Name: d, dtype: float64
    
    In [120]: frame
    Out[120]: 
    b d e
    Utah 0.0 1.0 2.0
    Ohio 3.0 4.0 5.0
    Texas 6.0 7.0 8.0
    Oregon 9.0 10.0 11.0
    
    In [121]: frame.sub(series3, axis=0)
    Out[121]: 
    b d e
    Utah -1.0 0.0 1.0
    Ohio -1.0 0.0 1.0
    Texas -1.0 0.0 1.0
    Oregon -1.0 0.0 1.0
    

    传入的轴号就是希望匹配的轴。

    函数应用和映射
    NumPy的ufuncs(元素级数组方法)也可用于操作pandas对象

    In [122]: frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
         ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
    
    In [123]: frame
    Out[123]: 
    b d e
    Utah -0.976531 -1.511940 -0.018721
    Ohio 0.598117 0.047678 -0.058404
    Texas 2.469704 0.027215 1.154004
    Oregon 1.308615 -1.634739 0.096210
    
    In [124]: np.abs(frame)
    Out[124]: 
    b d e
    Utah 0.976531 1.511940 0.018721
    Ohio 0.598117 0.047678 0.058404
    Texas 2.469704 0.027215 1.154004
    Oregon 1.308615 1.634739 0.096210
    

    将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能。

    In [125]: f = lambda x: x.max() - x.min()
    
    In [126]: frame.apply(f)
    Out[126]: 
    b 3.446234
    d 1.682417
    e 1.212408
    dtype: float64
    
    In [127]: frame.apply(f,axis=1)
    Out[127]: 
    Utah 1.493219
    Ohio 0.656521
    Texas 2.442489
    Oregon 2.943355
    dtype: float64
    

    sum和mean方法

    In [128]: def f(x):
         ...: return Series([x.min(), x.max()], index=['min', 'max'])
         ...: 
    
    In [129]: frame.apply(f)
    Out[129]: 
    b d e
    min -0.976531 -1.634739 -0.058404
    max 2.469704 0.047678 1.154004
    

    得到frame中各个浮点值的格式化字符串,使用applymap

    In [128]: def f(x):
         ...:
    return Series([x.min(), x.max()], index=['min', 'max'])
      
    
    In [129]: frame.apply(f)
    Out[129]: 
    b d e
    min -0.976531 -1.634739 -0.058404
    max 2.469704 0.047678 1.154004
    
    In [130]: format = lambda x: '%.2f' % x
    
    In [131]: frame.applymap(format)
    Out[131]: 
    b d e
    Utah -0.98 -1.51 -0.02
    Ohio 0.60 0.05 -0.06
    Texas 2.47 0.03 1.15
    Oregon 1.31 -1.63 0.10
    

    Series有一个用于应用元素级函数的map方法

    In [132]: frame['e'].map(format)
    Out[132]: 
    Utah -0.02
    Ohio -0.06
    Texas 1.15
    Oregon 0.10
    Name: e, dtype: object
    

    排序和排名
    根据条件对数据集排序(sorting)也是一种重要的内置运算。要对行或列索引进行排序(按字典顺序),可使用sort_index方法,它将返回一个已排序的新对象

    In [133]: obj = Series(range(4), index=['d', 'a', 'b', 'c'])
    
    In [134]: obj
    Out[134]: 
    d 0
    a 1
    b 2
    c 3
    dtype: int32
    
    In [135]: obj.sort_index()
    Out[135]: 
    a 1
    b 2
    c 3
    d 0
    dtype: int32
    

    DataFrame,则可以根据任意一个轴上的索引进行排序

    In [136]: frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
         ...: columns=['d', 'a', 'b', 'c'])
    
    In [137]: frame
    Out[137]: 
    d a b c
    three 0 1 2 3
    one 4 5 6 7
    
    In [138]: frame.sort_index()
    Out[138]: 
    d a b c
    one 4 5 6 7
    three 0 1 2 3
    
    In [139]: frame.sort_index(axis=1)
    Out[139]: 
    a b c d
    three 1 2 3 0
    one 5 6 7 4
    

    数据默认是按升序排序的,但也可以降序排序

    In [140]: frame.sort_index(axis=1,ascending=False)
    Out[140]: 
    d c b a
    three 0 3 2 1
    one 4 7 6 5
    

    series通过索引进行排序

    In [148]: obj = Series([6, 9, -8, 3])
    
    In [149]: obj.sort_index()
    Out[149]: 
    0 6
    1 9
    2 -8
    3 3
    dtype: int64
    

    series通过升值进行排序

    In [150]: obj.sort_values()
    Out[150]: 
    2 -8
    3 3
    0 6
    1 9
    dtype: int64
    

    在排序时,任何缺失值默认都会被放到Series的末尾,其中order不能排序,使用sort_values进行排序

    In [151]: obj = Series([4, np.nan, 6, np.nan, -3, 3])
    
    In [152]: obj.order()
    ---------------------------------------------------------------------------
    AttributeError Traceback (most recent call last)
    <ipython-input-152-4fc888977b98> in <module>()
    ----> 1 obj.order()
    
    C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
    2968 if name in self._info_axis:
    2969 return self[name]
    -> 2970 return object.__getattribute__(self, name)
    2971 
    2972 def __setattr__(self, name, value):
    
    AttributeError: 'Series' object has no attribute 'order'
    
    In [153]: obj.sort_values()
    Out[153]: 
    4 -3.0
    5 3.0
    0 4.0
    2 6.0
    1 NaN
    3 NaN
    dtype: float64
    

    希望根据一个或多个列中的值进行排序,可以将一个或多个列的名字传递给by选项

    In [154]: frame = DataFrame({'b': [5, 8, -6, 3], 'a': [0, 1, 0, 1]})
    
    In [155]: frame
    Out[155]: 
    a b
    0 0 5
    1 1 8
    2 0 -6
    3 1 3
    
    In [156]: frame.sort_index(by='b')
    Out[156]: 
    a b
    2 0 -6
    3 1 3
    0 0 5
    1 1 8
    
    In [157]: frame.sort_index(by=['a','b'])
    C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
    """Entry point for launching an IPython kernel.
    Out[157]: 
    a b
    2 0 -6
    0 0 5
    3 1 3
    1 1 8
    

    要根据多个列进行排序,传入名称的列表

    In [158]: frame.sort_values(by=['a','b'])
    Out[158]: 
    a b
    2 0 -6
    0 0 5
    3 1 3
    1 1 8
    

    排名(ranking)跟排序关系密切,且它会增设一个排名值(从1开始,一直到数组中有效数据的数量)。它跟numpy.argsort产生的间接排序索引差不多,只不过它可以根据某种规则破坏平级关系。默认情况下,rank是通过“为各组分配一个平均排名”的方式破坏平级关系的:

    In [159]: obj = Series([8, -6, 5, 4, 2, 0, 4])
    
    In [160]: obj
    Out[160]: 
    0 8
    1 -6
    2 5
    3 4
    4 2
    5 0
    6 4
    dtype: int64
    
    In [161]: obj.rank()
    Out[161]: 
    0 7.0
    1 1.0
    2 6.0
    3 4.5
    4 3.0
    5 2.0
    6 4.5
    dtype: float64
    

    可以根据值在原数据中出现的顺序给出排名

    In [162]: obj.rank(method='first')
    Out[162]: 
    0 7.0
    1 1.0
    2 6.0
    3 4.0
    4 3.0
    5 2.0
    6 5.0
    dtype: float64
    

    按降序进行排名

    In [163]: obj.rank(ascending=False, method='max')
    Out[163]: 
    0 1.0
    1 7.0
    2 2.0
    3 4.0
    4 5.0
    5 6.0
    6 4.0
    dtype: float64
    

    在行或列上计算排名

    In [164]: frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
         ...: 'c': [-2, 5, 8, -2.5]})
    
    In [165]: frame
    Out[165]: 
    a b c
    0 0 4.3 -2.0
    1 1 7.0 5.0
    2 0 -3.0 8.0
    3 1 2.0 -2.5
    
    In [166]: frame.rank(axis=1)
    Out[166]: 
    a b c
    0 2.0 3.0 1.0
    1 1.0 3.0 2.0
    2 2.0 1.0 3.0
    3 2.0 3.0 1.0
    

    带有重复值的轴索引值的Series

    In [167]: obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
    
    In [168]: obj
    Out[168]: 
    a 0
    a 1
    b 2
    b 3
    c 4
    dtype: int32
    

    is_unique属性可以告诉它的值是否是唯一的

    In [169]: obj.index.is_unique
    Out[169]: False
    

    对于带有重复值的索引,数据选取的行为将会有些不同。如果某个索引对应多个值,则返回一个Series;而对应单个值的,则返回一个标量值

    In [170]: obj['a']
    Out[170]: 
    a 0
    a 1
    dtype: int32
    
    In [171]: obj['c']
    Out[171]: 4
    
    In [172]: df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
    
    In [173]: df
    Out[173]: 
    0 1 2
    a -0.199619 -0.871154 -0.674903
    a 1.573516 0.558822 0.511055
    b 0.029318 -0.654353 -0.682175
    b -0.563794 1.756565 0.105016
    
    In [174]: df.ix['b']
    Out[174]: 
    0 1 2
    b 0.029318 -0.654353 -0.682175
    b -0.563794 1.756565 0.105016
    

    通过练习认识到有些地方不能很好的理解,以后学习中慢慢理解各种函数的使用。

    相关文章

      网友评论

          本文标题:《利用Python进行数据分析》第五章 pandas的基本功能

          本文链接:https://www.haomeiwen.com/subject/lhyoixtx.html