Pandas入门

作者: 弃用中 | 来源:发表于2018-01-30 23:06 被阅读64次

    pandas的数据结构介绍

    我们将使用下面的方式导入pandas:

    import pandas as pd
    from pandas import Series, DataFrame
    

    Series

    Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。由一组数据就可产生最简单的Series:

    In [6]: obj = pd.Series([4,7,-5,3])
    
    In [7]: obj
    Out[7]:
    0    4
    1    7
    2   -5
    3    3
    dtype: int64
    

    Series的字符串表现为:索引在左边,值在右边。由于没有为数据指定索引,于是会自动创建一个0到N-1(N为数据的长度)的整数索引。可以通过Series的values和index属性获取其数组表示形式和索引对象:

    In [8]: obj.values
    Out[8]: array([ 4,  7, -5,  3], dtype=int64)
    
    In [9]: obj.index
    Out[9]: RangeIndex(start=0, stop=4, step=1)
    

    通常,我们希望创建的Series带有一个可以对各个数据点进行标记的索引:

    In [10]: obj2 = pd.Series([4,7,-5,3],index=['d','b','a','c'])
    
    In [12]: obj2
    Out[12]:
    d    4
    b    7
    a   -5
    c    3
    dtype: int64
    

    与普通NumPy数组相比,可以通过索引的方式选取Series中的单个或一组值:

    In [13]: obj2['a']
    Out[13]: -5
    
    In [15]: obj2['d'] = 7
    
    In [18]: obj2[['c','a','d']]
    Out[18]:
    c    3
    a   -5
    d    7
    dtype: int64
    

    NumPy数组运算(如根据布尔型数组进行过滤、标量乘法、应用数学函数等)都会保留索引和值之间的连接:

    In [19]: obj2
    Out[19]:
    d    7
    b    7
    a   -5
    c    3
    dtype: int64
    
    In [21]: obj2[obj2>0]
    Out[21]:
    d    7
    b    7
    c    3
    dtype: int64
    
    In [22]: obj2*2
    Out[22]:
    d    14
    b    14
    a   -10
    c     6
    dtype: int64
    
    In [23]: np.exp(obj2)
    Out[23]:
    d    1096.633158
    b    1096.633158
    a       0.006738
    c      20.085537
    dtype: float64
    

    还可以将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射。

    In [24]: 'b' in obj2
    Out[24]: True
    
    In [25]: 'e' in obj2
    Out[25]: False
    

    如果数据放在一个字典中,也可以通过这个字典来创建Series,索引就是原字典的键:

    In [26]: sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
    
    In [27]: obj3 = pd.Series(sdata)
    
    In [28]: obj3
    Out[28]:
    Ohio      35000
    Oregon    16000
    Texas     71000
    Utah       5000
    dtype: int64
    

    再看一个例子:

    In [30]: states = ['California','Ohio','Oregon','Texas']
    
    In [31]: obj4 = pd.Series(sdata,index=states)
    
    In [32]: obj4
    Out[32]:
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64
    

    在这个例子中,sdata跟states索引相匹配的那3个值会被找出来并放到相应的位置上,但由于“California”所对应的sdata值找不到,所以其结果就为NaN(即“非数字”,在pandas中,它用于表示缺失或NA值)。pandas的isnulllnotnull函数可用于检测缺失数据:

    In [33]: pd.isnull(obj4)
    Out[33]:
    California     True
    Ohio          False
    Oregon        False
    Texas         False
    dtype: bool
    
    In [34]: pd.notnull(obj4)
    Out[34]:
    California    False
    Ohio           True
    Oregon         True
    Texas          True
    dtype: bool
    

    对于许多应用而言,Series最重要的一个功能是:它在算术运算中会自动对齐不同索引的数据。

    In [35]: obj3
    Out[35]:
    Ohio      35000
    Oregon    16000
    Texas     71000
    Utah       5000
    dtype: int64
    
    In [36]: obj4
    Out[36]:
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64
    
    In [37]: obj3 + obj4
    Out[37]:
    California         NaN
    Ohio           70000.0
    Oregon         32000.0
    Texas         142000.0
    Utah               NaN
    dtype: float64
    

    Series对象本身及其索引都有一个name属性,该属性跟pandas其他的关键功能关系非常密切:

    In [38]: obj4.name = 'population'
    
    In [39]: obj4.index.name = 'state'
    
    In [40]: obj4
    Out[40]:
    state
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    Name: population, dtype: float64
    

    Series的索引可以通过赋值的方式就地修改:

    In [43]: obj.index = ['Bob','Steve','Jeff','Ryan']
    
    In [44]: obj
    Out[44]:
    Bob      4
    Steve    7
    Jeff    -5
    Ryan     3
    dtype: int64
    

    DataFrame

    DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以被看作由Series组成的字典(共用同一个索引)。

    构造DataFrame的方法有很多,最常用的一种是直接传入一个由等长列表或NumPy数组组成的字典:

    In [49]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
        ...:     ...: 'year':[2000,2001,2002,2001,2002],
        ...:     ...: 'pop':[1.5,1.7,3.6,2.4,2.9]}
    
    In [50]: frame = pd.DataFrame(data)
    

    DataFrame会自动加上索引(和Series一样),且全部列会被有序排列:

    In [51]: frame
    Out[51]:
       pop   state  year
    0  1.5    Ohio  2000
    1  1.7    Ohio  2001
    2  3.6    Ohio  2002
    3  2.4  Nevada  2001
    4  2.9  Nevada  2002
    

    如果指定了列序列,则DataFrame的列就会按照指定顺序进行排列:

    In [52]: pd.DataFrame(data,columns=['year','state','pop'])
    Out[52]:
       year   state  pop
    0  2000    Ohio  1.5
    1  2001    Ohio  1.7
    2  2002    Ohio  3.6
    3  2001  Nevada  2.4
    4  2002  Nevada  2.9
    

    和Series一样,如果传入的列在数据中找不到,就会产生NA值:

    In [56]: frame2 = pd.DataFrame(data,columns=['year','state','pop',
        ...: 'debt'],index=['one','two','three','four','five'])
    
    In [57]: frame2
    Out[57]:
           year   state  pop debt
    one    2000    Ohio  1.5  NaN
    two    2001    Ohio  1.7  NaN
    three  2002    Ohio  3.6  NaN
    four   2001  Nevada  2.4  NaN
    five   2002  Nevada  2.9  NaN
    
    In [58]: frame2.columns
    Out[58]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
    

    通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series:

    In [59]: frame2['state']
    Out[59]:
    one        Ohio
    two        Ohio
    three      Ohio
    four     Nevada
    five     Nevada
    Name: state, dtype: object
    
    In [60]: frame2.year
    Out[60]:
    one      2000
    two      2001
    three    2002
    four     2001
    five     2002
    Name: year, dtype: int64
    

    返回的Series拥有原DataFrame相同的索引,且其name属性也已经被相应地设置好了。行也可以通过位置或名称地方式进行获取、比如用索引字段loc。

    In [62]: frame2.loc['three']
    Out[62]:
    year     2002
    state    Ohio
    pop       3.6
    debt      NaN
    Name: three, dtype: object
    
    In [64]: frame2.iloc[0]
    Out[64]:
    year     2000
    state    Ohio
    pop       1.5
    debt      NaN
    Name: one, dtype: object
    

    列可以通过赋值的方式进行修改。例如,我们可以给空的"debt"列赋上一个标量值或一组值:

    In [65]: frame2['debt'] = 16.5
    
    In [66]: frame2
    Out[66]:
           year   state  pop  debt
    one    2000    Ohio  1.5  16.5
    two    2001    Ohio  1.7  16.5
    three  2002    Ohio  3.6  16.5
    four   2001  Nevada  2.4  16.5
    five   2002  Nevada  2.9  16.5
    
    In [67]: frame2['debt'] = np.arange(5)
    
    In [68]: frame2
    Out[68]:
           year   state  pop  debt
    one    2000    Ohio  1.5     0
    two    2001    Ohio  1.7     1
    three  2002    Ohio  3.6     2
    four   2001  Nevada  2.4     3
    five   2002  Nevada  2.9     4
    

    将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配。若赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都会被填上缺失值:

    In [71]: val = pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
    
    In [72]: frame2['debt'] = val
    
    In [73]: frame2
    Out[73]:
           year   state  pop  debt
    one    2000    Ohio  1.5   NaN
    two    2001    Ohio  1.7  -1.2
    three  2002    Ohio  3.6   NaN
    four   2001  Nevada  2.4  -1.5
    five   2002  Nevada  2.9  -1.7
    

    为不存在的列赋值会创建出一个新列。关键字del用于删除列:

    In [82]: frame2['eastern'] = frame2.state == 'Ohio'
    
    In [83]: frame2
    Out[83]:
           year   state  pop  debt  eastern
    one    2000    Ohio  1.5   NaN     True
    two    2001    Ohio  1.7  -1.2     True
    three  2002    Ohio  3.6   NaN     True
    four   2001  Nevada  2.4  -1.5    False
    five   2002  Nevada  2.9  -1.7    False
    
    In [84]: del frame2['eastern']
    
    In [85]: frame2.columns
    Out[85]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
    

    通过索引方式返回的列是相应数据的视图,不是副本。

    另一种常见的数据形式是嵌套字典:

    In [86]: pop = {'Nevada':{2001:2.4,2002:2.9},
        ...: 'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
    
    In [87]: frame3 = pd.DataFrame(pop)
    
    In [88]: frame3
    Out[88]:
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    2002     2.9   3.6
    

    外层字典的键作为列,内层键作为行索引

    由Series组成的字典差不多也是一样的用法:

    In [95]: pdata = {'Ohio':frame3['Ohio'][:-1],
        ...: 'Nevada':frame3['Nevada'][:2]}
    
    In [96]: pd.DataFrame(pdata)
    Out[96]:
          Nevada  Ohio
    2000     NaN   1.5
    2001     2.4   1.7
    

    可以输入给DataFrame构造器的数据:

    相关文章

      网友评论

        本文标题:Pandas入门

        本文链接:https://www.haomeiwen.com/subject/aaphzxtx.html