美文网首页我爱编程
十分钟快速入门pandas

十分钟快速入门pandas

作者: nummycode | 来源:发表于2017-04-29 09:55 被阅读385次

    译自10 Minutes to pandas

    pandas是一个python库,它提供了一些快速,灵活而且富于表现力的数据结构,用于处理关系型数据,为数据分析提供了强大的支持。

    下面简要介绍一下pandas,主要是针对新用户,以下操作均在ipython中执行,想学习更多内容的话请参考cookbook

    首先,我们需要导入以下几个库:

    In [1]: import pandas as pd
    In [2]: import numpy as np
    In [3]: import matplotlib.pyplot as plt
    

    创建对象

    通过传递列表来创建Series:

    In [4]: s = pd.Series([1,3,5,np.nan,6,8])
    In [5]: s
    Out[5]: 
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    

    下面的代码使用numpy矩阵来创建一个DataFrame,并指定index和column:

    In [6]: dates = pd.date_range('20130101', periods=6)
    
    In [7]: dates
    Out[7]: 
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    
    In [9]: df
    Out[9]: 
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    

    通过字典来创建DataFrame:

    In [10]: df2 = pd.DataFrame({ 'A' : 1.,
       ....:                      'B' : pd.Timestamp('20130102'),
       ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
       ....:                      'D' : np.array([3] * 4,dtype='int32'),
       ....:                      'E' : pd.Categorical(["test","train","test","train"]),
       ....:                      'F' : 'foo' })
       ....: 
    
    In [11]: df2
    Out[11]: 
         A          B    C  D      E    F
    0  1.0 2013-01-02  1.0  3   test  foo
    1  1.0 2013-01-02  1.0  3  train  foo
    2  1.0 2013-01-02  1.0  3   test  foo
    3  1.0 2013-01-02  1.0  3  train  foo
    

    输出每列的数据类型:

    In [12]: df2.dtypes
    Out[12]: 
    A           float64
    B    datetime64[ns]
    C           float32
    D             int32
    E          category
    F            object
    dtype: object
    

    使用ipython的话,支持命令补全:

    In [13]: df2.<TAB>
    df2.A                  df2.boxplot
    df2.abs                df2.C
    df2.add                df2.clip
    df2.add_prefix         df2.clip_lower
    df2.add_suffix         df2.clip_upper
    df2.align              df2.columns
    df2.all                df2.combine
    df2.any                df2.combineAdd
    df2.append             df2.combine_first
    df2.apply              df2.combineMult
    df2.applymap           df2.compound
    df2.as_blocks          df2.consolidate
    df2.asfreq             df2.convert_objects
    df2.as_matrix          df2.copy
    df2.astype             df2.corr
    df2.at                 df2.corrwith
    df2.at_time            df2.count
    df2.axes               df2.cov
    df2.B                  df2.cummax
    df2.between_time       df2.cummin
    df2.bfill              df2.cumprod
    df2.blocks             df2.cumsum
    df2.bool               df2.D
    

    查看数据

    查看前几行数据或者后几行数据:

    In [14]: df.head()
    Out[14]: 
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    
    In [15]: df.tail(3)
    Out[15]: 
                       A         B         C         D
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    

    显示行号,列号以及取值:

    In [16]: df.index
    Out[16]: 
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [17]: df.columns
    Out[17]: Index([u'A', u'B', u'C', u'D'], dtype='object')
    
    In [18]: df.values
    Out[18]: 
    array([[ 0.4691, -0.2829, -1.5091, -1.1356],
           [ 1.2121, -0.1732,  0.1192, -1.0442],
           [-0.8618, -2.1046, -0.4949,  1.0718],
           [ 0.7216, -0.7068, -1.0396,  0.2719],
           [-0.425 ,  0.567 ,  0.2762, -1.0874],
           [-0.6737,  0.1136, -1.4784,  0.525 ]])
    

    展示原始数据的统计数据:

    In [19]: df.describe()
    Out[19]: 
                  A         B         C         D
    count  6.000000  6.000000  6.000000  6.000000
    mean   0.073711 -0.431125 -0.687758 -0.233103
    std    0.843157  0.922818  0.779887  0.973118
    min   -0.861849 -2.104569 -1.509059 -1.135632
    25%   -0.611510 -0.600794 -1.368714 -1.076610
    50%    0.022070 -0.228039 -0.767252 -0.386188
    75%    0.658444  0.041933 -0.034326  0.461706
    max    1.212112  0.567020  0.276232  1.071804
    

    转置数据:

    In [20]: df.T
    Out[20]: 
       2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
    A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
    B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
    C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
    D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988
    

    按轴排序:

    In [21]: df.sort_index(axis=1, ascending=False)
    Out[21]: 
                       D         C         B         A
    2013-01-01 -1.135632 -1.509059 -0.282863  0.469112
    2013-01-02 -1.044236  0.119209 -0.173215  1.212112
    2013-01-03  1.071804 -0.494929 -2.104569 -0.861849
    2013-01-04  0.271860 -1.039575 -0.706771  0.721555
    2013-01-05 -1.087401  0.276232  0.567020 -0.424972
    2013-01-06  0.524988 -1.478427  0.113648 -0.673690
    

    按值排序:

    In [22]: df.sort_values(by='B')
    Out[22]: 
                       A         B         C         D
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    

    选择器

    获取数据
    选择单列,产生一个Series序列,等价于df.A:

    In [23]: df['A']
    Out[23]: 
    2013-01-01    0.469112
    2013-01-02    1.212112
    2013-01-03   -0.861849
    2013-01-04    0.721555
    2013-01-05   -0.424972
    2013-01-06   -0.673690
    Freq: D, Name: A, dtype: float64
    

    使用[]选取某几行的数据:

    In [24]: df[0:3]
    Out[24]: 
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    
    In [25]: df['20130102':'20130104']
    Out[25]: 
                       A         B         C         D
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    

    使用标签选取数据
    使用标签获取横截面:

    In [26]: df.loc[dates[0]]
    Out[26]: 
    A    0.469112
    B   -0.282863
    C   -1.509059
    D   -1.135632
    Name: 2013-01-01 00:00:00, dtype: float64
    

    使用标签选择多轴数据:

    In [27]: df.loc[:,['A','B']]
    Out[27]: 
                       A         B
    2013-01-01  0.469112 -0.282863
    2013-01-02  1.212112 -0.173215
    2013-01-03 -0.861849 -2.104569
    2013-01-04  0.721555 -0.706771
    2013-01-05 -0.424972  0.567020
    2013-01-06 -0.673690  0.113648
    

    显示标签切片, 包含两个端点

    In [28]: df.loc['20130102':'20130104',['A','B']]
    Out[28]: 
                       A         B
    2013-01-02  1.212112 -0.173215
    2013-01-03 -0.861849 -2.104569
    2013-01-04  0.721555 -0.706771
    

    降低返回对象维度:

    In [29]: df.loc['20130102',['A','B']]
    Out[29]: 
    A    1.212112
    B   -0.173215
    Name: 2013-01-02 00:00:00, dtype: float64
    

    获取标量值:

    In [30]: df.loc[dates[0],'A']
    Out[30]: 0.46911229990718628
    

    快速访问并获取标量数据 (等价于上面的方法):

    In [31]: df.at[dates[0],'A']
    Out[31]: 0.46911229990718628
    

    待续....

    相关文章

      网友评论

        本文标题:十分钟快速入门pandas

        本文链接:https://www.haomeiwen.com/subject/ydkzzttx.html