美文网首页
python3.6 pandas,Series和DataFram

python3.6 pandas,Series和DataFram

作者: LeeMin_Z | 来源:发表于2018-07-26 23:20 被阅读6次

    pandas 是基于numpy构建的库,加上numpy,主要用于科学运算和数据处理。

    也是一个让我忘记昂贵的MATLAB,并且不得不复习SQL的库..

    一般引入规定:

    In [105]: from pandas import Series,DataFrame
    
    In [106]: import pandas as pd
    
    In [107]: import numpy as np 
    

    Series

    类似一维数组,有一组数据和一组与之相关的索引组成。

    In [68]: o2 = Series([4,-7,35,99])
    
    In [69]: o2
    Out[69]:
    0     4
    1    -7
    2    35
    3    99
    dtype: int64
    
    In [70]: o2 = Series([4,-7,35,99],index=['a','b','c','d'])
    
    In [71]: o2
    Out[71]:
    a     4
    b    -7
    c    35
    d    99
    dtype: int64
    

    DataFrame

    表格型数据结构,可以看成是一系列Series组成的字典(共用同一个索引)。

    In [21]: frame = DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],column
        ...: s=['Ohio','Texas','Califor'])
    
    In [22]: frame
    Out[22]:
       Ohio  Texas  Califor
    a     0      1        2
    b     3      4        5
    c     6      7        8
    
    1. 在算数方法中填充值
      1.1. 两个长度不同的数组,直接相加,不存在/不对应的值会广播NaN
      1.2. NaN可以用fill_value填充值
    In [31]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
    
    In [32]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
    
    In [33]: df1
    Out[33]:
         a    b     c     d
    0  0.0  1.0   2.0   3.0
    1  4.0  5.0   6.0   7.0
    2  8.0  9.0  10.0  11.0
    
    In [34]: df2
    Out[34]:
          a     b     c     d     e
    0   0.0   1.0   2.0   3.0   4.0
    1   5.0   6.0   7.0   8.0   9.0
    2  10.0  11.0  12.0  13.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    In [35]: df1 + df2
    Out[35]:
          a     b     c     d   e
    0   0.0   2.0   4.0   6.0 NaN
    1   9.0  11.0  13.0  15.0 NaN
    2  18.0  20.0  22.0  24.0 NaN
    3   NaN   NaN   NaN   NaN NaN
    
    In [36]:
    
    In [36]: df1.add(df2,fill_value=0)
    Out[36]:
          a     b     c     d     e
    0   0.0   2.0   4.0   6.0   4.0
    1   9.0  11.0  13.0  15.0   9.0
    2  18.0  20.0  22.0  24.0  14.0
    3  15.0  16.0  17.0  18.0  19.0
    
    1. DataFrame和Series之间的运算--广播
      2.1. 一般是沿行做广播运算
      2.2. 沿列做广播运算需要运用算术方法
    In [41]: arr = np.arange(12.).reshape(3,4)
    
    In [42]: arr
    Out[42]:
    array([[  0.,   1.,   2.,   3.],
           [  4.,   5.,   6.,   7.],
           [  8.,   9.,  10.,  11.]])
    
    In [43]: arr - arr[0]
    Out[43]:
    array([[ 0.,  0.,  0.,  0.],
           [ 4.,  4.,  4.,  4.],
           [ 8.,  8.,  8.,  8.]])
    
    1. 函数的映射和应用
      一般是使用lambda和写函数式
    #lambda
    In [56]: frame
    Out[56]:
                   b         d         e
    Utah    0.073770 -0.264937  1.085603
    Ohio    1.274547  0.820050  0.056422
    Texas   1.346414  1.786314 -0.311222
    Oregon  0.571323 -0.731404  0.502011
    
    In [57]: f = lambda x : x.max() - x.min()
    
    In [58]: frame.apply(f)
    Out[58]:
    b    1.272643
    d    2.517719
    e    1.396825
    dtype: float64
    
    In [59]: frame.apply(f,axis=1)
    Out[59]:
    Utah      1.350540
    Ohio      1.218125
    Texas     2.097536
    Oregon    1.302727
    dtype: float64
    
    #f(x)
    In [60]: def f(x):
        ...:     return Series([x.min(),x.max()],index=['min','max'])
    
    In [61]: frame.apply(f)
    Out[61]:
                b         d         e
    min  0.073770 -0.731404 -0.311222
    max  1.346414  1.786314  1.085603
    
    1. 汇总和计算描述统计
    In [70]: df
    Out[70]:
              0         1         2
    a  1.037884  0.932937  0.480702
    a -1.453084 -1.039968  0.306588
    b  0.352103  0.083231 -0.264383
    b  0.628823 -0.454043 -0.993764
    
    In [71]: df.describe()
    Out[71]:
                  0         1         2
    count  4.000000  4.000000  4.000000
    mean   0.141432 -0.119461 -0.117714
    std    1.099703  0.838233  0.665109
    min   -1.453084 -1.039968 -0.993764
    25%   -0.099194 -0.600524 -0.446728
    50%    0.490463 -0.185406  0.021103
    75%    0.731088  0.295658  0.350117
    max    1.037884  0.932937  0.480702
    
    1. 处理缺失值
    In [89]: df1
    Out[89]:
              0         1         2
    0  1.700089       NaN       NaN
    1  0.209934       NaN       NaN
    2 -1.300037       NaN       NaN
    3 -0.044868       NaN  1.712725
    4  0.624518       NaN -0.559871
    5 -1.036317  1.075744  1.267794
    6 -0.201066  0.268681 -0.356206
    
    In [90]: df1.fillna(0)
    Out[90]:
              0         1         2
    0  1.700089  0.000000  0.000000
    1  0.209934  0.000000  0.000000
    2 -1.300037  0.000000  0.000000
    3 -0.044868  0.000000  1.712725
    4  0.624518  0.000000 -0.559871
    5 -1.036317  1.075744  1.267794
    6 -0.201066  0.268681 -0.356206
    
    In [91]: df1
    Out[91]:
              0         1         2
    0  1.700089       NaN       NaN
    1  0.209934       NaN       NaN
    2 -1.300037       NaN       NaN
    3 -0.044868       NaN  1.712725
    4  0.624518       NaN -0.559871
    5 -1.036317  1.075744  1.267794
    6 -0.201066  0.268681 -0.356206
    
    In [92]: df1.fillna({1:0.5,2:33})
    Out[92]:
              0         1          2
    0  1.700089  0.500000  33.000000
    1  0.209934  0.500000  33.000000
    2 -1.300037  0.500000  33.000000
    3 -0.044868  0.500000   1.712725
    4  0.624518  0.500000  -0.559871
    5 -1.036317  1.075744   1.267794
    6 -0.201066  0.268681  -0.356206
    
    1. 层次化索引/多层索引
      6.1. 基础就是多层索引
    In [100]: data = Series(np.random.rand(10),index=[['a','a','a','b','b','b','c',
         ...: 'c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
    
    In [101]: data
    Out[101]:
    a  1    0.676413
       2    0.623518
       3    0.414257
    b  1    0.434586
       2    0.905924
       3    0.726079
    c  1    0.693546
       2    0.708168
    d  2    0.667362
       3    0.789808
    dtype: float64
    
    In [102]: data.index
    Out[102]:
    MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
               labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2
    ]])
    

    6.2. 通过unstack,可以将其从Series转化为DataFrame

    In [114]: data.unstack()
    Out[114]:
              1         2         3
    a  0.676413  0.623518  0.414257
    b  0.434586  0.905924  0.726079
    c  0.693546  0.708168       NaN
    d       NaN  0.667362  0.789808
    

    6.3. unstack的逆运算是stack

    In [115]: data.unstack().stack()
    Out[115]:
    a  1    0.676413
       2    0.623518
       3    0.414257
    b  1    0.434586
       2    0.905924
       3    0.726079
    c  1    0.693546
       2    0.708168
    d  2    0.667362
       3    0.789808
    dtype: float64
    

    6.4. DataFrame每条轴都可以做多层索引

    
    In [118]: frame =DataFrame(np.arange(12).reshape(4,3),
         ...: index = [['a','a','b','b'],[1,2,1,2]],
         ...: columns = [['city1','city1','city2'],['G','R','G']])
    
    In [120]: frame
    Out[120]:
        city1     city2
            G   R     G
    a 1     0   1     2
      2     3   4     5
    b 1     6   7     8
      2     9  10    11
    
    
    In [121]: frame.index.names = ['key1','key2']
    
    In [122]: frame.columns.names = ['citys','color']
    
    In [123]: frame
    Out[123]:
    citys     city1     city2
    color         G   R     G
    key1 key2
    a    1        0   1     2
         2        3   4     5
    b    1        6   7     8
         2        9  10    11
    
    In [124]:
    
    
    1. 把DataFrame的列当成索引使用
      7.1. set_index , 把DataFrame的列当成索引使用, 可以选择是否保留原列
      7.2. reset_index 将7.1.恢复原样
    #7.1. set_index 
    In [134]: f
    Out[134]:
       a  b    c  d
    0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
    3  3  4  two  0
    4  4  3  two  1
    5  5  2  two  2
    6  6  1  two  3
    
    In [135]: f.set_index(['c','d'])
    Out[135]:
           a  b
    c   d
    one 0  0  7
        1  1  6
        2  2  5
    two 0  3  4
        1  4  3
        2  5  2
        3  6  1
    
    In [136]: f.set_index(['c','d'],drop=False)
    Out[136]:
           a  b    c  d
    c   d
    one 0  0  7  one  0
        1  1  6  one  1
        2  2  5  one  2
    two 0  3  4  two  0
        1  4  3  two  1
        2  5  2  two  2
        3  6  1  two  3
    
    # 7.2. reset_index example 
    
    In [137]: frame2=  f.set_index(['c','d'])
    
    In [139]: frame2
    Out[139]:
           a  b
    c   d
    one 0  0  7
        1  1  6
        2  2  5
    two 0  3  4
        1  4  3
        2  5  2
        3  6  1
    
    In [140]: frame2.reset_index()
    Out[140]:
         c  d  a  b
    0  one  0  0  7
    1  one  1  1  6
    2  one  2  2  5
    3  two  0  3  4
    4  two  1  4  3
    5  two  2  5  2
    6  two  3  6  1
    
    
    1. 面板数据/三维版DataFrame

    书里提到比较少用,一般可以降到二维。


    我觉得这个pandas功能也很像excel VB语言,果然语言都是很相似的,原理是矩阵和逻辑,要用再查参考书。

    话说,数据分析在排障也很好用啊,万万没想到

    2018.7.20

    相关文章

      网友评论

          本文标题:python3.6 pandas,Series和DataFram

          本文链接:https://www.haomeiwen.com/subject/nqeajftx.html