pandas学习

作者: 我是上帝可爱多 | 来源:发表于2017-08-29 10:18 被阅读26次

    最近一段时间特别迷茫,不知道学习的方向,也好花点时间给大家讲一下pandas这个python数据分析吧。

    import pandas as pd
    import numpy as np
    s = pd.Series([1,3,5,np.nan,6,8])
    s
    s.index
    Out[5]: 
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    
    Int64Index([0,1,2,3,4,5],dtype=int64)
    

    Series是pandas里面重要的一个包,相信大家也看出来了他是干嘛的。


    其实你会发现和数据库的表结构很相似。

    In [6]: dates = pd.date_range('20130101', periods=6)
    
    In [7]: dates
    Out[7]: 
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05', '2013-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
    In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    
    In [9]: df
    Out[9]: 
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    

    我们这里看到dataframe的作用,columns,index还有填充的数据内容。

    >>> import pandas as pd 
    >>> from pandas import Series, DataFrame 
    
    >>> data = {"name":["yahoo","google","facebook"], "marks":[200,400,800], "price":[9, 3, 7]} 
    >>> f1 = DataFrame(data) 
    >>> f1 
         marks  name      price 
    0    200    yahoo     9 
    1    400    google    3 
    2    800    facebook  7 
    

    看到这个我想大家就知道dataframe其实就是干这个事的。
    我们可以看到这个columns的排序是按照字母升序排的,我们可以自定义。

    >>> f2 = DataFrame(data, columns=['name','price','marks']) 
    >>> f2 
           name     price  marks 
    0     yahoo     9      200 
    1    google     3      400 
    2  facebook     7      800 
    
    In [10]: df2 = pd.DataFrame({ 'A' : 1.,
       ....:                      'B' : pd.Timestamp('20130102'),
       ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
       ....:                      'D' : np.array([3] * 4,dtype='int32'),
       ....:                      'E' : pd.Categorical(["test","train","test","train"]),
       ....:                      'F' : 'foo' })
       ....: 
    
    In [11]: df2
    Out[11]: 
         A          B    C  D      E    F
    0  1.0 2013-01-02  1.0  3   test  foo
    1  1.0 2013-01-02  1.0  3  train  foo
    2  1.0 2013-01-02  1.0  3   test  foo
    3  1.0 2013-01-02  1.0  3  train  foo
    
    In [12]: df2.dtypes
    Out[12]: 
    A           float64
    B    datetime64[ns]
    C           float32
    D             int32
    E          category
    F            object
    dtype: object
    
    >>> data = {"name":["yahoo","google","facebook"], "marks":[200,400,800], "price":[9, 3, 7]} 
    >>> f3 = DataFrame(data, columns=['name', 'price', 'marks', 'debt'], index=['a','b','c']) 
    >>> f3 
           name      price  marks  debt 
    a     yahoo      9      200     NaN 
    b    google      3      400     NaN 
    c  facebook      7      800     NaN 
    
    >>> f3.columns 
    Index(['name', 'price', 'marks', 'debt'], dtype=object) 
    
    >>> f3['name'] 
    a       yahoo 
    b      google 
    c    facebook 
    Name: name 
    
    >>> f3['debt'] = 89.2 
    >>> f3 
           name     price  marks  debt 
    a     yahoo     9        200  89.2 
    b    google     3        400  89.2 
    c  facebook     7        800  89.2
    
    >>> sdebt = Series([2.2, 3.3], index=["a","c"])    #注意索引 
    >>> f3['debt'] = sdebt 
    
    >>> f3 
           name  price  marks  debt 
    a     yahoo  9        200   2.2 
    b    google  3        400   NaN 
    c  facebook  7        800   3.3
    
    >>> f3["price"]["c"]= 300 
    >>> f3 
           name   price   marks  debt 
    a     yahoo   9       200    2.2 
    b    google   3       400    NaN 
    c  facebook   300     800    3.3 
    

    See the top & bottom rows of the frame

    In [14]: df.head()
    Out[14]: 
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    
    In [15]: df.tail(3)
    Out[15]: 
                       A         B         C         D
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    
    In [18]: df.values
    Out[18]: 
    array([[ 0.4691, -0.2829, -1.5091, -1.1356],
           [ 1.2121, -0.1732,  0.1192, -1.0442],
           [-0.8618, -2.1046, -0.4949,  1.0718],
           [ 0.7216, -0.7068, -1.0396,  0.2719],
           [-0.425 ,  0.567 ,  0.2762, -1.0874],
           [-0.6737,  0.1136, -1.4784,  0.525 ]])
    

    head是截取前5行。

    Describe shows a quick statistic summary of your data

    In [19]: df.describe()
    Out[19]: 
                  A         B         C         D
    count  6.000000  6.000000  6.000000  6.000000
    mean   0.073711 -0.431125 -0.687758 -0.233103
    std    0.843157  0.922818  0.779887  0.973118
    min   -0.861849 -2.104569 -1.509059 -1.135632
    25%   -0.611510 -0.600794 -1.368714 -1.076610
    50%    0.022070 -0.228039 -0.767252 -0.386188
    75%    0.658444  0.041933 -0.034326  0.461706
    max    1.212112  0.567020  0.276232  1.071804
    

    Transposing your data

    In [20]: df.T
    Out[20]: 
       2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
    A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
    B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
    C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
    D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988
    
    In [22]: df.sort_values(by='B')
    Out[22]: 
                       A         B         C         D
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-06 -0.673690  0.113648 -1.478427  0.524988
    2013-01-05 -0.424972  0.567020  0.276232 -1.087401
    

    Selecting via [], which slices the rows.

    In [24]: df[0:3]
    Out[24]: 
                       A         B         C         D
    2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    
    In [25]: df['20130102':'20130104']
    Out[25]: 
                       A         B         C         D
    2013-01-02  1.212112 -0.173215  0.119209 -1.044236
    2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
    2013-01-04  0.721555 -0.706771 -1.039575  0.271860
    
    In [26]: df.loc[dates[0]]
    Out[26]: 
    A    0.469112
    B   -0.282863
    C   -1.509059
    D   -1.135632
    Name: 2013-01-01 00:00:00, dtype: float64
    
    In [27]: df.loc[:,['A','B']]
    Out[27]: 
                       A         B
    2013-01-01  0.469112 -0.282863
    2013-01-02  1.212112 -0.173215
    2013-01-03 -0.861849 -2.104569
    2013-01-04  0.721555 -0.706771
    2013-01-05 -0.424972  0.567020
    2013-01-06 -0.673690  0.113648
    

    关于 csv 文件

    csv 是一种通用的、相对简单的文件格式,在表格类型的数据中用途很广泛,很多关系型数据库都支持这种类型文件的导入导出,并且 excel 这种常用的数据表格也能和 csv 文件之间转换。

    name,physics,python,math,english
    Google,100,100,25,12
    Facebook,45,54,44,88
    Twitter,54,76,13,91
    Yahoo,54,452,26,100
    
    >>> with open("./marks.csv") as f:
    ...     for line in f:
    ...         print line
    ... 
    name,physics,python,math,english
    
    Google,100,100,25,12
    
    Facebook,45,54,44,88
    
    Twitter,54,76,13,91
    
    Yahoo,54,452,26,100
    
    >>> import csv 
    >>> dir(csv)
    ['Dialect', 'DictReader', 'DictWriter', 'Error', 'QUOTE_ALL', 'QUOTE_MINIMAL', 'QUOTE_NONE', 'QUOTE_NONNUMERIC', 'Sniffer', 'StringIO', '_Dialect', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '__version__', 'excel', 'excel_tab', 'field_size_limit', 'get_dialect', 'list_dialects', 're', 'reader', 'reduce', 'register_dialect', 'unregister_dialect', 'writer']
    
    >>> import pandas as pd
    >>> marks = pd.read_csv("./marks.csv")
    >>> marks
           name  physics  python  math  english
    0    Google      100     100    25       12
    1  Facebook       45      54    44       88
    2   Twitter       54      76    13       91
    3     Yahoo       54     452    26      100
    
    >>> marks.sort(column="python")
           name  physics  python  math  english
    1  Facebook       45      54    44       88
    2   Twitter       54      76    13       91
    0    Google      100     100    25       12
    3     Yahoo       54     452    26      100
    

    代码分享就到这。。。

    相关文章

      网友评论

        本文标题:pandas学习

        本文链接:https://www.haomeiwen.com/subject/etzndxtx.html