美文网首页
pandas基本操作手册

pandas基本操作手册

作者: 张小张x86 | 来源:发表于2019-04-22 12:14 被阅读0次

    Series

    series是一种类似一维数组的对象,它由一组数据以及一组与之相关的标签组成,通过pandas的Series函数实例化一个series

    • 创建series
    import pandas as pd
    s = pd.Series([5,2,3,4,1])
    >>>
    0    2
    1    3
    2    4
    3    1
    4    5
    dtype: int64
    
    s.values
    >>>array([2, 3, 4, 1, 5])
    s.index
    >>>RangeIndex(start=0, stop=5, step=1)
    
    s2 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
    print(s2)
    >>>
    a    3
    b    2
    c    4
    d    1
    e    5
    dtype: int64
    
    #根据字典创建series
    dict = {'name':'joha','sex':'male','age':'18'}
    s3 = pd.Series(dict)
    print(s3)
    >>>
    name    joha
    sex     male
    age       18
    dtype: object
    
    • 根据索引选取Series的一个值或多个值
    s2 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
    #批量单个值
    s2['a']
    >>>3
    
    #批量选取多个值
    s2[['a','c','e']]
    >>>
    a    3
    c    4
    e    5
    dtype: int64
    
    s2[s2>3]
    >>>
    c    4
    e    5
    dtype: int64
    
    s2*3
    >>>
    a     9
    b     6
    c    12
    d     3
    e    15
    dtype: int64
    
    'c' in s2
    >>>True
    'f' in s2
    >>>False
    

    series在算数运算中自动对齐不同索引的数据

    s1 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
    s2 = pd.Series([3,-5,1],index = ['a','c','e'])
    print(s1+s2)
    >>>
    a    6.0
    b    NaN
    c   -1.0
    d    NaN
    e    6.0
    dtype: float64
    

    series中的index可以通过赋值的方式进行修改

    s2 = pd.Series([3,-5,1],index = ['a','c','e'])
    s2.index = [1,2,3]
    print(s2)
    >>>
    1    3
    2   -5
    3    1
    dtype: int64
    

    DataFrame

    • 创建dataFrame
    test_dict = {'id':[1,2,3,4,5,6],
                 'name':['Alice','Bob','Cindy','Eric','Helen','Grace '],
                 'math':[90,89,99,78,97,93],
                 'english':[89,94,80,94,94,90]}
    #[1].直接写入参数test_dict
    test_dict_df = pd.DataFrame(test_dict)
    print(test_dict_df)
    >>>
       id    name  math  english
    0   1   Alice    90       89
    1   2     Bob    89       94
    2   3   Cindy    99       80
    3   4    Eric    78       94
    4   5   Helen    97       94
    5   6  Grace     93       90
    #[2].字典型赋值
    test_dict_df = pd.DataFrame(data=test_dict)
    >>>
       id    name  math  english
    0   1   Alice    90       89
    1   2     Bob    89       94
    2   3   Cindy    99       80
    3   4    Eric    78       94
    4   5   Helen    97       94
    5   6  Grace     93       90
    
    test_dict_df = pd.DataFrame(test_dict,columns=['name','math','english','id'])
    print(test_dict_df)
    >>>
         name  math  english  id
    0   Alice    90       89   1
    1     Bob    89       94   2
    2   Cindy    99       80   3
    3    Eric    78       94   4
    4   Helen    97       94   5
    5  Grace     93       90   6
    
    • DataFrame取值
    test_dict_df['name']
    >>>
    0     Alice
    1       Bob
    2     Cindy
    3      Eric
    4     Helen
    5    Grace 
    Name: name, dtype: object
    
    test_dict_df.name
    >>>
    0     Alice
    1       Bob
    2     Cindy
    3      Eric
    4     Helen
    5    Grace 
    Name: name, dtype: object
    
    • 对某一列赋值
    test_dict_df['id'] = pd.Series(['11','22','33','44','55'])
    print(test_dict_df)
    >>>
         name  math  english   id
    0   Alice    90       89   11
    1     Bob    89       94   22
    2   Cindy    99       80   33
    3    Eric    78       94   44
    4   Helen    97       94   55
    5  Grace     93       90  NaN
    
    • 删除某一列
    test_dict_df.drop(['id'],axis=1)
    >>>
         name  math  english
    0   Alice    90       89
    1     Bob    89       94
    2   Cindy    99       80
    3    Eric    78       94
    4   Helen    97       94
    5  Grace     93       90
    
    • 多维数组构建DataFrame
    test_dict = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns = ['a','b','c'],index=[1,2,3])
    print(test_dict)
    >>>
       a  b  c
    1  1  2  3
    2  4  5  6
    3  7  8  9
    
    print(test_dict.values)
    >>>
    [[1 2 3]
     [4 5 6]
     [7 8 9]]
    
    • 构建series和dataFrame时,可以传入数组当作index
    test_dict = pd.Series([1,2,3],index = ['a','b','c'])
    print(test_dict)
    >>>
    a    1
    b    2
    c    3
    dtype: int64
    
    test_dict.index = ['c','d','e']
    print(test_dict)
    >>>
    c    1
    d    2
    e    3
    dtype: int64
    
    • pandas重新索引 reindex
    test_dict = pd.Series([1,2,3],index = ['a','b','c'])
    test_dict1 = test_dict.reindex(['a','b','c','d','e'])
    print(test_dict1)
    >>>
    a    1.0
    b    2.0
    c    3.0
    d    NaN
    e    NaN
    dtype: float64
    
    #填充
    test_dict = pd.Series([1,2,3],index = ['a','b','c'])
    test_dict1 = test_dict.reindex(['a','b','c','d','e'],fill_value = 0)
    print(test_dict1)
    >>>
    a    1
    b    2
    c    3
    d    0
    e    0
    dtype: int64
    
    obj = pd.Series(['Jim','Mike','Jhon'],index = [0,3,6])
    obj1 = obj.reindex(range(8),method = 'ffill')
    print(obj1)
    >>>
    0     Jim
    1     Jim
    2     Jim
    3    Mike
    4    Mike
    5    Mike
    6    Jhon
    7    Jhon
    dtype: object
    

    reindex作用于列

    df = pd.DataFrame(np.arange(1,10).reshape((3,3)),index = ['d','a','c'],columns = ['Jim','Mike','Jhon'])
    df1 = df.reindex(['a','b','c','d'],['Jhon','Mike','Jim'])
    print(df1)
    >>>
       Jhon  Mike  Jim
    a   6.0   5.0  4.0
    b   NaN   NaN  NaN
    c   9.0   8.0  7.0
    d   3.0   2.0  1.0
    
    • 丢弃指定轴上的项 DataFrame.drop
    test_dict = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
    print('output:{}'.format(test_dict))
    a = test_dict.drop(['a','c'])
    print('>>>{}'.format(a))
    >>>
    output:
    a    1
    b    2
    c    3
    d    4
    e    5
    dtype: int64
    >>>
    b    2
    d    4
    e    5
    dtype: int64
    
    df = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
    print(df)
    a = df.drop(['2','4'])
    print(a)
    >>>
        a   b   c   d   e
    1   0   1   2   3   4
    2   5   6   7   8   9
    3  10  11  12  13  14
    4  15  16  17  18  19
    5  20  21  22  23  24
    >>>
        a   b   c   d   e
    1   0   1   2   3   4
    3  10  11  12  13  14
    5  20  21  22  23  24
    
    df = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
    print(df)
    a = df.drop(['a','c'],axis = 1)
    print(a)
    >>>
        a   b   c   d   e
    1   0   1   2   3   4
    2   5   6   7   8   9
    3  10  11  12  13  14
    4  15  16  17  18  19
    5  20  21  22  23  24
    >>>
        b   d   e
    1   1   3   4
    2   6   8   9
    3  11  13  14
    4  16  18  19
    5  21  23  24
    
    • 索引,选取和过滤
    object = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
    print(object[1:3])
    >>>
    b    2
    c    4
    dtype: int64
    
    object[['a','b','c']]
    >>>
    a    3
    b    2
    c    4
    dtype: int64
    
    • 条件过滤
    object[object<4]
    >>>
    a    3
    b    2
    d    1
    dtype: int64
    
    • DataFrame的索引可以按行也可以按列
    object = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
    print(object['b'])
    >>>
    1     1
    2     6
    3    11
    4    16
    5    21
    Name: b, dtype: int64
    
    print(object[['a','c']])
    >>>
        a   c
    1   0   2
    2   5   7
    3  10  12
    4  15  17
    5  20  22
    
    • 按行索引
    print(object[1:4])
    >>>
        a   b   c   d   e
    2   5   6   7   8   9
    3  10  11  12  13  14
    4  15  16  17  18  19
    
    • 条件索引
    object[object['b']>10]
    >>>
        a   b   c   d   e
    3  10  11  12  13  14
    4  15  16  17  18  19
    5  20  21  22  23  24
    
    • 对不同索引的对象进行计算
    df1 = pd.DataFrame(np.arange(9).reshape(3,3),columns = list('abc'),index = [1,2,3])
    print(df1)
    >>>
       a  b  c
    1  0  1  2
    2  3  4  5
    3  6  7  8
    
    df2 = pd.DataFrame(np.arange(16).reshape(4,4),columns = list('bcde'),index = [2,3,4,5])
    print(df2)
    >>>
        b   c   d   e
    2   0   1   2   3
    3   4   5   6   7
    4   8   9  10  11
    5  12  13  14  15
    
    print(df1+df2)
    >>>
        a     b     c   d   e
    1 NaN   NaN   NaN NaN NaN
    2 NaN   4.0   6.0 NaN NaN
    3 NaN  11.0  13.0 NaN NaN
    4 NaN   NaN   NaN NaN NaN
    5 NaN   NaN   NaN NaN NaN
    
    df1.add(df2,fill_value = 0)
    #df1与df2两两都没有的值,依然是NaN
    >>>
         a     b     c     d     e
    1  0.0   1.0   2.0   NaN   NaN
    2  3.0   4.0   6.0   2.0   3.0
    3  6.0  11.0  13.0   6.0   7.0
    4  NaN   8.0   9.0  10.0  11.0
    5  NaN  12.0  13.0  14.0  15.0
    
    • DataFrame与Series之间的计算
      DataFrame与Series计算时会引入广播操作
    df1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('abc'))
    print(df1)
    >>>
       a   b   c
    0  0   1   2
    1  3   4   5
    2  6   7   8
    3  9  10  11
    
    series1 = pd.Series([3,4,5],index = ['a','b','c'])
    print(series1)
    >>>
    a    3
    b    4
    c    5
    
    #逐行相减
    print(df1 - series1)
    >>>
    a  b  c
    0 -3 -3 -3
    1  0  0  0
    2  3  3  3
    3  6  6  6
    
    • series 取自dataFrame
    series2 = df1['b']
    series2
    >>>
    0     1
    1     4
    2     7
    3    10
    Name: b, dtype: int64
    
    df1.sub(series2,axis = 0)
    >>>
       a  b  c
    0 -1  0  1
    1 -1  0  1
    2 -1  0  1
    3 -1  0  1
    
    • 函数应用和映射
      numpy的元素级函数可以直接作用到DataFrame上
    print(np.square(df1))
    >>>
        a    b    c
    0   0    1    4
    1   9   16   25
    2  36   49   64
    3  81  100  121
    
    • DataFrame将一个函数直接应用到其本身或者各行各列,形成一个新的数据或者行或列
    def fun(x):
        return x.max() - x.min()
    df1.apply(fun,axis = 1)
    >>>
    0    2
    1    2
    2    2
    3    2
    dtype: int64
    
    • 排序
    df1 = pd.DataFrame(np.random.randn(4,4),columns=list('bcad'),index=[2,4,3,1])
    print(df1)
    >>>
              b         c         a         d
    2  0.706356 -0.896474 -1.879608  0.322054
    4  0.666188 -0.450170  0.914737  0.691662
    3 -1.676381 -0.499211 -0.136020 -1.734251
    1 -2.111717 -0.226238  1.656514  0.146311
    
    print(df1.sort_index())
    >>>
              b         c         a         d
    1 -2.111717 -0.226238  1.656514  0.146311
    2  0.706356 -0.896474 -1.879608  0.322054
    3 -1.676381 -0.499211 -0.136020 -1.734251
    4  0.666188 -0.450170  0.914737  0.691662
    
    print(df1.sort_values(by=['b','a']))
    >>>
              b         c         a         d
    1 -2.111717 -0.226238  1.656514  0.146311
    3 -1.676381 -0.499211 -0.136020 -1.734251
    4  0.666188 -0.450170  0.914737  0.691662
    2  0.706356 -0.896474 -1.879608  0.322054
    

    统计相关计算

    • 求和 sum
    • 最大 max
    • 最小 min
    • 方差 var
    • 求平均 mean
    • 所有信息 describe
    print(df1.describe())
    >>>
                            b               c               a               d
    count  4.000000  4.000000  4.000000  4.000000
    mean  -0.603889 -0.518024  0.138906 -0.143556
    std    1.500402  0.278880  1.533517  1.084545
    min   -2.111717 -0.896474 -1.879608 -1.734251
    25%   -1.785215 -0.598527 -0.571917 -0.323829
    50%   -0.505096 -0.474691  0.389358  0.234183
    75%    0.676230 -0.394187  1.100181  0.414456
    max    0.706356 -0.226238  1.656514  0.691662
    

    处理数据缺失

    • dropna 去除nan数据
    • fillna 使用默认填入
    • isnull 返回一个含有布尔值的对象,标注nan的位置
      -notnull isnull否定式

    相关文章

      网友评论

          本文标题:pandas基本操作手册

          本文链接:https://www.haomeiwen.com/subject/lvbjgqtx.html