美文网首页机器学习
Pandas快速入门(二)

Pandas快速入门(二)

作者: 乔治大叔 | 来源:发表于2019-10-08 10:59 被阅读0次

    Pandas快速入门(一),接着写:

    布尔索引
    print(df[df.A>0]) #取值df.A>0的所有
    print(df[df>0]) #显示大于0的值,else显示NaN
    
                       A         B         C         D
    2019-09-01  0.586356  1.969502  1.125890 -0.831724
    2019-09-04  0.886695  1.543536 -0.170274  0.867814
    2019-09-06  0.297143 -0.317093  1.125189  1.023567
    
                       A         B         C         D
    2019-09-01  0.586356  1.969502  1.125890       NaN
    2019-09-02       NaN       NaN       NaN       NaN
    2019-09-03       NaN  1.162615  0.699749  1.224788
    2019-09-04  0.886695  1.543536       NaN  0.867814
    2019-09-05       NaN  0.445182       NaN       NaN
    2019-09-06  0.297143       NaN  1.125189  1.023567
    
    过滤

    使用 isin() 方法过滤:

    df['E'] = ['one','two','three','four','five','six']
    print(df)
    print(df[df['E'].isin(['two','three'])])
    
                       A         B         C         D      E
    2019-09-01  0.586356  1.969502  1.125890 -0.831724    one
    2019-09-02 -0.665937 -0.897839 -1.208598 -1.226119    two
    2019-09-03 -2.418687  1.162615  0.699749  1.224788  three
    2019-09-04  0.886695  1.543536 -0.170274  0.867814   four
    2019-09-05 -0.671953  0.445182 -0.614136 -0.064305   five
    2019-09-06  0.297143 -0.317093  1.125189  1.023567    six
                       A         B         C         D      E
    2019-09-02 -0.665937 -0.897839 -1.208598 -1.226119    two
    2019-09-03 -2.418687  1.162615  0.699749  1.224788  three
    

    赋值

    虽然用于选择和赋值的标准Python / Numpy表达式非常直观,并且便于交互工作,但是对于生产环境的代码,我们推荐优化的Pandas数据访问方法.at、.iat、.loc和.iloc。

    添加新列将自动根据索引对齐数据:

    s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20191001', periods=6))
    print(s1)
    
    2019-10-01    1
    2019-10-02    2
    2019-10-03    3
    2019-10-04    4
    2019-10-05    5
    2019-10-06    6
    Freq: D, dtype: int64
    

    通过标签赋值:

    datas = pd.date_range('20190901',periods=6)
    print(datas)
    df.at[dates[0], 'C'] = 0  #dates[0]='2019-09-01'
    
    DatetimeIndex(['2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04',
                   '2019-09-05', '2019-09-06'],
                  dtype='datetime64[ns]', freq='D')
    
                       A         B         C         D
    2019-09-01 -0.397827 -1.102112  0.000000  0.161291
    2019-09-02 -0.751784 -0.759627 -1.311447 -0.919117
    2019-09-03  0.531277  0.550232 -1.253598  0.647749
    2019-09-04 -0.549671  1.000032 -0.927265  0.094845
    2019-09-05 -0.046609  0.399075  1.111344  1.722658
    2019-09-06 -1.424410 -1.328193  2.587026  0.463605
    

    通过位置赋值:

    df.iat[0,2] = 0  #第0行,第2列
    
                       A         B         C         D
    2019-09-01 -0.921584 -0.207005  0.000000 -0.548157
    2019-09-02 -0.899229  0.561346  0.574105 -1.558532
    2019-09-03 -1.277597 -0.583355  1.247190 -0.916555
    2019-09-04 -1.227783  0.522624 -2.151186 -0.281190
    2019-09-05  0.553149 -0.114055  0.616718  0.875897
    2019-09-06  1.140854 -0.052508  0.943119  1.269147
    

    使用NumPy数组赋值:

    df.loc[:,'D'] = np.array([5]*len(df)) #通过NumPy赋值,[]不能省
    
                       A         B         C  D
    2019-09-01  0.260309 -0.786362  0.900311  5
    2019-09-02 -1.035287  1.727411 -0.041896  5
    2019-09-03 -0.495706  0.687953 -0.121707  5
    2019-09-04 -0.365145 -0.844624 -0.764868  5
    2019-09-05  0.309504  0.465509 -0.363573  5
    2019-09-06 -0.143167 -0.405704 -1.102475  5
    

    带有where条件的赋值操作:

    df2 = df.copy()
    df2[df2<0] = -df2 #如果小于零,则为正数
    print(df2)
    
                       A         B         C         D
    2019-09-01  0.608456  1.503148 -0.194184  0.149963
    2019-09-02 -0.654379  1.039558 -0.321524  1.771350
    2019-09-03 -2.084704 -0.734897  0.260852 -1.163411
    2019-09-04 -0.461798  0.311986  1.860293 -1.353793
    2019-09-05  0.660783 -2.050908 -0.480054 -1.123917
    2019-09-06  0.070030 -0.405595  0.687804  0.119593
                       A         B         C         D
    2019-09-01  0.608456  1.503148  0.194184  0.149963
    2019-09-02  0.654379  1.039558  0.321524  1.771350
    2019-09-03  2.084704  0.734897  0.260852  1.163411
    2019-09-04  0.461798  0.311986  1.860293  1.353793
    2019-09-05  0.660783  2.050908  0.480054  1.123917
    2019-09-06  0.070030  0.405595  0.687804  0.119593
    

    缺失值

    Pandas主要使用值np.nan来表示缺失的数据。

    df2 = df2[df2>0] #显示大于0的值,else显示NaN
    print(df2)
    print(df2.dropna(how='any')) #删除任何带有缺失值的行
    print(df2.fillna(value=5)) #填充缺失值
    print(pd.isna(df2)) #获取值为nan的掩码,nan为true
    
                       A         B         C         D
    2019-09-01  2.504590       NaN  1.139982       NaN
    2019-09-02       NaN  0.604752  0.655428       NaN
    2019-09-03       NaN       NaN  1.086983  0.600510
    2019-09-04       NaN       NaN       NaN  0.459104
    2019-09-05       NaN       NaN       NaN  1.349749
    2019-09-06  0.803654  1.542528  0.041647  1.053980
    
                       A         B         C        D
    2019-09-06  0.803654  1.542528  0.041647  1.05398
    
                       A         B         C         D
    2019-09-01  2.504590  5.000000  1.139982  5.000000
    2019-09-02  5.000000  0.604752  0.655428  5.000000
    2019-09-03  5.000000  5.000000  1.086983  0.600510
    2019-09-04  5.000000  5.000000  5.000000  0.459104
    2019-09-05  5.000000  5.000000  5.000000  1.349749
    2019-09-06  0.803654  1.542528  0.041647  1.053980
    
                    A      B      C      D
    2019-09-01  False   True  False   True
    2019-09-02   True  False  False   True
    2019-09-03   True   True  False  False
    2019-09-04   True   True   True  False
    2019-09-05   True   True   True  False
    2019-09-06  False  False  False  False
    
    

    相关文章

      网友评论

        本文标题:Pandas快速入门(二)

        本文链接:https://www.haomeiwen.com/subject/cjhqpctx.html