美文网首页数据分析专题
pandas 数据合并和数据转换

pandas 数据合并和数据转换

作者: 无敌的肉包 | 来源:发表于2018-05-31 09:18 被阅读53次
数据合并
  1. concat ----- 相当于SQL中的union ----- 轴向连接
In [116]: df=pd.DataFrame(np.random.randn(10,4))

In [117]: df
Out[117]:
          0         1         2         3
0 -0.090224  0.151115 -0.258844  0.635371
1  0.250083  0.393779 -0.167708  0.920488
2  1.116247 -1.799474 -1.463225  0.875507
3  0.030187  0.336028  0.544718 -0.316654
4 -0.255436 -0.173571  1.379008  0.143499
5 -0.625524  0.199277  0.374764 -2.165587
6 -0.523163  0.886603  0.527486  0.046614
7 -1.265947 -0.935611  1.169874 -1.535373
8  1.829968  0.867154 -1.973417  0.364380
9  1.623850  0.633808 -1.987559 -1.554006

In [118]: pieces = [df[:3],df[3:7],df[7:]]

In [119]: pieces
Out[119]:
[          0         1         2         3
 0 -0.090224  0.151115 -0.258844  0.635371
 1  0.250083  0.393779 -0.167708  0.920488
 2  1.116247 -1.799474 -1.463225  0.875507,
           0         1         2         3
 3  0.030187  0.336028  0.544718 -0.316654
 4 -0.255436 -0.173571  1.379008  0.143499
 5 -0.625524  0.199277  0.374764 -2.165587
 6 -0.523163  0.886603  0.527486  0.046614,
           0         1         2         3
 7 -1.265947 -0.935611  1.169874 -1.535373
 8  1.829968  0.867154 -1.973417  0.364380
 9  1.623850  0.633808 -1.987559 -1.554006]

In [120]: pd.concat(pieces)
Out[120]:
          0         1         2         3
0 -0.090224  0.151115 -0.258844  0.635371
1  0.250083  0.393779 -0.167708  0.920488
2  1.116247 -1.799474 -1.463225  0.875507
3  0.030187  0.336028  0.544718 -0.316654
4 -0.255436 -0.173571  1.379008  0.143499
5 -0.625524  0.199277  0.374764 -2.165587
6 -0.523163  0.886603  0.527486  0.046614
7 -1.265947 -0.935611  1.169874 -1.535373
8  1.829968  0.867154 -1.973417  0.364380
9  1.623850  0.633808 -1.987559 -1.554006

如果传入axis=1,则会变成一个DataFrame; 用kyes可以在连接的轴上创建一个层次化索引。


In [124]: result = pd.concat(pieces,keys=['one','two','three'])

In [125]: result
Out[125]:
                0         1         2         3
one   0 -0.090224  0.151115 -0.258844  0.635371
      1  0.250083  0.393779 -0.167708  0.920488
      2  1.116247 -1.799474 -1.463225  0.875507
two   3  0.030187  0.336028  0.544718 -0.316654
      4 -0.255436 -0.173571  1.379008  0.143499
      5 -0.625524  0.199277  0.374764 -2.165587
      6 -0.523163  0.886603  0.527486  0.046614
three 7 -1.265947 -0.935611  1.169874 -1.535373
      8  1.829968  0.867154 -1.973417  0.364380
      9  1.623850  0.633808 -1.987559 -1.554006

In [126]: result.unstack()
Out[126]:
              0                                                              \
              0         1         2         3         4         5         6
one   -0.090224  0.250083  1.116247       NaN       NaN       NaN       NaN
two         NaN       NaN       NaN  0.030187 -0.255436 -0.625524 -0.523163
three       NaN       NaN       NaN       NaN       NaN       NaN       NaN

                                     ...            3                      \
              7         8        9   ...            0         1         2
one         NaN       NaN      NaN   ...     0.635371  0.920488  0.875507
two         NaN       NaN      NaN   ...          NaN       NaN       NaN
three -1.265947  1.829968  1.62385   ...          NaN       NaN       NaN


              3         4         5         6         7        8         9
one         NaN       NaN       NaN       NaN       NaN      NaN       NaN
two   -0.316654  0.143499 -2.165587  0.046614       NaN      NaN       NaN
three       NaN       NaN       NaN       NaN -1.535373  0.36438 -1.554006

[3 rows x 40 columns]
  1. merge -----类似于SQL中的连接 join

默认情况下merge表示的是内连接 即 inner join,可以用how指定为left,rightouter
没有使用on指定连接的列的时候,merge会将重叠的列名当作键,如果列名不同,可以用left_onright_on分别指定。

In [127]: left = pd.DataFrame({'key':['foo','fool'],'lval':[1,2]})

In [128]: right = pd.DataFrame({'key':['foo','foo2'],'val':[4,5]})

In [129]: pd.merge(left,right)
Out[129]:
   key  lval  val
0  foo     1    4
In [131]: pd.merge(left,right,on='key')
Out[131]:
   key  lval  val
0  foo     1    4

In [133]: df3 = pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})

In [134]: df3
Out[134]:
   data1 lkey
0      0    b
1      1    b
2      2    a
3      3    c
4      4    a
5      5    a
6      6    b

In [135]: df4 = pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)})

In [136]: df4
Out[136]:
   data2 rkey
0      0    a
1      1    b
2      2    d

In [137]: pd.merge(df3,df4,left_on='lkey',right_on='rkey')
Out[137]:
   data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a

In [138]: pd.merge(df3,df4,left_on='lkey',right_on='rkey',how='outer')
Out[138]:
   data1 lkey  data2 rkey
0      0    b      1    b
1      1    b      1    b
2      6    b      1    b
3      2    a      0    a
4      4    a      0    a
5      5    a      0    a
6      3    c    NaN  NaN
7    NaN  NaN      2    d

当连接键位于索引上的时候可以通过传入right_index=Trueleft_index=Ture的参数将索引作为连接键。

  1. append -----将数据连接到一个DataFrame上
In [141]: df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])

In [142]: df
Out[142]:
          A         B         C         D
0  0.681495  1.002195  0.098096  0.707791
1 -2.692089 -0.307324  1.157524  1.182020
2  0.362695  0.310697 -1.866646  0.926730
3  0.326436  0.076280 -0.211885 -0.785103
4 -1.700167 -0.216005  0.076874  0.908238
5  1.139810  0.375421  1.244933  1.320597
6  0.137495 -0.194466  1.188652  0.704212
7  0.166488  0.758015 -0.307003 -1.113543

In [145]: s=df.iloc[3]

In [146]: s
Out[146]:
A    0.326436
B    0.076280
C   -0.211885
D   -0.785103
Name: 3, dtype: float64

In [151]: df.append(s,ignore_index=True)
Out[151]:
          A         B         C         D
0  0.681495  1.002195  0.098096  0.707791
1 -2.692089 -0.307324  1.157524  1.182020
2  0.362695  0.310697 -1.866646  0.926730
3  0.326436  0.076280 -0.211885 -0.785103
4 -1.700167 -0.216005  0.076874  0.908238
5  1.139810  0.375421  1.244933  1.320597
6  0.137495 -0.194466  1.188652  0.704212
7  0.166488  0.758015 -0.307003 -1.113543
8  0.326436  0.076280 -0.211885 -0.785103
数据转换
  1. 移除重复数据
    DataFrame中常常会出现重复数据,通常可以用duplicated方法返回一个布尔型Series,表示各行是否是重复的。
    drop_duplicates方法用于返回一个移除了重复行的DataFrame。
In [157]: df
Out[157]:
   data key
0     1   a
1     1   a
2     2   a
3     3   b
4     3   b
5     4   b
6     4   b

In [158]: df.duplicated()
Out[158]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [160]: df.drop_duplicates()
Out[160]:
   data key
0     1   a
2     2   a
3     3   b
5     4   b
  1. 值替换
    fillna方法填充缺失数据可以看作是数值替换的一种特殊情况,replace则提供了一种实现该功能更简单,更灵活的方法。
In [163]: data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])

In [164]: data
Out[164]:
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: float64

In [166]: data.replace(-999,np.nan)
Out[166]:
0       1
1     NaN
2       2
3     NaN
4   -1000
5       3
dtype: float64

In [167]: data.replace([-999,-1000],np.nan)
Out[167]:
0     1
1   NaN
2     2
3   NaN
4   NaN
5     3
dtype: float64

In [168]: data.replace([-999,-1000],[np.nan,0])
Out[168]:
0     1
1   NaN
2     2
3   NaN
4     0
5     3
dtype: float64

In [170]: data.replace({-999:np.nan,-1000:0})
Out[170]:
0     1
1   NaN
2     2
3   NaN
4     0
5     3
dtype: float64

相关文章

网友评论

    本文标题:pandas 数据合并和数据转换

    本文链接:https://www.haomeiwen.com/subject/jactdftx.html