数据合并
-
concat
----- 相当于SQL中的union ----- 轴向连接
In [116]: df=pd.DataFrame(np.random.randn(10,4))
In [117]: df
Out[117]:
0 1 2 3
0 -0.090224 0.151115 -0.258844 0.635371
1 0.250083 0.393779 -0.167708 0.920488
2 1.116247 -1.799474 -1.463225 0.875507
3 0.030187 0.336028 0.544718 -0.316654
4 -0.255436 -0.173571 1.379008 0.143499
5 -0.625524 0.199277 0.374764 -2.165587
6 -0.523163 0.886603 0.527486 0.046614
7 -1.265947 -0.935611 1.169874 -1.535373
8 1.829968 0.867154 -1.973417 0.364380
9 1.623850 0.633808 -1.987559 -1.554006
In [118]: pieces = [df[:3],df[3:7],df[7:]]
In [119]: pieces
Out[119]:
[ 0 1 2 3
0 -0.090224 0.151115 -0.258844 0.635371
1 0.250083 0.393779 -0.167708 0.920488
2 1.116247 -1.799474 -1.463225 0.875507,
0 1 2 3
3 0.030187 0.336028 0.544718 -0.316654
4 -0.255436 -0.173571 1.379008 0.143499
5 -0.625524 0.199277 0.374764 -2.165587
6 -0.523163 0.886603 0.527486 0.046614,
0 1 2 3
7 -1.265947 -0.935611 1.169874 -1.535373
8 1.829968 0.867154 -1.973417 0.364380
9 1.623850 0.633808 -1.987559 -1.554006]
In [120]: pd.concat(pieces)
Out[120]:
0 1 2 3
0 -0.090224 0.151115 -0.258844 0.635371
1 0.250083 0.393779 -0.167708 0.920488
2 1.116247 -1.799474 -1.463225 0.875507
3 0.030187 0.336028 0.544718 -0.316654
4 -0.255436 -0.173571 1.379008 0.143499
5 -0.625524 0.199277 0.374764 -2.165587
6 -0.523163 0.886603 0.527486 0.046614
7 -1.265947 -0.935611 1.169874 -1.535373
8 1.829968 0.867154 -1.973417 0.364380
9 1.623850 0.633808 -1.987559 -1.554006
如果传入axis=1,则会变成一个DataFrame; 用kyes可以在连接的轴上创建一个层次化索引。
In [124]: result = pd.concat(pieces,keys=['one','two','three'])
In [125]: result
Out[125]:
0 1 2 3
one 0 -0.090224 0.151115 -0.258844 0.635371
1 0.250083 0.393779 -0.167708 0.920488
2 1.116247 -1.799474 -1.463225 0.875507
two 3 0.030187 0.336028 0.544718 -0.316654
4 -0.255436 -0.173571 1.379008 0.143499
5 -0.625524 0.199277 0.374764 -2.165587
6 -0.523163 0.886603 0.527486 0.046614
three 7 -1.265947 -0.935611 1.169874 -1.535373
8 1.829968 0.867154 -1.973417 0.364380
9 1.623850 0.633808 -1.987559 -1.554006
In [126]: result.unstack()
Out[126]:
0 \
0 1 2 3 4 5 6
one -0.090224 0.250083 1.116247 NaN NaN NaN NaN
two NaN NaN NaN 0.030187 -0.255436 -0.625524 -0.523163
three NaN NaN NaN NaN NaN NaN NaN
... 3 \
7 8 9 ... 0 1 2
one NaN NaN NaN ... 0.635371 0.920488 0.875507
two NaN NaN NaN ... NaN NaN NaN
three -1.265947 1.829968 1.62385 ... NaN NaN NaN
3 4 5 6 7 8 9
one NaN NaN NaN NaN NaN NaN NaN
two -0.316654 0.143499 -2.165587 0.046614 NaN NaN NaN
three NaN NaN NaN NaN -1.535373 0.36438 -1.554006
[3 rows x 40 columns]
-
merge
-----类似于SQL中的连接 join
默认情况下merge
表示的是内连接 即 inner join,可以用how
指定为left
,right
和outer
。
没有使用on指定连接的列的时候,merge会将重叠的列名当作键,如果列名不同,可以用left_on
和right_on
分别指定。
In [127]: left = pd.DataFrame({'key':['foo','fool'],'lval':[1,2]})
In [128]: right = pd.DataFrame({'key':['foo','foo2'],'val':[4,5]})
In [129]: pd.merge(left,right)
Out[129]:
key lval val
0 foo 1 4
In [131]: pd.merge(left,right,on='key')
Out[131]:
key lval val
0 foo 1 4
In [133]: df3 = pd.DataFrame({'lkey':['b','b','a','c','a','a','b'],'data1':range(7)})
In [134]: df3
Out[134]:
data1 lkey
0 0 b
1 1 b
2 2 a
3 3 c
4 4 a
5 5 a
6 6 b
In [135]: df4 = pd.DataFrame({'rkey':['a','b','d'],'data2':range(3)})
In [136]: df4
Out[136]:
data2 rkey
0 0 a
1 1 b
2 2 d
In [137]: pd.merge(df3,df4,left_on='lkey',right_on='rkey')
Out[137]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
In [138]: pd.merge(df3,df4,left_on='lkey',right_on='rkey',how='outer')
Out[138]:
data1 lkey data2 rkey
0 0 b 1 b
1 1 b 1 b
2 6 b 1 b
3 2 a 0 a
4 4 a 0 a
5 5 a 0 a
6 3 c NaN NaN
7 NaN NaN 2 d
当连接键位于索引上的时候可以通过传入right_index=True
和left_index=Ture
的参数将索引作为连接键。
-
append
-----将数据连接到一个DataFrame上
In [141]: df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
In [142]: df
Out[142]:
A B C D
0 0.681495 1.002195 0.098096 0.707791
1 -2.692089 -0.307324 1.157524 1.182020
2 0.362695 0.310697 -1.866646 0.926730
3 0.326436 0.076280 -0.211885 -0.785103
4 -1.700167 -0.216005 0.076874 0.908238
5 1.139810 0.375421 1.244933 1.320597
6 0.137495 -0.194466 1.188652 0.704212
7 0.166488 0.758015 -0.307003 -1.113543
In [145]: s=df.iloc[3]
In [146]: s
Out[146]:
A 0.326436
B 0.076280
C -0.211885
D -0.785103
Name: 3, dtype: float64
In [151]: df.append(s,ignore_index=True)
Out[151]:
A B C D
0 0.681495 1.002195 0.098096 0.707791
1 -2.692089 -0.307324 1.157524 1.182020
2 0.362695 0.310697 -1.866646 0.926730
3 0.326436 0.076280 -0.211885 -0.785103
4 -1.700167 -0.216005 0.076874 0.908238
5 1.139810 0.375421 1.244933 1.320597
6 0.137495 -0.194466 1.188652 0.704212
7 0.166488 0.758015 -0.307003 -1.113543
8 0.326436 0.076280 -0.211885 -0.785103
数据转换
- 移除重复数据
DataFrame中常常会出现重复数据,通常可以用duplicated
方法返回一个布尔型Series,表示各行是否是重复的。
drop_duplicates
方法用于返回一个移除了重复行的DataFrame。
In [157]: df
Out[157]:
data key
0 1 a
1 1 a
2 2 a
3 3 b
4 3 b
5 4 b
6 4 b
In [158]: df.duplicated()
Out[158]:
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
In [160]: df.drop_duplicates()
Out[160]:
data key
0 1 a
2 2 a
3 3 b
5 4 b
- 值替换
用fillna
方法填充缺失数据可以看作是数值替换的一种特殊情况,replace
则提供了一种实现该功能更简单,更灵活的方法。
In [163]: data = pd.Series([1.,-999.,2.,-999.,-1000.,3.])
In [164]: data
Out[164]:
0 1
1 -999
2 2
3 -999
4 -1000
5 3
dtype: float64
In [166]: data.replace(-999,np.nan)
Out[166]:
0 1
1 NaN
2 2
3 NaN
4 -1000
5 3
dtype: float64
In [167]: data.replace([-999,-1000],np.nan)
Out[167]:
0 1
1 NaN
2 2
3 NaN
4 NaN
5 3
dtype: float64
In [168]: data.replace([-999,-1000],[np.nan,0])
Out[168]:
0 1
1 NaN
2 2
3 NaN
4 0
5 3
dtype: float64
In [170]: data.replace({-999:np.nan,-1000:0})
Out[170]:
0 1
1 NaN
2 2
3 NaN
4 0
5 3
dtype: float64
网友评论