ps:在pandas中,对索引的操作默认都为行索引
层次化索引
In [61]: data=pd.Series(np.random.randn(9),index=[['a','a','a','b','b','c','c','d','d'],[1,2,3,1,2,1,3,1,5]])
In [62]: data
Out[62]:
a 1 -0.544257
2 0.129334
3 -0.169315
b 1 -0.290376
2 -0.134793
c 1 1.854275
3 -0.912308
d 1 -0.246389
5 -0.551807
dtype: float64
层次化索引在数据重塑和基于分组的操作(如透视表生成)
将多级索引的Series转为DataFrame:unstack方法:
In [63]: data.unstack()
Out[63]:
1 2 3 5
a -0.544257 0.129334 -0.169315 NaN
b -0.290376 -0.134793 NaN NaN
c 1.854275 NaN -0.912308 NaN
d -0.246389 NaN NaN -0.551807
重排与分级排序
交换索引级别
frame.swaplevel(0,1,axis='columns')
选择索引排序的级别
frame.sort_index(level='key1')
根据级别汇总统计
In [109]: frame
Out[109]:
kind1 male female
kind2 e e u u
key1 key2
a 1 5 6 2 8
2 0 9 3 5
b 1 0 3 7 7
2 2 1 7 6
In [110]: frame.sum(level='key1')
Out[110]:
kind1 male female
kind2 e e u u
key1
a 5 15 5 13
b 2 4 14 13
使用DataFrame的列进行索引
将列作为索引:
In [118]: frame
Out[118]:
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
参数drop决定是否在原对象中保留索引列。(默认为True,不保留)
In [119]: frame.set_index(['c','d'],drop=False)
Out[119]:
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
与set_index相反:reset_index
8.2 合并数据集
pandas.merge:可根据一个或多个键将不同DataFrame中的行连接起来;相当于数据库的join操作
pandas.concat:可以沿着一条轴将多个对象堆叠到一起。
实例方法combine_first可以将重复数据拼接在一起,用一个对象中的值填充另一个对象中的缺失值。
多对一
连接两个表:pd.merge
参数on默认为属性有重叠的列的列名
列名相同
In [135]: pd.merge(df1,df2,on='key')
列名不同:分别指定参数left_on和参数right_on
In [145]: df3
Out[145]:
lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
In [146]: df4
Out[146]:
rkey data2
0 a 0
1 b 1
2 d 2
In [147]: pd.merge(df3,df4,left_on='lkey',right_on='rkey')
Out[147]:
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
注意到:key值为c和的数据消失了
这是因为默认情况下,merge做的是“内连接”;结果中的键是交集。其他方式还有"left"、"right"以及"outer"。外连接求取的是键的并集,组合了左连接和右连接的效果:
修改参数how即可改变连接方式:(how的可选参数为:inner,outer,right,left)
In [149]: pd.merge(df3,df4,left_on='lkey',right_on='rkey',how='outer')
lkey data1 rkey data2
0 b 0.0 b 1.0
1 b 1.0 b 1.0
2 b 6.0 b 1.0
3 a 2.0 a 0.0
4 a 4.0 a 0.0
5 a 5.0 a 0.0
6 c 3.0 NaN NaN
7 NaN NaN d 2.0
多对多
多对多产生的是行的笛卡儿积
In [151]: df1
Out[151]:
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
In [152]: df2
Out[152]:
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
In [153]: pd.merge(df1,df2,how='inner')
Out[153]:
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
根据多个键进行合并:给on参数传入一个键列表:
In [163]: pd.merge(left,right,on=['key1','key2'])
Out[163]:
key1 key2 lval rval
0 foo one 1 4
1 foo one 1 5
2 bar one 3 6
处理重复列名:merge的suffixes选项
In [171]: pd.merge(left,right,on='key1',suffixes=['_left','_right'])
Out[171]:
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
image.png
image.png
连接层次化索引的数据,以索引作为连接列:
In [36]: righth
Out[36]:
event1 event2
Nevada 2001 0 1
2000 2 3
Ohio 2000 4 5
2000 6 7
2001 8 9
2002 10 11
In [37]: lefth
Out[37]:
key1 key2 data
0 Ohio 2000 0.0
1 Ohio 2001 1.0
2 Ohio 2002 2.0
3 Nevada 2001 3.0
4 Nevada 2002 4.0
要将索引作为连接列,即设置参数righ_index为True
In [39]: pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='outer')
Out[39]:
key1 key2 data event1 event2
0 Ohio 2000 0.0 4.0 5.0
0 Ohio 2000 0.0 6.0 7.0
1 Ohio 2001 1.0 8.0 9.0
2 Ohio 2002 2.0 10.0 11.0
3 Nevada 2001 3.0 0.0 1.0
4 Nevada 2002 4.0 NaN NaN
4 Nevada 2000 NaN 2.0 3.0
当然可以同时使用双方的索引作为连接列
In [46]: left2
Out[46]:
Ohio Nevada
a 1.0 2.0
c 3.0 4.0
e 5.0 6.0
In [47]: right2
Out[47]:
Missouri Alabama
b 7.0 8.0
c 9.0 10.0
d 11.0 12.0
e 13.0 14.0
In [48]: pd.merge(left2,right2,left_index=True,right_index=True)
Out[48]:
Ohio Nevada Missouri Alabama
c 3.0 4.0 9.0 10.0
e 5.0 6.0 13.0 14.0
或者直接使用DataFrame对象的实例方法:join来直接实现索引合并
left2.join(right2,how='outer')
join方法支持连接多的DataFrame对象:
In [56]: another
Out[56]:
New York Oregon
a 7.0 8.0
c 9.0 10.0
e 11.0 12.0
f 16.0 17.0
In [57]: left2.join([right2,another])
Out[57]:
Ohio Nevada Missouri Alabama New York Oregon
a 1.0 2.0 NaN NaN 7.0 8.0
c 3.0 4.0 9.0 10.0 9.0 10.0
e 5.0 6.0 13.0 14.0 11.0 12.0
轴向连接:np.concatenate 和pd.concat
numpy的concatenate方法
np.concatenate([arr,arr],axis=1)
pandas的concat方法
pd.concat([s1,s2,s3],axis=1)
Out[103]:
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
In [89]: pd.concat([s1, s4], axis=1)
Out[89]:
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
In [90]: pd.concat([s1, s4], axis=1, join='inner')
Out[90]:
0 1
a 0 0
b 1 1
在连接轴上创建一个层次化索引:使用keys参数
In [122]: pd.concat([s1,s2],keys=['key1','key2'])
Out[122]:
key1 a 0
b 1
key2 c 2
d 3
e 4
dtype: int64
该逻辑同样使用与DataFrame对象
In [136]: pd.concat([df1,df2],axis=1,keys=['level1','level2'])
Out[136]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
如果传入的是字典而不是列表,则字典的键就会作为keys的值
In [141]: pd.concat({'level1':df1,'level2':df2},axis=1)
Out[141]:
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
可以给层次索引命名
In [22]: pd.concat([df1,df2],axis=1,keys=['level1','leve12'],names=['high','low'])
Out[22]:
high level1 leve12
low one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
当待合并DataFrame的索引不包括任何东西,此时我们通常需要对行索引依次重新排序
使用参数ignore_index=True
In [28]: pd.concat([df1,df2],ignore_index=True)
Out[28]:
a b c d
0 -0.404371 0.343314 1.938941 1.078406
1 -0.270911 0.500551 0.334860 -0.833356
2 0.393762 1.439736 -0.115172 -0.085488
3 -0.200651 -0.475177 NaN 0.157968
4 0.115498 0.429316 NaN -0.717663
合并重叠数据
回顾np.where方法:相当于if-else
In [69]: a
Out[69]:
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
In [70]: b
Out[70]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
实现了功能:a中空缺的数据用b中的数据去填补
In [71]: np.where(a.isnull(),b,a)
Out[71]: array([0. , 2.5, 2. , 3.5, 4.5, nan])
Series对象也有类似的功能:Series.combine_first
In [74]: a.combine_first(b)
Out[74]:
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a NaN
dtype: float64
8.3 重塑和轴向旋转
有许多用于重新排列表格型数据的基础运算。这些函数也称作重塑(reshape)或轴向旋转(pivot)运算。
重塑层次化索引
stack:将数据的列“旋转”为行。
unstack:将数据的行“旋转”为列。
In [90]: data
Out[90]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
In [92]: data.stack()
Out[92]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int32
对于Series和DataFrame通过stack和unstack转换的理解:stack意为:堆叠。多级索引即是将索引进行了堆叠,所以用unstack即使之不堆叠,从而Series=》DataFrame。反之同理
网友评论