1、pandas中的矩阵运算
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
In [3]:
s1 = Series([1, 2, 3],index=['A','B','C'])
s1
Out[3]:
A 1
B 2
C 3
dtype: int64
In [4]:
s2 = Series([4, 5, 6, 7],index=['B','C','D','E'])
s2
Out[4]:
B 4
C 5
D 6
E 7
dtype: int64
In [5]:
# series 相加
s1 + s2
Out[5]:
A NaN
B 6.0
C 8.0
D NaN
E NaN
dtype: float64
In [8]:
df1 = DataFrame(np.arange(4).reshape(2, 2),index=['A','B'],columns=['BJ','SH'])
df1
Out[8]:
BJ SH
A 0 1
B 2 3
In [9]:
df2 = DataFrame(np.arange(9).reshape(3, 3),index=['A','B','C'],columns=['BJ','SH','GZ'])
df2
Out[9]:
BJ SH GZ
A 0 1 2
B 3 4 5
C 6 7 8
In [10]:
df1 + df2
Out[10]:
BJ GZ SH
A 0.0 NaN 2.0
B 5.0 NaN 7.0
C NaN NaN NaN
In [11]:
df3 = DataFrame([[1, 2, 3],[4, 5,np.nan],[7,8,9]],index=['A','B','C'],columns=['c1','c2','c3'])
df3
Out[11]:
c1 c2 c3
A 1 2 3.0
B 4 5 NaN
C 7 8 9.0
In [12]:
df3.sum()#自动nan忽略掉
Out[12]:
c1 12.0
c2 15.0
c3 12.0
dtype: float64
In [13]:
df3.sum(axis=1)#自动nan忽略掉
Out[13]:
A 6.0
B 9.0
C 24.0
dtype: float64
In [14]:
df3.min()
Out[14]:
c1 1.0
c2 2.0
c3 3.0
dtype: float64
In [15]:
df3.min(axis=1)
Out[15]:
A 1.0
B 4.0
C 7.0
dtype: float64
In [16]:
df3.describe() ##有关数理统计的值
Out[16]:
c1 c2 c3
count 3.0 3.0 2.000000
mean 4.0 5.0 6.000000
std 3.0 3.0 4.242641
min 1.0 2.0 3.000000
25% 2.5 3.5 4.500000
50% 4.0 5.0 6.000000
75% 5.5 6.5 7.500000
max 7.0 8.0 9.000000
2、pandas的排序操作
利用面向对象的方式调用方法
s2 = s1.sort_values()
s2 = s1.sort_values(ascending=False)
s2.sort_index(ascending=False)
包括对索引排序/对值排序
DataFrame还包括对行/列排序
import numpy as np
import pandas as pd
import string
from pandas import DataFrame,Series
s1 = Series(np.random.rand(10))
s2 = s1.sort_values() #加s
s2 = s1.sort_values(ascending=False)
s2.sort_index(ascending=False)
df1 = DataFrame(np.random.rand(40).reshape(8,5),columns=list(string.ascii_uppercase[:5]))
df1['A'].sort_values() #对某一列排序,选择出某一列
df1.sort_values('A') #根据某一列排序,结果还是矩阵
df1.sort_values('A',ascending=False)
df2 = df1.sort_values('A',ascending=False)
df2.sort_index()
3、对pandas的行列进行重命名操作
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
csv_input = 'movie_metadata.csv'
data = pd.read_csv(csv_input)[['movie_title','director_name','imdb_score']] #取出三列
data.columns
Out[6]:
Index(['movie_title', 'director_name', 'imdb_score'], dtype='object')
data.columns = Series(['m_v','d_n','i_s']) #利用直接给列赋值的方式修改列的名字
data.columns = data.columns.map(str.upper) #把相应的列名全部变成大写
data = data.rename(columns=str.lower) #调用方法,把相应的列名变成相应的小写字母
data = data.rename(columns={'m_v':'mov'}) #使用字典,只修改里面的m_v的列名
def test_map(x):
return x + '_ABC'
data.columns = data.columns.map(test_map) #操作之后,列名后面都加上_ABC
data.rename(columns=test_map)# 调用方法,修改之后列名后加_ABC
4、merge操作
使用pd.merge(df1, df2)进行merge操作,如果不存在存在相同的元素则可以合并,不存在则无
pd.merge(df1,df2,on='key',how='inner'),可以指定根据什么进行连接,以及连接方式。
连接方式:left / right / inner / outer
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
In [3]:
df1 = DataFrame({
'key':['X','Y','Z'],'data_set_1':[1,2,3]
})
In [4]:
df1
Out[4]:
key data_set_1
0 X 1
1 Y 2
2 Z 3
In [22]:
df2 = DataFrame({
'key':['X','B','C'],'data_set_2':[4,5,6]
})
In [23]:
df2
Out[23]:
key data_set_2
0 X 4
1 B 5
2 C 6
In [24]:
pd.merge(df1, df2)
Out[24]:
key data_set_1 data_set_2
0 X 1 4
In [26]:
df1 = DataFrame({
'key':['X','Y','Z','X'],'data_set_1':[1,2,3,4]
})
In [27]:
df1
Out[27]:
key data_set_1
0 X 1
1 Y 2
2 Z 3
3 X 4
In [28]:
pd.merge(df1,df2,on='key')
Out[28]:
key data_set_1 data_set_2
0 X 1 4
1 X 4 4
In [29]:
pd.merge(df1,df2,on='key',how='inner')
Out[29]:
key data_set_1 data_set_2
0 X 1 4
1 X 4 4
In [30]:
pd.merge(df1,df2,on='key',how='left')
Out[30]:
key data_set_1 data_set_2
0 X 1 4.0
1 Y 2 NaN
2 Z 3 NaN
3 X 4 4.0
In [31]:
pd.merge(df1,df2,on='key',how='right')
Out[31]:
key data_set_1 data_set_2
0 X 1.0 4
1 X 4.0 4
2 B NaN 5
3 C NaN 6
In [32]:
pd.merge(df1,df2,on='key',how='outer')
Out[32]:
key data_set_1 data_set_2
0 X 1.0 4.0
1 X 4.0 4.0
2 Y 2.0 NaN
3 Z 3.0 NaN
4 B NaN 5.0
5 C NaN 6.0
In [ ]:
5、concatenate 和 combine
(1)concatenate默认行拼接,可通过制定参数的方式进行列拼接,如果两个Series进行列拼接,并且行的数量不一样,则把不存在的值补充为nan
(2)s1.combine_first(s2)用s2的数据填充s1的数据,如果s2中有s1中不存在的数据,则赋值给s1,若存在,则保持不变
concat
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
s1 = Series([1, 2, 3],index=['X','Y','Z'])
s2 = Series([4, 5],index=['A','B'])
pd.concat([s1, s2])
pd.concat([s1, s2], axis=1)
df1 = DataFrame(np.random.rand(4, 3),columns=['X','Y','Z'])
df2 = DataFrame(np.random.rand(3, 3),columns=['X','Y','A'])
pd.concat([df1,df2])
combine
s1 = Series([1, np.nan, 3,np.nan],index=['A','B','C','D'])
s2 = Series([1,2,3,4],index=['A','B','C','D'])
s1.combine_first(s2)
df1 =DataFrame({
'X':[1,np.nan,3,np.nan],
'Y':[5,np.nan,7,np.nan],
'Z':[9,np.nan,11,np.nan]
})
df2 =DataFrame({
'Z':[np.nan,10,np.nan,12],
'A':[1, 2, 3 ,4]
})
In [31]:
df1.combine_first(df2)
网友评论