Series
series是一种类似一维数组的对象,它由一组数据以及一组与之相关的标签组成,通过pandas的Series函数实例化一个series
- 创建series
import pandas as pd
s = pd.Series([5,2,3,4,1])
>>>
0 2
1 3
2 4
3 1
4 5
dtype: int64
s.values
>>>array([2, 3, 4, 1, 5])
s.index
>>>RangeIndex(start=0, stop=5, step=1)
s2 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
print(s2)
>>>
a 3
b 2
c 4
d 1
e 5
dtype: int64
#根据字典创建series
dict = {'name':'joha','sex':'male','age':'18'}
s3 = pd.Series(dict)
print(s3)
>>>
name joha
sex male
age 18
dtype: object
- 根据索引选取Series的一个值或多个值
s2 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
#批量单个值
s2['a']
>>>3
#批量选取多个值
s2[['a','c','e']]
>>>
a 3
c 4
e 5
dtype: int64
s2[s2>3]
>>>
c 4
e 5
dtype: int64
s2*3
>>>
a 9
b 6
c 12
d 3
e 15
dtype: int64
'c' in s2
>>>True
'f' in s2
>>>False
series在算数运算中自动对齐不同索引的数据
s1 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
s2 = pd.Series([3,-5,1],index = ['a','c','e'])
print(s1+s2)
>>>
a 6.0
b NaN
c -1.0
d NaN
e 6.0
dtype: float64
series中的index可以通过赋值的方式进行修改
s2 = pd.Series([3,-5,1],index = ['a','c','e'])
s2.index = [1,2,3]
print(s2)
>>>
1 3
2 -5
3 1
dtype: int64
DataFrame
- 创建dataFrame
test_dict = {'id':[1,2,3,4,5,6],
'name':['Alice','Bob','Cindy','Eric','Helen','Grace '],
'math':[90,89,99,78,97,93],
'english':[89,94,80,94,94,90]}
#[1].直接写入参数test_dict
test_dict_df = pd.DataFrame(test_dict)
print(test_dict_df)
>>>
id name math english
0 1 Alice 90 89
1 2 Bob 89 94
2 3 Cindy 99 80
3 4 Eric 78 94
4 5 Helen 97 94
5 6 Grace 93 90
#[2].字典型赋值
test_dict_df = pd.DataFrame(data=test_dict)
>>>
id name math english
0 1 Alice 90 89
1 2 Bob 89 94
2 3 Cindy 99 80
3 4 Eric 78 94
4 5 Helen 97 94
5 6 Grace 93 90
test_dict_df = pd.DataFrame(test_dict,columns=['name','math','english','id'])
print(test_dict_df)
>>>
name math english id
0 Alice 90 89 1
1 Bob 89 94 2
2 Cindy 99 80 3
3 Eric 78 94 4
4 Helen 97 94 5
5 Grace 93 90 6
- DataFrame取值
test_dict_df['name']
>>>
0 Alice
1 Bob
2 Cindy
3 Eric
4 Helen
5 Grace
Name: name, dtype: object
test_dict_df.name
>>>
0 Alice
1 Bob
2 Cindy
3 Eric
4 Helen
5 Grace
Name: name, dtype: object
- 对某一列赋值
test_dict_df['id'] = pd.Series(['11','22','33','44','55'])
print(test_dict_df)
>>>
name math english id
0 Alice 90 89 11
1 Bob 89 94 22
2 Cindy 99 80 33
3 Eric 78 94 44
4 Helen 97 94 55
5 Grace 93 90 NaN
- 删除某一列
test_dict_df.drop(['id'],axis=1)
>>>
name math english
0 Alice 90 89
1 Bob 89 94
2 Cindy 99 80
3 Eric 78 94
4 Helen 97 94
5 Grace 93 90
- 多维数组构建DataFrame
test_dict = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns = ['a','b','c'],index=[1,2,3])
print(test_dict)
>>>
a b c
1 1 2 3
2 4 5 6
3 7 8 9
print(test_dict.values)
>>>
[[1 2 3]
[4 5 6]
[7 8 9]]
- 构建series和dataFrame时,可以传入数组当作index
test_dict = pd.Series([1,2,3],index = ['a','b','c'])
print(test_dict)
>>>
a 1
b 2
c 3
dtype: int64
test_dict.index = ['c','d','e']
print(test_dict)
>>>
c 1
d 2
e 3
dtype: int64
- pandas重新索引 reindex
test_dict = pd.Series([1,2,3],index = ['a','b','c'])
test_dict1 = test_dict.reindex(['a','b','c','d','e'])
print(test_dict1)
>>>
a 1.0
b 2.0
c 3.0
d NaN
e NaN
dtype: float64
#填充
test_dict = pd.Series([1,2,3],index = ['a','b','c'])
test_dict1 = test_dict.reindex(['a','b','c','d','e'],fill_value = 0)
print(test_dict1)
>>>
a 1
b 2
c 3
d 0
e 0
dtype: int64
obj = pd.Series(['Jim','Mike','Jhon'],index = [0,3,6])
obj1 = obj.reindex(range(8),method = 'ffill')
print(obj1)
>>>
0 Jim
1 Jim
2 Jim
3 Mike
4 Mike
5 Mike
6 Jhon
7 Jhon
dtype: object
reindex作用于列
df = pd.DataFrame(np.arange(1,10).reshape((3,3)),index = ['d','a','c'],columns = ['Jim','Mike','Jhon'])
df1 = df.reindex(['a','b','c','d'],['Jhon','Mike','Jim'])
print(df1)
>>>
Jhon Mike Jim
a 6.0 5.0 4.0
b NaN NaN NaN
c 9.0 8.0 7.0
d 3.0 2.0 1.0
- 丢弃指定轴上的项 DataFrame.drop
test_dict = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print('output:{}'.format(test_dict))
a = test_dict.drop(['a','c'])
print('>>>{}'.format(a))
>>>
output:
a 1
b 2
c 3
d 4
e 5
dtype: int64
>>>
b 2
d 4
e 5
dtype: int64
df = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
print(df)
a = df.drop(['2','4'])
print(a)
>>>
a b c d e
1 0 1 2 3 4
2 5 6 7 8 9
3 10 11 12 13 14
4 15 16 17 18 19
5 20 21 22 23 24
>>>
a b c d e
1 0 1 2 3 4
3 10 11 12 13 14
5 20 21 22 23 24
df = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
print(df)
a = df.drop(['a','c'],axis = 1)
print(a)
>>>
a b c d e
1 0 1 2 3 4
2 5 6 7 8 9
3 10 11 12 13 14
4 15 16 17 18 19
5 20 21 22 23 24
>>>
b d e
1 1 3 4
2 6 8 9
3 11 13 14
4 16 18 19
5 21 23 24
- 索引,选取和过滤
object = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
print(object[1:3])
>>>
b 2
c 4
dtype: int64
object[['a','b','c']]
>>>
a 3
b 2
c 4
dtype: int64
- 条件过滤
object[object<4]
>>>
a 3
b 2
d 1
dtype: int64
- DataFrame的索引可以按行也可以按列
object = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
print(object['b'])
>>>
1 1
2 6
3 11
4 16
5 21
Name: b, dtype: int64
print(object[['a','c']])
>>>
a c
1 0 2
2 5 7
3 10 12
4 15 17
5 20 22
- 按行索引
print(object[1:4])
>>>
a b c d e
2 5 6 7 8 9
3 10 11 12 13 14
4 15 16 17 18 19
- 条件索引
object[object['b']>10]
>>>
a b c d e
3 10 11 12 13 14
4 15 16 17 18 19
5 20 21 22 23 24
- 对不同索引的对象进行计算
df1 = pd.DataFrame(np.arange(9).reshape(3,3),columns = list('abc'),index = [1,2,3])
print(df1)
>>>
a b c
1 0 1 2
2 3 4 5
3 6 7 8
df2 = pd.DataFrame(np.arange(16).reshape(4,4),columns = list('bcde'),index = [2,3,4,5])
print(df2)
>>>
b c d e
2 0 1 2 3
3 4 5 6 7
4 8 9 10 11
5 12 13 14 15
print(df1+df2)
>>>
a b c d e
1 NaN NaN NaN NaN NaN
2 NaN 4.0 6.0 NaN NaN
3 NaN 11.0 13.0 NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
df1.add(df2,fill_value = 0)
#df1与df2两两都没有的值,依然是NaN
>>>
a b c d e
1 0.0 1.0 2.0 NaN NaN
2 3.0 4.0 6.0 2.0 3.0
3 6.0 11.0 13.0 6.0 7.0
4 NaN 8.0 9.0 10.0 11.0
5 NaN 12.0 13.0 14.0 15.0
- DataFrame与Series之间的计算
DataFrame与Series计算时会引入广播操作
df1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('abc'))
print(df1)
>>>
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
series1 = pd.Series([3,4,5],index = ['a','b','c'])
print(series1)
>>>
a 3
b 4
c 5
#逐行相减
print(df1 - series1)
>>>
a b c
0 -3 -3 -3
1 0 0 0
2 3 3 3
3 6 6 6
- series 取自dataFrame
series2 = df1['b']
series2
>>>
0 1
1 4
2 7
3 10
Name: b, dtype: int64
df1.sub(series2,axis = 0)
>>>
a b c
0 -1 0 1
1 -1 0 1
2 -1 0 1
3 -1 0 1
- 函数应用和映射
numpy的元素级函数可以直接作用到DataFrame上
print(np.square(df1))
>>>
a b c
0 0 1 4
1 9 16 25
2 36 49 64
3 81 100 121
- DataFrame将一个函数直接应用到其本身或者各行各列,形成一个新的数据或者行或列
def fun(x):
return x.max() - x.min()
df1.apply(fun,axis = 1)
>>>
0 2
1 2
2 2
3 2
dtype: int64
- 排序
df1 = pd.DataFrame(np.random.randn(4,4),columns=list('bcad'),index=[2,4,3,1])
print(df1)
>>>
b c a d
2 0.706356 -0.896474 -1.879608 0.322054
4 0.666188 -0.450170 0.914737 0.691662
3 -1.676381 -0.499211 -0.136020 -1.734251
1 -2.111717 -0.226238 1.656514 0.146311
print(df1.sort_index())
>>>
b c a d
1 -2.111717 -0.226238 1.656514 0.146311
2 0.706356 -0.896474 -1.879608 0.322054
3 -1.676381 -0.499211 -0.136020 -1.734251
4 0.666188 -0.450170 0.914737 0.691662
print(df1.sort_values(by=['b','a']))
>>>
b c a d
1 -2.111717 -0.226238 1.656514 0.146311
3 -1.676381 -0.499211 -0.136020 -1.734251
4 0.666188 -0.450170 0.914737 0.691662
2 0.706356 -0.896474 -1.879608 0.322054
统计相关计算
- 求和 sum
- 最大 max
- 最小 min
- 方差 var
- 求平均 mean
- 所有信息 describe
print(df1.describe())
>>>
b c a d
count 4.000000 4.000000 4.000000 4.000000
mean -0.603889 -0.518024 0.138906 -0.143556
std 1.500402 0.278880 1.533517 1.084545
min -2.111717 -0.896474 -1.879608 -1.734251
25% -1.785215 -0.598527 -0.571917 -0.323829
50% -0.505096 -0.474691 0.389358 0.234183
75% 0.676230 -0.394187 1.100181 0.414456
max 0.706356 -0.226238 1.656514 0.691662
处理数据缺失
- dropna 去除nan数据
- fillna 使用默认填入
- isnull 返回一个含有布尔值的对象,标注nan的位置
-notnull isnull否定式
网友评论