美文网首页
pandas基本操作手册

pandas基本操作手册

作者: 张小张x86 | 来源:发表于2019-04-22 12:14 被阅读0次

Series

series是一种类似一维数组的对象,它由一组数据以及一组与之相关的标签组成,通过pandas的Series函数实例化一个series

  • 创建series
import pandas as pd
s = pd.Series([5,2,3,4,1])
>>>
0    2
1    3
2    4
3    1
4    5
dtype: int64
s.values
>>>array([2, 3, 4, 1, 5])
s.index
>>>RangeIndex(start=0, stop=5, step=1)
s2 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
print(s2)
>>>
a    3
b    2
c    4
d    1
e    5
dtype: int64
#根据字典创建series
dict = {'name':'joha','sex':'male','age':'18'}
s3 = pd.Series(dict)
print(s3)
>>>
name    joha
sex     male
age       18
dtype: object
  • 根据索引选取Series的一个值或多个值
s2 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
#批量单个值
s2['a']
>>>3

#批量选取多个值
s2[['a','c','e']]
>>>
a    3
c    4
e    5
dtype: int64

s2[s2>3]
>>>
c    4
e    5
dtype: int64

s2*3
>>>
a     9
b     6
c    12
d     3
e    15
dtype: int64

'c' in s2
>>>True
'f' in s2
>>>False

series在算数运算中自动对齐不同索引的数据

s1 = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
s2 = pd.Series([3,-5,1],index = ['a','c','e'])
print(s1+s2)
>>>
a    6.0
b    NaN
c   -1.0
d    NaN
e    6.0
dtype: float64

series中的index可以通过赋值的方式进行修改

s2 = pd.Series([3,-5,1],index = ['a','c','e'])
s2.index = [1,2,3]
print(s2)
>>>
1    3
2   -5
3    1
dtype: int64

DataFrame

  • 创建dataFrame
test_dict = {'id':[1,2,3,4,5,6],
             'name':['Alice','Bob','Cindy','Eric','Helen','Grace '],
             'math':[90,89,99,78,97,93],
             'english':[89,94,80,94,94,90]}
#[1].直接写入参数test_dict
test_dict_df = pd.DataFrame(test_dict)
print(test_dict_df)
>>>
   id    name  math  english
0   1   Alice    90       89
1   2     Bob    89       94
2   3   Cindy    99       80
3   4    Eric    78       94
4   5   Helen    97       94
5   6  Grace     93       90
#[2].字典型赋值
test_dict_df = pd.DataFrame(data=test_dict)
>>>
   id    name  math  english
0   1   Alice    90       89
1   2     Bob    89       94
2   3   Cindy    99       80
3   4    Eric    78       94
4   5   Helen    97       94
5   6  Grace     93       90
test_dict_df = pd.DataFrame(test_dict,columns=['name','math','english','id'])
print(test_dict_df)
>>>
     name  math  english  id
0   Alice    90       89   1
1     Bob    89       94   2
2   Cindy    99       80   3
3    Eric    78       94   4
4   Helen    97       94   5
5  Grace     93       90   6
  • DataFrame取值
test_dict_df['name']
>>>
0     Alice
1       Bob
2     Cindy
3      Eric
4     Helen
5    Grace 
Name: name, dtype: object

test_dict_df.name
>>>
0     Alice
1       Bob
2     Cindy
3      Eric
4     Helen
5    Grace 
Name: name, dtype: object
  • 对某一列赋值
test_dict_df['id'] = pd.Series(['11','22','33','44','55'])
print(test_dict_df)
>>>
     name  math  english   id
0   Alice    90       89   11
1     Bob    89       94   22
2   Cindy    99       80   33
3    Eric    78       94   44
4   Helen    97       94   55
5  Grace     93       90  NaN
  • 删除某一列
test_dict_df.drop(['id'],axis=1)
>>>
     name  math  english
0   Alice    90       89
1     Bob    89       94
2   Cindy    99       80
3    Eric    78       94
4   Helen    97       94
5  Grace     93       90
  • 多维数组构建DataFrame
test_dict = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],columns = ['a','b','c'],index=[1,2,3])
print(test_dict)
>>>
   a  b  c
1  1  2  3
2  4  5  6
3  7  8  9

print(test_dict.values)
>>>
[[1 2 3]
 [4 5 6]
 [7 8 9]]
  • 构建series和dataFrame时,可以传入数组当作index
test_dict = pd.Series([1,2,3],index = ['a','b','c'])
print(test_dict)
>>>
a    1
b    2
c    3
dtype: int64

test_dict.index = ['c','d','e']
print(test_dict)
>>>
c    1
d    2
e    3
dtype: int64
  • pandas重新索引 reindex
test_dict = pd.Series([1,2,3],index = ['a','b','c'])
test_dict1 = test_dict.reindex(['a','b','c','d','e'])
print(test_dict1)
>>>
a    1.0
b    2.0
c    3.0
d    NaN
e    NaN
dtype: float64
#填充
test_dict = pd.Series([1,2,3],index = ['a','b','c'])
test_dict1 = test_dict.reindex(['a','b','c','d','e'],fill_value = 0)
print(test_dict1)
>>>
a    1
b    2
c    3
d    0
e    0
dtype: int64
obj = pd.Series(['Jim','Mike','Jhon'],index = [0,3,6])
obj1 = obj.reindex(range(8),method = 'ffill')
print(obj1)
>>>
0     Jim
1     Jim
2     Jim
3    Mike
4    Mike
5    Mike
6    Jhon
7    Jhon
dtype: object

reindex作用于列

df = pd.DataFrame(np.arange(1,10).reshape((3,3)),index = ['d','a','c'],columns = ['Jim','Mike','Jhon'])
df1 = df.reindex(['a','b','c','d'],['Jhon','Mike','Jim'])
print(df1)
>>>
   Jhon  Mike  Jim
a   6.0   5.0  4.0
b   NaN   NaN  NaN
c   9.0   8.0  7.0
d   3.0   2.0  1.0
  • 丢弃指定轴上的项 DataFrame.drop
test_dict = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print('output:{}'.format(test_dict))
a = test_dict.drop(['a','c'])
print('>>>{}'.format(a))
>>>
output:
a    1
b    2
c    3
d    4
e    5
dtype: int64
>>>
b    2
d    4
e    5
dtype: int64
df = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
print(df)
a = df.drop(['2','4'])
print(a)
>>>
    a   b   c   d   e
1   0   1   2   3   4
2   5   6   7   8   9
3  10  11  12  13  14
4  15  16  17  18  19
5  20  21  22  23  24
>>>
    a   b   c   d   e
1   0   1   2   3   4
3  10  11  12  13  14
5  20  21  22  23  24
df = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
print(df)
a = df.drop(['a','c'],axis = 1)
print(a)
>>>
    a   b   c   d   e
1   0   1   2   3   4
2   5   6   7   8   9
3  10  11  12  13  14
4  15  16  17  18  19
5  20  21  22  23  24
>>>
    b   d   e
1   1   3   4
2   6   8   9
3  11  13  14
4  16  18  19
5  21  23  24
  • 索引,选取和过滤
object = pd.Series([3,2,4,1,5],index = ['a','b','c','d','e'])
print(object[1:3])
>>>
b    2
c    4
dtype: int64

object[['a','b','c']]
>>>
a    3
b    2
c    4
dtype: int64
  • 条件过滤
object[object<4]
>>>
a    3
b    2
d    1
dtype: int64
  • DataFrame的索引可以按行也可以按列
object = pd.DataFrame(np.arange(25).reshape((5,5)),index = list('12345'),columns = list('abcde'))
print(object['b'])
>>>
1     1
2     6
3    11
4    16
5    21
Name: b, dtype: int64

print(object[['a','c']])
>>>
    a   c
1   0   2
2   5   7
3  10  12
4  15  17
5  20  22
  • 按行索引
print(object[1:4])
>>>
    a   b   c   d   e
2   5   6   7   8   9
3  10  11  12  13  14
4  15  16  17  18  19
  • 条件索引
object[object['b']>10]
>>>
    a   b   c   d   e
3  10  11  12  13  14
4  15  16  17  18  19
5  20  21  22  23  24
  • 对不同索引的对象进行计算
df1 = pd.DataFrame(np.arange(9).reshape(3,3),columns = list('abc'),index = [1,2,3])
print(df1)
>>>
   a  b  c
1  0  1  2
2  3  4  5
3  6  7  8

df2 = pd.DataFrame(np.arange(16).reshape(4,4),columns = list('bcde'),index = [2,3,4,5])
print(df2)
>>>
    b   c   d   e
2   0   1   2   3
3   4   5   6   7
4   8   9  10  11
5  12  13  14  15

print(df1+df2)
>>>
    a     b     c   d   e
1 NaN   NaN   NaN NaN NaN
2 NaN   4.0   6.0 NaN NaN
3 NaN  11.0  13.0 NaN NaN
4 NaN   NaN   NaN NaN NaN
5 NaN   NaN   NaN NaN NaN
df1.add(df2,fill_value = 0)
#df1与df2两两都没有的值,依然是NaN
>>>
     a     b     c     d     e
1  0.0   1.0   2.0   NaN   NaN
2  3.0   4.0   6.0   2.0   3.0
3  6.0  11.0  13.0   6.0   7.0
4  NaN   8.0   9.0  10.0  11.0
5  NaN  12.0  13.0  14.0  15.0
  • DataFrame与Series之间的计算
    DataFrame与Series计算时会引入广播操作
df1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns = list('abc'))
print(df1)
>>>
   a   b   c
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

series1 = pd.Series([3,4,5],index = ['a','b','c'])
print(series1)
>>>
a    3
b    4
c    5

#逐行相减
print(df1 - series1)
>>>
a  b  c
0 -3 -3 -3
1  0  0  0
2  3  3  3
3  6  6  6
  • series 取自dataFrame
series2 = df1['b']
series2
>>>
0     1
1     4
2     7
3    10
Name: b, dtype: int64
df1.sub(series2,axis = 0)
>>>
   a  b  c
0 -1  0  1
1 -1  0  1
2 -1  0  1
3 -1  0  1
  • 函数应用和映射
    numpy的元素级函数可以直接作用到DataFrame上
print(np.square(df1))
>>>
    a    b    c
0   0    1    4
1   9   16   25
2  36   49   64
3  81  100  121
  • DataFrame将一个函数直接应用到其本身或者各行各列,形成一个新的数据或者行或列
def fun(x):
    return x.max() - x.min()
df1.apply(fun,axis = 1)
>>>
0    2
1    2
2    2
3    2
dtype: int64
  • 排序
df1 = pd.DataFrame(np.random.randn(4,4),columns=list('bcad'),index=[2,4,3,1])
print(df1)
>>>
          b         c         a         d
2  0.706356 -0.896474 -1.879608  0.322054
4  0.666188 -0.450170  0.914737  0.691662
3 -1.676381 -0.499211 -0.136020 -1.734251
1 -2.111717 -0.226238  1.656514  0.146311

print(df1.sort_index())
>>>
          b         c         a         d
1 -2.111717 -0.226238  1.656514  0.146311
2  0.706356 -0.896474 -1.879608  0.322054
3 -1.676381 -0.499211 -0.136020 -1.734251
4  0.666188 -0.450170  0.914737  0.691662

print(df1.sort_values(by=['b','a']))
>>>
          b         c         a         d
1 -2.111717 -0.226238  1.656514  0.146311
3 -1.676381 -0.499211 -0.136020 -1.734251
4  0.666188 -0.450170  0.914737  0.691662
2  0.706356 -0.896474 -1.879608  0.322054

统计相关计算

  • 求和 sum
  • 最大 max
  • 最小 min
  • 方差 var
  • 求平均 mean
  • 所有信息 describe
print(df1.describe())
>>>
                        b               c               a               d
count  4.000000  4.000000  4.000000  4.000000
mean  -0.603889 -0.518024  0.138906 -0.143556
std    1.500402  0.278880  1.533517  1.084545
min   -2.111717 -0.896474 -1.879608 -1.734251
25%   -1.785215 -0.598527 -0.571917 -0.323829
50%   -0.505096 -0.474691  0.389358  0.234183
75%    0.676230 -0.394187  1.100181  0.414456
max    0.706356 -0.226238  1.656514  0.691662

处理数据缺失

  • dropna 去除nan数据
  • fillna 使用默认填入
  • isnull 返回一个含有布尔值的对象,标注nan的位置
    -notnull isnull否定式

相关文章

网友评论

      本文标题:pandas基本操作手册

      本文链接:https://www.haomeiwen.com/subject/lvbjgqtx.html