pandas库

作者: 文嘉达_0da8 | 来源:发表于2020-06-17 22:17 被阅读0次

(八)Pandas库的学习|python数据分析与展示(学习笔记
9-Python 科学计算_pandas 篇
【Python学习】No.6 使用Pandas实现MySQL数据
pandas玩转Excel01-如何创建Excel文件
Pandas入门
数据分析-pandas库快速了解
pandas库学习(一) Series
Pandas基础之DataFrame,Series
python使用pandas读写excel
1. Pandas的数据结构

一、pandas数据结构

1.1创建Series数据

Series是由一组数据（各种Numpy数据类型）和一组对应的索引组成。

import pandas as pd
obj=pd.Series([1,-2,3,-4)]
Out[ ]: 
0    1
1   -2
2    3
3   -4
dtype: int64

#也可以指定Series的索引
obj2=pd.Series([1,-2,3,-4],index=['a','b','c','d'])
Out[ ]: 
a    1
b   -2
c    3
d   -4
dtype: int64

也可以通过字典数据来创建Series。

data={
    '张三':92,
    '李四':78,
    '王五':68,
    '小明':82,
}
obj3=pd.Series(data)
Out[ ]: 
张三    92
李四    78
王五    68
小明    82
dtype: int64

1.2 Series数据属性

obj2.index
Out[ ]: Index(['a', 'b', 'c', 'd'], dtype='object')

obj2.values
Out[ ]: array([ 1, -2,  3, -4], dtype=int64)

字典结构是无序的，所以返回的Series也是无序的，依旧可以通过index指定索引的排列顺序。

names=['张三','李四','王五','小明']
obj4=pd.Series(data,index=names)
Out[ ]: 
张三    92
李四    78
王五    68
小明    82
dtype: int64

Series对象和索引都有name属性，这样就可以给Series定义名称，让Series具可读性。

obj4.name='math'
obj4.index.name='studens'
obj4
Out[ ]: 
studens
张三    92
李四    78
王五    68
小明    82
Name: math, dtype: int64

2.1创建DataFrame数据

由数组、列表或元组组成的字典

 data={
    'name':['张三','李四','王五','小明'],
    'sex':['female','female','male','male'],
    'year':[2001,2001,2003,2002],
    'city':['北京','上海','广州','北京']
}
df=pd.DataFrame(data)
Out[ ]: 
    name  sex     year   city
0   张三  female  2001    北京
1   李四  female  2001    上海
2   王五  male    2003    广州
3   小明  male    2002    北京

嵌套字典

data2={
    'sex':{'张三':'female','李四':'female','王五':'male'},
    'city':{'张三':'北京','李四':'上海','王五':'广州'}
}
df2=pd.DataFrame(data2)
df2
Out[ ]: 
      sex     city
张三  female  北京
李四  female  上海
王五  male    广州

二维ndarray

import numpy as np
df3=pd.DataFrame(np.arange(10).reshape(2,5),index=['a','b'])
df3
Out[ ]: 
    0   1   2   3   4
a   0   1   2   3   4
b   5   6   7   8   9

Series组成的字典

data={'one':pd.Series([1,2,3],index=['a','b','c']),
'two':pd.Series([8,7,6,5],index=['a','b','c','w'])}
df4=pd.DataFrame(data)
Out[ ]: 
    one two
a   1.0 8
b   2.0 7
c   3.0 6
w   NaN 5

列表或元组组成

data=[[12,3,4,5],
       [2,4,6,4,],
       [324,6,4,2]]
df5=pd.DataFrame(data,index=['a','b','c'])
df5
Out[ ]: 
     0  1   2   3
a   12  3   4   5
b   2   4   6   4
c   324 6   4   2

2.2 DataFrame数据属性

可以指定columns和index

df=DataFrame(data,columns=['name','sex','year','city'],index=['a','b','c','d'])

可以设置index和columns的name属性

df.index.name='id'
df.columns.name='std_info'
df
Out[ ]:   
std_info name   sex   year  city
id              
      0 张三  female  2001   北京
      1 李四  female  2001   上海
      2 王五  male    2003   广州
      3 小明  male    2002   北京

通过values属性可以将DataFrame数据转换为二维数组。

df.values
Out[ ]:  
array([['张三', 'female', 2001, '北京'],
       ['李四', 'female', 2001, '上海'],
       ['王五', 'male', 2003, '广州'],
       ['小明', 'male', 2002, '北京']], dtype=object)

二、pandas索引操作

1. 重新索引

重新索引并不是给索引重新命名，而是对索引重新排序，如果某个值不存在的话，就会引入缺失值。

Series

obj=pd.Series([1,-2,3,-4],index=['b','a','c','d'])
obj
Out[ ]:  
b    1
a   -2
c    3
d   -4
dtype: int64

obj2=obj.reindex(['a','b','c','d','e'])
obj2
Out[ ]:  
a   -2.0
b    1.0
c    3.0
d   -4.0
e    NaN
dtype: float64

如果需要对插入的缺失值进行填充的话，可通过method参数来实现，ffill为向前填充，bfill为向后填充。

obj=pd.Series([1,-2,3,-4],index=[0,2,3,5])
obj
Out[ ]:  
0    1
2   -2
3    3
5   -4
dtype: int64

obj2=obj.reindex(range(6),method='ffill')
obj2
Out[ ]:  
0    1
1    1
2   -2
3    3
4    3
5   -4
dtype: int64

DataFrame
对于DataFrame来说，行和列索引都是可以重新索引的。

df=pd.DataFrame(np.arange(9).reshape(3,3),index=['a','c','d'],columns=['name','id','sex'])
df
Out[ ]:
    name id sex
a   0   1   2
c   3   4   5
d   6   7   8

df2=df.reindex(['a','b','c','d'])
df2
Out[ ]:
   name id  sex
a   0.0 1.0 2.0
b   NaN NaN NaN
c   3.0 4.0 5.0
d   6.0 7.0 8.0

2. 更换索引

data={
    'name':['张三','李四','王五','小明'],
    'sex':['female','female','male','male'],
    'year':[2001,2001,2003,2002],
    'city':['北京','上海','广州','北京']
}
df=pd.DataFrame(data)
Out[ ]:
    name    sex   year    city
0   张三  female  2001    北京
1   李四  female  2001    上海
2   王五  male    2003    广州
3   小明  male    2002    北京


df2=df.set_index('name')
Out[ ]:
      sex   year    city
name            
张三  female  2001    北京
李四  female  2001    上海
王五  male    2003    广州
小明  male    2002    北京

df3=df2.reset_index()
Out[ ]:
    name    sex   year    city
0   张三  female  2001    北京
1   李四  female  2001    上海
2   王五  male    2003    广州
3   小明  male    2002    北京

另一实例
DataFrame排序之后行索引会改变，如需获取成绩倒数2位同学的数据时，可通过回复索引，对数据进行排序。

data={
    'name':['张三','李四','王五','小明'],
    'grade':[68,78,63,92]
}
df=pd.DataFrame(data)
Out[ ]:
   name grade
0   张三  68
1   李四  78
2   王五  63
3   小明  92

df2=df.sort_values(by='grade')
Out[ ]:
   name  grade
2   王五  63
0   张三  68
1   李四  78
3   小明  92
df.reset_index(drop=True)  #drop可将原索引删除
Out[ ]:
   name  grade
0   王五  63
1   张三  68
2   李四  78
3   小明  92

3. 索引和选取

DataFrame选取列

#普通索引
df['city']
df[['city','sex']]
#位置索引，取第一列和第三列
df.iloc[:,[0,2]]
#切片索引，取第一列到第三列
df.iloc[:,0:2]

DataFrame选取行

#普通索引
df.loc['第一行名称']
df.loc[['第一行名称','第二行名称']]
#位置索引，选取第一行和第三行
df.iloc[[0,2]]
#切片索引，选取第一行到第三行
df[0:2]

DataFrame布尔索引

df[df['年龄']<200]]
df[df['年龄']<200)&(df['唯一识别码']<102)]

4. 操作行和列

增加

new_data={'city':'武汉','name':'小李','sex':'male','year':2002}
df.append(new_data,ignore_index=True) #忽略索引值

df['math']=[92,78,58,69,82]  #新增加math这一列

删除

#删除行
new_df=df.drop(2)  
df.drop([0, 1])
df.drop(index=[0, 1])

#删除列
new_df=df.drop('class',axis=1) 
df.drop(['B', 'C'], axis=1)
df.drop(columns=['B', 'C'])

修改

new_df.rename(index={3:2,4:3},columns={'math':'Math'},inplace=True)

5. pandas数据运算

算数运算

df1=pd.DataFrame(np.arange(9).reshape(3,3),columns=['a','b','c'],index=['apple','tea','banana'])
Out[ ]:
        a   b   c
apple   0   1   2
tea     3   4   5
banana  6   7   8

df2=pd.DataFrame(np.arange(9).reshape(3,3),columns=['a','b','d'],index=['apple','tea','coco'])
Out[ ]:
        a   b   d
apple   0   1   2
tea     3   4   5
coco    6   7   8

df+df2
        a   b   c   d
apple   0.0 2.0 NaN NaN
banana  NaN NaN NaN NaN
coco    NaN NaN NaN NaN
tea     6.0 8.0 NaN NaN

函数应用和映射
map函数：将函数套用在Series的每个元素中

data={
    'fruit':['apple','orange','grape','banana'],
    'price':['25元','42元','36元','14元']
}
df1=pd.DataFrame(data)
Out[ ]:
    fruit   price
0   apple   25元
1   orange  42元
2   grape   36元
3   banana  14元
def f(x):
    return x.split('元')[0]
df1['price']=df1['price'].map(f)
df1
Out[ ]:
    fruit   price
0   apple   25
1   orange  42
2   grape   36
3   banana  14

apply函数：将函数套用到DataFrame的行与列上

df2=pd.DataFrame(np.random.randn(3,3),columns=['a','b','c'],index=['app','win','mac'])
Out[ ]:
        a           b         c
app -0.507420   -1.139712   1.368877
win 0.087377    1.152213    0.289662
mac 0.618923    0.740873    -2.249652

f=lambda x:x.max()-x.min()
df2.apply(f)
Out[ ]:
a    1.126343
b    2.291925
c    3.618528
dtype: float64

applymap函数：将函数套用到DataFrame的每个元素上

df2.applymap(lambda x:'%.2f'%x)
Out[ ]:
      a      b       c
app -0.51   -1.14   1.37
win 0.09    1.15    0.29
mac 0.62    0.74    -2.25

6. 排序

Series中可以用sort_index函数对索引进行排序，默认为升序。

obj1=pd.Series([-1,3,2,1],index=['b','a','d','c'])
Out[ ]:
b   -1
a    3
d    2
c    1
dtype: int64

obj1.sort_index()
Out[ ]:
a    3
b   -1
c    1
d    2
dtype: int64

obj1.sort_index(ascending=False)
Out[ ]:
d    2
c    1
b   -1
a    3
dtype: int64

obj1.sort_values()
Out[ ]:
b   -1
c    1
d    2
a    3
dtype: int64

DataFrame可以通过指定轴，对行或者列索引进行排序。也可以通过sort_values根据列排序。

df2.sort_values(by='b')
        a         b          c
app -0.507420   -1.139712   1.368877
mac 0.618923    0.740873    -2.249652
win 0.087377    1.152213    0.289662

—————————————————————
以上内容均来自书籍或网络，为个人的学习笔记，如有侵权，请联系删帖。

(八)Pandas库的学习|python数据分析与展示(学习笔记
1.本课程导学2.pandas库的介绍3.pandas库的Serious类型4.pandas库的DataFrame...
9-Python 科学计算_pandas 篇
课程概要：1、pandas 库之数据筛选及过滤2、pandas 库之字符串提取与操作3、pandas库之散点图4、...
【Python学习】No.6 使用Pandas实现MySQL数据
以下三个库来实现Pandas读写MySQL数据库： pandas sqlalchemy pymysql SQLAl...
pandas玩转Excel01-如何创建Excel文件
导入pandas库 import pandas as pdpd.DateFrame({'ID':['1','2',...
Pandas入门
Pandas库的引用 Pandas是Python第三方库，提供高性能易用数据类型和分析工具Pandas基于NumP...
数据分析-pandas库快速了解
1.pandas是什么库 Pandas是Python第三方库，提供高性能易用数据类型和分析工具，pandas 是基...
pandas库学习(一) Series
在学习pandas库之前，要先了解NumPy库。pandas的数据结构1.Series2.DataFrame3.索...
Pandas基础之DataFrame,Series
pandas使用(1) note:学习环境python3.5,pandas库 pandas是基于NumPy的一个非...
python使用pandas读写excel
依赖库安装： pip install pandas pandas处理Excel需要xlrd、openpyxl依赖包...
1. Pandas的数据结构
Pandas有两种数据结构：Series和DataFrame。首先要引入Pandas库和Numpy库 1.1 S...