Pandas

作者: MaceJin | 来源:发表于2018-03-09 09:22 被阅读0次

一、Series类型

由一组数据及与之相关的数据索引组成。

从标量值创建

import pandas as pd

a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])

a
Out[13]: 
a    9
b    8
c    7
d    6
dtype: int64

从字典类型创建

import pandas as pd

b = pd.Series({'a':9, 'b':8, 'c':7})

b
Out[16]: 
a    9
b    8
c    7
dtype: int64

c = pd.Series({'a':9, 'b':8, 'c':7}, index=['c', 'a', 'b', 'd'])

c
Out[18]: 
c    7.0
a    9.0
b    8.0
d    NaN
dtype: float64

从ndarray类型创建

import pandas as pd

import numpy as np

n = pd.Series(np.arange(5))

n
Out[22]: 
0    0
1    1
2    2
3    3
4    4
dtype: int32

m = pd.Series(np.arange(5), index=np.arange(9, 4, -1))

m
Out[24]: 
9    0
8    1
7    2
6    3
5    4
dtype: int32

Series类型的基本操作

.index 获取索引
.values 获取数据
... in ... 判断是否在自定义索引中

import pandas as pd

b = pd.Series([9, 8, 7, 6], ['a', 'b' ,'c', 'd'])

b
Out[27]: 
a    9
b    8
c    7
d    6
dtype: int64

b.index     ###############################################
Out[28]: Index(['a', 'b', 'c', 'd'], dtype='object')

b.values  #########################################################
Out[29]: array([9, 8, 7, 6], dtype=int64)

b['b']
Out[30]: 8

b[1]
Out[31]: 8

b[['c', 'd', 0]]

Out[32]: 
c    7.0
d    6.0
0    NaN
dtype: float64

b[['c', 'd', 'a']]
Out[33]: 
c    7
d    6
a    9
dtype: int64

b[:3]
Out[34]: 
a    9
b    8
c    7
dtype: int64

b[b > b.median()]
Out[35]: 
a    9
b    8
dtype: int64

np.exp(b)
Out[36]: 
a    8103.083928
b    2980.957987
c    1096.633158
d     403.428793
dtype: float64

'c' in b   #################################
Out[37]: True

0 in b
Out[38]: False

b.get('f', 100)
Out[39]: 100

Series类型对齐操作

把索引相同的项进行相加

import pandas as pd

a = pd.Series([1, 2, 3],['c', 'd', 'e'])

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

a + b
Out[47]: 
a    NaN
b    NaN
c    8.0
d    8.0
e    NaN
dtype: float64

Series类型的name属性

Series对象和索引都可以有一个名字，储存在属性.name中。

import pandas as pd

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

b.name = 'Series对象'

b.index.name = '索引列'

b
Out[52]: 
索引列
a    9
b    8
c    7
d    6
Name: Series对象, dtype: int64

Series类型的修改

Series对象可以随时修改并立刻生效

import pandas as pd

b = pd.Series([9, 8, 7, 6], ['a', 'b', 'c', 'd'])

b.name = 'Series对象'

b.index.name = '索引列'

b
Out[52]: 
索引列
a    9
b    8
c    7
d    6
Name: Series对象, dtype: int64

b.name = 'aaa'

b['b', 'c'] = 10

b
Out[55]: 
索引列
a     9
b    10
c    10
d     6
Name: aaa, dtype: int64

二、Pandas库的DataFrame类型

DataFrame类型

由共用相同索引的一组列组成。（相当于表格）

表格型的数据类型，每列值类型可以不同。
行索引（index）、列索引（column）。
常用于表达二维数据，但可以表达多维数据。

从二维ndarray对象字典创建

import pandas as pd

import numpy as np

d = pd.DataFrame(np.arange(10).reshape(2, 5))

d
Out[4]: 
   0  1  2  3  4
0  0  1  2  3  4
1  5  6  7  8  9

从一维ndarray对象字典创建

import pandas as pd

dt = {'one':pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two':pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])}

d = pd.DataFrame(dt)

d
Out[11]: 
   one  two
a  1.0    9
b  2.0    8
c  3.0    7
d  NaN    6

pd.DataFrame(dt, index=['b', 'c', 'd'], columns=['two', 'three'])
Out[12]: 
   two three
b    8   NaN
c    7   NaN
d    6   NaN

从列表类型的字典创建

import pandas as pd

dl = {'one':[1, 2, 3, 4], 'two':[9, 8, 7, 6]}

d = pd.DataFrame(dl, index=['a', 'b', 'c', 'd'])

d
Out[16]: 
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

d.index
Out[17]: Index(['a', 'b', 'c', 'd'], dtype='object')

d.columns
Out[18]: Index(['one', 'two'], dtype='object')

d.values
Out[19]: 
array([[1, 9],
       [2, 8],
       [3, 7],
       [4, 6]], dtype=int64)

d['one']
Out[20]: 
a    1
b    2
c    3
d    4
Name: one, dtype: int64

d.ix['b']

Out[21]: 
one    2
two    8
Name: b, dtype: int64

d['one']['b']
Out[22]: 2

三、Pandas库的数据类型操作

改变Series和DataFrame对象

增加或重排：重新索引

import pandas as pd

dl = {'城市':['北京', '上海', '广州', '深圳', '沈阳'],
      '环比':[101.5, 101.2, 101.3, 101.4, 101.5],
      '同比':[120.7, 127.3, 119.4, 140.9, 101.4],
      '定基':[121.4, 127.8, 120.2, 145.5, 101.6]}

d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])

d
Out[6]: 
       同比  城市     定基     环比
c1  120.7  北京  121.4  101.5
c2  127.3  上海  127.8  101.2
c3  119.4  广州  120.2  101.3
c4  140.9  深圳  145.5  101.4
c5  101.4  沈阳  101.6  101.5

d = d.reindex(index=['c5', 'c4', 'c3', 'c2', 'c1'])

d
Out[8]: 
       同比  城市     定基     环比
c5  101.4  沈阳  101.6  101.5
c4  140.9  深圳  145.5  101.4
c3  119.4  广州  120.2  101.3
c2  127.3  上海  127.8  101.2
c1  120.7  北京  121.4  101.5

d = d.reindex(columns=['城市', '同比', '环比', '定基'])

d
Out[10]: 
    城市     同比     环比     定基
c5  沈阳  101.4  101.5  101.6
c4  深圳  140.9  101.4  145.5
c3  广州  119.4  101.3  120.2
c2  上海  127.3  101.2  127.8
c1  北京  120.7  101.5  121.4

.reindex(index=None,columns=None,...)的参数

参数	说明
index,columns	新的行列自定义索引
fill_value	重新索引中，用于填充缺失位置的值
method	填充方法，ffill当前值向前填充，bfill向后填充
limit	最大填充量
copy	默认True，生成新的对象，False时，新旧相等不复制

import pandas as pd

dl = {'城市':['北京', '上海', '广州', '深圳', '沈阳'],
      '环比':[101.5, 101.2, 101.3, 101.4, 101.5],
      '同比':[120.7, 127.3, 119.4, 140.9, 101.4],
      '定基':[121.4, 127.8, 120.2, 145.5, 101.6]}

d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])

d
Out[17]: 
       同比  城市     定基     环比
c1  120.7  北京  121.4  101.5
c2  127.3  上海  127.8  101.2
c3  119.4  广州  120.2  101.3
c4  140.9  深圳  145.5  101.4
c5  101.4  沈阳  101.6  101.5

newc = d.columns.insert(4,'新增')

newb = d.reindex(columns=newc, fill_value=200)

newb
Out[20]: 
       同比  城市     定基     环比   新增
c1  120.7  北京  121.4  101.5  200
c2  127.3  上海  127.8  101.2  200
c3  119.4  广州  120.2  101.3  200
c4  140.9  深圳  145.5  101.4  200
c5  101.4  沈阳  101.6  101.5  200

索引类型的常用方法

方法	说明
.append(idx)	连接另一个index对象，产生新的index对象
.diff(idx)	计算差集，产生新的index对象
.intersection(idx)	计算交集
.union(idx)	计算并集
.delete(loc)	删除loc位置处的元素
.insert(loc,e)	在loc位置增加一个元素e

d
Out[21]: 
       同比  城市     定基     环比
c1  120.7  北京  121.4  101.5
c2  127.3  上海  127.8  101.2
c3  119.4  广州  120.2  101.3
c4  140.9  深圳  145.5  101.4
c5  101.4  沈阳  101.6  101.5

nc = d.columns.delete(2)

ni = d.index.insert(5, 'c6')

nd = d.reindex(index=ni, columns=nc, method='ffill')

nd
Out[25]: 
       同比  城市     环比
c1  120.7  北京  101.5
c2  127.3  上海  101.2
c3  119.4  广州  101.3
c4  140.9  深圳  101.4
c5  101.4  沈阳  101.5
c6  101.4  沈阳  101.5

删除指定索引对象

.drop()能删除Series和DataFrame指定行或列索引

a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])

a
Out[27]: 
a    9
b    8
c    7
d    6
dtype: int64

a.drop(['b', 'c'])
Out[28]: 
a    9
d    6
dtype: int64

d
Out[29]: 
       同比  城市     定基     环比
c1  120.7  北京  121.4  101.5
c2  127.3  上海  127.8  101.2
c3  119.4  广州  120.2  101.3
c4  140.9  深圳  145.5  101.4
c5  101.4  沈阳  101.6  101.5

d.drop('c5')
Out[30]: 
       同比  城市     定基     环比
c1  120.7  北京  121.4  101.5
c2  127.3  上海  127.8  101.2
c3  119.4  广州  120.2  101.3
c4  140.9  深圳  145.5  101.4

d.drop('定基', axis=1)
Out[31]: 
       同比  城市     环比
c1  120.7  北京  101.5
c2  127.3  上海  101.2
c3  119.4  广州  101.3
c4  140.9  深圳  101.4
c5  101.4  沈阳  101.5

四、Pandas库的数据类型运算

算术运算法则

算术运算根据行列索引，补齐后运算，运算默认产生浮点数
补齐使缺项填充NaN
二维和一维、一维和零维间为广播运算
采用+ - * /符号进行的二元运算产生新的对象

import pandas as pd

import numpy as np

a = pd.DataFrame(np.arange(12).reshape(3,4))

a
Out[37]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

b = pd.DataFrame(np.arange(20).reshape(4,5))

b
Out[39]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

a + b
Out[40]: 
      0     1     2     3   4
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN

a * b
Out[41]: 
      0     1      2      3   4
0   0.0   1.0    4.0    9.0 NaN
1  20.0  30.0   42.0   56.0 NaN
2  80.0  99.0  120.0  143.0 NaN
3   NaN   NaN    NaN    NaN NaN

数据类型的算术运算方法形式的运算

方法	说明
.add(d, **argws)	类型间加法运算，可选参数
.sub(d, **argws)	类型间减法运算，可选参数
.mul(d, **argws)	类型间乘法运算，可选参数
.div(d, **argws)	类型间除法运算，可选参数

b.add(a, fill_value=100)
Out[42]: 
       0      1      2      3      4
0    0.0    2.0    4.0    6.0  104.0
1    9.0   11.0   13.0   15.0  109.0
2   18.0   20.0   22.0   24.0  114.0
3  115.0  116.0  117.0  118.0  119.0

a.mul(b, fill_value=0)
Out[43]: 
      0     1      2      3    4
0   0.0   1.0    4.0    9.0  0.0
1  20.0  30.0   42.0   56.0  0.0
2  80.0  99.0  120.0  143.0  0.0
3   0.0   0.0    0.0    0.0  0.0

不同纬度间为广播运算，一维Series默认在轴1参与运算

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5))

b
Out[47]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

c = pd.Series(np.arange(4))

c
Out[50]: 
0    0
1    1
2    2
3    3
dtype: int32

c - 10
Out[51]: 
0   -10
1    -9
2    -8
3    -7
dtype: int32

b - c
Out[52]: 
      0     1     2     3   4
0   0.0   0.0   0.0   0.0 NaN
1   5.0   5.0   5.0   5.0 NaN
2  10.0  10.0  10.0  10.0 NaN
3  15.0  15.0  15.0  15.0 NaN

下方法可以令一维Series参与轴0运算

b
Out[53]: 
    0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19

c
Out[54]: 
0    0
1    1
2    2
3    3
dtype: int32

b.sub(c, axis=0)
Out[55]: 
    0   1   2   3   4
0   0   1   2   3   4
1   4   5   6   7   8
2   8   9  10  11  12
3  12  13  14  15  16

比较运算

比较运算只能比较相同索引的元素，不进行补齐
二维和一维、一维和零位间为广播运算
采用> < >= <= == !=等符号进行的二元运算产生布尔对象

import pandas as pd

import numpy as np

a = pd.DataFrame(np.arange(12).reshape(3,4))

a
Out[59]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

d = pd.DataFrame(np.arange(12, 0, -1).reshape(3,4))

d
Out[62]: 
    0   1   2  3
0  12  11  10  9
1   8   7   6  5
2   4   3   2  1

a > d
Out[63]: 
       0      1      2      3
0  False  False  False  False
1  False  False  False   True
2   True   True   True   True

不同纬度，广播运算，默认1轴

import numpy as np

import pandas as pd

a = pd.DataFrame(np.arange(12).reshape(3,4))

a
Out[67]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

c = pd.Series(np.arange(4))

c
Out[69]: 
0    0
1    1
2    2
3    3
dtype: int32

a>c
Out[70]: 
       0      1      2      3
0  False  False  False  False
1   True   True   True   True
2   True   True   True   True

c>0
Out[71]: 
0    False
1     True
2     True
3     True
dtype: bool

五、数据的排序

根据索引排序

sort_index()方法在指定轴上根据索引进行排列，默认升序。

.sort_index(axis=0,ascending=True) 默认0轴

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c', 'a', 'd', 'b'])

b
Out[75]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

b.sort_index()
Out[77]: 
    0   1   2   3   4
a   5   6   7   8   9
b  15  16  17  18  19
c   0   1   2   3   4
d  10  11  12  13  14

b.sort_index(ascending=False)
Out[78]: 
    0   1   2   3   4
d  10  11  12  13  14
c   0   1   2   3   4
b  15  16  17  18  19
a   5   6   7   8   9

b.sort_index(axis=1,ascending=False)
Out[79]: 
    4   3   2   1   0
c   4   3   2   1   0
a   9   8   7   6   5
d  14  13  12  11  10
b  19  18  17  16  15

根据数据排序

.sort_values()方法在指定轴上根据数值进行排序，默认升序。
NaN默认放在末尾
Series.sort_values(axis=0,ascending=True)
DateFrame.sort_values(by,axis=0,ascending=Ture)
- by:axis轴上的某个索引或索引列表

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c', 'a', 'd', 'b'])

b
Out[75]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

c = b.sort_values(2, ascending=False)

c
Out[82]: 
    0   1   2   3   4
b  15  16  17  18  19
d  10  11  12  13  14
a   5   6   7   8   9
c   0   1   2   3   4

c = c.sort_values('a', axis=1, ascending=False)

c
Out[84]: 
    4   3   2   1   0
b  19  18  17  16  15
d  14  13  12  11  10
a   9   8   7   6   5
c   4   3   2   1   0

六、数据的基本统计分析

基本的统计分析函数

适用于Series和DataFrame类型

方法	说明
.sum()	计算数据的总和，按0轴计算，下同
.count()	非NaN值的数量
.mean() .median()	计算数据的算术平均值、算术中位数
.var() .std()	计算数据的方差、标准差
.min() .max()	计算数据的最小值、最大值

适用于Series类型

方法	说明
.argmin() .argmax()	计算数据最大值、最小值所在位置的索引位置（自动索引）
.idxmin() .idxmax()	计算数据最大值、最小值所在位置的索引（自定义索引）

适用于Series和DataFrame类型（包括上诉所有）

方法	说明
.describe()	针对0轴（各列）的统计汇总

import pandas as pd

a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])

a
Out[87]: 
a    9
b    8
c    7
d    6
dtype: int64

a.describe()
Out[88]: 
count    4.000000
mean     7.500000
std      1.290994
min      6.000000
25%      6.750000
50%      7.500000
75%      8.250000
max      9.000000
dtype: float64

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c', 'a', 'd', 'b'])

b.describe()
Out[92]: 
               0          1          2          3          4
count   4.000000   4.000000   4.000000   4.000000   4.000000
mean    7.500000   8.500000   9.500000  10.500000  11.500000
std     6.454972   6.454972   6.454972   6.454972   6.454972
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     3.750000   4.750000   5.750000   6.750000   7.750000
50%     7.500000   8.500000   9.500000  10.500000  11.500000
75%    11.250000  12.250000  13.250000  14.250000  15.250000
max    15.000000  16.000000  17.000000  18.000000  19.000000

b.describe().ix['max']

Out[93]: 
0    15.0
1    16.0
2    17.0
3    18.0
4    19.0
Name: max, dtype: float64

b.describe()[2]
Out[94]: 
count     4.000000
mean      9.500000
std       6.454972
min       2.000000
25%       5.750000
50%       9.500000
75%      13.250000
max      17.000000
Name: 2, dtype: float64

七、数据的累计统计分析

累计统计分析函数

适用于Series和DataFrame类型

方法	说明
.cumsum()	依次给出前1、2、...、n个数的和
.cumprod()	依次给出前1、2、...、n个数的积
.cummax()	依次给出前1、2、...、n个数的最大值
.cummin()	依次给出前1、2、...、n个数的最小值

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c', 'a', 'd', 'b'])

b
Out[98]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

b.cumsum()
Out[99]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   7   9  11  13
d  15  18  21  24  27
b  30  34  38  42  46

适用于适用于Series和DataFrame类型，滚动计算（窗口计算）

方法	说明
.rolling(w).sum()	依次计算相邻w个元素的和
.rolling(w).mean()	依次计算相邻w个元素的算术平均数
.rolling(w).var()	依次计算相邻w个元素的方差
.rolling(w).std()	依次计算相邻w个元素的标准差
.rolling(w).min() .max()	依次计算相邻w个元素的最大值或最小值

import pandas as pd

import numpy as np

b = pd.DataFrame(np.arange(20).reshape(4,5), index=['c', 'a', 'd', 'b'])

b
Out[98]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   6   7   8   9
d  10  11  12  13  14
b  15  16  17  18  19

b.cumsum()
Out[99]: 
    0   1   2   3   4
c   0   1   2   3   4
a   5   7   9  11  13
d  15  18  21  24  27
b  30  34  38  42  46

b.rolling(2).sum()
Out[100]: 
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   5.0   7.0   9.0  11.0  13.0
d  15.0  17.0  19.0  21.0  23.0
b  25.0  27.0  29.0  31.0  33.0

b.rolling(3).sum()
Out[101]: 
      0     1     2     3     4
c   NaN   NaN   NaN   NaN   NaN
a   NaN   NaN   NaN   NaN   NaN
d  15.0  18.0  21.0  24.0  27.0
b  30.0  33.0  36.0  39.0  42.0

八、数据的相关分析

正相关
负相关
不相关

协方差

协方差>0,x和y正相关
协方差<0,x和y负相关
协方差=0,x和y不相关

Pearson相关系数（r取值范围[-1,1]）

0.8-1.0 极强相关
0.6-0.8 强相关
0.4-0.6 中等程度相关
0.2-0.4 弱相关
0.0-0.2 极弱相关

方法	说明
.cov()	计算协方差矩阵
.corr()	计算相关系数矩阵，Pearson、Spearman、Kendall等系数

Pandas

一、Series类型

从标量值创建

从字典类型创建

从ndarray类型创建

Series类型的基本操作

Series类型对齐操作

Series类型的name属性

Series类型的修改

二、Pandas库的DataFrame类型

DataFrame类型

从二维ndarray对象字典创建

从一维ndarray对象字典创建

从列表类型的字典创建

三、Pandas库的数据类型操作

改变Series和DataFrame对象

增加或重排：重新索引

.reindex(index=None,columns=None,...)的参数

索引类型的常用方法

删除指定索引对象

四、Pandas库的数据类型运算

算术运算法则

数据类型的算术运算方法形式的运算

比较运算

五、数据的排序

根据索引排序

根据数据排序

六、数据的基本统计分析

基本的统计分析函数

适用于Series和DataFrame类型

适用于Series类型

适用于Series和DataFrame类型（包括上诉所有）

七、数据的累计统计分析

累计统计分析函数

适用于Series和DataFrame类型

适用于适用于Series和DataFrame类型，滚动计算（窗口计算）

八、数据的相关分析

协方差

Pearson相关系数（r取值范围[-1,1]）

相关分析函数

适用于Series和DataFrame类型

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

我爱编程