pandas快速入门
目录
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
创建对象
传入数据列表创建Series对象
s = pd.Series([1,3,5,np.nan,6,8])
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
创建DataFrame对象
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

数据视图
观察数据的头部
df.head()

查看DataFrame对象的索引,列名,数据信息
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[-0.46936354, -1.38929068, 0.84403157, 0.04286594],
[ 0.98657633, -0.68954348, -0.38326456, -1.10493201],
[-0.19242554, 1.74076522, 0.73047859, -1.32078058],
[ 0.04734752, -1.95230265, -0.6915437 , -1.40388308],
[ 0.23302102, 0.61911183, 0.628579 , -0.80258543],
[ 0.49394583, 0.84824737, 1.633055 , -0.74056229]])
简单统计
df.describe()

数据选取
选取
按行选取数据
df[0:3]

通过标签选取数据
df.loc['20130102':'20130104',['A','B']]

通过位置选取数据
df.iloc[3:5, 0:2]

布尔索引
df[df.A>0]

填充缺失数据
df1.fillna(value=5)

pd.isnull(df1)

函数操作
统计
纵向求均值
df.mean()
A 0.261411
B 0.094380
C 0.460223
D 5.000000
F 3.000000
dtype: float64
横向求均值
df.mean(1)
2013-01-01 1.461008
2013-01-02 1.182754
2013-01-03 1.855764
2013-01-04 1.080700
2013-01-05 2.096142
2013-01-06 2.595050
Freq: D, dtype: float64
直方统计
s = pd.Series(np.random.randint(0, 7, size=10))
s
0 6
1 1
2 4
3 6
4 3
5 2
6 3
7 5
8 2
9 2
dtype: int64
s.value_counts()
2 3
6 2
3 2
5 1
4 1
1 1
dtype: int64
数据合并
Join函数
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
pd.merge(left, right, on='key')

分组
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df.groupby(['A', 'B']).sum()

数据变形
数据透视表
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
'B' : ['A', 'B', 'C'] * 4,
'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D' : np.random.randn(12),
'E' : np.random.randn(12)})

df.pivot_table(values='D', index=['A', 'B'], columns='C')

数据标签
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df['grade'] = df['raw_grade'].astype("category")
df['grade']
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
绘图
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')

导入导出数据
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']).head()

网友评论