目录:
1 0.24.1版本新特性
2 安装
3马上开始
3.1 pandas概况
3.2 十分钟上手pandas
3.3 基础功能(一)
3.3 基础功能(二)
3.3 基础功能(三)
3.4 数据结构
3.5 对比其他工具
R、SAS、Stata这三个笔者暂时没有学,下文只翻译SQL部分。
3.5 对比其他工具
3.5.1 对比R/R库
暂略,可查官网。
3.5.2 对比SQL
由于许多潜在的pandas用户对SQL都有一定的了解,本页面旨在提供一些示例来说如何用pandas来执行各种SQL操作。
如果你是pandas的新手,请先看10分钟上手pandas来加深对这个库的了解。
按照惯例,我们这样引用pandas和NumPy:
In [1]: import pandas as pd
In [2]: import numpy as np
大多数示例将使用pandas测试集中的tips数据集。我们将原数据读取到一个叫tips的DataFrame中,假定我们都有相同名字和结构的数据表。
In [3]: url = ('https://raw.github.com/pandas-dev'
...: '/pandas/master/pandas/tests/data/tips.csv')
...:
In [4]: tips = pd.read_csv(url)
In [5]: tips.head()
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
3.5.2.1 SELECT
在SQL中,用户是通过逗号分隔的列名串来选择想要的列(或者通过*号来选择所有列):
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
在pandas中,选择列是通过给DataFrame传递列名构成的列表:
In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
Out[6]:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
如果调用DataFrame而不传递列名构成的列表,将展示所有的列,就像SQL的*号选取所有列一样。
3.5.2.2 WHERE
在SQL中筛选是通过WHERE语句。
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
DataFrame可以通过多种方法过滤。其中最直观的方法是通过布尔索引。
In [7]: tips[tips['time'] == 'Dinner'].head(5)
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
上面的语句仅仅给DataFrame传递了True/False对象构成的Series,最后返回结果为True的所有行。
In [8]: is_dinner = tips['time'] == 'Dinner'
In [9]: is_dinner.value_counts()
Out[9]:
True 176
False 68
Name: time, dtype: int64
In [10]: tips[is_dinner].head(5)
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
和SQL的OR和AND一样,DataFrame也可以使用‘’|(OR)‘’和‘’&(AND)‘’做复杂条件的筛选。
-- tips of more than $5.00 at Dinner meals
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
# tips of more than $5.00 at Dinner meals
In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
Out[11]:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
116 29.93 5.07 Male No Sun Dinner 4
155 29.85 5.14 Female No Sun Dinner 5
170 50.81 10.00 Male Yes Sat Dinner 3
172 7.25 5.15 Male Yes Sun Dinner 2
181 23.33 5.65 Male Yes Sun Dinner 2
183 23.17 6.50 Male Yes Sun Dinner 4
211 25.89 5.16 Male Yes Sat Dinner 4
212 48.33 9.00 Male No Sat Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
# tips by parties of at least 5 diners OR bill total was more than $45
In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)]
Out[12]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
检查空值(NULL)需要使用notna()和isna()方法。
In [13]: frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})
....:
In [14]: frame
Out[14]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
假定我们有和上面的DataFrame同样结构的表,通过下面的查询语句,我们能够得到col2为空的记录。
SELECT *
FROM frame
WHERE col2 IS NULL;
In [15]: frame[frame['col2'].isna()]
Out[15]:
col1 col2
1 B NaN
选取col1不为空的项,可以通过notna()方法。
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [16]: frame[frame['col1'].notna()]
Out[16]:
col1 col2
0 A F
1 B NaN
3 C H
4 D I
3.5.2.3 GROUP BY
在pandas中,SQL的GROUP BY操作是通过有着类似名字的groupby()方法完成的。当我们需要将数据集拆分成组、应用函数(尤其是聚合函数)然后将各组组合起来时,groupby()方法尤其适用。
一个常见的SQL操作是对整个数据集的每个组计算记录条数。比如,这条查询语句选取了每个性别的tips数目。
SELECT sex, count(*)
FROM tips
GROUP BY sex;
/*
Female 87
Male 157
*/
在pandas中等价的操作是:
In [17]: tips.groupby('sex').size()
Out[17]:
sex
Female 87
Male 157
dtype: int64
注意在pandas的代码中我们使用了size()函数,而不是count()。这是因为count()函数会作用于数据集的每一列,返回每一列的非空记录数。
In [18]: tips.groupby('sex').count()
Out[18]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
或者,我们可以将count()方法应用到单个列中:
In [19]: tips.groupby('sex')['total_bill'].count()
Out[19]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
我们也可以同时应用多个函数。比如,我们想知道一周中每天的小费额的差别——agg()函数可以允许你对分好组的DataFrame传入一个字典,指定特定的列运用特定的函数。
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/
In [20]: tips.groupby('day').agg({'tip': np.mean, 'day': np.size})
Out[20]:
tip day
day
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
以多个列为条件进行分组时,可以将列的列表传入groupby()方法。
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
*/
In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
Out[21]:
tip
size mean
smoker day
No Fri 4.0 2.812500
Sat 45.0 3.102889
Sun 57.0 3.167895
Thur 45.0 2.673778
Yes Fri 15.0 2.714000
Sat 42.0 2.875476
Sun 19.0 3.516842
Thur 17.0 3.030000
3.5.2.4 JOIN
连接操作要使用join()或merge()函数。join()函数默认按照DataFrame的索引连接。每个方法都有几项参数,允许你指定连接的种类(LEFT,RIGHT,INNER,FULL)或者要连接的列(通过列名或者索引)。
[22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
....: 'value': np.random.randn(4)})
In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
....: 'value': np.random.randn(4)})
假设我们有和上面的DataFrame一样的名字和结构的数据表。
现在我们来浏览一下各种连接。
3.5.2.4.1 内连接(INNER JOIN)
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
# merge函数默认进行内连接
In [24]: pd.merge(df1, df2, on='key')
Out[24]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
如果你想将一个DataFrame的列与另一个DataFrame的索引连接,merge()也提供了参数。
In [25]: indexed_df2 = df2.set_index('key')
In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True)
Out[26]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
3 D -1.135632 0.119209
3.5.2.4.2 左连接(LEFT JOIN)
-- 展示df1所有的记录
SELECT *
FROM df1
LEFT OUTER JOIN df2
ON df1.key = df2.key;
# 展示df1所有的记录
In [27]: pd.merge(df1, df2, on='key', how='left')
Out[27]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
3.5.2.4.3 右连接(TIGHT JOIN)
-- 展示df2所有的记录
SELECT *
FROM df1
RIGHT OUTER JOIN df2
ON df1.key = df2.key;
# 展示df2所有的记录
In [28]: pd.merge(df1, df2, on='key', how='right')
Out[28]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236
3.5.2.4.4 全连接(FULL JOIN)
pandas也支持全连接,无论要连接的列是否匹配,被连接的两个数据集的所有记录都会展示出来。在编写上,并非所有的RDMBS(MySQL)都支持全连接。
-- 展示两个表的所有记录
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;
# 展示两个表的所有记录
In [29]: pd.merge(df1, df2, on='key', how='outer')
Out[29]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
3.5.2.2 UNION
使用concat()进行SQL的UNION ALL操作。
In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
....: 'rank': range(1, 4)})
In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
....: 'rank': [1, 4, 5]})
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/
In [32]: pd.concat([df1, df2])
Out[32]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
SQL的UNION和UNION ALL很像,不过UNION会移除重复的行。
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- 注意这次只有一条Chicago的记录
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
在pandas中,可以先concat(),再用drop_duplicates()函数。
In [33]: pd.concat([df1, df2]).drop_duplicates()
Out[33]:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
3.5.2.2 一些SQL分析聚合函数的pandas实现
3.5.2.2.1 跳过一些行再选取头N行
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [34]: tips.nlargest(10 + 5, columns='tip').tail(10)
Out[34]:
total_bill tip sex smoker day time size
183 23.17 6.50 Male Yes Sun Dinner 4
214 28.17 6.50 Female Yes Sat Dinner 3
47 32.40 6.00 Male No Sun Dinner 4
239 29.03 5.92 Male No Sat Dinner 3
88 24.71 5.85 Male No Thur Lunch 2
181 23.33 5.65 Male Yes Sun Dinner 2
44 30.40 5.60 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
85 34.83 5.17 Female No Thur Lunch 4
211 25.89 5.16 Male Yes Sat Dinner 4
3.5.2.2.2 每组选取头N行
-- Oracle's ROW_NUMBER()分析函数
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;
In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False)
....: .groupby(['day'])
....: .cumcount() + 1)
....: .query('rn < 3')
....: .sort_values(['day', 'rn']))
....:
Out[35]:
total_bill tip sex smoker day time size rn
95 40.17 4.73 Male Yes Fri Dinner 4 1
90 28.97 3.00 Male Yes Fri Dinner 2 2
170 50.81 10.00 Male Yes Sat Dinner 3 1
212 48.33 9.00 Male No Sat Dinner 4 2
156 48.17 5.00 Male No Sun Dinner 6 1
182 45.35 3.50 Male Yes Sun Dinner 3 2
197 43.11 5.00 Female Yes Thur Lunch 4 1
142 41.19 5.00 Male No Thur Lunch 5 2
这和使用rank(method='first')函数是一样的:
In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill']
....: .rank(method='first', ascending=False))
....: .query('rnk < 3')
....: .sort_values(['day', 'rnk']))
....:
Out[36]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0
-- Oracle's RANK()分析函数
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;
我们在每个年龄组tips<2的部分中找出rank<3的tips。注意,当我们使用rank(method='min')函数时,对于同一条tip,rnk_min是一样的。(这点像Oracle的RANK()函数)
3.5.2.2 UPDATE
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
In [38]: tips.loc[tips['tip'] < 2, 'tip'] *= 2
3.5.2.2 DELETE
DELETE FROM tips
WHERE tip > 9;
在pandas中我们选择需要保留的项,而不是删除不需保留的项。
In [39]: tips = tips.loc[tips['tip'] <= 9]
3.5.3 对比SAS
暂略,可查官网。
3.5.4 对比Stata
暂略,可查官网。
网友评论