Pandas作为数据分析三剑客之一,在数据挖掘和数据分析领域有着极其广泛的应用,而熟悉数据分析的人,往往也十分熟悉SQL,因为一开始的数据分析基本上都是从使用SQL开始。Pandas作为数据分析利器,自然也提供了很多SQL相关的功能,这里简单的从几个方面对Pandas和SQL进行比较。
在进行比较之前,我们先导入一个tips.csv的小费数据集。
import numpy as np
import pandas as pd
tips = pd.read_csv('D:\\Workspace\\dataset\\tips.csv')
del tips['Unnamed: 0']
tips
结果如下:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
244 rows × 7 columns
SELECT
SQL中的SELECT
是用来返回指定列的信息,不指定列名的情况下,如SELECT *
返回所有列的数据:
SQL:
SELECT *
FROM tips
LIMIT 5;
Pandas:
tips.head(5)
结果如下:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
如果需要返回若干指定的列,则需要在列索引中指定所要返回的列名:
SQL:
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
Pandas:
tips[['total_bill', 'tip', 'smoker','time']].head(5)
结果如下:
total_bill tip smoker time
0 16.99 1.01 No Dinner
1 10.34 1.66 No Dinner
2 21.01 3.50 No Dinner
3 23.68 3.31 No Dinner
4 24.59 3.61 No Dinner
WHERE
SQL中的WHERE
主要是用来做条件过滤的,Pandas中可以通过多种方式进行条件过滤,其中最直观的方式是使用布尔索引。
SQL:
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
Pandas:
tips[tips['time'] == 'Dinner'].head(5)
结果如下:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
类似于SQL中where
子句中的and
和or
,Pandas的布尔索引也支持and
和or
组合的条件。其中and
使用&
来实现,or
使用|
来实现。
SQL:
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00
LIMIT 5;
Pandas:
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)].head(5)
结果如下:
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
44 30.40 5.60 Male No Sun Dinner 4
47 32.40 6.00 Male No Sun Dinner 4
52 34.81 5.20 Female No Sun Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
SQL:
SELECT *
FROM tips
WHERE time = 'Dinner' or tip > 3.00
LIMIT 5;
Pandas:
tips[(tips['time'] == 'Dinner') | (tips['tip'] > 3.00)].head(5)
结果如下:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
GROUP BY
SQL中的GROUP BY
通常用来将数据集按照某些列进行分组,然后基于每个分组进行某些聚合计算,然后将结果组合在一起的过程,Pandas提供了groupby()
方法来实现SQL中的GROUP BY
功能。
例如通过计算每天的小费数量和总金额
SQL:
SELECT day, count(*), sum(tip),
FROM tips
GROUP BY day;
Pandas:
tips.groupby('day').agg({'day':np.size,'tip':np.sum})
结果如下:
day tip
day
Fri 19 51.96
Sat 87 260.40
Sun 76 247.39
Thur 62 171.83
也可以通过对groupby方法传入多个列来实现多列分组:
SQL:
SELECT day, sex, count(*), sum(tip),
FROM tips
GROUP BY day, sex;
Pandas:
tips.groupby(['day','sex']).agg({'day':np.size,'tip':np.sum})
结果如下:
day tip
day sex
Fri Female 9 25.03
Male 10 26.93
Sat Female 28 78.45
Male 59 181.95
Sun Female 18 60.61
Male 58 186.78
Thur Female 32 82.42
Male 30 89.41
UNION
SQL中的UNION
主要用于对具有相同列的数据进行组合,其中UNION ALL
不去重,可以使用Pandas中concat方法来实现。假设我们存在两组数据:
df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],'rank': [1, 4, 5]})
print(df1)
print(df2)
结果如下:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
city rank
0 Chicago 1
1 Boston 4
2 Los Angeles 5
使用concat来实现union all操作:
SQL:
SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
Pandas:
pd.concat([df1, df2])
结果如下:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
0 Chicago 1
1 Boston 4
2 Los Angeles 5
由于concat
默认不去重,因此如果需要执行去重的union
操作,则需要在concat
的基础上使用drop_duplicates()
来实现:
SQL:
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
Pandas:
pd.concat([df1, df2]).drop_duplicates()
结果如下:
city rank
0 Chicago 1
1 San Francisco 2
2 New York City 3
1 Boston 4
2 Los Angeles 5
JOIN
JOIN
操作用于在某列上对两个表的数据进行连接,连接的类型包括:INNER JOIN、LEFT JOIN、RIGHT JOIN、FULL JOIN。下面我们通过两个不同的DataFrame来展示以上不同的JOIN操作在Pandas中是如何实现的。两个DataFrame的数据如下:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D','E'],'value': np.random.randn(5)})
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E','F'],'value': np.random.randn(5)})
print(df1)
print(df2)
结果如下:
key value
0 A -2.345263
1 B -0.626052
2 C 1.322350
3 D 0.447053
4 E -0.478527
key value
0 B -0.720448
1 D 1.515793
2 D -0.335380
3 E -0.411750
4 F 0.817647
INNER JOIN
SQL:
SELECT *
FROM df1
INNER JOIN df2 ON df1.key = df2.key;
Pandas:
pd.merge(df1, df2, on='key')
结果如下:
key value_x value_y
0 B -0.626052 -0.720448
1 D 0.447053 1.515793
2 D 0.447053 -0.335380
3 E -0.478527 -0.411750
LEFT JOIN
SQL:
SELECT *
FROM df1
LEFT JOIN df2 ON df1.key = df2.key;
Pandas:
pd.merge(df1, df2, on='key', how='left')
结果如下:
key value_x value_y
0 A -2.345263 NaN
1 B -0.626052 -0.720448
2 C 1.322350 NaN
3 D 0.447053 1.515793
4 D 0.447053 -0.335380
5 E -0.478527 -0.411750
RIGHT JOIN
SQL:
SELECT *
FROM df1
RIGHT JOIN df2 ON df1.key = df2.key;
Pandas:
pd.merge(df1, df2, on='key', how='right')
结果如下:
key value_x value_y
0 B -0.626052 -0.720448
1 D 0.447053 1.515793
2 D 0.447053 -0.335380
3 E -0.478527 -0.411750
4 F NaN 0.817647
FULL JOIN
SQL:
SELECT *
FROM df1
FULL OUTER JOIN df2 ON df1.key = df2.key;
Pandas:
pd.merge(df1, df2, on='key', how='outer')
结果如下:
key value_x value_y
0 A -2.345263 NaN
1 B -0.626052 -0.720448
2 C 1.322350 NaN
3 D 0.447053 1.515793
4 D 0.447053 -0.335380
5 E -0.478527 -0.411750
6 F NaN 0.817647
关于Pandas和SQL的简单比较就介绍到这里,其实Pandas还提供了许多SQL在窗口函数下的实现,具体的以后有时间再去慢慢写下来。
网友评论