美文网首页
Pandas与SQL的简单比较

Pandas与SQL的简单比较

作者: 老羊_肖恩 | 来源:发表于2023-01-03 17:27 被阅读0次

  Pandas作为数据分析三剑客之一,在数据挖掘和数据分析领域有着极其广泛的应用,而熟悉数据分析的人,往往也十分熟悉SQL,因为一开始的数据分析基本上都是从使用SQL开始。Pandas作为数据分析利器,自然也提供了很多SQL相关的功能,这里简单的从几个方面对Pandas和SQL进行比较。
在进行比较之前,我们先导入一个tips.csv的小费数据集。

import numpy as np 
import pandas as pd
tips = pd.read_csv('D:\\Workspace\\dataset\\tips.csv')
del tips['Unnamed: 0']
tips

结果如下:

total_bill  tip sex smoker  day time    size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4
... ... ... ... ... ... ... ...
239 29.03   5.92    Male    No  Sat Dinner  3
240 27.18   2.00    Female  Yes Sat Dinner  2
241 22.67   2.00    Male    Yes Sat Dinner  2
242 17.82   1.75    Male    No  Sat Dinner  2
243 18.78   3.00    Female  No  Thur    Dinner  2
244 rows × 7 columns
SELECT

  SQL中的SELECT是用来返回指定列的信息,不指定列名的情况下,如SELECT *返回所有列的数据:
SQL:

SELECT *
FROM tips
LIMIT 5;

Pandas:

tips.head(5)

结果如下:

total_bill  tip sex smoker  day time    size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4

如果需要返回若干指定的列,则需要在列索引中指定所要返回的列名:
SQL:

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

Pandas:

tips[['total_bill', 'tip', 'smoker','time']].head(5)

结果如下:

total_bill  tip smoker  time
0   16.99   1.01    No  Dinner
1   10.34   1.66    No  Dinner
2   21.01   3.50    No  Dinner
3   23.68   3.31    No  Dinner
4   24.59   3.61    No  Dinner
WHERE

  SQL中的WHERE主要是用来做条件过滤的,Pandas中可以通过多种方式进行条件过滤,其中最直观的方式是使用布尔索引。
SQL:

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

Pandas:

tips[tips['time'] == 'Dinner'].head(5)

结果如下:

    total_bill  tip sex smoker  day time    size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4

类似于SQL中where子句中的andor,Pandas的布尔索引也支持andor组合的条件。其中and使用&来实现,or使用|来实现。

SQL:

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00
LIMIT 5;

Pandas:

tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)].head(5)

结果如下:

total_bill  tip sex smoker  day time    size
23  39.42   7.58    Male    No  Sat Dinner  4
44  30.40   5.60    Male    No  Sun Dinner  4
47  32.40   6.00    Male    No  Sun Dinner  4
52  34.81   5.20    Female  No  Sun Dinner  4
59  48.27   6.73    Male    No  Sat Dinner  4

SQL:

SELECT *
FROM tips
WHERE time = 'Dinner' or tip > 3.00
LIMIT 5;

Pandas:

tips[(tips['time'] == 'Dinner') | (tips['tip'] > 3.00)].head(5)

结果如下:

total_bill  tip sex smoker  day time    size
0   16.99   1.01    Female  No  Sun Dinner  2
1   10.34   1.66    Male    No  Sun Dinner  3
2   21.01   3.50    Male    No  Sun Dinner  3
3   23.68   3.31    Male    No  Sun Dinner  2
4   24.59   3.61    Female  No  Sun Dinner  4
GROUP BY

  SQL中的GROUP BY通常用来将数据集按照某些列进行分组,然后基于每个分组进行某些聚合计算,然后将结果组合在一起的过程,Pandas提供了groupby()方法来实现SQL中的GROUP BY功能。
例如通过计算每天的小费数量和总金额
SQL:

SELECT day, count(*), sum(tip),
FROM tips
GROUP BY day;

Pandas:

tips.groupby('day').agg({'day':np.size,'tip':np.sum})

结果如下:

    day tip
day     
Fri 19  51.96
Sat 87  260.40
Sun 76  247.39
Thur    62  171.83

也可以通过对groupby方法传入多个列来实现多列分组:
SQL:

SELECT day, sex, count(*), sum(tip),
FROM tips
GROUP BY day, sex;

Pandas:

tips.groupby(['day','sex']).agg({'day':np.size,'tip':np.sum})

结果如下:


            day tip
day sex     
Fri Female  9   25.03
     Male   10  26.93
Sat Female  28  78.45
     Male   59  181.95
Sun Female  18  60.61
    Male    58  186.78
Thur Female 32  82.42
    Male    30  89.41
UNION

SQL中的UNION 主要用于对具有相同列的数据进行组合,其中UNION ALL不去重,可以使用Pandas中concat方法来实现。假设我们存在两组数据:

df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],'rank': range(1, 4)})
df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],'rank': [1, 4, 5]})
print(df1)
print(df2)

结果如下:

            city  rank
0        Chicago     1
1  San Francisco     2
2  New York City     3

          city  rank
0      Chicago     1
1       Boston     4
2  Los Angeles     5

使用concat来实现union all操作:
SQL:

SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;

Pandas:

pd.concat([df1, df2])

结果如下:

    city    rank
0   Chicago 1
1   San Francisco   2
2   New York City   3
0   Chicago 1
1   Boston  4
2   Los Angeles 5

由于concat默认不去重,因此如果需要执行去重的union操作,则需要在concat的基础上使用drop_duplicates()来实现:
SQL:

SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;

Pandas:

pd.concat([df1, df2]).drop_duplicates()

结果如下:

    city    rank
0   Chicago 1
1   San Francisco   2
2   New York City   3
1   Boston  4
2   Los Angeles 5
JOIN

  JOIN操作用于在某列上对两个表的数据进行连接,连接的类型包括:INNER JOIN、LEFT JOIN、RIGHT JOIN、FULL JOIN。下面我们通过两个不同的DataFrame来展示以上不同的JOIN操作在Pandas中是如何实现的。两个DataFrame的数据如下:

df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D','E'],'value': np.random.randn(5)})
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E','F'],'value': np.random.randn(5)})
print(df1)
print(df2)

结果如下:

  key     value
0   A -2.345263
1   B -0.626052
2   C  1.322350
3   D  0.447053
4   E -0.478527

  key     value
0   B -0.720448
1   D  1.515793
2   D -0.335380
3   E -0.411750
4   F  0.817647
INNER JOIN

SQL:

SELECT *
FROM df1
INNER JOIN df2 ON df1.key = df2.key;

Pandas:

pd.merge(df1, df2, on='key')

结果如下:

    key value_x value_y
0   B   -0.626052   -0.720448
1   D   0.447053    1.515793
2   D   0.447053    -0.335380
3   E   -0.478527   -0.411750
LEFT JOIN

SQL:

SELECT *
FROM df1
LEFT JOIN df2 ON df1.key = df2.key;

Pandas:

pd.merge(df1, df2, on='key', how='left')

结果如下:

    key value_x value_y
0   A   -2.345263   NaN
1   B   -0.626052   -0.720448
2   C   1.322350    NaN
3   D   0.447053    1.515793
4   D   0.447053    -0.335380
5   E   -0.478527   -0.411750
RIGHT JOIN

SQL:

SELECT *
FROM df1
RIGHT JOIN df2 ON df1.key = df2.key;

Pandas:

pd.merge(df1, df2, on='key', how='right')

结果如下:

    key value_x value_y
0   B   -0.626052   -0.720448
1   D   0.447053    1.515793
2   D   0.447053    -0.335380
3   E   -0.478527   -0.411750
4   F   NaN 0.817647
FULL JOIN

SQL:

SELECT *
FROM df1
FULL OUTER JOIN df2 ON df1.key = df2.key;

Pandas:

pd.merge(df1, df2, on='key', how='outer')

结果如下:

    key value_x value_y
0   A   -2.345263   NaN
1   B   -0.626052   -0.720448
2   C   1.322350    NaN
3   D   0.447053    1.515793
4   D   0.447053    -0.335380
5   E   -0.478527   -0.411750
6   F   NaN 0.817647

关于Pandas和SQL的简单比较就介绍到这里,其实Pandas还提供了许多SQL在窗口函数下的实现,具体的以后有时间再去慢慢写下来。

相关文章

网友评论

      本文标题:Pandas与SQL的简单比较

      本文链接:https://www.haomeiwen.com/subject/mwkxcdtx.html