美文网首页python学习
Pandas数据分析 vs SQL统计实践

Pandas数据分析 vs SQL统计实践

作者: 数据民工来取经儿 | 来源:发表于2019-01-10 23:29 被阅读17次

    1. 目的:SQL语句的Pandas实现

    数据分析师最基础的技能之一:写SQL统计数据;
    然..不够使,又须多一个能力,
    数据分析师最基础的技能之一:Pandas数据处理

    2. 环境

    • python2.7
    • python须安装ipython, pandas, numpy, mysqldb
    • mysql, 并下载练习包employees database
      如果不想花时间搭建环境, 我已经在云端搭建了数据学习环境,按照下面的文章最后的说明,申请远程访问权限:
      2019年,数据民工的大白话规划

    2.1 练习环境

    • ipython
    import MySQLdb
    import pandas as pd
    conn = MySQLdb.connect(host='127.0.0.1', user='usertest', passwd='usertest',port=3306, db='employees', charset="utf8")
    cur = conn.cursor()
    df_employees = pd.read_sql('select * from employees', con=conn)
    
    • 备注:

    python MySQLdb连接数据库,为防止查询结果中文乱码,connect参数注意增加charset="utf8"(因为一般表创建字符charset为utf8)

    3. SQL实例

    下面开始一步一步对照SQL的统计语句和Pandas实现方法:

    3.1 单表查询

    3.1.1 查询前几行

    • sql
    select * from emloyees limit 10
    
    • pandas
    # 方法一:
    df_employees.iloc[0:10]
    # 方法二:
    df_employees.head(10)
    

    3.1.2 查询某个字段

    • sql
    select emp_no, birth_date from emloyees
    
    • pandas
    df_employees[['emp_no', 'birth_date']]
    

    3.1.3 查询某字段前几行

    • sql
    select emp_no, birth_date from emloyees limit 10
    
    • pandas
    df_employees[['emp_no', 'birth_date']].iloc[0:10]
    df_employees[['emp_no', 'birth_date']].head(10)
    

    3.1.4 单字段筛选特殊值

    • sql
    select * from emloyees where hire_date='1986-06-26'
    select * from emloyees where hire_date <> '1986-06-26'
    
    • pandas
    df_employees.loc[df_employees['hire_date'].astype(str)=='1986-06-26']
    df_employees.loc[df_employees['hire_date'].astype(str) != '1986-06-26']
    df_employees.loc[df_employees['hire_date'].astype(str) <> '1986-06-26']
    

    3.1.5 and 和 or

    • sql
    select * from employees where hire_date='1986-06-26' and gender='F'
    select * from employees where hire_date='1986-06-26' or gender='F'
    
    • pandas 方法一
    df_employees.loc[(df_employees['hire_date'].astype(str)=='1986-06-26') & (df_employees['gender'].astype(str)=='M')]
    df_employees.loc[(df_employees['hire_date'].astype(str)=='1986-06-26') | (df_employees['gender'].astype(str)=='M')]
    
    • pandas 方法二
    df1 = df_employees.loc[df_employees['hire_date'].astype(str)=='1986-06-26']
    df2 = df1.loc[df1['gender'].astype(str)=='F']
    print df2
    

    3.1.6 is null 和 is not null

    • sql
    select * from employees where first_name is null 
    
    • pandas
    # 当然数据可能没有空值,所以结果为空
    df_employees[df_employees['first_name'].isnull()]
    df_employees[df_employees['first_name'].notnull()].head(10)
    

    3.1.6 in操作

    • sql
    select * from employees where first_name in ('Parto', 'Anneke', 'Saniya')
    
    • pandas
    df_employees.loc[df_employees['first_name'].isin(['Parto', 'Anneke', 'Saniya'])]
    

    3.1.7 not in操作

    • sql
    select * from employees where first_name not in ('Parto', 'Anneke', 'Saniya')
    
    • pandas
    df_employees.loc[~df_employees['first_name'].isin(['Parto', 'Anneke', 'Saniya'])]
    # 希望通过某列转化的list作为in的输出参数
    df_employees.loc[~df_employees['first_name'].isin(df.colexample.tolist())]
    

    3.1.8 聚合操作:count()

    • sql
    select first_name, count(*) as tt from employees group by first_name order by tt
    
    • pandas
    df_employees.groupby('first_name').size()
    # 注意比较size()与count()的区别,count()把每个字段都汇总,所以用size()即可
    df_employees.groupby('first_name').count()
    

    3.1.9 order by

    • sql
    select emp_no from employees order by emp_no desc
    
    • pandas
    df_employees.sort_values(by='emp_no', ascending=False)
    

    3.1.10 聚合操作:count()和count(distinct)

    • sql
    select first_name, count(*) as tt from employees group by first_name order by tt
    select gender,count(DISTINCT first_name) from employees group by gender
    
    • pandas
    df_employees.groupby('gender').count().sort_values('first_name', ascending=False)
    df_employees.groupby('gender').first_name.nunique()
    

    3.1.11 聚合操作:sum()

    • sql
    select gender, sum(emp_no) as tt from employees group by gender order by tt
    
    • pandas
    df_employees.[['emp_no', 'gender']].groupby('gender').sum()
    # pandas对非int字段,不会进行sum操作,如下结果与上面代码运行结果一样:
    df_employees.groupby('gender').sum() 
    

    3.1.12 多种聚合操作:count()与sum()

    • sql
    select gender, sum(emp_no) as tt1, count(*) as tt2 from employees group by gender order by tt1 desc
    select gender,first_name, sum(emp_no) as tt1, count(*) as tt2 
    from employees group by gender,first_name
    
    • pandas
    df_employees.groupby('gender').agg({'emp_no':np.sum, 'first_name':np.size}).sort_values(by='emp_no', ascending=False)
    df_employees.groupby(['gender','first_name']).agg({'emp_no':[np.sum, np.size]})
    

    3.1.13 待更新

    3.2 单表查询

    3.2.1 连接:Left join,Rigth join,Full join

    • 新读入薪水表
    df_salaries = pd.read_sql('select * from salaries', con=conn)
    
    • sql
    select * from employees t1 left join salaries t2 on t1.emp_no=t2.emp_no
    select * from employees t1 right join salaries t2 on t1.emp_no=t2.emp_no
    select * from employees t1 full join salaries t2 on t1.emp_no=t2.emp_no
    
    • pandas
    pd.merge(df_employees, df_salaries , on='emp_no', how='left')
    pd.merge(df_employees, df_salaries , on='emp_no', how='right')
    pd.merge(df_employees, df_salaries , on='emp_no', how='outer')
    # 如果是多列作为连接条件:
    pd.merge(df_employees, df_salaries , left_on=['emp_no', 'hire_date'], right_on=['emp_no', 'to_date'], how='inner')
    
    • join 函数
    # 如果两个dataframe待连接字段名字不同
    # 如下: 通过df_employees的columnName 与 df_salaries的newColumn进行连接
    df_employees.set_index('columnName').join(df_salaries.set_index('newColumn'))
    
    df_employees.join(df_salaries.set_index("birth_date"), on='hire_date', lsuffix='_one', rsuffix='_another', how='left')
    

    3.2.2 union all

    • sql
    (select * from employees t1 limit 10)
    union all 
    (select * from employees t1 limit 20,10)
    
    • pandas
    pd.concat([df_employees.iloc[0:10], df_employees.iloc[10:21]])
    

    3.2.2 union all 和 union

    • sql
    (select * from employees t1 limit 10)
    union all 
    (select * from employees t1 limit 20,10)
    
    • pandas
    pd.concat([df_employees.iloc[0:10], df_employees.iloc[10:21]])
    pd.concat([df_employees.iloc[0:10], df_employees.iloc[10:21]])
    

    3.2.3 row_number()

    • sql
    # mysql没有row_number()功能,oracle有, 如下模拟oracle写法(实际mysql无法运行哈,意会下精神,咳咳)
    SELECT * FROM 
    ( SELECT t.*, ROW_NUMBER() OVER(PARTITION BY gender ORDER BY emp_no DESC) AS rn FROM employees t)
    WHERE rn < 5
    ORDER BY gender ,rn;
    
    • pandas
    (df_employees.assign(rn=df_employees.sort_values(['emp_no'], ascending=False).groupby(['gender']).cumcount()+1).query('rn<5').sort_values(['gender','rn']))
    

    3.2.4 update

    • sql
    update employees
    set emp_no=100
    where emp_no=10001
    
    • pandas
    df_employees.loc[df_employees['emp_no']==10001, 'emp_no'] = 99
    

    3.2.5 delete

    • sql
    delete employees
    where emp_no=10003
    
    • pandas
    df1 = df_employees.loc[df_employees['emp_no'] != 10003]
    

    待更新..

    相关文章

      网友评论

        本文标题:Pandas数据分析 vs SQL统计实践

        本文链接:https://www.haomeiwen.com/subject/gpzzrqtx.html