美文网首页
Pandas中DataFrame的操作

Pandas中DataFrame的操作

作者: RandyLou | 来源:发表于2020-09-20 15:10 被阅读0次

    Pandas有两个核型概念: Series、DataFrame。 Series类似于Python中的list,可以理解为表格中的一列数据。 DataFrame包含Series,理解为一个表格。核心价值就在于基于表格数据的过滤、转换、统计、分析并输出可视化图表。

    后面所有的示例都是基于以下几份测试数据:

    PV转换数据,变量名: pv_conv

    首页PV 搜索页PV 注册数 下单用户数 订单数
    985 290 98 46 40
    211 200 21 43 50
    688 201 19 68 70
    766 228 71 72 80

    所有的代码示例都假设你已经导入如下python包:

    import pandas as pd
    import numpy as np
    import matplotlib as plt
    import sys
    import MySQLdb
    
    pv_conv = {
        '首页PV': [985, 211, 688, 766],
        '搜索页PV': [290, 200, 201, 228],
        '注册数': [98, 21, 19, 71],
        '下单用户数': [46, 43, 68, 72],
        '订单数': [40, 50, 70, 80],
    }
    

    1. 创建DataFrame

    1.1 基于dict创建

    基于代码来创建DataFrame的场景还是很常见的,这样数据就可以是来自任意位置、采用任意格式。DataFrame很多操作都是基于index列完成的。

    DataFrame的处理本省兼容性很强,天然支持类似2.1.3 基于JSON创建里的orient='split'orient='records'orient='index'orient='columns'

    1. 类orient='split'格式

    我们需要将indexcolumnsdata分成3个参数传递:

    index = ["2020-09-01", "2020-09-02", "2020-09-03", "2020-09-04"]
    columns = ["首页PV", "搜索页PV", "注册数", "下单用户数", "订单数"]
    data = [
        [1, 2, 3, 4, 5],
        [6, 7, 8, 9, 10],
        [11, 12, 13, 14, 15],
        [16, 17, 18, 19, 20]
    ]
    
    record_json = pd.DataFrame(index=index, columns=columns, data=data)
    print(record_json)
    
    2. 类orient='records'格式

    也能支持使用orient='records'格式的数据:

    data = [
        {"日期": "2020-09-01", "首页PV":"1",  "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
        {"日期": "2020-09-02", "首页PV":"2",  "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
        {"日期": "2020-09-03", "首页PV":"3",  "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
        {"日期": "2020-09-04", "首页PV":"4",  "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
    ]
    
    record_json = pd.DataFrame(data)
    print(record_json)
    
    3. 类orient='columns'格式
    pv_conv = {
        '首页PV': [985, 211, 688, 766],
        '搜索页PV': [290, 200, 201, 228],
        '注册数': [98, 21, 19, 71],
        '下单用户数': [46, 43, 68, 72],
        '订单数': [40, 50, 70, 80],
    }
    
    record_json = pd.DataFrame(pv_conv)
    print(record_json)
    
    4. 不支持类orient='index'格式

    这种格式以数组下标做为key,以行数据做为value,组成一个dict传给DataFrameDataFrame无法区分orient='columns'

    5. 不支持orient='values'格式

    这种格式和类orient='split'格式也是冲突的, 传入的二维数组会被当成data字段,导致你的indexcolumns都被当做数据

    6. index的设定
    • 自动生成index

    适用于index列无实际意义的场景,index默认从0开始,相当于数组的位置索引。

    df = pd.DataFrame(pv_conv)
    

    输出

    image.png
    • 手动切换index
    df = pd.DataFrame(pv_conv)
    df.set_index('日期')
    
    • 额外提供index

    数据处理的时候,自己将index列单独拿出。

    df = pd.DataFrame(pv_conv, index=["2020-09-01", "2020-09-03", "2020-09-03", "2020-09-04"])
    

    输出

    image.png
    1.2 基于csv文件创建

    csv文件内容:

    日期,首页PV,搜索页PV,注册数,下单用户数,订单数
    2020-09-01,985,290,98,46,40
    2020-09-03,211,200,21,43,50
    2020-09-03,688,201,19,68,70
    2020-09-04,766,228,71,72,80
    
    1. 默认index列自动生成
    csv = pd.read_csv('d:/pv_conv.csv')
    print(csv)
    

    输出

    image.png
    2. 通过列索引指定index
    csv = pd.read_csv('d:/pv_conv.csv', index_col=0)
    print(csv)
    

    输出

    image.png
    3. Python 2.7下pandas遇到的问题

    csv文件如果使用index_col(默认使用),会导致列头和实际数据错位一列,需要指定index_col=False才能避免改问题。

    csv = pd.read_csv("e:/wht_mobile.csv", index_col=False)
    
    1.3 基于json创建
    1. 字符串,格式split

    可以通过orient指定json格式,如果orient='split'格式, indexcolumns指定单独字段,data是二维数组,数组里每一个元素对应数据的一行

    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, 9, 10],
                [11, 12, 13, 14, 15],
                [16, 17, 18, 19, 20]
         ]
    }
    

    代码

    split_json = pd.read_json(tj, orient='split')
    print(split_json)
    

    所有JSON格式创建的DataFrame都需要默认生成0开始的类似数组下标的index,如果数据里有index列,需要你自己指定。

    输出

    image.png
    2. 字符串,格式reacords

    如果orient='records'格式, 数据是json对象的数组,数组里的每一个对象对应一行记录:

    [
        {"日期": "2020-09-01", "首页PV":"1",  "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
        {"日期": "2020-09-02", "首页PV":"2",  "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
        {"日期": "2020-09-03", "首页PV":"3",  "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
        {"日期": "2020-09-04", "首页PV":"4",  "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
    ]
    

    ==这里有个非常坑的点是,数据元素后面的",", 在最后一个元素后面不能有,json格式本身也不允许,但是如果你放了提示很诡异,这个问题困扰了我好几个小时==

    代码

    record_json = pd.read_json(rj, orient='records', encoding="utf-8", dtype={"a": str, "b": str})
    print(record_json)
    

    默认是自动生成index列,如果要自己指定,通过set_index修改,这个方法是通用的

    输出

    image.png
    3. 字符串,格式index

    如果orient='index'格式,数据的json格式会变成object,以位置索引为key

    {
        "0":{"日期": "2020-09-01", "首页PV":"1",  "搜索页PV": "1", "注册数" : "1" , "下单用户数" : "1", "订单数" : "1"},
        "1":{"日期": "2020-09-02", "首页PV":"2",  "搜索页PV": "2", "注册数" : "2" , "下单用户数" : "2", "订单数" : "2"},
        "2":{"日期": "2020-09-03", "首页PV":"3",  "搜索页PV": "3", "注册数" : "3" , "下单用户数" : "3", "订单数" : "3"},
        "3":{"日期": "2020-09-04", "首页PV":"4",  "搜索页PV": "4", "注册数" : "4" , "下单用户数" : "4", "订单数" : "4"}
    }
    

    代码

    record_json = pd.read_json(ridx, orient='index', encoding="utf-8")
    print(record_json)
    
    4. 字符串,格式columns

    columns是导入的默认格式,输入json是一个object,每个key对应一列数据

    {
        "首页PV": [985, 211, 688, 766],
        "搜索页PV": [290, 200, 201, 228],
        "注册数": [98, 21, 19, 71],
        "下单用户数": [46, 43, 68, 72],
        "订单数": [40, 50, 70, 80],
        "日期":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"]
    }
    

    代码

    record_json = pd.read_json(ridx, orient='columns', encoding="utf-8")
    print(record_json)
    
    5. 字符串,格式values

    如果指定格式values,数据是一个json的二维数据,每一个元素代表一行

    [
        ["首页PV","搜索页PV","注册数","下单用户数","订单数","日期"],
        ["985","290","98","46","40","2020-09-01"],
        ["211","200","21","43","50","2020-09-02"],
        ["688","201","19","68","70","2020-09-03"],
        ["766","228","71","72","80","2020-09-04"]
    ]
    

    代码

    record_json = pd.read_json(ridx, orient='values', encoding="utf-8")
    print(record_json)
    
    6. 从文件创建

    文件和字符串的区别仅仅是数据来源问题,同样适用上面提到的orient,默认orient=columns

    json文件内容:

    {
        "首页PV": [985, 211, 688, 766],
        "搜索页PV": [290, 200, 201, 228],
        "注册数": [98, 21, 19, 71],
        "下单用户数": [46, 43, 68, 72],
        "订单数": [40, 50, 70, 80],
        "日期":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"]
    }
    

    代码

    json = pd.read_json("d:/pv_conv.json")
    json = json.set_index("日期")
    print(json)
    

    按自己的需求指定index列,如果不指定默认是类似数组下标索引,从0开始的数字。

    1.4 从数据库表创建
    # 通过数据库导入
    mysql_conn = MySQLdb.connect(host='192.168.36.92', user='hadoop', passwd='aa.123', port=3306, db='dp', charset="utf8")
    db = pd.read_sql_query("select * from client", mysql_conn)
    # print(db)
    
    # 通过数据库导入, 指定索引列, 这里用位置索引报错
    db = pd.read_sql_query("select * from client", mysql_conn, index_col='id')
    # print(db)
    
    # 切换索引列
    db = db.set_index("name")
    print(db)
    

    2. 数据操作

    2.1 基于index列的筛选
    1. 自动生成index列的筛选
    df = pd.DataFrame(pv_conv)
    row = df.loc[0]
    

    输出

    image.png
    2. 额外提供index列的筛选
    df = pd.DataFrame(pv_conv, index=["2020-09-01", "2020-09-03", "2020-09-03", "2020-09-04"])
    row = df.loc['2020-09-01']
    

    输出

    image.png
    3. 基于index范围切片

    指定index的起始范围,返回包含两端的边界值。

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.loc['2020-09-02':'2020-09-03'])
    
    

    输出

    image.png
    2.2 基于行位置索引
    1. 根据行位置筛选

    操作和数组下标一致

    示例

    
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.iloc[0])
    

    输出

    image.png
    2. 基于行位置范围筛选

    这里比较诡异的一个点是,loc是包含起始值的,但iloc不包含结束值。

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.iloc[1:3])
    

    输出

    image.png
    2.3 前N行、后N行
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, 9, 10],
                [11, 12, 13, 14, 15],
                [16, 17, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv)
    
    print(pv_conv.head(2))
    print(pv_conv.tail(1))
    
    2.4 数据去重,基于index

    append是将两个DataFrame拼接,这里我们用pv_conv拼接自己,导致数据重复出现,再使用drop_duplicates去重

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, 9, 10],
                [11, 12, 13, 14, 15],
                [16, 17, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.shape)
    
    pv_conv = pv_conv.append(pv_conv)
    print(pv_conv.shape)
    
    print(pv_conv)
    
    # 删除重复的
    pv_conv = pv_conv.drop_duplicates()
    print(pv_conv)
    

    输出

    image.png

    通过指定keep参数,控制保留那条数据:

    取值 说明
    keep=first 保留第一条数据
    keep=last 保留最后一条
    keep=False 删除所有重复
    pv_conv = pv_conv.drop_duplicates(keep=False)
    print(pv_conv)
    
    2.5 空值处理
    1. 判断是否为null
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.isnull())
    

    输出

    image.png

    单纯的判断感觉意义不大

    2. 空值统计
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.isnull().sum())
    

    输出

    image.png
    3. 填充空值
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv)
    
    pv_conv = pv_conv.fillna('HI')
    print(pv_conv)
    

    输出

    image.png
    2.6 数据清洗,删除空数据
    1. 删除有空值的行

    删除有空值的行,只要有一个空值这一行就会被丢弃

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv)
    print(pv_conv.dropna())
    

    输出

    image.png
    2. 删除有空值的列

    我们能通过dropna的参数axis指定删除行还是列,默认axis=0表示删除行,指定axis=1表示删除列

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv)
    print(pv_conv.dropna(axis=1))
    

    输出

    image.png
    2.7 基于条件的筛选
    1. 过滤条件

    通过DataFrame['列名']能拿到Series类型,Series配合条件运算符会生成我们的过滤条件,过滤条件的输出类似于isnull()方法的输出。

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    condition = (pv_conv['首页PV'] > 6)
    print(condition)
    

    输出

    image.png

    DataFrame[过滤条件]返回的是结果为True的那些行。

    2. 逻辑运算符

    过滤条件支持逻辑运算符,&表示与,|表示或。每个条件都用()包含,&|在括号外。

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    condition = (pv_conv['首页PV'] > 6) & (pv_conv['下单用户数'] < 18)
    print(condition)
    
    print(pv_conv[condition])
    print(pv_conv[(pv_conv['首页PV'] > 6) & (pv_conv['下单用户数'] < 18)]) # 也可以直接使用
    
    3. 在列表中,类似数据库的in

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    print(pv_conv[pv_conv['首页PV'].isin([1, 11])])
    

    输出

    image.png
    4. 基于过滤条件的筛选

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv)
    
    home_page_pv = pv_conv[pv_conv['首页PV'] > 10]
    print(home_page_pv)
    

    输出

    image.png
    2.8 转换函数

    通过列名获取DataFrameSeries,结合apply方法,提供转换函数,能对特定列数据进行转换。转换结果需要赋值会DataFrame['列名']后才能使修改生效。

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [3, 7, 8, null, 10],
                [5, null, 13, 14, 15],
                [7, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    def multiply_pv(pv):
        return pv * 100 + random.randint(0, 100)
    
    temp = pv_conv['首页PV'].apply(multiply_pv)
    print(temp)
    
    pv_conv['首页PV'] = temp
    print(pv_conv)
    

    输出

    image.png

    3. 反射、DML

    3.1 info 获取列名和类型,非空行数
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.info())
    

    输出

    image.png
    3.2 shape 获取行数和列数
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.shape)
    

    输出

    说明当前DataFrame有4行5列

    image.png
    3.3 获取列名
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.columns)
    

    输出

    image.png
    3.4 修改列名
    1. 修改指定列列名

    DataFrame中的订单数列名改成order_count

    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.columns)
    
    pv_conv.rename(columns={
        '订单数': 'oder_count'
    }, inplace=True)
    
    print(pv_conv.columns)
    
    2. 批量改名

    一次修改多个列名:

    pv_conv.columns = ['首页PV', 'search_pv', 'register_count', '下单用户数', '订单数']
    print(pv_conv.columns)
    
    3. 基于列表解析,用函数自动改名

    适用场景是基于规则修改,比如将列名都改成大写的

    pv_conv.columns = [c.upper() for c in temp]
    print(pv_conv)
    

    输出

    image.png

    4. 表操作

    4.1 union

    将两个DataFrame内容拼接成一个DataFrame

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, 9, 10],
                [11, 12, 13, 14, 15],
                [16, 17, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.shape)
    
    pv_conv = pv_conv.append(pv_conv)
    print(pv_conv.shape)
    
    4.2 join

    通过DataFrame.join方法我们可以见一个DataFrame和另外的一个到多个SeriesDataFrame做关联操作,如果是多个用list传入。

    格式

    DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
    

    参数明说:

    参数名 说明
    other 要关联的SeriesDataFrame,多个用数组传参
    on 用来和DataFrame关联的other上的字段名称,默认是用index字段关联
    how 关联方式,类型与数据库关联的left joinright join的概念,可选值是:left,right, outer, inner
    lsuffix 左侧字段名加前缀,用于避免重名
    rsuffix 右侧字段名加前缀,用于避免重名
    sort 是否对结果按关联字段排序

    示例:

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    js2 = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["APP下载","APP活跃"],
        "data":[
                [111,112],
                [121,122],
                [131,132],
                [141,142]
         ]
    }
    '''
    
    print(pv_conv)
    
    pv_conv2 = pd.read_json(js2, orient='split')
    print(pv_conv2)
    
    pv_conv3 = pv_conv.join(pv_conv2)
    print(pv_conv3)
    

    输出

    image.png
    4.3 差集

    DataFrame本身不直接提供方法支持差集,我们是利用drop_duplicate实现的差集,pv_conv拼接两次pv_conv2,这样会导致pv_conv2上的index必然重复,再指定keep=False这样就保证pv_conv2上的说有index只都会删除。

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    js2 = '''
    {
        "index":["2020-09-01","2020-09-02"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10]
         ]
    }
    '''
    
    pv_conv2 = pd.read_json(js2, orient='split')
    
    pv_diff = pv_conv.append(pv_conv2)
    pv_diff = pv_diff.append(pv_conv2)
    pv_diff = pv_diff.drop_duplicates(keep=False)
    
    print(pv_diff)
    

    输出

    image.png
    4.4 统计
    1. describe统计: 个数、均值、最大最小值、百分位
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.describe())
    

    输出

    image.png

    对于非数值类型,返回的就是 行数、distinct行数、出现频率最高的关键字、出现的次数

    2. value_count: 出现频率统计
    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.value_counts())
    print(pv_conv['首页PV'].value_counts())
    

    DataFrame.value_count的实际作用待考究。Serries返回每个特定值出现的频率

    输出

    image.png
    3. corr: 计算列之间的相关性

    计算各个列之间的相关性,正数表明正相关(一个涨另一个也上涨), 负数表明负相关(一个涨一个跌),越接近1表示相关性越大

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(pv_conv.corr())
    

    输出

    image.png
    4. 百分位

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [3, 7, 8, null, 10],
                [5, null, 13, 14, 15],
                [7, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    
    print(pv_conv['首页PV'].quantile(0.25))
    

    输出

    image.png
    5. 平均值

    示例

    print(csv[csv['用户ID'] < 18000].head(3)['用户ID'].mean())
    
    4.5 列裁剪

    关于行裁切可以看一下2.2.12.2.2基于index和数组下标的切片。

    1. 获取特定列

    默认pv_conv['订单数']返回的是Series,如果想要返回DataFrame需要传递一个数组,pv_conv[['订单数','首页PV']

    示例

    js = '''
    {
        "index":["2020-09-01","2020-09-02","2020-09-03","2020-09-04"],
        "columns":["首页PV","搜索页PV","注册数","下单用户数","订单数"],
        "data":[
                [1, 2, 3, 4, 5],
                [6, 7, 8, null, 10],
                [11, null, 13, 14, 15],
                [16, null, 18, 19, 20]
         ]
    }
    '''
    
    pv_conv = pd.read_json(js, orient='split')
    print(type(pv_conv['订单数']))
    print(pv_conv['订单数'])
    
    print(type(pv_conv[['订单数', '首页PV']]))
    print(pv_conv[['订单数', '首页PV']])
    

    输出

    image.png

    5. 导出

    5.1 to_csv

    如果写出有中文,必须指定encoding

    db.to_csv('e:/test_csv.csv', encoding='utf-8')
    
    5.2 to_json

    可以通过orient指定输出格式,默认是orient='columns'

    db.to_json("e:/test_json.json")
    
    5.3 to_sql
    db.to_sql('table_name',mysql_conn)
    

    参考资料

    1. https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

     
     
     

    相关文章

      网友评论

          本文标题:Pandas中DataFrame的操作

          本文链接:https://www.haomeiwen.com/subject/mbueyktx.html