美文网首页数据科学和人工智能技术笔记程序员python学习
数据科学和人工智能技术笔记 十九、数据整理(1)

数据科学和人工智能技术笔记 十九、数据整理(1)

作者: 布客飞龙 | 来源:发表于2019-01-01 17:19 被阅读34次

    十九、数据整理(1)

    作者:Chris Albon

    译者:飞龙

    协议:CC BY-NC-SA 4.0

    在 Pandas 中通过分组应用函数

    import pandas as pd
    
    # 创建示例数据帧
    data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
           'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
    df = pd.DataFrame(data)
    df
    
    Casualties Platoon
    0 1 A
    1 4 A
    2 5 A
    3 7 A
    4 5 A
    5 5 A
    6 6 B
    7 1 B
    8 4 B
    9 5 B
    10 6 B
    11 7 C
    12 4 C
    13 6 C
    14 4 C
    15 6 C
    # 按照 df.platoon 对 df 分组
    # 然后将滚动平均 lambda 函数应用于 df.casualties
    df.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())
    
    '''
    0     NaN
    1     2.5
    2     4.5
    3     6.0
    4     6.0
    5     5.0
    6     NaN
    7     3.5
    8     2.5
    9     4.5
    10    5.5
    11    NaN
    12    5.5
    13    5.0
    14    5.0
    15    5.0
    dtype: float64
    ''' 
    

    在 Pandas 中向分组应用操作

    # 导入模块
    import pandas as pd
    
    # 创建数据帧
    raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
            'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
            'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
    df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
    df
    
    regiment company name preTestScore postTestScore
    0 Nighthawks 1st Miller 4 25
    1 Nighthawks 1st Jacobson 24 94
    2 Nighthawks 2nd Ali 31 57
    3 Nighthawks 2nd Milner 2 62
    4 Dragoons 1st Cooze 3 70
    5 Dragoons 1st Jacon 4 25
    6 Dragoons 2nd Ryaner 24 94
    7 Dragoons 2nd Sone 31 57
    8 Scouts 1st Sloan 2 62
    9 Scouts 1st Piger 3 70
    10 Scouts 2nd Riani 2 62
    11 Scouts 2nd Ali 3 70
    # 创建一个 groupby 变量,按团队(regiment)对 preTestScores 分组
    groupby_regiment = df['preTestScore'].groupby(df['regiment'])
    groupby_regiment
    
    # <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550> 
    

    “这个分组变量现在是GroupBy对象。 除了分组的键df ['key1']的一些中间数据之外,它实际上还没有计算任何东西。 我们的想法是,该对象具有将所有操作应用于每个分组所需的所有信息。” -- PyDA

    使用list()显示分组的样子。

    list(df['preTestScore'].groupby(df['regiment']))
    
    '''
    [('Dragoons', 4     3
      5     4
      6    24
      7    31
      Name: preTestScore, dtype: int64), ('Nighthawks', 0     4
      1    24
      2    31
      3     2
      Name: preTestScore, dtype: int64), ('Scouts', 8     2
      9     3
      10    2
      11    3
      Name: preTestScore, dtype: int64)] 
    '''
    
    df['preTestScore'].groupby(df['regiment']).describe()
    
    count mean std min 25% 50% 75% max
    regiment
    Dragoons 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0
    Nighthawks 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0
    Scouts 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0
    # 每个团队的 preTestScore 均值
    groupby_regiment.mean()
    
    '''
    regiment
    Dragoons      15.50
    Nighthawks    15.25
    Scouts         2.50
    Name: preTestScore, dtype: float64 
    '''
    
    df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
    
    '''
    regiment    company
    Dragoons    1st         3.5
                2nd        27.5
    Nighthawks  1st        14.0
                2nd        16.5
    Scouts      1st         2.5
                2nd         2.5
    Name: preTestScore, dtype: float64 
    '''
    
    df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
    
    company 1st 2nd
    regiment
    Dragoons 3.5 27.5
    Nighthawks 14.0 16.5
    Scouts 2.5 2.5
    # 按团队和公司(company)对整个数据帧分组
    df.groupby(['regiment', 'company']).mean()
    
    preTestScore postTestScore
    regiment company
    Dragoons 1st 3.5 47.5
    2nd 27.5 75.5
    Nighthawks 1st 14.0 59.5
    2nd 16.5 59.5
    Scouts 1st 2.5 66.0
    2nd 2.5 66.0
    # 每个团队和公司的观测数量
    df.groupby(['regiment', 'company']).size()
    
    '''
    regiment    company
    Dragoons    1st        2
                2nd        2
    Nighthawks  1st        2
                2nd        2
    Scouts      1st        2
                2nd        2
    dtype: int64 
    '''
    
    # 按团队对数据帧分组,对于每个团队,
    for name, group in df.groupby('regiment'): 
        # 打印团队名称
        print(name)
        # 打印它的数据
        print(group)
    
    
    '''
    Dragoons
       regiment company    name  preTestScore  postTestScore
    4  Dragoons     1st   Cooze             3             70
    5  Dragoons     1st   Jacon             4             25
    6  Dragoons     2nd  Ryaner            24             94
    7  Dragoons     2nd    Sone            31             57
    Nighthawks
         regiment company      name  preTestScore  postTestScore
    0  Nighthawks     1st    Miller             4             25
    1  Nighthawks     1st  Jacobson            24             94
    2  Nighthawks     2nd       Ali            31             57
    3  Nighthawks     2nd    Milner             2             62
    Scouts
       regiment company   name  preTestScore  postTestScore
    8    Scouts     1st  Sloan             2             62
    9    Scouts     1st  Piger             3             70
    10   Scouts     2nd  Riani             2             62
    11   Scouts     2nd    Ali             3             70 
    '''
    

    按列分组:

    特别是在这种情况下:按列对数据类型(即axis = 1)分组,然后使用list()查看该分组的外观。

    list(df.groupby(df.dtypes, axis=1))
    
    '''
    [(dtype('int64'),     preTestScore  postTestScore
      0              4             25
      1             24             94
      2             31             57
      3              2             62
      4              3             70
      5              4             25
      6             24             94
      7             31             57
      8              2             62
      9              3             70
      10             2             62
      11             3             70),
     (dtype('O'),       regiment company      name
      0   Nighthawks     1st    Miller
      1   Nighthawks     1st  Jacobson
      2   Nighthawks     2nd       Ali
      3   Nighthawks     2nd    Milner
      4     Dragoons     1st     Cooze
      5     Dragoons     1st     Jacon
      6     Dragoons     2nd    Ryaner
      7     Dragoons     2nd      Sone
      8       Scouts     1st     Sloan
      9       Scouts     1st     Piger
      10      Scouts     2nd     Riani
      11      Scouts     2nd       Ali)] 
    
    df.groupby('regiment').mean().add_prefix('mean_')
    
    mean_preTestScore mean_postTestScore
    regiment
    Dragoons 15.50 61.5
    Nighthawks 15.25 59.5
    Scouts 2.50 66.0
    # 创建获取分组状态的函数
    def get_stats(group):
        return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}
    
    bins = [0, 25, 50, 75, 100]
    group_names = ['Low', 'Okay', 'Good', 'Great']
    df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)
    
    df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
    
    count max mean min
    categories
    Good 8.0 70.0 63.75 57.0
    Great 2.0 94.0 94.00 94.0
    Low 2.0 25.0 25.00 25.0
    Okay 0.0 NaN NaN NaN

    在 Pandas 数据帧上应用操作

    # 导入模型
    import pandas as pd
    import numpy as np
    
    data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
            'year': [2012, 2012, 2013, 2014, 2014], 
            'reports': [4, 24, 31, 2, 3],
            'coverage': [25, 94, 57, 62, 70]}
    df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
    df
    
    coverage name reports year
    Cochice 25 Jason 4 2012
    Pima 94 Molly 24 2012
    Santa Cruz 57 Tina 31 2013
    Maricopa 62 Jake 2 2014
    Yuma 70 Amy 3 2014
    # 创建大写转换的 lambda 函数
    capitalizer = lambda x: x.upper()
    

    capitalizer函数应用于name列。

    apply()可以沿数据帧的任意轴应用函数。

    df['name'].apply(capitalizer)
    
    '''
    Cochice       JASON
    Pima          MOLLY
    Santa Cruz     TINA
    Maricopa       JAKE
    Yuma            AMY
    Name: name, dtype: object 
    '''
    

    capitalizer lambda 函数映射到序列name中的每个元素。

    map()对序列的每个元素应用操作。

    df['name'].map(capitalizer)
    
    '''
    Cochice       JASON
    Pima          MOLLY
    Santa Cruz     TINA
    Maricopa       JAKE
    Yuma            AMY
    Name: name, dtype: object 
    '''
    

    将平方根函数应用于整个数据帧中的每个单元格。

    applymap()将函数应用于整个数据帧中的每个元素。

    # 删除字符串变量,以便 applymap() 可以运行
    df = df.drop('name', axis=1)
    
    # 返回数据帧每个单元格的平方根
    df.applymap(np.sqrt)
    
    coverage reports year
    Cochice 5.000000 2.000000 44.855323
    Pima 9.695360 4.898979 44.855323
    Santa Cruz 7.549834 5.567764 44.866469
    Maricopa 7.874008 1.414214 44.877611
    Yuma 8.366600 1.732051 44.877611

    在数据帧上应用函数。

    # 创建叫做 times100 的函数
    def times100(x):
        # 如果 x 是字符串,
        if type(x) is str:
            # 原样返回它
            return x
        # 如果不是,返回它乘上 100
        elif x:
            return 100 * x
        # 并留下其它东西
        else:
            return
    
    df.applymap(times100)
    
    coverage reports year
    Cochice 2500 400 201200
    Pima 9400 2400 201200
    Santa Cruz 5700 3100 201300
    Maricopa 6200 200 201400
    Yuma 7000 300 201400

    向 Pandas 数据帧赋予新列

    import pandas as pd
    
    # 创建空数据帧
    df = pd.DataFrame()
    
    # 创建一列
    df['name'] = ['John', 'Steve', 'Sarah']
    
    # 查看数据帧
    df
    
    name
    0 John
    1 Steve
    2 Sarah
    # 将一个新列赋予名为 age 的 df,它包含年龄列表
    df.assign(age = [31, 32, 19])
    
    name age
    0 John 31
    1 Steve 32
    2 Sarah 19

    将列表拆分为大小为 N 的分块

    在这个片段中,我们接受一个列表并将其分解为大小为 n 的块。 在处理具有最大请求大小的 API 时,这是一种非常常见的做法。

    这个漂亮的函数由 Ned Batchelder 贡献,发布于 StackOverflow

    # 创建名称列表
    first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']
    
    # 创建叫做 chunks 的函数,有两个参数 l 和 n
    def chunks(l, n):
        # 对于长度为 l 的范围中的项目 i
        for i in range(0, len(l), n):
            # 创建索引范围
            yield l[i:i+n]
    
    # 从函数 chunks 的结果创建一个列表
    list(chunks(first_names, 5))
    
    '''
    [['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],
     ['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],
     ['Will', 'Won', 'Li']] 
    '''
    

    在 Pandas 中使用正则表达式将字符串分解为列

    # 导入模块
    import re
    import pandas as pd
    
    # 创建带有一列字符串的数据帧
    data = {'raw': ['Arizona 1 2014-12-23       3242.0',
                    'Iowa 1 2010-02-23       3453.7',
                    'Oregon 0 2014-06-20       2123.0',
                    'Maryland 0 2014-03-14       1123.6',
                    'Florida 1 2013-01-15       2134.0',
                    'Georgia 0 2012-07-14       2345.6']}
    df = pd.DataFrame(data, columns = ['raw'])
    df
    
    raw
    0 Arizona 1 2014-12-23 3242.0
    1 Iowa 1 2010-02-23 3453.7
    2 Oregon 0 2014-06-20 2123.0
    3 Maryland 0 2014-03-14 1123.6
    4 Florida 1 2013-01-15 2134.0
    5 Georgia 0 2012-07-14 2345.6
    # df['raw'] 的哪些行包含 'xxxx-xx-xx'?
    df['raw'].str.contains('....-..-..', regex=True)
    
    '''
    0    True
    1    True
    2    True
    3    True
    4    True
    5    True
    Name: raw, dtype: bool 
    '''
    
    # 在 raw 列中,提取字符串中的单个数字
    df['female'] = df['raw'].str.extract('(\d)', expand=True)
    df['female']
    
    '''
    0    1
    1    1
    2    0
    3    0
    4    1
    5    0
    Name: female, dtype: object 
    '''
    
    # 在 raw 列中,提取字符串中的 xxxx-xx-xx
    df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
    df['date']
    
    '''
    0    2014-12-23
    1    2010-02-23
    2    2014-06-20
    3    2014-03-14
    4    2013-01-15
    5    2012-07-14
    Name: date, dtype: object 
    '''
    
    # 在 raw 列中,提取字符串中的 ####.##
    df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)
    df['score']
    
    '''
    0    3242.0
    1    3453.7
    2    2123.0
    3    1123.6
    4    2134.0
    5    2345.6
    Name: score, dtype: object 
    '''
    
    # 在 raw 列中,提取字符串中的单词
    df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)
    df['state']
    
    '''
    0     Arizona
    1        Iowa
    2      Oregon
    3    Maryland
    4     Florida
    5     Georgia
    Name: state, dtype: object 
    '''
    
    df
    
    raw female date score state
    0 Arizona 1 2014-12-23 3242.0 1 2014-12-23 3242.0 Arizona
    1 Iowa 1 2010-02-23 3453.7 1 2010-02-23 3453.7 Iowa
    2 Oregon 0 2014-06-20 2123.0 0 2014-06-20 2123.0 Oregon
    3 Maryland 0 2014-03-14 1123.6 0 2014-03-14 1123.6 Maryland
    4 Florida 1 2013-01-15 2134.0 1 2013-01-15 2134.0 Florida
    5 Georgia 0 2012-07-14 2345.6 0 2012-07-14 2345.6 Georgia

    由两个数据帧贡献列

    # 导入库
    import pandas as pd
    
    # 创建数据帧
    dataframe_one = pd.DataFrame()
    dataframe_one['1'] = ['1', '1', '1']
    dataframe_one['B'] = ['b', 'b', 'b']
    
    # 创建第二个数据帧
    dataframe_two = pd.DataFrame()
    dataframe_two['2'] = ['2', '2', '2']
    dataframe_two['B'] = ['b', 'b', 'b']
    
    # 将每个数据帧的列转换为集合,
    # 然后找到这两个集合的交集。
    # 这将是两个数据帧共享的列的集合。
    set.intersection(set(dataframe_one), set(dataframe_two))
    
    # {'B'} 
    

    从多个列表构建字典

    # 创建官员名称的列表
    officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']
    
    # 创建官员军队的列表
    officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']
    
    # 创建字典,它是两个列表的 zip
    dict(zip(officer_names, officer_armies))
    
    '''
    {'Chris Jefferson': 'Orange Army',
     'Jessica Billars': 'Green Army',
     'Michael Mulligan': 'Red Army',
     'Sodoni Dogla': 'Purple Army',
     'Steven Johnson': 'Blue Army'} 
    '''
    

    将 CSV 转换为 Python 代码来重建它

    # 导入 pandas 包
    import pandas as pd
    
    # 将 csv 文件加载为数据帧
    df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
    df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
    
    # 打印创建数据帧的代码
    print('==============================')
    print('RUN THE CODE BELOW THIS LINE')
    print('==============================')
    print('raw_data =', df.to_dict(orient='list'))
    print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')
    
    '''
    ==============================
    RUN THE CODE BELOW THIS LINE
    ==============================
    raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}
    '''
    
    df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']) 
    
    # 如果你打算检查结果
    # 1\. 输入此单元格中上面单元格生成的代码
    raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}
    df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])
    
    # 查看原始数据帧的前几行
    df.head()
    
    Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    0 1 5.1 3.5 1.4 0.2 setosa
    1 2 4.9 3.0 1.4 0.2 setosa
    2 3 4.7 3.2 1.3 0.2 setosa
    3 4 4.6 3.1 1.5 0.2 setosa
    4 5 5.0 3.6 1.4 0.2 setosa
    # 查看使用我们的代码创建的,数据帧的前几行
    df_original.head()
    
    Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    0 1 5.1 3.5 1.4 0.2 setosa
    1 2 4.9 3.0 1.4 0.2 setosa
    2 3 4.7 3.2 1.3 0.2 setosa
    3 4 4.6 3.1 1.5 0.2 setosa
    4 5 5.0 3.6 1.4 0.2 setosa

    将分类变量转换为虚拟变量

    # 导入模块
    import pandas as pd
    
    # 创建数据帧
    raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
            'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
            'sex': ['male', 'female', 'male', 'female', 'female']}
    df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])
    df
    
    first_name last_name sex
    0 Jason Miller male
    1 Molly Jacobson female
    2 Tina Ali male
    3 Jake Milner female
    4 Amy Cooze female
    # 从 sex 变量创建一组虚拟变量
    df_sex = pd.get_dummies(df['sex'])
    
    # 将虚拟变量连接到主数据帧
    df_new = pd.concat([df, df_sex], axis=1)
    df_new
    
    first_name last_name sex female male
    0 Jason Miller male 0.0 1.0
    1 Molly Jacobson female 1.0 0.0
    2 Tina Ali male 0.0 1.0
    3 Jake Milner female 1.0 0.0
    4 Amy Cooze female 1.0 0.0
    # 连接新列的替代方案
    df_new = df.join(df_sex)
    df_new
    
    first_name last_name sex female male
    0 Jason Miller male 0.0 1.0
    1 Molly Jacobson female 1.0 0.0
    2 Tina Ali male 0.0 1.0
    3 Jake Milner female 1.0 0.0
    4 Amy Cooze female 1.0 0.0

    将分类变量转换为虚拟变量

    # 导入模块
    import pandas as pd
    import patsy
    
    # 创建数据帧
    raw_data = {'countrycode': [1, 2, 3, 2, 1]} 
    df = pd.DataFrame(raw_data, columns = ['countrycode'])
    df
    
    countrycode
    0 1
    1 2
    2 3
    3 2
    4 1
    # 将 countrycode 变量转换为三个二元变量
    patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')
    
    C(countrycode)[1] C(countrycode)[2] C(countrycode)[3]
    0 1.0 0.0 0.0
    1 0.0 1.0 0.0
    2 0.0 0.0 1.0
    3 0.0 1.0 0.0
    4 1.0 0.0 0.0

    将字符串分类变量转换为数字变量

    # 导入模块
    import pandas as pd
    
    raw_data = {'patient': [1, 1, 1, 2, 2], 
            'obs': [1, 2, 3, 1, 2], 
            'treatment': [0, 1, 0, 1, 0],
            'score': ['strong', 'weak', 'normal', 'weak', 'strong']} 
    df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
    df
    
    patient obs treatment score
    0 1 1 0 strong
    1 1 2 1 weak
    2 1 3 0 normal
    3 2 1 1 weak
    4 2 2 0 strong
    # 创建一个函数,将 df['score'] 的所有值转换为数字
    def score_to_numeric(x):
        if x=='strong':
            return 3
        if x=='normal':
            return 2
        if x=='weak':
            return 1
    
    df['score_num'] = df['score'].apply(score_to_numeric)
    df
    
    patient obs treatment score score_num
    0 1 1 0 strong 3
    1 1 2 1 weak 1
    2 1 3 0 normal 2
    3 2 1 1 weak 1
    4 2 2 0 strong 3

    相关文章

      网友评论

        本文标题:数据科学和人工智能技术笔记 十九、数据整理(1)

        本文链接:https://www.haomeiwen.com/subject/omonlqtx.html