数据分析工具pandas快速入门教程5-处理缺失数据

作者: python测试开发 | 来源:发表于2018-08-24 13:17 被阅读3次

    第5章 缺失数据

    介绍

    很少没有任何缺失值的数据集。 有许多缺失数据的表示。 在数据库中是NULL值,一些编程语言使用NA。缺失值可以是空字符串:''或者甚至是数值88或99等。Pandas显示缺失值为NaN。

    本章将涵盖:

    • 什么是缺失值
    • 如何创建缺失值
    • 如何重新编码并使用缺失值进行计算

    什么是缺失值

    可以从numpy中获得NaN值,在Python中看到缺失值使用几种方式显示:NaN,NAN或nan,他们都是相等的。

    NaN不等于0或空字符串''。

    
    In [1]: from numpy import NaN, NAN, nan
    
    In [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|')
    False|False|False|False
    
    In [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|')
    False|False|False|False
    
    In [4]: import pandas as pd
    
    In [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|')
    True|True|True
    
    In [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|')
    False|True|True
    
    

    缺失值的来源

    来自加载数据或数据处理

    • 加载数据

    当我们加载数据时,pandas会自动找到该缺少数据的单元格,并填充NaN值。在read_csv函数中,参数na_values, keep_default_na, na_filter用于处理缺失值。比如:na_values=[99]。na_filter设置为False,在读大文件时会提升性能。

    5-1.py

    
    import pandas as pd
    
    visited_file = 'data/survey_visited.csv'
    print(pd.read_csv(visited_file))
    print(pd.read_csv(visited_file, keep_default_na=False))
    print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False))
    

    执行结果

    
    $ python3 5-1.py 
       ident   site       dated
    0    619   DR-1  1927-02-08
    1    622   DR-1  1927-02-10
    2    734   DR-3  1939-01-07
    3    735   DR-3  1930-01-12
    4    751   DR-3  1930-02-26
    5    752   DR-3         NaN
    6    837  MSK-4  1932-01-14
    7    844   DR-1  1932-03-22
       ident   site       dated
    0    619   DR-1  1927-02-08
    1    622   DR-1  1927-02-10
    2    734   DR-3  1939-01-07
    3    735   DR-3  1930-01-12
    4    751   DR-3  1930-02-26
    5    752   DR-3            
    6    837  MSK-4  1932-01-14
    7    844   DR-1  1932-03-22
       ident   site       dated
    0    619   DR-1  1927-02-08
    1    622   DR-1  1927-02-10
    2    734   DR-3  1939-01-07
    3    735   DR-3  1930-01-12
    4    751   DR-3  1930-02-26
    5    752   DR-3         NaN
    6    837  MSK-4  1932-01-14
    7    844   DR-1  1932-03-22
    
    
    • 合并数据
    
    import pandas as pd
    
    visited = pd.read_csv('data/survey_visited.csv')
    survey = pd.read_csv('data/survey_survey.csv')
    print(visited)
    print(survey)
    vs = visited.merge(survey, left_on='ident', right_on='taken')
    print(vs)
    

    执行结果

    
    $ python3 5-2.py 
       ident   site       dated
    0    619   DR-1  1927-02-08
    1    622   DR-1  1927-02-10
    2    734   DR-3  1939-01-07
    3    735   DR-3  1930-01-12
    4    751   DR-3  1930-02-26
    5    752   DR-3         NaN
    6    837  MSK-4  1932-01-14
    7    844   DR-1  1932-03-22
        taken person quant  reading
    0     619   dyer   rad     9.82
    1     619   dyer   sal     0.13
    2     622   dyer   rad     7.80
    3     622   dyer   sal     0.09
    4     734     pb   rad     8.41
    5     734   lake   sal     0.05
    6     734     pb  temp   -21.50
    7     735     pb   rad     7.22
    8     735    NaN   sal     0.06
    9     735    NaN  temp   -26.00
    10    751     pb   rad     4.35
    11    751     pb  temp   -18.50
    12    751   lake   sal     0.10
    13    752   lake   rad     2.19
    14    752   lake   sal     0.09
    15    752   lake  temp   -16.00
    16    752    roe   sal    41.60
    17    837   lake   rad     1.46
    18    837   lake   sal     0.21
    19    837    roe   sal    22.50
    20    844    roe   rad    11.25
        ident   site       dated  taken person quant  reading
    0     619   DR-1  1927-02-08    619   dyer   rad     9.82
    1     619   DR-1  1927-02-08    619   dyer   sal     0.13
    2     622   DR-1  1927-02-10    622   dyer   rad     7.80
    3     622   DR-1  1927-02-10    622   dyer   sal     0.09
    4     734   DR-3  1939-01-07    734     pb   rad     8.41
    5     734   DR-3  1939-01-07    734   lake   sal     0.05
    6     734   DR-3  1939-01-07    734     pb  temp   -21.50
    7     735   DR-3  1930-01-12    735     pb   rad     7.22
    8     735   DR-3  1930-01-12    735    NaN   sal     0.06
    9     735   DR-3  1930-01-12    735    NaN  temp   -26.00
    10    751   DR-3  1930-02-26    751     pb   rad     4.35
    11    751   DR-3  1930-02-26    751     pb  temp   -18.50
    12    751   DR-3  1930-02-26    751   lake   sal     0.10
    13    752   DR-3         NaN    752   lake   rad     2.19
    14    752   DR-3         NaN    752   lake   sal     0.09
    15    752   DR-3         NaN    752   lake  temp   -16.00
    16    752   DR-3         NaN    752    roe   sal    41.60
    17    837  MSK-4  1932-01-14    837   lake   rad     1.46
    18    837  MSK-4  1932-01-14    837   lake   sal     0.21
    19    837  MSK-4  1932-01-14    837    roe   sal    22.50
    20    844   DR-1  1932-03-22    844    roe   rad    11.25
    
    
    • 用户输入
    
    import pandas as pd
    from numpy import NaN, NAN, nan
    
    num_legs = pd.Series({'goat': 4, 'amoeba': nan})
    print(num_legs)
    scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'],
                               'Occupation': ['Chemist', 'Statistician'],
                               'Born': ['1920-07-25', '1876-06-13'],
                               'Died': ['1958-04-16', '1937-10-16'],
                               'missing': [NaN, nan]})
    print(scientists)
    scientists['missing'] = nan
    print(scientists)
    

    执行结果

    
    $ python3 5-3.py 
    amoeba    NaN
    goat      4.0
    dtype: float64
             Born        Died               Name    Occupation  missing
    0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN
    1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN
             Born        Died               Name    Occupation  missing
    0  1920-07-25  1958-04-16  Rosaline Franklin       Chemist      NaN
    1  1876-06-13  1937-10-16     William Gosset  Statistician      NaN
    
    
    • 重新索引

    5-4.py

    
    import pandas as pd
    from numpy import NaN, NAN, nan
    
    gapminder = pd.read_csv('data/gapminder.tsv', sep='\t')
    life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
    print(life_exp)
    print(life_exp.reindex(range(2000, 2010)))
    

    执行结果

    
    year
    1952    49.057620
    1957    51.507401
    1962    53.609249
    1967    55.678290
    1972    57.647386
    1977    59.570157
    1982    61.533197
    1987    63.212613
    1992    64.160338
    1997    65.014676
    2002    65.694923
    2007    67.007423
    Name: lifeExp, dtype: float64
    year
    2000          NaN
    2001          NaN
    2002    65.694923
    2003          NaN
    2004          NaN
    2005          NaN
    2006          NaN
    2007    67.007423
    2008          NaN
    2009          NaN
    Name: lifeExp, dtype: float64
    
    

    处理缺失数据

    • 统计缺失数据

    5-5.py

    
    import pandas as pd
    from numpy import NaN, NAN, nan
    import numpy as np
    
    ebola = pd.read_csv('data/country_timeseries.csv')
    print(ebola.head())
    print(ebola.count())
    num_rows = ebola.shape[0]
    print("num_rows")
    print(num_rows)
    num_missing = num_rows - ebola.count()
    print("num_missing:")
    print(num_missing)
    print(np.count_nonzero(ebola.isnull()))
    print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))
    print(ebola.Cases_Guinea.value_counts(dropna=False).head())
    

    执行结果

    
     Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
    0    1/5/2015  289        2776.0            NaN            10030.0   
    1    1/4/2015  288        2775.0            NaN             9780.0   
    2    1/3/2015  287        2769.0         8166.0             9722.0   
    3    1/2/2015  286           NaN         8157.0                NaN   
    4  12/31/2014  284        2730.0         8115.0             9633.0   
    
       Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
    0            NaN            NaN                 NaN          NaN         NaN   
    1            NaN            NaN                 NaN          NaN         NaN   
    2            NaN            NaN                 NaN          NaN         NaN   
    3            NaN            NaN                 NaN          NaN         NaN   
    4            NaN            NaN                 NaN          NaN         NaN   
    
       Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
    0         1786.0             NaN              2977.0             NaN   
    1         1781.0             NaN              2943.0             NaN   
    2         1767.0          3496.0              2915.0             NaN   
    3            NaN          3496.0                 NaN             NaN   
    4         1739.0          3471.0              2827.0             NaN   
    
       Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
    0             NaN                  NaN           NaN          NaN  
    1             NaN                  NaN           NaN          NaN  
    2             NaN                  NaN           NaN          NaN  
    3             NaN                  NaN           NaN          NaN  
    4             NaN                  NaN           NaN          NaN  
    Date                   122
    Day                    122
    Cases_Guinea            93
    Cases_Liberia           83
    Cases_SierraLeone       87
    Cases_Nigeria           38
    Cases_Senegal           25
    Cases_UnitedStates      18
    Cases_Spain             16
    Cases_Mali              12
    Deaths_Guinea           92
    Deaths_Liberia          81
    Deaths_SierraLeone      87
    Deaths_Nigeria          38
    Deaths_Senegal          22
    Deaths_UnitedStates     18
    Deaths_Spain            16
    Deaths_Mali             12
    dtype: int64
    num_rows
    122
    num_missing:
    Date                     0
    Day                      0
    Cases_Guinea            29
    Cases_Liberia           39
    Cases_SierraLeone       35
    Cases_Nigeria           84
    Cases_Senegal           97
    Cases_UnitedStates     104
    Cases_Spain            106
    Cases_Mali             110
    Deaths_Guinea           30
    Deaths_Liberia          41
    Deaths_SierraLeone      35
    Deaths_Nigeria          84
    Deaths_Senegal         100
    Deaths_UnitedStates    104
    Deaths_Spain           106
    Deaths_Mali            110
    dtype: int64
    1214
    29
    NaN       29
     86.0      3
     495.0     2
     112.0     2
     390.0     2
    Name: Cases_Guinea, dtype: int64
    
    
    • 处理缺失数据

    5-6.py

    
    import pandas as pd
    from numpy import NaN, NAN, nan
    import numpy as np
    
    ebola = pd.read_csv('data/country_timeseries.csv')
    print(ebola.iloc[0:10, 0:5])
    print(ebola.fillna(0).iloc[0:10, 0:5])
    # 前向填充
    print(ebola.fillna(method='ffill').iloc[0:10, 0:5])
    # 后向填充
    print(ebola.fillna(method='bfill').iloc[0:10, 0:5])
    
    print(ebola.interpolate().iloc[0:10, 0:5])
    
    print(ebola.shape)
    ebola_dropna = ebola.dropna()
    print(ebola_dropna.shape)
    print(ebola_dropna)
    
    ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \
    ebola['Cases_SierraLeone']
    
    ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',
                                 'Cases_SierraLeone', 'Cases_multiple']]
    print(ebola_subset.head(n=10))
    print(ebola.Cases_Guinea.sum(skipna = True))
    print(ebola.Cases_Guinea.sum(skipna = False))
    
    

    执行结果

    
             Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
    0    1/5/2015  289        2776.0            NaN            10030.0
    1    1/4/2015  288        2775.0            NaN             9780.0
    2    1/3/2015  287        2769.0         8166.0             9722.0
    3    1/2/2015  286           NaN         8157.0                NaN
    4  12/31/2014  284        2730.0         8115.0             9633.0
    5  12/28/2014  281        2706.0         8018.0             9446.0
    6  12/27/2014  280        2695.0            NaN             9409.0
    7  12/24/2014  277        2630.0         7977.0             9203.0
    8  12/21/2014  273        2597.0            NaN             9004.0
    9  12/20/2014  272        2571.0         7862.0             8939.0
             Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
    0    1/5/2015  289        2776.0            0.0            10030.0
    1    1/4/2015  288        2775.0            0.0             9780.0
    2    1/3/2015  287        2769.0         8166.0             9722.0
    3    1/2/2015  286           0.0         8157.0                0.0
    4  12/31/2014  284        2730.0         8115.0             9633.0
    5  12/28/2014  281        2706.0         8018.0             9446.0
    6  12/27/2014  280        2695.0            0.0             9409.0
    7  12/24/2014  277        2630.0         7977.0             9203.0
    8  12/21/2014  273        2597.0            0.0             9004.0
    9  12/20/2014  272        2571.0         7862.0             8939.0
             Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
    0    1/5/2015  289        2776.0            NaN            10030.0
    1    1/4/2015  288        2775.0            NaN             9780.0
    2    1/3/2015  287        2769.0         8166.0             9722.0
    3    1/2/2015  286        2769.0         8157.0             9722.0
    4  12/31/2014  284        2730.0         8115.0             9633.0
    5  12/28/2014  281        2706.0         8018.0             9446.0
    6  12/27/2014  280        2695.0         8018.0             9409.0
    7  12/24/2014  277        2630.0         7977.0             9203.0
    8  12/21/2014  273        2597.0         7977.0             9004.0
    9  12/20/2014  272        2571.0         7862.0             8939.0
             Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
    0    1/5/2015  289        2776.0         8166.0            10030.0
    1    1/4/2015  288        2775.0         8166.0             9780.0
    2    1/3/2015  287        2769.0         8166.0             9722.0
    3    1/2/2015  286        2730.0         8157.0             9633.0
    4  12/31/2014  284        2730.0         8115.0             9633.0
    5  12/28/2014  281        2706.0         8018.0             9446.0
    6  12/27/2014  280        2695.0         7977.0             9409.0
    7  12/24/2014  277        2630.0         7977.0             9203.0
    8  12/21/2014  273        2597.0         7862.0             9004.0
    9  12/20/2014  272        2571.0         7862.0             8939.0
             Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone
    0    1/5/2015  289        2776.0            NaN            10030.0
    1    1/4/2015  288        2775.0            NaN             9780.0
    2    1/3/2015  287        2769.0         8166.0             9722.0
    3    1/2/2015  286        2749.5         8157.0             9677.5
    4  12/31/2014  284        2730.0         8115.0             9633.0
    5  12/28/2014  281        2706.0         8018.0             9446.0
    6  12/27/2014  280        2695.0         7997.5             9409.0
    7  12/24/2014  277        2630.0         7977.0             9203.0
    8  12/21/2014  273        2597.0         7919.5             9004.0
    9  12/20/2014  272        2571.0         7862.0             8939.0
    (122, 18)
    (1, 18)
              Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
    19  11/18/2014  241        2047.0         7082.0             6190.0   
    
        Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
    19           20.0            1.0                 4.0          1.0         6.0   
    
        Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
    19         1214.0          2963.0              1267.0             8.0   
    
        Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
    19             0.0                  1.0           0.0          6.0  
       Cases_Guinea  Cases_Liberia  Cases_SierraLeone  Cases_multiple
    0        2776.0            NaN            10030.0             NaN
    1        2775.0            NaN             9780.0             NaN
    2        2769.0         8166.0             9722.0         20657.0
    3           NaN         8157.0                NaN             NaN
    4        2730.0         8115.0             9633.0         20478.0
    5        2706.0         8018.0             9446.0         20170.0
    6        2695.0            NaN             9409.0             NaN
    7        2630.0         7977.0             9203.0         19810.0
    8        2597.0            NaN             9004.0             NaN
    9        2571.0         7862.0             8939.0         19372.0
    84729.0
    nan
    
    

    参考资料

    相关文章

      网友评论

        本文标题:数据分析工具pandas快速入门教程5-处理缺失数据

        本文链接:https://www.haomeiwen.com/subject/gopbiftx.html