美文网首页
Pandas - 5. 缺失值 处理

Pandas - 5. 缺失值 处理

作者: 陈天睡懒觉 | 来源:发表于2022-05-22 17:49 被阅读0次

判断缺失值

  • isnull()
  • notnull()
import pandas as pd
from numpy import NaN,NAN,nan
import numpy as np
print(pd.isnull(NaN))
print(pd.isnull(NAN))
print(pd.isnull(nan))
print(pd.isnull(True))
True
True
True
False
print(pd.notnull(NaN))
print(pd.notnull(NAN))
print(pd.notnull(nan))
print(pd.notnull(True))
False
False
False
True

读取文件时产生的缺失值

pd.read_csv()函数中有三个参数与缺失值有关:

  • na_values:可以额外指定缺失值,比如99作为缺失值,na_values=[99]
  • keep_default_na:布尔值,默认为True,即na_values额外指定的值会追加到现有的缺失值中。设为False则只使用na_values已有的值
  • na_filter:布尔值,默认为True,即把缺失值编码成NaN。设为False,则不会将任何值编码成NaN。可在不含缺失值的情况下加快读取数据的速度。
print(pd.read_csv('data/survey_visited.csv'))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# 加载数据时不包含默认缺失值
print(pd.read_csv('data/survey_visited.csv',
                 keep_default_na=False))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3            
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# 手动指定缺失值
print(pd.read_csv('data/survey_visited.csv',
                  na_values=[''],
                 keep_default_na=False))
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22

统计缺失值

ebola = pd.read_csv('data/country_timeseries.csv')
# 统计非缺失值的个数
print(ebola.count())
Date                   122
Day                    122
Cases_Guinea            93
Cases_Liberia           83
Cases_SierraLeone       87
Cases_Nigeria           38
Cases_Senegal           25
Cases_UnitedStates      18
Cases_Spain             16
Cases_Mali              12
Deaths_Guinea           92
Deaths_Liberia          81
Deaths_SierraLeone      87
Deaths_Nigeria          38
Deaths_Senegal          22
Deaths_UnitedStates     18
Deaths_Spain            16
Deaths_Mali             12
dtype: int64
# 展示缺失值个数
num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count() # 广播
print(num_missing)
Date                     0
Day                      0
Cases_Guinea            29
Cases_Liberia           39
Cases_SierraLeone       35
Cases_Nigeria           84
Cases_Senegal           97
Cases_UnitedStates     104
Cases_Spain            106
Cases_Mali             110
Deaths_Guinea           30
Deaths_Liberia          41
Deaths_SierraLeone      35
Deaths_Nigeria          84
Deaths_Senegal         100
Deaths_UnitedStates    104
Deaths_Spain           106
Deaths_Mali            110
dtype: int64
# 统计缺失值的总数
print(np.count_nonzero(ebola.isnull()))
# 统计某列的缺失值数量
print(np.count_nonzero(ebola['Cases_Liberia'].isnull()))
1214
39

缺失值处理方法

替换 fillna()

  • ebola.fillna(0) 重新编码/替换
    将缺失值替换成0
  • ebola.fillna(method='ffill) 前值填充(已有数据代替)
    用前一个值填充,如果某列以缺失值开头,该缺失值会继续存在
  • ebola.fillna(method='bfill) 后值填充(新数据代替)
    用后一个值填充,如果某列以缺失值结束,该缺失值会继续存在

注:有inplace参数,可以直接在原数据上修改,大型数据上可以提高效率

插值 interpolate()

  • ebola.interpolate() 默认线性方式填充,method参数可以指定插值方式(并不能补全全部的缺失值)

删除 dropna()

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

  • axis: axis=0/'index',删除带有缺失值的行;axis=1/'columns',删除带有缺失值的列
  • how: 筛选方式,'any'表示该行/列只要有一个以上的空值,就删除该行/列;'all'表示该行/列全部都为空值,就删除该行/列。
  • thresh: 非空元素最低数量,int型。如果该行/列中,非空元素数量小于这个值,就删除该行/列
  • subset: 只删除部分列或者行,输入列表,元素为行或者列的索引,指定在哪些范围中删除
  • inplace: 是否在原数据上修改,布尔值

subset设置子集有交叉的感觉。删除哪些行,从列中设置子集。删除哪些列,从行中设置子集。

# 构建数据集
import numpy as np
import pandas as pd
 
m = np.ones((11, 10))
for i in range(len(m)):
    m[i,:i] = np.nan
df = pd.DataFrame(data=m,
                 columns=['A','B','C','D','E','F','G','H','I','J'],
                index=['a','b','c','d','e','f','g','h','i','j','k'])
print(df)
     A    B    C    D    E    F    G    H    I    J
a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
b  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
c  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
d  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0
e  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0
f  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0
g  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
h  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0
i  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0
j  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0
k  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 删除行索引为'a','b','c'的样本中带有缺失值的列
print(df.dropna(axis=1, subset=['a','b','c']))
     C    D    E    F    G    H    I    J
a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
b  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
c  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
d  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0
e  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0
f  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0
g  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
h  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0
i  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0
j  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0
k  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
# 删除列索引为'H','I','J'的样本中带有缺失值的行
print(df.dropna(axis=0, subset=['H','I','J']))
     A    B    C    D    E    F    G    H    I    J
a  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
b  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
c  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
d  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0
e  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0  1.0
f  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  1.0
g  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
h  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0

带有缺失值的计算

如果计算中包含缺失值,结果通常返回缺失值,sum和mean可以忽略缺失值,可通过参数指定计算时是否要忽略缺失值

  • df.sum(skipna=True) skipna默认为True,表示忽略缺失值
  • df.mean(skipna=False) 若有缺失值返回nan

查看带有缺失值的样本

visited = pd.read_csv('data/survey_visited.csv',
                  na_values=[''],
                 keep_default_na=False)
print(visited)
   ident   site       dated
0    619   DR-1  1927-02-08
1    622   DR-1  1927-02-10
2    734   DR-3  1939-01-07
3    735   DR-3  1930-01-12
4    751   DR-3  1930-02-26
5    752   DR-3         NaN
6    837  MSK-4  1932-01-14
7    844   DR-1  1932-03-22
# isnull()会对DataFrame中的每个元素进行缺失值检查,若为缺失值返回True;不是缺失值返回False;最终返回一个DataFrame.
miss = visited.isnull()
print(miss)
   ident   site  dated
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
4  False  False  False
5  False  False   True
6  False  False  False
7  False  False  False
# 使用any,并设定axis=1,则当每一行中存在缺失值时就会返回True;若需要找到所有缺失值都为True的行则使用all即可。
print(miss.any(axis=1))
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
dtype: bool
print(visited[miss.any(axis=1)==True])
   ident  site dated
5    752  DR-3   NaN

一步到位

print(visited[visited.isnull().any(axis=1)==True])
   ident  site dated
5    752  DR-3   NaN

相关文章

网友评论

      本文标题:Pandas - 5. 缺失值 处理

      本文链接:https://www.haomeiwen.com/subject/neokprtx.html