判断缺失值
- isnull()
- notnull()
import pandas as pd
from numpy import NaN,NAN,nan
import numpy as np
print(pd.isnull(NaN))
print(pd.isnull(NAN))
print(pd.isnull(nan))
print(pd.isnull(True))
True
True
True
False
print(pd.notnull(NaN))
print(pd.notnull(NAN))
print(pd.notnull(nan))
print(pd.notnull(True))
False
False
False
True
读取文件时产生的缺失值
pd.read_csv()函数中有三个参数与缺失值有关:
- na_values:可以额外指定缺失值,比如99作为缺失值,na_values=[99]
- keep_default_na:布尔值,默认为True,即na_values额外指定的值会追加到现有的缺失值中。设为False则只使用na_values已有的值
- na_filter:布尔值,默认为True,即把缺失值编码成NaN。设为False,则不会将任何值编码成NaN。可在不含缺失值的情况下加快读取数据的速度。
print(pd.read_csv('data/survey_visited.csv'))
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
# 加载数据时不包含默认缺失值
print(pd.read_csv('data/survey_visited.csv',
keep_default_na=False))
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
# 手动指定缺失值
print(pd.read_csv('data/survey_visited.csv',
na_values=[''],
keep_default_na=False))
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
统计缺失值
ebola = pd.read_csv('data/country_timeseries.csv')
# 统计非缺失值的个数
print(ebola.count())
Date 122
Day 122
Cases_Guinea 93
Cases_Liberia 83
Cases_SierraLeone 87
Cases_Nigeria 38
Cases_Senegal 25
Cases_UnitedStates 18
Cases_Spain 16
Cases_Mali 12
Deaths_Guinea 92
Deaths_Liberia 81
Deaths_SierraLeone 87
Deaths_Nigeria 38
Deaths_Senegal 22
Deaths_UnitedStates 18
Deaths_Spain 16
Deaths_Mali 12
dtype: int64
# 展示缺失值个数
num_rows = ebola.shape[0]
num_missing = num_rows - ebola.count() # 广播
print(num_missing)
Date 0
Day 0
Cases_Guinea 29
Cases_Liberia 39
Cases_SierraLeone 35
Cases_Nigeria 84
Cases_Senegal 97
Cases_UnitedStates 104
Cases_Spain 106
Cases_Mali 110
Deaths_Guinea 30
Deaths_Liberia 41
Deaths_SierraLeone 35
Deaths_Nigeria 84
Deaths_Senegal 100
Deaths_UnitedStates 104
Deaths_Spain 106
Deaths_Mali 110
dtype: int64
# 统计缺失值的总数
print(np.count_nonzero(ebola.isnull()))
# 统计某列的缺失值数量
print(np.count_nonzero(ebola['Cases_Liberia'].isnull()))
1214
39
缺失值处理方法
替换 fillna()
- ebola.fillna(0) 重新编码/替换
将缺失值替换成0 - ebola.fillna(method='ffill) 前值填充(已有数据代替)
用前一个值填充,如果某列以缺失值开头,该缺失值会继续存在 - ebola.fillna(method='bfill) 后值填充(新数据代替)
用后一个值填充,如果某列以缺失值结束,该缺失值会继续存在
注:有inplace参数,可以直接在原数据上修改,大型数据上可以提高效率
插值 interpolate()
- ebola.interpolate() 默认线性方式填充,method参数可以指定插值方式(并不能补全全部的缺失值)
删除 dropna()
dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
- axis: axis=0/'index',删除带有缺失值的行;axis=1/'columns',删除带有缺失值的列
- how: 筛选方式,'any'表示该行/列只要有一个以上的空值,就删除该行/列;'all'表示该行/列全部都为空值,就删除该行/列。
- thresh: 非空元素最低数量,int型。如果该行/列中,非空元素数量小于这个值,就删除该行/列
- subset: 只删除部分列或者行,输入列表,元素为行或者列的索引,指定在哪些范围中删除
- inplace: 是否在原数据上修改,布尔值
subset设置子集有交叉的感觉。删除哪些行,从列中设置子集。删除哪些列,从行中设置子集。
# 构建数据集
import numpy as np
import pandas as pd
m = np.ones((11, 10))
for i in range(len(m)):
m[i,:i] = np.nan
df = pd.DataFrame(data=m,
columns=['A','B','C','D','E','F','G','H','I','J'],
index=['a','b','c','d','e','f','g','h','i','j','k'])
print(df)
A B C D E F G H I J
a 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
b NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
c NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
d NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
e NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0
f NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0
g NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0
h NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0
i NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0
j NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
k NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 删除行索引为'a','b','c'的样本中带有缺失值的列
print(df.dropna(axis=1, subset=['a','b','c']))
C D E F G H I J
a 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
b 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
c 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
d NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
e NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0
f NaN NaN NaN 1.0 1.0 1.0 1.0 1.0
g NaN NaN NaN NaN 1.0 1.0 1.0 1.0
h NaN NaN NaN NaN NaN 1.0 1.0 1.0
i NaN NaN NaN NaN NaN NaN 1.0 1.0
j NaN NaN NaN NaN NaN NaN NaN 1.0
k NaN NaN NaN NaN NaN NaN NaN NaN
# 删除列索引为'H','I','J'的样本中带有缺失值的行
print(df.dropna(axis=0, subset=['H','I','J']))
A B C D E F G H I J
a 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
b NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
c NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
d NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0 1.0
e NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0 1.0
f NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0 1.0
g NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0 1.0
h NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0
带有缺失值的计算
如果计算中包含缺失值,结果通常返回缺失值,sum和mean可以忽略缺失值,可通过参数指定计算时是否要忽略缺失值
- df.sum(skipna=True) skipna默认为True,表示忽略缺失值
- df.mean(skipna=False) 若有缺失值返回nan
查看带有缺失值的样本
visited = pd.read_csv('data/survey_visited.csv',
na_values=[''],
keep_default_na=False)
print(visited)
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
# isnull()会对DataFrame中的每个元素进行缺失值检查,若为缺失值返回True;不是缺失值返回False;最终返回一个DataFrame.
miss = visited.isnull()
print(miss)
ident site dated
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False True
6 False False False
7 False False False
# 使用any,并设定axis=1,则当每一行中存在缺失值时就会返回True;若需要找到所有缺失值都为True的行则使用all即可。
print(miss.any(axis=1))
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
print(visited[miss.any(axis=1)==True])
ident site dated
5 752 DR-3 NaN
一步到位
print(visited[visited.isnull().any(axis=1)==True])
ident site dated
5 752 DR-3 NaN
网友评论