2020-03-28

作者: 酸菜鱼_02a6 | 来源:发表于2020-03-28 21:44 被阅读0次

def outliers_proc(data, col_name, scale=3):
  """
 用于清洗异常值，默认用 box_plot（scale=3）进行清洗
 :param data: 接收 pandas 数据格式
 :param col_name: pandas 列名
 :param scale: 尺度
 :return:
 """
 def box_plot_outliers(data_ser, box_scale):
 """
 利用箱线图去除异常值
 :param data_ser: 接收 pandas.Series 数据格式
 :param box_scale: 箱线图尺度，
 :return:
 """
 iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
 val_low = data_ser.quantile(0.25) - iqr
 val_up = data_ser.quantile(0.75) + iqr
 rule_low = (data_ser < val_low)
 rule_up = (data_ser > val_up)
 return (rule_low, rule_up), (val_low, val_up)
 data_n = data.copy()
 data_series = data_n[col_name]
 rule, value = box_plot_outliers(data_series, box_scale=scale)
 index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
 print("Delete number is: {}".format(len(index)))
 data_n = data_n.drop(index)
 data_n.reset_index(drop=True, inplace=True)
 print("Now column number is: {}".format(data_n.shape[0]))
 index_low = np.arange(data_series.shape[0])[rule[0]]
 outliers = data_series.iloc[index_low]
 print("Description of data less than the lower bound is:")
 print(pd.Series(outliers).describe())
 index_up = np.arange(data_series.shape[0])[rule[1]]
 outliers = data_series.iloc[index_up]
 print("Description of data larger than the upper bound is:")
 print(pd.Series(outliers).describe())
 
 fig, ax = plt.subplots(1, 2, figsize=(10, 7))
 sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
 sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
 return data_n

缺失值处理

像IRIS数据集没有缺失值，故对数据集新增一个特征，4个特征均赋值为NaN，表示数据缺失；
用均值、众数、中位数填充；
用正态分布进行填充；
sklearn.processing import Imputer 这是sklearn中的处理特征缺失的类；
缺失过多，特征融合或舍弃特征。

二值化(对列向量进行处理)
二值化主要是针对将模糊变量转化为数值变量时使用；信息冗余：对于某些定量特征，其包含的有效信息为区间划分。

from sklearn.preprocessing import Binarizer 
#二值化，阈值设置为3，返回值为二值化后的数据
Binarizer(threshold=3).fit_transform(df.data)

哑编码(对列向量进行处理)
如果定性特征不能直接使用：通常使用哑编码的方式将定性特征转换为定量特征，假设有N种定性值，则将这一个特征扩展为N种特征，当原始特征值为第i种定性值时，第i个扩展特征赋值为1，其他扩展特征赋值为0。哑编码的方式相比直接指定的方式，不用增加调参的工作，对于线性模型来说，使用哑编码后的特征可达到非线性的效果。

from sklearn.preprocessing import OneHotEncoder 

#哑编码，对数据集的目标值，返回值为哑编码后的数据 
OneHotEncoder().fit_transform(df.target.reshape((-1,1)))

什么情况下(不)需要归一化？
需要：基于参数的模型或基于距离的模型，都是要进行特征的归一化。
不需要：基于树的方法是不需要进行特征的归一化，例如随机森林，bagging 和 boosting等。

异常值处理：减少脏数据
a) 简单统计：如 describe() 的统计描述；散点图等；
b) 3∂ 法则（正态分布）/箱型图截断；
c) 利用模型进行离群点检测：聚类、K近邻、One Class SVM、Isolation Forest；

网友评论

本文标题：2020-03-28

本文链接：https://www.haomeiwen.com/subject/jmqeuhtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

2020-03-28

缺失值处理

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读