数据预处理方法
- scikit-learn模块
降维模块 Dimensionality reduction (decomposition)
数据预处理模块 Preprocessing
填补缺失值 impute
特征选择 feature_selection
数据无量纲化
- 数据归一化 normalization(preprocessing.MinMaxScaler)
通过 中心化(平移)缩放处理 ,MinMaxScaler参数 feature_range 默认参数[0,1],使得数据收敛到(0,1)
极易受异常值的影响
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
#实现归一化
scaler = MinMaxScaler(feature_range=[0,1])
result = scaler.fit_transform(data)
# 复原数据
scaler.inverse_transform(result)
- 数据标准化 Standardization (StandardScaler)
标准化后,数据会服从均值为0,方差为1的正态分布
from sklearn.preprocessing import StandardScaler
data = [[-1,2],[-0.5,6],[0,10],[1,18]]
# 数据标准化
scaler = StandardScaler(copy=True,with_mean=True,with_std=True)
x_std = scaler.fit_transform(data)
# 对比 前后标准差 和方差
x_std.mean()
x_std.std()
scaler.mean_
scaler.var_
缺失值处理
- 缺失值填充库 (impute.SimpleImputer)
# 参数 :
missing_values
strategy(mean median most_frequent constant)
fill_value
copy
- 代码说明
import pandas as pd
data = pd.read_csv(r"./train.csv",index_col=0)
# 分析数据
data.head()
data.info()
# 提取数据
from sklearn.impute import SimpleImputer
# 填充缺失值策略
Age = data.loc[:,"Age"].values.reshape(-1,1)
imp_median = SimpleImputer(strategy="median") # 中位数填补
data.loc[:,"Age"] = imp_median.fit_transform(Age)
Embarked = data.loc[:,"Embarked"].values.reshape(-1,1)
imp_most = SimpleImputer(strategy="most_frequent") # 众数填补
data.loc[:,"Embarked"] = imp_most.fit_transform(Embarked)
data.info()
特征、标签 数值化
- 标签 数值化 LabelEncoder (1 2 3 4 表示)
- 特征 数值化 OrdinalEncode
from sklearn.preprocessing import LabelEncoder
y = data.iloc[:,-1]
le = LabelEncoder()
data.iloc[:,-1] = le.fit_transform(y) # 标签 数值化
le.classes_ # 查看 分类情况
data.head(10)
# 特征专用 preprocessing.OrdinalEncode
from sklearn.preprocessing import OrdinalEncoder
data_ = data.copy()
OrdinalEncoder().fit(data_.iloc[:,3:4]).categories_
data_.iloc[:,3:4] = OrdinalEncoder().fit_transform(data_.iloc[:,3:4])
data_.head()
- 特征哑编码 OneHotEncoder (0101表示)
from sklearn.preprocessing import OneHotEncoder
X = data.iloc[:,3:4]
enc = OneHotEncoder(categories='auto').fit(X)
result = OneHotEncoder(categories='auto').fit_transform(X).toarray()
#看看情况
pd.DataFrame(result)
enc.get_feature_names()
newdata = pd.concat([data,pd.DataFrame(result)],axis=1)
newdata.drop(["Sex"],axis=1,inplace=True)
newdata.columns = [ "Survived","Pclass","Name","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked","x0_female", "x0_male"]
- 特征二值化 Binarizer (举例 对年龄二值化)
from sklearn.preprocessing import Binarizer
data_2 = data.copy()
X = data_2.iloc[:,4].values.reshape(-1,1)
transformer = Binarizer(threshold=30).fit_transform(X)
transformer
- 连续型随机变量 分箱 preprocessing.KBinsDiscretizer
参数:n_bins encode strategy
from sklearn.preprocessing import KBinsDiscretizer
X = data.iloc[:,4].values.reshape(-1,1)
est = KBinsDiscretizer(n_bins=3,encode='ordinal',strategy='uniform')
est.fit_transform(X)
#查看一下
set(est.fit_transform(X).ravel())
# {0.0, 1.0, 2.0}
est = KBinsDiscretizer(n_bins=3,encode='onehot',strategy='uniform')
est.fit_transform(X).toarray()
# 后续步骤同上
网友评论