美文网首页
特征工程

特征工程

作者: 陈文瑜 | 来源:发表于2019-10-13 20:22 被阅读0次

    数据预处理方法

    • scikit-learn模块

    降维模块 Dimensionality reduction (decomposition)
    数据预处理模块 Preprocessing
    填补缺失值 impute
    特征选择 feature_selection

    数据无量纲化

    • 数据归一化 normalization(preprocessing.MinMaxScaler)
      x^* = \frac {x-min(x)}{max(x)-min(x)}

    通过 中心化(平移)缩放处理 ,MinMaxScaler参数 feature_range 默认参数[0,1],使得数据收敛到(0,1)
    极易受异常值的影响

    from sklearn.preprocessing import MinMaxScaler
    import pandas as pd
    data = [[-1,2],[-0.5,6],[0,10],[1,18]]
    #实现归一化
    scaler = MinMaxScaler(feature_range=[0,1])
    result = scaler.fit_transform(data)
    # 复原数据
    scaler.inverse_transform(result)
    
    • 数据标准化 Standardization (StandardScaler)

    标准化后,数据会服从均值为0,方差为1的正态分布

    from sklearn.preprocessing import StandardScaler
    data = [[-1,2],[-0.5,6],[0,10],[1,18]]
    # 数据标准化
    scaler = StandardScaler(copy=True,with_mean=True,with_std=True)
    x_std = scaler.fit_transform(data)
    # 对比 前后标准差 和方差
    x_std.mean()
    x_std.std()
    scaler.mean_
    scaler.var_
    

    缺失值处理

    • 缺失值填充库 (impute.SimpleImputer)
    # 参数 :
    missing_values   
    strategy(mean median most_frequent constant)
    fill_value
    copy
    
    • 代码说明
    import pandas as pd
    data = pd.read_csv(r"./train.csv",index_col=0)
    #  分析数据
    data.head()
    data.info()
    # 提取数据
    from sklearn.impute import SimpleImputer
    # 填充缺失值策略
    Age = data.loc[:,"Age"].values.reshape(-1,1)
    imp_median = SimpleImputer(strategy="median")  # 中位数填补
    data.loc[:,"Age"] = imp_median.fit_transform(Age)
    
    Embarked = data.loc[:,"Embarked"].values.reshape(-1,1)
    imp_most = SimpleImputer(strategy="most_frequent") # 众数填补
    data.loc[:,"Embarked"] = imp_most.fit_transform(Embarked)
    
    data.info()
    

    特征、标签 数值化

    • 标签 数值化 LabelEncoder (1 2 3 4 表示)
    • 特征 数值化 OrdinalEncode
    from sklearn.preprocessing import LabelEncoder
    y = data.iloc[:,-1]
    le = LabelEncoder()
    data.iloc[:,-1] = le.fit_transform(y) # 标签 数值化
    le.classes_ # 查看 分类情况
    data.head(10)
    
    # 特征专用 preprocessing.OrdinalEncode  
    from sklearn.preprocessing import OrdinalEncoder
    data_ = data.copy()
    OrdinalEncoder().fit(data_.iloc[:,3:4]).categories_
    data_.iloc[:,3:4] = OrdinalEncoder().fit_transform(data_.iloc[:,3:4])
    data_.head()
    
    • 特征哑编码 OneHotEncoder (0101表示)
    from sklearn.preprocessing import OneHotEncoder
    X = data.iloc[:,3:4]
    enc = OneHotEncoder(categories='auto').fit(X)
    
    result =  OneHotEncoder(categories='auto').fit_transform(X).toarray()
    #看看情况
    pd.DataFrame(result)
    enc.get_feature_names()
    
    newdata = pd.concat([data,pd.DataFrame(result)],axis=1)
    newdata.drop(["Sex"],axis=1,inplace=True)
    newdata.columns = [ "Survived","Pclass","Name","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked","x0_female", "x0_male"]
    
    • 特征二值化 Binarizer (举例 对年龄二值化)
      x^{'}= \begin{cases}1,\quad \ \ & x>threshold,\\ 0, \quad \ \ & x<=threshold, \end{cases}
    from sklearn.preprocessing import Binarizer
    data_2 = data.copy()
    X = data_2.iloc[:,4].values.reshape(-1,1)
    transformer = Binarizer(threshold=30).fit_transform(X)
    transformer
    
    • 连续型随机变量 分箱 preprocessing.KBinsDiscretizer

    参数:n_bins encode strategy

    from sklearn.preprocessing import KBinsDiscretizer
    X = data.iloc[:,4].values.reshape(-1,1)
    
    est = KBinsDiscretizer(n_bins=3,encode='ordinal',strategy='uniform')
    est.fit_transform(X)
    #查看一下
    set(est.fit_transform(X).ravel())
    # {0.0, 1.0, 2.0}
    
    est = KBinsDiscretizer(n_bins=3,encode='onehot',strategy='uniform')
    est.fit_transform(X).toarray()
    # 后续步骤同上
    

    相关文章

      网友评论

          本文标题:特征工程

          本文链接:https://www.haomeiwen.com/subject/tpukmctx.html