主要讲述了如何处理分类型变量,直接上代码。
# Get list of categorical variables
s = (X_train.dtypes == 'object') #判断类别是否为分类型变量的series
object_cols = list(s[s].index) #s[s]就是找出结果为真的series,最后得出对应的索引列表,即分类变量的名称
print("Categorical variables:")
print(object_cols)
定义打分函数,使用MAE来评价不同方法的分数。
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
Score from Approach 1 (Drop Categorical Variables)[删除分类型变量]
We drop the object
columns with the select_dtypes()
method.
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
Score from Approach 2 (Label Encoding)[类别编码]
Scikit-learn has a LabelEncoder
class that can be used to get label encodings. We loop over the categorical variables and apply the label encoder separately to each column.对分类型变量进行遍历,并对每个列分别使用标签编码。
from sklearn.preprocessing import LabelEncoder
# Make copy to avoid changing original data 为避免更改原始数据创建副本
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder() #实例化方法
for col in object_cols:
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.transform(X_valid[col])
print("MAE from Approach 2 (Label Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
在上面的代码块中,我们把一个唯一整数随机分配给每一列的不同类别,这是一种常用方法,比自定义标签更简单,但是,如果我们能为有序的分类变量提供灵活的标签,预期性能会进一步提高。
Score from Approach 3 (One-Hot Encoding)¶
We use the OneHotEncoder
class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.
- We set
handle_unknown='ignore'
to avoid errors when the validation data contains classes that aren't represented in the training data, and - setting
sparse=False
ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).当验证数据集中包含训练数据中未表示的类别时,设置以避免错误,并设置可以确保one-hot编码作为np数组返回而非稀疏矩阵。
To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode the training data, we supply X_train[object_cols]
. (object_cols
in the code cell below is a list of the column names with categorical data, and so X_train[object_cols]
contains all of the categorical data in the training set.)要使用编码器,我们仅需提供我们想要被one-hot编码的分类列。例如本例我们提供X_train[object_cols]
from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
#handle_unknown='ignore'为了避免验证集中出现训练集中未出现的类
#sparse=False 确保编码列作为numpy数组而非稀疏矩阵返回
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back【OH编码会移除行索引,恢复索引】
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)移除分类列(将使用OH编码代替)剩余为数量型变量 num_X
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features 将OH编码添加至数量型变量,现在的OH_X包含了数量型变量和OH编码过的分类型变量
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
Which approach is best?
In this case, dropping the categorical columns (Approach 1) performed worst, since it had the highest MAE score. As for the other two approaches, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.
In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns (Approach 1) typically performs worst, but it varies on a case-by-case basis.
Conclusion
The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!
网友评论