以kaggle竞赛的入门Titanic数据集为例:
一、处理空值
1、打印空值数目:
print("Training columns with null values:\n",training.isnull().sum())
print("-"*20)
print("Test columns with null values:\n",training.isnull().sum())
2、用平均值填补
dataset['Age'].fillna(dataset['Age'].median(),inplace = True)
3、用众数填补
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0],inplace = True)
二、label_encoder对分类数据编码并生成dummy_variable
label=LabelEncoder()
for dataset in all:
dataset['Sex_Code']=label.fit_transform(dataset['Sex'])
dataset['Embarked_Code']=label.fit_transform(dataset['Embarked'])
dataset['Title_Code']=label.fit_transform(dataset['Title'])
data1_x=['Sex','Pclass','Embarked','Title','SibSp','Parch','Age','Fare','FamilySize','IsAlone'
training_dummy = pd.get_dummies(training['data1_x'])
data1_dummy.head()
三、对人名的处理
1、从人名中提取出title
dataset['Title']=dataset['Name'].str.split(", ",expand=True)[1].str.split(".",expand=True)[0]
网友评论