kaggle地址
wx文章
https://zhuanlan.zhihu.com/p/29086614
随机森林解法
- 训练数据集有891行12列。各列代表的信息:
· PassengerId:一个用以标记每个乘客的数字id
· Survived:标记乘客是否幸存——幸存(1)、死亡(0)。我们将预测这一列。
· Pclass:标记乘客所属船层——第一层(1),第二层(2),第三层(3)。
· Name:乘客名字。
· Sex:乘客性别——男male、女female
· Age:乘客年龄。部分。
· SibSp:船上兄弟姐妹和配偶的数量。
· Parch:船上父母和孩子的数量。
· Ticket:乘客的船票号码。
· Fare:乘客为船票付了多少钱。
· Cabin:乘客住在哪个船舱。
· Embarked:乘客从哪个地方登上泰坦尼克号。
一些思考:
(1)有数据空缺,比如age还有Carbin
(2)可以用于二分类的感知机、Logistic回归、决策树、SVM和随机森林等
(3)数据分析过程中,了解业务背景是非常重要的。
大家记得在泰坦尼克号沉没的时候,船长说了一句话:"小孩和妇女先走,男人留下。"知道这个背景以后,在做数据处理的时候我们就应该知道Sex和Age两个字段应该是关键。
实操:
1. 数据初步分析(使用统计学与绘图)
1)Sex Feature:女性幸存率远高于男性
sns.barplot(x="Sex", y="Survived", data=train)
2)Pclass Feature:乘客社会等级越高,幸存率越高
sns.barplot(x="Pclass", y="Survived", data=train)
3)SibSp Feature:配偶及兄弟姐妹数适中的乘客幸存率更高
sns.barplot(x="SibSp", y="Survived", data=train)
4)Parch Feature:父母与子女数适中的乘客幸存率更高
sns.barplot(x="Parch", y="Survived", data=train)
5)从不同生还情况的密度图可以看出,在年龄15岁的左侧,生还率有明显差别,密度图非交叉区域面积非常大,但在其他年龄段,则差别不是很明显,认为是随机所致,因此可以考虑将此年龄偏小的区域分离出来。
facet = sns.FacetGrid(train, hue="Survived",aspect=2)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.xlabel('Age')
plt.ylabel('density')
image.png
6)Embarked登港港口与生存情况的分析
结果分析:C地的生存率更高,这个也应该保留为模型特征
sns.countplot('Embarked',hue='Survived',data=train)
7)Title Feature(New)不同称呼的乘客幸存率不同
新增Title特征,从姓名中提取乘客的称呼,归纳为六类。
all_data['Title'] = all_data['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
#### 新增Title特征,从姓名中提取乘客的称呼,归纳为六类。
Title_Dict = {}
# 将前面的几个称呼对应的字典值设为[]后面的
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
all_data['Title'] = all_data['Title'].map(Title_Dict)
# 将原先的train里面的映射关系修改
sns.barplot(x="Title", y="Survived", data=all_data)
image.png
8)FamilyLabel Feature(New) 家庭人数为2到4的乘客幸存率较高
新增FamilyLabel特征,先计算FamilySize=Parch+SibSp+1,然后把FamilySize分为三类。
all_data['FamilySize']=all_data['SibSp']+all_data['Parch']+1
sns.barplot(x="FamilySize", y="Survived", data=all_data)
image.png
9)Deck Feature(New) 不同甲板的乘客幸存率不同
新增Deck特征,先把Cabin空缺值填充为'Unknown',再提取Cabin中的首字母构成乘客的甲板号。
all_data['Cabin'] = all_data['Cabin'].fillna('Unknown')
all_data['Deck']=all_data['Cabin'].str.get(0)
sns.barplot(x="Deck", y="Survived", data=all_data)
image.png
10)TicketGroup Feature(New) 与2至4人共票号的乘客幸存率较高
新增TicketGroup特征,统计每个乘客的共票号数。
Ticket_Count = dict(all_data['Ticket'].value_counts())
# value_counts()是一种查看表格某列中有多少个不同值的快捷方法
# 并计算每个不同值有在该列中有多少重复值。
all_data['TicketGroup'] = all_data['Ticket'].apply(lambda x:Ticket_Count[x])
sns.barplot(x='TicketGroup', y='Survived', data=all_data)
image.png
按生存率把TicketGroup分为三类。
def Ticket_Label(s):
if (s >= 2) & (s <= 4):
return 2
elif ((s > 4) & (s <= 8)) | (s == 1):
return 1
elif (s > 8):
return 0
all_data['TicketGroup'] = all_data['TicketGroup'].apply(Ticket_Label)
sns.barplot(x='TicketGroup', y='Survived', data=all_data)
image.png
2. 数据清洗(Data Cleaning)
import pandas as pd
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
print(train.isnull())
print(train.isnull().sum())
print(len(train))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 False False False False False False False False False False True False
1 False False False False False False False False False False False False
.. ... ... ... ... ... ... ... ... ... ... ... ...
889 False False False False False False False False False False False False
890 False False False False False False False False False False True False
[891 rows x 12 columns]
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
891
首先来看缺失数据,上图显示Age,Cabin,Embarked,Fare这些变量存在缺失值(Survived是预测值)。其中Embarked和Fare的缺失值较少,可以直接用众数和中位数插补,但是Age还有Cabin缺失非常多。
1)缺失值填充
Age Feature:Age缺失量为263,缺失量较大,用Sex, Title, Pclass三个特征构建随机森林模型,填充年龄缺失值。
from sklearn.ensemble import RandomForestRegressor
age_df = all_data[['Age', 'Pclass','Sex','Title']]
# one-hot 编码
age_df=pd.get_dummies(age_df)
known_age = age_df[age_df.Age.notnull()].values
unknown_age = age_df[age_df.Age.isnull()].values
# x[:,n]就是取所有集合的第n个数据
y = known_age[:, 0]
# y是Age
X = known_age[:, 1:]
# x是Pclass
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
# 用Pclass训练预测Age
predictedAges = rfr.predict(unknown_age[:, 1::])
# 通过年龄未知的人的Pclass预测Age,填补较多的Age缺失
all_data.loc[ (all_data.Age.isnull()), 'Age' ] = predictedAges
Embarked Feature:Embarked缺失量为2,缺失Embarked信息的乘客的Pclass均为1,且Fare均为80,因为Embarked为C且Pclass为1的乘客的Fare中位数为80,所以缺失值填充为C。
all_data[all_data['Embarked'].isnull()]
>>> all_data[all_data['Embarked'].isnull()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN Miss
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN Mrs
>>> all_data.groupby(by=["Pclass","Embarked"]).Fare.median()
Pclass Embarked
1 C 78.2667
Q 90.0000
S 52.0000
2 C 24.0000
Q 12.3500
S 13.5000
3 C 7.8958
Q 7.7500
S 8.0500
Name: Fare, dtype: float64
all_data['Embarked'] = all_data['Embarked'].fillna('C')
Fare Feature:Fare缺失量为1,缺失Fare信息的乘客的Embarked为S,Pclass为3,所以用Embarked为S,Pclass为3的乘客的Fare中位数填充。
>>> all_data[all_data['Fare'].isnull()]
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
152 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S
fare=all_data[(all_data['Embarked'] == "S") & (all_data['Pclass'] == 3)].Fare.median()
>>> fare
8.05
all_data['Fare']=all_data['Fare'].fillna(fare)
2)同组识别
把姓氏相同的乘客划分为同一组,从人数大于一的组中分别提取出每组的妇女儿童和成年男性。
all_data['Surname']=all_data['Name'].apply(lambda x:x.split(',')[0].strip())
# strip() 方法用于移除字符串头尾指定的字符(默认为空格或换行符)
Surname_Count = dict(all_data['Surname'].value_counts())
# 去重式dict统计重复次数
all_data['FamilyGroup'] = all_data['Surname'].apply(lambda x:Surname_Count[x])
# 将名字替换为重复次数
Female_Child_Group=all_data.loc[(all_data['FamilyGroup']>=2) & ((all_data['Age']<=12) | (all_data['Sex']=='female'))]
# 从人数 > 1 的组中分别提取出每组的妇女儿童
Male_Adult_Group=all_data.loc[(all_data['FamilyGroup']>=2) & (all_data['Age']>12) & (all_data['Sex']=='male')]
# 从人数 >1 的组中分别提取出每组的成年男性
>>> all_data['Surname']
0 Braund
1 Cumings
2 Heikkinen
...
890 Dooley
>>> all_data['FamilyGroup']
0 2
1 1
2 1
..
890 1
发现绝大部分女性和儿童组的平均存活率都为1或0,即同组的女性和儿童要么全部幸存,要么全部遇难。
Female_Child=pd.DataFrame(Female_Child_Group.groupby('Surname')['Survived'].mean().value_counts())
# 按surname分组,对每组的survived求均值并计不重复的数,从而看出
>>> Female_Child_Group.groupby('Surname')['Survived'].mean()
Surname
Abbott 1.000000
...
Yasbeck 1.000000
Zabour 0.000000
>>> Female_Child
Survived
1.000000 78
0.000000 26
0.750000 2
0.333333 1
0.142857 1
Female_Child.columns=['GroupCount']
>>> Female_Child
GroupCount
1.000000 78
0.000000 26
0.750000 2
0.333333 1
0.142857 1
sns.barplot(x=Female_Child.index, y=Female_Child["GroupCount"]).set_xlabel('AverageSurvived')
image.png
绝大部分成年男性组的平均存活率也为1或0。
Male_Adult=pd.DataFrame(Male_Adult_Group.groupby('Surname')['Survived'].mean().value_counts())
Male_Adult.columns=['GroupCount']
>>> Male_Adult
GroupCount
0.000000 75
1.000000 14
0.500000 4
0.333333 1
因为普遍规律是女性和儿童幸存率高,成年男性幸存较低,所以我们把不符合普遍规律的反常组选出来单独处理。把女性和儿童组中幸存率为0的组设置为遇难组,把成年男性组中存活率为1的设置为幸存组,推测处于遇难组的女性和儿童幸存的可能性较低,处于幸存组的成年男性幸存的可能性较高。
Female_Child_Group=Female_Child_Group.groupby('Surname')['Survived'].mean()
Dead_List=set(Female_Child_Group[Female_Child_Group.apply(lambda x:x==0)].index)
>>> print(Dead_List)
{'Arnold-Franchi', 'Ford', 'Barbara', 'Rosblom', 'Panula', 'Turpin', 'Vander Planke', 'Oreskovic', 'Jussila', 'Johnston', 'Goodwin', 'Bourke', 'Boulos', 'Lobb', 'Strom', 'Sage', 'Palsson', 'Van Impe', 'Danbom', 'Olsson', 'Skoog', 'Zabour', 'Attalah', 'Lefebre', 'Cacic', 'Rice'}
Male_Adult_List=Male_Adult_Group.groupby('Surname')['Survived'].mean()
Survived_List=set(Male_Adult_List[Male_Adult_List.apply(lambda x:x==1)].index)
>>> print(Survived_List)
{'Bishop', 'Beane', 'Flynn', 'Hoyt', 'Jussila', 'Beckwith', 'Goldenberg', 'Frauenthal', 'Nakid', 'Daly', 'Duff Gordon', 'Chambers', 'Taylor', 'Dick'}
为了使处于这两种反常组中的样本能够被正确分类,对测试集中处于反常组中的样本的Age,Title,Sex进行惩罚修改。
train=all_data.loc[all_data['Survived'].notnull()]
test=all_data.loc[all_data['Survived'].isnull()]
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Sex'] = 'male'
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Age'] = 60
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Title'] = 'Mr'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Sex'] = 'female'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Age'] = 5
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Title'] = 'Miss'
3)特征转换
选取特征,转换为数值变量,划分训练集和测试集。
all_data=pd.concat([train, test])
all_data=all_data[['Survived','Pclass','Sex','Age','Fare','Embarked','Title','FamilyLabel','Deck','TicketGroup']]
all_data=pd.get_dummies(all_data)
train=all_data[all_data['Survived'].notnull()]
test=all_data[all_data['Survived'].isnull()].drop('Survived',axis=1)
X = train.as_matrix()[:,1:]
y = train.as_matrix()[:,0]
3.建模和优化
1)参数优化
用网格搜索自动化选取最优参数,事实上我用网格搜索得到的最优参数是n_estimators = 28,max_depth = 6。但是参考另一篇Kernel把参数改为n_estimators = 26,max_depth = 6之后交叉验证分数和kaggle评分都有略微提升。
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
pipe=Pipeline([('select',SelectKBest(k=20)),
('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])
# 计算公式要自己给,要取前k个的k值也自己给
param_test = {'classify__n_estimators':list(range(20,50,2)),
'classify__max_depth':list(range(3,60,3))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='roc_auc', cv=10)
gsearch.fit(X,y)
print(gsearch.best_params_, gsearch.best_score_)
2)训练模型
from sklearn.pipeline import make_pipeline
select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10, warm_start = True,
n_estimators = 26,
max_depth = 6,
max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)
3)交叉验证
from sklearn import cross_validation, metrics
cv_score = cross_validation.cross_val_score(pipeline, X, y, cv= 10)
>>> print("CV Score : Mean - %.7g | Std - %.7g " % (np.mean(cv_score), np.std(cv_score)))
CV Score : Mean - 0.8451402 | Std - 0.03276752
4.预测
predictions = pipeline.predict(test)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv(r"h:\kaggle\submission1.csv", index=False)
代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')
PassengerId=test['PassengerId']
all_data = pd.concat([train, test], ignore_index = True)
#### 新增Title特征,从姓名中提取乘客的称呼,归纳为六类。
all_data['Title'] = all_data['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
Title_Dict = {}
# 将前面的几个称呼对应的字典值设为[]后面的
Title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
all_data['Title'] = all_data['Title'].map(Title_Dict)
sns.barplot(x="Title", y="Survived", data=all_data)
#### 新增FamilyLabel特征,先计算FamilySize=Parch+SibSp+1,然后把FamilySize分为三类
all_data['FamilySize']=all_data['SibSp']+all_data['Parch']+1
# 按生存率把FamilySize分为三类,构成FamilyLabel特征
def Fam_label(s):
if (s >= 2) & (s <= 4):
return 2
elif ((s > 4) & (s <= 7)) | (s == 1):
return 1
elif (s > 7):
return 0
all_data['FamilyLabel']=all_data['FamilySize'].apply(Fam_label)
#### 新增Deck特征,先把Cabin空缺值填充为'Unknown',再提取Cabin中的首字母构成乘客的甲板号
all_data['Cabin'] = all_data['Cabin'].fillna('Unknown')
all_data['Deck']=all_data['Cabin'].str.get(0)
#### TicketGroup Feature(New):与2至4人共票号的乘客幸存率较高
# 新增TicketGroup特征,统计每个乘客的共票号数。
Ticket_Count = dict(all_data['Ticket'].value_counts())
all_data['TicketGroup'] = all_data['Ticket'].apply(lambda x:Ticket_Count[x])
# 按生存率把TicketGroup分为三类
def Ticket_Label(s):
if (s >= 2) & (s <= 4):
return 2
elif ((s > 4) & (s <= 8)) | (s == 1):
return 1
elif (s > 8):
return 0
all_data['TicketGroup'] = all_data['TicketGroup'].apply(Ticket_Label)
#### 缺失值填充
# Age Feature:Age缺失量为263,缺失量较大,用Sex, Title, Pclass三个特征构建随机森林模型,填充年龄缺失值。
from sklearn.ensemble import RandomForestRegressor
age_df = all_data[['Age', 'Pclass','Sex','Title']]
age_df=pd.get_dummies(age_df)
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
y = known_age[:, 0]
X = known_age[:, 1:]
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)
predictedAges = rfr.predict(unknown_age[:, 1::])
all_data.loc[ (all_data.Age.isnull()), 'Age' ] = predictedAges
#### Embarked Feature:Embarked缺失量为2,缺失Embarked信息的乘客的Pclass均为1,
# 且Fare均为80,因为Embarked为C且Pclass为1的乘客的Fare中位数为80,所以缺失值填充为C。
all_data['Embarked'] = all_data['Embarked'].fillna('C')
#### Fare Feature:Fare缺失量为1,缺失Fare信息的乘客的Embarked为S,Pclass为3,
# 所以用Embarked为S,Pclass为3的乘客的Fare中位数填充。
fare=all_data[(all_data['Embarked'] == "S") & (all_data['Pclass'] == 3)].Fare.median()
all_data['Fare']=all_data['Fare'].fillna(fare)
#### 同组识别
# 把姓氏相同的乘客划分为同一组,从人数大于一的组中分别提取出每组的妇女儿童和成年男性
all_data['Surname']=all_data['Name'].apply(lambda x:x.split(',')[0].strip())
Surname_Count = dict(all_data['Surname'].value_counts())
all_data['FamilyGroup'] = all_data['Surname'].apply(lambda x:Surname_Count[x])
Female_Child_Group=all_data.loc[(all_data['FamilyGroup']>=2) & ((all_data['Age']<=12) | (all_data['Sex']=='female'))]
Male_Adult_Group=all_data.loc[(all_data['FamilyGroup']>=2) & (all_data['Age']>12) & (all_data['Sex']=='male')]
#### 因为普遍规律是女性和儿童幸存率高,成年男性幸存较低,所以我们把不符合普遍规律的反常组选出来单独处理。
# 把女性和儿童组中幸存率为0的组设置为遇难组,把成年男性组中存活率为1的设置为幸存组,
# 推测处于遇难组的女性和儿童幸存的可能性较低,处于幸存组的成年男性幸存的可能性较高。
Female_Child_Group=Female_Child_Group.groupby('Surname')['Survived'].mean()
Dead_List=set(Female_Child_Group[Female_Child_Group.apply(lambda x:x==0)].index)
Male_Adult_List=Male_Adult_Group.groupby('Surname')['Survived'].mean()
Survived_List=set(Male_Adult_List[Male_Adult_List.apply(lambda x:x==1)].index)
#### 为了使处于这两种反常组中的样本能够被正确分类,对测试集中处于反常组中的样本的Age,Title,Sex进行惩罚修改。
train=all_data.loc[all_data['Survived'].notnull()]
test=all_data.loc[all_data['Survived'].isnull()]
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Sex'] = 'male'
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Age'] = 60
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Title'] = 'Mr'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Sex'] = 'female'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Age'] = 5
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Title'] = 'Miss'
#### 特征转换
# 选取特征,转换为数值变量,划分训练集和测试集。
all_data=pd.concat([train, test])
all_data=all_data[['Survived','Pclass','Sex','Age','Fare','Embarked','Title','FamilyLabel','Deck','TicketGroup']]
all_data=pd.get_dummies(all_data)
train=all_data[all_data['Survived'].notnull()]
test=all_data[all_data['Survived'].isnull()].drop('Survived',axis=1)
X = train.as_matrix()[:,1:]
y = train.as_matrix()[:,0]
#### 参数优化
# 用网格搜索自动化选取最优参数,事实上我用网格搜索得到的最优参数是n_estimators = 28,max_depth = 6。
# 但是参考另一篇Kernel把参数改为n_estimators = 26,max_depth = 6之后交叉验证分数和kaggle评分都有略微提升。
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectKBest
pipe=Pipeline([('select',SelectKBest(k=20)),
('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])
# SelectKBest数据预处理过滤式特征选取
param_test = {'classify__n_estimators':list(range(20,50,2)),
'classify__max_depth':list(range(3,60,3))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='roc_auc', cv=10)
grid_result=gsearch.fit(X,y)
#### 打印下过程:
print("Best: %f using %s" % (grid_result.best_score_,grid_result.best_params_))
#grid_scores_:给出不同参数情况下的评价结果。best_params_:描述了已取得最佳结果的参数的组合
#best_score_:成员提供优化过程期间观察到的最好的评分
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
for mean,param in zip(means,params):
print("%f with: %r" % (mean,param))
#### 训练模型
from sklearn.pipeline import make_pipeline
select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10, warm_start = True,
n_estimators = 26,
max_depth = 6,
max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)
#### 交叉验证
from sklearn import model_selection
from sklearn import metrics
cv_score = model_selection.cross_val_score(pipeline, X, y, cv= 10)
#### 预测
predictions = pipeline.predict(test)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
print(submission)
submission.to_csv(r"submission1.csv", index=False)
网友评论