Titanic生存预测2

作者: 章光辉_数据 | 来源:发表于2018-02-06 18:41 被阅读14次

    背景音乐:保留 - 郭顶

    上一篇:Titanic生存预测1,主要讲了如何做的特征工程。

    这一篇讲如何训练模型来实现预测。

    %matplotlib inline
    from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
    from xgboost import XGBClassifier
    from sklearn.preprocessing import OneHotEncoder, LabelEncoder
    from sklearn import feature_selection
    from sklearn import model_selection
    from sklearn import metrics
    import pandas as pd
    import time
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import StandardScaler
    

    1. 读取数据

    path_data = '../../data/titanic/'
    df = pd.read_csv(path_data + 'fe_data.csv')
    
    df_data_y = df['Survived']
    df_data_x = df.drop(['Survived', 'PassengerId'], 1)
    
    df_train_x = df_data_x.iloc[:891, :]  # 前891个数据是训练集
    df_train_y = df_data_y[:891]
    

    2. 特征选择

    我选择用GBDT来进行特征选择,这是由决策树本身的算法特性所决定的,每次通过计算信息增益(或其他准则)来选择特征进行分割,在预测的同时也对特征的贡献进行了“衡量”,因此比较容易可视化~

    cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0) 
    gbdt_rfe = feature_selection.RFECV(ensemble.GradientBoostingClassifier(random_state=2018), step = 1, scoring = 'accuracy', cv = cv_split)
    gbdt_rfe.fit(df_train_x, df_train_y)
    columns_rfe = df_train_x.columns.values[gbdt_rfe.get_support()]
    print('Picked columns: {}'.format(columns_rfe))
    print("Optimal number of features : {}/{}".format(gbdt_rfe.n_features_, len(df_train_x.columns)))
    plt.figure()
    plt.xlabel("Number of features selected")
    plt.ylabel("Cross validation score (nb of correct classifications)")
    plt.plot(range(1, len(gbdt_rfe.grid_scores_) + 1), gbdt_rfe.grid_scores_)
    plt.show()
    

    结果显示:

    Picked columns: ['Age' 'Fare' 'Pclass' 'SibSp' 'FamilySize' 'Family_Survival' 'Sex_Code' 'Title_Master' 'Title_Mr' 'Cabin_C' 'Cabin_E' 'Cabin_X']
    Optimal number of features : 12/24
    

    大约在5个以上特征的时候,交叉验证集的分数就已经趋于稳定了。说明在现有特征中,有贡献的特征并不多……

    最好的结果出现在12个特征的时候。但需要注意的是,比赛的比分不是由你的交叉验证集决定,所以存在一定的偶然性,鉴于特征数量在比较长的跨度上表现接近,因此我觉得有机会的话,特征数量从5到24的每种选择都值得一试。

    我个人比较了24个特征和12个特征,表现最好的是24个全选……没试其他的。

    然后对特征进行标准化,用以训练:

    stsc = StandardScaler()
    df_data_x = stsc.fit_transform(df_data_x)
    print('mean:\n', stsc.mean_)
    print('var:\n', stsc.var_)
    
    df_train_x = df_data_x[:891]
    df_train_y = df_data_y[:891]
    
    df_test_x = df_data_x[891:]
    df_test_output = df.iloc[891:, :][['PassengerId','Survived']]
    

    3.模型融合

    机器学习的套路是:

    1. 先选择一个基础模型,进行训练和预测,最快建立起一个pipeline。
    2. 在此基础上用交叉验证和GridSearch对模型调参,查看模型的表现。
    3. 用模型融合进行多个模型的组合,用投票的方式(或其他)来预测结果。

    一般来说,模型融合得到的结果会比单个模型的要好。

    在这里,我跳过了步骤1和2,直接进行步骤3。

    3.1 设置基本参数

    vote_est = [
        ('ada', ensemble.AdaBoostClassifier()),
        ('bc', ensemble.BaggingClassifier()),
        ('etc', ensemble.ExtraTreesClassifier()),
        ('gbc', ensemble.GradientBoostingClassifier()),
        ('rfc', ensemble.RandomForestClassifier()),
        ('gpc', gaussian_process.GaussianProcessClassifier()),
        ('lr', linear_model.LogisticRegressionCV()),
        ('bnb', naive_bayes.BernoulliNB()),
        ('gnb', naive_bayes.GaussianNB()),
        ('knn', neighbors.KNeighborsClassifier()),
        ('svc', svm.SVC(probability=True)),
        ('xgb', XGBClassifier())
    ]
    
    grid_n_estimator = [10, 50, 100, 300, 500]
    grid_ratio = [.5, .8, 1.0]
    grid_learn = [.001, .005, .01, .05, .1]
    grid_max_depth = [2, 4, 6, 8, 10]
    grid_criterion = ['gini', 'entropy']
    grid_bool = [True, False]
    grid_seed = [0]
    
    grid_param = [
        # AdaBoostClassifier
        {
            'n_estimators':grid_n_estimator,
            'learning_rate':grid_learn,
            'random_state':grid_seed
        },
        # BaggingClassifier
        {
            'n_estimators':grid_n_estimator,
            'max_samples':grid_ratio,
            'random_state':grid_seed
        },
        # ExtraTreesClassifier
        {
            'n_estimators':grid_n_estimator,
            'criterion':grid_criterion,
            'max_depth':grid_max_depth,
            'random_state':grid_seed
        },
        # GradientBoostingClassifier
        {
            'learning_rate':grid_learn,
            'n_estimators':grid_n_estimator,
            'max_depth':grid_max_depth,
            'random_state':grid_seed,
    
        },
        # RandomForestClassifier
        {
            'n_estimators':grid_n_estimator,
            'criterion':grid_criterion,
            'max_depth':grid_max_depth,
            'oob_score':[True],
            'random_state':grid_seed
        },
        # GaussianProcessClassifier
        {
            'max_iter_predict':grid_n_estimator,
            'random_state':grid_seed
        },
        # LogisticRegressionCV
        {
            'fit_intercept':grid_bool,  # default: True
            'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
            'random_state':grid_seed
        },
        # BernoulliNB
        {
            'alpha':grid_ratio,
        },
        # GaussianNB
        {},
        # KNeighborsClassifier
        {
            'n_neighbors':range(6, 25),
            'weights':['uniform', 'distance'],
            'algorithm':['auto', 'ball_tree', 'kd_tree', 'brute']
        },
        # SVC
        {
            'C':[1, 2, 3, 4, 5],
            'gamma':grid_ratio,
            'decision_function_shape':['ovo', 'ovr'],
            'probability':[True],
            'random_state':grid_seed
        },
        # XGBClassifier
        {
            'learning_rate':grid_learn,
            'max_depth':[1, 2, 4, 6, 8, 10],
            'n_estimators':grid_n_estimator,
            'seed':grid_seed
        }
    ]
    

    3.2 训练

    对于每个模型都进行调参再组合,不过有的迭代次数较多,为了节省时间我就用了RandomizedSearchCV来简化(还没来得及试验全部GridSearchCV)。

    start_total = time.perf_counter()
    N = 0
    for clf, param in zip (vote_est, grid_param):  
        start = time.perf_counter()     
        cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0) 
        if 'n_estimators' not in param.keys():
            print(clf[1].__class__.__name__, 'GridSearchCV')
            best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'accuracy')
            best_search.fit(df_train_x, df_train_y)
            best_param = best_search.best_params_
        else:
            print(clf[1].__class__.__name__, 'RandomizedSearchCV')
            best_search2 = model_selection.RandomizedSearchCV(estimator = clf[1], param_distributions = param, cv = cv_split, scoring = 'accuracy')
            best_search2.fit(df_train_x, df_train_y)
            best_param = best_search2.best_params_
        run = time.perf_counter() - start
    
        print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
        clf[1].set_params(**best_param) 
    
    run_total = time.perf_counter() - start_total
    print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
    

    4. 预测

    投票有两种方式——软投票和硬投票。

    • 硬投票:少数服从多数。
    • 软投票:没研究过,有文章表明,计算的是加权平均概率,预测结果是概率高的。

    如果没有先验经验,那么最好是两种投票方式都算一遍,看看结果如何。

    对于Titanic生存预测,我发现每次都是硬投票的结果要好。

    grid_hard = ensemble.VotingClassifier(estimators = vote_est , voting = 'hard')
    grid_hard_cv = model_selection.cross_validate(grid_hard, df_train_x, df_train_y, cv = cv_split, scoring = 'accuracy')
    grid_hard.fit(df_train_x, df_train_y)
    
    print("Hard Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_hard_cv['train_score'].mean()*100)) 
    print("Hard Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_hard_cv['test_score'].mean()*100))
    print("Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_hard_cv['test_score'].std()*100*3))
    print('-'*10)
    
    grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
    grid_soft_cv = model_selection.cross_validate(grid_soft, df_train_x, df_train_y, cv = cv_split, scoring = 'accuracy')
    grid_soft.fit(df_train_x, df_train_y)
    
    print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100)) 
    print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
    print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))
    

    结果为:

    Hard Voting w/Tuned Hyperparameters Training w/bin score mean: 89.70
    Hard Voting w/Tuned Hyperparameters Test w/bin score mean: 85.97
    Hard Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 5.95
    ----------
    Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 90.02
    Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 85.52
    Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 6.07
    

    硬投票得出的预测结果,在测试集上的分数较高,标准差较小,优选硬投票。

    5. 提交结果:

    用硬投票作为预测的方案,得到结果并提交。

    df_test_output['Survived'] = grid_hard.predict(df_test_x)
    df_test_output.to_csv('../../data/titanic/hardvote.csv', index = False)
    

    在官网上提交结果,给出的分数是0.81339。


    后记

    Titanic这个项目很值得一试,在实践的过程中,我参考了一些参赛者在kaggle上分享的kernel,收益良多。

    但作为入门项目,重在参与,后面有空了再做一遍,看是否能有提高。

    接下来,我会尝试参加猫狗大战
    也就是编写一个算法来分类图像是否包含狗或猫。
    这对人类,狗和猫来说很容易,但用算法如何实现呢?拭目以待。

    相关文章

      网友评论

        本文标题:Titanic生存预测2

        本文链接:https://www.haomeiwen.com/subject/rvywzxtx.html