模型融合

作者: CodingFish | 来源:发表于2017-04-10 14:52 被阅读1306次

    1.blending

    比如数据分成train和test,对于model_i(比如xgboost) ,即对所有的数据训练模型model_i,预测test数据生成预测向量v_i, 然后对train做CV fold=5,然后对其他4份做训练数据,另外一份作为val数据,得出模型model_i_j,然后对val预测生成向量t_i_j, 然后将5分向量concat生成t_i,这是对应t_i与v_i对应,每个模型都能生成这样一组向量,然后在顶层的模型比如LR或者线性对t向量进行训练,生成blender模型对v向量进行预测
       也就是需要生成如下的一个表,训练集数据为把数据切分交叉生成,测试集为训练数据全部训练对测试集预测生成

    id model_1 model_2 model_3 model_4 label
    1 0.1 0.2 0.14 0.15 0
    2 0.2 0.22 0.18 0.3 1
    3 0.8 0.7 0.88 0.6 1
    4 0.3 0.3 0.2 0.22 0
    5 0.5 0.3 0.6 0.5 1

    blending 的优点是:比stacking简单,不会造成数据穿越,generalizers和stackers使用不同的数据,可以随时添加其他模型到blender中。

    与stacking的区别是:
    stacking在预测 测试集上时直接基于训练数据的
    blender在预测 测试集上每次cv的子集都会预测下预测集, n次cv取平均

    from __future__ import division
    import numpy as np
    import load_data
    from sklearn.cross_validation import StratifiedKFold
    from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
    from sklearn.linear_model import LogisticRegression
    from utility import *
    from evaluator import *
    
    
    
    def logloss(attempt, actual, epsilon=1.0e-15):
        """Logloss, i.e. the score of the bioresponse competition.
        """
        attempt = np.clip(attempt, epsilon, 1.0-epsilon)
        return - np.mean(actual * np.log(attempt) + (1.0 - actual) * np.log(1.0 - attempt))
    
    
    if __name__ == '__main__':
    
        np.random.seed(0) # seed to shuffle the train set
    
        # n_folds = 10
        n_folds = 5
        verbose = True
        shuffle = False
    
    
        # X, y, X_submission = load_data.load()
    
        train_x_id, train_x, train_y = preprocess_train_input()
        val_x_id, val_x, val_y = preprocess_val_input()
    
        X = train_x
        y = train_y
        X_submission = val_x
        X_submission_y = val_y
    
        if shuffle:
            idx = np.random.permutation(y.size)
            X = X[idx]
            y = y[idx]
    
    
        skf = list(StratifiedKFold(y, n_folds))
    
        clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
                RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
                ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
                ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
                GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]
    
        print "Creating train and test sets for blending."
        
        dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
        dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))
        
        for j, clf in enumerate(clfs):
            print j, clf
            dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))
            for i, (train, test) in enumerate(skf):
                print "Fold", i
                X_train = X[train]
                y_train = y[train]
                X_test = X[test]
                y_test = y[test]
                clf.fit(X_train, y_train)
                y_submission = clf.predict_proba(X_test)[:,1]
                dataset_blend_train[test, j] = y_submission
                dataset_blend_test_j[:, i] = clf.predict_proba(X_submission)[:,1]
            dataset_blend_test[:,j] = dataset_blend_test_j.mean(1)
            print("val auc Score: %0.5f" % (evaluate2(dataset_blend_test[:,j], X_submission_y)))
    
        print
        print "Blending."
        # clf = LogisticRegression()
        clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=100)
        clf.fit(dataset_blend_train, y)
        y_submission = clf.predict_proba(dataset_blend_test)[:,1]
    
        print "Linear stretch of predictions to [0,1]"
        y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min())
        print "blend result"
        print("val auc Score: %0.5f" % (evaluate2(y_submission, X_submission_y)))
        print "Saving Results."
        np.savetxt(fname='blend_result.csv', X=y_submission, fmt='%0.9f')
    

    2.rank_avg

    这种融合方法适合排序评估指标,比如auc之类的


    其中weight_i为该模型权重,权重为1表示平均融合
    rank_i表示样本的升序排名 ,也就是越靠前的样本融合后也越靠前
    能较快的利用排名融合多个模型之间的差异,而不用去加权样本的概率值融合

    3.weighted

    加权融合,给模型一个权重weight,然后加权得到最终结果
    weight为1时为均值融合,result_i为模型i的输出

    4.bagging

    从特征,参数,样本的多样性差异性来做多模型融合,参考随机森林


    转自:http://blog.csdn.net/bryan__/article/details/51229032

    相关文章

      网友评论

        本文标题:模型融合

        本文链接:https://www.haomeiwen.com/subject/aqbsattx.html