CTR预估中GBDT与LR融合方案

作者: dingtom | 来源:发表于2020-05-08 13:27 被阅读0次

ad
CTR预估中GBDT与LR融合方案
CTR预估中GBDT与LR融合方案
LR知识点梳理
FM,LR,CTR预估
GBDT+LR CTR预估-Kaggle实例[有数据集]
机器学习面试003—逻辑斯蒂回归
阅读记录-推荐系统
CTR预估
CTR预估

GBDT+LR 使用最广泛的场景是 CTR 点击率预估，即预测当给用户推送的广告会不会被用户点击。

训练样本一般是上亿级别，样本量大，模型常采用速度较快的 LR。但 LR 是线性模型，学习能力有限。GBDT 算法的特点正好可以用来发掘有区分度的特征、特征组合，减少特征工程中人力成本。

GBDT前面的树，特征分裂主要体现对多数样本有区分度的特征；后面的树，主要体现的是经过前N颗树，残差仍然较大的少数样本。优先选用在整体上有区分度的特征，再选用针对少数样本有区分度的特征，思路更加合理.

举个例子，下图是一个 GBDT+LR 模型结构，设 GBDT 有两个弱分类器，分别以蓝色和红色部分表示，其中蓝色弱分类器的叶子结点个数为 3，红色弱分类器的叶子结点个数为 2，并且蓝色弱分类器中对 0-1 的预测结果落到了第二个叶子结点上，红色弱分类器中对 0-1 的预测结果也落到了第二个叶子结点上。那么我们就记蓝色弱分类器的预测结果为[0 1 0]，红色弱分类器的预测结果为[0 1]，综合起来看，GBDT 的输出为这些弱分类器的组合[0 1 0 0 1] ，或者一个稀疏向量（数组）。

这里的思想与 One-hot 独热编码类似，事实上，在用 GBDT 构造新的训练数据时，采用的也正是 One-hot 方法。并且由于每一弱分类器有且只有一个叶子节点输出预测结果，所以在一个具有 n 个弱分类器、共计 m 个叶子结点的 GBDT 中，每一条训练数据都会被转换为 1*m 维稀疏向量，且有 n 个元素为 1，其余 m-n 个元素全为 0
新的训练数据构造完成后，下一步就要与原始的训练数据中的 label(输出)数据一并输入到 Logistic Regression 分类器中进行最终分类器的训练。思考一下，在对原始数据进行 GBDT 提取为新的数据这一操作之后，数据不仅变得稀疏，而且由于弱分类器个数，叶子结点个数的影响，可能会导致新的训练数据特征维度过大的问题，因此，在 Logistic Regression 这一层中，可使用正则化来减少过拟合的风险，在 Facebook 的论文中采用的是 L1 正则化。

Logistic Regression 是一个线性分类器，也就是说会忽略掉特征与特征之间的关联信息，那么是否可以采用构建新的交叉特征这一特征组合方式从而提高模型的效果？
其次，GBDT 很有可能构造出的新训练数据是高维的稀疏矩阵，而 Logistic Regression 使用高维稀疏矩阵进行训练，会直接导致计算量过大，特征权值更新缓慢的问题。

使用 FM 算法代替 LR，这样就解决了 Logistic Regression 的模型表达效果及高维稀疏矩阵的训练开销较大的问题。然而，这样就意味着可以高枕无忧了吗？当然不是，因为采用 FM 对本来已经是高维稀疏矩阵做完特征交叉后，新的特征维度会更加多，并且由于元素非 0 即 1，新的特征数据可能也会更加稀疏，那么怎么办？

所以，我们需要再次回到 GBDT 构造新训练数据这里。当 GBDT 构造完新的训练样本后，我们要做的是对每一个特征做与输出之间的特征重要度评估并筛选出重要程度较高的部分特征，这样，GBDT 构造的高维的稀疏矩阵就会减少一部分特征，也就是说得到的稀疏矩阵不再那么高维了。之后，对这些筛选后得到的重要度较高的特征再做 FM 算法构造交叉项，进而引入非线性特征，继而完成最终分类器的训练数据的构造及模型的训练。(gbdt+FFM后面有代码）

GBDT+LR

# encoding=utf-8
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.utils import shuffle

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import make_scorer, roc_curve, accuracy_score, f1_score, confusion_matrix, roc_auc_score

data = load_breast_cancer(return_X_y=True) # data[0]数据 data[1]标签

data_p = []  # positive data
data_n = []  # negative data

for i in range(len(data[1])):
    if data[1][i] == 1:
        data_p.append(data[0][i])
    else:
        data_n.append(data[0][i])

data_p = pd.DataFrame(np.array(data_p))
# print(data_p.values.shape)  # (357, 30)

data_n = np.array(data_n)
data_p = data_p.sample(data_n.shape[0], axis=0)
# print(data_p.values.shape)  # (212, 30)
data_comb = np.vstack((data_p, data_n))

label_p = [1] * data_p.shape[0]
label_n = [0] * data_n.shape[0]
label_p.extend(label_n)
label = np.array(label_p)

x, y = shuffle(data_comb, label, random_state=10)
# print(np.unique(y))  # [0 1]
x = StandardScaler().fit_transform(x)

# print(label.shape)  # (424,)
# print(data_comb.shape)  # (424, 30)
# -------------------------------网格搜索最佳参数
# parameter for LogisticRegression
# parameter = {'C': [1, 10, 100, 1000],'solver':['liblinear','newton-cg','lbfgs','sag','saga']}
# parameter for SVM
# parameter = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]},{'kernel': ['linear'], 'C': [1, 10, 100, 1000]},{'kernel':['poly'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]},{'kernel':['sigmoid'], 'gamma': [1e-3, 1e-4],'C': [1, 10, 100, 1000]}]
# parameter for DecisonTree
# parameter = {'criterion':['gini','entropy'],'max_features':['sqrt','log2']}
# parameter for RandomForest
# parameter = {'n_estimators':range(5,100,5),'criterion':['gini','entropy'],'max_depth':[1,2,3,4,5,6,7,8,9,10],'max_features':['sqrt','log2']}
# parameter for xgboost
# parameter = {'learning_rate': np.linspace(0.01,0.3,num=30),'n_estimators':range(5,300,5),'max_depth': [1,2,3, 4, 5, 6, 7, 8, 9, 10]}
# parameter for adaboost
# parameter = {'n_estimators': range(5,300,5),"learning_rate":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}
# parameter for bagging
# parameter = {'n_estimators': range(5,100,5)}
# parameter for MLP
# parameter = {
#     'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
#     'activation': ['tanh', 'relu'],
#     'solver': ['sgd', 'adam'],
#     'alpha': [0.0001, 0.05],
#     'learning_rate': ['constant','adaptive']}

# clf = LogisticRegression(max_iter=1000,solver='liblinear', C=1)
# clf = SVC(C=1, kernel='rbf', probability=True)
# clf = DecisionTreeClassifier(criterion='entropy',max_features='log2')
# clf = RandomForestClassifier()
# clf = xgb.XGBClassifier()
# clf = AdaBoostClassifier()
# clf = BaggingClassifier(LogisticRegression())
# clf = GradientBoostingClassifier(learning_rate=0.268,  max_depth=3,n_estimators=85)
# clf = MLPClassifier(activation='relu', alpha=0.05, hidden_layer_sizes=(50,50), learning_rate='constant',solver='adam')

# # parameter for GBDT
# parameter = {'learning_rate': np.linspace(0.10, 0.3, num=20),
#              'n_estimators': range(10, 100, 5),
#              'max_depth': [3, 4, 5, 6, 7, 8]}

# # make_scorer生成scoring对象。该函数将metrics转换成在模型评估中可调用的对象。第一个典型的用例是，将一个库中已经存在的metrics函数进行包装，使用定制参数
# scoring = {'roc_auc': 'roc_auc', 'accuracy': make_scorer(accuracy_score),
#            'f1': make_scorer(f1_score)}
# clf = GradientBoostingClassifier()
# model = GridSearchCV(clf, parameter, cv=5, scoring=scoring,
#                      n_jobs=-1, verbose=2, return_train_score=False, refit='roc_auc')

# his = model.fit(x, y)
# print("best estimator : ", his.best_estimator_)
# print("best score : ", his.best_score_)
# print("best parameters : ", his.best_params_)
lr, md, ne =  his.best_params_.values()

# ----------------------------------特征选择
# lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(x, y)
# lsvc = LogisticRegression().fit(x,y)
lsvc = GradientBoostingClassifier(random_state=0, learning_rate=lr, max_depth=md, n_estimators=ne).fit(x, y)
model = SelectFromModel(lsvc, threshold=0.0001, prefit=True)
# print(lsvc.feature_importances_, lsvc.coef_, model.threshold_)
x_new = model.transform(x)
# print(x_new.shape)  # (424, 27)

# ------------------------------GBDT+LR特征融合
# 样本量大，模型常采用速度较快的LR。但LR是线性模型，学习能力有限，此时特征工程尤其重要。
# 现有的特征工程实验，主要集中在寻找到有区分度的特征、特征组合，折腾一圈未必会带来效果提升。
# GBDT算法的特点正好可以用来发掘有区分度的特征、特征组合，减少特征工程中人力成本

skf = StratifiedKFold(n_splits=5)  # n_splits=5 循环5次 训练：测试4：1
AUC = []
Accuracy = []
Sensitivity = []
Specificity = []
F1 = []
for train_index, test_index in skf.split(x_new, y):
    # print(len(train_index), len(test_index))  # train 339       test 85
    X_train, X_test = x_new[train_index], x_new[test_index]
    y_train, y_test = y[train_index], y[test_index]
    X_train, X_train_lr, y_train, y_train_lr = train_test_split(
        X_train, y_train, test_size=0.5, random_state=10)  # 原特征
    # print(X_train.shape, X_train_lr.shape)  # X_train (169, 25)  X_train_lr  (170, 24)
    # X_train数据学习提取特征，X_train_lr数据用于分类
    grd = GradientBoostingClassifier(learning_rate=lr, max_depth=md, n_estimators=ne)
    grd.fit(X_train, y_train)

    grd_enc = OneHotEncoder(categories='auto')
    # print(grd.apply(X_train).shape)  # 梯度提升提取X_train的特征(169, 85, 1)
    grd_enc.fit(grd.apply(X_train)[:, :, 0])  # apply返回训练数据 X_train 在训练好的模型里每棵树中所处的叶子节点的位置（索引）
    # 使用训练好的GBDT模型构建特征，然后将特征经过one-hot编码作为新的特征输入到LR模型训练。
    
    onehot_train_lr = grd_enc.transform(grd.apply(X_train_lr)[:, :, 0]).toarray()
    #　print(onehot_train_lr.shape)  # 梯度提升提取X_train_lr的特征one-hot(170, 245)
    total_train_lr = np.hstack((onehot_train_lr, X_train_lr))
    # print(total_train_lr.shape)  # total_train_lr总特征 (170, 269) 梯度提升提取的特征one-hot+原特征

    onehot_test = grd_enc.transform(grd.apply(X_test)[:, :, 0]).toarray()
    total_test = np.hstack((onehot_test, X_test))

    grd_lm = LogisticRegression(solver='liblinear', max_iter=1000)
    grd_lm.fit(total_train_lr, y_train_lr)

    # predict是训练后返回预测结果，是标签值。
    # predict_proba返回的是一个n行k(类别数）列的数组， 第i行第j列上的数值
    # 模型预测第i个预测样本为j标签的概率，并且每一行的概率和为1。
    y_pred_grd_lm_proba = grd_lm.predict_proba(total_test)
    y_pred_grd_lm = grd_lm.predict(total_test)
    # print(y_pred_grd_lm_proba.shape,y_pred_grd_lm.shape)  # (85, 2) (85,)

    fpr_grd_lm, tpr_grd_lm, _ = roc_curve(y_test, y_pred_grd_lm)
    confusion = confusion_matrix(y_test, y_pred_grd_lm)
    # print(confusion)
    # [[43  0]
    # [ 0 42]]
    tn, fp, fn, tp = confusion.ravel()
    # print(tn, fp, fn, tp)  # 43 0 0 42
    F1.append(f1_score(y_test, y_pred_grd_lm))
    AUC.append(roc_auc_score(y_test, y_pred_grd_lm_proba[:, 1]))
    Accuracy.append(accuracy_score(y_test, y_pred_grd_lm))
    Sensitivity.append(tp / (tp + fn))
    Specificity.append(tn / (tn + fp))

AUC.append(np.mean(AUC))
F1.append(np.mean(F1))
Accuracy.append(np.mean(Accuracy))
Sensitivity.append(np.mean(Sensitivity))
Specificity.append(np.mean(Specificity))

print("Sensitivity : ", Sensitivity)
print("Specificity : ", Specificity)
print("Accuracy : ", Accuracy)
print("F1 Score : ", F1)
print("AUC : ", AUC)

gbdt+FFM

https://github.com/wangru8080/gbdt-lr

# encoding=utf-8
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.utils import shuffle

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import make_scorer, roc_curve, accuracy_score, f1_score, confusion_matrix, roc_auc_score

# data = load_breast_cancer(return_X_y=True) # data[0]数据 data[1]标签
# 
# data_p = []  # positive data
# data_n = []  # negative data
# 
# for i in range(len(data[1])):
#     if data[1][i] == 1:
#         data_p.append(data[0][i])
#     else:
#         data_n.append(data[0][i])
# 
# data_p = pd.DataFrame(np.array(data_p))
# print(data_p.values.shape)  # (357, 30)
# 
# data_n = np.array(data_n)
# data_p = data_p.sample(data_n.shape[0], axis=0)
# print(data_p.values.shape)  # (212, 30)
# data_comb = np.vstack((data_p, data_n))
# 
# label_p = [1] * data_p.shape[0]
# label_n = [0] * data_n.shape[0]
# label_p.extend(label_n)
# label = np.array(label_p)
# print(label.shape)  # (424,)
# print(data_comb.shape)  # (424, 30)

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.metrics import log_loss
import warnings

warnings.filterwarnings('ignore')
import gc
from scipy import sparse


def preProcess():
    path = r'E:\Downloads\Compressed\gbdt-lr-master\data\\'
    print('读取数据...')
    df_train = pd.read_csv(path + 'train.csv')
    df_test = pd.read_csv(path + 'test.csv')
    print('读取结束')
    df_train.drop(['Id'], axis=1, inplace=True)
    df_test.drop(['Id'], axis=1, inplace=True)

    df_test['Label'] = -1

    data = pd.concat([df_train, df_test])
    data = data.fillna(-1)
    data.to_csv('data/data.csv', index=False)
    return data


def lr_predict(data, category_feature, continuous_feature):  # 0.47181
    # 连续特征归一化
    print('开始归一化...')
    scaler = MinMaxScaler()
    for col in continuous_feature:
        data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
    print('归一化结束')

    # 离散特征one-hot编码
    print('开始one-hot...')
    for col in category_feature:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)
    print('one-hot结束')

    train = data[data['Label'] != -1]
    target = train.pop('Label')
    test = data[data['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    print('划分数据集...')
    x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=2018)
    print('开始训练...')
    lr = LogisticRegression()
    lr.fit(x_train, y_train)
    tr_logloss = log_loss(y_train, lr.predict_proba(x_train)[:, 1])
    print('tr-logloss: ', tr_logloss)
    val_logloss = log_loss(y_val, lr.predict_proba(x_val)[:, 1])
    print('val-logloss: ', val_logloss)
    print('开始预测...')
    y_pred = lr.predict_proba(test)[:, 1]
    print('写入结果...')
    res = pd.read_csv('data/test.csv')
    submission = pd.DataFrame({'Id': res['Id'], 'Label': y_pred})
    submission.to_csv('submission/submission_lr_trlogloss_%s_vallogloss_%s.csv' % (tr_logloss, val_logloss),
                      index=False)
    print('结束')


def gbdt_predict(data, category_feature, continuous_feature):  # 0.44548
    # 离散特征one-hot编码
    print('开始one-hot...')
    for col in category_feature:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)
    print('one-hot结束')

    train = data[data['Label'] != -1]
    target = train.pop('Label')
    test = data[data['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    print('划分数据集...')
    x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=2018)

    print('开始训练..')
    gbm = lgb.LGBMClassifier(objective='binary',
                             subsample=0.8,
                             min_child_weight=0.5,
                             colsample_bytree=0.7,
                             num_leaves=100,
                             max_depth=12,
                             learning_rate=0.01,
                             n_estimators=10000,
                             )

    gbm.fit(x_train, y_train,
            eval_set=[(x_train, y_train), (x_val, y_val)],
            eval_names=['train', 'val'],
            eval_metric='binary_logloss',
            early_stopping_rounds=100,
            )
    tr_logloss = log_loss(y_train, gbm.predict_proba(x_train)[:, 1])
    val_logloss = log_loss(y_val, gbm.predict_proba(x_val)[:, 1])
    y_pred = gbm.predict_proba(test)[:, 1]
    print('写入结果...')
    res = pd.read_csv('data/test.csv')
    submission = pd.DataFrame({'Id': res['Id'], 'Label': y_pred})
    submission.to_csv('submission/submission_gbdt_trlogloss_%s_vallogloss_%s.csv' % (tr_logloss, val_logloss),
                      index=False)
    print('结束')


def gbdt_lr_predict(data, category_feature, continuous_feature):  # 0.43616
    # 离散特征one-hot编码
    print('开始one-hot...')
    for col in category_feature:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)
    print('one-hot结束')

    train = data[data['Label'] != -1]
    target = train.pop('Label')
    test = data[data['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    print('划分数据集...')
    x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=2018)

    print('开始训练gbdt..')
    gbm = lgb.LGBMRegressor(objective='binary',
                            subsample=0.8,
                            min_child_weight=0.5,
                            colsample_bytree=0.7,
                            num_leaves=100,
                            max_depth=12,
                            learning_rate=0.05,
                            n_estimators=10,
                            )

    gbm.fit(x_train, y_train,
            eval_set=[(x_train, y_train), (x_val, y_val)],
            eval_names=['train', 'val'],
            eval_metric='binary_logloss',
            # early_stopping_rounds = 100,
            )
    model = gbm.booster_
    print('训练得到叶子数')
    gbdt_feats_train = model.predict(train, pred_leaf=True)
    gbdt_feats_test = model.predict(test, pred_leaf=True)
    gbdt_feats_name = ['gbdt_leaf_' + str(i) for i in range(gbdt_feats_train.shape[1])]
    df_train_gbdt_feats = pd.DataFrame(gbdt_feats_train, columns=gbdt_feats_name)
    df_test_gbdt_feats = pd.DataFrame(gbdt_feats_test, columns=gbdt_feats_name)

    print('构造新的数据集...')
    train = pd.concat([train, df_train_gbdt_feats], axis=1)
    test = pd.concat([test, df_test_gbdt_feats], axis=1)
    train_len = train.shape[0]
    data = pd.concat([train, test])
    del train
    del test
    gc.collect()

    # # 连续特征归一化
    # print('开始归一化...')
    # scaler = MinMaxScaler()
    # for col in continuous_feature:
    #     data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
    # print('归一化结束')

    # 叶子数one-hot
    print('开始one-hot...')
    for col in gbdt_feats_name:
        print('this is feature:', col)
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data.drop([col], axis=1, inplace=True)
        data = pd.concat([data, onehot_feats], axis=1)
    print('one-hot结束')

    train = data[: train_len]
    test = data[train_len:]
    del data
    gc.collect()

    x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.3, random_state=2018)
    # lr
    print('开始训练lr..')
    lr = LogisticRegression()
    lr.fit(x_train, y_train)
    tr_logloss = log_loss(y_train, lr.predict_proba(x_train)[:, 1])
    print('tr-logloss: ', tr_logloss)
    val_logloss = log_loss(y_val, lr.predict_proba(x_val)[:, 1])
    print('val-logloss: ', val_logloss)
    print('开始预测...')
    y_pred = lr.predict_proba(test)[:, 1]
    print('写入结果...')
    res = pd.read_csv('data/test.csv')
    submission = pd.DataFrame({'Id': res['Id'], 'Label': y_pred})
    submission.to_csv('submission/submission_gbdt+lr_trlogloss_%s_vallogloss_%s.csv' % (tr_logloss, val_logloss),
                      index=False)
    print('结束')


def gbdt_ffm_predict(data, category_feature, continuous_feature):
    # 离散特征one-hot编码
    print('开始one-hot...')
    for col in category_feature:
        onehot_feats = pd.get_dummies(data[col], prefix=col)
        data = pd.concat([data, onehot_feats], axis=1)
    print('one-hot结束')

    feats = [col for col in data if col not in category_feature]  # onehot_feats + continuous_feature
    tmp = data[feats]
    train = tmp[tmp['Label'] != -1]
    target = train.pop('Label')
    test = tmp[tmp['Label'] == -1]
    test.drop(['Label'], axis=1, inplace=True)

    # 划分数据集
    print('划分数据集...')
    x_train, x_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=2018)

    print('开始训练gbdt..')
    gbm = lgb.LGBMRegressor(objective='binary',
                            subsample=0.8,
                            min_child_weight=0.5,
                            colsample_bytree=0.7,
                            num_leaves=100,
                            max_depth=12,
                            learning_rate=0.05,
                            n_estimators=10,
                            )

    gbm.fit(x_train, y_train,
            eval_set=[(x_train, y_train), (x_val, y_val)],
            eval_names=['train', 'val'],
            eval_metric='binary_logloss',
            # early_stopping_rounds = 100,
            )
    model = gbm.booster_
    print('训练得到叶子数')
    gbdt_feats_train = model.predict(train, pred_leaf=True)
    gbdt_feats_test = model.predict(test, pred_leaf=True)
    gbdt_feats_name = ['gbdt_leaf_' + str(i) for i in range(gbdt_feats_train.shape[1])]
    df_train_gbdt_feats = pd.DataFrame(gbdt_feats_train, columns=gbdt_feats_name)
    df_test_gbdt_feats = pd.DataFrame(gbdt_feats_test, columns=gbdt_feats_name)

    print('构造新的数据集...')
    tmp = data[category_feature + continuous_feature + ['Label']]
    train = tmp[tmp['Label'] != -1]
    test = tmp[tmp['Label'] == -1]
    train = pd.concat([train, df_train_gbdt_feats], axis=1)
    test = pd.concat([test, df_test_gbdt_feats], axis=1)
    data = pd.concat([train, test])
    del train
    del test
    gc.collect()

    # 连续特征归一化
    print('开始归一化...')
    scaler = MinMaxScaler()
    for col in continuous_feature:
        data[col] = scaler.fit_transform(data[col].values.reshape(-1, 1))
    print('归一化结束')

    data.to_csv('data/data.csv', index=False)
    return category_feature + gbdt_feats_name


def FFMFormat(df, label, path, train_len, category_feature=[], continuous_feature=[]):
    index = df.shape[0]
    train = open(path + 'train.ffm', 'w')
    test = open(path + 'test.ffm', 'w')
    feature_index = 0
    feat_index = {}
    for i in range(index):
        feats = []
        field_index = 0
        for j, feat in enumerate(category_feature):
            t = feat + '_' + str(df[feat][i])
            if t not in feat_index.keys():
                feat_index[t] = feature_index
                feature_index = feature_index + 1
            feats.append('%s:%s:%s' % (field_index, feat_index[t], 1))
            field_index = field_index + 1

        for j, feat in enumerate(continuous_feature):
            feats.append('%s:%s:%s' % (field_index, feature_index, df[feat][i]))
            feature_index = feature_index + 1
            field_index = field_index + 1

        print('%s %s' % (df[label][i], ' '.join(feats)))

        if i < train_len:
            train.write('%s %s\n' % (df[label][i], ' '.join(feats)))
        else:
            test.write('%s\n' % (' '.join(feats)))
    train.close()
    test.close()


if __name__ == '__main__':
    data = preProcess()
    continuous_feature = ['I'] * 13
    continuous_feature = [col + str(i + 1) for i, col in enumerate(continuous_feature)]
    category_feature = ['C'] * 26
    category_feature = [col + str(i + 1) for i, col in enumerate(category_feature)]
    # lr_predict(data, category_feature, continuous_feature)
    # gbdt_predict(data, category_feature, continuous_feature)
    # gbdt_lr_predict(data, category_feature, continuous_feature)
    category_feature = gbdt_ffm_predict(data, category_feature, continuous_feature)

    data = pd.read_csv('data/data.csv')
    df_train = pd.read_csv('data/train.csv')
    FFMFormat(data, 'Label', 'data/', df_train.shape[0], category_feature, continuous_feature)

ad
01. 程序化广告交易中的点击率预估02. CTR预估中GBDT与LR融合方案03. 浅谈矩阵分解在推荐系统中的应...
CTR预估中GBDT与LR融合方案
1、背景 CTR预估（Click-Through Rate Prediction）是互联网计算广告中的关键环节，...
CTR预估中GBDT与LR融合方案
GBDT+LR 使用最广泛的场景是 CTR 点击率预估，即预测当给用户推送的广告会不会被用户点击。训练样本一般是...
LR知识点梳理
LR适用场景LR适用于需要求解二分类概率的场景，例如CTR预估，金融风控等场景，CTR预估中，工业界常使用GBDT...
FM,LR,CTR预估
factorization machine和logistic regression的区别？在线广告系统的clic...
GBDT+LR CTR预估-Kaggle实例[有数据集]
最近读了一篇GBDT+LR实现推荐系统的文章，准备实践一下，但是所有讲这种方式的文章都没有放数据集，所以我从头开始...
机器学习面试003—逻辑斯蒂回归
1. LR为什么可以用来做CTR预估？ Ans：若把点击的样本作为正例，未点击的样本作为负例，则样本的CTR就是样...
阅读记录-推荐系统
推荐系统遇上深度学习(十八)--探秘阿里之深度兴趣网络(DIN)浅析及实现常见ctr预估算法LR，FM，wide...
CTR预估
一、概要目标：将合适的内容，在合适的场景，展示给合适的人量化指标：CTR,Click-Through-Rate...
CTR预估
Online Advertising业务背景三方业务场景：广告主、平台、用户广告计费方式：CPM(cost-pe...