用户流失预警

作者: ForgetThatNight | 来源:发表于2018-07-07 10:53 被阅读43次

用户流失预警
用户流失预警
【用户运营】如何做好流失用户预警？
流失用户预警分析
用户流失预警风控
用户流失预测模型-电信行业项目实战
线上酒店用户流失分析预警
135.如何进行离线计算-1
在线教育付费用户流失预警
用户流失预警：除了真诚还要有套路！

from __future__ import division
import pandas as pd
import numpy as np

churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist()

print "Column names:"
print col_names

to_show = col_names[:6] + col_names[-6:]

print "\nSample data:"
churn_df[to_show].head(6)

Column names:
['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']

Sample data:

#转换object类型的特征为数字特征
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)

# 删除不需要的特征
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)

# 'yes'/'no' has to be converted to boolean values
# one-hot编码 归一化数据，否则sklearn会认为部分特征很重要 部分特征没意义
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

# Pull out features for future use
features = churn_feat_space.columns

X = churn_feat_space.as_matrix().astype(np.float)

# This is important
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

print "Feature space holds %d observations and %d features" % X.shape
print "Unique target labels:", np.unique(y)
print X[0]
# 打印出没用流式的用户数量
print len(y[y == 0])

输出：
Feature space holds 3333 observations and 17 features
Unique target labels: [0 1]
[ 0.67648946 -0.32758048 1.6170861 1.23488274 1.56676695 0.47664315
1.56703625 -0.07060962 -0.05594035 -0.07042665 0.86674322 -0.46549436
0.86602851 -0.08500823 -0.60119509 -0.0856905 -0.42793202]
2850

from sklearn.cross_validation import KFold

def run_cv(X,y,clf_class,**kwargs):
    # KFold sklearn的交叉验证
    kf = KFold(len(y),n_folds=5,shuffle=True)
    y_pred = y.copy()

    # Iterate through folds
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)#训练数据
        y_pred[test_index] = clf.predict(X_test) # 预测数据
    return y_pred

测试三种分类器预测的效果

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN

def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)

print "Support vector machines:"
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
print "Random forest:"
print "%.3f" % accuracy(y, run_cv(X,y,RF))
print "K-nearest-neighbors:"
print "%.3f" % accuracy(y, run_cv(X,y,KNN))

输出（在这里精度accuracy是骗人的）：
Support vector machines:
0.916
Random forest:
0.944
K-nearest-neighbors:
0.893

用户流式预警：更关注TN和FN,即流式的样本

得到实际流式的用户我们预测出来了多少 recall= TN/(TN+FN)

def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(len(y), n_folds=5, shuffle=True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        # Predict probabilities, not classes
        y_prob[test_index] = clf.predict_proba(X_test)
    return y_prob

import warnings
warnings.filterwarnings('ignore')

# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
#print pred_prob[0]
pred_churn = pred_prob[:,1]
is_churn = y == 1

# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_churn)
#print counts

# calculate true probabilities
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob])
    true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
# 预测的pred_prob10%可能流失的样本个数为count 733个 最后真实流失的样本为 2.8%
counts.columns = ['pred_prob', 'count', 'true_prob']
counts

通过观察将阈值设置为0.7 是比较可靠的分类阈值