美文网首页
用户流失预警

用户流失预警

作者: ForgetThatNight | 来源:发表于2018-07-07 10:53 被阅读43次
from __future__ import division
import pandas as pd
import numpy as np

churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist()

print "Column names:"
print col_names

to_show = col_names[:6] + col_names[-6:]

print "\nSample data:"
churn_df[to_show].head(6)

Column names:
['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']

Sample data:


#转换object类型的特征为数字特征
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)

# 删除不需要的特征
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)

# 'yes'/'no' has to be converted to boolean values
# one-hot编码 归一化数据,否则sklearn会认为部分特征很重要 部分特征没意义
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'

# Pull out features for future use
features = churn_feat_space.columns

X = churn_feat_space.as_matrix().astype(np.float)

# This is important
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

print "Feature space holds %d observations and %d features" % X.shape
print "Unique target labels:", np.unique(y)
print X[0]
# 打印出没用流式的用户数量
print len(y[y == 0])

输出 :
Feature space holds 3333 observations and 17 features
Unique target labels: [0 1]
[ 0.67648946 -0.32758048 1.6170861 1.23488274 1.56676695 0.47664315
1.56703625 -0.07060962 -0.05594035 -0.07042665 0.86674322 -0.46549436
0.86602851 -0.08500823 -0.60119509 -0.0856905 -0.42793202]
2850

from sklearn.cross_validation import KFold

def run_cv(X,y,clf_class,**kwargs):
    # KFold sklearn的交叉验证
    kf = KFold(len(y),n_folds=5,shuffle=True)
    y_pred = y.copy()

    # Iterate through folds
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)#训练数据
        y_pred[test_index] = clf.predict(X_test) # 预测数据
    return y_pred

测试三种分类器预测的效果

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN

def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred)

print "Support vector machines:"
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
print "Random forest:"
print "%.3f" % accuracy(y, run_cv(X,y,RF))
print "K-nearest-neighbors:"
print "%.3f" % accuracy(y, run_cv(X,y,KNN))

输出 (在这里精度accuracy是骗人的):
Support vector machines:
0.916
Random forest:
0.944
K-nearest-neighbors:
0.893

用户流式预警:更关注TN和FN,即流式的样本

得到实际流式的用户我们预测出来了多少 recall= TN/(TN+FN)

def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(len(y), n_folds=5, shuffle=True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        # Predict probabilities, not classes
        y_prob[test_index] = clf.predict_proba(X_test)
    return y_prob
import warnings
warnings.filterwarnings('ignore')

# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
#print pred_prob[0]
pred_churn = pred_prob[:,1]
is_churn = y == 1

# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_churn)
#print counts

# calculate true probabilities
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob])
    true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
# 预测的pred_prob10%可能流失的样本个数为count 733个 最后真实流失的样本为 2.8%
counts.columns = ['pred_prob', 'count', 'true_prob']
counts
通过观察 将阈值设置为0.7 是比较可靠的分类阈值

相关文章

  • 用户流失预警

    Column names:['State', 'Account Length', 'Area Code', 'Ph...

  • 用户流失预警

    对于互联网公司而言,用户流失是个不可避免的问题,据统计,获取一个新用户的成本比召回一个新用户的成本要大得多。如果能...

  • 【用户运营】如何做好流失用户预警?

    “ 流失用户召回没有成效?流失率居高不下?何不防患于未然,早早进行预警干预?” 01、为什么要进行流失用户预警? ...

  • 流失用户预警分析

    为什么流失用户重要 流失用户对于平台来说都是非常重要的(电商类平台为例) 用户流失直接造成了gmv的下降 用户流失...

  • 用户流失预警风控

    业务背景 在业务发展过程中有两个重要的环节,一个是拉新,另一个是留存。如何做到用户的留存需要很多技术手段保证,一个...

  • 用户流失预测模型-电信行业项目实战

    本文以电信行业数据为基础,对其进行用户流失预警的建模,整理如下,欢迎拍砖~ 一、流失知识点整理 1. 流失定义 不...

  • 线上酒店用户流失分析预警

    本文是对某线上酒店用户流失预测分析项目的一个总结。 目录/分析思路: 01: 项目介绍 02:问题分析 03:数据...

  • 135.如何进行离线计算-1

    应用场景用户流失预警系统基于用户购买的挽回系统用户特征和规则提取系统数据分析系统用户画像系统 流程数据采集数据预处...

  • 在线教育付费用户流失预警

    一. 背景: 2018年度,新生52.6万人,盈利19.74亿; 在线教育产品的增长,不是看流量,而是看 留存。 ...

  • 用户流失预警:除了真诚还要有套路!

    许多运营汪和产品汪们都在绞尽脑汁地努力提升现有用户的活跃度,却忽略了数量更为巨大的已流失用户。 在充满诱惑的今天,...

网友评论

      本文标题:用户流失预警

      本文链接:https://www.haomeiwen.com/subject/rifiuftx.html