天池O2O新人挑战赛（1）数据预处理

作者: 刘爱玛 | 来源:发表于2019-03-29 15:29 被阅读0次

天池O2O新人挑战赛（1）数据预处理
天池o2o优惠券使用预测比赛解析（初级）
O2O线下优惠券分析
kaggle竞赛：Jigsaw Unintended Bias
spark-天池O2O竞赛
【代谢组学】代谢组学原始数据的预处理
实战淘宝穿衣搭配
Machine-Learning-Day-1
python数据分析与挖掘实战笔记
LSTM-TimeSeriesRegression-数据预处理

本篇文章主要是为参加挑战赛做一些准备工作，铺垫一些基础知识。适用于从零开始的菜鸟，大牛绕道~

比赛网址：https://tianchi.aliyun.com/competition/entrance/231593/information

1、ROC、AUC相关概念

2、数据的读取与初步观察

啥也不说，先把numpy和pandas import进来。
（1）读取数据使用pandas的read_csv()方法。观察前5条数据使用pandas对象的.head()方法。查看数据整体情况，比如一共有多少字段，各字段的类型等，使用pandas对象的.info()方法。

import numpy as np
import pandas as pd

dftest = pd.read_csv('data/ccf_offline_stage1_test_revised.csv')
dfoff = pd.read_csv('data/ccf_offline_stage1_train.csv')
dfon = pd.read_csv('data/ccf_online_stage1_train.csv')

dfoff.head()

得到结果：
User_id Merchant_id Coupon_id Discount_rate Distance Date_received Date
0 1439408 2632 NaN NaN 0.0 NaN 20160217.0
1 1439408 4663 11002.0 150:20 1.0 20160528.0 NaN
2 1439408 2632 8591.0 20:1 0.0 20160217.0 NaN
3 1439408 2632 1078.0 20:1 0.0 20160319.0 NaN
4 1439408 2632 8591.0 20:1 0.0 20160613.0 NaN

dfoff.info()

得到结果：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id float64
Discount_rate object
Distance float64
Date_received float64
Date float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB

NOTE：以上是最通用的两个方法，几乎每次加载新的数据集都可以通过.head()和.info()方法来初步查看数据集的情况。

(2)接下来就要结合对业务的理解来做一些稍微深入一些的数据洞察。拿到这个题目和数据集后，可能要问几个问题：一个是线上线下的行为数据是否是同一拨人的，二是训练数据集和测试数据集是否是同一拨人的，三是有多少客户是对训练有帮助的，在本题中就体现在有多少客户既领取了优惠券又使用了优惠券。可以利用python中的各种索引切片方法来完成这类问题的洞察。

print('训练集线下消费用户总数：', len(set(dfoff['User_id'])))
print('训练集线下消费用户同时具有线上消费行为的用户数：', len(set(dfoff['User_id']) & set(dfon['User_id'])))
print('测试集中未领取过优惠券的用户数：',len(set(dftest['User_id']) - set(dftest[(dftest['Date_received'] != 'null')].User_id)))

得到结果：
训练集线下消费用户总数： 539438
训练集线下消费用户同时具有线上消费行为的用户数： 267448
测试集中未领取过优惠券的用户数： 0

Note： 可以看到线下样本中有27万的用户同时在线上也有消费行为，线上消费偏好可以作为一个特征。同时测试集中的用户均领用过优惠券，数据质量较好。

然后，这里面要理解的一个非常简单的业务逻辑，客户是有领取优惠券、使用优惠券、消费三个动作的，数据表的记录集合了这三个动作。
题目中有这样一段话：
消费日期：如果Date=null & Coupon_id != null，该记录表示领取优惠券但没有使用，即负样本；如果Date!=null & Coupon_id = null，则表示普通消费日期；如果Date!=null & Coupon_id != null，则表示用优惠券消费日期，即正样本。
可以观察一下这几类记录的条数。

print('领取优惠券但没有使用的记录数（负样本）：', len(dfoff[(dfoff['Date'].isnull()) & (dfoff['Coupon_id'].notnull())]))
print('未领取优惠券，普通购物的记录数：', len(dfoff[(dfoff['Date'].notnull() & dfoff['Coupon_id'].isnull())]))
print('领取了优惠券且使用了优惠券记录数（正样本）：', len(dfoff[(dfoff['Date'].notnull() & dfoff['Coupon_id'].notnull())]))
print('无优惠券且无消费的记录条数：', len(dfoff[(dfoff['Date'].isnull() & dfoff['Coupon_id'].isnull())]))

输出结果：
领取优惠券但没有使用的记录数（负样本）： 977900
未领取优惠券，普通购物的记录数： 701602
领取了优惠券且使用了优惠券记录数（正样本）： 75382
无优惠券且无消费的记录条数： 0

3、对Discount_rate和Distance字段的处理

本篇将继续上一篇，对O2O比赛的数据进行初步分析，本篇中将在数据洞察的基础上加入一些必要的数据处理。

首先，我们使用.info()函数，来看一下训练集中各个字段的数据类型。
offline_train:

offline_train.info()

输出：<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id float64
Discount_rate object
Distance float64
Date_received float64
Date float64
dtypes: float64(4), int64(2), object(1)
memory usage: 93.7+ MB

online_train:

online_train.info()

输出：<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11429826 entries, 0 to 11429825
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Action int64
Coupon_id object
Discount_rate object
Date_received float64
Date float64
dtypes: float64(2), int64(3), object(2)
memory usage: 610.4+ MB

所以我们首先要将object对象的字段进行量化的转换。
1、discount_rate处理
discount_rate目前的取值：

offline_train['Discount_rate'].unique()

输出：array([nan, '150:20', '20:1', '200:20', '30:5', '50:10', '10:5', '100:10',
'200:30', '20:5', '30:10', '50:5', '150:10', '100:30', '200:50',
'100:50', '300:30', '50:20', '0.9', '10:1', '30:1', '0.95',
'100:5', '5:1', '100:20', '0.8', '50:1', '200:10', '300:20',
'100:1', '150:30', '300:50', '20:10', '0.85', '0.6', '150:50',
'0.75', '0.5', '200:5', '0.7', '30:20', '300:10', '0.2', '50:30',
'200:100', '150:5'], dtype=object)

可以看到，取值有小数，有比例，有nan，通过type()函数，看到小数和比例都是str类型，nan是float()类型。需要将其转化为数值，并且不损失信息量。增加三个字段，discount_man，discount_jian， discount_new_rate。

import math
from numpy import nan as NaN
from pandas import DataFrame
from pandas import Series
import numpy as np

def get_discount_man(row):
    if isinstance(row, str) and ':' in row:
        rows = row.split(':')
        man = int(rows[0])
        jian = int(rows[1])
        return man
    elif isinstance(row, str) and '.' in row:
        return  0
    elif isinstance(row, str) == False and math.isnan(row):
        return 0
    else:
        print("something unexpected", row, type(row))
        return 0

def get_discount_jian(row):
    if isinstance(row, str) and ':' in row:
        rows = row.split(':')
        man = int(rows[0])
        jian = int(rows[1])
        return jian
    elif isinstance(row, str) and '.' in row:
        return  0
    elif isinstance(row, str) == False and math.isnan(row):
        return 0
    else:
        print("something unexpected", row, type(row))
        return 0
    
def get_discount_rate(row):
    if isinstance(row, str) and ':' in row:
        rows = row.split(':')
        man = int(rows[0])
        jian = int(rows[1])
        return 1 - float(jian)/float(man)
    elif isinstance(row, str) and '.' in row:
        return  float(row)
    elif isinstance(row, str) == False and math.isnan(row):
        return 1
    else:
        print("something unexpected", row, type(row))
        return 0

def processData(offline):
    offline['discount_man'] = offline['Discount_rate'].apply(get_discount_man)
    offline['discount_jian'] = offline['Discount_rate'].apply(get_discount_jian)
    offline['discount_new_rate'] = offline['Discount_rate'].apply(get_discount_rate)
    return offline

offline_train = processData(offline_train)
offline_test = processData(offline_test)

处理完Discount_rate后，还需要处理一下Distance，因为虽然Distance的数据类型均为float，但Distance里面有NaN值，需要将NaN值替换为-1。

offline_train['Distance'] = offline_train['Distance'].replace(NaN, -1.0)

至此，Discount_rate和Distance数据均处理完毕。

4、时间数据解析

数据中关于时间的字段有两个，一个是领券日期Date_received, 一个是消费日期Date。首先来看一下这两个日期的分布及格式。

date_received = offline_train['Date_received'].unique()
date_received = Series(date_received)
print(date_received)

输出：
0 NaN
1 20160528.0
2 20160217.0
3 20160319.0
4 20160613.0
5 20160516.0
6 20160429.0
7 20160129.0
8 20160530.0
9 20160519.0
10 20160606.0
11 20160207.0
12 20160421.0
13 20160130.0
14 20160412.0
15 20160518.0
16 20160327.0
17 20160127.0
18 20160215.0
19 20160524.0
20 20160523.0
21 20160515.0
22 20160521.0
23 20160114.0
24 20160321.0
25 20160426.0
26 20160409.0
27 20160326.0
28 20160322.0
29 20160131.0
...
138 20160104.0
139 20160113.0
140 20160108.0
141 20160115.0
142 20160513.0
143 20160208.0
144 20160612.0
145 20160419.0
146 20160103.0
147 20160312.0
148 20160209.0
149 20160529.0
150 20160119.0
151 20160227.0
152 20160315.0
153 20160304.0
154 20160216.0
155 20160507.0
156 20160311.0
157 20160320.0
158 20160102.0
159 20160106.0
160 20160224.0
161 20160219.0
162 20160111.0
163 20160310.0
164 20160307.0
165 20160221.0
166 20160226.0
167 20160309.0
Length: 168, dtype: float64

可以看到，除了NaN，一共有167天的记录。NaN表示客户并没有领券。下面通过两步对时间特征进行构建，一是将Date_received数据处理为日期类型加NaN类型。二是加入星期特征。

5、时间数据类型转化

将Date_received、Date数据处理为日期类型加NaN类型

from datetime import date

def getDateType(row):
    if math.isnan(row):
        return row
    else:
        str_row = str(row)
        return date(int(str_row[0 : 4]), int(str_row[4:6]), int(str_row[6:8]))

offline_train['date_received_new'] = offline_train['Date_received'].apply(getDateType)
offline_train['date_new'] = offline_train['Date'].apply(getDateType)
offline_test['date_received_new'] = offline_test['Date_received'].apply(getDateType)

6、加入星期特征

加入工作日or周六日特征（weekday_type:{0,1})，加入星期X特征(weekday:{1~7})。
首先加入星期X特征：

def getWeekday(row):
    if type(row) != float:
         return row.weekday() + 1
    else:
        return row
offline_train['weekday_received'] = offline_train['date_received_new'].apply(getWeekday)
offline_test['weekday_received'] = offline_test['date_received_new'].apply(getWeekday)
offline_train['weekday_buy'] = offline_train['date_new'].apply(getWeekday)

然后加入工作日特征：

offline_train['weekday_type_received'] = offline_train['weekday_received'].apply(lambda x : 1 if x == 6 or x == 7 else 0)
offline_train['weekday_type_buy'] = offline_train['weekday_buy'].apply(lambda x : 1 if x == 6 or x == 7 else 0)
offline_test['weekday_type_received'] = offline_test['weekday_received'].apply(lambda x : 1 if x == 6 or x == 7 else 0)

将星期X特征转化为one-hot格式。什么是one-hot，以下的说明非常简明。

one-hot的基本思想：将离散型特征的每一种取值都看成一种状态，若你的这一特征中有N个不相同的取值，那么我们就可以将该特征抽象成N种不同的状态，one-hot编码保证了每一个取值只会使得一种状态处于“激活态”，也就是说这N种状态中只有一个状态位值为1，其他状态位都是0。举个例子，假设我们以学历为例，我们想要研究的类别为小学、中学、大学、硕士、博士五种类别，我们使用one-hot对其编码就会得到：

作者：古怪地区
链接：https://www.jianshu.com/p/5f8782bf15b1
来源：简书
简书著作权归作者所有，任何形式的转载都请联系作者获得授权并注明出处。

转化为one-hot格式需要pd中的一个函数，get_dummies()，将需要转化的字段输入即可。为了便于数据的阅读，在get_dummies后对获得的结果，进行列名的重新定义。

weekdaycols = [ 'weekday_' + str(i) for i in [1, 2, 3, 4, 5, 6, 7]]

data_weekday = pd.get_dummies(offline_train['weekday_received'])
data_weekday.columns = weekdaycols
offline_train[weekdaycols] = data_weekday

data_weekday = pd.get_dummies(offline_test['weekday_received'])
data_weekday.columns = weekdaycols
offline_test[weekdaycols] = data_weekday

7、数据标注

将数据分为三类，

一类是领券并在15天内用券的数据，即Date_received != null, Date - Date_received <= 15 : y = 1
第二类是未领券数据，即Date_received == null : y = -1
第三类是其他，也就是领券但未使用数据: y = 0

def getLabel(row):
    if type(row['date_received_new']) == float and math.isnan(row['date_received_new']):
        return -1
    elif type(row['date_new']) == date and row['date_new'] - row['date_received_new']  <= pd.Timedelta(15, 'D'):
        return 1
    else:
        return 0
label = offline_train.apply(getLabel, axis = 1)
offline_train['label'] = label
print(offline_train['label'].value_counts())

输出：
0 988887
-1 701602
1 64395
Name: label, dtype: int64

至此数据处理就全部完成了。来看一下训练数据的样子：

print('已有列名', offline_train.columns.tolist())

已有列名 ['User_id', 'Merchant_id', 'Coupon_id', 'Discount_rate', 'Distance', 'Date_received', 'Date', 'discount_man', 'discount_jian', 'discount_new_rate', 'date_received_new', 'date_new', 'weekday', 'weekday_received', 'weekday_buy', 'weekday_type_received', 'weekday_type_buy', 'weekday_1', 'weekday_2', 'weekday_3', 'weekday_4', 'weekday_5', 'weekday_6', 'weekday_7', 'label']

本节需记忆的python语句总结：

1. !ls data #输出该路径下的文件名
2. pd.read_csv(路径名) #读取路径指定的csv文件
3. info() #查看某个pandas对象的信息
4. set(x) - set(y) #将两个字段内容转为集合，并求差集（在x中但未在y中）
5. isinstance(data, type) #判断data是否是type类型，返回bool值
6. math.isnan(data) #判断float类型数据是否为空，返回bool值
7. data.apply(function) #对data中的数据逐一使用function函数，返回函数值数组
8. data.replace(a, b) #将data中的a换为b
9. date.weekday()+1 #返回日期的星期
10. pd.get_dummies(data) #返回data的one-hot格式
11. date1-date2 < pd.Timedelta(15, 'D') #计算日期间隔
12. Series.value_counts() #统计每个取值的个数

本节经验总结

1、拿到数据后先对数据进行理解和观察，了解每个字段的含义、数值类型。
2、对测试集和训练集进行对比，确保测试集中的主体在训练集中都存在（大部分存在）。
3、对于拿到的数据，一定要对每列数值进行分析和清洗，将无法比较的数据（比如空值或格式不统一的数值）进行转化。
4、分析消费数据时可考虑星期特征及工作日特征。

参考博客：https://tianchi.aliyun.com/notebook-ai/detail?spm=5176.12586969.1002.3.292844f2tqQhXQ&postId=4796

天池O2O新人挑战赛（1）数据预处理
本篇文章主要是为参加挑战赛做一些准备工作，铺垫一些基础知识。适用于从零开始的菜鸟，大牛绕道~ 比赛网址：https...
天池o2o优惠券使用预测比赛解析（初级）
天池o2o优惠券使用预测比赛解析（初级）赛题链接：天池o2o优惠券使用预测加载数据缺失值处理统计比赛的...
O2O线下优惠券分析
数据来源，某O2O公开在天池的数据集；数据大小为1754884行× 7列，包含用户id，商户id，优惠券id，折扣...
kaggle竞赛：Jigsaw Unintended Bias
1 数据预处理上面的句子用来预处理数据。
spark-天池O2O竞赛
地址转移到： spark-天池O2O竞赛
【代谢组学】代谢组学原始数据的预处理
主要内容： 1. 原始数据预处理概述； 2. 主要分析软件汇总； 3. 数据预处理的方法。 1. 原始数据预处理概...
实战淘宝穿衣搭配
1. 说明《淘宝穿衣搭配》比赛是2015年的一个天池算法比赛，现已开放为新人赛，仍可下载数据，上传结果及计算排...
Machine-Learning-Day-1
数据预处理 Day 1的任务是数据预处理。开始任务~ Step1 Import the libs Step2 Im...
python数据分析与挖掘实战笔记
第四章，数据预处理: 1, 数据预处理的过程主要包括：数据清洗，数据集成，数据转换和数据规约。 2，牛顿插值法: ...
LSTM-TimeSeriesRegression-数据预处理
这里记录LSTM用在时间序列数据的t+1预测时的数据预处理。数据预处理1.1 Scaler因为LSTM对数据值敏...