2012每日单车共享数量预估
1、 任务描述
请在Capital Bikeshare (美国Washington, D.C.的一个共享单车公司)提供的自行车数据上进行回归分析。训练数据为2011年的数据,要求预测2012年每天的单车共享数量。
原始数据集地址:http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
-
文件说明
day.csv: 按天计的单车共享次数(只需使用该文件)
hour.csv: 按小时计的单车共享次数(无需理会)
readme:数据说明文件 -
字段说明
Instant记录号
Dteday:日期
Season:季节(1=春天、2=夏天、3=秋天、4=冬天)
yr:年份,(0: 2011, 1:2012)
mnth:月份( 1 to 12)
hr:小时 (0 to 23) (只在hour.csv有,忽略此字段)
holiday:是否是节假日
weekday:星期中的哪天,取值为0~6
workingday:是否工作日
1=工作日 (是否为工作日,1为工作日,0为非周末或节假日
weathersit:天气(1:晴天,多云2:雾天,阴天3:小雪,小雨4:大雨,大雪,大雾)
temp:气温摄氏度
atemp:体感温度
hum:湿度
windspeed:风速
casual:非注册用户个数
registered:注册用户个数
cnt:给定日期(天)时间(每小时)总租车人数,响应变量y
cnt特征为要预测的y,对cnt进行预测
导入必要的工具包
# 导入必要的工具包
# 数据读取及基本处理
import numpy as np
import pandas as pd
#可视化
import matplotlib.pyplot as plt
import seaborn as sns
#时间特征处理
import time
import datetime
#模型
from sklearn.linear_model import LinearRegression
#模型评估
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score #评价回归预测模型的性能
%matplotlib inline
读取数据
data = pd.read_csv("day.csv")
训练数据和测试数据分割
根据yr字段是否为1,分拆出2011年数据作为训练集trainData,21012年数据作为测试集testData
trainData = data[data.loc[:,'yr'] == 0].copy()
testData = data[data.loc[:,'yr']==1].copy()
trainData.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 365 entries, 0 to 364
Data columns (total 16 columns):
instant 365 non-null int64
dteday 365 non-null object
season 365 non-null int64
yr 365 non-null int64
mnth 365 non-null int64
holiday 365 non-null int64
weekday 365 non-null int64
workingday 365 non-null int64
weathersit 365 non-null int64
temp 365 non-null float64
atemp 365 non-null float64
hum 365 non-null float64
windspeed 365 non-null float64
casual 365 non-null int64
registered 365 non-null int64
cnt 365 non-null int64
dtypes: float64(4), int64(11), object(1)
memory usage: 48.5+ KB
适当的特征工程(及数据探索)
将dteday转换为一年中的第几日特征dayCount,使得string转化为int类型,有效利用数据,然后将dteday删除
def getInterval_train(df):
date=df['dteday']
date=time.strptime(date,"%Y-%m-%d")
date1=datetime.datetime(2011,1,1)
date=datetime.datetime(date[0],date[1],date[2])
return date-date1
Interval=trainData.apply(lambda r : getInterval_train(r),axis=1).dt.days.copy()
trainData['dayCount']=Interval
trainData = trainData.drop(['dteday','instant','yr'],axis=1)
def getInterval_test(df):
date=df['dteday']
date=time.strptime(date,"%Y-%m-%d")
date1=datetime.datetime(2012,1,1)
date=datetime.datetime(date[0],date[1],date[2])
return date-date1
Interval=testData.apply(lambda r : getInterval_test(r),axis=1).dt.days.copy()
testData['dayCount']=Interval
testData=testData.drop(['dteday','instant','yr'],axis=1)
# mydate = pd.to_numeric(data["dteday"].str.replace('-',''))
# year = mydate//10000
# month = (mydate-year*10000)//100
# date = mydate%100
各属性的统计特性
trainData.describe()
season | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | dayCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 | 365.000000 |
mean | 2.498630 | 6.526027 | 0.027397 | 3.008219 | 0.684932 | 1.421918 | 0.486665 | 0.466835 | 0.643665 | 0.191403 | 677.402740 | 2728.358904 | 3405.761644 | 182.000000 |
std | 1.110946 | 3.452584 | 0.163462 | 2.006155 | 0.465181 | 0.571831 | 0.189596 | 0.168836 | 0.148744 | 0.076890 | 556.269121 | 1060.110413 | 1378.753666 | 105.510663 |
min | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.059130 | 0.079070 | 0.000000 | 0.022392 | 9.000000 | 416.000000 | 431.000000 | 0.000000 |
25% | 2.000000 | 4.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.325000 | 0.321954 | 0.538333 | 0.135583 | 222.000000 | 1730.000000 | 2132.000000 | 91.000000 |
50% | 3.000000 | 7.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 0.479167 | 0.472846 | 0.647500 | 0.186900 | 614.000000 | 2915.000000 | 3740.000000 | 182.000000 |
75% | 3.000000 | 10.000000 | 0.000000 | 5.000000 | 1.000000 | 2.000000 | 0.656667 | 0.612379 | 0.742083 | 0.235075 | 871.000000 | 3632.000000 | 4586.000000 | 273.000000 |
max | 4.000000 | 12.000000 | 1.000000 | 6.000000 | 1.000000 | 3.000000 | 0.849167 | 0.840896 | 0.972500 | 0.507463 | 3065.000000 | 4614.000000 | 6043.000000 | 364.000000 |
归一化各属性
numerical_features = ['temp','atemp','hum','windspeed']
numerical_features_nor = ['temp_nor','atemp_nor','hum_nor','windspeed_nor']
for col in numerical_features:
temp = trainData[col].copy()
temp = (temp-temp.min())/(temp.max()-temp.min())
trainData[col+'_nor'] = temp
for col in numerical_features:
temp = testData[col].copy()
temp = (temp-temp.min())/(temp.max()-temp.min())
testData[col+'_nor'] = temp
trainData.head()
season | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | dayCount | temp_nor | atemp_nor | hum_nor | windspeed_nor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 | 0 | 0.360789 | 0.373517 | 0.828620 | 0.284606 |
1 | 1 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 | 1 | 0.385232 | 0.360541 | 0.715771 | 0.466215 |
2 | 1 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 | 2 | 0.173705 | 0.144830 | 0.449638 | 0.465740 |
3 | 1 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 | 3 | 0.178308 | 0.174649 | 0.607131 | 0.284297 |
4 | 1 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 | 4 | 0.212429 | 0.197158 | 0.449313 | 0.339143 |
分布以及散点可视化检验归一化效果
myShow = numerical_features
for col in myShow:
plt.figure(figsize=(12,12))
plt.subplot(2,2,1)
sns.distplot(trainData[col], bins=30, kde=False)
plt.title("Distributing of %s"%col)
plt.subplot(2,2,2)
plt.scatter(range(trainData.shape[0]), trainData[col].values,color='purple')
plt.title("scatter of %s"%col)




myShow = numerical_features_nor
for col in myShow:
plt.figure(figsize=(12,12))
plt.subplot(2,2,1)
sns.distplot(trainData[col], bins=30, kde=False)
plt.title("Distributing of %s"%col)
plt.subplot(2,2,2)
plt.scatter(range(trainData.shape[0]), trainData[col].values,color='purple')
plt.title("scatter of %s"%col)




删除非归一化特征
trainData = trainData.drop(numerical_features,axis=1)
testData = testData.drop(numerical_features,axis=1)
trainData.head()
season | mnth | holiday | weekday | workingday | weathersit | casual | registered | cnt | dayCount | temp_nor | atemp_nor | hum_nor | windspeed_nor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 0 | 6 | 0 | 2 | 331 | 654 | 985 | 0 | 0.360789 | 0.373517 | 0.828620 | 0.284606 |
1 | 1 | 1 | 0 | 0 | 0 | 2 | 131 | 670 | 801 | 1 | 0.385232 | 0.360541 | 0.715771 | 0.466215 |
2 | 1 | 1 | 0 | 1 | 1 | 1 | 120 | 1229 | 1349 | 2 | 0.173705 | 0.144830 | 0.449638 | 0.465740 |
3 | 1 | 1 | 0 | 2 | 1 | 1 | 108 | 1454 | 1562 | 3 | 0.178308 | 0.174649 | 0.607131 | 0.284297 |
4 | 1 | 1 | 0 | 3 | 1 | 1 | 82 | 1518 | 1600 | 4 | 0.212429 | 0.197158 | 0.449313 | 0.339143 |
热图可视化各特征相关性
data_corr = trainData.corr().abs()
plt.figure(figsize=(13, 9))
sns.heatmap(data_corr,annot=True)
plt.show()

处理类别型特征
categorical_features = ['season','mnth','weathersit','weekday']
for col in categorical_features:
print ("\n%s属性的不同取值和出现的次数'%col")
print (trainData[col].value_counts())
trainData[col] = trainData[col].astype('object')
testData[col] = testData[col].astype('object')
%s属性的不同取值和出现的次数'%col
3 94
2 92
1 90
4 89
Name: season, dtype: int64
%s属性的不同取值和出现的次数'%col
12 31
10 31
8 31
7 31
5 31
3 31
1 31
11 30
9 30
6 30
4 30
2 28
Name: mnth, dtype: int64
%s属性的不同取值和出现的次数'%col
1 226
2 124
3 15
Name: weathersit, dtype: int64
%s属性的不同取值和出现的次数'%col
6 53
5 52
4 52
3 52
2 52
1 52
0 52
Name: weekday, dtype: int64
categorical_features = ['season','mnth','weathersit','weekday']
x_train_cat = trainData[categorical_features]
x_train_cat = pd.get_dummies(x_train_cat)
x_train_rest = trainData.drop(categorical_features,axis=1)
trainData = pd.concat([x_train_cat, x_train_rest], axis = 1, ignore_index=False)
trainData.columns
Index(['season_1', 'season_2', 'season_3', 'season_4', 'mnth_1', 'mnth_2',
'mnth_3', 'mnth_4', 'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8', 'mnth_9',
'mnth_10', 'mnth_11', 'mnth_12', 'weathersit_1', 'weathersit_2',
'weathersit_3', 'weekday_0', 'weekday_1', 'weekday_2', 'weekday_3',
'weekday_4', 'weekday_5', 'weekday_6', 'holiday', 'workingday',
'casual', 'registered', 'cnt', 'dayCount', 'temp_nor', 'atemp_nor',
'hum_nor', 'windspeed_nor'],
dtype='object')
x_test_cat = testData[categorical_features]
x_test_cat = pd.get_dummies(x_test_cat)
x_test_rest = testData.drop(categorical_features,axis=1)
testData = pd.concat([x_test_cat, x_test_rest], axis = 1, ignore_index=False)
testData.columns
Index(['season_1', 'season_2', 'season_3', 'season_4', 'mnth_1', 'mnth_2',
'mnth_3', 'mnth_4', 'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8', 'mnth_9',
'mnth_10', 'mnth_11', 'mnth_12', 'weathersit_1', 'weathersit_2',
'weathersit_3', 'weekday_0', 'weekday_1', 'weekday_2', 'weekday_3',
'weekday_4', 'weekday_5', 'weekday_6', 'holiday', 'workingday',
'casual', 'registered', 'cnt', 'dayCount', 'temp_nor', 'atemp_nor',
'hum_nor', 'windspeed_nor'],
dtype='object')
回归
准备训练数据,分离特征与目标
y_train = trainData["cnt"]
X_train = trainData.drop(['cnt','casual','registered','dayCount'], axis = 1)
y_test = testData["cnt"]
X_test = testData.drop(['cnt','casual','registered','dayCount'], axis = 1)
y_test.head()
365 2294
366 1951
367 2236
368 2368
369 3272
Name: cnt, dtype: int64
# 数据标准化
# 初始化对目标值的标准化器
# 对y标准化不是必须,但对其进行标准化可以使得不同问题w的取值范围相对相同
#自己实现试试...,这些参数需要保留,对测试集预测完后还需要对其进行反变换
mean_y = y_train.mean()
std_y = y_train.std()
y_train = (y_train - mean_y)/std_y
y_test = (y_test - mean_y)/std_y
最小二乘线性回归
# Linear Regression
# 1\. 生成学习器实例
lr = LinearRegression()
#2\. 在训练集上训练学习器
lr.fit(X_train, y_train)
#3.训练上测试,得到训练误差,实际任务中这一步不需要
# Look at predictions on training and validation set
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
生成提交测试结果
y_test_pred = lr.predict(X_test)
y_test_pred = y_test_pred * std_y + mean_y
#生成提交测试结果
df = pd.DataFrame({'instant':np.arange(len(y_test_pred))+1,'cnt':y_test_pred})
#df.reindex(columns=['instant'])
#y = pd.Series(data = y_test_pred, name = 'cnt')
#df = pd.concat([testID, y], axis = 1, ignore_index=True)
df.to_csv('submission.csv')
#查看预测结果
df.drop(['instant'],axis=1,inplace=True)
df.head(20)

网友评论