科比数据分析

作者: UlissesJr | 来源:发表于2018-10-23 21:55 被阅读93次

科比数据分析
科比数据集分析
科比数据集的处理和预测（机器学习）
使用SKLearn构建随机森林，预测科比进球数
探讨在数据分析中要注意哪些要点
数据分析学习笔记（个人理解篇）
课时1 赠送视频1：工作环境准备及Python数据结构讲解
学校如何推动阅读？
浅谈数据分析(挖掘)、机器学习、深度学习和人工智能
0基础怎么学习数据分析—最常用的数据分析方法

先导入数据

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold

# import data
filename= "data.csv"
raw = pd.read_csv(filename)
print (raw.shape)
raw.head()

原始数据的shot_made_flag列有5000个缺失值，我们后面把他当做测试集，这一列就是我们的标签

# 5000 for test
kobe =  raw[pd.notnull(raw['shot_made_flag'])]
print (kobe.shape)

(25697, 25)

#plt.subplot(211) first is raw second Column
alpha = 0.02
plt.figure(figsize=(10,10))

# loc_x and loc_y
plt.subplot(121)
plt.scatter(kobe.loc_x, kobe.loc_y, color='R', alpha=alpha)
plt.title('loc_x and loc_y')

# lat and lon
plt.subplot(122)
plt.scatter(kobe.lon, kobe.lat, color='B', alpha=alpha)
plt.title('lat and lon')

分别用坐标轴和经纬度对投篮点进行画图，得出结果差别很小

我们把坐标换成极坐标表示

raw['dist'] = np.sqrt(raw['loc_x']**2 + raw['loc_y']**2)

loc_x_zero = raw['loc_x'] == 0
#print (loc_x_zero)
raw['angle'] = np.array([0]*len(raw))
raw['angle'][~loc_x_zero] = np.arctan(raw['loc_y'][~loc_x_zero] / raw['loc_x'][~loc_x_zero])
raw['angle'][loc_x_zero] = np.pi / 2

raw['remaining_time'] = raw['minutes_remaining'] * 60 + raw['seconds_remaining']
print(kobe.action_type.unique())
print(kobe.combined_shot_type.unique())
print(kobe.shot_type.unique())
print(kobe.shot_type.value_counts())
处理后，我们来看一下比较重要的几个特征都包含哪些内容
['Jump Shot' 'Driving Dunk Shot' 'Layup Shot' 'Running Jump Shot'
 'Reverse Dunk Shot' 'Slam Dunk Shot' 'Driving Layup Shot'
 'Turnaround Jump Shot' 'Reverse Layup Shot' 'Tip Shot' 'Running Hook Shot'
 'Alley Oop Dunk Shot' 'Dunk Shot' 'Alley Oop Layup shot'
 'Running Dunk Shot' 'Driving Finger Roll Shot' 'Running Layup Shot'
 'Finger Roll Shot' 'Fadeaway Jump Shot' 'Follow Up Dunk Shot' 'Hook Shot'
 'Turnaround Hook Shot' 'Jump Hook Shot' 'Running Finger Roll Shot'
 'Jump Bank Shot' 'Turnaround Finger Roll Shot' 'Hook Bank Shot'
 'Driving Hook Shot' 'Running Tip Shot' 'Running Reverse Layup Shot'
 'Driving Finger Roll Layup Shot' 'Fadeaway Bank shot' 'Pullup Jump shot'
 'Finger Roll Layup Shot' 'Turnaround Fadeaway shot'
 'Driving Reverse Layup Shot' 'Driving Slam Dunk Shot'
 'Step Back Jump shot' 'Turnaround Bank shot' 'Reverse Slam Dunk Shot'
 'Floating Jump shot' 'Putback Slam Dunk Shot' 'Running Bank shot'
 'Driving Bank shot' 'Driving Jump shot' 'Putback Layup Shot'
 'Putback Dunk Shot' 'Running Finger Roll Layup Shot' 'Pullup Bank shot'
 'Running Slam Dunk Shot' 'Cutting Layup Shot' 'Driving Floating Jump Shot'
 'Running Pull-Up Jump Shot' 'Tip Layup Shot'
 'Driving Floating Bank Jump Shot']
['Jump Shot' 'Dunk' 'Layup' 'Tip Shot' 'Hook Shot' 'Bank Shot']
['2PT Field Goal' '3PT Field Goal']
2PT Field Goal    20285
3PT Field Goal     5412
Name: shot_type, dtype: int64

gs = kobe.groupby('shot_zone_area')
print (kobe['shot_zone_area'].value_counts())
print (len(gs))
下面我们看一下kobe的投篮区域次数
Center(C)                11289
Right Side Center(RC)     3981
Right Side(R)             3859
Left Side Center(LC)      3364
Left Side(L)              3132
Back Court(BC)              72
Name: shot_zone_area, dtype: int64
6

对不同的区域groupby，画出图像

import matplotlib.cm as cm
plt.figure(figsize=(20,10))

def scatter_plot_by_category(feat):
    alpha = 0.1
    gs = kobe.groupby(feat)
    cs = cm.rainbow(np.linspace(0, 1, len(gs)))
    for g, c in zip(gs, cs):
        plt.scatter(g[1].loc_x, g[1].loc_y, color=c, alpha=alpha)

# shot_zone_area
plt.subplot(131)
scatter_plot_by_category('shot_zone_area')
plt.title('shot_zone_area')

# shot_zone_basic
plt.subplot(132)
scatter_plot_by_category('shot_zone_basic')
plt.title('shot_zone_basic')

# shot_zone_range
plt.subplot(133)
scatter_plot_by_category('shot_zone_range')
plt.title('shot_zone_range')

对不重要的列进行删除，同时对一些字符形数据进行one-hot处理。

drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_range', 'shot_zone_basic', \
         'matchup', 'lon', 'lat', 'seconds_remaining', 'minutes_remaining', \
         'shot_distance', 'loc_x', 'loc_y', 'game_event_id', 'game_id', 'game_date']
for drop in drops:
    raw = raw.drop(drop, 1)

#one-hot
categorical_vars = ['action_type', 'combined_shot_type', 'shot_type', 'opponent', 'period', 'season']
for var in categorical_vars:
    raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], 1)
    raw = raw.drop(var, 1)

数据集的划分

#这里把'shot_made_flag'里的5000个有缺失值得数据当做测试集了
train_kobe = raw[pd.notnull(raw['shot_made_flag'])]
train_kobe = train_kobe.drop('shot_made_flag', 1)
train_label = train_kobe['shot_made_flag']
test_kobe = raw[pd.isnull(raw['shot_made_flag'])]
test_kobe = test_kobe.drop('shot_made_flag', 1)

开始训练

一下代码用交叉验证对不同的树个数和树的深度进行了比较，选出最优解，但是因为只是举个简单的例子，所以树的深度只有1，10，100三种进行比较，树的个数也是对1，10，100进行比较。

#完成了所有数据预处理的操作 下面开始训练
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix,log_loss
import time
import numpy as np
range_m = np.logspace(0,2,num=5).astype(int)
# range_m           array([  1,   3,  10,  31, 100])

# find the best n_estimators for RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold

print('Finding best n_estimators for RandomForestClassifier...')
min_score = 100000
best_n = 0
scores_n = []
range_n = np.logspace(0,2,num=3).astype(int)
for n in range_n:
    print("the number of trees : {0}".format(n))
    t1 = time.time()
    
    rfc_score = 0.
    rfc = RandomForestClassifier(n_estimators=n)
    for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_n.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_n = n
        
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(n, t2-t1))
print(best_n, min_score)


# find best max_depth for RandomForestClassifier
print('Finding best max_depth for RandomForestClassifier...')
min_score = 100000
best_m = 0
scores_m = []
range_m = np.logspace(0,2,num=3).astype(int)
for m in range_m:
    print("the max depth : {0}".format(m))
    t1 = time.time()
    
    rfc_score = 0.
    rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)
    for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        #rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_m.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_m = m
    
    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(m, t2-t1))
print(best_m, min_score)


结果如下：
Finding best n_estimators for RandomForestClassifier...
the number of trees : 1
Done processing 1 trees (1.407sec)
the number of trees : 10
Done processing 10 trees (7.093sec)
the number of trees : 100
Done processing 100 trees (67.297sec)
100 11.8669680428
Finding best max_depth for RandomForestClassifier...
the max depth : 1
Done processing 1 trees (6.658sec)
the max depth : 10
Done processing 10 trees (23.687sec)
the max depth : 100
Done processing 100 trees (70.740sec)
10 11.0039977617

不同参数的比较如下图

plt.figure(figsize=(10,5))
plt.subplot(121)
plt.plot(range_n, scores_n)
plt.ylabel('score')
plt.xlabel('number of trees')

plt.subplot(122)
plt.plot(range_m, scores_m)
plt.ylabel('score')
plt.xlabel('max depth')

科比数据分析
先导入数据原始数据的shot_made_flag列有5000个缺失值，我们后面把他当做测试集，这一列就是我们的标...
科比数据集分析
输出： (30697, 25) 输出： (25697, 25) 输出：['Jump Shot' 'Drivi...
科比数据集的处理和预测（机器学习）
练习：科比数据集的处理和预测数据导入 import pandas as pd import matplotlib...
使用SKLearn构建随机森林，预测科比进球数
科比数据集可以在CSDN下载，https://download.csdn.net/download/qq_4069...
探讨在数据分析中要注意哪些要点
数据分析的要点结合客户中心的管理实践，通常在分析中要关注以下几个要点。（1）数据的分布比数据的均值重要。由于个...
数据分析学习笔记（个人理解篇）
什么是数据分析？维基百科对数据分析的描述是： Data analysis is a process of ins...
课时1 赠送视频1：工作环境准备及Python数据结构讲解
数据分析定义，参考维基百科。数据分析基本步骤：明确目的思路、数据收集、数据处理、数据分析、数据展现。 IDE方...
学校如何推动阅读？
台湾研究者对参与PISSA测试的人进行了有关阅读素养的对比数据分析，显示课外时间用于阅读的孩子，比用于玩游戏、看电...
浅谈数据分析(挖掘)、机器学习、深度学习和人工智能
本文意在简单区分这几者的关系，然后简单阐述如何去学习这几个应用. 数据分析引用自→维基百科｜数据分析数据分析是...
0基础怎么学习数据分析—最常用的数据分析方法
科多大数据发现很多小伙伴想要学习数据分析却不知道怎么入手，怎么系统的学习数据分析，那么科多大数据今天给大奖讲一下0...