[Python数据挖掘入门与实践]-第三章用决策树预测获胜球队

作者: 六千宛 | 来源:发表于2020-11-11 17:21 被阅读0次

Python简易爬虫小实例：爬取NBA球队赛季对阵数据！
[Python数据挖掘入门与实践]-第三章用决策树预测获胜球队
Python实现初阶决策树
用转换器提取特征（学习笔记一）
第八章数据决策分析算法——基于ID3算法的决策分析
数据挖掘期末实验报告
某网价值1800元的Python课程分享
用Python机器学习模型预测世界杯结果靠谱吗？
《Python数据分析与挖掘实战》读书笔记-数据探索
数据时代技能书单

image.png

清洗数据集
通过上面的输出我们发现一些问题：
（1）Date属性不是Date对象而是String对象
（2）第一行标题列不完整或是部分列对应的属性名不正确
我们可以通过pd.read_csv函数来解决上述问题。

NOTES
# Don't read the first row, as it is blank, and parse the date column as a date
#usecols:选择表格中要用的列
#parse_dates:直接用列的index将该列转化为日期格式
#dayfirst:直接用列的index将该列转化为时间格式
#pd.columns:重新赋列名

results = pd.read_csv(data_filename,usecols=[0,1,2,3,4,5,6,7,8], parse_dates=[0], dayfirst=[1], skiprows=[0,])
# Fix the name of the columns
results.columns = ["Date","Start","Visitor Team","VisitorPts","Home Team","HomePts","OT","Notes",'Score Type']
results.ix[:5]

NOTES
本文介绍numpy数组中这四个方法的区别ndim、shape、dtype、astype。
##### 1.ndim
![image](https://img.haomeiwen.com/i24215864/a0b2219229cd94b5?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
ndim返回的是数组的维度，返回的只有一个数，该数即表示数组的维度。
##### 2.shape
![image](https://img.haomeiwen.com/i24215864/c8a6e4f365bd046b?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
shape：表示各位维度大小的元组。返回的是一个元组。
对于一维数组：有疑问的是为什么不是（1，6），因为arr1.ndim维度为1，元组内只返回一个数。
对于二维数组：前面的是行，后面的是列，他的ndim为2，所以返回两个数。
对于三维数组：很难看出，下面打印arr3，看下它是什么结构。
![image](https://img.haomeiwen.com/i24215864/224f3d0c45afa6e9?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
先看最外面的中括号，包含[[1,2,3],[4,5,6]]和[[7,8,9],[10,11,12]]，假设他们为数组A、B，就得到[A,B]，如果A、B仅仅是一个数字，他的ndim就是2，这就是第一个数。但是A、B是（2，3）的数组。所以结合起来，这就是arr3的shape，为（2，2，3）。
将这种方法类比，也就可以推出4维、5维数组的shape。
##### 3.dtype
![image](https://img.haomeiwen.com/i24215864/97b1d88e27731659?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
dtype：一个用于说明数组数据类型的对象。返回的是该数组的数据类型。由于图中的数据都为整形，所以返回的都是int32。如果数组中有数据带有小数点，那么就会返回float64。
有疑问的是：整形数据不应该是int吗？浮点型数据不应该是float吗？
解答：int32、float64是Numpy库自己的一套数据类型。
##### 4.astype
![image](https://img.haomeiwen.com/i24215864/2e8e3017a26445c3?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
astype：转换数组的数据类型。
int32 --> float64        完全ojbk
float64 --> int32        会将小数部分截断
string_ --> float64        如果字符串数组表示的全是数字，也可以用astype转化为数值类型
![image](https://img.haomeiwen.com/i24215864/63cd20e84ff740fd?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
注意其中的float，它是python内置的类型，但是Numpy可以使用。Numpy会将Python类型映射到等价的dtype上。

NOTES
df.dtypes # 各字段的数据类型
df.team.dtype # 某个字段的类型
s.dtype # S 的类型
df.dtypes.value_counts() # 各类型有多少个字段

image.png

NOTES-数据类型检测
pd.api.types.is_bool_dtype(s)
pd.api.types.is_categorical_dtype(s)
pd.api.types.is_datetime64_any_dtype(s)
pd.api.types.is_datetime64_ns_dtype(s)
pd.api.types.is_datetime64_dtype(s)
pd.api.types.is_float_dtype(s)
pd.api.types.is_int64_dtype(s)
pd.api.types.is_numeric_dtype(s)
pd.api.types.is_object_dtype(s)
pd.api.types.is_string_dtype(s)
pd.api.types.is_timedelta64_dtype(s)
pd.api.types.is_bool_dtype(s)

NOTES
1-type():
返回的是数据结构的类型(list, dict,numpy.ndarry)
>>> k = [1, 2]
>>> type(k)
<class 'list'>
>>> import numpy as np
>>> p = np.array(k)
>>> type(p)
<class 'numpy.ndarray'>

2-dtype():
返回的是数据元素的类型(int, float)
>>> k = [1, 2]
>>> k.dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'dtype'
#由于 list、dict 等可以包含不同的数据类型，因此不可调用dtype()函数
>>> import numpy as np
>>> p = np.array(k)
>>> p.dtype
dtype('int32')
#np.array 中要求所有元素属于同一数据类型，因此可调用dtype()函数

3-astype():
改变np.array中所有数据元素的数据类型
>>> import numpy as np
>>> p = np.array(k)
>>> p
array([1, 2])
>>> p.astype(float)
array([1., 2.])

NOTES
1-loc
2-iloc
3-ix

代码中这段报错，因为r没有定义
results["HomeWin"] = results["VisitorPts"] < results["HomePts"]
# Our "class values"
y_true = results["HomeWin"].values
r = 0
for i in range(1315):
    if results["HomeWin"][i] == True:
        r +=1
print(r)
print("Home Win percentage: {0:.1f}%".format(100 * r  / results["HomeWin"].count()))
上面这一大段都可以用一句话表示
results["HomeWin"].mean()

NOTES
1-iterrows()
这里的iterrows()返回值为元组,(index,row)
上面的代码里，for循环定义了两个变量，index，row，那么返回的元组，index=index，row=row.
2-
from collections import defaultdict
won_last = defaultdict(int)
for index, row in results.iterrows():  # Note that this is not efficient
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    results["HomeLastWin"] = won_last[home_team]
    results["VisitorLastWin"] = won_last[visitor_team]
    results.ix[index] = row    
    # Set current win
    won_last[home_team] = row["HomeWin"]
    won_last[visitor_team] = not row["HomeWin"]
为什么第二行和第三行的row,不能换成results，因为用了会报错“'Series' objects are mutable, thus they cannot be hashed”
意思是 won_last['home_team'] 整体上是一个 Series，是容易改变的，因此不能作为 index 进行检索并赋值

NOTES
原来代码顺序有问题，并不能计算出两队上场是否获胜
# Now compute the actual values for these
# Did the home and visitor teams win their last game?
from collections import defaultdict
won_last = defaultdict(int)

for index, row in results.iterrows():  # Note that this is not efficient
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    won_last[home_team] = row["HomeWin"]
    won_last[visitor_team] = not row["HomeWin"]
    row["HomeLastWin"] = won_last[home_team]
    row["VisitorLastWin"] = won_last[visitor_team]
    results.ix[index] = row    
    # Set current win
results

NOTES
使用决策树进行预测
在scikit-learn包中已经实现了分类回归树（Classification and Regression Trees ）CART算法作为决策树的默认算法，它支持类别型（ categorical ）和连续型（continuous）特征。

决策树中的参数
决策树中的一个非常重要的参数就是停止标准（stopping criterion）。在构建决策树过程准备要结束时，最后几步决策仅依赖少量样本而且随机性很大，如果应用最后这几个少量的样本训练出的决策树模型会过拟合训练数据（overfit training data）。取而代之的是，使用停止标准会防止决策树对训练数据精度过高而带来的过拟合。
除了使用停止标准外，我们也可以根据已有样本将一颗树完整地构建出来，然后再通过剪枝（pruning）来获得一个通用模型，剪枝的过程就是将一些对整个决策树构建过程提供微不足道的信息的一些节点给去除掉。
scikit-learn中实现的决策树提供了以下两个选项来作为停止树构建的标准：
（1）min_samples_split：指定了在决策树中新建一个节点需要样本的数量。
（2）min_samples_leaf：指定为了保留节点，每个节点至少应该包含的样本数。
第一个参数控制决策树节点的创建，第二个参数决定节点是否会被保留。

决策树的另一个参数就是创建决策的标准，主要用到的就是基尼不纯度（Gini impurity） 和 信息增益（information gain）

（1）Gini impurity：用于衡量决策节点错误预测新样本类别的比例。
（2）information gain：用于信息论中的熵来表示决策节点提供多少新信息。
上面提到的这些参数值完成的功能大致相同（即决定使用什么样的准则或值去将节点拆分（split）为子节点）。值本身就是用来确定拆分的度量标准，因此值得选择会对最终的模型带来重要影响。

scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
在cross_val_score中有一个scoring方法，官方文档并没有说清楚怎么设置不同的评价标准，下面链接说的很不错
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

DataFrame.set_value()用法
https://vimsky.com/examples/usage/python-pandas-dataframe-set_value.html
DataFrame.set_value(index, col, value, takeable=False)
参数：
index: row label
col: column label
value: scalar value
takeable:  interpret the index/col as indexers, default False

[机器学习——决策树，DecisionTreeClassifier参数详解，决策树可视化查看树结构](https://www.cnblogs.com/baby-lily/p/10646226.html)

Python简易爬虫小实例：爬取NBA球队赛季对阵数据！
之前浏览《Python数据挖掘入门与实践》这本书的时候发现了非常有意思的内容——用决策树预测NBA获胜球队，但是书...
[Python数据挖掘入门与实践]-第三章用决策树预测获胜球队
Python实现初阶决策树
《Python数据挖掘入门与实践》这本书中关于决策树部分的讲解，代码部分相对简要，但坑爹的是，他给的数据下载网站，...
用转换器提取特征（学习笔记一）
参考书目《Python数据挖掘入门与实践》学习目的及意义大多数数据挖掘算法都依赖于数值或类别型特征。所以，如何...
第八章数据决策分析算法——基于ID3算法的决策分析
在机器学习中决策树是一个预测模型，代表对象属性与对象值之间的一种映射关系。决策树经常用于数据挖掘中的数据分析和预测...
数据挖掘期末实验报告
课程综述（20分）方法应用（25分） 1.研究背景（Why） 5分最近在看《Python数据挖掘入门与实践》这...
某网价值1800元的Python课程分享
1. Python3数据分析与挖掘建模实战（全） 2. Python3数据科学入门与实战 3. Python前后端...
用Python机器学习模型预测世界杯结果靠谱吗？
看到kaggle、medium上有不少人用球队的历史数据来进行建模预测，比如用到泊松分布、决策树、逻辑回归等算法，...
《Python数据分析与挖掘实战》读书笔记-数据探索
《Python数据分析与挖掘实战》，第三章数据探索前面的内容偏理论，简单整理下，我感觉有点儿用的吧数据质量分析...
数据时代技能书单
大数据时代，掌握处理数据的技能是必要的，书单包括数据清洗+数据挖掘+数据分析...... 1、数据清洗入门与实践 ...