Python 机器学习的利器（5）

作者: zidea | 来源:发表于2020-07-13 20:25 被阅读0次

Python 机器学习的利器（5）
环境依赖相关包
Python 机器学习的利器（F）
Python 机器学习的利器（2）
Python 机器学习的利器（3）
Python 机器学习的利器（1）
使用python机器学习（四）
python可视化利器：pyecharts
上手实践《Python机器学习第2版》PDF中文+PDF英文+代
搭建python机器学习环境以及一个机器学习例子

图

决策树会选择最大化信息增益来对结点进行划分。

ID3 算法

计算出根节点的信息熵，然后根据属性依次划分并计算其节点的信息熵，信息增益=根节点信息熵-属性节点信息熵。根据信息增益进行降序排列，排列前面的就是第一个划分属性，依次类推。

在开始之前我们一定分析数据结构，这是往往被初学忽视，我们更注重算法以及如何构建模型，切实数据才是关键，就像做 web 应用一样，只有数据库定义好坏是决定应用的关键。

training_data = [
    ['Green',3,'Apple'],
    ['Yellow',3,'Apple'],
    ['Red',1,'Grape'],
    ['Red',1,'Grape'],
    ['Yellow',3,'Lemon']
]

为了更直观地观察数据，可以用 pandas 以表格形式输出数据。

import pandas as pd
labels = {'color','diameter','label'}
data = pd.DataFrame.from_records(training_data,columns=labels)

data.head()

数据

在数据每条数据都被标记两个属性颜色(color)和直径(diameter)，

def unique_vals(rows, col):
    S = set()
    for row in rows:
        S.add(row[col])
    """在数据集中每一个列找到去重来获取值"""
    return S

这个方法 unique_vals 方法是对数据的列中出现值进行去重，而从得到该列（属性）有几种可能值。

unique_vals(training_data,0)

使用这种方法可以得出该 0 列，可能出现的值，

{'Green', 'Red', 'Yellow'}

def class_counts(rows):
    """合计每一种类水果的数量"""
    counts = {}
    for row in rows:
        label = row[-1]
        if label not in counts:
            counts[label] = 0
        counts[label] += 1
    return counts

class_counts(training_data)

{'Apple': 2, 'Grape': 2, 'Lemon': 1}

使用 class_counts 方法可以计算出每种水果种类的数量。

我们知道所谓决策树，就是我们设计出一些列问题帮助我们根据问题答案来进行对物体进行分类。那么选择那些问题，以及问题先后顺序就是关键了。好的问题将集合分类 unmixed 和 mixed 两类，这样就可以逐步地将 unmixed 从 mixed 去除直到通过答案得到结果。

设计问题会根据属性来设计，有关颜色我们可以设计出

是否为红色
是否为绿色
是否为黄色

我们通过 Gini impurity 来评价单个结点的不确定性，通过问题我们可以减少多少不确定为 Information Gain 。当对于结点无需进行问题后该结点为叶节点。

def is_numeric(value):
    """判断一个值是否为数据类型"""
    return isinstance(value,int) or isinstance(value,float)

class Question:
    def __init__(self, column, value):
        self.column = column
        self.value = value
    def match(self, example):
        val = example[self.column]
        if is_numeric(val):
            return val >= self.value
        else:
            return val == self.value
    def __repr__(self):
        condition = "=="
        if is_numeric(self.value):
            condition = ">="
        return "Is %s %s %s?" %(header[self.column],condition,str(self.value))

在这个 Question 接收两个参数第一个参数 column 也就是列（数据特征）的值和一个 value 标准值进行比较

Question(1,3)

Is diameter >= 3?

这里定义 partition 函数对数据根据条件将数据分类符合条件和不符合条件的数据

def partition(rows, question):
    """数据集部分数据"""
    true_rows, false_rows = [],[]
    for row in rows:
        if question.match(row):
            true_rows.append(row)
        else:
            false_rows.append(row)
    return true_rows, false_rows

true_rows, flase_rows = partition(training_data,Question(0,'Red'))
true_rows

[['Red', 1, 'Grape'], ['Red', 1, 'Grape']]

bandicam 2019-07-27 12-53-56-551.jpg

bandicam 2019-07-27 12-54-08-927.jpg

bandicam 2019-07-27 12-54-19-801.jpg

bandicam 2019-07-27 12-54-24-202.jpg

bandicam 2019-07-27 12-54-29-088.jpg

Impurity

是我们随机拿出一个物品与预期不符的概率
下面定义一个计算 Impurity 的公式，

def gini(rows):
    """计算 Gini Impurity """
    counts = class_counts(rows)
    impurity = 1
    for lbl in counts:
        prob_of_lbl = counts[lbl] / float(len(rows))
        impurity -= prob_of_lbl**2
    return impurity

测试一下 gini 这个计算 Impurity 的方法，通过[['Apple'],['Apple']]来测试，因为集合只有一种类别所以 Impurity 为 0

no_mixing= [['Apple'],['Apple']]
gini(no_mixing)

0.0

我们来看看一个 Impurity 的集合[['Apple'],['Orange']]输出的结果为 0.5

some_mixing = [['Apple'],['Orange']]
gini(some_mixing)

0.5

fruit_mixing = []
for rec in training_data:
    fruit_mixing.append([rec[2]])
fruit_mixing
gini(fruit_mixing)

0.64

true_rows, false_rows = partition(training_data,Question(0,'Green'))
"""计算按 Green 分类后 True 集合的 Impurity"""
print(cal_gini(false_rows))
"""计算按 Green 分类后 False 集合的 Impurity"""
print(cal_gini(true_rows))

0.625
0.0

根据问题对集合进行分类，分类后通过计算每一个分类的 Impurity 然后对其进行去和来看到进行分类后 Impurity 降低了多少

"""计算 Avg Impurity = 4/5 * 0.62 + 1/5 * 0"""
print(float(len(false_rows))/float(len(training_data)))
float(len(false_rows))/float(len(training_data)) * cal_gini(false_rows) + \
float(len(true_rows))/float(len(training_data))* cal_gini(true_rows)

0.5

0.64 是在未根据问题进行分类前数据的 Impurity 而是 0.5 计算的结果。

0.64 - 0.5

def info_gain(left, right, current_uncertainty):
    p = float(len(left))/(len(left) + len(right))
    return current_uncertainty - p * gini(left) - (1-p) * gini(right)

def find_best_split(rows):
    """"""
    best_gain = 0
    best_question = None
    current_uncertainty = gini(rows)
    n_features = len(rows[0]) - 1
    for col in range(n_features):
        values = set([row[col] for row in rows])
        for val in values:
            question = Question(col, val)
            true_rows, false_rows = partition(rows, question)
            if len(true_rows, false_rows, current_uncertainty)
            
            if gain >= best_gain:
                best_gain, best_question = gain, question
    return best_gain, best_question

实例

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
balance_data = pd.read_csv( 
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data', 
    sep= ',', header = None) 
print(balance_data)

     0  1  2  3  4
0    B  1  1  1  1
1    R  1  1  1  2
2    R  1  1  1  3
3    R  1  1  1  4
4    R  1  1  1  5
..  .. .. .. .. ..
620  L  5  5  5  1
621  L  5  5  5  2
622  L  5  5  5  3
623  L  5  5  5  4
624  B  5  5  5  5

获取数据

def importData():
    balance_data = pd.read_csv( 
    'https://archive.ics.uci.edu/ml/machine-learning-'+
    'databases/balance-scale/balance-scale.data', 
        sep= ',', header = None)
    return balance_data

输出数据集

     0  1  2  3  4
0    B  1  1  1  1
1    R  1  1  1  2
2    R  1  1  1  3
3    R  1  1  1  4
4    R  1  1  1  5
..  .. .. .. .. ..
620  L  5  5  5  1
621  L  5  5  5  2
622  L  5  5  5  3
623  L  5  5  5  4
624  B  5  5  5  5

使用 sklearn 提供数据作为学习数据，这样方便很多。
对于初学者可能重视训练过程而一些机器学习的算法而忽视了对数据处理和数据分析。我们需要了解一下数据类型

print(type(balance_data))

数据集合为 pandas 类型，我们可以一起用 pandas 以表格形式查看

<class 'pandas.core.frame.DataFrame'>

print(balance_data.head())

以表格形式输出前 5 行数据，0 列表代表结果

   0  1  2  3  4
0  B  1  1  1  1
1  R  1  1  1  2
2  R  1  1  1  3
3  R  1  1  1  4
4  R  1  1  1  5

1 列为类名称，类别为以下 3 类
L : 表示天平左侧沉
B : 表示天平平衡
R : 表示天平右侧沉
2 列 Left-Weight: 5 (1, 2, 3, 4, 5)
3 列 Left-Distance: 5 (1, 2, 3, 4, 5)
4 列 Right-Weight: 5 (1, 2, 3, 4, 5)
5 列 Right-Distance: 5 (1, 2, 3, 4, 5)

处理数据

有了数据，也了解数据结构我们还需要对数据进行拆分为训练用数据和测试用数据集。

def splitdataset(balance_data): 
  
    # 将期望值从数据里分离出来
    X = balance_data.values[:, 1:5] 
    Y = balance_data.values[:, 0] 
  
    # 数据集合拆分为训练数据集合测试
    # 0.3 测试数据集大小 random_st
    X_train, X_test, y_train, y_test = train_test_split(  
    X, Y, test_size = 0.3, random_state = 100) 
      
    return X, Y, X_train, X_test, y_train, y_test

定义训练方法

# 使用 giniIndex 训练数据 
def train_using_gini(X_train, X_test, y_train): 
  
    # 创建分类器
    clf_gini = DecisionTreeClassifier(criterion = "gini", 
            random_state = 100,max_depth=3, min_samples_leaf=5) 
  
    # 进行训练
    clf_gini.fit(X_train, y_train) 
    return clf_gini

# 使用 entropy(信息熵) 训练数据
def tarin_using_entropy(X_train, X_test, y_train): 
  
    # 构建信息熵决策树 
    clf_entropy = DecisionTreeClassifier( 
            criterion = "entropy", random_state = 100, 
            max_depth = 3, min_samples_leaf = 5) 
  
    # 进行训练 
    clf_entropy.fit(X_train, y_train) 
    return clf_entropy

Report :                precision    recall  f1-score   support

           B       0.00      0.00      0.00        13
           L       0.71      0.74      0.72        85
           R       0.71      0.78      0.74        90

    accuracy                           0.71       188
   macro avg       0.47      0.51      0.49       188
weighted avg       0.66      0.71      0.68       188

Python 机器学习的利器（5）
决策树会选择最大化信息增益来对结点进行划分。 ID3 算法计算出根节点的信息熵，然后根据属性依次划分并计算其节点...
环境依赖相关包
工欲善事，必先利器：python相关环境搭建的依赖包。数据分析爬虫 Django 机器学习
Python 机器学习的利器（F）
在生活少不了决策，大大小小的决策充满了我们的生活，也决定了我们成长的轨迹。我们花在选择和决策上时间和精力也不少。 ...
Python 机器学习的利器（2）
在机器学习中，对数据进行分析和处理是必不可少，而且也是决定我们训练效果的关键。参考《Python机器学习基础教程...
Python 机器学习的利器（3）
K 邻近算法(KNN) 可以将房源想象为实例，将房源的位置和价格将房源分类为推荐房源和非推荐房源。最近苹果贵了，...
Python 机器学习的利器（1）
参考《Python机器学习基础教程》参考国外有关 pandas 库的教程为什么是 Python 今天 Pyth...
使用python机器学习（四）
前面三篇文章《使用python机器学习（一）》、《使用python机器学习（二）》、《使用python机器学习（三...
python可视化利器：pyecharts
前言前面我们提及ggplot在R和Python中都是数据可视化的利器，在机器学习和数据分析领域得到了广泛的应用。...
上手实践《Python机器学习第2版》PDF中文+PDF英文+代
学习机器学习，推荐学习《Python机器学习(第二版)》。《Python机器学习(第2版)》，图文并茂，代码详实...
搭建python机器学习环境以及一个机器学习例子
搭建python机器学习环境以及一个机器学习例子这篇文章介绍了Python机器学习环境的搭建，我用的机器学习开源...

Python 机器学习的利器（5）

ID3 算法

Impurity

实例

获取数据

处理数据

定义训练方法

相关文章

Python 机器学习的利器（5）

环境依赖相关包

Python 机器学习的利器（F）

Python 机器学习的利器（2）

Python 机器学习的利器（3）

Python 机器学习的利器（1）

使用python机器学习（四）

python可视化利器：pyecharts

上手实践《Python机器学习第2版》PDF中文+PDF英文+代

搭建python机器学习环境以及一个机器学习例子

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读