【机器学习】学习笔记

作者: Du1in9 | 来源:发表于2020-07-20 13:19 被阅读0次

逻辑回归
多元线性回归
机器学习笔记
内容整理（持续更新）
00-Scikit-learn学习笔记系列文章
[机器学习入门] 李宏毅机器学习笔记-5（Classificat
机器学习笔记1
[机器学习入门] 李宏毅机器学习笔记-22（Transfer L
[机器学习入门] 李弘毅机器学习笔记-7 （Brief Intr
[机器学习入门] 李宏毅机器学习笔记-15 （Unsupervi

理论篇

什么是机器学习

利用计算机从历史数据中查找规律，并把它用到对未来不确定场景的决策
人：数据分析（人为生成规律）
计算机：机器学习（自动生成规律）
历史数据越多，机器学习规律越准确

从数据中寻找规律

基石：概率论，数据统计
用模型拟合规律：函数（多种形态）-> 函数曲线 -> 拟合

机器学习发展的原动力

用数据替代专家（避免了主观性）
经济驱动，数据变现（近年爆火的原因归功于大数据）

业务系统发展的历史

基于专家经验 -> 基于统计（分维度统计）-> 机器学习（在线学习）

典型应用

关联规则：数据算法之购物篮分析：“啤酒和尿布”

聚类：用户细分精准营销之移动：神州行，动感地带，全球通

风险识别：垃圾邮件（朴素贝叶斯算法），信用卡欺诈（决策树）

点击预估：互联网广告（ctr预估），推荐购买系统（协同过滤）

情感分析，实体识别（自然语言处理），深度学习（图像识别）

更多应用

语音识别，人脸识别，自动驾驶，虚拟助理，实时翻译，手势控制

数据分析和机器学习的区别

数据特点不同：
交易数据 vs 行为数据
少量数据 vs 海量数据
采样分析 vs 全量分析

参与者不同：
分析师能力决定结果 vs 数据质量决定结果

目标用户不同：
公司高层 vs 个体

解决问题不同：

技术手段不同：

常见算法和分类

1）分类一

有监督学习：有Y值

无监督学习：聚类，无Y值
半监督学习：强化学习

2）分类二

聚类
分类算法与回归
标注：给元素打标签

3）分类三（重要）

生成模型：给函数，数据 -> 结果带有概率性（A：30%，B：70%）
判别模型：给函数，数据 -> 结果带有肯定性（A：是，B：不是）
回答问题方式不同，思想也不同

机器学习解决问题

图片识别demo演示

按照色彩聚类

实战篇

模拟神经元的数学表示

感知器分类算法

a）权重更新算法示例

b）适用范围（第一种）

c）算法步骤总结

实现感知器对象

安装环境Anaconda Navigator：https://docs.anaconda.com/anaconda/install/

import numpy as np

class Perceptron(object):
    # 注释1
    def __init__(self, eta = 0.01, n_iter = 10):
        self.eta = eta
        self.n_iter = n_iter
    def fit(self, X, y):
        # 注释2
        self.w_ = np.zeros(1 + X.shape[1])
        self.errors_ = []

        for _ in range(self.n_iter):
            errors = 0
            # 注释3
            for xi, target in zip(X, y):
                update = self.eta * (target - self.predict(xi))
                # 注释4
                self.w_[1:] += update * xi
                self.w_[0] += update
                errors += int(update != 0.0)
                self.errors_.append(errors)
    def net_input(self, X):
        # 注释5
        return np.dot(X, self.w_[1:]) + self.w_[0]
    def predict(self, X):
        return np.where(self.net_input(X) >= 0.0, 1, -1)

注释1：
    eta：学习率
    n_iter：权重向量的训练次数
    w_：神经分叉权重向量
    errors_：用于记录神经元判断出错次数
注释2：
    输入训练数据，培训神经元，X是输入样本向量，y是对应的样本分类
    X:shape[n_samples, n_features]
    比如：X:[[1, 2, 3], [4, 5, 6]]
    那么：n_samples: 2，n_features: 3，y:[1, -1]

    初始化权重向量为0，加1是因为提到的w0,即步调函数的阈值
注释3：
    比如：X:[[1, 2, 3], [4, 5, 6]
    所以y:[1, -1]，zip(X, y):[[1, 2, 3, 1]. [4, 5, 6, -1]]

    update = n * (y - y')
注释4：
    xi是一个向量
    update * xi等价于：[ w1 = x1*update, w2 = x2*update, ...]
注释5：
    z = w0*1 + w1*x1 + w2*x2 + ... 
    np.dot()是做点积

数据解析和可视化

数据文件（iris.data.csv）：https://graph-bed-1256708472.cos.ap-chengdu.myqcloud.com/pythondata%2Firis.data.csv

import pandas as pd 

file = "C:/Users/YYDL/Desktop/data.csv"
# header=None 数据第一行是有用数据，不是表头
df = pd.read_csv(file, header = None)
# 显示文件前十行
df.head(10)

import matplotlib.pyplot as plt
import numpy as np

# 1)
y = df.iloc[0:100, 4].values
print(y)
y = np.where(y == 'Iris-setosa', -1, 1)
X = df.iloc[0:100, [0, 2]].values
print(X)
# 2)
plt.scatter(X[:50, 0], X[:50, 1], color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1], color='blue', marker='x', label='versicolor')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.xlabel('花茎长度')
plt.ylabel('花瓣长度')
plt.legend(loc='upper left')
plt.show()
# 3)
ppn = Perceptron(eta=0.1, n_iter=10)
ppn.fit(X, y)
plt.plot(range(1, len(ppn.errors_) + 1), ppn.errors_, marker='o')
plt.xlabel("Epochs")
plt.ylabel("error count")
plt.show()

 1）数据可视化
 得到数据前一百行的第五列
 将字符串转化为数字-1和1
 抽取前100条数据的第0列和第2列
 2）scatter散点绘图
 将前50条数据的第0列作为x坐标，第1列作为y坐标，点为红色圆圈
 将后50条数据的第0列作为x坐标，第1列作为y坐标，点为蓝色叉叉
 3）培训神经网络
 输出模型错误分类次数

由图可知，数据满足感知器分类算法

神经网络对数据实现分类

from matplotlib.colors import ListedColormap

def plot_decision_region(X, y, classifier, resolution=0.02):
    marker = ('s', 'x', 'o', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    # len(np.unique(y))=2
    cmap = ListedColormap(colors[:len(np.unique(y))])
    # 花茎的长度
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max()
    # 花瓣的长度
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max()
    print(x1_min, x1_max)
    print(x2_min, x2_max)
    # （备注）
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
    # 输出语句
    print(np.arange(x1_min, x1_max, resolution).shape)
    print(np.arange(x1_min, x1_max, resolution))
    print(xx1.shape)
    print(xx1)
    print(np.arange(x2_min, x2_max, resolution).shape)
    print(np.arange(x2_min, x2_max, resolution))
    print(xx2.shape)
    print(xx2)
# 执行语句
plot_decision_regions(X, y, ppn, resolution=0.02)

备注：
    将np.arange()中的向量扩展成一个矩阵

    a = np.arange(x1_min, x1_max, resolution) 向量元素为185个
    xx1[255, 185],将a中的元素作为一行，重复255行
    b = np.arange(x2_min, x2_max, resolution) 向量元素为255个
    xx2[255, 185],将b中的元素作为一列，重复185列

from matplotlib.colors import ListedColormap

def plot_decision_region(X, y, classifier, resolution=0.02):
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max()
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max()

    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
    z = classifier.predict(np.array([xx1.ravel(),xx2.ravel()]).T)
    z = z.reshape(xx1.shape)
    plt.contourf(xx1, xx2, z, alpha=0.4, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.xlim(xx2.min(), xx2.max())
    plt.scatter(X[:50,0],X[:50,1],color='red',marker='o',label='setosa')         
    plt.scatter(X[50:100,0],X[50:100,1],color='blue',marker='x',label='versicolor')
# 执行语句
plot_decision_region(X,y,ppn,resolution=0.02)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.xlabel('花茎长度')
plt.ylabel('花瓣长度')
plt.legend(loc='upper left')
plt.show()

适应性线性神经元

1）距离的定义

2）渐进下降法

适应性神经元代码实现

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

class Perceptron(object):
    def __init__(self, eta = 0.01, n_iter = 10):
        self.eta = eta
        self.n_iter = n_iter
    def fit(self, X, y):
        self.w_ = np.zeros(1 + X.shape[1])
        self.errors_ = []
        for _ in range(self.n_iter):
            errors = 0
            for xi, target in zip(X, y):
                update = self.eta * (target - self.predict(xi))
                self.w_[1:] += update * xi
                self.w_[0] += update
                errors += int(update != 0.0)
                self.errors_.append(errors)
    def net_input(self, X):
        return np.dot(X, self.w_[1:]) + self.w_[0]
    def predict(self, X):
        return np.where(self.net_input(X) >= 0.0, 1, -1)

file = "C:/Users/YYDL/Desktop/data.csv"
df = pd.read_csv(file, header = None)
y = df.iloc[0:100, 4].values
y = np.where(y == 'Iris-setosa', -1, 1)
X = df.iloc[0:100, [0, 2]].values
ppn = Perceptron(eta=0.1, n_iter=10)
ppn.fit(X, y)

def plot_decision_region(X, y, classifier, resolution=0.02):
    marker = ('s', 'x', 'o', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max()
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max()
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
    
class AdalineGD(object):
    def __init__(self, eta=0.01, n_iter=50):
        self.eta = eta
        self.n_iter = n_iter
    def fit(self, X, y):
        self.w_ = np.zeros(1 + X.shape[1])
        self.cost_ = []
        for i in range(self.n_iter):
            output = self.net_input(X)
            errors = (y - output)
            # 和方差求偏导数
            self.w_[1:] += self.eta * X.T.dot(errors)
            # 神经元参数更新
            self.w_[0] += self.eta * errors.sum()
            cost = (errors ** 2).sum() / 2.0
            self.cost_.append(cost)
        return self
    def net_input(self, X):
        return np.dot(X, self.w_[1:]) + self.w_[0]
    def activation(self, X):
        return self.net_input(X)
    def predict(self, X):
        return np.where(self.activation(X) >= 0, 1, -1)

# 神经网络对象ada，学习率0.0001，训练次数50
ada = AdalineGD(eta = 0.0001, n_iter = 50)
# 迭代训练神经网络
ada.fit(X, y)
# 构造预测数据，输入模型进行分类
plot_decision_region(X, y, classifier = ada)
# 绘图展示
plt.plot(range(1, len(ada.cost_)+1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('sum-squard-error')
plt.show()

逻辑回归
点击链接：逻辑回归 NG机器学习公开课笔记：机器学习笔记
多元线性回归
链接：多元线性回归 NG机器学习公开课笔记：机器学习笔记
机器学习笔记
学习Andrew Ng的机器学习教程，做个笔记。初识机器学习人工智能的核心是机器学习，机器学习的本质是算法机...
内容整理（持续更新）
机器学习基础视频教程：吴恩达机器学习-网易云公开课笔记地址：机器学习笔记作业练习：https://github....
00-Scikit-learn学习笔记系列文章
机器学习读书笔记撰写机器学习读书笔记，总结自己系统学习sklearn的经验，将整个学习过程写成读书笔记的形式分享...
[机器学习入门] 李宏毅机器学习笔记-5（Classificat
[机器学习] 李宏毅机器学习笔记-5（Classification: Probabilistic Generati...
机器学习笔记1
机器学习笔记1
[机器学习入门] 李宏毅机器学习笔记-22（Transfer L
[机器学习入门] 李宏毅机器学习笔记-22（Transfer Learning part 2；迁移学习 part ...
[机器学习入门] 李弘毅机器学习笔记-7 （Brief Intr
[机器学习入门] 李弘毅机器学习笔记-7 （Brief Introduction of Deep Learning...
[机器学习入门] 李宏毅机器学习笔记-15 （Unsupervi
[机器学习入门] 李宏毅机器学习笔记-15 （Unsupervised Learning: WordEmbeddi...