探索机器学习算法的几何结构：利用距离、角度和凸度优化SVM、PC

作者: iCloudEnd | 来源:发表于2023-03-07 15:08 被阅读0次

在真实数据集上的随机森林模型参数调优
机器学习算法
SVM 随笔
Machine Learning: 十大机器学习算法
Machine Learning: 十大机器学习算法
机器学习之优化算法学习总结
机器学习算法 - 支持向量机SVM
SVM支持向量机
【百面机器学习】优化算法
常用机器学习算法

机器学习算法广泛应用于各个领域，并彻底改变了我们处理数据分析的方式。这些算法基于数学模型，通常在高维空间中运行，因此很难解释它们的行为。但是，了解这些模型的几何结构可以为了解它们的工作方式以及如何优化它们以提高性能提供重要的见解。

在本文中，我们将探讨几何在机器学习算法中的重要性。我们将讨论关键的几何概念，例如距离、角度和凸度，以及它们与各种机器学习算法的关系。我们还将提供示例，说明如何使用几何直觉来优化和深入了解机器学习算法，例如支持向量机 (SVM)、主成分分析 (PCA) 和神经网络。

总体而言，本文全面概述了几何在机器学习算法中的作用。到本文结束时，读者将更好地理解几何概念如何集成到机器学习算法中以及如何使用它们来提高性能。

机器学习中的几何概念：

机器学习算法植根于数学模型，并在很大程度上依赖于几何概念来解释和分析数据。在这里，我们将概述与机器学习算法相关的关键几何概念。

一个重要的概念是距离，它衡量空间中两点之间的差异。在机器学习算法中，距离通常用于衡量数据点之间的相似性，例如在 k 最近邻 (KNN) 算法中。类似地，距离在层次聚类中用于将数据点分组到聚类中。

另一个重要的几何概念是角度，它衡量物体在空间中的方向。在机器学习算法中，角度在优化技术中很重要，例如梯度下降，它调整模型参数以最小化预测输出和实际输出之间的角度。

凸性是机器学习算法中的另一个关键概念。凸函数有一个唯一的最小点，可以使用梯度下降等优化技术来找到这个最小点。凸性也与 SVM 相关，其目标是找到使两类数据之间的边距最大化的超平面。该目标可以表示为凸优化问题。

理解这些几何概念对于理解机器学习算法的工作原理以及如何优化它们以获得更好的性能至关重要。接下来的部分将探讨如何在各种机器学习算法中使用几何概念的具体示例。

支持向量机 (SVM)：

支持向量机 (SVM) 是一种流行的机器学习算法，用于分类任务。SVM 使用几何方法通过找到使两类数据之间的边距最大化的超平面来对数据进行分类。边距是超平面与每个类中最近的数据点之间的距离，SVM 旨在最大化该距离。

SVM 中的超平面是一个将两类数据分开的几何概念。它可以表示为高维空间中的线性方程，其中每个维度对应数据的一个特征。边距也是一个几何概念，因为它测量超平面与每个类中最近的数据点之间的距离。

SVM 可以使用凸优化等几何概念来实现和优化。SVM 的目标是找到使边距最大化的超平面，这可以表述为凸优化问题。可以使用梯度下降或二次规划等优化技术找到该问题的解决方案。

例如，考虑一个包含两类数据点（红色和蓝色）的数据集，它们在二维空间中不可线性分离。通过使用核函数将这些数据点映射到高维空间，我们可以找到一个将两个类分开的超平面。边距是超平面与每个类中最近的数据点之间的距离，我们可以使用优化技术来找到使该距离最大化的超平面。

总体而言，支持向量机展示了如何使用几何概念对数据进行分类和优化机器学习算法。通过了解 SVM 背后的几何结构，我们可以深入了解它们的工作原理以及如何改进它们以获得更好的性能。

代码：

此代码创建一个具有线性内核的 SVM 分类器实例，在鸢尾花数据集上对其进行训练，并绘制决策边界和支持向量。决策边界是分隔两类数据的线，支持向量是距离决策边界最近的数据点。

生成的图显示了 SVM 分类器如何学会使用线性决策边界分离鸢尾花数据集中的不同类别的数据。您可以试验不同的 SVM 内核和数据集，以查看决策边界如何变化。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# Import iris dataset
iris = datasets.load_iris()

# We will only use the first two features of the iris dataset
X = iris.data[:, :2]
y = iris.target

# Create an instance of SVM with a linear kernel
C = 1.0  # SVM regularization parameter
clf = svm.SVC(kernel='linear', C=C)

# Train the SVM on the iris dataset
clf.fit(X, y)

# Plot the decision boundary
# Code adapted from: https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html
# Create a meshgrid of points to plot
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Plot the decision boundary and the support vectors
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title('SVM Decision Boundary')
plt.show()

截屏2023-03-08 14.57.41.png

此代码创建了一个 SVM 分类器，该分类器具有线性核和正则化参数的不同值C。然后，它在 iris 数据集上训练 SVM，并为每个值绘制决策边界和支持向量C。生成的图显示了增加的值如何C导致更窄的边缘和更多的支持向量，而降低的值如何导致C更大的边缘和更少的支持向量。

C通过可视化对 SVM 决策边界的影响，我们可以更好地理解正则化参数如何影响分类器的性能。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# Import iris dataset
iris = datasets.load_iris()

# We will only use the first two features of the iris dataset
X = iris.data[:, :2]
y = iris.target

# Create a range of values for the regularization parameter C
C_range = np.logspace(-2, 10, 13)

# Plot the decision boundary for different values of C
plt.figure(figsize=(12, 8))
for i, C in enumerate(C_range):
    clf = svm.SVC(kernel='linear', C=C)
    clf.fit(X, y)
    plt.subplot(3, 5, i + 1)

    # Create a meshgrid of points to plot
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))

    # Plot the decision boundary and the support vectors
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title('C = %0.2f' % C)

plt.suptitle('Effect of regularization parameter C on SVM decision boundary')
plt.tight_layout()
plt.show()

image.png

此代码创建一个具有径向基函数 (RBF) 内核的 SVM 分类器，并通过向 iris 数据集添加噪声来在非线性可分离数据集上对其进行训练。然后绘制 SVM 的决策边界和支持向量。由于 RBF 核是非线性核，因此得到的决策边界也是非线性的，可以捕获数据中更复杂的模式。

通过使用内核技巧可视化非线性 SVM 的决策边界，我们可以看到如何使用 SVM 来解决非线性分类问题。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# Import iris dataset
iris = datasets.load_iris()

# We will only use the first two features of the iris dataset
X = iris.data[:, :2]
y = iris.target

# Create a non-linearly separable dataset by adding noise
np.random.seed(0)
X = np.concatenate((X, np.random.randn(100, 2) * 0.5 + np.array([2, 2])))
y = np.concatenate((y, np.array([1] * 100)))

# Create an SVM classifier with a radial basis function (RBF) kernel
clf = svm.SVC(kernel='rbf', gamma='auto')
clf.fit(X, y)

# Plot the decision boundary and the support vectors
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
plt.ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)

# Create a meshgrid of points to plot
xx, yy = np.meshgrid(np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 500),
                     np.linspace(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5, 500))

# Predict the class for each point in the meshgrid and plot the decision boundary
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

# Plot the support vectors
sv = clf.support_vectors_
plt.scatter(sv[:, 0], sv[:, 1], s=100, facecolors='none', edgecolors='k')

plt.title('Non-linear SVM decision boundary using RBF kernel')
plt.show()

image.png

此代码创建一个具有线性内核的 SVM 分类器，并在鸢尾花数据集的线性可分子集上对其进行训练。然后绘制 SVM 的决策边界和支持向量。由于线性核导致线性决策边界，SVM 试图找到以最大边距将两个类分开的超平面。

通过可视化线性 SVM 的决策边界，我们可以看到如何使用 SVM 来解决线性分类问题，方法是找到使类之间的边距最大化的超平面。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# Import iris dataset
iris = datasets.load_iris()

# We will only use the first two features of the iris dataset
X = iris.data[:, :2]
y = iris.target

# Create a linearly separable dataset by removing some data points
X = X[y != 0]
y = y[y != 0]
y -= 1

# Create an SVM classifier with a linear kernel
clf = svm.SVC(kernel='linear')
clf.fit(X, y)

# Plot the decision boundary and the support vectors
plt.figure(figsize=(10, 8))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
plt.ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)

# Get the coefficients and intercept of the hyperplane
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
yy = a * xx - (clf.intercept_[0]) / w[1]

# Plot the hyperplane
plt.plot(xx, yy, 'k-')

# Plot the margin
margin = 1 / np.sqrt(np.sum(clf.coef_ ** 2))
yy_down = yy - a * margin
yy_up = yy + a * margin
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')

# Plot the support vectors
sv = clf.support_vectors_
plt.scatter(sv[:, 0], sv[:, 1], s=100, facecolors='none', edgecolors='k')

plt.title('Linear SVM decision boundary')
plt.show()

image.png

此代码创建一个具有径向基函数 (RBF) 内核的 SVM 分类器，并在鸢尾花数据集的非线性可分子集上对其进行训练。然后它创建一个网格来绘制 SVM 的决策面，它显示了 SVM 如何根据点与支持向量的距离对特征空间中的点进行分类。

通过可视化非线性 SVM 的决策边界，我们可以看到如何通过将输入数据映射到更可能线性可分的高维特征空间来使用 SVM 解决非线性分类问题。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# Import iris dataset
iris = datasets.load_iris()

# We will only use the first two features of the iris dataset
X = iris.data[:, :2]
y = iris.target

# Create a non-linearly separable dataset by adding some noise
np.random.seed(0)
X = np.vstack((X[y == 0][:50] + np.random.randn(50, 2),
               X[y == 1][:50] + np.random.randn(50, 2),
               X[y == 2][:50] + np.random.randn(50, 2)))
y = np.hstack((np.zeros(50), np.ones(50), np.ones(50) * 2))

# Create an SVM classifier with a radial basis function (RBF) kernel
clf = svm.SVC(kernel='rbf', gamma=0.7, C=1.0)
clf.fit(X, y)

# Create a meshgrid to plot the decision surface
xx, yy = np.meshgrid(np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 500),
                     np.linspace(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5, 500))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision surface
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.title('Non-linear SVM decision boundary')
plt.show()

image.png

通过可视化线性 SVM 的决策边界，我们可以看到如何使用 SVM 来解决线性分类问题，方法是找到使类之间的边距最大化的超平面。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

# Import breast cancer dataset
cancer = datasets.load_breast_cancer()

# We will only use the first two features of the breast cancer dataset
X = cancer.data[:, :2]
y = cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create a linear SVM classifier
clf = svm.SVC(kernel='linear', probability=True)
clf.fit(X_train, y_train)

# Compute the predicted probabilities for the test set
y_prob = clf.predict_proba(X_test)[:, 1]

# Compute the false positive rate and true positive rate for various threshold values
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# Compute the area under the ROC curve (AUC)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

image.png

通过可视化非线性 SVM 的决策边界，我们可以看到如何通过将输入数据映射到更可能线性可分的高维特征空间来使用 SVM 解决非线性分类问题。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# Import iris dataset
iris = datasets.load_iris()

# We will only use the first two features of the iris dataset
X = iris.data[:, :2]
y = iris.target

# Create a non-linearly separable dataset by adding some noise
np.random.seed(0)
X = np.vstack((X[y == 0][:50] + np.random.randn(50, 2),
               X[y == 1][:50] + np.random.randn(50, 2),
               X[y == 2][:50] + np.random.randn(50, 2)))
y = np.hstack((np.zeros(50), np.ones(50), np.ones(50) * 2))

# Create an SVM classifier with a radial basis function (RBF) kernel
clf = svm.SVC(kernel='rbf', gamma=0.7, C=1.0)
clf.fit(X, y)

# Create a meshgrid to plot the decision surface
xx, yy = np.meshgrid(np.linspace(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5, 500),
                     np.linspace(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5, 500))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision surface
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.title('Non-linear SVM decision boundary')
plt.show()

image.png

此代码创建一个具有线性核的二元 SVM 分类器，并在乳腺癌数据集的一个子集上对其进行训练。然后计算测试集的预测概率，并使用它们计算各种阈值的假阳性率和真阳性率。它最终绘制分类器的 ROC 曲线并计算曲线下面积 (AUC)。

通过可视化二元 SVM 分类器的 ROC 曲线和 AUC 分数，我们可以评估分类器的性能，并将其与使用相同评估指标的其他分类器进行比较。

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

# Import breast cancer dataset
cancer = datasets.load_breast_cancer()

# We will only use the first two features of the breast cancer dataset
X = cancer.data[:, :2]
y = cancer.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Create a linear SVM classifier
clf = svm.SVC(kernel='linear', probability=True)
clf.fit(X_train, y_train)

# Compute the predicted probabilities for the test set
y_prob = clf.predict_proba(X_test)[:, 1]

# Compute the false positive rate and true positive rate for various threshold values
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# Compute the area under the ROC curve (AUC)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(10, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()

image.png

主成分分析 (PCA)

PCA 的解释以及它如何使用几何概念来查找数据中最大方差的方向：
主成分分析 (PCA) 是一种广泛使用的技术，用于在保留重要信息的同时降低高维数据的维数。在本节中，我们将更详细地解释 PCA 的工作原理以及它如何使用几何概念来查找数据中最大方差的方向。
PCA 首先找到数据变化最大的方向，这些方向称为主成分。第一主成分是捕获数据中最大方差的方向。第二个主成分是与第一个主成分正交的方向，它捕获的方差最大，依此类推。
讨论特征向量和协方差矩阵是PCA中的几何概念：
本节将解释特征向量和特征值如何用于计算 PCA 中的主成分。特征向量表示数据中最大方差的方向，而特征值表示方差的大小。
我们还将解释协方差矩阵如何成为 PCA 中另一个重要的几何概念。协方差矩阵是一个对称矩阵，表示数据中特征之间的成对关系。协方差矩阵的对角线元素表示每个特征的方差，而非对角线元素表示每对特征之间的协方差。
使用几何概念的 PCA 实施和优化示例：
本节将提供一个使用 Python 的 sci-kit-learn 库实现 PCA 的实际示例。我们将解释如何计算和可视化主要组件以深入了解数据结构。
我们还将解释如何优化分析中使用的主成分数量，以平衡信息保存和计算复杂性。这涉及计算每个主成分的解释方差比，并根据预定阈值或使用交叉验证选择适当数量的成分。
本节最后将讨论如何解释 PCA 的结果，包括如何将原始特征映射回主成分并在降维空间中可视化数据。
PCA 是一种统计方法，它将一组高维数据点转换到低维空间，同时尽可能多地保留数据中的原始变化。这是通过找到数据中最大方差的方向（称为主成分）来完成的。第一主成分是捕获数据中最大方差的方向。每个后续分量捕获与先前分量正交方向上的最大剩余方差。

PCA 通过计算数据协方差矩阵的特征向量和特征值来工作。特征向量表示数据中最大方差的方向，相应的特征值表示方差的大小。我们可以通过找到具有最大特征值的特征向量来识别捕获数据中最多变化的主成分。

PCA 背后的几何概念是数据点可以看作是高维空间中的向量。主成分是数据点变化最大的方向，形成空间的正交基。与每个主成分关联的特征值的大小表明该成分捕获了数据中的总变异量。

PCA 是分析高维数据和识别最重要模式和趋势的强大工具。降低数据的维度，也有助于提高后续分析的效率和可解释性。

让我们首先定义它们是什么，以了解特征向量和特征值在 PCA 中的使用方式。

矩阵 A 的特征向量是一个非零向量 v，当它与 A 相乘时，会得到 v 的标量倍数。换句话说，Av = λv，其中 λ 是对应于特征向量 v 的特征值。特征向量和特征值描述线性变换如何影响向量。

在 PCA 中，我们计算数据的协方差矩阵的特征向量和特征值。协方差矩阵是描述数据中特征之间成对关系的对称矩阵。协方差矩阵的对角线元素表示每个特征的方差，而非对角线元素表示每对特征之间的协方差。

协方差矩阵的特征向量表示数据中最大方差的方向。具有最大特征值的特征向量是最大方差的方向或第一主成分。第二主成分是具有第二大特征值的特征向量，依此类推。

特征值本身代表每个主成分解释的方差量。所有特征值的总和等于数据的总方差。我们可以使用此信息来确定在数据的降维表示中保留多少主成分。

import numpy as np
import matplotlib.pyplot as plt

# generate 1000 random 2-dimensional points
points = np.random.randn(2, 1000)

# compute the covariance matrix
covariance_matrix = np.cov(points)

# compute the eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

# plot the points
plt.scatter(points[0], points[1], alpha=0.2)

# plot the eigenvectors
plt.plot([0, eigenvectors[0, 0]], [0, eigenvectors[1, 0]], 'r', label='Eigenvector 1')
plt.plot([0, eigenvectors[0, 1]], [0, eigenvectors[1, 1]], 'b', label='Eigenvector 2')

# plot the eigenvalues
plt.scatter(eigenvalues[0]*eigenvectors[0, 0], eigenvalues[0]*eigenvectors[1, 0], s=100, marker='o', label='Eigenvalue 1')
plt.scatter(eigenvalues[1]*eigenvectors[0, 1], eigenvalues[1]*eigenvectors[1, 1], s=100, marker='o', label='Eigenvalue 2')

# set axis limits and labels
plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.xlabel('x')
plt.ylabel('y')
plt.legend()

# show the plot
plt.show()

image.png

协方差矩阵是 PCA 中一个重要的几何概念，因为它表示数据中特征之间的成对关系。通过计算协方差矩阵的特征向量和特征值，我们可以识别数据中最大方差的方向和每个主成分解释的方差量。这使我们能够在保留尽可能多的信息的同时降低数据的维度。

此代码使用生成原始数据的散点图matplotlib.pyplot。该np.random.seed函数用于确保数据的可再现性。ax.scatter然后使用、和分别使用、和xlabel设置散点图。最后，使用显示绘图。ylabeltitleax.set_xlabelax.set_ylabelax.set_titleplt.show()

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(0)
x = np.random.normal(size=100)
y = 2 * x + np.random.normal(size=100)

# Plot scatter plot
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_title('Scatter Plot of Original Data')
plt.show()

image.png

from sklearn.decomposition import PCA

# Fit PCA model to data
data = np.array([x, y]).T
pca = PCA(n_components=2)
pca.fit(data)

# Project data onto first principal component
data_transformed = pca.transform(data)
first_pc = pca.components_[0]

# Plot scatter plot of data projected onto first principal component
fig, ax = plt.subplots()
ax.scatter(data_transformed[:, 0], np.zeros_like(data_transformed[:, 0]))
ax.arrow(0, 0, first_pc[0], first_pc[1], head_width=0.1, head_length=0.1, fc='k', ec='k')
ax.set_xlim([-4, 4])
ax.set_xlabel('First Principal Component')
ax.set_title('Data Projected onto First Principal Component')
plt.show()

image.png

# Generate sample data
np.random.seed(0)
data = np.random.normal(size=(100, 4))

# Fit PCA model to data
pca = PCA(n_components=4)
pca.fit(data)

# Plot scree plot of PCA model
fig, ax = plt.subplots()
ax.plot(range(1, 5), pca.explained_variance_ratio_, 'o-')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Explained Variance Ratio')
ax.set_title('Scree Plot')
plt.show()

image.png

# Generate sample data
np.random.seed(0)
data = np.random.normal(size=(100, 4))

# Fit PCA model to data
pca = PCA(n_components=2)
pca.fit(data)

# Plot biplot of PCA model
fig, ax = plt.subplots()
ax.scatter(data[:, 0], data[:, 1])
for i in range(pca.components_.shape[1]):
    ax.arrow(0, 0, pca.components_[0, i], pca.components_[1, i], head_width=0.1, head_length=0.1, fc='k', ec='k')
    ax.text(pca.components_[0, i], pca.components_[1, i], f'Feature {i+1}', ha='center', va='center', fontsize=12)
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Biplot')
plt.show()

image.png

# Reconstruct data from first two principal components
data_reconstructed = pca.inverse_transform(data_pca)

# Plot reconstructed data
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data_reconstructed[:, 0], data_reconstructed[:, 1], data_reconstructed[:, 2])
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

image.png

在本文中，我们讨论了理解机器学习算法几何结构的重要性。我们首先定义了几何在机器学习环境中的含义，以及它与这些算法的优化过程的关系。然后，我们探讨了不同的几何概念，例如凸度、曲率和距离，以及它们如何影响机器学习模型的性能和行为。

我们还讨论了使用可视化技术来深入了解高维空间的几何结构，这可能特别难以理解。通过可视化机器学习模型的决策边界和特征空间，我们可以更好地了解它们如何进行预测并识别潜在问题或偏差。

了解机器学习算法的几何结构对于提高其性能特别有用。例如，通过了解损失函数的曲率，我们可以选择更适合手头问题的适当优化方法。同样，通过可视化特征空间，我们可以识别模型可能过度拟合或欠拟合数据的区域。

总之，机器学习算法的几何学是一个重要的研究领域，可以帮助我们提高这些模型的性能和可解释性。通过更深入地了解高维空间的几何结构，我们可以开发更有效的机器学习技术并更好地理解这些算法的行为。

在真实数据集上的随机森林模型参数调优
搞机器学习的人，都会有自己偏爱的某种算法，有的喜欢支持向量机（SVM），因为它公式表达的优雅和可利用方法实现的高质...
机器学习算法
机器学习的算法分监督算法和无监督算法。监督算法包括回归算法，神经网络，SVM；无监督算法包括聚类算法，降维算法。...
SVM 随笔
前言当下机器学习比较重要 3 中算法，个人都目前为止认为比较重要机器学习算法分别是，深度学习、SVM 和决策树。...
Machine Learning: 十大机器学习算法
机器学习算法分类：监督学习、无监督学习、强化学习基本的机器学习算法：线性回归、支持向量机(SVM)、最近邻居(K...
Machine Learning: 十大机器学习算法
机器学习算法分类：监督学习、无监督学习、强化学习基本的机器学习算法：线性回归、支持向量机(SVM)、最近邻居(K...
机器学习之优化算法学习总结
优化算法演化历程机器学习和深度学习中使用到的优化算法的演化历程如下：SGD --> Momentum --> N...
机器学习算法 - 支持向量机SVM
在上两节中，我们讲解了机器学习的决策树和k-近邻算法，本节我们讲解另外一种分类算法：支持向量机SVM。 SVM是迄...
SVM支持向量机
SVM是数据挖掘算法中比较复杂难懂的，反复观看斯坦福机器学习的视频，以及网上零散学习各种数学和SVM相关资料， ...
【百面机器学习】优化算法
机器学习算法 = 模型表征 + 模型评估 + 优化算法。不同的优化算法对应的模型表征和评估指标分别为线性分类模型和...
常用机器学习算法
常用机器学习算法常用预测（分类，回归）模型：分类算法：LR , SVM，KNN 矩阵分解：FunkSVD，Bi...