学习曲线：sklearn.model_selection.lea

作者: 微笑life | 来源:发表于2019-12-25 09:12 被阅读0次

学习曲线：sklearn.model_selection.lea
哲哲的ML笔记（二十一：学习曲线）
用学习曲线 learning curve 来判别过拟合问题
Java学习曲线
学习曲线
学习曲线
学习曲线
学习曲线
《认知方法论》（四）
过拟合的概念

第一：学习曲线

    学习曲线是一种用来判断训练模型的一种方法，它会自动 把训练样本的数量按照预定的规则逐渐增加，然后画出不同训练样本数量时的模型准确度。

    我们可以把Jtrain(theta) and Jtest(theta)作为纵坐标，画出与训练集数据集m的大小关系，这就是学习曲线。通过学习曲线，可以直观地观察到模型的准确性和训练数据大小的关系。 我们可以比较直观的了解到我们的模型处于一个什么样的状态，如：过拟合（overfitting）或欠拟合（underfitting）

    如果数据集的大小为m，则通过下面的流程即可画出学习曲线：

1.把数据集分成训练数据集和交叉验证数据集（可以看作测试机）

2.取训练数据及的20%作为训练样本，训练出模型参数。

3.使用交叉验证数据集来计算训练出来的模型的准确性。

4.以续联数据及的准确性和交叉验证的准确性为纵坐标，训练数据集个数作为横坐标，在坐标轴上画出上述步骤计算出来的模型准确性。

5.训练数据集增加10%，调到步骤2，继续执行，知道训练数据集大小为100%。

第二：比较

参考链接：https://blog.csdn.net/u012328159/article/details/79255433

learning_curve()：这个函数主要是用来判断（可视化）模型是否过拟合的，关于过拟合，就不多说了，具体可以看以前的博客：模型选择和改进

(X,y) = datasets.load_digits(return_X_y=True)

train_sizes,train_score,test_score = learning_curve(RandomForestClassifier(),X,y,train_sizes=[0.1,0.2,0.4,0.6,0.8,1],cv=10,scoring='accuracy')

train_error =  1- np.mean(train_score,axis=1)

test_error = 1- np.mean(test_score,axis=1)

plt.plot(train_sizes,train_error,'o-',color = 'r',label = 'training')

plt.plot(train_sizes,test_error,'o-',color = 'g',label = 'testing')

plt.legend(loc='best')

plt.xlabel('traing examples')

plt.ylabel('error')

plt.show()

validation_curve()：这个函数主要是用来查看在参数不同的取值下模型的性能

(X,y) = datasets.load_digits(return_X_y=True)

# print(X[:2,:])

param_range = [10,20,40,80,160,250]

train_score,test_score = validation_curve(RandomForestClassifier(),X,y,param_name='n_estimators',param_range=param_range,cv=10,scoring='accuracy')

train_score =  np.mean(train_score,axis=1)

test_score = np.mean(test_score,axis=1)

plt.plot(param_range,train_score,'o-',color = 'r',label = 'training')

plt.plot(param_range,test_score,'o-',color = 'g',label = 'testing')

plt.legend(loc='best')

plt.xlabel('number of tree')

plt.ylabel('accuracy')

plt.show()

第三：参数解释

from sklearn.model_selection import learning_curve

参数解释：参考：https://blog.csdn.net/gracejpw/article/details/102370364

image

X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.

是一个m*n的矩阵，m:样品数量，n:特征数量

y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression; None for unsupervised learning.

是一个m*1的矩阵，m:样品数量，相对于X的目标进行分类或回归

groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set.

将数据集拆分为训练/测试集时使用的样本的标签分组。**[可选]**

**train_sizes **: array-like, shape (n_ticks,), dtype float or int Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class. (default: np.linspace(0.1, 1.0, 5))

指定训练样品数量的变化规则。比如：np.linspace(0.1, 1.0, 5)表示把训练样品数量从0.1-1分成5等分，生成[0.1, 0.325,0.55,0.75,1]的序列，从序列中取出训练样品数量百分比，逐个计算在当前训练样本数量情况下训练出来的模型准确性。

**cv **: int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy.

交叉验证拆分策略，可以使用sklearn.model_selection.ShuffleSplit

    None，要使用默认的三折交叉验证（v0.22版本中将改为五折）

    整数，用于指定（分层）KFold中的折叠数，

    CV splitter

    可迭代的集（训练，测试）拆分为索引数组。

    对于整数/无输入，如果估计器是分类器，y是二进制或多类，则使用StratifiedKFold。在所有其他情况下，都使用KFold。

scoring：字符串，可调用或无，可选，默认：None，模型性能的评价指标，如（‘accuracy’、‘f1’、”mean_squared_error”等）

exploit_incremental_learning：布尔值，可选，默认值：False

如果估算器支持增量学习，此参数将用于加快拟合不同训练集大小的速度。

n_jobs：int或None，可选（默认=None）

要并行运行的作业数。None表示1。 -1表示使用所有处理器。

pre_dispatch：整数或字符串，可选

并行执行的预调度作业数（默认为全部）。该选项可以减少分配的内存。该字符串可以是“ 2 * n_jobs”之类的表达式。

shuffle：布尔值，可选

是否在基于``train_sizes’'为前缀之前对训练数据进行洗牌。

random_state：int，RandomState实例或无，可选（默认=None）

如果为int，则random_state是随机数生成器使用的种子；否则为false。如果是RandomState实例，则random_state是随机数生成器；如果为None，则随机数生成器是np.random使用的RandomState实例。在shuffle为True时使用。

error_score：‘raise’ | ‘raise-deprecating’ 或数字

如果估算器拟合中出现错误，则分配给分数的值。如果设置为“ raise”，则会引发错误。如果设置为“raise-deprecating”，则会在出现错误之前打印FutureWarning。如果给出数值，则引发FitFailedWarning。此参数不会影响重新安装步骤，这将始终引发错误。默认值为“不赞成使用”，但从0.22版开始，它将更改为np.nan。

返回值：

image

第四：使用


from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import ShuffleSplit

from sklearn.model_selection import train_test_split

from sklearn.model_selection import learning_curve

from sklearn.preprocessing import PolynomialFeatures

from sklearn.pipeline import Pipeline

from sklearn.datasets import load_breast_cancer

import matplotlib.pyplot as plt

import numpy as np

import time

cancer = load_breast_cancer()

X      = cancer.data

y      = cancer.target

def polynomial_model(degree = 1, **kargs):

    polynomial_features = PolynomialFeatures(degree = degree, include_bias = False)

    logistic_regression = LogisticRegression(**kargs)

    pipeline            = Pipeline([("pf", polynomial_features),

                                    ("lr", logistic_regression)])

    return pipeline

def plot_learning_curve(plt, estimator, title, X, y, ylim = None, cv = None, n_jobs = 1, train_size = np.linspace(0.1,1,5)):

    plt.title(title)

    if ylim is not None:

        plt.ylim(*ylim)

    plt.xlabel("Training examples")

    plt.ylabel("Score")

    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv = cv, n_jobs = n_jobs, train_sizes = train_size)

    print("train_sizes:\n",train_sizes, "\ntrain_scores:\n",train_scores, "\ntest_scores:\n",test_scores)

    train_scores_mean = np.mean(train_scores, axis = 1)

    test_scores_mean  = np.mean(test_scores, axis = 1)

    train_scores_std  = np.std(train_scores, axis = 1)

    test_scores_std  = np.std(test_scores,  axis = 1)

    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha = 0.1,color = "r")

    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha = 0.1,color = "g")

    plt.plot(train_sizes, train_scores_mean, "o-", color = "r", label = "Training score")

    plt.plot(train_sizes, test_scores_mean,"o-", color = "g", label = "Cross-validation score")

    plt.legend(loc = "best")

    return plt

cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0)

title = "Learning Curves (degreee={0}, penalty={1})"

degrees = [1,2]

penalty = ["l1", "l2"]

start = time.clock()

plt.figure(figsize = (12,4), dpi = 144)

j = 0

for p in penalty:

    for i in range(len(degrees)):

        plt.subplot(len(penalty), len(degrees), j + 1)

        plot_learning_curve(plt, polynomial_model(degree = degrees[i], penalty = p), title.format(degrees[i], p), X, y, ylim = (0.8,1.01), cv = cv)

        j += 1

plt.tight_layout()

plt.savefig("1.png")

learning_curve的返回值结果如下：

learning_curve返回结果展示

共选择了5组数据且选择了10折交叉验证，所以,train_sizes 为5个元素的narray,train_scores 和 test_scores为5*10的矩阵，每一行，为一次数据的每一折的结果，对其求平均值，作为最终的准确性。
第五：性能评估

学习曲线：sklearn.model_selection.lea
第一：学习曲线 1.把数据集分成训练数据集和交叉验证数据集（可以看作测试机） 2.取训练数据及的20%作为训练样本...
哲哲的ML笔记（二十一：学习曲线）
学习曲线的含义学习曲线就是一种很好的工具，我经常使用学习曲线来判断某一个学习算法是否处于偏差、方差问题。学习曲线...
用学习曲线 learning curve 来判别过拟合问题
本文结构：学习曲线是什么？怎么解读？怎么画？学习曲线是什么？学习曲线就是通过画出不同训练集大小时训练集和...
Java学习曲线
JavaEE Java基础视频 [241/241]以下是某培训机构给出的Java学习曲线Java学习曲线.png ...
学习曲线
反思自己的学习之路，好像真的不是那么顺畅，一直都是热情饱满的开始，但是很快就会放弃结束，今天我找到了原因。问题就...
学习曲线
在机器学习中,模型的欠拟合和过拟合是需要格外注意的问题,同时也是经常发生的问题,其中过拟合最为常见. 欠拟合,即训...
学习曲线
人不是因为学得快才变得博学，而是因为博学才学得更快。学习学习曲线的slow beginning，steep ac...
学习曲线
一概念介绍企业员工在生产产品的过程中，存在随着产量提升，但是技术熟练度不断提高，单位产品制造时间会越来越短，由...
《认知方法论》（四）
认知升级是学习曲线陡峭上升的状态，认知留级是学习曲线越来越平缓，甚至往下滑的状态。认知留级最重要的状态不是新的认知...
过拟合的概念
糟糕的“举一反三”和过拟合蓦然回首，学习曲线假设目标函数是50次函数学习曲线高方差时，增加样本量可以提高模型效果。