There remains the problem of choosing the order M of the polynomial, and as we shall see this will turn out to be an example of an important concept called model comparison or model selection. In Figure 1.4, we show four examples of the results of fitting polynomials having orders M =0 ,1,3, and9 to the data set shown in Figure 1.2.
还有一个要解决的问题就是多项式中的阶的选择,这是我们将要出现的概念即模型对比或模型筛选。如图1.4,展示了四个不同系数下(M=0,1,3,和9)多项式的拟合结果。
We notice that the constant (M =0) and first order (M =1) polynomials give rather poor fits to the data and consequently rather poor representations of the function sin(2πx). The third order (M =3) polynomial seems to give the best fit to the function sin(2πx) of the examples shown in Figure 1.4. When we go to a much higher order polynomial (M =9), we obtain an excellent fit to the training data. In fact, the polynomial passes exactly through each data point and E(w*)=0 . However, the fitted curve oscillates wildly and gives a very poor representation of the function sin(2πx). This latter behaviour is known as over-fitting.
可以看到M=0和M=1时,多项式不能很好的拟合数据,当然也不能很好的拟合函数sin(2πx)。当M=3时,多项式似乎对函数sin(2πx)拟合的最好。当M=9时,我们可以完美的拟合训练数据。事实上,这个多项式精确的穿过了每个数据点并且错误函数等于0.然而,这个拟合曲线变形严重,对函数sin(2πx)的拟合非常糟糕。这个就是我们所知的过拟合。
As we have noted earlier, the goal is to achieve good generalization by making accurate predictions for new data. We can obtain some quantitative insight into the dependence of the generalization performance on M by considering a separate test set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values. For each choice of M,we can then evaluate the residual value of E(w*) given by (1.2) for the training data, and we can also evaluate E(w*) for the test data set. It is sometimes more convenient to use the root-mean-square in which the division by N allows us to compare different sizes of data sets on an equal footing, and the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t. Graphs of the training and test set RMS errors are shown, for various values of M, in Figure 1.5. The test set error is a measure of how well we are doing in predicting the values of t for new data observations of x. We note from Figure 1.5 that small values of M give relatively large values of the test set error, and this can be attributed to the fact that the corresponding polynomials are rather inflexible and are incapable of capturing the oscillations in the function sin(2πx). Values of M in the range 3 � M � 8 give small values for the test set error, and these also give reasonable representations of the generating function sin(2πx), as can be seen, for the case of M =3, from Figure 1.4.
我们已经知道,我们的目标是获得一个对于新数据进行精确预测的好的泛化。考虑使用由100个数据点组成的独立的测试数据集,我们能得到一些量化的了解泛化性能对于M的依赖程度,这些测试数据集的生成方式和训练数据集完全一样,但是把新的噪音数据包含到目标值中。对于每一个M,我们能根据(1.2)评估对于训练集的E(w*)的残值,我们也能评估对于测试数据集的E(w*)的残值。有时我们使用均方根更方便,它确保Erms能够和目标值使用相同标准进行测量。在图1.5中,展示了对于不同的M,训练集和测试集的RMS误差。测试集的误差能对我们预测新数据的好坏进行度量。我们注意到在图1.5中相对小的M值给出相对大的测试集误差,这能说明一个事实,相应的多项式是相当不灵活并且不能捕获在函数sin(2πx)里的变化。M在3到8的范围内,给出了较小的测试集误差,这些也在图1.4中有一个合理的表现。
For M =9, the training set error goes to zero, as we might expect because this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients w0,...,w9, and so can be tuned exactly to the 10 data points in the training set. However, the test set error has become very large and, as we saw in Figure 1.4, the corresponding function y(x,w*) exhibits wild oscillations.
对于M=9,这个训练集误差等于零,像我们可能期望的一样,因为这个多项式包含了10个自由的角度对应10个系数,所以能够精确的匹配到10个在训练集的数据。但是测试集误差变的很大,对应函数y(x,w*)也剧烈波动。
This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases. The M =9polynomial is therefore capable of generating results at least as good as the M =3polynomial. Furthermore, we might suppose that the best predictor of new data would be the function sin(2πx) from which the data was generated (and we shall see later that this is indeed the case). We know that a power series expansion of the function sin(2πx) contains terms of all orders, so we might expect that results should improve monotonically as we increase M.
这或许是一些启示,因为一个给定阶数的多项式包含了所有低阶多项式,把他们作为个例。M=9的多项式因此也能够生成至少和M=3的多项式一样好的结果。而且,我们可以认为新数据的最好的预测器或许是函数sin(2πx),我们知道函数sin(2πx)的幂级数展开包含了所有阶的特性,所以我们可以期望结果应该能够有所改进随着我们增加M。
We can gain some insight into the problem by examining the values of the coefficients w* obtained from polynomials of various order, as shown in Table 1.1. We see that, as M increases, the magnitude of the coefficients typically gets larger. In particular for the M =9polynomial, the coefficients have become finely tuned to the data by developing large positive and negative values so that the corresponding polynomial function matches each of the data points exactly, but between data points (particularly near the ends of the range) the function exhibits the large oscillations observed in Figure 1.4. Intuitively, what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values.
考察不同阶的多项式的系数矩阵w*,我们能得到一些这个问题的提示。如表1.1,我们看到,随着M的增长,系数的值明显变大。对于M=9的情况下,系数已经变得可以很好的适配数据,有了这些很大的正负值,对应的多项式函数就可以精确的匹配这些数据点,但数据点之间出现了很大的波动,如图1.4。直观上,有着更大阶数的多项式更能够匹配这些目标值的噪音。
It is also interesting to examine the behaviour of a given model as the size of the data set is varied, as shown in Figure 1.6. We see that, for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. However, as we shall see in Chapter 3, the number of parameters is not necessarily the most appropriate measure of model complexity.
观察当数据集大小变化时给定模型的行为也是很有趣的。图1.6,我们看到,一个给定的模型,随着数据集规模的增加,过拟合问题变得不是很严重了。换句话说数据集越大,越复杂化模型,就可以更好的拟合数据。一个初步的启示(有时是有利的)数据点的数量应该不少于参数数量的几倍(5到10倍)。但是在第三章我们会看到,参数的数量不是必然的最合适的模型灵活性的度量。
Also, there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set. It would seem more reasonable to choose the complexity of the model according to the complexity of the problem being solved. We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5), and that the over-fitting problem can be understood as a general property of maximum likelihood. By adopting a Bayesian approach, the Section 3.4 over-fitting problem can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set.
还有,有更不满意的事情就是不得不限制模型的参数数量按照可用训练集的大小。似乎根据要解决问题的复杂性来选择模型的复杂性才更合理。我们将会看到最少平方方法找到的模型参数代表了最大似然的一个特例,并且过拟合问题也是可以被作为最大似然的一个普通属性的。用一个合适的贝叶斯方法,过拟合问题是可以避免的。我们将会看到贝叶斯在雇佣模型中的预测,参数的数量大大超出了数据点的数量。事实上,在贝叶斯模型中有效的参数数量自动匹配数据模型的大小。
网友评论