在搭建模型的过程中,我们往往会从已知的特征中提取更多新的特征,并以此搭建更为复杂的模型,但是模型越复杂,越会值其本身掉入不断“自我催眠,强化偏见”的过程,从而引起过度拟合的问题。如果将毫不相关的变量加入到模型中,也会得到相应的参数估计值,而这个估计值几乎不可能为0,这就造成了所谓的“模型幻觉”。模型幻觉会引起模型参数的不可靠,更严重的是使得原本可能较为正确的估计扭曲为错误,比如将原来变量的正效应估计为负效应(变量对应的参数为正时成为正效应,否则为负效应)。
!pip install statsmodels
import statsmodels.api as sm
import numpy as np
import pandas as pd
def generateData():
"""
生成模型数据
"""
np.random.seed(5320)
x = np.array(range(0, 20))/2
error = np.round(np.random.randn(20), 2)
y = 0.05*x + error
# 新加入无关变量z恒等于1
z = np.zeros(20) + 1
return pd.DataFrame({"x": x, "z": z, "y": y})
def wrongCoef():
"""
由于新变量的加入,正效应为负效应
"""
features = ["x", "z"]
labels = ["y"]
data = generateData()
X = data[features]
Y = data[labels]
# 没有多余变量,x系数符合估计正确,为正
model = sm.OLS(Y, X["x"])
res = model.fit()
print("没有新变量时")
print(res.summary())
# 加入多余变量后,x的系数符合估计错误,为负
model1 = sm.OLS(Y, X)
res1 = model1.fit()
print("加入新变量后")
print(res1.summary())
wrongCoef()
显示结果:
没有新变量时
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.204
Model: OLS Adj. R-squared (uncentered): 0.162
Method: Least Squares F-statistic: 4.878
Date: Sat, 05 Dec 2020 Prob (F-statistic): 0.0397
Time: 09:24:08 Log-Likelihood: -29.583
No. Observations: 20 AIC: 61.17
Df Residuals: 19 BIC: 62.16
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x 0.0969 0.044 2.209 0.040 0.005 0.189
==============================================================================
Omnibus: 0.871 Durbin-Watson: 2.037
Prob(Omnibus): 0.647 Jarque-Bera (JB): 0.815
Skew: 0.275 Prob(JB): 0.665
Kurtosis: 2.179 Cond. No. 1.00
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
加入新变量后
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.005
Model: OLS Adj. R-squared: -0.050
Method: Least Squares F-statistic: 0.09171
Date: Sat, 05 Dec 2020 Prob (F-statistic): 0.765
Time: 09:24:08 Log-Likelihood: -27.982
No. Observations: 20 AIC: 59.96
Df Residuals: 18 BIC: 61.96
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x -0.0243 0.080 -0.303 0.765 -0.193 0.144
z 0.7873 0.445 1.768 0.094 -0.148 1.723
==============================================================================
Omnibus: 0.939 Durbin-Watson: 2.375
Prob(Omnibus): 0.625 Jarque-Bera (JB): 0.886
Skew: 0.338 Prob(JB): 0.642
Kurtosis: 2.221 Cond. No. 11.0
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
这里可能并不直观,下面看看一个抛物线的例子。
"""
此脚本用于展示随机变量引起的模型幻觉
"""
import numpy as np
import matplotlib.pyplot as plt
def generate_data(seed, num):
x = 0
np.random.seed(seed)
data = []
for i in range(num):
x += np.random.normal()
data.append(x)
return data
def visualize_data(series1, series2):
"""
根据给定的fpr和tpr,绘制ROC曲线
"""
# 为在Matplotlib中显示中文,设置特殊字体
plt.rcParams["font.sans-serif"] = ["SimHei"]
# 在Matplotlib中显示负号
plt.rcParams['axes.unicode_minus'] = False
# 创建一个图形框
fig = plt.figure(figsize=(12, 6), dpi=80)
# 在图形框里只画两幅图
ax = fig.add_subplot(1, 2, 1)
ax.plot(series1)
ax1 = fig.add_subplot(1, 2, 2)
ax1.plot(series2)
plt.show()
if __name__ == "__main__":
series1 = generate_data(4096, 200)
series2 = generate_data(2046, 200)
visualize_data(series1, series2)
显示图像:

可以看到,不同的观察窗口(x 的取值)得到完全不同的两个模型。
网友评论