美文网首页
模型幻觉

模型幻觉

作者: 水之心 | 来源:发表于2020-12-05 09:40 被阅读0次

    在搭建模型的过程中,我们往往会从已知的特征中提取更多新的特征,并以此搭建更为复杂的模型,但是模型越复杂,越会值其本身掉入不断“自我催眠,强化偏见”的过程,从而引起过度拟合的问题。如果将毫不相关的变量加入到模型中,也会得到相应的参数估计值,而这个估计值几乎不可能为0,这就造成了所谓的“模型幻觉”。模型幻觉会引起模型参数的不可靠,更严重的是使得原本可能较为正确的估计扭曲为错误,比如将原来变量的正效应估计为负效应(变量对应的参数为正时成为正效应,否则为负效应)。

    !pip install statsmodels
    
    import statsmodels.api as sm
    import numpy as np
    import pandas as pd
    
    
    def generateData():
        """
        生成模型数据
        """
        np.random.seed(5320)
        x = np.array(range(0, 20))/2
        error = np.round(np.random.randn(20), 2)
        y = 0.05*x + error
        # 新加入无关变量z恒等于1
        z = np.zeros(20) + 1
        return pd.DataFrame({"x": x, "z": z, "y": y})
    
    
    def wrongCoef():
        """
        由于新变量的加入,正效应为负效应
        """
        features = ["x", "z"]
        labels = ["y"]
        data = generateData()
        X = data[features]
        Y = data[labels]
        # 没有多余变量,x系数符合估计正确,为正
        model = sm.OLS(Y, X["x"])
        res = model.fit()
        print("没有新变量时")
        print(res.summary())
        # 加入多余变量后,x的系数符合估计错误,为负
        model1 = sm.OLS(Y, X)
        res1 = model1.fit()
        print("加入新变量后")
        print(res1.summary())
        
    wrongCoef()
    

    显示结果:

    没有新变量时
                                     OLS Regression Results                                
    =======================================================================================
    Dep. Variable:                      y   R-squared (uncentered):                   0.204
    Model:                            OLS   Adj. R-squared (uncentered):              0.162
    Method:                 Least Squares   F-statistic:                              4.878
    Date:                Sat, 05 Dec 2020   Prob (F-statistic):                      0.0397
    Time:                        09:24:08   Log-Likelihood:                         -29.583
    No. Observations:                  20   AIC:                                      61.17
    Df Residuals:                      19   BIC:                                      62.16
    Df Model:                           1                                                  
    Covariance Type:            nonrobust                                                  
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    x              0.0969      0.044      2.209      0.040       0.005       0.189
    ==============================================================================
    Omnibus:                        0.871   Durbin-Watson:                   2.037
    Prob(Omnibus):                  0.647   Jarque-Bera (JB):                0.815
    Skew:                           0.275   Prob(JB):                        0.665
    Kurtosis:                       2.179   Cond. No.                         1.00
    ==============================================================================
    
    Notes:
    [1] R² is computed without centering (uncentered) since the model does not contain a constant.
    [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    加入新变量后
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.005
    Model:                            OLS   Adj. R-squared:                 -0.050
    Method:                 Least Squares   F-statistic:                   0.09171
    Date:                Sat, 05 Dec 2020   Prob (F-statistic):              0.765
    Time:                        09:24:08   Log-Likelihood:                -27.982
    No. Observations:                  20   AIC:                             59.96
    Df Residuals:                      18   BIC:                             61.96
    Df Model:                           1                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    x             -0.0243      0.080     -0.303      0.765      -0.193       0.144
    z              0.7873      0.445      1.768      0.094      -0.148       1.723
    ==============================================================================
    Omnibus:                        0.939   Durbin-Watson:                   2.375
    Prob(Omnibus):                  0.625   Jarque-Bera (JB):                0.886
    Skew:                           0.338   Prob(JB):                        0.642
    Kurtosis:                       2.221   Cond. No.                         11.0
    ==============================================================================
    
    Notes:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    

    这里可能并不直观,下面看看一个抛物线的例子。

    """
    此脚本用于展示随机变量引起的模型幻觉
    """
    import numpy as np
    import matplotlib.pyplot as plt
    
    
    def generate_data(seed, num):
        x = 0
        np.random.seed(seed)
        data = []
        for i in range(num):
            x += np.random.normal()
            data.append(x)
        return data
    
    
    def visualize_data(series1, series2):
        """
        根据给定的fpr和tpr,绘制ROC曲线
        """
        # 为在Matplotlib中显示中文,设置特殊字体
        plt.rcParams["font.sans-serif"] = ["SimHei"]
        # 在Matplotlib中显示负号
        plt.rcParams['axes.unicode_minus'] = False
        # 创建一个图形框
        fig = plt.figure(figsize=(12, 6), dpi=80)
        # 在图形框里只画两幅图
        ax = fig.add_subplot(1, 2, 1)
        ax.plot(series1)
        ax1 = fig.add_subplot(1, 2, 2)
        ax1.plot(series2)
        plt.show()
    
    
    if __name__ == "__main__":
        series1 = generate_data(4096, 200)
        series2 = generate_data(2046, 200)
        visualize_data(series1, series2)
    

    显示图像:

    可以看到,不同的观察窗口(x 的取值)得到完全不同的两个模型。

    相关文章

      网友评论

          本文标题:模型幻觉

          本文链接:https://www.haomeiwen.com/subject/dgnwwktx.html