美文网首页
【R<-练习】多元线性回归与逻辑回归

【R<-练习】多元线性回归与逻辑回归

作者: 莫讠 | 来源:发表于2019-05-21 22:07 被阅读0次
    Q1

    一、某地29名13岁儿童身高(cm),体重(kg)和肺活量(L)数据见data,求:
    (1)由身高,体重推算肺活量的回归方程;
    (2)求出的方程是否有意义;
    (3)剩余标准差

    (1).
    
    > setwd("C:/Users/My/Desktop/hw6/")
    >  data6_1 <- read.csv("homework-6.1-data.csv",header = T, sep = ",")
    >  data6_1
       number Height weight vital.capacity
    1       1  135.1   32.0           1.75
    2       2  139.9   30.4           2.00
    3       3  163.6   46.2           2.75
    4       4  146.5   33.5           2.50
    5       5  156.2   37.1           2.75
    6       6  156.4   35.5           2.00
    7       7  167.8   41.5           2.75
    8       8  149.7   31.0           1.50
    9       9  145.0   33.0           2.50
    10     10  148.5   37.2           2.25
    11     11  165.5   49.5           3.00
    12     12  135.0   27.6           1.25
    13     13  153.3   41.0           2.75
    14     14  152.0   32.0           1.75
    15     15  160.5   47.2           2.25
    16     16  153.0   32.0           1.75
    17     17  147.6   40.5           2.00
    18     18  157.5   43.3           2.25
    19     19  155.1   44.7           2.75
    20     20  160.5   37.5           2.00
    21     21  143.0   31.5           1.75
    22     22  149.4   33.9           2.25
    23     23  160.8   40.4           2.75
    24     24  159.0   38.5           2.50
    25     25  158.2   37.5           2.00
    26     26  150.0   36.0           1.75
    27     27  144.5   34.7           2.25
    28     28  154.6   39.5           2.50
    29     29  156.5   32.0           1.75
    > lm.reg <- lm(data6_1$vital.capacity~data6_1$Height+data6_1$weight)
    > summary(lm.reg)
    
    Call:
    lm(formula = data6_1$vital.capacity ~ data6_1$Height + data6_1$weight)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.54117 -0.25524 -0.00266  0.22039  0.55425 
    
    Coefficients:
                    Estimate Std. Error t value Pr(>|t|)   
    (Intercept)    -0.565664   1.240127  -0.456  0.65208   
    data6_1$Height  0.005017   0.010575   0.474  0.63920   
    data6_1$weight  0.054061   0.015984   3.382  0.00228 **
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.3137 on 26 degrees of freedom
    Multiple R-squared:  0.546, Adjusted R-squared:  0.511 
    F-statistic: 15.63 on 2 and 26 DF,  p-value: 3.485e-05
    
    

    所以由身高(x1),体重(x2)推算的肺活量(y)的回归方程为

    y= -0.5657 + 0.0050x1 + 0.0541

    (2)因为F=15.63,p=3.485e-05<0.01,所以方程有意义。

    (3)剩余标准差: 0.3137

    Q2

    二、某农场通过试验取得早稻收获量与春季降雨量和春季温度的数据如下:
    收获量y(kg/mm2) 降雨量x1(mm) 温度x2(℃)

    建立早稻收获量对春季降雨量和春季温度的二元线性回归方程,计算各回归系数的置信区间,并对回归模型的线性关系和回归系数进行检验(α=0.05)。

    > data_2 <- read.table("homework-6.2-data.txt",header = T)
    > fit2 <- lm(y~x1+x2,data = data_2)
    > summary(fit2)
    
    Call:
    lm(formula = y ~ x1 + x2, data = data_2)
    
    Residuals:
           1        2        3        4        5        6        7 
    -275.101   90.464  216.483  140.280  150.676 -316.599   -6.203 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)   -0.591    505.004  -0.001   0.9991  
    x1            22.387      9.601   2.332   0.0801 .
    x2           327.672     98.798   3.317   0.0295 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 261.4 on 4 degrees of freedom
    Multiple R-squared:  0.9913,    Adjusted R-squared:  0.987 
    F-statistic: 228.4 on 2 and 4 DF,  p-value: 7.532e-05
    
    

    所以回归方程: y = -0.591 + 22.387x1 + 327.672x2

    > confint(fit2, level = 0.95)
                       2.5 %     97.5 %
    (Intercept) -1402.707516 1401.52552
    x1             -4.268921   49.04184
    x2             53.364699  601.97873
    
    

    降雨量x1的置信区间为(-4.268921,49.04184),含义是在温度不变的条件下,降雨量每变动1mm,收获量的平均变动在-4.268921到49.04184 kg/mm2之间。
    温度x2的置信区间为(53.364699,601.97873),含义是在降雨量不变的条件下,温度每变动1℃,收获量的平均变动在53.364699到601.97873 kg/mm2之间。

    线性关系检验是检验因变量y与k个自变量之间的关系是否显著,也称总体显著性检验。根据以上R输出结果,检验统计量F= 228.4,显著水平P= 7.532e-05< 0.05,拒绝H0,即收获量y与降雨量x1和温度x2之间的线性关系显著。

    要判断每个自变量对因变量的影响是否都显著,需要对各回归系数βi分别进行t检验。根据R输出结果,降雨量x1和温度x2的回归系数相应的显著水平分别为0.0801和0.0295,只有温度对应的显著性水平小于0.05通过检验,这表明影响收获量的自变量中,只有温度对收获量的影响显著,而降雨量对收获量的影响不显著。

    Q3_1

    三、某葡萄酒爱好者想探索葡萄酒的品质与哪些因素相关。他有一个数据集包含了(1 -固定酸度,2 -挥发性酸度,3 -柠檬酸,4 -残余糖,5 -氯化物,6 -自由二氧化硫量,7 -二氧化硫总量,8 -密度,9 - pH值,10 -硫酸盐,11 -酒精浓度,和12 -品质(0 - 10分)。
    1.查看数据集的前五行和数据集的总结
    2.通过直方图展示固定酸度的分布和展示挥发性酸度与品质的散点图
    3.计算这些变量与品质的相关性
    4.通过方差分析不同品质的葡萄酒的酒精浓度是否有差异
    5.通过多元线性回归建立一个品质预测模型,并说明哪些变量与品质显著相关。

    数据从此处下载

    > read.csv("homework-6.3-winequality-red.csv",header = T)
    > head(data_3,n = 5)
    
      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
    1           7.4             0.70        0.00            1.9     0.076
    2           7.8             0.88        0.00            2.6     0.098
    3           7.8             0.76        0.04            2.3     0.092
    4          11.2             0.28        0.56            1.9     0.075
    5           7.4             0.70        0.00            1.9     0.076
      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
    1                  11                   34  0.9978 3.51      0.56     9.4
    2                  25                   67  0.9968 3.20      0.68     9.8
    3                  15                   54  0.9970 3.26      0.65     9.8
    4                  17                   60  0.9980 3.16      0.58     9.8
    5                  11                   34  0.9978 3.51      0.56     9.4
      quality
    1       5
    2       5
    3       5
    4       6
    5       5
    
    # 对数据集的汇总
    > summary(data_3)
     fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
     Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
     1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
     Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
     Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
     3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
     Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
       chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
     Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
     1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
     Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
     Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
     3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
     Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
           pH          sulphates         alcohol         quality     
     Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
     1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
     Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
     Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
     3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
     Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000 
    
    

    Q3_2

    # 通过直方图展示固定酸度的分布
    > hist(data_3$fixed.acidity,main = "the distribution of fixed acidity",xlab = "fixed.acidity")
    
    直方图
    # 展示挥发性酸度与品质的散点图
    > plot(data_3$quality,data_3$volatile.acidity,main = "quality vs volatile acidity", ylab = "volatile.acidity", xlab = "quality" )
    
    散点图

    Q3-3

    #计算变量与品质的相关性
    > apply(data_3,2,function(x)cor(x, data_3$quality))
           fixed.acidity     volatile.acidity          citric.acid       residual.sugar            chlorides 
              0.12405165          -0.39055778           0.22637251           0.01373164          -0.12890656 
     free.sulfur.dioxide total.sulfur.dioxide              density                   pH            sulphates 
             -0.05065606          -0.18510029          -0.17491923          -0.05773139           0.25139708 
                 alcohol              quality 
              0.47616632           1.00000000 
    

    Q3_4

    > summary(aov(alcohol~quality, data = data_3))
                  Df Sum Sq Mean Sq F value Pr(>F)    
    quality        1  411.5   411.5   468.3 <2e-16 ***
    Residuals   1597 1403.3     0.9                   
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    # p < 0.05, 不同品质的葡萄酒的酒精浓度有较显著的差异
    
    
    

    Q3_5

    > summary(lm(quality~fixed.acidity+volatile.acidity+citric.acid+residual.sugar+chlorides+free.sulfur.dioxide+total.sulfur.dioxide+density+pH+sulphates+alcohol,data = data_3))
    
    Call:
    lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
        residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
        density + pH + sulphates + alcohol, data = data_3)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -2.68911 -0.36652 -0.04699  0.45202  2.02498 
    
    Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
    (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
    fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
    volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
    citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
    residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
    chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
    free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
    total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
    density              -1.788e+01  2.163e+01  -0.827   0.4086    
    pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
    sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
    alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.648 on 1587 degrees of freedom
    Multiple R-squared:  0.3606,    Adjusted R-squared:  0.3561 
    F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16
    
    

    与红酒品质显著相关的变量有:2 -挥发性酸度,5 -氯化物,6 -自由二氧化硫量,7 -二氧化硫总量,9 - pH值,10 -硫酸盐,11 -酒精浓度。

    Q4

    > data_4 <- read.table("homework-6.4-data.txt",header = T, sep = ",")
    > data_4
       Income Age y
    1   45000   2 0
    2   40000   4 0
    3   60000   3 1
    4   50000   2 1
    5   55000   2 0
    6   50000   5 1
    7   35000   7 1
    8   65000   2 1
    9   53000   2 0
    10  48000   1 0
    11  37000   5 1
    12  31000   7 1
    13  40000   4 1
    14  75000   2 0
    15  43000   9 1
    16  49000   2 0
    17  37500   4 1
    18  71000   1 0
    19  34000   5 0
    20  27000   6 0
    > fit <- glm(y~Income+Age, family = binomial(), data = data_4)
    > summary(fit)
    
    Call:
    glm(formula = y ~ Income + Age, family = binomial(), data = data_4)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.5635  -0.8045  -0.1397   0.9535   1.7915  
    
    Coefficients:
                  Estimate Std. Error z value Pr(>|z|)  
    (Intercept) -7.047e+00  4.674e+00  -1.508    0.132  
    Income       7.382e-05  6.371e-05   1.159    0.247  
    Age          9.879e-01  5.274e-01   1.873    0.061 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 27.726  on 19  degrees of freedom
    Residual deviance: 21.082  on 17  degrees of freedom
    AIC: 27.082
    
    Number of Fisher Scoring iterations: 5
    
    
    > coef(fit)
      (Intercept)        Income           Age 
    -7.047061e+00  7.381679e-05  9.878861e-01
    

    所以回归系数: β0=−7.047,β1=0.00007382,β2=0.9879
    回归方程为:


    > predictdata <- data.frame(Income=c(45000), Age = c(5))
    > predictdata$prob <- predict(fit,newdata = predictdata,type = "response") 
    > predictdata
      Income Age      prob
    1  45000   5 0.7710279
    
    

    所以预测的概率为 0.7710279

    Q5_1

    > data_5 <- read.csv("homework-6.5-Drivers.csv",header = T, sep = ",")
    > head(data_5)
      x1 x2 x3 y
    1  1 17  1 1
    2  1 44  0 0
    3  1 48  1 0
    4  1 55  0 0
    5  1 75  1 1
    6  0 35  0 1
    > log.glm <- glm(y~x1+x2+x3,data = data_5, family = "binomial")
    > summary(log.glm)
    
    Call:
    glm(formula = y ~ x1 + x2 + x3, family = "binomial", data = data_5)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.5636  -0.9131  -0.7892   0.9637   1.6000  
    
    Coefficients:
                 Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  0.597610   0.894831   0.668   0.5042  
    x1          -1.496084   0.704861  -2.123   0.0338 *
    x2          -0.001595   0.016758  -0.095   0.9242  
    x3           0.315865   0.701093   0.451   0.6523  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 62.183  on 44  degrees of freedom
    Residual deviance: 57.026  on 41  degrees of freedom
    AIC: 65.026
    
    Number of Fisher Scoring iterations: 4
    

    由此得到初步的logistic回归模型:


    image.png

    Q5_2

    在上述模型中, 由于参数β2,β3没有通过检验, 即只有视力状况对是否发生事故有显著影响
    去除非影响因素,可类似于线性模型, 用step( )做变量筛选.

    # 通过step()函数进行变量筛选
    > log.step <- step(log.glm)
    Start:  AIC=65.03
    y ~ x1 + x2 + x3
    
           Df Deviance    AIC
    - x2    1   57.035 63.035
    - x3    1   57.232 63.232
    <none>      57.026 65.026
    - x1    1   61.936 67.936
    
    Step:  AIC=63.03
    y ~ x1 + x3
    
           Df Deviance    AIC
    - x3    1   57.241 61.241
    <none>      57.035 63.035
    - x1    1   61.991 65.991
    
    Step:  AIC=61.24
    y ~ x1
    
           Df Deviance    AIC
    <none>      57.241 61.241
    - x1    1   62.183 64.183
    
    # 变量筛选后进行汇总
    > summary(log.step)
    
    Call:
    glm(formula = y ~ x1, family = "binomial", data = data_5)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.4490  -0.8782  -0.8782   0.9282   1.5096  
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)   0.6190     0.4688   1.320   0.1867  
    x1           -1.3728     0.6353  -2.161   0.0307 *
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 62.183  on 44  degrees of freedom
    Residual deviance: 57.241  on 43  degrees of freedom
    AIC: 61.241
    
    Number of Fisher Scoring iterations: 4
    

    得到的新的回归方程为:


    Q5_3

    # 输入两位需要投保的司机的数据
    > predictdata <- data.frame(person=c("A","B"),x1=c(0,1),x2=c(50,20),x3=c(1,0))
    > predictdata
      person x1 x2 x3
    1      A  0 50  1
    2      B  1 20  0
    # 应用上面拟合出的回归方程进行预测
    > predictdata$prob <- predict(object = log.step, newdata = predictdata, type = "response")
    > predictdata
      person x1 x2 x3 prob
    1      A  0 50  1 0.65
    2      B  1 20  0 0.32
    
    

    所以A、B两者明年出事故的概率分别为0.65和0.32
    只有B出事故的概率低于40 %, 所以只有B可以投保

    Q6

    > data_6 <- read.csv("homework-6.6-data.csv",header = T, sep = ",")
    > head(data_6)
            id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
    1   842302         M       17.99        10.38         122.80    1001.0         0.11840
    2   842517         M       20.57        17.77         132.90    1326.0         0.08474
    3 84300903         M       19.69        21.25         130.00    1203.0         0.10960
    4 84348301         M       11.42        20.38          77.58     386.1         0.14250
    5 84358402         M       20.29        14.34         135.10    1297.0         0.10030
    6   843786         M       12.45        15.70          82.57     477.1         0.12780
      compactness_mean concavity_mean concave.points_mean symmetry_mean
    1          0.27760         0.3001             0.14710        0.2419
    2          0.07864         0.0869             0.07017        0.1812
    3          0.15990         0.1974             0.12790        0.2069
    4          0.28390         0.2414             0.10520        0.2597
    5          0.13280         0.1980             0.10430        0.1809
    6          0.17000         0.1578             0.08089        0.2087
      fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se
    1                0.07871    1.0950     0.9053        8.589  153.40      0.006399
    2                0.05667    0.5435     0.7339        3.398   74.08      0.005225
    3                0.05999    0.7456     0.7869        4.585   94.03      0.006150
    4                0.09744    0.4956     1.1560        3.445   27.23      0.009110
    5                0.05883    0.7572     0.7813        5.438   94.44      0.011490
    6                0.07613    0.3345     0.8902        2.217   27.19      0.007510
      compactness_se concavity_se concave.points_se symmetry_se fractal_dimension_se
    1        0.04904      0.05373           0.01587     0.03003             0.006193
    2        0.01308      0.01860           0.01340     0.01389             0.003532
    3        0.04006      0.03832           0.02058     0.02250             0.004571
    4        0.07458      0.05661           0.01867     0.05963             0.009208
    5        0.02461      0.05688           0.01885     0.01756             0.005115
    6        0.03345      0.03672           0.01137     0.02165             0.005082
      radius_worst texture_worst perimeter_worst area_worst smoothness_worst
    1        25.38         17.33          184.60     2019.0           0.1622
    2        24.99         23.41          158.80     1956.0           0.1238
    3        23.57         25.53          152.50     1709.0           0.1444
    4        14.91         26.50           98.87      567.7           0.2098
    5        22.54         16.67          152.20     1575.0           0.1374
    6        15.47         23.75          103.40      741.6           0.1791
      compactness_worst concavity_worst concave.points_worst symmetry_worst
    1            0.6656          0.7119               0.2654         0.4601
    2            0.1866          0.2416               0.1860         0.2750
    3            0.4245          0.4504               0.2430         0.3613
    4            0.8663          0.6869               0.2575         0.6638
    5            0.2050          0.4000               0.1625         0.2364
    6            0.5249          0.5355               0.1741         0.3985
      fractal_dimension_worst
    1                 0.11890
    2                 0.08902
    3                 0.08758
    4                 0.17300
    5                 0.07678
    6                 0.12440
    
    > fit <- glm(data_6$diagnosis ~ data_6$radius_mean + data_6$texture_mean + data_6$perimeter_mean + data_6$area_mean + data_6$smoothness_mean + data_6$compactness_mean + data_6$concavity_mean + data_6$concave.points_mean + data_6$symmetry_mean + data_6$fractal_dimension_mean, family = "binomial")
    Warning message:
    glm.fit:拟合機率算出来是数值零或一 
    > summary(fit)
    
    Call:
    glm(formula = data_6$diagnosis ~ data_6$radius_mean + data_6$texture_mean + 
        data_6$perimeter_mean + data_6$area_mean + data_6$smoothness_mean + 
        data_6$compactness_mean + data_6$concavity_mean + data_6$concave.points_mean + 
        data_6$symmetry_mean + data_6$fractal_dimension_mean, family = "binomial")
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -1.95590  -0.14839  -0.03943   0.00429   2.91690  
    
    Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)                    -7.35952   12.85259  -0.573   0.5669    
    data_6$radius_mean             -2.04930    3.71588  -0.551   0.5813    
    data_6$texture_mean             0.38473    0.06454   5.961  2.5e-09 ***
    data_6$perimeter_mean          -0.07151    0.50516  -0.142   0.8874    
    data_6$area_mean                0.03980    0.01674   2.377   0.0174 *  
    data_6$smoothness_mean         76.43227   31.95492   2.392   0.0168 *  
    data_6$compactness_mean        -1.46242   20.34249  -0.072   0.9427    
    data_6$concavity_mean           8.46870    8.12003   1.043   0.2970    
    data_6$concave.points_mean     66.82176   28.52910   2.342   0.0192 *  
    data_6$symmetry_mean           16.27824   10.63059   1.531   0.1257    
    data_6$fractal_dimension_mean -68.33703   85.55666  -0.799   0.4244    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 751.44  on 568  degrees of freedom
    Residual deviance: 146.13  on 558  degrees of freedom
    AIC: 168.13
    
    Number of Fisher Scoring iterations: 9
    
    
    

    Q6_2

    > reduced_feature <- c("texture_mean","area_mean","smoothness_mean","concave.points_mean")
    > formula <- paste("diagnosis",paste(reduced_feature,collapse = "+") ,sep = "~")
    > model <- glm(formula, data_6, family = "binomial")
    Warning message:
    glm.fit:拟合機率算出来是数值零或一 
    > summary(model)
    
    Call:
    glm(formula = formula, family = "binomial", data = data_6)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -2.31798  -0.15623  -0.04212   0.01662   2.84201  
    
    Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
    (Intercept)         -23.677816   3.882774  -6.098 1.07e-09 ***
    texture_mean          0.362687   0.060544   5.990 2.09e-09 ***
    area_mean             0.010342   0.002002   5.165 2.40e-07 ***
    smoothness_mean      59.471304  25.965153   2.290    0.022 *  
    concave.points_mean  76.571210  16.427864   4.661 3.15e-06 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 751.44  on 568  degrees of freedom
    Residual deviance: 156.44  on 564  degrees of freedom
    AIC: 166.44
    
    Number of Fisher Scoring iterations: 8
    
    

    Q6_3

    > anova(fit, model, test = "Chisq")
    Analysis of Deviance Table
    
    Model: binomial, link: logit
    
    Response: data_6$diagnosis
    
    Terms added sequentially (first to last)
    
    
                                  Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
    NULL                                            568     751.44              
    data_6$radius_mean             1   421.43       567     330.01 < 2.2e-16 ***
    data_6$texture_mean            1    38.89       566     291.12 4.489e-10 ***
    data_6$perimeter_mean          1    72.23       565     218.90 < 2.2e-16 ***
    data_6$area_mean               1     7.53       564     211.37 0.0060655 ** 
    data_6$smoothness_mean         1    42.14       563     169.22 8.481e-11 ***
    data_6$compactness_mean        1     0.04       562     169.18 0.8359415    
    data_6$concavity_mean          1    13.22       561     155.96 0.0002771 ***
    data_6$concave.points_mean     1     6.72       560     149.24 0.0095192 ** 
    data_6$symmetry_mean           1     2.46       559     146.78 0.1166476    
    data_6$fractal_dimension_mean  1     0.65       558     146.13 0.4213577    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    

    Q6_4

    > trainData <- data_6[1:398,]
    > testData <- data_6[399:569,]
    > reduced_feature <- c("texture_mean","area_mean","smoothness_mean","concave.points_mean")
    
    > formula <- paste("diagnosis",paste(reduced_feature,collapse = "+"),sep = "~")
    > model <- glm(formula, trainData, family = "binomial")
    > summary(model)
    
    Call:
    glm(formula = formula, family = "binomial", data = trainData)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -2.39278  -0.14454  -0.02447   0.03635   2.60665  
    
    Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
    (Intercept)         -27.47397    4.74798  -5.786 7.19e-09 ***
    texture_mean          0.46244    0.08434   5.483 4.19e-08 ***
    area_mean             0.01082    0.00235   4.606 4.11e-06 ***
    smoothness_mean      90.11221   30.96961   2.910 0.003618 ** 
    concave.points_mean  59.01212   17.51779   3.369 0.000755 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 544.93  on 397  degrees of freedom
    Residual deviance: 108.30  on 393  degrees of freedom
    AIC: 118.3
    
    Number of Fisher Scoring iterations: 8
    
    > predictdata <- predict(model, testData,type = "response")
    > pred_num <- ifelse(predictdata > 0.5,1,0)
    > y_pred <- factor(pred_num,levels = c(0,1))
    > y_act <- factor(ifelse(testData$diagnosis=="B",0,1))
    > sum(y_pred==y_act)
    [1] 155
    > 155/171
    [1] 0.9064327
    
    

    相关文章

      网友评论

          本文标题:【R<-练习】多元线性回归与逻辑回归

          本文链接:https://www.haomeiwen.com/subject/kxuszqtx.html