MachineLearning 23. 机器学习之岭回归预测基因

作者: 桓峰基因 | 来源:发表于2024-03-10 09:38 被阅读0次

逻辑回归
机器学习资源
机器学习之Logistic回归(逻辑蒂斯回归）
MachineLearning 7. 癌症诊断机器学习之回归树（
统计学惩罚-LR RR和ENR
波士顿房价预测
机器学习实战（2）之预测房价
机器学习之逻辑回归(2)
数据科学34 |机器学习-回归预测和分类树预测
R 包学习 -mlr()

简介

岭回归(英文名：ridge regression, Tikhonov regularization)是一种专用于共线性数据分析的有偏估计回归方法，实质上是一种改良的最小二乘估计法，通过放弃最小二乘法的无偏性，以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法，对病态数据的拟合要强于最小二乘法。

迄今为止，许多遗传变异已被确定为与不同的表型性状相关。然而，已确定的关联通常只能解释性状遗传性的一小部分，并且仅包含已知相关变异的模型的预测能力很小。多元回归是一种流行的框架，它同时考虑许多遗传变异的联合效应。由于遗传数据的高维数和预测因子之间的相关结构，普通多元回归很少适用于遗传数据。人们重新对使用惩罚性回归技术来规避这些困难产生了兴趣。在本文中着重于岭回归，一种惩罚回归方法，已被证明在多变量预测中提供良好的性能问题。岭回归应用中的一个挑战是控制量的岭参数的选择收缩的回归系数。我们提出了一种基于数据确定脊形参数的方法。

目的:在高维预测问题中获得良好的性能。为我们的方法建立了理论依据，并在模拟遗传数据和实际数据实例上验证了其性能。岭回归模型拟合数十万到数百万的基因变异同时存在，这给计算带来了挑战。

软件包安装

ridge 软件包安装非常简单，直接install.packages()即可。

install.packages("ridge")
devtools::install_github("selva86/InformationValue")

数据读取

输入：自变量X至少一项或以上的定量变量或二分类定类变量，因变量Y要求为定量变量（若为定类变量，请使用逻辑回归）。输出：模型检验优度的结果，自变量对因变量的线性关系等等。

表型为数值型变量

library(ridge)
data(GenCont)
head(GenCont[, 1:10])
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9
## [1,]  1.4356316    1    1    1    0    0    1    0    0    0
## [2,]  2.9226960    1    0    0    0    1    0    0    1    0
## [3,]  0.5669319    0    0    0    0    0    0    0    0    0
## [4,]  4.8515051    1    0    0    0    1    0    0    0    0
## [5,]  0.1525582    1    1    1    0    0    1    0    0    0
## [6,]  2.9522701    1    0    0    0    1    0    0    0    0

表型为分类型变量

data(GenBin)
head(GenBin)
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## [1,]          0    1    0    0    0    0    0    0    1    0     1     0     0
## [2,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [3,]          1    0    0    0    0    0    0    2    0    1     0     0     1
## [4,]          0    0    0    0    0    0    1    0    0    0     0     0     0
## [5,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [6,]          0    0    0    0    0    0    0    0    0    0     0     0     0
##      SNP13 SNP14
## [1,]     0     1
## [2,]     1     0
## [3,]     1     0
## [4,]     0     0
## [5,]     0     0
## [6,]     0     0

实例操作

基于乳腺癌分类的逻辑岭回归模型

这个我们采用glmnet里面的函数和参数解决实际的问题，也可以使用R包ridge，常规数据我们可以模仿Lasso回归的模式来做。

数据读取

这个数据集经常使用，在分享的机器学习里面的例子中经常出现，并且有对数据进行处理的常规步骤：

library(glmnet)
library(caret)
BreastCancer <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
BreastCancer = BreastCancer[, -1]
dim(BreastCancer)
## [1] 568  31
str(BreastCancer)
## 'data.frame':	568 obs. of  31 variables:
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  20.6 19.7 11.4 20.3 12.4 ...
##  $ texture_mean           : num  17.8 21.2 20.4 14.3 15.7 ...
##  $ perimeter_mean         : num  132.9 130 77.6 135.1 82.6 ...
##  $ area_mean              : num  1326 1203 386 1297 477 ...
##  $ smoothness_mean        : num  0.0847 0.1096 0.1425 0.1003 0.1278 ...
##  $ compactne_mean         : num  0.0786 0.1599 0.2839 0.1328 0.17 ...
##  $ concavity_mean         : num  0.0869 0.1974 0.2414 0.198 0.1578 ...
##  $ concave_points_mean    : num  0.0702 0.1279 0.1052 0.1043 0.0809 ...
##  $ symmetry_mean          : num  0.181 0.207 0.26 0.181 0.209 ...
##  $ fractal_dimension_mean : num  0.0567 0.06 0.0974 0.0588 0.0761 ...
##  $ radius_se              : num  0.543 0.746 0.496 0.757 0.335 ...
##  $ texture_se             : num  0.734 0.787 1.156 0.781 0.89 ...
##  $ perimeter_se           : num  3.4 4.58 3.44 5.44 2.22 ...
##  $ area_se                : num  74.1 94 27.2 94.4 27.2 ...
##  $ smoothness_se          : num  0.00522 0.00615 0.00911 0.01149 0.00751 ...
##  $ compactne_se           : num  0.0131 0.0401 0.0746 0.0246 0.0335 ...
##  $ concavity_se           : num  0.0186 0.0383 0.0566 0.0569 0.0367 ...
##  $ concave_points_se      : num  0.0134 0.0206 0.0187 0.0188 0.0114 ...
##  $ symmetry_se            : num  0.0139 0.0225 0.0596 0.0176 0.0216 ...
##  $ fractal_dimension_se   : num  0.00353 0.00457 0.00921 0.00511 0.00508 ...
##  $ radius_worst           : num  25 23.6 14.9 22.5 15.5 ...
##  $ texture_worst          : num  23.4 25.5 26.5 16.7 23.8 ...
##  $ perimeter_worst        : num  158.8 152.5 98.9 152.2 103.4 ...
##  $ area_worst             : num  1956 1709 568 1575 742 ...
##  $ smoothness_worst       : num  0.124 0.144 0.21 0.137 0.179 ...
##  $ compactne_worst        : num  0.187 0.424 0.866 0.205 0.525 ...
##  $ concavity_worst        : num  0.242 0.45 0.687 0.4 0.535 ...
##  $ concave_points_worst   : num  0.186 0.243 0.258 0.163 0.174 ...
##  $ symmetry_worst         : num  0.275 0.361 0.664 0.236 0.399 ...
##  $ fractal_dimension_worst: num  0.089 0.0876 0.173 0.0768 0.1244 ...
table(BreastCancer$diagnosis)
## 
##   B   M 
## 357 211
# 删除方差为0的变量
zerovar = nearZeroVar(BreastCancer[, -1])
zerovar
## integer(0)
# 首先删除强相关的变量
descrCorr = cor(BreastCancer[, -1])
descrCorr[1:5, 1:5]
##                 radius_mean texture_mean perimeter_mean area_mean
## radius_mean       1.0000000   0.32938305      0.9978764 0.9873442
## texture_mean      0.3293830   1.00000000      0.3359176 0.3261929
## perimeter_mean    0.9978764   0.33591759      1.0000000 0.9865482
## area_mean         0.9873442   0.32619289      0.9865482 1.0000000
## smoothness_mean   0.1680940  -0.01776898      0.2045046 0.1748380
##                 smoothness_mean
## radius_mean          0.16809398
## texture_mean        -0.01776898
## perimeter_mean       0.20450464
## area_mean            0.17483805
## smoothness_mean      1.00000000
highCorr = findCorrelation(descrCorr, 0.9)
BreastCancer = BreastCancer[, -(highCorr + 1)]
# 随后解决多重共线性，本例中不存在多重共线性问题
comboInfo = findLinearCombos(BreastCancer[, -1])
Process = preProcess(BreastCancer)
BreastCancer = predict(Process, BreastCancer)

数据分割

library(tidyverse)
library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(BreastCancer, "diagnosis", size = rev(round(table(BreastCancer$diagnosis) *
    0.7)))$ID_unit
# 训练数据
trainData <- BreastCancer[train_id, ]
# 测试数据
testData <- BreastCancer[-train_id, ]

# 查看训练、测试数据中正负样本比例
prop.table(table(trainData$diagnosis))
## 
##         B         M 
## 0.6281407 0.3718593
prop.table(table(testData$diagnosis))
## 
##         B         M 
## 0.6294118 0.3705882
prop.table(table(BreastCancer$diagnosis))
## 
##         B         M 
## 0.6285211 0.3714789

可视化重要变量

#4. How to visualize the importance of variables using featurePlot()
featurePlot(x = trainData[, 2:21], 
            y = as.factor(trainData$diagnosis), 
            plot = "box", #For classification:box, strip, density, pairs or ellipse. For regression, pairs or scatter
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free"))
)

构建输入数据

x = as.matrix(trainData[, -1])
y = ifelse(trainData$diagnosis == "M", 1, 0)

构建逻辑岭回归模型

ridge <- glmnet(x, y, family = "binomial", alpha = 0)
plot(ridge, xvar = "lambda", label = TRUE)

交叉验证

ridge_model <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
plot(ridge_model)

plot(ridge_model$glmnet.fit, "lambda", label = T)

确定模型均方误差最小的情况下参数lambda的取值

在cv.glmnet()函数的输出结果中，包含一个lambda.min取值，且ridge_model$lambda.min = 0.15，该值为在模型均方误差最小的情况下参数lambda的取值，可以使用该值建立Ridge回归模型。下面使用该参数对全部的训练数据集建立Ridge模型。

ridge_min <- ridge_model$lambda.min
## 使用ridge_min 拟合ridge模型
ridge_best <- glmnet(x, y, alpha = 0, lambda = ridge_min)
coef(ridge_best)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                                    s0
## (Intercept)              0.3729332067
## area_mean                0.0885358564
## smoothness_mean          0.0052045287
## compactne_mean          -0.0003435624
## symmetry_mean            0.0075012310
## fractal_dimension_mean  -0.0839607648
## radius_se                0.0636875678
## texture_se               0.0027998059
## smoothness_se            0.0388690882
## compactne_se            -0.0551225591
## concavity_se            -0.0524564964
## concave_points_se        0.0598000796
## symmetry_se             -0.0177510728
## fractal_dimension_se     0.0119540616
## texture_worst            0.0557146338
## smoothness_worst         0.0310525197
## compactne_worst         -0.0004504186
## concavity_worst          0.0892093379
## concave_points_worst     0.1296407274
## symmetry_worst           0.0614660535
## fractal_dimension_worst  0.0430986166

验证集验证

ridge.coef <- predict(ridge_model, s = 0.05, type = "coefficients")
ridge.coef
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)              0.3729040858
## area_mean                0.0894265533
## smoothness_mean          0.0055832621
## compactne_mean           0.0009547588
## symmetry_mean            0.0076406106
## fractal_dimension_mean  -0.0794539760
## radius_se                0.0632058362
## texture_se               0.0022559684
## smoothness_se            0.0356625085
## compactne_se            -0.0512845944
## concavity_se            -0.0465064056
## concave_points_se        0.0570207458
## symmetry_se             -0.0175198162
## fractal_dimension_se     0.0076460181
## texture_worst            0.0559848075
## smoothness_worst         0.0332178942
## compactne_worst          0.0054483603
## concavity_worst          0.0835264079
## concave_points_worst     0.1249131201
## symmetry_worst           0.0592973698
## fractal_dimension_worst  0.0381440680
ridge.y <- predict(ridge_model, newx = as.matrix(testData[, -1]), type = "response",
    s = 0.05)

绘制ROC曲线

library(InformationValue)
actuals <- ifelse(testData$diagnosis == "M", 1, 0)
misClassError(actuals, ridge.y)
## [1] 0.0588
plotROC(actuals, ridge.y)

基于基因型和表型数据构建岭回归模型

这样的数据我们使用特殊的软件包 ridge 来做，非常方便，并且里面内置了基因型预测函数，非常方便对基因型和表型的数据进行处理。

构建逻辑岭回归模型

data(GenBin)
head(GenBin)
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## [1,]          0    1    0    0    0    0    0    0    1    0     1     0     0
## [2,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [3,]          1    0    0    0    0    0    0    2    0    1     0     0     1
## [4,]          0    0    0    0    0    0    1    0    0    0     0     0     0
## [5,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [6,]          0    0    0    0    0    0    0    0    0    0     0     0     0
##      SNP13 SNP14
## [1,]     0     1
## [2,]     1     0
## [3,]     1     0
## [4,]     0     0
## [5,]     0     0
## [6,]     0     0
table(GenBin[, 1])
## 
##   0   1 
## 250 250
beta_logisticRidge <- logisticRidge(Phenotypes ~ ., data = as.data.frame(GenBin))
summary(beta_logisticRidge)
## 
## Call:
## logisticRidge(formula = Phenotypes ~ ., data = as.data.frame(GenBin))
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  0.002782              NA                  NA               NA
## SNP1         0.166067        1.914369            0.666082            2.874
## SNP2         0.110745        0.269635            0.963936            0.280
## SNP3        -0.284802       -0.491809            0.999027           -0.492
## SNP4        -1.007615       -1.422131            0.990998           -1.435
## SNP5         0.911082        0.910170            0.986458            0.923
## SNP6        -0.026877       -0.355500            0.959424           -0.371
## SNP7        -0.068540       -1.022005            0.922686           -1.108
## SNP8         0.149821        1.729257            0.667675            2.590
## SNP9        -0.159520       -0.795170            0.948859           -0.838
## SNP10       -0.207470       -1.214400            0.893062           -1.360
## SNP11        0.027842        0.067788            0.999575            0.068
## SNP12        0.090563        0.651787            0.958376            0.680
## SNP13       -0.051310       -0.705257            0.947386           -0.744
## SNP14       -0.196030       -0.990720            0.904846           -1.095
##             Pr(>|t|)   
## (Intercept)       NA   
## SNP1         0.00405 **
## SNP2         0.77969   
## SNP3         0.62252   
## SNP4         0.15127   
## SNP5         0.35618   
## SNP6         0.71098   
## SNP7         0.26802   
## SNP8         0.00960 **
## SNP9         0.40202   
## SNP10        0.17389   
## SNP11        0.94593   
## SNP12        0.49644   
## SNP13        0.45662   
## SNP14        0.27356   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge paramter: 0.1244006, chosen automatically, computed using 3 PCs
## 
## Degrees of freedom: model 7.02 , variance 4.066
pvalsMod <- pvals(beta_logisticRidge)
print(pvalsMod)
## 
## 	lambda 0.124401 chosen automatically using 3 PCs
## 
##       Estimate (scaled) Std. Error (scaled) t value (scaled) Pr(>|t|)   
## SNP1            1.91437             0.66608            2.874  0.00405 **
## SNP2            0.26964             0.96394            0.280  0.77969   
## SNP3           -0.49181             0.99903           -0.492  0.62252   
## SNP4           -1.42213             0.99100           -1.435  0.15127   
## SNP5            0.91017             0.98646            0.923  0.35618   
## SNP6           -0.35550             0.95942           -0.371  0.71098   
## SNP7           -1.02201             0.92269           -1.108  0.26802   
## SNP8            1.72926             0.66767            2.590  0.00960 **
## SNP9           -0.79517             0.94886           -0.838  0.40202   
## SNP10          -1.21440             0.89306           -1.360  0.17389   
## SNP11           0.06779             0.99958            0.068  0.94593   
## SNP12           0.65179             0.95838            0.680  0.49644   
## SNP13          -0.70526             0.94739           -0.744  0.45662   
## SNP14          -0.99072             0.90485           -1.095  0.27356   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pvalsMod)

pred_phen <- predict(beta_logisticRidge, type = "response")
res <- cbind(GenBin, pred_phen)
head(res)
##   Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## 1          0    1    0    0    0    0    0    0    1    0     1     0     0
## 2          0    0    0    0    0    0    0    1    0    0     0     0     0
## 3          1    0    0    0    0    0    0    2    0    1     0     0     1
## 4          0    0    0    0    0    0    1    0    0    0     0     0     0
## 5          0    0    0    0    0    0    0    1    0    0     0     0     0
## 6          0    0    0    0    0    0    0    0    0    0     0     0     0
##   SNP13 SNP14          
## 1     0     1 0.4788050
## 2     1     0 0.4707666
## 3     1     0 0.4367005
## 4     0     0 0.4939766
## 5     0     0 0.4835665
## 6     0     0 0.5006956

预测基因型

genotypesfile <- system.file("extdata", "GenBin_genotypes.txt", package = "ridge")
phenotypesfile <- system.file("extdata", "GenBin_phenotypes.txt", package = "ridge")
beta_logisticRidgeGenotypes <- logisticRidgeGenotypes(genotypesfilename = genotypesfile,
    phenotypesfilename = phenotypesfile)
beta_logisticRidgeGenotypes
##                   B
## Intercept  0.002782
## SNP1       0.166067
## SNP2       0.110745
## SNP3      -0.284802
## SNP4      -1.007615
## SNP5       0.911082
## SNP6      -0.026877
## SNP7      -0.068540
## SNP8       0.149821
## SNP9      -0.159520
## SNP10     -0.207470
## SNP11      0.027842
## SNP12      0.090563
## SNP13     -0.051310
## SNP14     -0.196030
## compare to output of logisticRidge
results <- as.data.frame(cbind(round(coef(beta_logisticRidge), 6), beta_logisticRidgeGenotypes))
results
##             round(coef(beta_logisticRidge), 6)         B
## (Intercept)                           0.002782  0.002782
## SNP1                                  0.166067  0.166067
## SNP2                                  0.110745  0.110745
## SNP3                                 -0.284802 -0.284802
## SNP4                                 -1.007615 -1.007615
## SNP5                                  0.911082  0.911082
## SNP6                                 -0.026877 -0.026877
## SNP7                                 -0.068540 -0.068540
## SNP8                                  0.149821  0.149821
## SNP9                                 -0.159520 -0.159520
## SNP10                                -0.207470 -0.207470
## SNP11                                 0.027842  0.027842
## SNP12                                 0.090563  0.090563
## SNP13                                -0.051310 -0.051310
## SNP14                                -0.196030 -0.196030

构建线性岭回归模型

data(GenCont)
head(GenCont)
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## [1,]  1.4356316    1    1    1    0    0    1    0    0    0     0     1     0
## [2,]  2.9226960    1    0    0    0    1    0    0    1    0     0     1     0
## [3,]  0.5669319    0    0    0    0    0    0    0    0    0     0     2     0
## [4,]  4.8515051    1    0    0    0    1    0    0    0    0     0     1     0
## [5,]  0.1525582    1    1    1    0    0    1    0    0    0     0     1     0
## [6,]  2.9522701    1    0    0    0    1    0    0    0    0     0     0     0
beta_linearRidge <- linearRidge(Phenotypes ~ ., data = as.data.frame(GenCont))
summary(beta_linearRidge)
## 
## Call:
## linearRidge(formula = Phenotypes ~ ., data = as.data.frame(GenCont))
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  1.533386              NA                  NA               NA
## SNP1         0.277296        4.045409            0.266120           15.201
## SNP2        -0.110458       -1.256154            0.216332            5.807
## SNP3        -0.110458       -1.256154            0.216332            5.807
## SNP4         0.005230        0.011635            0.371693            0.031
## SNP5         0.531173        6.323006            0.315368           20.050
## SNP6        -0.119164       -1.373227            0.223047            6.157
## SNP7         0.113844        0.113730            0.372181            0.306
## SNP8        -0.099149       -1.028581            0.355807            2.891
## SNP9        -0.008321       -0.008312            0.372386            0.022
## SNP10        0.058562        0.101128            0.371567            0.272
## SNP11       -0.096526       -1.495699            0.329250            4.543
## SNP12       -0.334279       -0.333945            0.372248            0.897
##             Pr(>|t|)    
## (Intercept)       NA    
## SNP1         < 2e-16 ***
## SNP2        6.38e-09 ***
## SNP3        6.38e-09 ***
## SNP4         0.97503    
## SNP5         < 2e-16 ***
## SNP6        7.43e-10 ***
## SNP7         0.75993    
## SNP8         0.00384 ** 
## SNP9         0.98219    
## SNP10        0.78549    
## SNP11       5.55e-06 ***
## SNP12        0.36966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 2.206848, chosen automatically, computed using 1 PCs
## 
## Degrees of freedom: model 3.121 , variance 1.205 , residual 5.036
pvalsMod <- pvals(beta_linearRidge)
print(pvalsMod)
## 
## 	lambda 2.206848 chosen automatically using 1 PCs
## 
##       Estimate (scaled) Std. Error (scaled) t value (scaled) Pr(>|t|)    
## SNP1           4.045409            0.266120           15.201  < 2e-16 ***
## SNP2          -1.256154            0.216332            5.807 6.38e-09 ***
## SNP3          -1.256154            0.216332            5.807 6.38e-09 ***
## SNP4           0.011635            0.371693            0.031  0.97503    
## SNP5           6.323006            0.315368           20.050  < 2e-16 ***
## SNP6          -1.373227            0.223047            6.157 7.43e-10 ***
## SNP7           0.113730            0.372181            0.306  0.75993    
## SNP8          -1.028581            0.355807            2.891  0.00384 ** 
## SNP9          -0.008312            0.372386            0.022  0.98219    
## SNP10          0.101128            0.371567            0.272  0.78549    
## SNP11         -1.495699            0.329250            4.543 5.55e-06 ***
## SNP12         -0.333945            0.372248            0.897  0.36966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pvalsMod)

pred_phen <- predict(beta_linearRidge)
res <- cbind(GenCont, pred_phen)
head(res)
##   Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## 1  1.4356316    1    1    1    0    0    1    0    0    0     0     1     0
## 2  2.9226960    1    0    0    0    1    0    0    1    0     0     1     0
## 3  0.5669319    0    0    0    0    0    0    0    0    0     0     2     0
## 4  4.8515051    1    0    0    0    1    0    0    0    0     0     1     0
## 5  0.1525582    1    1    1    0    0    1    0    0    0     0     1     0
## 6  2.9522701    1    0    0    0    1    0    0    0    0     0     0     0
##   pred_phen
## 1  1.374077
## 2  2.146179
## 3  1.340333
## 4  2.245328
## 5  1.374077
## 6  2.341855

预测基因型

genotypesfile <- system.file("extdata", "GenCont_genotypes.txt", package = "ridge")
phenotypesfile <- system.file("extdata", "GenCont_phenotypes.txt", package = "ridge")
betafile <- tempfile(pattern = "beta", fileext = ".dat")
beta_linearRidgeGenotypes <- linearRidgeGenotypes(genotypesfilename = genotypesfile,
    phenotypesfilename = phenotypesfile, betafilename = betafile)
pred_phen_geno <- linearRidgeGenotypesPredict(genotypesfilename = genotypesfile,
    betafilename = betafile)

## compare to output of linearRidge
results <- cbind(pred_phen_geno, pred_phen)
head(results)
##   PredictedPhenotypes pred_phen
## 1            1.374076  1.374077
## 2            2.146180  2.146179
## 3            1.340334  1.340333
## 4            2.245329  2.245328
## 5            1.374076  1.374077
## 6            2.341855  2.341855

Reference

Significance testing in ridge regression for genetic data. Cule, E. et al (2011) BMC Bioinformatics, 12:372 A semi-automatic method to guide the choice of ridge parameter in ridge regression. Cule, E. and De Iorio, M. (2012) arXiv:1205.0686v1
Cule, Erika, and Maria De Iorio. "Ridge regression in prediction problems: automatic choice of the ridge parameter." Genetic epidemiology 37.7 (2013): 704-714.

本文使用文章同步助手同步

逻辑回归
问题逻辑回归预测应当无偏差。即：“预测平均值”应当约等于“观察平均值” 机器学习面试之逻辑回归输出的值是真实的概...
机器学习资源
摘抄 allmachinelearning/MachineLearning 机器学习资源 Machine lear...
机器学习之Logistic回归(逻辑蒂斯回归）
机器学习之Logistic回归(逻辑蒂斯回归）
MachineLearning 7. 癌症诊断机器学习之回归树（
前言树方法精髓就是划分特征，从第一次分裂开始就要考虑如何最大程度改善RSS，然后持续进行“树权”分裂，直到树...
统计学惩罚-LR RR和ENR
岭回归和lasso回归 1.学习基础偏差和方差（bias and variance）我们在机器学习中理解bia...
波士顿房价预测
机器学习实战小项目之波士顿房价预测前言波士顿房价预测项目是一个简单的回归模型，通过该项目的学习可以学会一些关于...
机器学习实战（2）之预测房价
上一篇机器学习实战⑴之线性回归预测房价 - 简书https://www.jianshu.com/p/0b66f1...
机器学习之逻辑回归(2)
接机器学习之逻辑回归(1)，尝试判别分析和多元自适应回归样条方法对数据集的预测能力。线性判别分析即LDA，可通过M...
数据科学34 |机器学习-回归预测和分类树预测
使用回归模型预测与许多机器学习算法相比，使用回归模型进行预测更易实现，更易解释，但在非线性方面拟合不佳。因此，通...
R 包学习 -mlr()
机器学习R包 reddit.com/r/MachineLearning/comments/8aluhs/d_mac...

MachineLearning 23. 机器学习之岭回归预测基因

简介

软件包安装

数据读取

表型为数值型变量

表型为分类型变量

实例操作

基于乳腺癌分类的逻辑岭回归模型

数据读取

数据分割

可视化重要变量

构建输入数据

构建逻辑岭回归模型

交叉验证

确定模型均方误差最小的情况下参数lambda的取值

验证集验证

绘制ROC曲线

基于基因型和表型数据构建岭回归模型

构建逻辑岭回归模型

预测基因型

构建线性岭回归模型

预测基因型

Reference

相关文章

逻辑回归

机器学习资源

机器学习之Logistic回归(逻辑蒂斯回归）

MachineLearning 7. 癌症诊断机器学习之回归树（

统计学惩罚-LR RR和ENR

波士顿房价预测

机器学习实战（2）之预测房价

机器学习之逻辑回归(2)

数据科学34 |机器学习-回归预测和分类树预测

R 包学习 -mlr()

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

学习

MachineLearning 23. 机器学习之岭回归预测基因

简 介

软件包安装

数据读取

表型为数值型变量

表型为分类型变量

实例操作

基于乳腺癌分类的逻辑岭回归模型

数据读取

数据分割

可视化重要变量

构建输入数据

构建逻辑岭回归模型

交叉验证

确定模型均方误差最小的情况下参数lambda的取值

验证集验证

绘制ROC曲线

基于基因型和表型数据构建岭回归模型

构建逻辑岭回归模型

预测基因型

构建线性岭回归模型

预测基因型

Reference

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

简介