R for data science ||使用modelr实现基

作者: 周运来就是我 | 来源:发表于2019-08-08 08:09 被阅读25次

R for data science ||使用modelr实现基
R函数学习-tibble()
R for data science || 使用purrr实现
R for data science ||使用tibble实现简
R for data science || 使用magritt
【R for Data Science】(2) tibble
R for Data Science
[R语言] tidyr包数据整理《R for data sci
[R语言] tibble包《R for data science
[R语言] Functions 函数《R for data sc

模型的作用是提供一个简单的，低纬度的数据集摘要。

模型的目的不是发现真理，而是获得简单但是有价值的近似。

构建模型两个阶段：

定义模型簇
拟合模型

library(tidyverse)

library(modelr)
options(na.action = na.warn)
ggplot(sim1, aes(x, y)) + 
  geom_point()

models <- tibble(
  a1 = runif(250, -20, 40),
  a2 = runif(250, -5, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point()

model1 <- function(a, data) {
  a[1] + data$x * a[2]
}
model1(c(7, 1.5), sim1)
#>  [1]  8.5  8.5  8.5 10.0 10.0 10.0 11.5 11.5 11.5 13.0 13.0 13.0 14.5 14.5
#> [15] 14.5 16.0 16.0 16.0 17.5 17.5 17.5 19.0 19.0 19.0 20.5 20.5 20.5 22.0
#> [29] 22.0 22.0

measure_distance <- function(mod, data) {
  diff <- data$y - model1(mod, data)
  sqrt(mean(diff ^ 2))
}
measure_distance(c(7, 1.5), sim1)
#> [1] 2.67

sim1_dist <- function(a1, a2) {
  measure_distance(c(a1, a2), sim1)
}

models <- models %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models
#> # A tibble: 250 x 3
#>       a1      a2  dist
#>    <dbl>   <dbl> <dbl>
#> 1 -15.2   0.0889  30.8
#> 2  30.1  -0.827   13.2
#> 3  16.0   2.27    13.2
#> 4 -10.6   1.38    18.7
#> 5 -19.6  -1.04    41.8
#> 6   7.98  4.59    19.3
#> # … with 244 more rows

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(models, rank(dist) <= 10)
  )

ggplot(models, aes(a1, a2)) +
  geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))

grid <- expand.grid(
  a1 = seq(-5, 20, length = 25),
  a2 = seq(1, 3, length = 25)
  ) %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))

grid %>% 
  ggplot(aes(a1, a2)) +
  geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(grid, rank(dist) <= 10)
  )

best <- optim(c(0, 0), measure_distance, data = sim1)
best$par
#> [1] 4.22 2.05

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(intercept = best$par[1], slope = best$par[2])

其他模型族
上面的学习都是基于线性模型，其实还有很多模型，下面列出名称及其R实现：

广义线性模型，如stats::glm()函数
广义可加模型，如mgcv::gam()函数
带有惩罚项的线性模型，如glmnet::glmnet()函数
健壮线性模型，如MASS:rlm()

树模型，如rpart::rpart()。本身效率不高，但如果使用随机森林randomForest::randomForest()或梯度提升机xgboost::xgboost()这样的模型将非常强大。

r4ds

R for data science ||使用modelr实现基
模型的作用是提供一个简单的，低纬度的数据集摘要。模型的目的不是发现真理，而是获得简单但是有价值的近似。构建模型...
R函数学习-tibble()
R for data science ||使用tibble实现简单数据框 tibble是R语言中一个用来替换dat...
R for data science || 使用purrr实现
永远不要重复复制黏贴超过两次减少重复，节约空间，主要有三个好处：更容易看清代码的意图更容易对变化做出反应更...
R for data science ||使用tibble实现简
tibble是R语言中一个用来替换data.frame类型的扩展的数据框，tibble继承了data.frame，...
R for data science || 使用magritt
magrittr包被定义为一个高效的管道操作工具包，通过管道的连接方式，让数据或表达式的传递更高效，使用操作符%>...
【R for Data Science】(2) tibble
参考:tibble in tibble in R tibble in ti...
R for Data Science
由于作业是在腾讯文档在线做的，转载过来很麻烦，就发布成网页，点开链接直接就看了，以后直接就这样更新了。有没有人看就...
[R语言] tidyr包数据整理《R for data sci
《R for Data Science》第十二章 Tidy data 啃书知识点积累参考链接：R for Data...
[R语言] tibble包《R for data science
《R for Data Science》第九、十章 Tibbles 啃书知识点积累参考链接：R for Data ...
[R语言] Functions 函数《R for data sc
《R for Data Science》第十九章 Functions 啃书知识点积累参考链接：R for Data...