96.formula中的Transformations

作者: 心惊梦醒 | 来源:发表于2022-04-15 20:30 被阅读0次

96.formula中的Transformations
Data Flow Transformations
Spark的这些事（三）——spark常用的Transforma
Spark的Transformation的lazy策略
spark算子
transformations and actions
3 RDDs基本操作之Transformations
Flink Operators 实战高级篇
关于Spark Dataset API中的Typed trans
spark RDD常用函数/操作

可以在模型公式内部执行transform，例如：log(y) ~ sqrt(x1) + x2，如果你的转换里包括+、*、^或-，则需要用I()将转换的部分括起来，例如：y ~ x + I(x ^ 2)。
transformation有用是因为你可以用它去近似一个非线性函数，也就是说你可以用一个多项式任意接近一个smooth函数。

you can approximate any smooth function with an infinite sum of polynomials

可以用poly()或splines::ns()来计算多项式。前者的缺点：在数据范围之外，多项式迅速发散至正负无穷。

model_matrix(df, y ~ poly(x, 2))
#> # A tibble: 3 x 3
#>   `(Intercept)` `poly(x, 2)1` `poly(x, 2)2`
#>           <dbl>         <dbl>         <dbl>
#> 1             1     -7.07e- 1         0.408
#> 2             1     -7.85e-17        -0.816
#> 3             1      7.07e- 1         0.408

library(splines)
model_matrix(df, y ~ ns(x, 2))
#> # A tibble: 3 x 3
#>   `(Intercept)` `ns(x, 2)1` `ns(x, 2)2`
#>           <dbl>       <dbl>       <dbl>
#> 1             1       0           0    
#> 2             1       0.566      -0.211
#> 3             1       0.344       0.771

书中尝试拟合了一个得线性函数，图很好看，但我不想再抄了。

练习题

1. 用一个没有截距的模型重复sim2的分析，会发生什么？model equation发生了什么？predictions又发生了什么？
解答：model equation需要用-1来去掉截距，去掉截距后predictions没有发生变化。这似乎说明，截距对自变量为分类变量的模型不起作用?

library(ggplot2)
library(modelr)
# 基于sim2的建模分析
mod2 <- lm(y ~ x, data = sim2)
grid <- sim2 %>%
    data_grid(x) %>%
    add_predictions(mod2)
ggplot(sim2, aes(x)) +
    geom_point(aes(y = y)) +
    geom_point(data = grid, aes(y = pred), colour = "red", size = 4)

# 用一个没有截距的模型重复sim2的分析
mod3 <- lm(y~x-1,data=sim2)
grid3 <- sim2 %>%
    data_grid(x) %>%
    add_predictions(mod3)
ggplot(sim2, aes(x)) +
    geom_point(aes(y = y)) +
    geom_point(data = grid3, aes(y = pred), colour = "red", size = 4)

2. 用model_matrix()探索在拟合sim3和sim4的模型时生成的equation。为什么*是一个好的相互作用的快捷方式？
解答：因为对model_matrix()的不理解，这个问题有点儿困难，暂时略过

library(modelr)
model_matrix(sim3, y ~ x1 + x2)
model_matrix(sim3, y ~ x1 * x2)
model_matrix(sim4, y ~ x1 + x2)
model_matrix(sim4, y ~ x1 * x2)

3. 用基本的原则，将下面两个模型中的formula转变成function，

mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)

解答：再次挖个坑。
4. 对sim4，mod1和mod2哪个更好？我认为mod2在删除模式方面做得稍微好一点，但它非常微妙。你能想出一个plot来支持我的主张吗？
解答：这个问题，咋解答呢？

缺失值

模型函数将删除任何包含缺失值的行，全局变量设置为options(na.action = na.warn)时，删除缺失值会默认打印警告信息。设置其为na.action = na.exclude则在删除缺失值时不打印警告信息。

网友评论

本文标题：96.formula中的Transformations

本文链接：https://www.haomeiwen.com/subject/dckxertx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

96.formula中的Transformations

练习题

缺失值

相关文章

96.formula中的Transformations

Data Flow Transformations

Spark的这些事（三）——spark常用的Transforma

Spark的Transformation的lazy策略

spark算子

transformations and actions

3 RDDs基本操作之Transformations

Flink Operators 实战高级篇

关于Spark Dataset API中的Typed trans

spark RDD常用函数/操作

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读