R 学习 DAY2（1）

作者: Peng_001 | 来源:发表于2020-04-30 16:21 被阅读0次

R 学习 DAY2（2）
R 学习 DAY2（1）
进入女性财富榜样Day4
2019-08-07 Day2——DFL
2020-10-25
用RIA学习法，学习RIA学习法
python爬虫学习-day7-实战
Python 基础爬虫目录
python爬虫学习-day5-selenium
python爬虫学习-day6-ip池

代码来源：datacamp R语言学习

R 的数据类型

Decimal values like 4.5 are called numerics.
Natural numbers like 4 are called integers. Integers are also numerics.
Boolean values (TRUE or FALSE) are called logical.
Text (or string) values are called characters.

向量

通过 <- c() 创建向量。
并通过names 为向量命名。

names(vectors) <- c('a','b','c')

sum()
计算向量的加和。

sum(vector)

vector[n]
选择向量中的某个变量
或选择多个变量

poker_midweek <- poker_vector[c(2,3,4)]
# 选择vector中的2，3，4变量
roulette_selection_vector <- roulette_vector[2:4]
# 选择vector 中 2～4变量，与上同

mean()
求平均。

poker_start = mean(vector)

利用向量进行比较

selection_vector <- poker_vector>0
# 输出结果为原先向量中数值判断后返回的布尔值。
'''
   Monday   Tuesday Wednesday  Thursday    Friday 
     TRUE     FALSE      TRUE     FALSE      TRUE
'''

此时可以将判断后的结果向量selection_vector作为选择值用于poker_vector的选择。

R knows what to do when you pass a logical vector in square brackets: it will only select the elements that correspond to TRUE in selection_vector.

即可以通过下式获得对应TRUE的信息。

poker_winning_days <- poker_vector[selection_vector]

整合多个向量

box_office <- c(new_hope, empire_strikes, return_jedi)

可以通过直接依靠c() 整合

matrix 矩阵

建立矩阵

matrix(1:9, byrow = TRUE, nrow = 3)

1）1:9 表示矩阵行或列的元素信息，例设定为1到9；
2）byrow 表示矩阵是按照何种排列，TRUE 为行，False 为列；
3）nrow 表示每行元素数。
返回结果

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

若为按列排列

matrix(1:12, byrow = FALSE, ncol = 3)
'''
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
'''

为行或列命名
定义某个矩阵

star_wars_matrix <- matrix(1:6, nrow = 3, byrow = TRUE)
# 按行排列，3行6个数字，2列。

定义行和列

region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")

命名行&列

colnames(star_wars_matrix) <- region
# 命名列
rownames(star_wars_matrix) <- titles
# 命名行

输出

star_wars_matrix

添加列到matrix
使用cbind() 。
向量包含的元素数值需等于矩阵行数。

big_matrix <- cbind(matrix1, matrix2, vector1 ...)

添加行到matrix
使用rbind()，操作同cbind()
加和
colSums() 或 rowSums()
选择矩阵中的元素
matrix[x, y] ，x表示行，y表示列

martix[1:2,2:3]
# 选取1、2行的第2与3列的元素。
# 返回的也是一个向量类型的值。
'''
                        non-US worldwide_vector
A New Hope               314.4          775.398
The Empire Strikes Back  247.9          538.375
'''

factors

什么是factor

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

factor因子用于分类变量，分类变量是存储有条件数目的分类的；而连续变量则可以储存及对应一系列数目（可以无穷大）。
简单例子：
性别：男，女。
永远是finite number of categories。

将向量转换为因子

factor_sex_vector <- factor(sex_vector)

两种类型的变量下的factor
1）A nominal variable，表示没有内在顺序的变量类型。如动物的种类：猴子，兔子，老鼠。不同的动物之间不存在高低顺序的关联性。
2）An ordinal variable，表示有一个排序关系。如描述程度关系的词：高，中，低。明显有一个内在关系。

# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector

# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector

通过order,与level 调整 ordinal varibles 在factor 中的等级。

若没有通过定义factor直接设置levels 及顺序。
可以之后语句添加

survey_vector <- c("M", "F", "F", "M", "M")
# 定义向量
factor_survey_vector <- factor(survey_vector)
# 定义因子
levels(factor_survey_vector) <- c("Female", "Male")
# 为因子定义level为Female，Male，因为按照默认顺序排列，字母表中F在M前
# 所以Female 会指代给F
# 这样的好处是在定义向量时不必完整的输入向量具体名称
# 直接在设定因子时定义即可。
factor_survey_vector

summary a factor
通过summary() 函数了解因子中不同类型变量的数值。

survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
# 定义factor
summary(factor_survey_vector)
# 汇总并返回
'''
Female   Male 
     2      3
'''

ps : 也可以给数字、字符串、矩阵使用summary()。
数字

> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2       2       2       2       2       2

字符串

> summary(x2)
   Length     Class      Mode 
        5 character character

矩阵

> summary(all_wars_matrix)
       US            non-US      worldwide_vector
 Min.   :290.5   Min.   :165.8   Min.   :475.1   
 1st Qu.:299.9   1st Qu.:206.8   1st Qu.:506.7   
 Median :309.3   Median :247.9   Median :538.4   
 Mean   :353.6   Mean   :242.7   Mean   :596.3   
 3rd Qu.:385.2   3rd Qu.:281.1   3rd Qu.:656.9   
 Max.   :461.0   Max.   :314.4   Max.   :775.4

比较factor 中的变量
对于有条件关系的因子中的变量，ordinal variables，则会返回一个判断的布尔值。ps: 这类factor 在类别上也叫做ordered（至少Rstudio上是这样显示的）
如：

temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
high <- factor_temperature_vector[1]
# 将本来向量顺序中的第一个值返还给high
low <- factor_temperature_vector[2]
high > low

high > low
[1] TRUE

对于没有顺序关系，nominal variables,则会报错。

Warning message:
In Ops.factor(high, medium) : ‘>’ not meaningful for factors

性别不存在顺序。

# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")

# Male
male <- factor_survey_vector[1]

# Female
female <- factor_survey_vector[2]

返回
NA

数据架构

一个合适数据架构就和问卷一样，是包含不同类型的数据的。
比如

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

如同一个问卷一样。

'Are you married?' or 'yes/no' questions (logical)
'How old are you?' (numeric)
'What is your opinion on this product?' or other 'open-ended' questions (character)

查看数据集
通过head()与tail()
查看data frame 的结构
可以通过str() 快速了解数据集的结构信息。

1）数据集中的observation数目。通常也就是行数。
2）变量数目。通常也是列数。
3）变量的数目与类型。
4）前10个observation 的信息。

构建data frame。
首先构建vectors。

name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

将不同的vctors 拼接起来。

planets_df <- data.frame(name, type, diameter, rotation, rings)

显示

> planets_df
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
5 Jupiter          Gas giant   11.209     0.41  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
8 Neptune          Gas giant    3.883     0.67  TRUE

选取data frame的信息
跟matrix 类似。
ps：其实就完全可以将data frame 理解为一个包含不同类型数据的matrix。

不同之处在于，frame work 可以对行和列进行定义（分类函数）。
所以可以借助于变量名查询。

planets_df[1:5,"diameter"]

通过$可以直接获取分类变量下的全部信息。

rings_vector <- planets_df$rings
将 rings 列下的全部元素给新的向量

通过筛选分类下的布尔型数据，可以针对该数据筛选出目标数据。
如找出所有带有rings 的planet。

planets_df[rings_vector,]
# 列出所有符合TRUE的planet
planets_df[rings_vector,"name"]
# 列出所有符合TRUE的行星name

通过条件语句获得信息
subset(frame_name, condition)

subset(planets_df, subset = diameter < 1)
# 获得直径小于1的变量

排序
order() 会将其中的元素按照大小顺序排列，并按照大小顺序返回元素所在的位置数据。

> a <- c(100, 10, 1000)
> order(a)
[1] 2 1 3

R 学习 DAY2（2）
参考 datacamp intermediate R course 目前为止 DAY2 学习到的内容： 1）Vec...
R 学习 DAY2（1）
代码来源：datacamp R语言学习 R 的数据类型 Decimal values like 4.5 are c...
进入女性财富榜样Day4
今日计划 1、读书赋能分享day2； 2、小喜手帐day2； 3、老王商学院学习Day4； 4、生命银行学习； 5...
2019-08-07 Day2——DFL
Day2 学习笔记
2020-10-25
day2 . 的学习笔记
用RIA学习法，学习RIA学习法
用RIA学习法，学习RIA学习法 1、R——阅读原文 RIA便签学习法——一种主动式学习。 1）R=Read，阅读...
python爬虫学习-day7-实战
目录 python爬虫学习-day1 python爬虫学习-day2正则表达式 python爬虫学习-day3-B...
Python 基础爬虫目录
目录 python爬虫学习-day1 python爬虫学习-day2正则表达式 python爬虫学习-day3-B...
python爬虫学习-day5-selenium
目录 python爬虫学习-day1 python爬虫学习-day2正则表达式 python爬虫学习-day3-B...
python爬虫学习-day6-ip池
目录 python爬虫学习-day1 python爬虫学习-day2正则表达式 python爬虫学习-day3-B...