美文网首页
R_Datacamp1(2018.7.7——2018.7.17)

R_Datacamp1(2018.7.7——2018.7.17)

作者: 一条很闲的咸鱼 | 来源:发表于2018-07-17 09:51 被阅读0次

    Introduce to R

    • calculate

    • modulo
      5 %% 4
      ——1
    • Exponentiation
      2 ^ 5
      ——32
    • vector向量

    c(“hello”, “hi”, “hola”)
    c(12, 23, 44, 53)
    poker_vector <- c(140, -50, 20)
    names(poker_vector) <- c("Monday", "Tuesday", "Wednesday")
    poker_vector 向量命名之
    Monday Tuesday Wednesday
    140 ....... -50 ......... 20
    Poker1_vector <- poker_vector[c(2 :3)] 与python比较之
    Poker1_vector
    ——Tuesday Wednesday
    .........-50 ............20

    • matrix矩阵

    matrix(1:9, byrow = TRUE, nrow = 3) 矩阵基本
    The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE 按列还是按行排
    rownames(my_matrix) <- row_names_vector 行命名
    colnames(my_matrix) <- col_names_vector 列命名
    star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
    ............................... dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"),
    ..........................................................c("US", "non-US")))
    worldwide_vector <- rowSums(star_wars_matrix) 行合并之
    total_revenue_vector <- colSums(all_wars_matrix) 列合并之
    all_wars_matrix <- cbind(star_wars_matrix,worldwide_vector) 列合并
    all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2)行合并

    • factor因子 有排序的向量or列表?

    factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
    levels(factor_survey_vector) <- c("Female", "Male")设置排名
    summary(levels(survey_vector)输出 长度、分类、类型
    summary(factor_speed_vector)各个等级分类汇总

    • data frames数据结构

    head(mtcars, 2)前面2行
    tail(mtcars, 3)后面3行
    str(mtcars)查看结构
    rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)布尔类型的向量
    planets_df <- data.frame(name ,type, diameter, rotation, rings)各种向量组成数据结构
    my_df[1:3,2:4]数据结构中的选取,此为123行,234列
    planets_df[1:5, "diameter"] diameter是某一类的标题 此为1-5行按标题查询参数
    planets_dfrings同样为按标题名/列的名称查询 subset(my_df, subset = some_condition) 按条件筛选 a是一个向量 order(a)给出对应的大小顺序 a[order(a)]按顺序排序后的向量a positions <- order(planets_dfdiameter)
    planets_df[positions, ] 按其中某列数值的大小排序
    subset函数,从某一个数据框中选择出符合某条件的数据或是相关的列
    selectresult=subset(df1,name=="aa")
    selectresult=subset(df1,name=="aa",select=c(age,sex))
    selectresult=subset(df1,name=="aa" & sex=="f",select=c(age,sex))
    names()显示数据结构中每一列的标题

    • list列表

    my_list <- list(vec = my_vector, mat = my_matrix, df = my_df)
    shining_list <- list(moviename = mov, actors = act, reviews = rev)顺序是 定义的名字=储存好的向量
    列表的三种查询方式:
    shining_list[[1]]
    shining_list[["reviews"]]
    shining_list$reviews。
    ext_list <- c(my_list, my_name = my_val)加一列进列表,且指定列的名称

    Intermediate R

    • Conditionals and Control Flow条件控制

    if (condition) {
    expr
    } if条件语句

    if (condition) {
    expr1
    } else {
    expr2
    } if+else条件语句

    if (condition1) {
    expr1
    } else if (condition2) {
    expr2
    } else if (condition3) {
    expr3
    } else {
    expr4
    }更多条件语句

    • Loops循环语句

    while (condition) {
    expr
    } while循环语句
    while + if循环语句
    break终止循环
    loop version 1
    for (p in primes) {
    ....print(p)
    }
    loop version 2
    for (i in 1:length(primes)) {
    ....print(primes[i])
    }
    paste(..., sep = " ", collapse = NULL) 字符串的连接
    for (var1 in seq1) {
    for (var2 in seq2) {
    expr
    }
    }循环套循环
    next:跳过此项,之后的继续
    break:从这项开始就终止了
    substr("abcdef", 2, 4) #从字符串“abcdef”中提取出第2到4个位置上的字符
    substring("abcdef", 1:6, 1:6) #从字符串“abcdef”中提取出第1到1、2到2—6到6位置上的字符,即把字符串单个化

    • Function

    sample(x, size, replace = FALSE, prob=c())随机抽样处理 其中size=抽取样本数目 replace是否重复抽样F/T prob表示各个样本被抽取的概率
    参数为函数名,返回函数的参数名及其对应的默认值
    mean(x, trim = 0, na.rm = FALSE, ...)
    trim表示截尾平均数,0~0.5之间的数值,如:0.10表示丢弃最大10%和最小的10%的数据后,再计算算术平均数。默认为0.
    rm是逻辑值,表示在计算之前,是否忽略NA的值。
    sd(x, na.rm = FALSE)计算标准差
    install.packages()安装包
    library()加载包
    search()看看现在装了哪些包

    • The apply family

    lapply(X, FUN, ...)
    lapply(数据,运算函数,函数的参数) 针对list
    split_math <- strsplit(pioneers, split = ":")字符串的拆分,相当于paste的逆操作
    tolower() toupper() 改变大小写
    split_low <- lapply(split, tolower)
    names <- lapply(split_low, function(x){x[1]})分离后分离出的第一列
    select_el <- function(x, index) {
    x[index]
    }
    names <- lapply(split_low, select_el, index = 1)利用function设置一个功能,然后套入lapply以此选出第一列
    sapply(数据,运算函数,函数的参数,simplify = TRUE, USE.NAMES = TRUE) 相比lapply会重新整理好格式
    identical(A,B)测试A、B是否相等,相等即TRUE
    vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)FUN.VALUE为对应fun中是几项,且是什么类型

    • Utilities

    abs()绝对值
    round()四舍五入
    rev() reverse
    sort(x, decreasing = F/T) 对量从小到大进行排序;顺序还是逆序
    unlist()将list的结构变成非list结构
    append()合并
    seq(from, to, by = )向量的起点,终点,步长
    seq(from, to, length.out = )向量中元素的数目
    seq(from/along.with = )表示生成的向量为现有一向量的索引
    seq(length.out = )便是生成从1开始,步长为1,长度为length.out的向量
    rep(x,each = ,times = )重复
    is.(): Check for the class of an R object.
    as.
    (): Convert an R object from one class to another.
    grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
    fixed = FALSE, useBytes = FALSE, invert = FALSE)给出关键字在列表中的序号
    grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)关键字是否在列表中,分别输出TRUE/FALSE
    sub和gsub用于字符串的替换 sub只替换第一次匹配的字符串,而gsub是替换所有匹配的字符串
    sub(pattern,replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
    gsub(pattern,replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE) replacement表示要替换的内容
    today <- Sys.Date()查询日期
    now <- Sys.time()查询时间
    unclass()消除数据分类
    %Y: 4-digit year (1982)
    %y: 2-digit year (82)
    %m: 2-digit month (01)
    %d: 2-digit day of the month (13)
    %A: weekday (Wednesday)
    %a: abbreviated weekday (Wed)
    %B: month (January)
    %b: abbreviated month (Jan)
    format()将时间调节成指定时间格式
    as.Date()
    as.Date(ISOdate(year,month, day)) //转换为Date对象
    %H: hours as a decimal number (00-23)
    %I: hours as a decimal number (01-12)
    %M: minutes as a decimal number
    %S: seconds as a decimal number
    %T: shorthand notation for the typical format %H:%M:%S
    %p: AM/PM indicator
    as.POSIXct()
    diff() 相邻两项的差

    Intermediate R- Practice

    hist()创建直方图
    boxplot()创建箱型图

    Introduction to the Tidyverse

    • Data wrangling

    library()导入包
    library(dplyr)拓展包用于将多个数据表连接成一个整齐的数据集
    library(gapminder) 摘自Gapminder的实验数据
    gapminder %>% filter(year == 1957)拓展包中的函数 可以方便选取1957年的数据
    arrange(hflights_df, DayofMonth, Month, Year) dplyr包中的arrange排列
    arrange(gapminder, lifeExp)升序
    arrange(gapminder, desc(lifeExp))降序
    gapminder %>%
    ..filter(year == 1957) %>%
    ..arrange(desc(pop))这种格式下,gapminder会自动加载到每一行中,所以可以省略了。
    mutate() 对已有列进行数据运算并添加为新列:
    mutate(gapminder, month = 12*lifeExp)

    • Data visualization数据可视化

    library(ggplot2)
    ggplot(gapminder_1952, aes(x = pop, y = gdpPercap))
    ..geom_point()散点图
    x、y轴数据分布太散时可以取对数
    ggplot(gapminder_1952, aes(x = pop, y = lifeExp)) + scale_x_log10()
    ggplot(gapminder_1952, aes(x = pop, y = lifeExp, color = continent, size = gdpPercap)) + scale_x_log10() +
    ..geom_point()成功的做出一次散点图,有颜色和点的大小
    facet_wrap(~ continent)分面,加在末尾可以按continent划分为几个小图
    expand_limits(y = 0)放入最末

    • Grouping and summarizing分组与总结

    median(x, na.rm = FALSE, …)计算中位数
    summarise(.data, ...)将分组的数据汇总,可以逗号隔开将不同的汇总处理。
    group_by(.data, ..., add = FALSE)分组add=true即添加到已经存在的分组
    ungroup(x, ...)取消分组

    • Types of visualizations可视化类别

    geom_point()散点图
    geom_line()折线图
    geom_col()直方图
    geom_histogram()柱状图 只有x轴的定义?
    geom_boxplot() 箱型图
    labs(title = "Comparing GDP per capita across continents")加在后面可以添加标题,同理可以用x = "x"添加x轴名称

    Importing Data in R (Part 1)

    pools <- read_csv("swimming_pools.csv")
    pools <- read_csv("swimming_pools.csv", stringsAsFactors = FALSE)
    read.delim("hotdogs.txt") header = TRUE,第一行为文件名
    read_tsv(....,col_names = c(.....)) col_names指定每一列的标题
    read_tsv中可以加skip = 跳过的line数(从1开始),n_max = 显示的line数,比如只要看23line,则skip=1,n_max=2.
    同样tsv中,col_types = "cdil_" 为column的类型,character, double, integer and logical

    path <- file.path("data", "hotdogs.txt")
    hotdogs <- read.table(path,
    sep = "",
    col.names = c("type", "calories", "sodium"))
    head(hotdogs)
    which.min(x) 返回的是最小值的位置标识
    tom <- hotdogs[which.max(hotdogs$sodium), ]

    hotdogs2 <- read.delim("hotdogs.txt", header = FALSE,
    col.names = c("type", "calories", "sodium"),
    colClasses = c("factor", "NULL", "numeric")) / NA 读取txt拓展

    potatoes <- read_delim("potatoes.txt", delim = "\t", col_names = properties)

    library(data.table)
    read.table()读取文件转化为数据框架
    fread()和read.table类似,但是更加方便快捷
    potatoes <- fread("potatoes.csv", select = c(6, 8))只导入第六列和第八列的数据
    plot(potatoestexture, potatoesmoistness)散点图

    • Importing Excel data 导入excel数据

    library(readxl) 导入xlsx文件
    excel_sheets("urbanpop.xlsx")
    data <- read_excel("data.xlsx", sheet = "my_sheet")
    my_workbook <- lapply(excel_sheets("data.xlsx"),
    ......................................read_excel,
    ......................................path = "data.xlsx")
    pop_a <- read_excel("urbanpop_nonames.xlsx", col_names = FALSE)
    cols <- c("country", paste0("year_", 1960:1966))
    pop_b <- read_excel("urbanpop_nonames.xlsx", col_names = cols)导入没有列标题的excel文件后添加列标题
    urbanpop_sel <- read_excel("urbanpop.xlsx", sheet = 2, col_names = FALSE, skip = 21)
    head(urban_pop, n = 11)前十一项 head(..., n = )
    path <- "urbanpop.xls"
    urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
    na.fail(object, …)只会返回没有缺失值的数据,不然就报错
    na.omit(object, …)会将缺失值排除返回正常数据
    na.exclude(object, …)
    na.pass(object, …)原数返回
    excel_sheets("urbanpop.xlsx")查看这个表的sheet名
    pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)导入excel的sheet

    • Reproducible Excel work with XLConnect

    library(XLConnec)
    loadWorkbook(filename, create = FALSE, password = NULL)建立excel工作簿
    my_book <- loadWorkbook("urbanpop.xlsx")
    getSheets(my_book) list my_book中的sheets
    readWorksheet(my_book, sheet = 2)读取工作簿中的sheets
    createSheet(object, name)创建一个新的sheet
    writeWorksheet(object,data,sheet,startRow,startCol,header,rownames)在sheet中写入数据
    saveWorkbook(object,file)将工作簿存入关联的exclle文件
    renameSheet(object,sheet,newName)给sheet改名
    removeSheet(my_book, sheet = 4)移除sheet

    Importing Data in R

    • Importing data from databases (Part 1)从数据库导入数据

    library(DBI)
    con <- dbConnect(RMySQL::MySQL(),
    .................dbname = "tweater",
    .................host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
    .................port = 3306,
    .................user = "student",
    .................password = "datacamp")
    table_names <- dbListTables(con)得到表的名字(向量)
    dbReadTable(conn, name, ...)
    dbWriteTable(conn, name, value, ...)
    tables <- lapply(table_names, dbReadTable, conn = con)导入所有tables
    dbGetQuery(con, "SELECT age FROM people WHERE gender = 'male'")查询出的形式是数据结构
    CHAR_LENGTH(name) 即the number of characters in the name
    res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")对数据库发送问题
    dbFetch(res, n = 1, ...)抓取接下来的n个element/row并将其返回为数据结构
    dbClearResult(res, ...)清理返回的结构
    bConnect(drv, ...)
    dbDisconnect(conn, ...)

    • Importing data from the web (Part 1) 从网页导入数据

    read.csv
    library(readr)
    read_csv
    library(gdata)
    read.xls()
    download.file(url_xls, destfile = "local_latitude.xls") 通过url下载xls文件

    library(hhtr)
    url <- "http://www.example.com/"
    resp <- GET(url) 从链接中获取数据存入resp
    raw_content <- content(resp, as = "raw") 获取resp的数据 as是其表现的形式,如txt形式等

    library(jsonlite)
    wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
    wine <- fromJSON(wine_json)
    str(wine)
    sw4 <- fromJSON(url_sw4)
    sw4$Title

    json1 <- '[1, 2, 3, 4, 5, 6]'
    fromJSON(json1) 得出的是一列数
    json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
    fromJSON(json2)体会一下json 得出的是a,b分开的两列数
    json1 <- '[[1, 2], [3, 4]]'
    fromJSON(json1)得出2维矩阵
    json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
    fromJSON(json2)得出的是正常的表格形式的行列

    water_json <- toJSON(water) toJSON()可以将读取的数据结构格式的文件转化为JSON格式
    JSON的格式;pretty和mini格式
    转换为json格式时,toJSON(water, pretty/mini = TRUE)即可转换为对应格式
    对于已经是json格式的 可用prettify() / minify()转换为对应的格式

    • Importing data from statistical software packages

    library(haven)
    haven包可以用来加载:
    SAS:read.sas()
    STATA:read_dta(), read_stata()
    SPSS:read_sav() read_por()

    traits <- read_sav("person.sav")
    summary(traits) 可以看出traits中有多少缺失值NA,最大值最小值等信息。
    subset(traits, Extroversion > 40 & Agreeableness > 40)从数据中按条件选择子集
    as_factor(work$GENDER)将这一列转化为factor格式
    tail(florida, n = 6)查看最后六项/列

    demo <- read.spss("international.sav", to.data.frame = TRUE) 用read.spss读取spss的文件 且转换为数据结构模式
    boxplot(demo$gdp)做成箱型图

    cor(sizeheight, sizewidth)计算相关性
    demo_2 <- read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)此处变量的价值标签不被转化为R的因子factors

    Cleaning Data in R 数据清洗

    • Introduction and exploring raw data 介绍与探索原始数据

    dim()用于查看数据结构有多少行多少列 dimension

    • Tidying data整理数据

    library(tidyr)
    gather(data, key, value, -col(以此为基准x), na.rm = FALSE, convert = FALSE)在wide_df数据结构中,将宽数据变成长数据格式,key是指新的列的名称,其中值是现在存在的。value,值组成的新的一列的名称。
    spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)将长数据变成宽数据格式
    bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")把一列分成两列
    bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")分开的再合并 separate的逆操作

    • Tidying dataPreparing data for analysis为分析准备数据

    R中变量的形式 "character", "numeric" ,"integer":class(99L), "factor":class(factor("factor)), "logical":TRUE/FALSE

    library(lubridate)将日期的格式从character转化为日期格式,例如:
    mdy_hm("July 15, 2012 12:56")其中月日年小时分钟,这个顺序是依据要转换的原始数据的顺序

    library(stringr)
    str_trim(c(" Filip ", "Nick ", " Jonathan")) str_trim()将多余空格清理掉
    str_pad(c("23485W", "8823453Q", "994Z"), width = 9, side = "left", pad = "0")防止以pad=0开头的数,0丢失

    toupper()全体大写
    tolower()全体小写

    str_detect(c("banana", "kiwi"), "a")查询那俩c中是否有a
    str_replace(c("banana", "kiwi"), "a", "o")在那两个向量中用o来代替a,但是如果有多个a只替换第一个
    str_replace_all(c("banana", "kiwi"), "a", "o")这就可以替换全部的啦
    na.omit(social_df)移除social_df中包含NA缺失值的行与列
    complete.cases()返回一个向量,查看每一行中是否没有缺失值。
    hist()生成柱状图

    • Putting it all together

    class(weather)查看数据类型
    dim(weather)查看分类汇总
    names(weather)查看各列的名称
    str(weather)查看数据结构
    library(dplyr)
    glipmse(weather)换种方法查看数据结构
    summary(weather)分类总结数据结构
    as.character()
    as.numeric()将变量转化为不同的格式

    sum(is.na(weather6))查看weather6中有几个缺失值
    summary(weather6)查看缺失值的分布
    ind <- which(is.na(weather6$Max.Gust.SpeedMPH))找出指定列的缺失值的位置index

    相关文章

      网友评论

          本文标题:R_Datacamp1(2018.7.7——2018.7.17)

          本文链接:https://www.haomeiwen.com/subject/mbfwuftx.html