美文网首页
Getting and Cleaning Data - Week

Getting and Cleaning Data - Week

作者: 富士山下裸奔 | 来源:发表于2018-04-12 11:41 被阅读0次

    tidy数据集特征:

    • 每个变量形成一个列
    • 每一个观察都形成一行
    • 每一种观测单位都形成一个表

    untidy数据集特征

    列是值,而不是变量名 :
    religion, income and frequency.

    多个变量存储在一个列中
    人口统计群体被sex(m, f)和age(0-14,15-25,25-34,35-44,45-54,55-64,55-64)划分
    变量存储在行和列中:
    在各个列(id、年、月)中有变量,分布在列(day, d1-d31)和跨行(tmin, tmax)(最小和最高温度)。
    不同类型的观察单元存储在同一个表中:
    billborad数据集实际上包含了对两种观察单元的观察:歌曲信息和它在每个星期的排名。艺术家artist,年year和时间time被重复了很多次。这个数据集需要细分为两个部分:一个歌曲数据集,它存储艺术家、歌曲名称和时间,以及一个排名数据集,每个星期都给出歌曲的排名。 •单个观察单元存储在多个表中:

    PRACTICE

    tidy data
    # 处理方案
    # 1. select() all columns that do NOT contain the word "total",
    # since if we have the male and female data, we can always
    # recreate the total count in a separate column, if we want it.
    # Hint: Use the contains() function, which you'll
    # find detailed in 'Special functions' section of ?select.
    #
    # 2. gather() all columns EXCEPT score_range, using
    # key = part_sex and value = count.
    #
    # 3. separate() part_sex into two separate variables (columns),
    # called "part" and "sex", respectively. You may need to check
    # the 'Examples' section of ?separate to remember how the 'into'
    # argument should be phrased.
    #
    
    sat1 <- sat[2:11] %>%
      select(-contains("total")) %>%
      gather(part_sex, count, -score_range) %>%
        separate(part_sex, c("part", "sex")) %>%
        group_by(part, sex)%>%
        mutate(total = sum(count),
               prop = count / total
        ) %>% 
      print
    

    cleaned data

    处理后结果

    Week 3 Quiz

    程序代码

    相关文章

      网友评论

          本文标题:Getting and Cleaning Data - Week

          本文链接:https://www.haomeiwen.com/subject/lrcxkftx.html