tidy数据集特征:
- 每个变量形成一个列
- 每一个观察都形成一行
- 每一种观测单位都形成一个表
untidy数据集特征
• 列是值,而不是变量名 :
religion
, income
and frequency
.
data:image/s3,"s3://crabby-images/d1cd9/d1cd929862b9d8d3d8525cb79d7dd29138156c82" alt=""
人口统计群体被sex(m, f)和age(0-14,15-25,25-34,35-44,45-54,55-64,55-64)划分
data:image/s3,"s3://crabby-images/44c24/44c24b705a50226537d6c9b1eb05561caacf0f41" alt=""
在各个列(id、年、月)中有变量,分布在列(day, d1-d31)和跨行(tmin, tmax)(最小和最高温度)。
data:image/s3,"s3://crabby-images/156be/156bebfe332393e8957a866105482bbfe909d670" alt=""
billborad数据集实际上包含了对两种观察单元的观察:歌曲信息和它在每个星期的排名。艺术家
artist
,年year
和时间time
被重复了很多次。这个数据集需要细分为两个部分:一个歌曲数据集,它存储艺术家、歌曲名称和时间,以及一个排名数据集,每个星期都给出歌曲的排名。
data:image/s3,"s3://crabby-images/fe10b/fe10ba0a581f8973c8520fa7f49a316c77a3bdca" alt=""
data:image/s3,"s3://crabby-images/808fd/808fd77d42d405f04cc45869d5fdf75e97bb8a37" alt=""
PRACTICE
- data : sat.csv
- resource :The 2013 SAT Report on College & Career Readiness
data:image/s3,"s3://crabby-images/088fa/088fa1aad887d219d17fd36e7ddbacc57de87f9b" alt=""
# 处理方案
# 1. select() all columns that do NOT contain the word "total",
# since if we have the male and female data, we can always
# recreate the total count in a separate column, if we want it.
# Hint: Use the contains() function, which you'll
# find detailed in 'Special functions' section of ?select.
#
# 2. gather() all columns EXCEPT score_range, using
# key = part_sex and value = count.
#
# 3. separate() part_sex into two separate variables (columns),
# called "part" and "sex", respectively. You may need to check
# the 'Examples' section of ?separate to remember how the 'into'
# argument should be phrased.
#
sat1 <- sat[2:11] %>%
select(-contains("total")) %>%
gather(part_sex, count, -score_range) %>%
separate(part_sex, c("part", "sex")) %>%
group_by(part, sex)%>%
mutate(total = sum(count),
prop = count / total
) %>%
print
data:image/s3,"s3://crabby-images/42ee0/42ee043f80125419143a3271adc4691a76edbcdc" alt=""
Week 3 Quiz
data:image/s3,"s3://crabby-images/3bd5f/3bd5f4c8a5f40a17a944ed0c0fb5ed3b14ee3e97" alt=""
网友评论