美文网首页
DataCamp课程 <学会使用和操作时间数据> Chapter

DataCamp课程 <学会使用和操作时间数据> Chapter

作者: Jason数据分析生信教室 | 来源:发表于2021-07-26 15:02 被阅读0次

学会使用和操作时间数据课程目录

Chapter1. R里的时间和数据
Chapter2. 操作和剖析时间数据
Chapter3. 对时间数据进行计算
Chapter4. 问题实践

Chapter1. R里的时间和数据

指定日期数据

时间数据有和别的数据不一样的数据属性。但是R并不会因为你输入了一个例如"2021-07-26"的数据就会自动判断这是一个时间数据,谁知道不是字符还是因子数据呢。所以得要告诉R这个数据的时间数据属性。会用到as.Date()

# The date R 3.0.0 was released
x <- "2013-04-03"
# Examine structure of x
str(x)
 chr "2013-04-03"
# Use as.Date() to interpret x as a date
x_date <- as.Date(x)
# Examine structure of x_date
str(x_date)
 Date[1:1], format: "2013-04-03"
# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")
# The date R 3.0.0 was released
x <- "2013-04-03"
# Examine structure of x
str(x)
 chr "2013-04-03"
# Use as.Date() to interpret x as a date
x_date <- as.Date(x)
# Examine structure of x_date
str(x_date)
 Date[1:1], format: "2013-04-03"
# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")

自动整合日期数据

有两个非常方便的包。一个是readr,会自动识别时间数据。
先用read_csv()读取文件。然后用str()查看数据结构。

# Use read_csv() to import rversions.csv
releases <- read_csv("rversions.csv")

# Examine the structure of the date column
str(releases$date)
 Date[1:105], format: "1997-12-04" "1997-12-21" "1998-01-10" "1998-03-14" "1998-05-02" ...

还有一个就是anytime包,这个包可以自动整合时间数据。

# Load the anytime package
library(anytime)
Warning message: running command 'timedatectl' had status 1
# Various ways of writing Sep 10 2009
sep_10_2009 <- c("September 10 2009", "2009-09-10", "10 Sep 2009", "09-10-2009")
# Use anytime() to parse sep_10_2009
anytime(sep_10_2009)
[1] "2009-09-10 UTC" "2009-09-10 UTC" "2009-09-10 UTC" "2009-09-10 UTC"

日期数据可视化

根据major对数据进行分组,然后指定时间范围,对时间数据进行可视化。

library(ggplot2)

# Set the x axis to the date column
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major)))

# Limit the axis to between 2010-01-01 and 2014-01-01
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major))) +
  xlim(as.Date("2010-01-01"), as.Date("2014-01-01"))

# Specify breaks every ten years and labels with "%Y"
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major))) +
  scale_x_date(date_breaks = "10 years", date_labels = "%Y")

日期数据的简单计算

选取数据集releasedate列的最大值,也就是最近一次的relase。然后计算最近的一次release距今有多久了。

# Find the largest date
last_release_date <- max(releases$date)

# Filter row for last release
last_release <- filter(releases, date==last_release_date)

# Print last_release
last_release

# How long since last release?
Sys.Date() - last_release$date

时间数据

日期数据用as.Date(),时间数据的话就要用到as.POSIXct()
时间数据的格式是YYYY-MM-DD HH:MM:SS。还可以通过tz参数来设置时区(timezone)。

# Use as.POSIXct to enter the datetime 
as.POSIXct("2010-10-01 12:12:00")

# Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`
as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")

# Use read_csv to import rversions.csv
releases <- read_csv("rversions.csv")

# Examine structure of datetime column
str(releases$datetime)

再来做一个练习,自己设定一个日期时间点,然后选取数据集里R_version是3.2.0并且时间大于设定的时间点的数据。最后可视化一下数据分布。

# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("cran-logs_2015-04-17.csv")

# Print logs
logs

# Store the release time as a POSIXct object
release_time <- as.POSIXct("2015-04-16 07:13:33", tz = "UTC")

# When is the first download of 3.2.0?
logs %>% 
  filter(logs$datetime>release_time,
    r_version == "3.2.0")

# Examine histograms of downloads by version
ggplot(logs, aes(x = datetime)) +
  geom_histogram() +
  geom_vline(aes(xintercept = as.numeric(release_time)))+
  facet_wrap(~ r_version, ncol = 1)

相关文章

网友评论

      本文标题:DataCamp课程 <学会使用和操作时间数据> Chapter

      本文链接:https://www.haomeiwen.com/subject/nzzwmltx.html