Polish date times
This commit is contained in:
parent
b15eecf8b3
commit
5ad1eca391
210
datetimes.qmd
210
datetimes.qmd
|
@ -5,6 +5,9 @@
|
|||
#| echo: false
|
||||
source("_common.R")
|
||||
status("polishing")
|
||||
|
||||
# https://github.com/tidyverse/lubridate/issues/1058
|
||||
options(warnPartialMatchArgs = FALSE)
|
||||
```
|
||||
|
||||
## Introduction
|
||||
|
@ -13,15 +16,14 @@ This chapter will show you how to work with dates and times in R.
|
|||
At first glance, dates and times seem simple.
|
||||
You use them all the time in your regular life, and they don't seem to cause much confusion.
|
||||
However, the more you learn about dates and times, the more complicated they seem to get.
|
||||
To warm up, try these three seemingly simple questions:
|
||||
To warm up think about how many days there are in a year, and how many hours there are in a day.
|
||||
|
||||
- Does every year have 365 days?
|
||||
- Does every day have 24 hours?
|
||||
- Does every minute have 60 seconds?
|
||||
You probably remembered that most years have 365 days, but leap years have 366.
|
||||
Do you know the full rule for determining if a year is a leap year[^datetimes-1]?
|
||||
The number of hours in a day is a little less obvious: most days have 24 hours, but if you use daylight saving time (DST), one day each year has 23 hours and another has 25.
|
||||
|
||||
We're sure you know that not every year has 365 days, but do you know the full rule for determining if a year is a leap year?
|
||||
(It has three parts.) You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25.
|
||||
You might not have known that some minutes have 61 seconds because every now and then leap seconds are added because the Earth's rotation is gradually slowing down.
|
||||
[^datetimes-1]: A year is a leap year if it's divisible by 4, unless it's also divisible by 100, except if it's also divisible by 400.
|
||||
In other words, in every set of 400 years, there's 97 leap years.
|
||||
|
||||
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST.
|
||||
This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
|
||||
|
@ -34,7 +36,6 @@ We will also need nycflights13 for practice data.
|
|||
|
||||
```{r}
|
||||
#| message: false
|
||||
|
||||
library(tidyverse)
|
||||
|
||||
library(lubridate)
|
||||
|
@ -53,9 +54,9 @@ There are three types of date/time data that refer to an instant in time:
|
|||
|
||||
- A **date-time** is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second).
|
||||
Tibbles print this as `<dttm>`.
|
||||
Elsewhere in R these are called POSIXct, but that's not a very useful name.
|
||||
Base R calls these POSIXct, but doesn't exactly trip off the tongue.
|
||||
|
||||
In this chapter we are only going to focus on dates and date-times as R doesn't have a native class for storing times.
|
||||
In this chapter we are going to focus on dates and date-times as R doesn't have a native class for storing times.
|
||||
If you need one, you can use the **hms** package.
|
||||
|
||||
You should always use the simplest possible data type that works for your needs.
|
||||
|
@ -93,14 +94,6 @@ mdy("January 31st, 2017")
|
|||
dmy("31-Jan-2017")
|
||||
```
|
||||
|
||||
These functions also take unquoted numbers.
|
||||
This is the most concise way to create a single date/time object, as you might need when filtering date/time data.
|
||||
`ymd()` is short and unambiguous:
|
||||
|
||||
```{r}
|
||||
ymd(20170131)
|
||||
```
|
||||
|
||||
`ymd()` and friends create dates.
|
||||
To create a date-time, add an underscore and one or more of "h", "m", and "s" to the name of the parsing function:
|
||||
|
||||
|
@ -112,7 +105,7 @@ mdy_hm("01/31/2017 08:01")
|
|||
You can also force the creation of a date-time from a date by supplying a timezone:
|
||||
|
||||
```{r}
|
||||
ymd(20170131, tz = "UTC")
|
||||
ymd("2017-01-31", tz = "UTC")
|
||||
```
|
||||
|
||||
### From individual components
|
||||
|
@ -155,9 +148,17 @@ flights_dt <- flights |>
|
|||
flights_dt
|
||||
```
|
||||
|
||||
With this data, we can visualise the distribution of departure times across the year:
|
||||
With this data, we can visualize the distribution of departure times across the year:
|
||||
|
||||
```{r}
|
||||
#| fig.alt: >
|
||||
#| A frequency polyon with departure time (Jan-Dec 2013) on the x-axis
|
||||
#| and number of flights on the y-axis (0-1000). The frequency polygon
|
||||
#| is binned by day so you see a time series of flights by day. The
|
||||
#| pattern is dominated by a weekly pattern; there are fewer flights
|
||||
#| on weekends. The are few days that stand out as having a surprisingly
|
||||
#| few flights in early Februrary, early July, late November, and late
|
||||
#| December.
|
||||
flights_dt |>
|
||||
ggplot(aes(dep_time)) +
|
||||
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
|
||||
|
@ -166,6 +167,12 @@ flights_dt |>
|
|||
Or within a single day:
|
||||
|
||||
```{r}
|
||||
#| fig.alt: >
|
||||
#| A frequency polygon with departure time (6am - midnight Jan 1) on the
|
||||
#| x-axis, number of flights on the y-axis (0-17), binned into 10 minute
|
||||
#| increments. It's hard to see much pattern because of high variability,
|
||||
#| but most bins have 8-12 flights, and there are markedly fewer flights
|
||||
#| before 6am and after 8pm.
|
||||
flights_dt |>
|
||||
filter(dep_time < ymd(20130102)) |>
|
||||
ggplot(aes(dep_time)) +
|
||||
|
@ -227,7 +234,7 @@ The next section will look at how arithmetic works with date-times.
|
|||
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
|
||||
|
||||
```{r}
|
||||
datetime <- ymd_hms("2016-07-08 12:34:56")
|
||||
datetime <- ymd_hms("2026-07-08 12:34:56")
|
||||
|
||||
year(datetime)
|
||||
month(datetime)
|
||||
|
@ -248,6 +255,12 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
We can use `wday()` to see that more flights depart during the week than on the weekend:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A bar chart with days of the week on the x-axis and number of
|
||||
#| flights on the y-axis. Monday-Friday have roughly the same number of
|
||||
#| flights, ~48,0000, decreasingly slightly over the course of the week.
|
||||
#| Sunday is a little lower (~45,000), and Saturday is much lower
|
||||
#| (~38,000).
|
||||
flights_dt |>
|
||||
mutate(wday = wday(dep_time, label = TRUE)) |>
|
||||
ggplot(aes(x = wday)) +
|
||||
|
@ -258,6 +271,13 @@ There's an interesting pattern if we look at the average departure delay by minu
|
|||
It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A line chart with minute of actual departure (0-60) on the x-axis and
|
||||
#| average delay (4-20) on the y-axis. Average delay starts at (0, 12),
|
||||
#| steadily increases to (18, 20), then sharply drops, hitting at minimum
|
||||
#| at ~23 minute past the hour and 9 minutes of delay. It then increases
|
||||
#| again to (17, 35), and sharply decreases to (55, 4). It finishes off
|
||||
#| with an increase to (60, 9).
|
||||
flights_dt |>
|
||||
mutate(minute = minute(dep_time)) |>
|
||||
group_by(minute) |>
|
||||
|
@ -271,6 +291,11 @@ flights_dt |>
|
|||
Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A line chart with minute of scheduled departure (0-60) on the x-axis
|
||||
#| and average delay (4-16). There is relatively little pattern, just a
|
||||
#| small suggestion that the average delay decreases from maybe 10 minutes
|
||||
#| to 8 minutes over the course of the hour.
|
||||
sched_dep <- flights_dt |>
|
||||
mutate(minute = minute(sched_dep_time)) |>
|
||||
group_by(minute) |>
|
||||
|
@ -287,6 +312,12 @@ Well, like much data collected by humans, there's a strong bias towards flights
|
|||
Always be alert for this sort of pattern whenever you work with data that involves human judgement!
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A line plot with departure minute (0-60) on the x-axis and number of
|
||||
#| flights (0-60000) on the y-axis. Most flights are scheduled to depart
|
||||
#| on either the hour (~60,000) or the half hour (~35,000). Otherwise,
|
||||
#| all most all flights are scheduled to depart on multiples of five,
|
||||
#| with a few extra at 15, 45, and 55 minutes.
|
||||
ggplot(sched_dep, aes(minute, n)) +
|
||||
geom_line()
|
||||
```
|
||||
|
@ -298,22 +329,55 @@ Each function takes a vector of dates to adjust and then the name of the unit ro
|
|||
This, for example, allows us to plot the number of flights per week:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A line plot with week (Jan-Dec 2013) on the x-axis and number of
|
||||
#| flights (2,000-7,000) on the y-axis. The pattern is fairly flat from
|
||||
#| February to November with around 7,000 flights per week. There are
|
||||
#| far fewer flights on the first (approximately 4,500 flights) and last
|
||||
#| weeks of the year (approximately 2,500 flights).
|
||||
flights_dt |>
|
||||
count(week = floor_date(dep_time, "week")) |>
|
||||
ggplot(aes(week, n)) +
|
||||
geom_line()
|
||||
geom_line() +
|
||||
geom_point()
|
||||
```
|
||||
|
||||
Computing the difference between a rounded and unrounded date can be particularly useful.
|
||||
|
||||
### Setting components
|
||||
|
||||
You can also use each accessor function to set the components of a date/time:
|
||||
You can use rounding to show the distribution of flights across the course of a day by computing the difference between `dep_time` and the earliest instant of that day:
|
||||
|
||||
```{r}
|
||||
(datetime <- ymd_hms("2016-07-08 12:34:56"))
|
||||
#| fig-alt: >
|
||||
#| A line plot with depature time on the x-axis. This is units of seconds
|
||||
#| since midnight so it's hard to interpret.
|
||||
flights_dt |>
|
||||
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
```
|
||||
|
||||
year(datetime) <- 2020
|
||||
Computing the difference between a pair of date-times yields a difftime (more on that in @sec-intervals). We can convert that to an `hms` object to get a more useful x-axis:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A line plot with depature time (midnight to midnight) on the x-axis
|
||||
#| and number of flights on the y-axis (0 to 15,000). There are very few
|
||||
#| (<100) flights before 5am. The number of flights then rises rapidly
|
||||
#| to 12,000 / hour, peaking at 15,000 at 9am, before falling to around
|
||||
#| 8,000 / hour for 10am to 2pm. Number of flights then increases to
|
||||
#| around 12,000 per hour until 8pm, when they rapidly drop again.
|
||||
flights_dt |>
|
||||
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
```
|
||||
|
||||
### Modifying components
|
||||
|
||||
You can also use each accessor function to modify the components of a date/time:
|
||||
|
||||
```{r}
|
||||
(datetime <- ymd_hms("2026-07-08 12:34:56"))
|
||||
|
||||
year(datetime) <- 2030
|
||||
datetime
|
||||
month(datetime) <- 01
|
||||
datetime
|
||||
|
@ -321,33 +385,20 @@ hour(datetime) <- hour(datetime) + 1
|
|||
datetime
|
||||
```
|
||||
|
||||
Alternatively, rather than modifying in place, you can create a new date-time with `update()`.
|
||||
This also allows you to set multiple values at once.
|
||||
Alternatively, rather than modifying an existing variabke, you can create a new date-time with `update()`.
|
||||
This also allows you to set multiple values in one step:
|
||||
|
||||
```{r}
|
||||
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
|
||||
update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
|
||||
```
|
||||
|
||||
If values are too big, they will roll-over:
|
||||
|
||||
```{r}
|
||||
ymd("2015-02-01") |>
|
||||
update(mday = 30)
|
||||
ymd("2015-02-01") |>
|
||||
update(hour = 400)
|
||||
update(ymd("2023-02-01"), mday = 30)
|
||||
update(ymd("2023-02-01"), hour = 400)
|
||||
```
|
||||
|
||||
You can use `update()` to show the distribution of flights across the course of the day for every day of the year:
|
||||
|
||||
```{r}
|
||||
flights_dt |>
|
||||
mutate(dep_hour = update(dep_time, yday = 1)) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 300)
|
||||
```
|
||||
|
||||
Setting larger components of a date to a constant is a powerful technique that allows you to explore patterns in the smaller components.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. How does the distribution of flight times within a day change over the course of the year?
|
||||
|
@ -386,7 +437,7 @@ In R, when you subtract two dates, you get a difftime object:
|
|||
|
||||
```{r}
|
||||
# How old is Hadley?
|
||||
h_age <- today() - ymd(19791014)
|
||||
h_age <- today() - ymd("1979-10-14")
|
||||
h_age
|
||||
```
|
||||
|
||||
|
@ -431,15 +482,15 @@ last_year <- today() - dyears(1)
|
|||
However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:
|
||||
|
||||
```{r}
|
||||
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
|
||||
one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
|
||||
|
||||
one_pm
|
||||
one_pm + ddays(1)
|
||||
```
|
||||
|
||||
Why is one day after 1pm on March 12, 2pm on March 13?!
|
||||
Why is one day after 1pm March 12, 2pm March 13?
|
||||
If you look carefully at the date you might also notice that the time zones have changed.
|
||||
Because of DST, March 12 only has 23 hours, so if we add a full days worth of seconds we end up with a different time.
|
||||
March 12 only has 23 hours because it's when DST starts, so if we add a full days worth of seconds we end up with a different time.
|
||||
|
||||
### Periods
|
||||
|
||||
|
@ -455,13 +506,9 @@ one_pm + days(1)
|
|||
Like durations, periods can be created with a number of friendly constructor functions.
|
||||
|
||||
```{r}
|
||||
seconds(15)
|
||||
minutes(10)
|
||||
hours(c(12, 24))
|
||||
days(7)
|
||||
months(1:6)
|
||||
weeks(3)
|
||||
years(1)
|
||||
```
|
||||
|
||||
You can add and multiply periods:
|
||||
|
@ -476,8 +523,8 @@ Compared to durations, periods are more likely to do what you expect:
|
|||
|
||||
```{r}
|
||||
# A leap year
|
||||
ymd("2016-01-01") + dyears(1)
|
||||
ymd("2016-01-01") + years(1)
|
||||
ymd("2024-01-01") + dyears(1)
|
||||
ymd("2024-01-01") + years(1)
|
||||
|
||||
# Daylight Savings Time
|
||||
one_pm + ddays(1)
|
||||
|
@ -500,7 +547,7 @@ We can fix this by adding `days(1)` to the arrival time of each overnight flight
|
|||
flights_dt <- flights_dt |>
|
||||
mutate(
|
||||
overnight = arr_time < dep_time,
|
||||
arr_time = arr_time + days(ifelse(overnight, 0, 1)),
|
||||
arr_time = arr_time + days(if_else(overnight, 0, 1)),
|
||||
sched_arr_time = sched_arr_time + days(overnight * 1)
|
||||
)
|
||||
```
|
||||
|
@ -512,7 +559,7 @@ flights_dt |>
|
|||
filter(overnight, arr_time < dep_time)
|
||||
```
|
||||
|
||||
### Intervals
|
||||
### Intervals {#sec-intervals}
|
||||
|
||||
It's obvious what `dyears(1) / ddays(365)` should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.
|
||||
|
||||
|
@ -531,15 +578,18 @@ An interval is a pair of starting and ending date times, or you can think of it
|
|||
You can create an interval by writing `start %--% end`:
|
||||
|
||||
```{r}
|
||||
to_next_year <- today() %--% (today() + years(1))
|
||||
to_next_year
|
||||
y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
|
||||
y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01")
|
||||
|
||||
y2023
|
||||
y2024
|
||||
```
|
||||
|
||||
You could then divide it by a duration or a period:
|
||||
You could then divide it by `days()` to find out how many days fit in the year:
|
||||
|
||||
```{r}
|
||||
to_next_year / ddays(1)
|
||||
to_next_year / months(1)
|
||||
y2023 / days(1)
|
||||
y2024 / days(1)
|
||||
```
|
||||
|
||||
### Summary
|
||||
|
@ -548,17 +598,6 @@ How do you pick between duration, periods, and intervals?
|
|||
As always, pick the simplest data structure that solves your problem.
|
||||
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
|
||||
|
||||
@fig-dt-algebra summarizes permitted arithmetic operations between the different data types.
|
||||
|
||||
```{r}
|
||||
#| label: fig-dt-algebra
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| The allowed arithmetic operations between pairs of date/time classes.
|
||||
|
||||
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Explain `days(overnight * 1)` to someone who has just started learning R.
|
||||
|
@ -576,17 +615,19 @@ knitr::include_graphics("diagrams/datetimes-arithmetic.png")
|
|||
Time zones are an enormously complicated topic because of their interaction with geopolitical entities.
|
||||
Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
|
||||
|
||||
<!--# https://www.ietf.org/timezones/tzdb-2018a/theory.html -->
|
||||
|
||||
The first challenge is that everyday names of time zones tend to be ambiguous.
|
||||
For example, if you're American you're probably familiar with EST, or Eastern Standard Time.
|
||||
However, both Australia and Canada also have EST!
|
||||
To avoid confusion, R uses the international standard IANA time zones.
|
||||
These use a consistent naming scheme "<area>/<location>", typically in the form "\<continent\>/\<city\>" (there are a few exceptions because not every country lies on a continent).
|
||||
These use a consistent naming scheme `{area}/{location}`, typically in the form `{continent}/{city}` or `{ocean}/{city}`.
|
||||
Examples include "America/New_York", "Europe/Paris", and "Pacific/Auckland".
|
||||
|
||||
You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country.
|
||||
This is because the IANA database has to record decades worth of time zone rules.
|
||||
In the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same.
|
||||
Another problem is that the name needs to reflect not only the current behaviour, but also the complete history.
|
||||
Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same.
|
||||
Another problem is that the name needs to reflect not only the current behavior, but also the complete history.
|
||||
For example, there are time zones for both "America/New_York" and "America/Detroit".
|
||||
These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name.
|
||||
It's worth reading the raw time zone database (available at <http://www.iana.org/time-zones>) just to read some of these stories!
|
||||
|
@ -610,9 +651,14 @@ In R, the time zone is an attribute of the date-time that only controls printing
|
|||
For example, these three objects represent the same instant in time:
|
||||
|
||||
```{r}
|
||||
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
|
||||
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
|
||||
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
|
||||
x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
|
||||
x1
|
||||
|
||||
x2 <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
|
||||
x2
|
||||
|
||||
x3 <- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
|
||||
x3
|
||||
```
|
||||
|
||||
You can verify that they're the same time using subtraction:
|
||||
|
@ -623,7 +669,7 @@ x1 - x3
|
|||
```
|
||||
|
||||
Unless otherwise specified, lubridate always uses UTC.
|
||||
UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and roughly equivalent to its predecessor GMT (Greenwich Mean Time).
|
||||
UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time).
|
||||
It does not have DST, which makes a convenient representation for computation.
|
||||
Operations that combine date-times, like `c()`, will often drop the time zone.
|
||||
In that case, the date-times will display in your local time zone:
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 73 KiB |
Binary file not shown.
Loading…
Reference in New Issue