This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don't seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. For example:
I'm sure you know that not every year has 365 days, but but do you know the full rule for determining if a year is a leap year? You might have remembered that many parts of the world use daylight savings time (DST), so that some days have 23 hours, and others have 25. You probably didn't know that some minutes have 61 seconds because every now and then leap seconds are added to keep because the Earth's rotation is gradually slowing down.
Dates and times are hard because they have to reconcile two physical phenonmen (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenonmeon including months, time zones, and DST. This chapter won't teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.
This chapter will focus on the __lubridate__ package, which makes it easier to work with dates and times in R. We will use nycflights13 for practice data, and some packages for EDA.
* A __date-time__ is a date plus a time: it uniquely identifies an
instant in time (typically to the nearest second). Tibbles print this
as `<dttm>`. Elsewhere in R these are called POSIXct, but I don't think
that's a very useful name.
In this chapter we are only going to focus on dates and date-times. R doesn't have a native class for storing times. If you need one, you can use the hms package.
You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we'll come back to at the end of the chapter.
Time data often comes as strings. You've seen one approach to parsing date times with readr package, in [date-times](#readr-datetimes). Another approach is to use the helper functions provided by lubridate. They automatically work out the format once you tell them the order of the day, month, and year components. To use them, identify the order in which the year, month, and day appears in your dates. Now arrange "y", "m", and "d" in the same order. This is the name of the function in lubridate that will parse your dates. For example:
If you want to create a single date object for use in comparisons (e.g. in `dplyr::filter()`), I recommend using `ymd()` with numeric input. It's short and unambiguous:
If you have a date-time that also contains hours, minutes, or seconds, add an underscore and then one or more of "h", "m", and "s" to the name of the parsing function.
Let's do the same thing for each of the four times column in `flights`. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once that's done, I focus in on the variables we'll explore in the rest of the chapter.
Now that you know how to get date-time data in R's date-time datastructures let's explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components of the date. The next section will look at how arithmetic works with date-times.
You can pull out individual parts of the date with the acccessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year)`, `wday()` (day of the week), `hour()`, `minute()`, `second()`.
For `month()` and `wday()` you can set `label = TRUE` to return the name of the month or day of the week. Set `abbr = TRUE` to return an abbreviated version of the name, which can be helpful in plots.
There's an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays that otherwise!
So we do we see such a strong pattern in the delays of actual departure times? Well, like much data collected by humans, there's a strong bias towards flights leaving at "nice" departure times. Always be alert for this sort of pattern whenever you data involves human judgement.
An alternative approach to plotting individual components is to round the date, using `floor_date()`, `round_date()`, and `ceiling_date()` to round a date to a nearby unit of time. Each function takes a vector of dates to adjust and then the name of the unit to floor, ceiling, or round them to.
This allows us to, for example, plot the number of flights per week:
Next you'll learn about how arithmetic with dates works, including substraction, addition, and division. Along the way, you'll learn about three important classes that represent time spans:
A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This can ambiguity makes difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the __duration__.
Durations always record the time space in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds at the standard rate (60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year). You can add and multiple durations:
Why is one day after 1pm on March 12 2pm on March 13?! If you look carefully at the date you might also notice that the time zones have changed. Because of DST, March 12 only has 23 hours, so if add a full days worth of seconds we end up with a different hour.
You can use __periods__ to handle irregularities in the timeline. Periods are time spans that are work with "human" times, like days, months, and seconds. Periods don't have fixed length in seconds, which lets them work in an intuitive, human friendly way.
Let's use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination _before_ they departed from New York City.
These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding `days(1)` to the arrival time of each overnight flight.
It's obvious what `dyears(1) / ddays(365)` should return. It should return one because durations are always represented by seconds, and a duration of a year is defined as 365 days worth of seconds.
What should `years(1) / days(1)` return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There's not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate, with a warning:
If you want a more accurate measurement, you'll have to use an __interval__ instead of a a duration. An interval is a duration with a starting point - that makes it precise so you can determine exactly how long it is:
How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
The following diagram summarises the interelationships between the different data types:
Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don't need to dig into all the details as they're not all important for data analysis, but there are a few challenges we'll need to tackle head on.
The first challange is that the names of time zones that you're familiar with are not very general. For example, if you're an American you're probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have Eastern standard times which mean different things! To avoid confusion R uses the international standard IANA time zones. These don't have a terribly consistent naming scheme, but tend to fall in one of three camps:
An additional complication of time zones is daylight savings time (DST): many time zones shift by an hour during summer time. For example, the same instants may be the same time or difference times in Denver and Phoenix over the course of the year:
This also creates a challenge for determining how much time has elapsed between two date-times. Lubridate also offers solution for this: the __interval__, which you can coerce into either a duration or a period:
Operations that drop attributes, such as `c()` will drop the time zone attribute from your date-times. In that case, the date-times will display in your local time zone:
If you do not set the time zone, lubridate will automatically assign the date-time to Coordinated Universal Time (UTC). Coordinated Universal Time is the standard time zone used by the scientific community and roughly equates to its predecessor, Greenwich Meridian Time. Since Coordinated Universal time does not follow Daylight Savings Time, it is straightforward to work with times saved in this time zone.