Re-render book for O'Reilly
This commit is contained in:
@@ -33,9 +33,9 @@ Creating date/times</h1>
|
||||
<p>To get the current date or date-time you can use <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> or <code><a href="https://lubridate.tidyverse.org/reference/now.html">now()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">today()
|
||||
#> [1] "2022-11-18"
|
||||
#> [1] "2023-01-12"
|
||||
now()
|
||||
#> [1] "2022-11-18 11:36:09 CST"</pre>
|
||||
#> [1] "2023-01-12 17:04:08 CST"</pre>
|
||||
</div>
|
||||
<p>Otherwise, the following sections describe the four ways you’re likely to create a date/time:</p>
|
||||
<ul><li>While reading a file with readr.</li>
|
||||
@@ -61,7 +61,7 @@ read_csv(csv)
|
||||
<p>If you haven’t heard of <strong>ISO8601</strong> before, it’s an international standard<span data-type="footnote"><a href="https://xkcd.com/1179/" class="uri">https://xkcd.com/1179/</a></span> for writing dates where the components of a date are organised from biggest to smallest separated by <code>-</code>. For example, in ISO8601 March 5 2022 is <code>2022-05-03</code>. ISO8601 dates can also include times, where hour, minute, and second are separated by <code>:</code>, and the date and time components are separated by either a <code>T</code> or a space. For example, you could write 4:26pm on March 5 2022 as either <code>2022-05-03 16:26</code> or <code>2022-05-03T16:26</code>.</p>
|
||||
<p>For other date-time formats, you’ll need to use <code>col_types</code> plus <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a <code>%</code> followed by a single character. For example, <code>%Y-%m-%d</code> specifies a date that’s a year, <code>-</code>, month (as number) <code>-</code>, day. Table <a href="#tbl-date-formats" data-type="xref">#tbl-date-formats</a> lists all the options.</p>
|
||||
<div id="tbl-date-formats" class="anchored">
|
||||
<table class="table"><caption>Table 17.1: All date formats understood by readr</caption>
|
||||
<table class="table"><caption>Table 19.1: All date formats understood by readr</caption>
|
||||
<thead><tr class="header"><th>Type</th>
|
||||
<th>Code</th>
|
||||
<th>Meaning</th>
|
||||
@@ -256,20 +256,20 @@ flights_dt
|
||||
<p>With this data, we can visualize the distribution of departure times across the year:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
ggplot(aes(dep_time)) +
|
||||
ggplot(aes(x = dep_time)) +
|
||||
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-12-1.png" alt="A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Or within a single day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
filter(dep_time < ymd(20130102)) |>
|
||||
ggplot(aes(dep_time)) +
|
||||
ggplot(aes(x = dep_time)) +
|
||||
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-13-1.png" class="img-fluid" alt="A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-13-1.png" alt="A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.</p>
|
||||
@@ -281,9 +281,9 @@ From other types</h2>
|
||||
<p>You may want to switch between a date-time and a date. That’s the job of <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">as_datetime(today())
|
||||
#> [1] "2022-11-18 UTC"
|
||||
#> [1] "2023-01-12 UTC"
|
||||
as_date(now())
|
||||
#> [1] "2022-11-18"</pre>
|
||||
#> [1] "2023-01-12"</pre>
|
||||
</div>
|
||||
<p>Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code>; if it’s in days, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>.</p>
|
||||
<div class="cell">
|
||||
@@ -357,9 +357,9 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(wday = wday(dep_time, label = TRUE)) |>
|
||||
ggplot(aes(x = wday)) +
|
||||
geom_bar()</pre>
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000)." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-20-1.png" alt="A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>There’s an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!</p>
|
||||
@@ -367,13 +367,14 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(minute = minute(dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
summarize(
|
||||
avg_delay = mean(dep_delay, na.rm = TRUE),
|
||||
n = n()) |>
|
||||
ggplot(aes(minute, avg_delay)) +
|
||||
geom_line()</pre>
|
||||
n = n()
|
||||
) |>
|
||||
ggplot(aes(x = minute, y = avg_delay)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9)." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-21-1.png" alt="A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Interestingly, if we look at the <em>scheduled</em> departure time we don’t see such a strong pattern:</p>
|
||||
@@ -381,22 +382,24 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
||||
<pre data-type="programlisting" data-code-language="r">sched_dep <- flights_dt |>
|
||||
mutate(minute = minute(sched_dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
summarize(
|
||||
avg_delay = mean(arr_delay, na.rm = TRUE),
|
||||
n = n())
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(sched_dep, aes(minute, avg_delay)) +
|
||||
ggplot(sched_dep, aes(x = minute, y = avg_delay)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-22-1.png" alt="A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
|
||||
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times, as <a href="#fig-human-rounding" data-type="xref">#fig-human-rounding</a> shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(sched_dep, aes(minute, n)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes." width="576"/></p>
|
||||
|
||||
<figure id="fig-human-rounding"><p><img src="datetimes_files/figure-html/fig-human-rounding-1.png" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes. " width="576"/></p>
|
||||
<figcaption>A frequency polygon showing the number of flights scheduled to depart each hour. You can see a strong preference for round numbers like 0 and 30 and generally for numbers that are a multiple of five.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
@@ -408,33 +411,33 @@ Rounding</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
count(week = floor_date(dep_time, "week")) |>
|
||||
ggplot(aes(week, n)) +
|
||||
ggplot(aes(x = week, y = n)) +
|
||||
geom_line() +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-24-1.png" class="img-fluid" alt="A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights)." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-24-1.png" alt="A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can use rounding to show the distribution of flights across the course of a day by computing the difference between <code>dep_time</code> and the earliest instant of that day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
ggplot(aes(x = dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
#> Don't know how to automatically pick scale for object of type <difftime>.
|
||||
#> Defaulting to continuous.</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-25-1.png" alt="A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Computing the difference between a pair of date-times yields a difftime (more on that in <a href="#sec-intervals" data-type="xref">#sec-intervals</a>). We can convert that to an <code>hms</code> object to get a more useful x-axis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)</pre>
|
||||
ggplot(aes(x = dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (<100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-26-1.png" alt="A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (<100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
@@ -442,7 +445,7 @@ Rounding</h2>
|
||||
<section id="modifying-components" data-type="sect2">
|
||||
<h2>
|
||||
Modifying components</h2>
|
||||
<p>You can also use each accessor function to modify the components of a date/time:</p>
|
||||
<p>You can also use each accessor function to modify the components of a date/time. This doesn’t come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">(datetime <- ymd_hms("2026-07-08 12:34:56"))
|
||||
#> [1] "2026-07-08 12:34:56 UTC"
|
||||
@@ -457,7 +460,7 @@ hour(datetime) <- hour(datetime) + 1
|
||||
datetime
|
||||
#> [1] "2030-01-08 13:34:56 UTC"</pre>
|
||||
</div>
|
||||
<p>Alternatively, rather than modifying an existing variabke, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
|
||||
<p>Alternatively, rather than modifying an existing variable, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
|
||||
#> [1] "2030-02-02 02:34:56 UTC"</pre>
|
||||
@@ -480,7 +483,7 @@ Exercises</h2>
|
||||
<li><p>How does the average delay time change over the course of a day? Should you use <code>dep_time</code> or <code>sched_dep_time</code>? Why?</p></li>
|
||||
<li><p>On what day of the week should you leave if you want to minimise the chance of a delay?</p></li>
|
||||
<li><p>What makes the distribution of <code>diamonds$carat</code> and <code>flights$sched_dep_time</code> similar?</p></li>
|
||||
<li><p>Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.</p></li>
|
||||
<li><p>Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
@@ -504,12 +507,12 @@ Durations</h2>
|
||||
<pre data-type="programlisting" data-code-language="r"># How old is Hadley?
|
||||
h_age <- today() - ymd("1979-10-14")
|
||||
h_age
|
||||
#> Time difference of 15741 days</pre>
|
||||
#> Time difference of 15796 days</pre>
|
||||
</div>
|
||||
<p>A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the <strong>duration</strong>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">as.duration(h_age)
|
||||
#> [1] "1360022400s (~43.1 years)"</pre>
|
||||
#> [1] "1364774400s (~43.25 years)"</pre>
|
||||
</div>
|
||||
<p>Durations come with a bunch of convenient constructors:</p>
|
||||
<div class="cell">
|
||||
@@ -691,7 +694,7 @@ Time zones</h1>
|
||||
<p>And see the complete list of all time zone names with <code><a href="https://rdrr.io/r/base/timezones.html">OlsonNames()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">length(OlsonNames())
|
||||
#> [1] 595
|
||||
#> [1] 596
|
||||
head(OlsonNames())
|
||||
#> [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
|
||||
#> [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"</pre>
|
||||
|
||||
Reference in New Issue
Block a user