Fix code language
This commit is contained in:
parent
69b4597f3b
commit
868a35ca71
|
@ -56,6 +56,11 @@ devtools::load_all("../minibook/"); process_book()
|
|||
|
||||
html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE)
|
||||
file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE)
|
||||
|
||||
pngs <- list.files("oreilly", pattern = "[.]png$", full.names = TRUE, recursive = TRUE)
|
||||
dest <- gsub("oreilly", "../r-for-data-science-2e/", pngs)
|
||||
fs::dir_create(unique(dirname(dest)))
|
||||
file.copy(pngs, dest, overwrite = TRUE)
|
||||
```
|
||||
|
||||
## Code of Conduct
|
||||
|
|
|
@ -15,7 +15,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -52,7 +52,7 @@ Variation</h1>
|
|||
Visualizing distributions</h2>
|
||||
<p>How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is <strong>categorical</strong> if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, you can use a bar chart:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p>
|
||||
|
@ -60,7 +60,7 @@ Visualizing distributions</h2>
|
|||
</div>
|
||||
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut)
|
||||
#> # A tibble: 5 × 2
|
||||
#> cut n
|
||||
|
@ -73,7 +73,7 @@ Visualizing distributions</h2>
|
|||
</div>
|
||||
<p>A variable is <strong>continuous</strong> if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, you can use a histogram:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
|
||||
|
@ -81,7 +81,7 @@ Visualizing distributions</h2>
|
|||
</div>
|
||||
<p>You can compute this by hand by combining <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut_width(carat, 0.5))
|
||||
#> # A tibble: 11 × 2
|
||||
#> `cut_width(carat, 0.5)` n
|
||||
|
@ -97,7 +97,7 @@ Visualizing distributions</h2>
|
|||
<p>A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. Note that even though it’s not possible to have a <code>carat</code> value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0. This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25. The tallest bar shows that almost 30,000 observations have a <code>carat</code> value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5.</p>
|
||||
<p>You can set the width of the intervals in a histogram with the <code>binwidth</code> argument, which is measured in the units of the <code>x</code> variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">smaller <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">smaller <- diamonds |>
|
||||
filter(carat < 3)
|
||||
|
||||
ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
|
@ -108,7 +108,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
|
|||
</div>
|
||||
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> performs the same calculation as <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
|
||||
geom_freqpoly(binwidth = 0.1, size = 0.75)
|
||||
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
|
||||
#> ℹ Please use `linewidth` instead.</pre>
|
||||
|
@ -132,7 +132,7 @@ Typical values</h2>
|
|||
<ul><li><p>Why are there more diamonds at whole carats and common fractions of carats?</p></li>
|
||||
<li><p>Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?</p></li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.01)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak." width="576"/></p>
|
||||
|
@ -145,7 +145,7 @@ Typical values</h2>
|
|||
<li><p>Why might the appearance of clusters be misleading?</p></li>
|
||||
</ul><p>The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
geom_histogram(binwidth = 0.25)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-10-1.png" class="img-fluid" alt="A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5." width="576"/></p>
|
||||
|
@ -159,7 +159,7 @@ Typical values</h2>
|
|||
Unusual values</h2>
|
||||
<p>Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the <code>y</code> variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
geom_histogram(binwidth = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
|
||||
|
@ -167,7 +167,7 @@ Unusual values</h2>
|
|||
</div>
|
||||
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
geom_histogram(binwidth = 0.5) +
|
||||
coord_cartesian(ylim = c(0, 50))</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -177,7 +177,7 @@ Unusual values</h2>
|
|||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> also has an <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
|
||||
<p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">unusual <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">unusual <- diamonds |>
|
||||
filter(y < 3 | y > 20) |>
|
||||
select(price, x, y, z) |>
|
||||
arrange(y)
|
||||
|
@ -216,7 +216,7 @@ Missing values</h1>
|
|||
<ol type="1"><li>
|
||||
<p>Drop the entire row with the strange values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds2 <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds2 <- diamonds |>
|
||||
filter(between(y, 3, 20))</pre>
|
||||
</div>
|
||||
<p>We don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!</p>
|
||||
|
@ -224,14 +224,14 @@ Missing values</h1>
|
|||
<li>
|
||||
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to replace the variable with a modified copy. You can use the <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> function to replace unusual values with <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds2 <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds2 <- diamonds |>
|
||||
mutate(y = if_else(y < 3 | y > 20, NA, y))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another.</p>
|
||||
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
geom_point()
|
||||
#> Warning: Removed 9 rows containing missing values (`geom_point()`).</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -240,12 +240,12 @@ Missing values</h1>
|
|||
</div>
|
||||
<p>To suppress that warning, set <code>na.rm = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
geom_point(na.rm = TRUE)</pre>
|
||||
</div>
|
||||
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, we’ll use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">nycflights13::flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">nycflights13::flights |>
|
||||
mutate(
|
||||
cancelled = is.na(dep_time),
|
||||
sched_hour = sched_dep_time %/% 100,
|
||||
|
@ -278,7 +278,7 @@ Covariation</h1>
|
|||
A categorical and continuous variable</h2>
|
||||
<p>It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in the shapes of their distributions. For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = price)) +
|
||||
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
|
||||
|
@ -286,7 +286,7 @@ A categorical and continuous variable</h2>
|
|||
</div>
|
||||
<p>It’s hard to see the difference in distribution because the overall counts differ so much:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal." width="576"/></p>
|
||||
|
@ -294,7 +294,7 @@ A categorical and continuous variable</h2>
|
|||
</div>
|
||||
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
|
||||
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
|
||||
|
@ -313,7 +313,7 @@ A categorical and continuous variable</h2>
|
|||
</div>
|
||||
<p>Let’s take a look at the distribution of price by cut using <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest." width="576"/></p>
|
||||
|
@ -323,7 +323,7 @@ A categorical and continuous variable</h2>
|
|||
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://rdrr.io/r/stats/reorder.factor.html">reorder()</a></code> function.</p>
|
||||
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
|
||||
|
@ -331,7 +331,7 @@ A categorical and continuous variable</h2>
|
|||
</div>
|
||||
<p>To make the trend easier to see, we can reorder <code>class</code> based on the median value of <code>hwy</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg,
|
||||
mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -340,7 +340,7 @@ A categorical and continuous variable</h2>
|
|||
</div>
|
||||
<p>If you have long variable names, <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg,
|
||||
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -365,7 +365,7 @@ Exercises</h3>
|
|||
Two categorical variables</h2>
|
||||
<p>To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
|
||||
geom_count()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-29-1.png" class="img-fluid" alt="A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000." width="576"/></p>
|
||||
|
@ -374,7 +374,7 @@ Two categorical variables</h2>
|
|||
<p>The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.</p>
|
||||
<p>A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The number of diamonds for each level of cut increases from Fair to Ideal and the heights of the segments within each bar represent the number of diamonds that fall within each color/cut combination. There appear to be some of each color of diamonds within each level of cut of diamonds." width="576"/></p>
|
||||
|
@ -382,7 +382,7 @@ Two categorical variables</h2>
|
|||
</div>
|
||||
<p>However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
geom_bar(position = "fill")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The heights of each of the bars representing each cut of diamond are the same, 1. The heights of the segments within each bar represent the proportion of diamonds that fall within each color/cut combination. The proportions don't appear to be very different across the levels of cut." width="576"/></p>
|
||||
|
@ -390,7 +390,7 @@ Two categorical variables</h2>
|
|||
</div>
|
||||
<p>Another approach for exploring the relationship between these variables is computing the counts with dplyr:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(color, cut)
|
||||
#> # A tibble: 35 × 3
|
||||
#> color cut n
|
||||
|
@ -405,7 +405,7 @@ Two categorical variables</h2>
|
|||
</div>
|
||||
<p>Then visualize with <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> and the fill aesthetic:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(color, cut) |>
|
||||
ggplot(mapping = aes(x = color, y = cut)) +
|
||||
geom_tile(mapping = aes(fill = n))</pre>
|
||||
|
@ -430,7 +430,7 @@ Exercises</h3>
|
|||
Two continuous variables</h2>
|
||||
<p>You’ve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential." width="576"/></p>
|
||||
|
@ -438,7 +438,7 @@ Two continuous variables</h2>
|
|||
</div>
|
||||
<p>Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). You’ve already seen one way to fix the problem: using the <code>alpha</code> aesthetic to add transparency.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
geom_point(alpha = 1 / 100)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
|
||||
|
@ -447,7 +447,7 @@ Two continuous variables</h2>
|
|||
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> to bin in one dimension. Now you’ll learn how to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> to bin in two dimensions.</p>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> creates rectangular bins. <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_bin2d()
|
||||
|
||||
# install.packages("hexbin")
|
||||
|
@ -459,7 +459,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
|||
</div>
|
||||
<p>Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin <code>carat</code> and then for each group, display a boxplot:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
|
@ -468,7 +468,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
|||
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
|
||||
<p>Another approach is to display approximately the same number of points in each bin. That’s the job of <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-38-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
|
@ -485,7 +485,7 @@ Exercises</h3>
|
|||
<li>
|
||||
<p>Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of <code>x</code> and <code>y</code> values, which makes the points outliers even though their <code>x</code> and <code>y</code> values appear normal when examined separately.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
|
||||
geom_point() +
|
||||
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -509,7 +509,7 @@ Patterns and models</h1>
|
|||
<li><p>Does the relationship change if you look at individual subgroups of the data?</p></li>
|
||||
</ul><p>A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-40-1.png" class="img-fluid" alt="A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times." width="576"/></p>
|
||||
|
@ -518,7 +518,7 @@ Patterns and models</h1>
|
|||
<p>Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.</p>
|
||||
<p>Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts <code>price</code> from <code>carat</code> and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of <code>price</code> and <code>carat</code>, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidymodels)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidymodels)
|
||||
|
||||
diamonds <- diamonds |>
|
||||
mutate(
|
||||
|
@ -540,7 +540,7 @@ ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
|
|||
</div>
|
||||
<p>Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-42-1.png" class="img-fluid" alt="Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end." width="576"/></p>
|
||||
|
@ -554,18 +554,18 @@ ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
|
|||
ggplot2 calls</h1>
|
||||
<p>As we move on from these introductory chapters, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
geom_freqpoly(binwidth = 0.25)</pre>
|
||||
</div>
|
||||
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
|
||||
<p>Rewriting the previous plot more concisely yields:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(faithful, aes(eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(eruptions)) +
|
||||
geom_freqpoly(binwidth = 0.25)</pre>
|
||||
</div>
|
||||
<p>Sometimes we’ll turn the end of a pipeline of data transformation into a plot. Watch for the transition from <code>|></code> to <code>+</code>. We wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut, clarity) |>
|
||||
ggplot(aes(clarity, cut, fill = n)) +
|
||||
geom_tile()</pre>
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
<h2>
|
||||
Prerequisites</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -21,27 +21,27 @@ Subsetting vectors</h2>
|
|||
<ol type="1"><li>
|
||||
<p><strong>A vector of positive integers</strong>. Subsetting with positive integers keeps the elements at those positions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("one", "two", "three", "four", "five")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("one", "two", "three", "four", "five")
|
||||
x[c(3, 2, 5)]
|
||||
#> [1] "three" "two" "five"</pre>
|
||||
</div>
|
||||
<p>By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x[c(1, 1, 5, 5, 5, 2)]
|
||||
<pre data-type="programlisting" data-code-language="r">x[c(1, 1, 5, 5, 5, 2)]
|
||||
#> [1] "one" "one" "five" "five" "five" "two"</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>A vector of negative integers</strong>. Negative values drop the elements at the specified positions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x[c(-1, -3, -5)]
|
||||
<pre data-type="programlisting" data-code-language="r">x[c(-1, -3, -5)]
|
||||
#> [1] "two" "four"</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p><strong>A logical vector</strong>. Subsetting with a logical vector keeps all values corresponding to a <code>TRUE</code> value. This is most often useful in conjunction with the comparison functions.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(10, 3, NA, 5, 8, 1, NA)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(10, 3, NA, 5, 8, 1, NA)
|
||||
|
||||
# All non-missing values of x
|
||||
!is.na(x)
|
||||
|
@ -60,7 +60,7 @@ x[x %% 2 == 0]
|
|||
<li>
|
||||
<p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(abc = 1, def = 2, xyz = 5)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(abc = 1, def = 2, xyz = 5)
|
||||
x[c("xyz", "def")]
|
||||
#> xyz def
|
||||
#> 5 2</pre>
|
||||
|
@ -76,7 +76,7 @@ Subsetting data frames</h2>
|
|||
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to selecting rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
|
||||
<p>Here are a couple of examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = 1:3,
|
||||
y = c("a", "e", "f"),
|
||||
z = runif(3)
|
||||
|
@ -109,7 +109,7 @@ df[df$x > 1, ]
|
|||
<p>We’ll come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesn’t use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
|
||||
<p>There’s an important difference between tibbles and data frames when it comes to <code>[</code>. In this book we’ve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write <code>data.frame</code>s. So if <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- data.frame(x = 1:3)
|
||||
<pre data-type="programlisting" data-code-language="r">df1 <- data.frame(x = 1:3)
|
||||
df1[, "x"]
|
||||
#> [1] 1 2 3
|
||||
|
||||
|
@ -124,7 +124,7 @@ df2[, "x"]
|
|||
</div>
|
||||
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1[, "x", drop = FALSE]
|
||||
<pre data-type="programlisting" data-code-language="r">df1[, "x", drop = FALSE]
|
||||
#> x
|
||||
#> 1 1
|
||||
#> 2 2
|
||||
|
@ -139,7 +139,7 @@ dplyr equivalents</h2>
|
|||
<ul><li>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = c(2, 3, 1, 1, NA),
|
||||
y = letters[1:5],
|
||||
z = runif(5)
|
||||
|
@ -154,7 +154,7 @@ df[!is.na(df$x) & df$x > 1, ]</pre>
|
|||
<li>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="https://rdrr.io/r/base/order.html">order()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> arrange(x, y)
|
||||
<pre data-type="programlisting" data-code-language="r">df |> arrange(x, y)
|
||||
|
||||
# same as
|
||||
df[order(df$x, df$y), ]</pre>
|
||||
|
@ -164,7 +164,7 @@ df[order(df$x, df$y), ]</pre>
|
|||
<li>
|
||||
<p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> select(x, z)
|
||||
<pre data-type="programlisting" data-code-language="r">df |> select(x, z)
|
||||
|
||||
# same as
|
||||
df[, c("x", "z")]</pre>
|
||||
|
@ -172,7 +172,7 @@ df[, c("x", "z")]</pre>
|
|||
</li>
|
||||
</ul><p>Base R also provides a function that combines the features of <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code><span data-type="footnote">But it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code>.</span> called <code><a href="https://rdrr.io/r/base/subset.html">subset()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
filter(x > 1) |>
|
||||
select(y, z)
|
||||
#> # A tibble: 2 × 2
|
||||
|
@ -216,7 +216,7 @@ Selecting a single element<code>$</code> and <code>[[</code>
|
|||
Data frames</h2>
|
||||
<p><code>[[</code> and <code>$</code> can be used like <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tb <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">tb <- tibble(
|
||||
x = 1:4,
|
||||
y = c(10, 4, 1, 21)
|
||||
)
|
||||
|
@ -233,7 +233,7 @@ tb$x
|
|||
</div>
|
||||
<p>They can also be used to create new columns, the base R equivalent of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tb$z <- tb$x + tb$y
|
||||
<pre data-type="programlisting" data-code-language="r">tb$z <- tb$x + tb$y
|
||||
tb
|
||||
#> # A tibble: 4 × 3
|
||||
#> x y z
|
||||
|
@ -246,7 +246,7 @@ tb
|
|||
<p>There are a number other base approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
|
||||
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, there’s no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">max(diamonds$carat)
|
||||
<pre data-type="programlisting" data-code-language="r">max(diamonds$carat)
|
||||
#> [1] 5.01
|
||||
|
||||
levels(diamonds$cut)
|
||||
|
@ -259,7 +259,7 @@ levels(diamonds$cut)
|
|||
Tibbles</h2>
|
||||
<p>There are a couple of important differences between tibbles and base <code>data.frame</code>s when it comes to <code>$</code>. Data frames match the prefix of any variable names (so-called <strong>partial matching</strong>) and don’t complain if a column doesn’t exist:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- data.frame(x1 = 1)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- data.frame(x1 = 1)
|
||||
df$x
|
||||
#> Warning in df$x: partial match of 'x' to 'x1'
|
||||
#> [1] 1
|
||||
|
@ -268,7 +268,7 @@ df$z
|
|||
</div>
|
||||
<p>Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tb <- tibble(x1 = 1)
|
||||
<pre data-type="programlisting" data-code-language="r">tb <- tibble(x1 = 1)
|
||||
|
||||
tb$x
|
||||
#> Warning: Unknown or uninitialised column: `x`.
|
||||
|
@ -285,7 +285,7 @@ tb$z
|
|||
Lists</h2>
|
||||
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and it’s important to understand how they differ to <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">l <- list(
|
||||
<pre data-type="programlisting" data-code-language="r">l <- list(
|
||||
a = 1:3,
|
||||
b = "a string",
|
||||
c = pi,
|
||||
|
@ -295,7 +295,7 @@ Lists</h2>
|
|||
<ul><li>
|
||||
<p><code>[</code> extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str(l[1:2])
|
||||
<pre data-type="programlisting" data-code-language="r">str(l[1:2])
|
||||
#> List of 2
|
||||
#> $ a: int [1:3] 1 2 3
|
||||
#> $ b: chr "a string"
|
||||
|
@ -310,7 +310,7 @@ str(l[4])
|
|||
<li>
|
||||
<p><code>[[</code> and <code>$</code> extract a single component from a list. They remove a level of hierarchy from the list.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str(l[[1]])
|
||||
<pre data-type="programlisting" data-code-language="r">str(l[[1]])
|
||||
#> int [1:3] 1 2 3
|
||||
str(l[[4]])
|
||||
#> List of 2
|
||||
|
@ -348,7 +348,7 @@ str(l$a)
|
|||
</div>
|
||||
<p>This same principle applies when you use 1d <code>[</code> with a data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = 1:3, y = 3:5)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = 1:3, y = 3:5)
|
||||
|
||||
# returns a one-column data frame
|
||||
df["x"]
|
||||
|
@ -380,7 +380,7 @@ Apply family</h1>
|
|||
<p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.</span>. In fact, because we haven’t used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>’s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
|
||||
<p>There’s no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
|
||||
|
||||
# First find numeric columns
|
||||
num_cols <- sapply(df, is.numeric)
|
||||
|
@ -399,14 +399,14 @@ df
|
|||
<p>The code above uses a new function, <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code>. It’s similar to <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called <code><a href="https://purrr.tidyverse.org/reference/map.html">map_vec()</a></code> that we didn’t mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
|
||||
<p>Base R provides a stricter version of <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> called <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> call above with this <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> where we specify that we expect <code><a href="https://rdrr.io/r/base/numeric.html">is.numeric()</a></code> to return a logical vector of length 1:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">vapply(df, is.numeric, logical(1))
|
||||
<pre data-type="programlisting" data-code-language="r">vapply(df, is.numeric, logical(1))
|
||||
#> a b c d e
|
||||
#> TRUE TRUE FALSE FALSE TRUE</pre>
|
||||
</div>
|
||||
<p>The distinction between <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> and <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.</p>
|
||||
<p>Another important member of the apply family is <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> which computes a single grouped summary:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
group_by(cut) |>
|
||||
summarise(price = mean(price))
|
||||
#> # A tibble: 5 × 2
|
||||
|
@ -431,43 +431,43 @@ tapply(diamonds$price, diamonds$cut, mean)
|
|||
For loops</h1>
|
||||
<p>For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">for (element in vector) {
|
||||
<pre data-type="programlisting" data-code-language="r">for (element in vector) {
|
||||
# do something with element
|
||||
}</pre>
|
||||
</div>
|
||||
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |> walk(append_file)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |> walk(append_file)</pre>
|
||||
</div>
|
||||
<p>We could have used a for loop:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">for (path in paths) {
|
||||
<pre data-type="programlisting" data-code-language="r">for (path in paths) {
|
||||
append_file(path)
|
||||
}</pre>
|
||||
</div>
|
||||
<p>Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
|
||||
files <- map(paths, readxl::read_excel)</pre>
|
||||
</div>
|
||||
<p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as <code>paths</code>, which we can create with <code><a href="https://rdrr.io/r/base/vector.html">vector()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- vector("list", length(paths))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">files <- vector("list", length(paths))</pre>
|
||||
</div>
|
||||
<p>Then instead of iterating over the elements of <code>paths</code>, we’ll iterate over their indices, using <code><a href="https://rdrr.io/r/base/seq.html">seq_along()</a></code> to generate one index for each element of paths:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">seq_along(paths)
|
||||
<pre data-type="programlisting" data-code-language="r">seq_along(paths)
|
||||
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre>
|
||||
</div>
|
||||
<p>Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">for (i in seq_along(paths)) {
|
||||
<pre data-type="programlisting" data-code-language="r">for (i in seq_along(paths)) {
|
||||
files[[i]] <- readxl::read_excel(paths[[i]])
|
||||
}</pre>
|
||||
</div>
|
||||
<p>To combine the list of tibbles into a single tibble you can use <code><a href="https://rdrr.io/r/base/do.call.html">do.call()</a></code> + <code><a href="https://rdrr.io/r/base/cbind.html">rbind()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">do.call(rbind, files)
|
||||
<pre data-type="programlisting" data-code-language="r">do.call(rbind, files)
|
||||
#> # A tibble: 1,704 × 5
|
||||
#> country continent lifeExp pop gdpPercap
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl>
|
||||
|
@ -481,7 +481,7 @@ files <- map(paths, readxl::read_excel)</pre>
|
|||
</div>
|
||||
<p>Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">out <- NULL
|
||||
<pre data-type="programlisting" data-code-language="r">out <- NULL
|
||||
for (path in paths) {
|
||||
out <- rbind(out, readxl::read_excel(path))
|
||||
}</pre>
|
||||
|
@ -495,7 +495,7 @@ Plots</h1>
|
|||
<p>Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because they’re so concise — it’s very little typing to do a basic exploratory plot.</p>
|
||||
<p>There are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Here’s a quick example from the diamonds dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">hist(diamonds$carat)
|
||||
<pre data-type="programlisting" data-code-language="r">hist(diamonds$carat)
|
||||
|
||||
plot(diamonds$carat, diamonds$price)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
|
|
@ -12,7 +12,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including <strong>ggrepel</strong> and <strong>patchwork</strong>. Rather than loading those extensions here, we’ll refer to their functions explicitly, using the <code>::</code> notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Don’t forget you’ll need to install those packages with <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> if you don’t already have them.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -22,7 +22,7 @@ Prerequisites</h2>
|
|||
Label</h1>
|
||||
<p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> function. This example adds a plot title:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(title = "Fuel efficiency generally decreases with engine size")</pre>
|
||||
|
@ -35,7 +35,7 @@ Label</h1>
|
|||
<ul><li><p><code>subtitle</code> adds additional detail in a smaller font beneath the title.</p></li>
|
||||
<li><p><code>caption</code> adds text at the bottom right of the plot, often used to describe the source of the data.</p></li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(
|
||||
|
@ -49,7 +49,7 @@ Label</h1>
|
|||
</div>
|
||||
<p>You can also use <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(
|
||||
|
@ -63,7 +63,7 @@ Label</h1>
|
|||
</div>
|
||||
<p>It’s possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="https://rdrr.io/r/base/substitute.html">quote()</a></code> and read about the available options in <code><a href="https://rdrr.io/r/grDevices/plotmath.html">?plotmath</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = runif(10),
|
||||
y = runif(10)
|
||||
)
|
||||
|
@ -100,7 +100,7 @@ Annotations</h1>
|
|||
<p>In addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> is similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p>
|
||||
<p>There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isn’t terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">best_in_class <- mpg |>
|
||||
<pre data-type="programlisting" data-code-language="r">best_in_class <- mpg |>
|
||||
group_by(class) |>
|
||||
filter(row_number(desc(hwy)) == 1)
|
||||
|
||||
|
@ -113,7 +113,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
</div>
|
||||
<p>This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -122,7 +122,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
</div>
|
||||
<p>That helps a bit, but if you look closely in the top-left hand corner, you’ll notice that there are two labels practically on top of each other. This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same. There’s no way that we can fix these by applying the same transformation for every label. Instead, we can use the <strong>ggrepel</strong> package by Kamil Slowikowski. This useful package will automatically adjust labels so that they don’t overlap:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_point(size = 3, shape = 1, data = best_in_class) +
|
||||
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)</pre>
|
||||
|
@ -133,7 +133,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
<p>Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.</p>
|
||||
<p>You can sometimes use the same idea to replace the legend with labels placed directly on the plot. It’s not wonderful for this plot, but it isn’t too bad. (<code>theme(legend.position = "none"</code>) turns the legend off — we’ll talk about it more shortly.)</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">class_avg <- mpg |>
|
||||
<pre data-type="programlisting" data-code-language="r">class_avg <- mpg |>
|
||||
group_by(class) |>
|
||||
summarise(
|
||||
displ = median(displ),
|
||||
|
@ -155,7 +155,7 @@ ggplot(mpg, aes(displ, hwy, colour = class)) +
|
|||
</div>
|
||||
<p>Alternatively, you might just want to add a single label to the plot, but you’ll still need to create a data frame. Often, you want the label in the corner of the plot, so it’s convenient to create a new data frame using <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to compute the maximum values of x and y.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">label_info <- mpg |>
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- mpg |>
|
||||
summarise(
|
||||
displ = max(displ),
|
||||
hwy = max(hwy),
|
||||
|
@ -171,7 +171,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
</div>
|
||||
<p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since we’re no longer computing the positions from <code>mpg</code>, we can use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> to create the data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">label_info <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- tibble(
|
||||
displ = Inf,
|
||||
hwy = Inf,
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy."
|
||||
|
@ -186,7 +186,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
</div>
|
||||
<p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="https://stringr.tidyverse.org/reference/str_wrap.html">stringr::str_wrap()</a></code> to automatically add line breaks, given the number of characters you want per line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">"Increasing engine size is related to decreasing fuel economy." |>
|
||||
<pre data-type="programlisting" data-code-language="r">"Increasing engine size is related to decreasing fuel economy." |>
|
||||
str_wrap(width = 40) |>
|
||||
writeLines()
|
||||
#> Increasing engine size is related to
|
||||
|
@ -223,12 +223,12 @@ Exercises</h2>
|
|||
Scales</h1>
|
||||
<p>The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive. Normally, ggplot2 automatically adds scales for you. For example, when you type:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class))</pre>
|
||||
</div>
|
||||
<p>ggplot2 automatically adds default scales behind the scenes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
scale_x_continuous() +
|
||||
scale_y_continuous() +
|
||||
|
@ -244,7 +244,7 @@ Scales</h1>
|
|||
Axis ticks and legend keys</h2>
|
||||
<p>There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: <code>breaks</code> and <code>labels</code>. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of <code>breaks</code> is to override the default choice:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point() +
|
||||
scale_y_continuous(breaks = seq(15, 40, by = 5))</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -253,7 +253,7 @@ Axis ticks and legend keys</h2>
|
|||
</div>
|
||||
<p>You can use <code>labels</code> in the same way (a character vector the same length as <code>breaks</code>), but you can also set it to <code>NULL</code> to suppress the labels altogether. This is useful for maps, or for publishing plots where you can’t share the absolute numbers.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point() +
|
||||
scale_x_continuous(labels = NULL) +
|
||||
scale_y_continuous(labels = NULL)</pre>
|
||||
|
@ -264,7 +264,7 @@ Axis ticks and legend keys</h2>
|
|||
<p>You can also use <code>breaks</code> and <code>labels</code> to control the appearance of legends. Collectively axes and legends are called <strong>guides</strong>. Axes are used for x and y aesthetics; legends are used for everything else.</p>
|
||||
<p>Another use of <code>breaks</code> is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">presidential |>
|
||||
<pre data-type="programlisting" data-code-language="r">presidential |>
|
||||
mutate(id = 33 + row_number()) |>
|
||||
ggplot(aes(start, id)) +
|
||||
geom_point() +
|
||||
|
@ -285,7 +285,7 @@ Legend layout</h2>
|
|||
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p>
|
||||
<p>To control the overall position of the legend, you need to use a <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">base <- ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">base <- ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class))
|
||||
|
||||
base + theme(legend.position = "left")
|
||||
|
@ -314,7 +314,7 @@ base + theme(legend.position = "right") # the default</pre>
|
|||
<p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p>
|
||||
<p>To control the display of individual legends, use <code><a href="https://ggplot2.tidyverse.org/reference/guides.html">guides()</a></code> along with <code><a href="https://ggplot2.tidyverse.org/reference/guide_legend.html">guide_legend()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/guide_colourbar.html">guide_colorbar()</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
theme(legend.position = "bottom") +
|
||||
|
@ -332,7 +332,7 @@ Replacing a scale</h2>
|
|||
<p>Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and colour, you’ll be able to quickly pick up other scale replacements.</p>
|
||||
<p>It’s very useful to plot transformations of your variable. For example, as we’ve seen in <a href="#chp-diamond-prices" data-type="xref">#chp-diamond-prices</a> it’s easier to see the precise relationship between <code>carat</code> and <code>price</code> if we log transform them:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(diamonds, aes(carat, price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(carat, price)) +
|
||||
geom_bin2d()
|
||||
|
||||
ggplot(diamonds, aes(log10(carat), log10(price))) +
|
||||
|
@ -350,7 +350,7 @@ ggplot(diamonds, aes(log10(carat), log10(price))) +
|
|||
</div>
|
||||
<p>However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(diamonds, aes(carat, price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(carat, price)) +
|
||||
geom_bin2d() +
|
||||
scale_x_log10() +
|
||||
scale_y_log10()</pre>
|
||||
|
@ -360,7 +360,7 @@ ggplot(diamonds, aes(log10(carat), log10(price))) +
|
|||
</div>
|
||||
<p>Another scale that is frequently customized is colour. The default categorical scale picks colors that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = drv))
|
||||
|
||||
ggplot(mpg, aes(displ, hwy)) +
|
||||
|
@ -379,7 +379,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
</div>
|
||||
<p>Don’t forget simpler techniques. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = drv, shape = drv)) +
|
||||
scale_colour_brewer(palette = "Set1")</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -397,7 +397,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
</div>
|
||||
<p>When you have a predefined mapping between values and colors, use <code><a href="https://ggplot2.tidyverse.org/reference/scale_manual.html">scale_colour_manual()</a></code>. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">presidential |>
|
||||
<pre data-type="programlisting" data-code-language="r">presidential |>
|
||||
mutate(id = 33 + row_number()) |>
|
||||
ggplot(aes(start, id, colour = party)) +
|
||||
geom_point() +
|
||||
|
@ -410,7 +410,7 @@ ggplot(mpg, aes(displ, hwy)) +
|
|||
<p>For continuous colour, you can use the built-in <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_colour_gradient()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_fill_gradient()</a></code>. If you have a diverging scale, you can use <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_colour_gradient2()</a></code>. That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.</p>
|
||||
<p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = rnorm(10000),
|
||||
y = rnorm(10000)
|
||||
)
|
||||
|
@ -455,7 +455,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Why doesn’t the following code override the default scale?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(df, aes(x, y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
scale_colour_gradient(low = "white", high = "red") +
|
||||
coord_fixed()</pre>
|
||||
|
@ -473,7 +473,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Use <code>override.aes</code> to make the legend on the following plot easier to see.</p>
|
||||
<div class="cell" data-fig.format="png">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(diamonds, aes(carat, price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(carat, price)) +
|
||||
geom_point(aes(colour = cut), alpha = 1/20)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-31-1.png" style="width:50.0%"/></p>
|
||||
|
@ -493,7 +493,7 @@ Zooming</h1>
|
|||
</li>
|
||||
</ol><p>To zoom in on a region of the plot, it’s generally best to use <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>. Compare the following two plots:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, mapping = aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, mapping = aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth() +
|
||||
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
|
||||
|
@ -516,7 +516,7 @@ mpg |>
|
|||
</div>
|
||||
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">suv <- mpg |> filter(class == "suv")
|
||||
<pre data-type="programlisting" data-code-language="r">suv <- mpg |> filter(class == "suv")
|
||||
compact <- mpg |> filter(class == "compact")
|
||||
|
||||
ggplot(suv, aes(displ, hwy, colour = drv)) +
|
||||
|
@ -537,7 +537,7 @@ ggplot(compact, aes(displ, hwy, colour = drv)) +
|
|||
</div>
|
||||
<p>One way to overcome this problem is to share scales across multiple plots, training the scales with the <code>limits</code> of the full data.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">x_scale <- scale_x_continuous(limits = range(mpg$displ))
|
||||
<pre data-type="programlisting" data-code-language="r">x_scale <- scale_x_continuous(limits = range(mpg$displ))
|
||||
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
|
||||
col_scale <- scale_colour_discrete(limits = unique(mpg$drv))
|
||||
|
||||
|
@ -571,7 +571,7 @@ ggplot(compact, aes(displ, hwy, colour = drv)) +
|
|||
Themes</h1>
|
||||
<p>Finally, you can customize the non-data elements of your plot with a theme:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
theme_bw()</pre>
|
||||
|
@ -597,7 +597,7 @@ Themes</h1>
|
|||
Saving your plots</h1>
|
||||
<p>There are two main ways to get your plots out of R and into your final write-up: <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> and knitr. <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> will save the most recent plot to disk:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + geom_point()
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) + geom_point()
|
||||
ggsave("my-plot.pdf")
|
||||
#> Saving 6 x 4 in image</pre>
|
||||
</div>
|
||||
|
|
|
@ -10,7 +10,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter, you’ll learn how to load flat files in R with the <strong>readr</strong> package, which is part of the core tidyverse.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -73,7 +73,7 @@ Reading data from a file</h1>
|
|||
</div>
|
||||
<p>We can read this file into R using <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>. The first argument is the most important: it’s the path to the file.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students <- read_csv("data/students.csv")
|
||||
<pre data-type="programlisting" data-code-language="r">students <- read_csv("data/students.csv")
|
||||
#> Rows: 6 Columns: 5
|
||||
#> ── Column specification ─────────────────────────────────────────────────────
|
||||
#> Delimiter: ","
|
||||
|
@ -91,7 +91,7 @@ Practical advice</h2>
|
|||
<p>Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the <code>students</code> data with that in mind.</p>
|
||||
<p>In the <code>favourite.food</code> column, there are a bunch of food items and then the character string <code>N/A</code>, which should have been an real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students <- read_csv("data/students.csv", na = c("N/A", ""))
|
||||
<pre data-type="programlisting" data-code-language="r">students <- read_csv("data/students.csv", na = c("N/A", ""))
|
||||
|
||||
students
|
||||
#> # A tibble: 6 × 5
|
||||
|
@ -106,7 +106,7 @@ students
|
|||
</div>
|
||||
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by back ticks. That’s because they contain spaces, breaking R’s usual rules for variable names. To refer to them, you need to use those back ticks:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students |>
|
||||
<pre data-type="programlisting" data-code-language="r">students |>
|
||||
rename(
|
||||
student_id = `Student ID`,
|
||||
full_name = `Full Name`
|
||||
|
@ -123,7 +123,7 @@ students
|
|||
</div>
|
||||
<p>An alternative approach is to use <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> to use some heuristics to turn them all into snake case at once<span data-type="footnote">The <a href="http://sfirke.github.io/janitor/">janitor</a> package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses <code>|></code>.</span>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students |> janitor::clean_names()
|
||||
<pre data-type="programlisting" data-code-language="r">students |> janitor::clean_names()
|
||||
#> # A tibble: 6 × 5
|
||||
#> student_id full_name favourite_food meal_plan age
|
||||
#> <dbl> <chr> <chr> <chr> <chr>
|
||||
|
@ -136,7 +136,7 @@ students
|
|||
</div>
|
||||
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represent as factor:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students |>
|
||||
<pre data-type="programlisting" data-code-language="r">students |>
|
||||
janitor::clean_names() |>
|
||||
mutate(
|
||||
meal_plan = factor(meal_plan)
|
||||
|
@ -154,7 +154,7 @@ students
|
|||
<p>Note that the values in the <code>meal_type</code> variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<code><chr></code>) to factor (<code><fct></code>). You’ll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
|
||||
<p>Before you move on to analyzing these data, you’ll probably want to fix the <code>age</code> column as well: currently it’s a character variable because of the one observation that is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students <- students |>
|
||||
<pre data-type="programlisting" data-code-language="r">students <- students |>
|
||||
janitor::clean_names() |>
|
||||
mutate(
|
||||
meal_plan = factor(meal_plan),
|
||||
|
@ -179,7 +179,7 @@ students
|
|||
Other arguments</h2>
|
||||
<p>There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read csv files that you’ve created in a string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv(
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"a,b,c
|
||||
1,2,3
|
||||
4,5,6"
|
||||
|
@ -192,7 +192,7 @@ Other arguments</h2>
|
|||
</div>
|
||||
<p>Usually <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv(
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"The first line of metadata
|
||||
The second line of metadata
|
||||
x,y,z
|
||||
|
@ -217,7 +217,7 @@ read_csv(
|
|||
</div>
|
||||
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings, and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv(
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"1,2,3
|
||||
4,5,6",
|
||||
col_names = FALSE
|
||||
|
@ -230,7 +230,7 @@ read_csv(
|
|||
</div>
|
||||
<p>Alternatively you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv(
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"1,2,3
|
||||
4,5,6",
|
||||
col_names = c("x", "y", "z")
|
||||
|
@ -265,13 +265,13 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. What argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify to read the following text into a data frame?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">"x,y\n1,'a,b'"</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">"x,y\n1,'a,b'"</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Identify what is wrong with each of the following inline CSV files. What happens when you run the code?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv("a,b\n1,2,3\n4,5,6")
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv("a,b\n1,2,3\n4,5,6")
|
||||
read_csv("a,b,c\n1,2\n1,2,3,4")
|
||||
read_csv("a,b\n\"1")
|
||||
read_csv("a,b\n1,2\na,b")
|
||||
|
@ -285,7 +285,7 @@ read_csv("a;b\n1;3")</pre>
|
|||
<li>Creating a new column called <code>3</code> which is <code>2</code> divided by <code>1</code>.</li>
|
||||
<li>Renaming the columns to <code>one</code>, <code>two</code> and <code>three</code>.</li>
|
||||
</ol><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">annoying <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">annoying <- tibble(
|
||||
`1` = 1:10,
|
||||
`2` = `1` * 2 + rnorm(length(`1`))
|
||||
)</pre>
|
||||
|
@ -309,7 +309,7 @@ Guessing types</h2>
|
|||
<li>Otherwise, it must be a string.</li>
|
||||
</ul><p>You can see that behavior in action in this simple example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv("
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv("
|
||||
logical,numeric,date,string
|
||||
TRUE,1,2021-01-15,abc
|
||||
false,4.5,2021-02-15,def
|
||||
|
@ -341,7 +341,7 @@ Missing values, column types, and problems</h2>
|
|||
<p>The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type. One of the most common causes for this a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
|
||||
<p>Take this simple 1 column CSV file as an example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">csv <- "
|
||||
<pre data-type="programlisting" data-code-language="r">csv <- "
|
||||
x
|
||||
10
|
||||
.
|
||||
|
@ -350,7 +350,7 @@ Missing values, column types, and problems</h2>
|
|||
</div>
|
||||
<p>If we read it without any additional arguments, <code>x</code> becomes a character column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- read_csv(csv)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- read_csv(csv)
|
||||
#> Rows: 4 Columns: 1
|
||||
#> ── Column specification ─────────────────────────────────────────────────────
|
||||
#> Delimiter: ","
|
||||
|
@ -361,7 +361,7 @@ Missing values, column types, and problems</h2>
|
|||
</div>
|
||||
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled amongst them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- read_csv(csv, col_types = list(x = col_double()))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- read_csv(csv, col_types = list(x = col_double()))
|
||||
#> Warning: One or more parsing issues, call `problems()` on your data frame for
|
||||
#> details, e.g.:
|
||||
#> dat <- vroom(...)
|
||||
|
@ -369,7 +369,7 @@ Missing values, column types, and problems</h2>
|
|||
</div>
|
||||
<p>Now <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> reports that there was a problem, and tells us we can find out more with <code><a href="https://readr.tidyverse.org/reference/problems.html">problems()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">problems(df)
|
||||
<pre data-type="programlisting" data-code-language="r">problems(df)
|
||||
#> # A tibble: 1 × 5
|
||||
#> row col expected actual file
|
||||
#> <int> <int> <chr> <chr> <chr>
|
||||
|
@ -377,7 +377,7 @@ Missing values, column types, and problems</h2>
|
|||
</div>
|
||||
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- read_csv(csv, na = ".")
|
||||
<pre data-type="programlisting" data-code-language="r">df <- read_csv(csv, na = ".")
|
||||
#> Rows: 4 Columns: 1
|
||||
#> ── Column specification ─────────────────────────────────────────────────────
|
||||
#> Delimiter: ","
|
||||
|
@ -406,7 +406,7 @@ Column types</h2>
|
|||
<code><a href="https://readr.tidyverse.org/reference/col_skip.html">col_skip()</a></code> skips a column so it’s not included in the result.</li>
|
||||
</ul><p>It’s also possible to override the default column by switching from <code><a href="https://rdrr.io/r/base/list.html">list()</a></code> to <code><a href="https://readr.tidyverse.org/reference/cols.html">cols()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">csv <- "
|
||||
<pre data-type="programlisting" data-code-language="r">csv <- "
|
||||
x,y,z
|
||||
1,2,3"
|
||||
|
||||
|
@ -418,7 +418,7 @@ read_csv(csv, col_types = cols(.default = col_character()))
|
|||
</div>
|
||||
<p>Another useful helper is <code><a href="https://readr.tidyverse.org/reference/cols.html">cols_only()</a></code> which will read in only the columns you specify:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv(
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"x,y,z
|
||||
1,2,3",
|
||||
col_types = cols_only(x = col_character())
|
||||
|
@ -436,7 +436,7 @@ read_csv(csv, col_types = cols(.default = col_character()))
|
|||
Reading data from multiple files</h1>
|
||||
<p>Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: <code>01-sales.csv</code> for January, <code>02-sales.csv</code> for February, and <code>03-sales.csv</code> for March. With <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> you can read these data in at once and stack them on top of each other in a single data frame.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
|
||||
<pre data-type="programlisting" data-code-language="r">sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
|
||||
read_csv(sales_files, id = "file")
|
||||
#> Rows: 19 Columns: 6
|
||||
#> ── Column specification ─────────────────────────────────────────────────────
|
||||
|
@ -460,7 +460,7 @@ read_csv(sales_files, id = "file")
|
|||
<p>With the additional <code>id</code> parameter we have added a new column called <code>file</code> to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.</p>
|
||||
<p>If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
|
||||
sales_files
|
||||
#> [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"</pre>
|
||||
</div>
|
||||
|
@ -472,11 +472,11 @@ Writing to a file</h1>
|
|||
<p>readr also comes with two useful functions for writing data back to disk: <code><a href="https://readr.tidyverse.org/reference/write_delim.html">write_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/write_delim.html">write_tsv()</a></code>. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.</p>
|
||||
<p>The most important arguments are <code>x</code> (the data frame to save), and <code>file</code> (the location to save it). You can also specify how missing values are written with <code>na</code>, and if you want to <code>append</code> to an existing file.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">write_csv(students, "students.csv")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">write_csv(students, "students.csv")</pre>
|
||||
</div>
|
||||
<p>Now let’s read that csv file back in. Note that the type information is lost when you save to csv:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students
|
||||
<pre data-type="programlisting" data-code-language="r">students
|
||||
#> # A tibble: 6 × 5
|
||||
#> student_id full_name favourite_food meal_plan age
|
||||
#> <dbl> <chr> <chr> <fct> <dbl>
|
||||
|
@ -502,7 +502,7 @@ read_csv("students-2.csv")
|
|||
<ol type="1"><li>
|
||||
<p><code><a href="https://readr.tidyverse.org/reference/read_rds.html">write_rds()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_rds.html">read_rds()</a></code> are uniform wrappers around the base functions <code><a href="https://rdrr.io/r/base/readRDS.html">readRDS()</a></code> and <code><a href="https://rdrr.io/r/base/readRDS.html">saveRDS()</a></code>. These store data in R’s custom binary format called RDS:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">write_rds(students, "students.rds")
|
||||
<pre data-type="programlisting" data-code-language="r">write_rds(students, "students.rds")
|
||||
read_rds("students.rds")
|
||||
#> # A tibble: 6 × 5
|
||||
#> student_id full_name favourite_food meal_plan age
|
||||
|
@ -518,7 +518,7 @@ read_rds("students.rds")
|
|||
<li>
|
||||
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(arrow)
|
||||
<pre data-type="programlisting" data-code-language="r">library(arrow)
|
||||
write_parquet(students, "students.parquet")
|
||||
read_parquet("students.parquet")
|
||||
#> # A tibble: 6 × 5
|
||||
|
@ -540,7 +540,7 @@ read_parquet("students.parquet")
|
|||
Data entry</h1>
|
||||
<p>Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> works by column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">tibble(
|
||||
x = c(1, 2, 5),
|
||||
y = c("h", "m", "g"),
|
||||
z = c(0.08, 0.83, 0.60)
|
||||
|
@ -554,7 +554,7 @@ Data entry</h1>
|
|||
</div>
|
||||
<p>Note that every column in tibble must be same size, so you’ll get an error if they’re not:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">tibble(
|
||||
x = c(1, 2),
|
||||
y = c("h", "m", "g"),
|
||||
z = c(0.08, 0.83, 0.6)
|
||||
|
@ -567,7 +567,7 @@ Data entry</h1>
|
|||
</div>
|
||||
<p>Laying out the data by column can make it hard to see how the rows are related, so an alternative is <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>, short for <strong>tr</strong>ansposed t<strong>ibble</strong>, which lets you lay out your data row by row. <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> is customized for data entry in code: column headings start with <code>~</code> and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">tribble(
|
||||
~x, ~y, ~z,
|
||||
"h", 1, 0.08,
|
||||
"m", 2, 0.83,
|
||||
|
|
|
@ -19,7 +19,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
<p>From this chapter on, we’ll suppress the loading message from <code><a href="https://tidyverse.tidyverse.org">library(tidyverse)</a></code>.</p>
|
||||
</section>
|
||||
|
@ -32,7 +32,7 @@ Tidy data</h1>
|
|||
|
||||
<!-- TODO redraw as tables -->
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">table1
|
||||
<pre data-type="programlisting" data-code-language="r">table1
|
||||
#> # A tibble: 6 × 4
|
||||
#> country year cases population
|
||||
#> <chr> <int> <int> <int>
|
||||
|
@ -99,7 +99,7 @@ table4b # population
|
|||
<li><p>There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
|
||||
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with <code>table1</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Compute rate per 10,000
|
||||
<pre data-type="programlisting" data-code-language="r"># Compute rate per 10,000
|
||||
table1 |>
|
||||
mutate(
|
||||
rate = cases / population * 10000
|
||||
|
@ -164,7 +164,7 @@ Pivoting</h1>
|
|||
Data in column names</h2>
|
||||
<p>The <code>billboard</code> dataset records the billboard rank of songs in the year 2000:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">billboard
|
||||
<pre data-type="programlisting" data-code-language="r">billboard
|
||||
#> # A tibble: 317 × 79
|
||||
#> artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
|
@ -192,7 +192,7 @@ Data in column names</h2>
|
|||
<code>values_to</code> names the variable stored in the cell values, here <code>"rank"</code>.</li>
|
||||
</ul><p>That gives the following call:</p>
|
||||
<div class="cell" data-r.options="{"pillar.print_min":10}">
|
||||
<pre data-type="programlisting" data-code-language="downlit">billboard |>
|
||||
<pre data-type="programlisting" data-code-language="r">billboard |>
|
||||
pivot_longer(
|
||||
cols = starts_with("wk"),
|
||||
names_to = "week",
|
||||
|
@ -215,7 +215,7 @@ Data in column names</h2>
|
|||
</div>
|
||||
<p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac’s “Baby Don’t Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s don’t really represent unknown observations; they’re forced to exist by the structure of the dataset<span data-type="footnote">We’ll come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">billboard |>
|
||||
<pre data-type="programlisting" data-code-language="r">billboard |>
|
||||
pivot_longer(
|
||||
cols = starts_with("wk"),
|
||||
names_to = "week",
|
||||
|
@ -236,7 +236,7 @@ Data in column names</h2>
|
|||
<p>You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns <code>wk77</code>, <code>wk78</code>, … would be added to the dataset.</p>
|
||||
<p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">readr::parse_number()</a></code>. <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">billboard_tidy <- billboard |>
|
||||
<pre data-type="programlisting" data-code-language="r">billboard_tidy <- billboard |>
|
||||
pivot_longer(
|
||||
cols = starts_with("wk"),
|
||||
names_to = "week",
|
||||
|
@ -260,7 +260,7 @@ billboard_tidy
|
|||
</div>
|
||||
<p>Now we’re in a good position to look at how song ranks vary over time by drawing a plot. The code is shown below and the result is <a href="#fig-billboard-ranks" data-type="xref">#fig-billboard-ranks</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">billboard_tidy |>
|
||||
<pre data-type="programlisting" data-code-language="r">billboard_tidy |>
|
||||
ggplot(aes(week, rank, group = track)) +
|
||||
geom_line(alpha = 1/3) +
|
||||
scale_y_reverse()</pre>
|
||||
|
@ -278,7 +278,7 @@ billboard_tidy
|
|||
How does pivoting work?</h2>
|
||||
<p>Now that you’ve seen what pivoting can do for you, it’s worth taking a little time to gain some intuition about what it does to the data. Let’s start with a very simple dataset to make it easier to see what’s happening:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
~var, ~col1, ~col2,
|
||||
"A", 1, 2,
|
||||
"B", 3, 4,
|
||||
|
@ -287,7 +287,7 @@ How does pivoting work?</h2>
|
|||
</div>
|
||||
<p>Here we’ll say there are three variables: <code>var</code> (already in a variable), <code>name</code> (the column names in the column names), and <code>value</code> (the cell values). So we can tidy it with:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
pivot_longer(
|
||||
cols = col1:col2,
|
||||
names_to = "names",
|
||||
|
@ -337,7 +337,7 @@ How does pivoting work?</h2>
|
|||
Many variables in column names</h2>
|
||||
<p>A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the <code>who2</code> dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">who2
|
||||
<pre data-type="programlisting" data-code-language="r">who2
|
||||
#> # A tibble: 7,240 × 58
|
||||
#> country year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
|
@ -358,7 +358,7 @@ Many variables in column names</h2>
|
|||
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
|
||||
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">who2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">who2 |>
|
||||
pivot_longer(
|
||||
cols = !(country:year),
|
||||
names_to = c("diagnosis", "gender", "age"),
|
||||
|
@ -393,7 +393,7 @@ Many variables in column names</h2>
|
|||
Data and variable names in the column headers</h2>
|
||||
<p>The next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the <code>household</code> dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">household
|
||||
<pre data-type="programlisting" data-code-language="r">household
|
||||
#> # A tibble: 5 × 5
|
||||
#> family dob_child1 dob_child2 name_child1 name_child2
|
||||
#> <int> <date> <date> <chr> <chr>
|
||||
|
@ -405,7 +405,7 @@ Data and variable names in the column headers</h2>
|
|||
</div>
|
||||
<p>This dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (<code>dob</code>, <code>name)</code> and the values of another (<code>child,</code> with values 1 and 2). To solve this problem we again need to supply a vector to <code>names_to</code> but this time we use the special <code>".value"</code> sentinel. This overrides the usual <code>values_to</code> argument to use the first component of the pivoted column name as a variable name in the output.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">household |>
|
||||
<pre data-type="programlisting" data-code-language="r">household |>
|
||||
pivot_longer(
|
||||
cols = !family,
|
||||
names_to = c(".value", "child"),
|
||||
|
@ -444,7 +444,7 @@ Widening data</h2>
|
|||
<p>So far we’ve used <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to solve the common class of problems where values have ended up in column names. Next we’ll pivot (HA HA) to <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p>
|
||||
<p>We’ll start by looking at <code>cms_patient_experience</code>, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_experience
|
||||
#> # A tibble: 500 × 5
|
||||
#> org_pac_id org_nm measure_cd measure_title prf_r…¹
|
||||
#> <chr> <chr> <chr> <chr> <dbl>
|
||||
|
@ -458,7 +458,7 @@ Widening data</h2>
|
|||
</div>
|
||||
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |>
|
||||
distinct(measure_cd, measure_title)
|
||||
#> # A tibble: 6 × 2
|
||||
#> measure_cd measure_title
|
||||
|
@ -473,7 +473,7 @@ Widening data</h2>
|
|||
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesn’t hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. We’ll use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> has the opposite interface to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |>
|
||||
pivot_wider(
|
||||
names_from = measure_cd,
|
||||
values_from = prf_rate
|
||||
|
@ -493,7 +493,7 @@ Widening data</h2>
|
|||
</div>
|
||||
<p>The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |>
|
||||
pivot_wider(
|
||||
id_cols = starts_with("org"),
|
||||
names_from = measure_cd,
|
||||
|
@ -519,7 +519,7 @@ Widening data</h2>
|
|||
How does<code>pivot_wider()</code> work?</h2>
|
||||
<p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, let’s again start with a very simple dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
~id, ~name, ~value,
|
||||
"A", "x", 1,
|
||||
"B", "y", 2,
|
||||
|
@ -530,7 +530,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
</div>
|
||||
<p>We’ll take the values from the <code>value</code> column and the names from the <code>name</code> column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
pivot_wider(
|
||||
names_from = name,
|
||||
values_from = value
|
||||
|
@ -544,7 +544,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
<p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p>
|
||||
<p>To begin the process <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: it’s just the values of <code>name</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
distinct(name)
|
||||
#> # A tibble: 3 × 1
|
||||
#> name
|
||||
|
@ -555,7 +555,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
</div>
|
||||
<p>By default, the rows in the output are formed by all the variables that aren’t going into the names or values. These are called the <code>id_cols</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
select(-name, -value) |>
|
||||
distinct()
|
||||
#> # A tibble: 2 × 1
|
||||
|
@ -566,7 +566,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
</div>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> then combines these results to generate an empty data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
select(-name, -value) |>
|
||||
distinct() |>
|
||||
mutate(x = NA, y = NA, z = NA)
|
||||
|
@ -579,7 +579,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
<p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as there’s no entry for id “B” and name “z”, so that cell remains missing. We’ll come back to this idea that <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
|
||||
<p>You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
~id, ~name, ~value,
|
||||
"A", "x", 1,
|
||||
"A", "x", 2,
|
||||
|
@ -590,7 +590,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
</div>
|
||||
<p>If we attempt to pivot this we get an output that contains list-columns, which you’ll learn more about in <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> pivot_wider(
|
||||
<pre data-type="programlisting" data-code-language="r">df |> pivot_wider(
|
||||
names_from = name,
|
||||
values_from = value
|
||||
)
|
||||
|
@ -611,7 +611,7 @@ How does<code>pivot_wider()</code> work?</h2>
|
|||
</div>
|
||||
<p>Since you don’t know how to work with this sort of data yet, you’ll want to follow the hint in the warning to figure out where the problem is:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(id, name) |>
|
||||
summarize(n = n(), .groups = "drop") |>
|
||||
filter(n > 1L)
|
||||
|
@ -635,7 +635,7 @@ Untidy data</h1>
|
|||
Presenting data to humans</h2>
|
||||
<p>As you’ve seen, <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(clarity, color)
|
||||
#> # A tibble: 56 × 3
|
||||
#> clarity color n
|
||||
|
@ -650,7 +650,7 @@ Presenting data to humans</h2>
|
|||
</div>
|
||||
<p>This is easy to visualize or summarize further, but it’s not the most compact form for display. You can use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to create a form more suitable for display to other humans:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(clarity, color) |>
|
||||
pivot_wider(
|
||||
names_from = color,
|
||||
|
@ -677,7 +677,7 @@ Multivariate statistics</h2>
|
|||
<p>Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable. Sometimes these formats have substantial performance or space advantages, or sometimes they’re just necessary to get closer to the underlying matrix mathematics.</p>
|
||||
<p>We’re not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need. For example, let’s imagine you wanted to cluster the gapminder data to find countries that had similar progression of <code>gdpPercap</code> over time. To do this, we need one row for each country and one column for each year:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(gapminder)
|
||||
<pre data-type="programlisting" data-code-language="r">library(gapminder)
|
||||
|
||||
col_year <- gapminder |>
|
||||
mutate(gdpPercap = log10(gdpPercap)) |>
|
||||
|
@ -701,7 +701,7 @@ col_year
|
|||
</div>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms don’t want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">col_year <- col_year |>
|
||||
<pre data-type="programlisting" data-code-language="r">col_year <- col_year |>
|
||||
column_to_rownames("country")
|
||||
|
||||
head(col_year)
|
||||
|
@ -723,11 +723,11 @@ head(col_year)
|
|||
<p>This makes a data frame, because tibbles don’t support row names<span data-type="footnote">tibbles don’t use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.</span>.</p>
|
||||
<p>We’re now ready to cluster with (e.g.) <code><a href="https://rdrr.io/r/stats/kmeans.html">kmeans()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cluster <- stats::kmeans(col_year, centers = 6)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">cluster <- stats::kmeans(col_year, centers = 6)</pre>
|
||||
</div>
|
||||
<p>Extracting the data out of this object into a form you can work with is a challenge you’ll need to come back to later in the book, once you’ve learned more about lists. But for now, you can get the clustering membership out with this code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cluster_id <- cluster$cluster |>
|
||||
<pre data-type="programlisting" data-code-language="r">cluster_id <- cluster$cluster |>
|
||||
enframe() |>
|
||||
rename(country = name, cluster_id = value)
|
||||
cluster_id
|
||||
|
@ -744,7 +744,7 @@ cluster_id
|
|||
</div>
|
||||
<p>You could then combine this back with the original data using one of the joins you’ll learn about in <a href="#chp-joins" data-type="xref">#chp-joins</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gapminder |> left_join(cluster_id)
|
||||
<pre data-type="programlisting" data-code-language="r">gapminder |> left_join(cluster_id)
|
||||
#> Joining with `by = join_by(country)`
|
||||
#> # A tibble: 1,704 × 7
|
||||
#> country continent year lifeExp pop gdpPercap cluster_id
|
||||
|
@ -764,7 +764,7 @@ cluster_id
|
|||
Pragmatic computation</h2>
|
||||
<p>Sometimes it’s just easier to answer a question using untidy data. For example, if you’re interested in just the total number of missing values in <code>cms_patient_experience</code>, it’s easier to work with the untidy form:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |>
|
||||
group_by(org_pac_id) |>
|
||||
summarize(
|
||||
n_miss = sum(is.na(prf_rate)),
|
||||
|
@ -785,7 +785,7 @@ Pragmatic computation</h2>
|
|||
<p>So if you’re stuck figuring out how to do some computation, maybe it’s time to switch up the organisation of your data. For computations involving a fixed number of values (like computing differences or ratios), it’s usually easier if the data is in columns; for those with a variable number of values (like sums or means) it’s usually easier in rows. Don’t be afraid to untidy, transform, and re-tidy if needed.</p>
|
||||
<p>Let’s explore this idea by looking at <code>cms_patient_care</code>, which has a similar structure to <code>cms_patient_experience</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_care
|
||||
#> # A tibble: 252 × 5
|
||||
#> ccn facility_name measure_abbr score type
|
||||
#> <chr> <chr> <chr> <dbl> <chr>
|
||||
|
@ -801,7 +801,7 @@ Pragmatic computation</h2>
|
|||
<ul><li>
|
||||
<p>If you want to compute the number of patients that answered yes to the question, you may pivot <code>type</code> into the columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_care |>
|
||||
pivot_wider(
|
||||
names_from = type,
|
||||
values_from = score
|
||||
|
@ -824,7 +824,7 @@ Pragmatic computation</h2>
|
|||
<li>
|
||||
<p>If you want to display the distribution of each metric, you may keep it as is so you could facet by <code>measure_abbr</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_care |>
|
||||
filter(type == "observed") |>
|
||||
ggplot(aes(score)) +
|
||||
geom_histogram(binwidth = 2) +
|
||||
|
@ -835,7 +835,7 @@ Pragmatic computation</h2>
|
|||
<li>
|
||||
<p>If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care |>
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_care |>
|
||||
filter(type == "observed") |>
|
||||
select(-type) |>
|
||||
pivot_wider(
|
||||
|
|
|
@ -11,7 +11,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
|
||||
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
|
||||
library(tidyverse)
|
||||
#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
|
||||
|
@ -30,7 +30,7 @@ library(tidyverse)
|
|||
nycflights13</h2>
|
||||
<p>To explore the basic dplyr verbs, we’re going to use <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0">Bureau of Transportation Statistics</a>, and is documented in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">?flights</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights
|
||||
<pre data-type="programlisting" data-code-language="r">flights
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
|
@ -58,7 +58,7 @@ dplyr basics</h2>
|
|||
<li><p>The result is always a new data frame.</p></li>
|
||||
</ol><p>Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, <code>|></code>. The pipe takes the thing on its left and passes it along to the function on its right so that <code>x |> f(y)</code> is equivalent to <code>f(x, y)</code>, and <code>x |> f(y) |> g(z)</code> is equivalent to into <code>g(f(x, y), z)</code>. The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest == "IAH") |>
|
||||
group_by(year, month, day) |>
|
||||
summarize(
|
||||
|
@ -81,7 +81,7 @@ Rows</h1>
|
|||
</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, you’ll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(arr_delay > 120)
|
||||
#> # A tibble: 10,034 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -99,7 +99,7 @@ Rows</h1>
|
|||
</div>
|
||||
<p>As well as <code>></code> (greater than), you can use <code>>=</code> (greater than or equal to), <code><</code> (less than), <code><=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Flights that departed on January 1
|
||||
<pre data-type="programlisting" data-code-language="r"># Flights that departed on January 1
|
||||
flights |>
|
||||
filter(month == 1 & day == 1)
|
||||
#> # A tibble: 842 × 19
|
||||
|
@ -135,7 +135,7 @@ flights |>
|
|||
</div>
|
||||
<p>There’s a useful shortcut when you’re combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># A shorter way to select flights that departed in January or February
|
||||
<pre data-type="programlisting" data-code-language="r"># A shorter way to select flights that departed in January or February
|
||||
flights |>
|
||||
filter(month %in% c(1, 2))
|
||||
#> # A tibble: 51,955 × 19
|
||||
|
@ -155,7 +155,7 @@ flights |>
|
|||
<p>We’ll come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
|
||||
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code><-</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">jan1 <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">jan1 <- flights |>
|
||||
filter(month == 1 & day == 1)</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -165,7 +165,7 @@ flights |>
|
|||
Common mistakes</h2>
|
||||
<p>When you’re starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> will let you know when this happens:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month = 1)
|
||||
#> Error in `filter()`:
|
||||
#> ! We detected a named input.
|
||||
|
@ -174,7 +174,7 @@ Common mistakes</h2>
|
|||
</div>
|
||||
<p>Another mistakes is you write “or” statements like you would in English:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 1 | 2)</pre>
|
||||
</div>
|
||||
<p>This works, in the sense that it doesn’t throw an error, but it doesn’t do what you want. We’ll come back to what it does and why in <a href="#sec-boolean-operations" data-type="xref">#sec-boolean-operations</a>.</p>
|
||||
|
@ -186,7 +186,7 @@ Common mistakes</h2>
|
|||
</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(year, month, day, dep_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -204,7 +204,7 @@ Common mistakes</h2>
|
|||
</div>
|
||||
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(desc(dep_delay))
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -222,7 +222,7 @@ Common mistakes</h2>
|
|||
</div>
|
||||
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_delay <= 10 & dep_delay >= -10) |>
|
||||
arrange(desc(arr_delay))
|
||||
#> # A tibble: 239,109 × 19
|
||||
|
@ -271,7 +271,7 @@ Columns</h1>
|
|||
</h2>
|
||||
<p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60
|
||||
|
@ -293,7 +293,7 @@ Columns</h1>
|
|||
</div>
|
||||
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
|
@ -316,7 +316,7 @@ Columns</h1>
|
|||
</div>
|
||||
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can the name of a variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
|
@ -339,7 +339,7 @@ Columns</h1>
|
|||
</div>
|
||||
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
|
@ -365,7 +365,7 @@ Columns</h1>
|
|||
</h2>
|
||||
<p>It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Select columns by name
|
||||
<pre data-type="programlisting" data-code-language="r"># Select columns by name
|
||||
flights |>
|
||||
select(year, month, day)
|
||||
#> # A tibble: 336,776 × 3
|
||||
|
@ -436,7 +436,7 @@ flights |>
|
|||
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) you’ll also be use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
|
||||
<p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
select(tail_num = tailnum)
|
||||
#> # A tibble: 336,776 × 1
|
||||
#> tail_num
|
||||
|
@ -457,7 +457,7 @@ flights |>
|
|||
</h2>
|
||||
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
rename(tail_num = tailnum)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -483,7 +483,7 @@ flights |>
|
|||
</h2>
|
||||
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
relocate(time_hour, air_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> time_hour air_time year month day dep_time sched_dep…¹ dep_d…²
|
||||
|
@ -501,7 +501,7 @@ flights |>
|
|||
</div>
|
||||
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
relocate(year:dep_time, .after = time_hour)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
|
||||
|
@ -547,13 +547,13 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>What does the <code><a href="https://tidyselect.r-lib.org/reference/all_of.html">any_of()</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">variables <- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">variables <- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">select(flights, contains("TIME"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">select(flights, contains("TIME"))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
|
@ -570,7 +570,7 @@ Groups</h1>
|
|||
</h2>
|
||||
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> # Groups: month [12]
|
||||
|
@ -596,7 +596,7 @@ Groups</h1>
|
|||
</h2>
|
||||
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on you’ll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(dep_delay)
|
||||
|
@ -614,7 +614,7 @@ Groups</h1>
|
|||
</div>
|
||||
<p>Uhoh! Something has gone wrong and all of our results are <code>NA</code> (pronounced “N-A”), R’s symbol for missing value. We’ll come back to discuss missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>, but for now we’ll remove them by using <code>na.rm = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(dep_delay, na.rm = TRUE)
|
||||
|
@ -632,7 +632,7 @@ Groups</h1>
|
|||
</div>
|
||||
<p>You can create any number of summaries in a single call to <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>, which returns the number of rows in each group:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(dep_delay, na.rm = TRUE),
|
||||
|
@ -668,7 +668,7 @@ The<code>slice_</code> functions</h2>
|
|||
<code>df |> slice_sample(x, n = 1)</code> takes one random row.</li>
|
||||
</ul><p>You can vary <code>n</code> to select more than one row, or instead of <code>n =</code>, you can use <code>prop = 0.1</code> to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
slice_max(arr_delay, n = 1)
|
||||
#> # A tibble: 108 × 19
|
||||
|
@ -688,7 +688,7 @@ The<code>slice_</code> functions</h2>
|
|||
</div>
|
||||
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarize(max_delay = max(arr_delay, na.rm = TRUE))
|
||||
#> Warning: There was 1 warning in `summarize()`.
|
||||
|
@ -714,7 +714,7 @@ The<code>slice_</code> functions</h2>
|
|||
Grouping by multiple variables</h2>
|
||||
<p>You can create groups using more than one variable. For example, we could make a group for each day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily <- flights |>
|
||||
group_by(year, month, day)
|
||||
daily
|
||||
#> # A tibble: 336,776 × 19
|
||||
|
@ -734,7 +734,7 @@ daily
|
|||
</div>
|
||||
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily_flights <- daily |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily_flights <- daily |>
|
||||
summarize(
|
||||
n = n()
|
||||
)
|
||||
|
@ -743,7 +743,7 @@ daily
|
|||
</div>
|
||||
<p>If you’re happy with this behavior, you can explicitly request it in order to suppress the message:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily_flights <- daily |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily_flights <- daily |>
|
||||
summarize(
|
||||
n = n(),
|
||||
.groups = "drop_last"
|
||||
|
@ -757,7 +757,7 @@ daily
|
|||
Ungrouping</h2>
|
||||
<p>You might also want to remove grouping outside of <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You can do this with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">ungroup()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">daily |>
|
||||
<pre data-type="programlisting" data-code-language="r">daily |>
|
||||
ungroup() |>
|
||||
summarize(
|
||||
delay = mean(dep_delay, na.rm = TRUE),
|
||||
|
@ -787,7 +787,7 @@ Exercises</h2>
|
|||
Case study: aggregates and sample size</h1>
|
||||
<p>Whenever you do any aggregation, it’s always a good idea to include a count (<code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">delays <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">delays <- flights |>
|
||||
filter(!is.na(arr_delay), !is.na(tailnum)) |>
|
||||
group_by(tailnum) |>
|
||||
summarize(
|
||||
|
@ -803,7 +803,7 @@ ggplot(delays, aes(delay)) +
|
|||
</div>
|
||||
<p>Wow, there are some planes that have an <em>average</em> delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(delays, aes(n, delay)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(delays, aes(n, delay)) +
|
||||
geom_point(alpha = 1/10)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
|
||||
|
@ -812,7 +812,7 @@ ggplot(delays, aes(delay)) +
|
|||
<p>Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases<span data-type="footnote">*cough* the central limit theorem *cough*.</span>.</p>
|
||||
<p>When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">delays |>
|
||||
<pre data-type="programlisting" data-code-language="r">delays |>
|
||||
filter(n > 25) |>
|
||||
ggplot(aes(n, delay)) +
|
||||
geom_point(alpha = 1/10) +
|
||||
|
@ -824,7 +824,7 @@ ggplot(delays, aes(delay)) +
|
|||
<p>Note the handy pattern for combining ggplot2 and dplyr. It’s a bit annoying that you have to switch from <code>|></code> to <code>+</code>, but it’s not too much of a hassle once you get the hang of it.</p>
|
||||
<p>There’s another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the <strong>Lahman</strong> package to compare what proportion of times a player hits the ball vs. the number of attempts they take:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">batters <- Lahman::Batting |>
|
||||
<pre data-type="programlisting" data-code-language="r">batters <- Lahman::Batting |>
|
||||
group_by(playerID) |>
|
||||
summarize(
|
||||
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
|
||||
|
@ -846,7 +846,7 @@ batters
|
|||
<ol type="1"><li><p>As above, the variation in our aggregate decreases as we get more data points.</p></li>
|
||||
<li><p>There’s a positive correlation between skill (<code>perf</code>) and opportunities to hit the ball (<code>n</code>) because obviously teams want to give their best batters the most opportunities to hit the ball.</p></li>
|
||||
</ol><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">batters |>
|
||||
<pre data-type="programlisting" data-code-language="r">batters |>
|
||||
filter(n > 100) |>
|
||||
ggplot(aes(n, perf)) +
|
||||
geom_point(alpha = 1 / 10) +
|
||||
|
@ -857,7 +857,7 @@ batters
|
|||
</div>
|
||||
<p>This also has important implications for ranking. If you naively sort on <code>desc(ba)</code>, the people with the best batting averages are clearly lucky, not skilled:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">batters |>
|
||||
<pre data-type="programlisting" data-code-language="r">batters |>
|
||||
arrange(desc(perf))
|
||||
#> # A tibble: 20,166 × 3
|
||||
#> playerID perf n
|
||||
|
|
|
@ -13,7 +13,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
|
||||
#> ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
|
||||
|
@ -26,7 +26,7 @@ Prerequisites</h2>
|
|||
<p>That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).</p>
|
||||
<p>If you run this code and get the error message “there is no package called ‘tidyverse’”, you’ll need to first install it, then run <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> once again.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse")
|
||||
<pre data-type="programlisting" data-code-language="r">install.packages("tidyverse")
|
||||
library(tidyverse)</pre>
|
||||
</div>
|
||||
<p>You only need to install a package once, but you need to reload it every time you start a new session.</p>
|
||||
|
@ -43,7 +43,7 @@ First steps</h1>
|
|||
The<code>mpg</code> data frame</h2>
|
||||
<p>You can test your answer with the <code>mpg</code> <strong>data frame</strong> found in ggplot2 (a.k.a. <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">ggplot2::mpg</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>mpg</code> contains observations collected by the US Environmental Protection Agency on 38 car models.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">mpg
|
||||
<pre data-type="programlisting" data-code-language="r">mpg
|
||||
#> # A tibble: 234 × 11
|
||||
#> manufacturer model displ year cyl trans drv cty hwy fl class
|
||||
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
|
||||
|
@ -66,7 +66,7 @@ The<code>mpg</code> data frame</h2>
|
|||
Creating a ggplot</h2>
|
||||
<p>To plot <code>mpg</code>, run this code to put <code>displ</code> on the x-axis and <code>hwy</code> on the y-axis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-5-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association." width="576"/></p>
|
||||
|
@ -121,7 +121,7 @@ Aesthetic mappings</h1>
|
|||
</div>
|
||||
<p>You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the <code>class</code> variable to reveal the class of each car.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy, color = class))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-9-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The points representing each car are colored according to the class of the car. The legend on the right of the plot shows the mapping between colors and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv." width="576"/></p>
|
||||
|
@ -132,7 +132,7 @@ Aesthetic mappings</h1>
|
|||
<p>The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.</p>
|
||||
<p>In the above example, we mapped <code>class</code> to the color aesthetic, but we could have mapped <code>class</code> to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a <em>warning</em> here: mapping an unordered variable (<code>class</code>) to an ordered aesthetic (<code>size</code>) is generally not a good idea because it implies a ranking that does not in fact exist.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy, size = class))
|
||||
#> Warning: Using size for a discrete variable is not advised.</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -141,7 +141,7 @@ Aesthetic mappings</h1>
|
|||
</div>
|
||||
<p>Similarly, we could have mapped <code>class</code> to the <em>alpha</em> aesthetic, which controls the transparency of the points, or to the <em>shape</em> aesthetic, which controls the shape of the points.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Left
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
|
||||
|
||||
|
@ -164,7 +164,7 @@ ggplot(data = mpg) +
|
|||
<p>Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.</p>
|
||||
<p>You can also <em>set</em> the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-12-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue." width="576"/></p>
|
||||
|
@ -189,7 +189,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>What’s gone wrong with this code? Why are the points not blue?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-14-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are red and the legend shows a red point that is mapped to the word blue." width="576"/></p>
|
||||
|
@ -210,7 +210,7 @@ Common problems</h1>
|
|||
<p>As you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. We have all been writing R code for years, but every day we still write code that doesn’t work!</p>
|
||||
<p>Start by carefully comparing the code that you’re running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every <code>(</code> is matched with a <code>)</code> and every <code>"</code> is paired with another <code>"</code>. Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a <code>+</code>, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.</p>
|
||||
<p>One common problem when creating ggplot2 graphics is to put the <code>+</code> in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:</p>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg)
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg)
|
||||
+ geom_point(mapping = aes(x = displ, y = hwy))</pre>
|
||||
<p>If you’re still stuck, try the help. You can get help about any R function by running <code>?function_name</code> in the console, or selecting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful - instead skip down to the examples and look for code that matches what you’re trying to do.</p>
|
||||
<p>If that doesn’t help, carefully read the error message. Sometimes the answer will be buried there! But when you’re new to R, the answer might be in the error message but you don’t yet know how to understand it. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.</p>
|
||||
|
@ -222,7 +222,7 @@ Facets</h1>
|
|||
<p>One way to add additional variables to a plot is by mapping them to an aesthetic. Another way, which is particularly useful for categorical variables, is to split your plot into <strong>facets</strong>, subplots that each display one subset of the data.</p>
|
||||
<p>To facet your plot by a single variable, use <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> is a formula<span data-type="footnote">Here “formula” is the name of the type of thing created by <code>~</code>, not a synonym for “equation”.</span>, which you create with <code>~</code> followed by a variable name. The variable that you pass to <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> should be discrete.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
facet_wrap(~cyl)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -231,7 +231,7 @@ Facets</h1>
|
|||
</div>
|
||||
<p>To facet your plot with the combination of two variables, switch from <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> to <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> is also a formula, but now it’s a double sided formula: <code>rows ~ cols</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ cyl)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -246,7 +246,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>What do the empty cells in plot with <code>facet_grid(drv ~ cyl)</code> mean? How do they relate to this plot?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = drv, y = cyl))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-17-1.png" alt="Scatterplot of number of cycles versus type of drive train of cars. The plot shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive." width="576"/></p>
|
||||
|
@ -256,7 +256,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>What plots does the following code make? What does <code>.</code> do?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ .)
|
||||
|
||||
|
@ -268,7 +268,7 @@ ggplot(data = mpg) +
|
|||
<li>
|
||||
<p>Take the first faceted plot in this section:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
facet_wrap(~ class, nrow = 2)</pre>
|
||||
</div>
|
||||
|
@ -278,7 +278,7 @@ ggplot(data = mpg) +
|
|||
<li>
|
||||
<p>Which of the following two plots makes it easier to compare engine size (<code>displ</code>) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ .)
|
||||
|
||||
|
@ -296,7 +296,7 @@ ggplot(data = mpg) +
|
|||
<li>
|
||||
<p>Recreate this plot using <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. How do the positions of the facet labels change?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ .)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -325,7 +325,7 @@ Geometric objects</h1>
|
|||
<p>A <strong>geom</strong> is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.</p>
|
||||
<p>To change the geom in your plot, change the geom function that you add to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. For instance, to make the plots above, you can use this code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Left
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy))
|
||||
|
||||
|
@ -335,7 +335,7 @@ ggplot(data = mpg) +
|
|||
</div>
|
||||
<p>Every geom function in ggplot2 takes a <code>mapping</code> argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you <em>could</em> set the linetype of a line. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-24-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed." width="576"/></p>
|
||||
|
@ -352,7 +352,7 @@ ggplot(data = mpg) +
|
|||
<p>ggplot2 provides more than 40 geoms, and extension packages provide even more (see <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a> for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at <a href="https://rstudio.com/resources/cheatsheets" class="uri">https://rstudio.com/resources/cheatsheets</a>. To learn more about any single geom, use the help (e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">?geom_smooth</a></code>).</p>
|
||||
<p>Many geoms, like <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>, use a single geometric object to display multiple rows of data. For these geoms, you can set the <code>group</code> aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the <code>linetype</code> example). It is convenient to rely on this feature because the <code>group</code> aesthetic by itself does not add a legend or distinguishing features to the geoms.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_smooth(mapping = aes(x = displ, y = hwy))
|
||||
|
||||
ggplot(data = mpg) +
|
||||
|
@ -379,7 +379,7 @@ ggplot(data = mpg) +
|
|||
</div>
|
||||
<p>To display multiple geoms in the same plot, add multiple geom functions to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy)) +
|
||||
geom_smooth(mapping = aes(x = displ, y = hwy))</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -388,13 +388,13 @@ ggplot(data = mpg) +
|
|||
</div>
|
||||
<p>This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display <code>cty</code> instead of <code>hwy</code>. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_smooth()</pre>
|
||||
</div>
|
||||
<p>If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings <em>for that layer only</em>. This makes it possible to display different aesthetics in different layers.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
geom_point(mapping = aes(color = class)) +
|
||||
geom_smooth()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -403,7 +403,7 @@ ggplot(data = mpg) +
|
|||
</div>
|
||||
<p>You can use the same idea to specify different <code>data</code> for each layer. Here, our smooth line displays just a subset of the <code>mpg</code> dataset, the subcompact cars. The local data argument in <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> overrides the global data argument in <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> for that layer only.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
geom_point(mapping = aes(color = class)) +
|
||||
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -419,7 +419,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point() +
|
||||
geom_smooth(se = FALSE)</pre>
|
||||
</div>
|
||||
|
@ -427,7 +427,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Earlier in this chapter we used <code>show.legend</code> without explaining it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_smooth(
|
||||
mapping = aes(x = displ, y = hwy, color = drv),
|
||||
show.legend = FALSE
|
||||
|
@ -439,7 +439,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Will these two graphs look different? Why/why not?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_smooth()
|
||||
|
||||
|
@ -485,7 +485,7 @@ ggplot() +
|
|||
Statistical transformations</h1>
|
||||
<p>Next, let’s take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code>. The following chart displays the total number of diamonds in the <code>diamonds</code> dataset, grouped by <code>cut</code>. The <code>diamonds</code> dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the <code>price</code>, <code>carat</code>, <code>color</code>, <code>clarity</code>, and <code>cut</code> of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-35-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
|
||||
|
@ -507,7 +507,7 @@ Statistical transformations</h1>
|
|||
<p>You can learn which stat a geom uses by inspecting the default value for the <code>stat</code> argument. For example, <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">?geom_bar</a></code> shows that the default value for <code>stat</code> is “count”, which means that <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> uses <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code> is documented on the same page as <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>. If you scroll down, the section called “Computed variables” explains that it computes two new variables: <code>count</code> and <code>prop</code>.</p>
|
||||
<p>You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
stat_count(mapping = aes(x = cut))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-37-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
|
||||
|
@ -517,7 +517,7 @@ Statistical transformations</h1>
|
|||
<ol type="1"><li>
|
||||
<p>You might want to override the default stat. In the code below, we change the stat of <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> from count (the default) to identity. This lets me map the height of the bars to the raw values of a <span class="math inline">\(y\)</span> variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">demo <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">demo <- tribble(
|
||||
~cut, ~freq,
|
||||
"Fair", 1610,
|
||||
"Good", 4906,
|
||||
|
@ -537,7 +537,7 @@ ggplot(data = demo) +
|
|||
<li>
|
||||
<p>You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-39-1.png" alt="Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40." width="576"/></p>
|
||||
|
@ -548,7 +548,7 @@ ggplot(data = demo) +
|
|||
<li>
|
||||
<p>You might want to draw greater attention to the statistical transformation in your code. For example, you might use <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>, which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
stat_summary(
|
||||
mapping = aes(x = cut, y = depth),
|
||||
fun.min = min,
|
||||
|
@ -572,7 +572,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>In our proportion bar chart, we need to set <code>group = 1</code>. Why? In other words, what is the problem with these two graphs?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
|
||||
ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))</pre>
|
||||
|
@ -586,7 +586,7 @@ ggplot(data = diamonds) +
|
|||
Position adjustments</h1>
|
||||
<p>There’s one more piece of magic associated with bar charts. You can color a bar chart using either the <code>color</code> aesthetic, or, more usefully, <code>fill</code>:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, color = cut))
|
||||
ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, fill = cut))</pre>
|
||||
|
@ -603,7 +603,7 @@ ggplot(data = diamonds) +
|
|||
</div>
|
||||
<p>Note what happens if you map the fill aesthetic to another variable, like <code>clarity</code>: the bars are automatically stacked. Each colored rectangle represents a combination of <code>cut</code> and <code>clarity</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, fill = clarity))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-43-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level." width="576"/></p>
|
||||
|
@ -613,7 +613,7 @@ ggplot(data = diamonds) +
|
|||
<ul><li>
|
||||
<p><code>position = "identity"</code> will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting <code>alpha</code> to a small value, or completely transparent by setting <code>fill = NA</code>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
|
||||
geom_bar(alpha = 1/5, position = "identity")
|
||||
ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
|
||||
geom_bar(fill = NA, position = "identity")</pre>
|
||||
|
@ -633,7 +633,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
|
|||
<li>
|
||||
<p><code>position = "fill"</code> works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-45-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Height of each bar is 1 and heights of the colored segments are proportional to the proportion of diamonds with a given clarity level within a given cut level." width="576"/></p>
|
||||
|
@ -643,7 +643,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
|
|||
<li>
|
||||
<p><code>position = "dodge"</code> places overlapping objects directly <em>beside</em> one another. This makes it easier to compare individual values.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds) +
|
||||
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-46-1.png" alt="Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity." width="576"/></p>
|
||||
|
@ -659,7 +659,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
|
|||
<p>The underlying values of <code>hwy</code> and <code>displ</code> are rounded so the points appear on a grid and many points overlap each other. This problem is known as <strong>overplotting</strong>. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of <code>hwy</code> and <code>displ</code> that contains 109 values?</p>
|
||||
<p>You can avoid this gridding by setting the position adjustment to “jitter”. <code>position = "jitter"</code> adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg) +
|
||||
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-48-1.png" alt="Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association." width="576"/></p>
|
||||
|
@ -674,7 +674,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>What is the problem with this plot? How could you improve it?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-49-1.png" alt="Scatterplot of highway fuel efficiency versus city fuel efficiency of cars that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset." width="576"/></p>
|
||||
|
@ -694,7 +694,7 @@ Coordinate systems</h1>
|
|||
<ul><li>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html">coord_flip()</a></code> switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
geom_boxplot()
|
||||
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
geom_boxplot() +
|
||||
|
@ -712,7 +712,7 @@ ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
|||
</div>
|
||||
<p>However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(y = class, x = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(y = class, x = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-visualize_files/figure-html/unnamed-chunk-51-1.png" alt="Side-by-side box plots of highway fuel efficiency of cars. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
|
||||
|
@ -722,7 +722,7 @@ ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
|||
<li>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the <a href="https://ggplot2-book.org/maps.html">Maps chapter</a> of <em>ggplot2: Elegant graphics for data analysis</em>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">nz <- map_data("nz")
|
||||
<pre data-type="programlisting" data-code-language="r">nz <- map_data("nz")
|
||||
|
||||
ggplot(nz, aes(long, lat, group = group)) +
|
||||
geom_polygon(fill = "white", color = "black")
|
||||
|
@ -745,7 +745,7 @@ ggplot(nz, aes(long, lat, group = group)) +
|
|||
<li>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code> uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">bar <- ggplot(data = diamonds) +
|
||||
<pre data-type="programlisting" data-code-language="r">bar <- ggplot(data = diamonds) +
|
||||
geom_bar(
|
||||
mapping = aes(x = cut, fill = cut),
|
||||
show.legend = FALSE,
|
||||
|
@ -778,7 +778,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>What does the plot below tell you about the relationship between city and highway mpg? Why is <code><a href="https://ggplot2.tidyverse.org/reference/coord_fixed.html">coord_fixed()</a></code> important? What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_abline()</a></code> do?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_abline() +
|
||||
coord_fixed()</pre>
|
||||
|
|
|
@ -11,7 +11,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(DBI)
|
||||
<pre data-type="programlisting" data-code-language="r">library(DBI)
|
||||
library(dbplyr)
|
||||
library(tidyverse)</pre>
|
||||
</div>
|
||||
|
@ -43,7 +43,7 @@ Connecting to a database</h1>
|
|||
</ul><p>If you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.</p>
|
||||
<p>Concretely, you create a database connection using <code><a href="https://dbi.r-dbi.org/reference/dbConnect.html">DBI::dbConnect()</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function you’ll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(
|
||||
RMariaDB::MariaDB(),
|
||||
username = "foo"
|
||||
)
|
||||
|
@ -61,11 +61,11 @@ In this book</h2>
|
|||
<p>Setting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.</p>
|
||||
<p>Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(duckdb::duckdb())</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb())</pre>
|
||||
</div>
|
||||
<p>duckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very to easy to get started with, but it’s also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, you’ll also need to supply the <code>dbdir</code> argument to make a persistent database and tell duckdb where to save it. Assuming you’re using a project (<a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>), it’s reasonable to store it in the <code>duckdb</code> directory of the current project:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -74,7 +74,7 @@ In this book</h2>
|
|||
Load some data</h2>
|
||||
<p>Since this is a new database, we need to start by adding some data. Here we’ll add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">DBI::dbWriteTable()</a></code>. The simplest usage of <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dbWriteTable(con, "mpg", ggplot2::mpg)
|
||||
<pre data-type="programlisting" data-code-language="r">dbWriteTable(con, "mpg", ggplot2::mpg)
|
||||
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
|
||||
</div>
|
||||
<p>If you’re using duckdb in a real project, we highly recommend learning about <code>duckdb_read_csv()</code> and <code>duckdb_register_arrow()</code>. These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.</p>
|
||||
|
@ -92,7 +92,7 @@ DBI basics</h1>
|
|||
What’s there?</h2>
|
||||
<p>The most important database objects for data scientists are tables. DBI provides two useful functions to either list all the tables in the database<span data-type="footnote">At least, all the tables that you have permission to see.</span> or to check if a specific table already exists:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dbListTables(con)
|
||||
<pre data-type="programlisting" data-code-language="r">dbListTables(con)
|
||||
#> [1] "diamonds" "mpg"
|
||||
dbExistsTable(con, "foo")
|
||||
#> [1] FALSE</pre>
|
||||
|
@ -104,7 +104,7 @@ dbExistsTable(con, "foo")
|
|||
Extract some data</h2>
|
||||
<p>Once you’ve determined a table exists, you can retrieve it with <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con |>
|
||||
<pre data-type="programlisting" data-code-language="r">con |>
|
||||
dbReadTable("diamonds") |>
|
||||
as_tibble()
|
||||
#> # A tibble: 53,940 × 10
|
||||
|
@ -127,7 +127,7 @@ Extract some data</h2>
|
|||
Run a query</h2>
|
||||
<p>The way you’ll usually retrieve data is with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sql <- "
|
||||
<pre data-type="programlisting" data-code-language="r">sql <- "
|
||||
SELECT carat, cut, clarity, color, price
|
||||
FROM diamonds
|
||||
WHERE price > 15000
|
||||
|
@ -161,7 +161,7 @@ dbplyr basics</h1>
|
|||
<p>Now that you’ve learned the low-level basics for connecting to a database and running a query, we’re going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="https://dtplyr.tidyverse.org">dtplyr</a> which translates to <a href="https://r-datatable.com">data.table</a>, and <a href="https://multidplyr.tidyverse.org">multidplyr</a> which executes your code on multiple cores.</p>
|
||||
<p>To use dbplyr, you must first use <code><a href="https://dplyr.tidyverse.org/reference/tbl.html">tbl()</a></code> to create an object that represents a database table:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, "diamonds")
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, "diamonds")
|
||||
diamonds_db
|
||||
#> # Source: table<diamonds> [?? x 10]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
|
@ -183,10 +183,10 @@ diamonds_db
|
|||
</div>
|
||||
|
||||
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
|
||||
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
|
||||
|
@ -197,7 +197,7 @@ FROM `planes`</pre></div>
|
|||
|
||||
<p>This object is <strong>lazy</strong>; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db <- diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">big_diamonds_db <- diamonds_db |>
|
||||
filter(price > 15000) |>
|
||||
select(carat:clarity, price)
|
||||
|
||||
|
@ -217,7 +217,7 @@ big_diamonds_db
|
|||
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.</p>
|
||||
<p>You can see the SQL code generated by the dbplyr function <code><a href="https://dplyr.tidyverse.org/reference/explain.html">show_query()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">big_diamonds_db |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
#> SELECT carat, cut, color, clarity, price
|
||||
|
@ -226,7 +226,7 @@ big_diamonds_db
|
|||
</div>
|
||||
<p>To get all the data back into R, you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> to get the data, then turns the result into a tibble:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">big_diamonds <- big_diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">big_diamonds <- big_diamonds_db |>
|
||||
collect()
|
||||
big_diamonds
|
||||
#> # A tibble: 1,655 × 5
|
||||
|
@ -249,7 +249,7 @@ SQL</h1>
|
|||
<p>The rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.</p>
|
||||
<p>We’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: <code>flights</code> and <code>planes</code>. These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dbplyr::copy_nycflights13(con)
|
||||
<pre data-type="programlisting" data-code-language="r">dbplyr::copy_nycflights13(con)
|
||||
#> Creating table: airlines
|
||||
#> Creating table: airports
|
||||
#> Creating table: flights
|
||||
|
@ -268,7 +268,7 @@ SQL basics</h2>
|
|||
<p>The top-level components of SQL are called <strong>statements</strong>. Common statements include <code>CREATE</code> for defining new tables, <code>INSERT</code> for adding data, and <code>SELECT</code> for retrieving data. We will on focus on <code>SELECT</code> statements, also called <strong>queries</strong>, because they are almost exclusively what you’ll use as a data scientist.</p>
|
||||
<p>A query is made up of <strong>clauses</strong>. There are five important clauses: <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>ORDER BY</code>, and <code>GROUP BY</code>. Every query must have the <code>SELECT</code><span data-type="footnote">Confusingly, depending on the context, <code>SELECT</code> is either a statement or a clause. To avoid this confusion, we’ll generally use query instead of <code>SELECT</code> statement.</span> and <code>FROM</code><span data-type="footnote">Ok, technically, only the <code>SELECT</code> is required, since you can write queries like <code>SELECT 1+1</code> to perform basic calculations. But if you want to work with data (as you always do!) you’ll also need a <code>FROM</code> clause.</span> clauses and the simplest query is <code>SELECT * FROM table</code>, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> show_query()
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> show_query()
|
||||
#> <SQL>
|
||||
#> SELECT *
|
||||
#> FROM flights
|
||||
|
@ -279,7 +279,7 @@ planes |> show_query()
|
|||
</div>
|
||||
<p><code>WHERE</code> and <code>ORDER BY</code> control which rows are included and how they are ordered:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest == "IAH") |>
|
||||
arrange(dep_delay) |>
|
||||
show_query()
|
||||
|
@ -291,7 +291,7 @@ planes |> show_query()
|
|||
</div>
|
||||
<p><code>GROUP BY</code> converts the query to a summary, causing aggregation to happen:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
show_query()
|
||||
|
@ -312,10 +312,10 @@ planes |> show_query()
|
|||
</div>
|
||||
|
||||
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
|
||||
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
|
||||
|
@ -332,7 +332,7 @@ SELECT</h2>
|
|||
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and, as you’ll learn in the next section, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">planes |>
|
||||
<pre data-type="programlisting" data-code-language="r">planes |>
|
||||
select(tailnum, type, manufacturer, model, year) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -364,10 +364,10 @@ planes |>
|
|||
</div>
|
||||
|
||||
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
|
||||
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
|
||||
|
@ -378,7 +378,7 @@ FROM `planes`</pre></div>
|
|||
|
||||
<p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
speed = distance / (air_time / 60)
|
||||
) |>
|
||||
|
@ -401,7 +401,7 @@ FROM</h2>
|
|||
GROUP BY</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> is translated to the <code>SELECT</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
|
@ -421,7 +421,7 @@ GROUP BY</h2>
|
|||
WHERE</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is translated to the <code>WHERE</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest == "IAH" | dest == "HOU") |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -444,7 +444,7 @@ flights |>
|
|||
<li>SQL uses only <code>''</code> for strings, not <code>""</code>. In SQL, <code>""</code> is used to identify variables, like R’s <code>``</code>.</li>
|
||||
</ul><p>Another useful SQL operator is <code>IN</code>, which is very close to R’s <code>%in%</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest %in% c("IAH", "HOU")) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -454,7 +454,7 @@ flights |>
|
|||
</div>
|
||||
<p>SQL uses <code>NULL</code> instead of <code>NA</code>. <code>NULL</code>s behave similarly to <code>NA</code>s. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(delay = mean(arr_delay))
|
||||
#> Warning: Missing values are always removed in SQL aggregation functions.
|
||||
|
@ -475,7 +475,7 @@ flights |>
|
|||
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="https://modern-sql.com/concept/three-valued-logic"><em>Three valued logic</em></a>” by Markus Winand.</p>
|
||||
<p>In general, you can work with <code>NULL</code>s using the functions you’d use for <code>NA</code>s in R:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(!is.na(dep_delay)) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -487,7 +487,7 @@ flights |>
|
|||
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
|
||||
<p>Note that if you <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause that’s evaluated afterwards.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(n = n()) |>
|
||||
filter(n > 100) |>
|
||||
|
@ -505,7 +505,7 @@ flights |>
|
|||
ORDER BY</h2>
|
||||
<p>Ordering rows involves a straightforward translation from <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> to the <code>ORDER BY</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(year, month, day, desc(dep_delay)) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -522,7 +522,7 @@ Subqueries</h2>
|
|||
<p>Sometimes it’s not possible to translate a dplyr pipeline into a single <code>SELECT</code> statement and you need to use a subquery. A <strong>subquery</strong> is just a query used as a data source in the <code>FROM</code> clause, instead of the usual table.</p>
|
||||
<p>dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the <code>SELECT</code> clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes <code>year1</code> and then the second (outer) query can compute <code>year2</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
year1 = year + 1,
|
||||
year2 = year1 + 1
|
||||
|
@ -537,7 +537,7 @@ Subqueries</h2>
|
|||
</div>
|
||||
<p>You’ll also see this if you attempted to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, it’s evaluated before it, so we need a subquery in this (silly) example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(year1 = year + 1) |>
|
||||
filter(year1 == 2014) |>
|
||||
show_query()
|
||||
|
@ -557,7 +557,7 @@ Subqueries</h2>
|
|||
Joins</h2>
|
||||
<p>If you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
left_join(planes |> rename(year_built = year), by = "tailnum") |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -619,7 +619,7 @@ Function translations</h1>
|
|||
<p>So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
|
||||
<p>To help see what’s going on, we’ll use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">summarize_query <- function(df, ...) {
|
||||
<pre data-type="programlisting" data-code-language="r">summarize_query <- function(df, ...) {
|
||||
df |>
|
||||
summarise(...) |>
|
||||
show_query()
|
||||
|
@ -632,7 +632,7 @@ mutate_query <- function(df, ...) {
|
|||
</div>
|
||||
<p>Let’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>, have a relatively simple translation while others, like <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarize_query(
|
||||
mean = mean(arr_delay, na.rm = TRUE),
|
||||
|
@ -652,7 +652,7 @@ mutate_query <- function(df, ...) {
|
|||
</div>
|
||||
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
mutate_query(
|
||||
mean = mean(arr_delay, na.rm = TRUE),
|
||||
|
@ -668,7 +668,7 @@ mutate_query <- function(df, ...) {
|
|||
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
|
||||
<p>Window functions include all functions that look forward or backwards, like <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
arrange(time_hour) |>
|
||||
mutate_query(
|
||||
|
@ -686,7 +686,7 @@ mutate_query <- function(df, ...) {
|
|||
<p>Here it’s important to <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> the data, because SQL tables have no intrinsic order. In fact, if you don’t use <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesn’t automatically apply to window functions.</p>
|
||||
<p>Another important SQL function is <code>CASE WHEN</code>. It’s used as the translation of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, the dplyr function that it directly inspired. Here’s a couple of simple examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate_query(
|
||||
description = if_else(arr_delay > 0, "delayed", "on-time")
|
||||
)
|
||||
|
@ -712,7 +712,7 @@ flights |>
|
|||
</div>
|
||||
<p><code>CASE WHEN</code> is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate_query(
|
||||
description = cut(
|
||||
arr_delay,
|
||||
|
|
|
@ -13,7 +13,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>This chapter will focus on the <strong>lubridate</strong> package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
|
||||
library(lubridate)
|
||||
library(nycflights13)</pre>
|
||||
|
@ -32,7 +32,7 @@ Creating date/times</h1>
|
|||
<p>You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.</p>
|
||||
<p>To get the current date or date-time you can use <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> or <code><a href="https://lubridate.tidyverse.org/reference/now.html">now()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">today()
|
||||
<pre data-type="programlisting" data-code-language="r">today()
|
||||
#> [1] "2022-11-18"
|
||||
now()
|
||||
#> [1] "2022-11-18 10:59:07 CST"</pre>
|
||||
|
@ -48,7 +48,7 @@ now()
|
|||
During import</h2>
|
||||
<p>If your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">csv <- "
|
||||
<pre data-type="programlisting" data-code-language="r">csv <- "
|
||||
date,datetime
|
||||
2022-01-02,2022-01-02 05:12
|
||||
"
|
||||
|
@ -137,7 +137,7 @@ read_csv(csv)
|
|||
</tr></tbody></table></div>
|
||||
<p>And this code shows some a few options applied to a very ambiguous date:</p>
|
||||
<div class="cell" data-messages="false">
|
||||
<pre data-type="programlisting" data-code-language="downlit">csv <- "
|
||||
<pre data-type="programlisting" data-code-language="r">csv <- "
|
||||
date
|
||||
01/02/15
|
||||
"
|
||||
|
@ -169,7 +169,7 @@ read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
|
|||
From strings</h2>
|
||||
<p>The date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ymd("2017-01-31")
|
||||
<pre data-type="programlisting" data-code-language="r">ymd("2017-01-31")
|
||||
#> [1] "2017-01-31"
|
||||
mdy("January 31st, 2017")
|
||||
#> [1] "2017-01-31"
|
||||
|
@ -178,14 +178,14 @@ dmy("31-Jan-2017")
|
|||
</div>
|
||||
<p><code><a href="https://lubridate.tidyverse.org/reference/ymd.html">ymd()</a></code> and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ymd_hms("2017-01-31 20:11:59")
|
||||
<pre data-type="programlisting" data-code-language="r">ymd_hms("2017-01-31 20:11:59")
|
||||
#> [1] "2017-01-31 20:11:59 UTC"
|
||||
mdy_hm("01/31/2017 08:01")
|
||||
#> [1] "2017-01-31 08:01:00 UTC"</pre>
|
||||
</div>
|
||||
<p>You can also force the creation of a date-time from a date by supplying a timezone:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ymd("2017-01-31", tz = "UTC")
|
||||
<pre data-type="programlisting" data-code-language="r">ymd("2017-01-31", tz = "UTC")
|
||||
#> [1] "2017-01-31 UTC"</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -195,7 +195,7 @@ mdy_hm("01/31/2017 08:01")
|
|||
From individual components</h2>
|
||||
<p>Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the <code>flights</code> data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
select(year, month, day, hour, minute)
|
||||
#> # A tibble: 336,776 × 5
|
||||
#> year month day hour minute
|
||||
|
@ -210,7 +210,7 @@ From individual components</h2>
|
|||
</div>
|
||||
<p>To create a date/time from this sort of input, use <code><a href="https://lubridate.tidyverse.org/reference/make_datetime.html">make_date()</a></code> for dates, or <code><a href="https://lubridate.tidyverse.org/reference/make_datetime.html">make_datetime()</a></code> for date-times:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
select(year, month, day, hour, minute) |>
|
||||
mutate(departure = make_datetime(year, month, day, hour, minute))
|
||||
#> # A tibble: 336,776 × 6
|
||||
|
@ -226,7 +226,7 @@ From individual components</h2>
|
|||
</div>
|
||||
<p>Let’s do the same thing for each of the four time columns in <code>flights</code>. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">make_datetime_100 <- function(year, month, day, time) {
|
||||
<pre data-type="programlisting" data-code-language="r">make_datetime_100 <- function(year, month, day, time) {
|
||||
make_datetime(year, month, day, time %/% 100, time %% 100)
|
||||
}
|
||||
|
||||
|
@ -255,7 +255,7 @@ flights_dt
|
|||
</div>
|
||||
<p>With this data, we can visualize the distribution of departure times across the year:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
ggplot(aes(dep_time)) +
|
||||
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -264,7 +264,7 @@ flights_dt
|
|||
</div>
|
||||
<p>Or within a single day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
filter(dep_time < ymd(20130102)) |>
|
||||
ggplot(aes(dep_time)) +
|
||||
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes</pre>
|
||||
|
@ -280,14 +280,14 @@ flights_dt
|
|||
From other types</h2>
|
||||
<p>You may want to switch between a date-time and a date. That’s the job of <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">as_datetime(today())
|
||||
<pre data-type="programlisting" data-code-language="r">as_datetime(today())
|
||||
#> [1] "2022-11-18 UTC"
|
||||
as_date(now())
|
||||
#> [1] "2022-11-18"</pre>
|
||||
</div>
|
||||
<p>Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code>; if it’s in days, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">as_datetime(60 * 60 * 10)
|
||||
<pre data-type="programlisting" data-code-language="r">as_datetime(60 * 60 * 10)
|
||||
#> [1] "1970-01-01 10:00:00 UTC"
|
||||
as_date(365 * 10 + 2)
|
||||
#> [1] "1980-01-01"</pre>
|
||||
|
@ -300,14 +300,14 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>What happens if you parse a string that contains invalid dates?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ymd(c("2010-10-10", "bananas"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">ymd(c("2010-10-10", "bananas"))</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>What does the <code>tzone</code> argument to <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> do? Why is it important?</p></li>
|
||||
<li>
|
||||
<p>For each of the following date-times show how you’d parse it using a readr column-specification and a lubridate function.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">d1 <- "January 1, 2010"
|
||||
<pre data-type="programlisting" data-code-language="r">d1 <- "January 1, 2010"
|
||||
d2 <- "2015-Mar-07"
|
||||
d3 <- "06-Jun-2017"
|
||||
d4 <- c("August 19 (2015)", "July 1 (2015)")
|
||||
|
@ -329,7 +329,7 @@ Date-time components</h1>
|
|||
Getting components</h2>
|
||||
<p>You can pull out individual parts of the date with the accessor functions <code><a href="https://lubridate.tidyverse.org/reference/year.html">year()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/month.html">month()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/day.html">mday()</a></code> (day of the month), <code><a href="https://lubridate.tidyverse.org/reference/day.html">yday()</a></code> (day of the year), <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> (day of the week), <code><a href="https://lubridate.tidyverse.org/reference/hour.html">hour()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/minute.html">minute()</a></code>, and <code><a href="https://lubridate.tidyverse.org/reference/second.html">second()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">datetime <- ymd_hms("2026-07-08 12:34:56")
|
||||
<pre data-type="programlisting" data-code-language="r">datetime <- ymd_hms("2026-07-08 12:34:56")
|
||||
|
||||
year(datetime)
|
||||
#> [1] 2026
|
||||
|
@ -345,7 +345,7 @@ wday(datetime)
|
|||
</div>
|
||||
<p>For <code><a href="https://lubridate.tidyverse.org/reference/month.html">month()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> you can set <code>label = TRUE</code> to return the abbreviated name of the month or day of the week. Set <code>abbr = FALSE</code> to return the full name.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">month(datetime, label = TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">month(datetime, label = TRUE)
|
||||
#> [1] Jul
|
||||
#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
|
||||
wday(datetime, label = TRUE, abbr = FALSE)
|
||||
|
@ -354,7 +354,7 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
</div>
|
||||
<p>We can use <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> to see that more flights depart during the week than on the weekend:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(wday = wday(dep_time, label = TRUE)) |>
|
||||
ggplot(aes(x = wday)) +
|
||||
geom_bar()</pre>
|
||||
|
@ -364,7 +364,7 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
</div>
|
||||
<p>There’s an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(minute = minute(dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
|
@ -378,7 +378,7 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
</div>
|
||||
<p>Interestingly, if we look at the <em>scheduled</em> departure time we don’t see such a strong pattern:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sched_dep <- flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">sched_dep <- flights_dt |>
|
||||
mutate(minute = minute(sched_dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
|
@ -393,7 +393,7 @@ ggplot(sched_dep, aes(minute, avg_delay)) +
|
|||
</div>
|
||||
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(sched_dep, aes(minute, n)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(sched_dep, aes(minute, n)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes." width="576"/></p>
|
||||
|
@ -406,7 +406,7 @@ ggplot(sched_dep, aes(minute, avg_delay)) +
|
|||
Rounding</h2>
|
||||
<p>An alternative approach to plotting individual components is to round the date to a nearby unit of time, with <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">floor_date()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">round_date()</a></code>, and <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">ceiling_date()</a></code>. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
count(week = floor_date(dep_time, "week")) |>
|
||||
ggplot(aes(week, n)) +
|
||||
geom_line() +
|
||||
|
@ -417,7 +417,7 @@ Rounding</h2>
|
|||
</div>
|
||||
<p>You can use rounding to show the distribution of flights across the course of a day by computing the difference between <code>dep_time</code> and the earliest instant of that day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
|
@ -429,7 +429,7 @@ Rounding</h2>
|
|||
</div>
|
||||
<p>Computing the difference between a pair of date-times yields a difftime (more on that in <a href="#sec-intervals" data-type="xref">#sec-intervals</a>). We can convert that to an <code>hms</code> object to get a more useful x-axis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)</pre>
|
||||
|
@ -444,7 +444,7 @@ Rounding</h2>
|
|||
Modifying components</h2>
|
||||
<p>You can also use each accessor function to modify the components of a date/time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">(datetime <- ymd_hms("2026-07-08 12:34:56"))
|
||||
<pre data-type="programlisting" data-code-language="r">(datetime <- ymd_hms("2026-07-08 12:34:56"))
|
||||
#> [1] "2026-07-08 12:34:56 UTC"
|
||||
|
||||
year(datetime) <- 2030
|
||||
|
@ -459,12 +459,12 @@ datetime
|
|||
</div>
|
||||
<p>Alternatively, rather than modifying an existing variabke, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
|
||||
<pre data-type="programlisting" data-code-language="r">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
|
||||
#> [1] "2030-02-02 02:34:56 UTC"</pre>
|
||||
</div>
|
||||
<p>If values are too big, they will roll-over:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">update(ymd("2023-02-01"), mday = 30)
|
||||
<pre data-type="programlisting" data-code-language="r">update(ymd("2023-02-01"), mday = 30)
|
||||
#> [1] "2023-03-02"
|
||||
update(ymd("2023-02-01"), hour = 400)
|
||||
#> [1] "2023-02-17 16:00:00 UTC"</pre>
|
||||
|
@ -501,19 +501,19 @@ Time spans</h1>
|
|||
Durations</h2>
|
||||
<p>In R, when you subtract two dates, you get a difftime object:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># How old is Hadley?
|
||||
<pre data-type="programlisting" data-code-language="r"># How old is Hadley?
|
||||
h_age <- today() - ymd("1979-10-14")
|
||||
h_age
|
||||
#> Time difference of 15741 days</pre>
|
||||
</div>
|
||||
<p>A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the <strong>duration</strong>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">as.duration(h_age)
|
||||
<pre data-type="programlisting" data-code-language="r">as.duration(h_age)
|
||||
#> [1] "1360022400s (~43.1 years)"</pre>
|
||||
</div>
|
||||
<p>Durations come with a bunch of convenient constructors:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dseconds(15)
|
||||
<pre data-type="programlisting" data-code-language="r">dseconds(15)
|
||||
#> [1] "15s"
|
||||
dminutes(10)
|
||||
#> [1] "600s (~10 minutes)"
|
||||
|
@ -530,19 +530,19 @@ dyears(1)
|
|||
<p>Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year is uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.</p>
|
||||
<p>You can add and multiply durations:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">2 * dyears(1)
|
||||
<pre data-type="programlisting" data-code-language="r">2 * dyears(1)
|
||||
#> [1] "63115200s (~2 years)"
|
||||
dyears(1) + dweeks(12) + dhours(15)
|
||||
#> [1] "38869200s (~1.23 years)"</pre>
|
||||
</div>
|
||||
<p>You can add and subtract durations to and from days:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tomorrow <- today() + ddays(1)
|
||||
<pre data-type="programlisting" data-code-language="r">tomorrow <- today() + ddays(1)
|
||||
last_year <- today() - dyears(1)</pre>
|
||||
</div>
|
||||
<p>However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
|
||||
<pre data-type="programlisting" data-code-language="r">one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
|
||||
|
||||
one_pm
|
||||
#> [1] "2026-03-12 13:00:00 EDT"
|
||||
|
@ -557,14 +557,14 @@ one_pm + ddays(1)
|
|||
Periods</h2>
|
||||
<p>To solve this problem, lubridate provides <strong>periods</strong>. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">one_pm
|
||||
<pre data-type="programlisting" data-code-language="r">one_pm
|
||||
#> [1] "2026-03-12 13:00:00 EDT"
|
||||
one_pm + days(1)
|
||||
#> [1] "2026-03-13 13:00:00 EDT"</pre>
|
||||
</div>
|
||||
<p>Like durations, periods can be created with a number of friendly constructor functions.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">hours(c(12, 24))
|
||||
<pre data-type="programlisting" data-code-language="r">hours(c(12, 24))
|
||||
#> [1] "12H 0M 0S" "24H 0M 0S"
|
||||
days(7)
|
||||
#> [1] "7d 0H 0M 0S"
|
||||
|
@ -574,14 +574,14 @@ months(1:6)
|
|||
</div>
|
||||
<p>You can add and multiply periods:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">10 * (months(6) + days(1))
|
||||
<pre data-type="programlisting" data-code-language="r">10 * (months(6) + days(1))
|
||||
#> [1] "60m 10d 0H 0M 0S"
|
||||
days(50) + hours(25) + minutes(2)
|
||||
#> [1] "50d 25H 2M 0S"</pre>
|
||||
</div>
|
||||
<p>And of course, add them to dates. Compared to durations, periods are more likely to do what you expect:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># A leap year
|
||||
<pre data-type="programlisting" data-code-language="r"># A leap year
|
||||
ymd("2024-01-01") + dyears(1)
|
||||
#> [1] "2024-12-31 06:00:00 UTC"
|
||||
ymd("2024-01-01") + years(1)
|
||||
|
@ -595,7 +595,7 @@ one_pm + days(1)
|
|||
</div>
|
||||
<p>Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination <em>before</em> they departed from New York City.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
filter(arr_time < dep_time)
|
||||
#> # A tibble: 10,640 × 9
|
||||
#> origin dest dep_delay arr_delay dep_time sched_dep_time
|
||||
|
@ -611,7 +611,7 @@ one_pm + days(1)
|
|||
</div>
|
||||
<p>These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding <code>days(1)</code> to the arrival time of each overnight flight.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt <- flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt <- flights_dt |>
|
||||
mutate(
|
||||
overnight = arr_time < dep_time,
|
||||
arr_time = arr_time + days(if_else(overnight, 0, 1)),
|
||||
|
@ -620,7 +620,7 @@ one_pm + days(1)
|
|||
</div>
|
||||
<p>Now all of our flights obey the laws of physics.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_dt |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
filter(overnight, arr_time < dep_time)
|
||||
#> # A tibble: 10,640 × 10
|
||||
#> origin dest dep_delay arr_delay dep_time sched_dep_time
|
||||
|
@ -642,13 +642,13 @@ Intervals</h2>
|
|||
<p>It’s obvious what <code>dyears(1) / ddays(365)</code> should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.</p>
|
||||
<p>What should <code>years(1) / days(1)</code> return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">years(1) / days(1)
|
||||
<pre data-type="programlisting" data-code-language="r">years(1) / days(1)
|
||||
#> [1] 365.25</pre>
|
||||
</div>
|
||||
<p>If you want a more accurate measurement, you’ll have to use an <strong>interval</strong>. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.</p>
|
||||
<p>You can create an interval by writing <code>start %--% end</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
|
||||
<pre data-type="programlisting" data-code-language="r">y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
|
||||
y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01")
|
||||
|
||||
y2023
|
||||
|
@ -658,7 +658,7 @@ y2024
|
|||
</div>
|
||||
<p>You could then divide it by <code><a href="https://lubridate.tidyverse.org/reference/period.html">days()</a></code> to find out how many days fit in the year:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y2023 / days(1)
|
||||
<pre data-type="programlisting" data-code-language="r">y2023 / days(1)
|
||||
#> [1] 365
|
||||
y2024 / days(1)
|
||||
#> [1] 366</pre>
|
||||
|
@ -684,13 +684,13 @@ Time zones</h1>
|
|||
<p>You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It’s worth reading the raw time zone database (available at <a href="https://www.iana.org/time-zones" class="uri">https://www.iana.org/time-zones</a>) just to read some of these stories!</p>
|
||||
<p>You can find out what R thinks your current time zone is with <code><a href="https://rdrr.io/r/base/timezones.html">Sys.timezone()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">Sys.timezone()
|
||||
<pre data-type="programlisting" data-code-language="r">Sys.timezone()
|
||||
#> [1] "America/Chicago"</pre>
|
||||
</div>
|
||||
<p>(If R doesn’t know, you’ll get an <code>NA</code>.)</p>
|
||||
<p>And see the complete list of all time zone names with <code><a href="https://rdrr.io/r/base/timezones.html">OlsonNames()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">length(OlsonNames())
|
||||
<pre data-type="programlisting" data-code-language="r">length(OlsonNames())
|
||||
#> [1] 595
|
||||
head(OlsonNames())
|
||||
#> [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
|
||||
|
@ -698,7 +698,7 @@ head(OlsonNames())
|
|||
</div>
|
||||
<p>In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
|
||||
<pre data-type="programlisting" data-code-language="r">x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
|
||||
x1
|
||||
#> [1] "2024-06-01 12:00:00 EDT"
|
||||
|
||||
|
@ -712,14 +712,14 @@ x3
|
|||
</div>
|
||||
<p>You can verify that they’re the same time using subtraction:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x1 - x2
|
||||
<pre data-type="programlisting" data-code-language="r">x1 - x2
|
||||
#> Time difference of 0 secs
|
||||
x1 - x3
|
||||
#> Time difference of 0 secs</pre>
|
||||
</div>
|
||||
<p>Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>, will often drop the time zone. In that case, the date-times will display in your local time zone:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x4 <- c(x1, x2, x3)
|
||||
<pre data-type="programlisting" data-code-language="r">x4 <- c(x1, x2, x3)
|
||||
x4
|
||||
#> [1] "2024-06-01 12:00:00 EDT" "2024-06-01 12:00:00 EDT"
|
||||
#> [3] "2024-06-01 12:00:00 EDT"</pre>
|
||||
|
@ -728,7 +728,7 @@ x4
|
|||
<ul><li>
|
||||
<p>Keep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
|
||||
<pre data-type="programlisting" data-code-language="r">x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
|
||||
x4a
|
||||
#> [1] "2024-06-02 02:30:00 +1030" "2024-06-02 02:30:00 +1030"
|
||||
#> [3] "2024-06-02 02:30:00 +1030"
|
||||
|
@ -741,7 +741,7 @@ x4a - x4
|
|||
<li>
|
||||
<p>Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
|
||||
<pre data-type="programlisting" data-code-language="r">x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
|
||||
x4b
|
||||
#> [1] "2024-06-01 12:00:00 +1030" "2024-06-01 12:00:00 +1030"
|
||||
#> [3] "2024-06-01 12:00:00 +1030"
|
||||
|
|
|
@ -11,7 +11,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the <strong>forcats</strong> package, which is part of the core tidyverse. It provides tools for dealing with <strong>cat</strong>egorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -21,32 +21,32 @@ Prerequisites</h2>
|
|||
Factor basics</h1>
|
||||
<p>Imagine that you have a variable that records month:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x1 <- c("Dec", "Apr", "Jan", "Mar")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">x1 <- c("Dec", "Apr", "Jan", "Mar")</pre>
|
||||
</div>
|
||||
<p>Using a string to record this variable has two problems:</p>
|
||||
<ol type="1"><li>
|
||||
<p>There are only twelve possible months, and there’s nothing saving you from typos:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x2 <- c("Dec", "Apr", "Jam", "Mar")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">x2 <- c("Dec", "Apr", "Jam", "Mar")</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>It doesn’t sort in a useful way:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sort(x1)
|
||||
<pre data-type="programlisting" data-code-language="r">sort(x1)
|
||||
#> [1] "Apr" "Dec" "Jan" "Mar"</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol><p>You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid <strong>levels</strong>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">month_levels <- c(
|
||||
<pre data-type="programlisting" data-code-language="r">month_levels <- c(
|
||||
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
|
||||
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
|
||||
)</pre>
|
||||
</div>
|
||||
<p>Now you can create a factor:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y1 <- factor(x1, levels = month_levels)
|
||||
<pre data-type="programlisting" data-code-language="r">y1 <- factor(x1, levels = month_levels)
|
||||
y1
|
||||
#> [1] Dec Apr Jan Mar
|
||||
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
|
||||
|
@ -57,27 +57,27 @@ sort(y1)
|
|||
</div>
|
||||
<p>And any values not in the level will be silently converted to NA:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y2 <- factor(x2, levels = month_levels)
|
||||
<pre data-type="programlisting" data-code-language="r">y2 <- factor(x2, levels = month_levels)
|
||||
y2
|
||||
#> [1] Dec Apr <NA> Mar
|
||||
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
|
||||
</div>
|
||||
<p>This seems risky, so you might want to use <code><a href="https://forcats.tidyverse.org/reference/fct.html">fct()</a></code> instead:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y2 <- fct(x2, levels = month_levels)
|
||||
<pre data-type="programlisting" data-code-language="r">y2 <- fct(x2, levels = month_levels)
|
||||
#> Error in `fct()`:
|
||||
#> ! All values of `x` must appear in `levels` or `na`
|
||||
#> ℹ Missing level: "Jam"</pre>
|
||||
</div>
|
||||
<p>If you omit the levels, they’ll be taken from the data in alphabetical order:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">factor(x1)
|
||||
<pre data-type="programlisting" data-code-language="r">factor(x1)
|
||||
#> [1] Dec Apr Jan Mar
|
||||
#> Levels: Apr Dec Jan Mar</pre>
|
||||
</div>
|
||||
<p>Sometimes you’d prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to <code>unique(x)</code>, or after the fact, with <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_inorder()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">f1 <- factor(x1, levels = unique(x1))
|
||||
<pre data-type="programlisting" data-code-language="r">f1 <- factor(x1, levels = unique(x1))
|
||||
f1
|
||||
#> [1] Dec Apr Jan Mar
|
||||
#> Levels: Dec Apr Jan Mar
|
||||
|
@ -89,12 +89,12 @@ f2
|
|||
</div>
|
||||
<p>If you ever need to access the set of valid levels directly, you can do so with <code><a href="https://rdrr.io/r/base/levels.html">levels()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">levels(f2)
|
||||
<pre data-type="programlisting" data-code-language="r">levels(f2)
|
||||
#> [1] "Dec" "Apr" "Jan" "Mar"</pre>
|
||||
</div>
|
||||
<p>You can also create a factor when reading your data with readr with <code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">csv <- "
|
||||
<pre data-type="programlisting" data-code-language="r">csv <- "
|
||||
month,value
|
||||
Jan,12
|
||||
Feb,56
|
||||
|
@ -112,7 +112,7 @@ df$month
|
|||
General Social Survey</h1>
|
||||
<p>For the rest of this chapter, we’re going to use <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">forcats::gss_cat</a></code>. It’s a sample of data from the <a href="https://gss.norc.org">General Social Survey</a>, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in <code>gss_cat</code> Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat
|
||||
#> # A tibble: 21,483 × 9
|
||||
#> year marital age race rincome partyid relig denom tvhours
|
||||
#> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
|
||||
|
@ -127,7 +127,7 @@ General Social Survey</h1>
|
|||
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>
|
||||
<p>When factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
count(race)
|
||||
#> # A tibble: 3 × 2
|
||||
#> race n
|
||||
|
@ -138,7 +138,7 @@ General Social Survey</h1>
|
|||
</div>
|
||||
<p>Or with a bar chart:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(gss_cat, aes(race)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(gss_cat, aes(race)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A bar chart showing the distribution of race. There are ~2000 records with race "Other", 3000 with race "Black", and other 15,000 with race "White"." width="576"/></p>
|
||||
|
@ -160,7 +160,7 @@ Exercise</h2>
|
|||
Modifying factor order</h1>
|
||||
<p>It’s often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">relig_summary <- gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">relig_summary <- gss_cat |>
|
||||
group_by(relig) |>
|
||||
summarise(
|
||||
age = mean(age, na.rm = TRUE),
|
||||
|
@ -181,7 +181,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
|
|||
<code>x</code>, a numeric vector that you want to use to reorder the levels.</li>
|
||||
<li>Optionally, <code>fun</code>, a function that’s used if there are multiple values of <code>x</code> for each value of <code>f</code>. The default value is <code>median</code>.</li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. "Other eastern" has the fewest tvhours under 2, and "Don't know" has the highest (over 5)." width="576"/></p>
|
||||
|
@ -190,7 +190,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
|
|||
<p>Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.</p>
|
||||
<p>As you start making more complicated transformations, we recommend moving them out of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> and into a separate <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> step. For example, you could rewrite the plot above as:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">relig_summary |>
|
||||
<pre data-type="programlisting" data-code-language="r">relig_summary |>
|
||||
mutate(
|
||||
relig = fct_reorder(relig, tvhours)
|
||||
) |>
|
||||
|
@ -199,7 +199,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
|
|||
</div>
|
||||
<p>What if we create a similar plot looking at how average age varies across reported income level?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rincome_summary <- gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">rincome_summary <- gss_cat |>
|
||||
group_by(rincome) |>
|
||||
summarise(
|
||||
age = mean(age, na.rm = TRUE),
|
||||
|
@ -216,7 +216,7 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
|
|||
<p>Here, arbitrarily reordering the levels isn’t a good idea! That’s because <code>rincome</code> already has a principled order that we shouldn’t mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
|
||||
<p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="https://forcats.tidyverse.org/reference/fct_relevel.html">fct_relevel()</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="The same scatterplot but now "Not Applicable" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is "Not applicable"." width="576"/></p>
|
||||
|
@ -225,7 +225,7 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
|
|||
<p>Why do you think the average age for “Not applicable” is so high?</p>
|
||||
<p>Another type of reordering is useful when you are coloring the lines on a plot. <code>fct_reorder2(f, x, y)</code> reorders the factor <code>f</code> by the <code>y</code> values associated with the largest <code>x</code> values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">#|
|
||||
<pre data-type="programlisting" data-code-language="r">#|
|
||||
#| Rearranging the legend makes the plot easier to read because the
|
||||
#| legend colours now match the order of the lines on the far right
|
||||
#| of the plot. You can see some unsuprising patterns: the proportion
|
||||
|
@ -259,7 +259,7 @@ ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
|
|||
</div>
|
||||
<p>Finally, for bar plots, you can use <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code> to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with <code><a href="https://forcats.tidyverse.org/reference/fct_rev.html">fct_rev()</a></code> if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
|
||||
ggplot(aes(marital)) +
|
||||
geom_bar()</pre>
|
||||
|
@ -282,7 +282,7 @@ Exercises</h2>
|
|||
Modifying factor levels</h1>
|
||||
<p>More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. It allows you to recode, or change, the value of each level. For example, take the <code>gss_cat$partyid</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |> count(partyid)
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |> count(partyid)
|
||||
#> # A tibble: 10 × 2
|
||||
#> partyid n
|
||||
#> <fct> <int>
|
||||
|
@ -296,7 +296,7 @@ Modifying factor levels</h1>
|
|||
</div>
|
||||
<p>The levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(
|
||||
partyid = fct_recode(partyid,
|
||||
"Republican, strong" = "Strong republican",
|
||||
|
@ -322,7 +322,7 @@ Modifying factor levels</h1>
|
|||
<p><code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code> will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.</p>
|
||||
<p>To combine groups, you can assign multiple old levels to the same new level:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(
|
||||
partyid = fct_recode(partyid,
|
||||
"Republican, strong" = "Strong republican",
|
||||
|
@ -351,7 +351,7 @@ Modifying factor levels</h1>
|
|||
<p>Use this technique with care: if you group together categories that are truly different you will end up with misleading results.</p>
|
||||
<p>If you want to collapse a lot of levels, <code><a href="https://forcats.tidyverse.org/reference/fct_collapse.html">fct_collapse()</a></code> is a useful variant of <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. For each new variable, you can provide a vector of old levels:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(
|
||||
partyid = fct_collapse(partyid,
|
||||
"other" = c("No answer", "Don't know", "Other party"),
|
||||
|
@ -371,7 +371,7 @@ Modifying factor levels</h1>
|
|||
</div>
|
||||
<p>Sometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the <code>fct_lump_*()</code> family of functions. <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_lowfreq()</a></code> is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(relig = fct_lump_lowfreq(relig)) |>
|
||||
count(relig)
|
||||
#> # A tibble: 2 × 2
|
||||
|
@ -382,7 +382,7 @@ Modifying factor levels</h1>
|
|||
</div>
|
||||
<p>In this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’d probably like to see some more details! Instead, we can use the <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_n()</a></code> to specify that we want exactly 10 groups:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gss_cat |>
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(relig = fct_lump_n(relig, n = 10)) |>
|
||||
count(relig, sort = TRUE) |>
|
||||
print(n = Inf)
|
||||
|
@ -416,7 +416,7 @@ Exercises</h2>
|
|||
Ordered factors</h1>
|
||||
<p>Before we go on, there’s a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with <code><a href="https://rdrr.io/r/base/factor.html">ordered()</a></code>, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use <code><</code> between the factor levels:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ordered(c("a", "b", "c"))
|
||||
<pre data-type="programlisting" data-code-language="r">ordered(c("a", "b", "c"))
|
||||
#> [1] a b c
|
||||
#> Levels: a < b < c</pre>
|
||||
</div>
|
||||
|
|
|
@ -18,7 +18,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>We’ll wrap up a variety of functions from around the tidyverse. We’ll also use nycflights13 as a source of familiar data to use our functions with.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(nycflights13)</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -29,7 +29,7 @@ library(nycflights13)</pre>
|
|||
Vector functions</h1>
|
||||
<p>We’ll begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
a = rnorm(5),
|
||||
b = rnorm(5),
|
||||
c = rnorm(5),
|
||||
|
@ -62,7 +62,7 @@ df |> mutate(
|
|||
Writing a function</h2>
|
||||
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> it’s a little easier to see the pattern because each repetition is now one line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
|
||||
<pre data-type="programlisting" data-code-language="r">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
|
||||
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
|
||||
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
|
||||
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) </pre>
|
||||
|
@ -77,26 +77,26 @@ Writing a function</h2>
|
|||
<li><p>The <strong>body</strong>. The body is the code that repeated across all the calls.</p></li>
|
||||
</ol><p>Then you create a function by following the template:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">name <- function(arguments) {
|
||||
<pre data-type="programlisting" data-code-language="r">name <- function(arguments) {
|
||||
body
|
||||
}</pre>
|
||||
</div>
|
||||
<p>For this case that leads to:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rescale01 <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">rescale01 <- function(x) {
|
||||
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
|
||||
}</pre>
|
||||
</div>
|
||||
<p>At this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rescale01(c(-10, 0, 10))
|
||||
<pre data-type="programlisting" data-code-language="r">rescale01(c(-10, 0, 10))
|
||||
#> [1] 0.0 0.5 1.0
|
||||
rescale01(c(1, 2, 3, NA, 5))
|
||||
#> [1] 0.00 0.25 0.50 NA 1.00</pre>
|
||||
</div>
|
||||
<p>Then you can rewrite the call to <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> as:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> mutate(
|
||||
<pre data-type="programlisting" data-code-language="r">df |> mutate(
|
||||
a = rescale01(a),
|
||||
b = rescale01(b),
|
||||
c = rescale01(c),
|
||||
|
@ -119,20 +119,20 @@ rescale01(c(1, 2, 3, NA, 5))
|
|||
Improving our function</h2>
|
||||
<p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rescale01 <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">rescale01 <- function(x) {
|
||||
rng <- range(x, na.rm = TRUE)
|
||||
(x - rng[1]) / (rng[2] - rng[1])
|
||||
}</pre>
|
||||
</div>
|
||||
<p>Or you might try this function on a vector that includes an infinite value:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1:10, Inf)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1:10, Inf)
|
||||
rescale01(x)
|
||||
#> [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre>
|
||||
</div>
|
||||
<p>That result is not particularly useful so we could ask <code><a href="https://rdrr.io/r/base/range.html">range()</a></code> to ignore infinite values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rescale01 <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">rescale01 <- function(x) {
|
||||
rng <- range(x, na.rm = TRUE, finite = TRUE)
|
||||
(x - rng[1]) / (rng[2] - rng[1])
|
||||
}
|
||||
|
@ -149,13 +149,13 @@ Mutate functions</h2>
|
|||
<p>Now you’ve got the basic idea of functions, lets take a look a whole bunch of examples. We’ll start by looking at “mutate” functions, functions that work well like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output the same length as the input.</p>
|
||||
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">z_score <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">z_score <- function(x) {
|
||||
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
|
||||
}</pre>
|
||||
</div>
|
||||
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">clamp <- function(x, min, max) {
|
||||
<pre data-type="programlisting" data-code-language="r">clamp <- function(x, min, max) {
|
||||
case_when(
|
||||
x < min ~ min,
|
||||
x > max ~ max,
|
||||
|
@ -167,7 +167,7 @@ clamp(1:10, min = 3, max = 7)
|
|||
</div>
|
||||
<p>Or maybe you’d rather mark those values as <code>NA</code>s:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">na_outside <- function(x, min, max) {
|
||||
<pre data-type="programlisting" data-code-language="r">na_outside <- function(x, min, max) {
|
||||
case_when(
|
||||
x < min ~ NA,
|
||||
x > max ~ NA,
|
||||
|
@ -179,7 +179,7 @@ na_outside(1:10, min = 3, max = 7)
|
|||
</div>
|
||||
<p>Of course functions don’t just need to work with numeric variables. You might want to extract out some repeated string manipulation. Maybe you need to make the first character upper case:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">first_upper <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">first_upper <- function(x) {
|
||||
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
|
||||
x
|
||||
}
|
||||
|
@ -188,7 +188,7 @@ first_upper("hello")
|
|||
</div>
|
||||
<p>Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/NVlabormarket/status/1571939851922198530
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/NVlabormarket/status/1571939851922198530
|
||||
clean_number <- function(x) {
|
||||
is_pct <- str_detect(x, "%")
|
||||
num <- x |>
|
||||
|
@ -205,13 +205,13 @@ clean_number("45%")
|
|||
</div>
|
||||
<p>Sometimes your functions will be highly specialized for one data analysis. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">fix_na <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">fix_na <- function(x) {
|
||||
if_else(x %in% c(997, 998, 999), NA, x)
|
||||
}</pre>
|
||||
</div>
|
||||
<p>We’ve focused on examples that take a single vector because we think they’re the most common. But there’s no reason that your function can’t take multiple vector inputs. For example, you might want to compute the distance between two locations on the globe using the haversine formula. This requires four vectors:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
|
||||
haversine <- function(long1, lat1, long2, lat2, round = 3) {
|
||||
# convert to radians
|
||||
long1 <- long1 * pi / 180
|
||||
|
@ -234,7 +234,7 @@ haversine <- function(long1, lat1, long2, lat2, round = 3) {
|
|||
Summary functions</h2>
|
||||
<p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">commas <- function(x) {
|
||||
<pre data-type="programlisting" data-code-language="r">commas <- function(x) {
|
||||
str_flatten(x, collapse = ", ", last = " and ")
|
||||
}
|
||||
commas(c("cat", "dog", "pigeon"))
|
||||
|
@ -242,7 +242,7 @@ commas(c("cat", "dog", "pigeon"))
|
|||
</div>
|
||||
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cv <- function(x, na.rm = FALSE) {
|
||||
<pre data-type="programlisting" data-code-language="r">cv <- function(x, na.rm = FALSE) {
|
||||
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
|
||||
}
|
||||
cv(runif(100, min = 0, max = 50))
|
||||
|
@ -252,14 +252,14 @@ cv(runif(100, min = 0, max = 500))
|
|||
</div>
|
||||
<p>Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/gbganalyst/status/1571619641390252033
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/gbganalyst/status/1571619641390252033
|
||||
n_missing <- function(x) {
|
||||
sum(is.na(x))
|
||||
} </pre>
|
||||
</div>
|
||||
<p>You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/neilgcurrie/status/1571607727255834625
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/neilgcurrie/status/1571607727255834625
|
||||
mape <- function(actual, predicted) {
|
||||
sum(abs((actual - predicted) / actual)) / length(actual)
|
||||
}</pre>
|
||||
|
@ -278,7 +278,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">mean(is.na(x))
|
||||
<pre data-type="programlisting" data-code-language="r">mean(is.na(x))
|
||||
mean(is.na(y))
|
||||
mean(is.na(z))
|
||||
|
||||
|
@ -302,7 +302,7 @@ round(z / sum(z, na.rm = TRUE) * 100, 1)</pre>
|
|||
<li>
|
||||
<p>Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">is_directory <- function(x) file.info(x)$isdir
|
||||
<pre data-type="programlisting" data-code-language="r">is_directory <- function(x) file.info(x)$isdir
|
||||
is_readable <- function(x) file.access(x, 4) == 0</pre>
|
||||
</div>
|
||||
</li>
|
||||
|
@ -320,7 +320,7 @@ Data frame functions</h1>
|
|||
Indirection and tidy evaluation</h2>
|
||||
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> the unique (distinct) values of a variable:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">pull_unique <- function(df, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct(var) |>
|
||||
pull(var)
|
||||
|
@ -328,14 +328,14 @@ Indirection and tidy evaluation</h2>
|
|||
</div>
|
||||
<p>If we try and use it, we get an error:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |> pull_unique(clarity)
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |> pull_unique(clarity)
|
||||
#> Error in `distinct()` at ]8;line = 38:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/pull.Rdplyr/R/pull.R:38:2]8;;:
|
||||
#> ! Must use existing variables.
|
||||
#> ✖ `var` not found in `.data`.</pre>
|
||||
</div>
|
||||
<p>To make the problem a bit more clear we can use a made up data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(var = "var", x = "x", y = "y")
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(var = "var", x = "x", y = "y")
|
||||
df |> pull_unique(x)
|
||||
#> [1] "var"
|
||||
df |> pull_unique(y)
|
||||
|
@ -346,7 +346,7 @@ df |> pull_unique(y)
|
|||
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what’s happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
|
||||
<p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">pull_unique <- function(df, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
|
@ -373,7 +373,7 @@ When to embrace?</h2>
|
|||
Common use cases</h2>
|
||||
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">summary6 <- function(data, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">summary6 <- function(data, var) {
|
||||
data |> summarise(
|
||||
min = min({{ var }}, na.rm = TRUE),
|
||||
mean = mean({{ var }}, na.rm = TRUE),
|
||||
|
@ -393,7 +393,7 @@ diamonds |> summary6(carat)
|
|||
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> in a helper, we think it’s good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
|
||||
<p>The nice thing about this function is because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> you can used it on grouped data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
group_by(cut) |>
|
||||
summary6(carat)
|
||||
#> # A tibble: 5 × 7
|
||||
|
@ -407,7 +407,7 @@ diamonds |> summary6(carat)
|
|||
</div>
|
||||
<p>Because the arguments to summarize are data-masking that also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
group_by(cut) |>
|
||||
summary6(log10(carat))
|
||||
#> # A tibble: 5 × 7
|
||||
|
@ -422,7 +422,7 @@ diamonds |> summary6(carat)
|
|||
<p>To summarize multiple variables you’ll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where you’ll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
|
||||
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/Diabb6/status/1571635146658402309
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/Diabb6/status/1571635146658402309
|
||||
count_prop <- function(df, var, sort = FALSE) {
|
||||
df |>
|
||||
count({{ var }}, sort = sort) |>
|
||||
|
@ -443,7 +443,7 @@ diamonds |> count_prop(clarity)
|
|||
<p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because it’s passed to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> which uses data-masking for all variables in <code>…</code>.</p>
|
||||
<p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, we’ll allow the user to supply a condition:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">unique_where <- function(df, condition, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">unique_where <- function(df, condition, var) {
|
||||
df |>
|
||||
filter({{ condition }}) |>
|
||||
distinct({{ var }}) |>
|
||||
|
@ -468,7 +468,7 @@ flights |> unique_where(tailnum == "N14228", month)
|
|||
<p>Here we embrace <code>condition</code> because it’s passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>.</p>
|
||||
<p>We’ve made all these examples take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights_sub <- function(rows, cols) {
|
||||
<pre data-type="programlisting" data-code-language="r">flights_sub <- function(rows, cols) {
|
||||
flights |>
|
||||
filter({{ rows }}) |>
|
||||
select(time_hour, carrier, flight, {{ cols }})
|
||||
|
@ -494,7 +494,7 @@ flights_sub(dest == "IAH", contains("time"))
|
|||
Data-masking vs tidy-selection</h2>
|
||||
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">count_missing <- function(df, group_vars, x_var) {
|
||||
<pre data-type="programlisting" data-code-language="r">count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by({{ group_vars }}) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
|
@ -508,7 +508,7 @@ flights |>
|
|||
</div>
|
||||
<p>This doesn’t work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> which allows you to use use tidy-selection inside data-masking functions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">count_missing <- function(df, group_vars, x_var) {
|
||||
<pre data-type="programlisting" data-code-language="r">count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
|
@ -531,7 +531,7 @@ flights |>
|
|||
</div>
|
||||
<p>Another convenient use of <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to rearrange the counts into a grid:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/pollicipes/status/1571606508944719876
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/pollicipes/status/1571606508944719876
|
||||
count_wide <- function(data, rows, cols) {
|
||||
data |>
|
||||
count(pick(c({{ rows }}, {{ cols }}))) |>
|
||||
|
@ -576,31 +576,31 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Find all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> filter_severe()</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> filter_severe()</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> group_by(dest) |> summarise_severe()</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> group_by(dest) |> summarise_severe()</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Finds all flights that were cancelled or delayed by more than a user supplied number of hours:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> filter_severe(hours = 2)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> filter_severe(hours = 2)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Summarizes the weather to compute the minum, mean, and maximum, of a user supplied variable:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">weather |> summarise_weather(temp)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">weather |> summarise_weather(temp)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc) into a decimal time (i.e. hours + minutes / 60).</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">weather |> standardise_time(sched_dep_time)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">weather |> standardise_time(sched_dep_time)</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></li>
|
||||
|
@ -608,7 +608,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Generalize the following function so that you can supply any number of variables to count.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">count_prop <- function(df, var, sort = FALSE) {
|
||||
<pre data-type="programlisting" data-code-language="r">count_prop <- function(df, var, sort = FALSE) {
|
||||
df |>
|
||||
count({{ var }}, sort = sort) |>
|
||||
mutate(prop = n / sum(n))
|
||||
|
@ -623,7 +623,7 @@ Exercises</h2>
|
|||
Plot functions</h1>
|
||||
<p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that you’re making a lot of histograms:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
ggplot(aes(carat)) +
|
||||
geom_histogram(binwidth = 0.1)
|
||||
|
||||
|
@ -633,7 +633,7 @@ diamonds |>
|
|||
</div>
|
||||
<p>Wouldn’t it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function so that you need to embrace:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">histogram <- function(df, var, binwidth = NULL) {
|
||||
<pre data-type="programlisting" data-code-language="r">histogram <- function(df, var, binwidth = NULL) {
|
||||
df |>
|
||||
ggplot(aes({{ var }})) +
|
||||
geom_histogram(binwidth = binwidth)
|
||||
|
@ -646,7 +646,7 @@ diamonds |> histogram(carat, 0.1)</pre>
|
|||
</div>
|
||||
<p>Note that <code>histogram()</code> returns a ggplot2 plot, so that you can still add on additional components if you want. Just remember to switch from <code>|></code> to <code>+</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
histogram(carat, 0.1) +
|
||||
labs(x = "Size (in carats)", y = "Number of diamonds")</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -659,7 +659,7 @@ diamonds |> histogram(carat, 0.1)</pre>
|
|||
More variables</h2>
|
||||
<p>It’s straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||
|
||||
linearity_check <- function(df, x, y) {
|
||||
df |>
|
||||
|
@ -680,7 +680,7 @@ starwars |>
|
|||
</div>
|
||||
<p>Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/ppaxisa/status/1574398423175921665
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/ppaxisa/status/1574398423175921665
|
||||
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
||||
df |>
|
||||
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
||||
|
@ -702,7 +702,7 @@ diamonds |> hex_plot(carat, price, depth)</pre>
|
|||
Combining with dplyr</h2>
|
||||
<p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sorted_bars <- function(df, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">sorted_bars <- function(df, var) {
|
||||
df |>
|
||||
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |>
|
||||
ggplot(aes(y = {{ var }})) +
|
||||
|
@ -715,7 +715,7 @@ diamonds |> sorted_bars(cut)</pre>
|
|||
</div>
|
||||
<p>Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">conditional_bars <- function(df, condition, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">conditional_bars <- function(df, condition, var) {
|
||||
df |>
|
||||
filter({{ condition }}) |>
|
||||
ggplot(aes({{ var }})) +
|
||||
|
@ -729,7 +729,7 @@ diamonds |> conditional_bars(cut == "Good", clarity)</pre>
|
|||
</div>
|
||||
<p>You can also get creative and display data summaries in other way. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
||||
<pre data-type="programlisting" data-code-language="r"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
||||
|
||||
fancy_ts <- function(df, val, group) {
|
||||
labs <- df |>
|
||||
|
@ -768,7 +768,7 @@ fancy_ts(df, value, dist_name)</pre>
|
|||
Faceting</h2>
|
||||
<p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/sharoz/status/1574376332821204999
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/sharoz/status/1574376332821204999
|
||||
|
||||
foo <- function(x) {
|
||||
ggplot(mtcars, aes(mpg, disp)) +
|
||||
|
@ -782,7 +782,7 @@ foo(cyl)</pre>
|
|||
</div>
|
||||
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution <code>bill_length_mm</code> from palmerpenguins dataset.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/yutannihilat_en/status/1574387230025875457
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/yutannihilat_en/status/1574387230025875457
|
||||
density <- function(colour, facets, binwidth = 0.1) {
|
||||
diamonds |>
|
||||
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
|
||||
|
@ -810,7 +810,7 @@ density(cut, clarity)</pre>
|
|||
Labeling</h2>
|
||||
<p>Remember the histogram function we showed you earlier?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">histogram <- function(df, var, binwidth = NULL) {
|
||||
<pre data-type="programlisting" data-code-language="r">histogram <- function(df, var, binwidth = NULL) {
|
||||
df |>
|
||||
ggplot(aes({{ var }})) +
|
||||
geom_histogram(binwidth = binwidth)
|
||||
|
@ -819,7 +819,7 @@ Labeling</h2>
|
|||
<p>Wouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from package we haven’t talked about before: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
|
||||
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">histogram <- function(df, var, binwidth) {
|
||||
<pre data-type="programlisting" data-code-language="r">histogram <- function(df, var, binwidth) {
|
||||
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
|
||||
|
||||
df |>
|
||||
|
@ -853,7 +853,7 @@ Style</h1>
|
|||
<p>R doesn’t care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. That’s hard! But it’s better to be clear than short, as RStudio’s autocomplete makes it easy to type long names.</p>
|
||||
<p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="https://rdrr.io/r/stats/coef.html">coef()</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and don’t be afraid to rename a function if you figure out a better name later.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Too short
|
||||
<pre data-type="programlisting" data-code-language="r"># Too short
|
||||
f()
|
||||
|
||||
# Not a verb, or descriptive
|
||||
|
@ -865,7 +865,7 @@ collapse_years()</pre>
|
|||
</div>
|
||||
<p>R also doesn’t care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># missing extra two spaces
|
||||
<pre data-type="programlisting" data-code-language="r"># missing extra two spaces
|
||||
pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
|
@ -890,7 +890,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">f1 <- function(string, prefix) {
|
||||
<pre data-type="programlisting" data-code-language="r">f1 <- function(string, prefix) {
|
||||
substr(string, 1, nchar(prefix)) == prefix
|
||||
}
|
||||
f3 <- function(x, y) {
|
||||
|
|
|
@ -92,12 +92,12 @@ The tidyverse</h2>
|
|||
<p>You’ll also need to install some R packages. An R <strong>package</strong> is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.</p>
|
||||
<p>You can install the complete tidyverse with a single line of code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">install.packages("tidyverse")</pre>
|
||||
</div>
|
||||
<p>On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that <a href="https://cloud.r-project.org/" class="uri">https://cloud.r-project.org/</a> isn’t blocked by your firewall or proxy.</p>
|
||||
<p>You will not be able to use the functions, objects, or help files in a package until you load it with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>. Once you have installed a package, you can load it using the <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
|
||||
#> ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
|
||||
|
@ -117,7 +117,7 @@ Other packages</h2>
|
|||
<p>There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles. This doesn’t make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data.</p>
|
||||
<p>In this book we’ll use three data packages from outside the tidyverse:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">install.packages(c("nycflights13", "gapminder", "Lahman"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">install.packages(c("nycflights13", "gapminder", "Lahman"))</pre>
|
||||
</div>
|
||||
<p>These packages provide data on airline flights, world development, and baseball that we’ll use to illustrate key data science ideas.</p>
|
||||
</section>
|
||||
|
@ -128,7 +128,7 @@ Other packages</h2>
|
|||
Running R code</h1>
|
||||
<p>The previous section showed you several examples of running R code. Code in the book looks like this:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">1 + 2
|
||||
<pre data-type="programlisting" data-code-language="r">1 + 2
|
||||
#> [1] 3</pre>
|
||||
</div>
|
||||
<p>If you run the same code in your local console, it will look like this:</p>
|
||||
|
@ -258,7 +258,7 @@ Colophon</h1>
|
|||
<td style="text-align: left;">CRAN (R 4.2.0)</td>
|
||||
</tr></tbody></table></div>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cli:::ruler()
|
||||
<pre data-type="programlisting" data-code-language="r">cli:::ruler()
|
||||
#> ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+--
|
||||
#> 12345678901234567890123456789012345678901234567890123456789012345678901234567</pre>
|
||||
</div>
|
||||
|
|
|
@ -27,7 +27,7 @@ Prerequisites</h2>
|
|||
|
||||
<p>In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. We’re going to use just a couple of purrr functions from in this chapter, but it’s a great package to explore as you improve your programming skills.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -37,7 +37,7 @@ Prerequisites</h2>
|
|||
Modifying multiple columns</h1>
|
||||
<p>Imagine you have this simple tibble and you want to count the number of observations and compute the median of every column.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
a = rnorm(10),
|
||||
b = rnorm(10),
|
||||
c = rnorm(10),
|
||||
|
@ -46,7 +46,7 @@ Modifying multiple columns</h1>
|
|||
</div>
|
||||
<p>You could do it with copy-and-paste:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> summarise(
|
||||
<pre data-type="programlisting" data-code-language="r">df |> summarise(
|
||||
n = n(),
|
||||
a = median(a),
|
||||
b = median(b),
|
||||
|
@ -60,7 +60,7 @@ Modifying multiple columns</h1>
|
|||
</div>
|
||||
<p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead you can use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> summarise(
|
||||
<pre data-type="programlisting" data-code-language="r">df |> summarise(
|
||||
n = n(),
|
||||
across(a:d, median),
|
||||
)
|
||||
|
@ -78,7 +78,7 @@ Selecting columns with<code>.cols</code>
|
|||
<p>The first argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, <code>.cols</code>, selects the columns to transform. This uses the same specifications as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <a href="#sec-select" data-type="xref">#sec-select</a>, so you can use functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">ends_with()</a></code> to select columns based on their name.</p>
|
||||
<p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code>where()</code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
grp = sample(2, 10, replace = TRUE),
|
||||
a = rnorm(10),
|
||||
b = rnorm(10),
|
||||
|
@ -108,7 +108,7 @@ df |>
|
|||
<li>
|
||||
<code>where(is.logical)</code> selects all logical columns.</li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_types <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df_types <- tibble(
|
||||
x1 = 1:3,
|
||||
x2 = runif(3),
|
||||
y1 = sample(letters, 3),
|
||||
|
@ -138,7 +138,7 @@ Calling a single function</h2>
|
|||
<p>The second argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a function programming language.</p>
|
||||
<p>It’s important to note that we’re passing this function to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, so <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can call it, not calling it ourselves. That means the function name should never be followed by <code>()</code>. If you forget, you’ll get an error:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(grp) |>
|
||||
summarise(across(everything(), median()))
|
||||
#> Error in vapply(.x, .f, .mold, ..., USE.NAMES = FALSE): values must be length 1,
|
||||
|
@ -146,7 +146,7 @@ Calling a single function</h2>
|
|||
</div>
|
||||
<p>This error arises because you’re calling the function with no input, e.g.:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">median()
|
||||
<pre data-type="programlisting" data-code-language="r">median()
|
||||
#> Error in is.factor(x): argument "x" is missing, with no default</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -156,7 +156,7 @@ Calling a single function</h2>
|
|||
Calling multiple functions</h2>
|
||||
<p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> propagates those missing values, giving us a suboptimal output:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
|
||||
<pre data-type="programlisting" data-code-language="r">rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
|
||||
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
|
||||
}
|
||||
|
||||
|
@ -178,7 +178,7 @@ df_miss |>
|
|||
</div>
|
||||
<p>It would be nice if we could pass along <code>na.rm = TRUE</code> to <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> to remove these missing values. To do so, instead of calling <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> directly, we need to create a new function that calls <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> with the desired arguments:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
across(a:d, function(x) median(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
|
@ -190,7 +190,7 @@ df_miss |>
|
|||
</div>
|
||||
<p>This is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or <strong>anonymous</strong><span data-type="footnote">Anonymous, because we never explicitly gave it a name with <code><-</code>. Another term programmers use for this is “lambda function”.</span>, function you can replace <code>function</code> with <code>\</code><span data-type="footnote">In older code you might see syntax that looks like <code>~ .x + 1</code>. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name <code>.x</code>. We now recommend the base syntax, <code>\(x) x + 1</code>.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
across(a:d, \(x) median(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
|
@ -198,7 +198,7 @@ df_miss |>
|
|||
</div>
|
||||
<p>In either case, <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> effectively expands to the following code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
a = median(a, na.rm = TRUE),
|
||||
b = median(b, na.rm = TRUE),
|
||||
|
@ -209,7 +209,7 @@ df_miss |>
|
|||
</div>
|
||||
<p>When we remove the missing values from the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, it would be nice to know just how many values we were removing. We can find that out by supplying two functions to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
across(a:d, list(
|
||||
median = \(x) median(x, na.rm = TRUE),
|
||||
|
@ -231,7 +231,7 @@ df_miss |>
|
|||
Column names</h2>
|
||||
<p>The result of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is named according to the specification provided in the <code>.names</code> argument. We could specify our own if we wanted the name of the function to come first<span data-type="footnote">You can’t currently change the order of the columns, but you could reorder them after the fact using <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> or similar.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
across(
|
||||
a:d,
|
||||
|
@ -251,7 +251,7 @@ Column names</h2>
|
|||
</div>
|
||||
<p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
mutate(
|
||||
across(a:d, \(x) coalesce(x, 0))
|
||||
)
|
||||
|
@ -266,7 +266,7 @@ Column names</h2>
|
|||
</div>
|
||||
<p>If you’d like to instead create new columns, you can use the <code>.names</code> argument to give the output new names:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
mutate(
|
||||
across(a:d, \(x) abs(x), .names = "{.col}_abs")
|
||||
)
|
||||
|
@ -286,7 +286,7 @@ Column names</h2>
|
|||
Filtering</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is a great match for <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> but it’s more awkward to use with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&</code>. It’s clear that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> called <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
|
||||
#> # A tibble: 3 × 4
|
||||
#> a b c d
|
||||
#> <dbl> <dbl> <dbl> <dbl>
|
||||
|
@ -317,7 +317,7 @@ df_miss |> filter(if_all(a:d, is.na))
|
|||
<code>across()</code> in functions</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is particularly useful to program with because it allows you to operate on multiple columns. For example, <a href="https://twitter.com/_wurli/status/1571836746899283969">Jacob Scott</a> uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(lubridate)
|
||||
<pre data-type="programlisting" data-code-language="r">library(lubridate)
|
||||
#> Loading required package: timechange
|
||||
#>
|
||||
#> Attaching package: 'lubridate'
|
||||
|
@ -347,7 +347,7 @@ df_date |>
|
|||
</div>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in <a href="#sec-embracing" data-type="xref">#sec-embracing</a>. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">summarise_means <- function(df, summary_vars = where(is.numeric)) {
|
||||
<pre data-type="programlisting" data-code-language="r">summarise_means <- function(df, summary_vars = where(is.numeric)) {
|
||||
df |>
|
||||
summarise(
|
||||
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
|
||||
|
@ -390,7 +390,7 @@ Vs<code>pivot_longer()</code>
|
|||
</h2>
|
||||
<p>Before we go on, it’s worth pointing out an interesting connection between <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> (<a href="#sec-pivoting" data-type="xref">#sec-pivoting</a>). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
summarise(across(a:d, list(median = median, mean = mean)))
|
||||
#> # A tibble: 1 × 8
|
||||
#> a_median a_mean b_median b_mean c_median c_mean d_median d_mean
|
||||
|
@ -399,7 +399,7 @@ Vs<code>pivot_longer()</code>
|
|||
</div>
|
||||
<p>We could compute the same values by pivoting longer and then summarizing:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">long <- df |>
|
||||
<pre data-type="programlisting" data-code-language="r">long <- df |>
|
||||
pivot_longer(a:d) |>
|
||||
group_by(name) |>
|
||||
summarise(
|
||||
|
@ -417,7 +417,7 @@ long
|
|||
</div>
|
||||
<p>And if you wanted the same structure as <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> you could pivot again:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">long |>
|
||||
<pre data-type="programlisting" data-code-language="r">long |>
|
||||
pivot_wider(
|
||||
names_from = name,
|
||||
values_from = c(median, mean),
|
||||
|
@ -431,7 +431,7 @@ long
|
|||
</div>
|
||||
<p>This is a useful technique to know about because sometimes you’ll hit a problem that’s not currently possible to solve with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_paired <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df_paired <- tibble(
|
||||
a_val = rnorm(10),
|
||||
a_wts = runif(10),
|
||||
b_val = rnorm(10),
|
||||
|
@ -444,7 +444,7 @@ long
|
|||
</div>
|
||||
<p>There’s currently no way to do this with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code><span data-type="footnote">Maybe there will be one day, but currently we don’t see how.</span>, but it’s relatively straightforward with <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_long <- df_paired |>
|
||||
<pre data-type="programlisting" data-code-language="r">df_long <- df_paired |>
|
||||
pivot_longer(
|
||||
everything(),
|
||||
names_to = c("group", ".value"),
|
||||
|
@ -488,7 +488,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Explain what each step of the pipeline in this function does. What special feature of <code>where()</code> are we taking advantage of?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">show_missing <- function(df, group_vars, summary_vars = everything()) {
|
||||
<pre data-type="programlisting" data-code-language="r">show_missing <- function(df, group_vars, summary_vars = everything()) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(
|
||||
|
@ -508,14 +508,14 @@ nycflights13::flights |> show_missing(c(year, month, day))</pre>
|
|||
Reading multiple files</h1>
|
||||
<p>In the previous section, you learned how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> to repeat a transformation on multiple columns. In this section, you’ll learn how to use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to do something to every file in a directory. Let’s start with a little motivation: imagine you have a directory full of excel spreadsheets<span data-type="footnote">If you instead had a directory of csv files with the same format, you can use the technique from <a href="#sec-readr-directory" data-type="xref">#sec-readr-directory</a>.</span> you want to read. You could do it with copy and paste:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">data2019 <- readxl::read_excel("data/y2019.xlsx")
|
||||
<pre data-type="programlisting" data-code-language="r">data2019 <- readxl::read_excel("data/y2019.xlsx")
|
||||
data2020 <- readxl::read_excel("data/y2020.xlsx")
|
||||
data2021 <- readxl::read_excel("data/y2021.xlsx")
|
||||
data2022 <- readxl::read_excel("data/y2022.xlsx")</pre>
|
||||
</div>
|
||||
<p>And then use <code><a href="https://dplyr.tidyverse.org/reference/bind_rows.html">dplyr::bind_rows()</a></code> to combine them all together:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">data <- bind_rows(data2019, data2020, data2021, data2022)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">data <- bind_rows(data2019, data2020, data2021, data2022)</pre>
|
||||
</div>
|
||||
<p>You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> to list all the files in a directory, then use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to read each of them into a list, then use <code><a href="https://purrr.tidyverse.org/reference/list_c.html">purrr::list_rbind()</a></code> to combine them into a single data frame. We’ll then discuss how you can handle situations of increasing heterogeneity, where you can’t do exactly the same thing to every file.</p>
|
||||
|
||||
|
@ -528,7 +528,7 @@ Listing files in a directory</h2>
|
|||
<li><p><code>full.names</code> determines whether or not the directory name should be included in the output. You almost always want this to be <code>TRUE</code>.</p></li>
|
||||
</ul><p>To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one year’s worth of data for 142 countries. We can list them all with the appropriate call to <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
|
||||
paths
|
||||
#> [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
|
||||
#> [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
|
||||
|
@ -552,7 +552,7 @@ gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
|
|||
</div>
|
||||
<p>But putting each sheet into its own variable is going to make it hard to work with them a few steps down the road. Instead, they’ll be easier to work with if we put them into a single object. A list is the perfect tool for this job:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- list(
|
||||
<pre data-type="programlisting" data-code-language="r">files <- list(
|
||||
readxl::read_excel("data/gapminder/1952.xlsx"),
|
||||
readxl::read_excel("data/gapminder/1957.xlsx"),
|
||||
readxl::read_excel("data/gapminder/1962.xlsx"),
|
||||
|
@ -562,7 +562,7 @@ gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
|
|||
</div>
|
||||
<p>Now that you have these data frames in a list, how do you get one out? You can use <code>files[[i]]</code> to extract the i-th element:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files[[3]]
|
||||
<pre data-type="programlisting" data-code-language="r">files[[3]]
|
||||
#> # A tibble: 142 × 5
|
||||
#> country continent lifeExp pop gdpPercap
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl>
|
||||
|
@ -583,7 +583,7 @@ gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
|
|||
</h2>
|
||||
<p>The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to make even better use of our <code>paths</code> vector. <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> is similar to<code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, but instead of doing something to each column in a data frame, it does something to each element of a vector.<code>map(x, f)</code> is shorthand for:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">list(
|
||||
<pre data-type="programlisting" data-code-language="r">list(
|
||||
f(x[[1]]),
|
||||
f(x[[2]]),
|
||||
...,
|
||||
|
@ -592,7 +592,7 @@ gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
|
|||
</div>
|
||||
<p>So we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> get a list of 12 data frames:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- map(paths, readxl::read_excel)
|
||||
<pre data-type="programlisting" data-code-language="r">files <- map(paths, readxl::read_excel)
|
||||
length(files)
|
||||
#> [1] 12
|
||||
|
||||
|
@ -611,7 +611,7 @@ files[[1]]
|
|||
<p>(This is another data structure that doesn’t display particularly compactly with <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> so you might want to load into RStudio and inspect it with <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>).</p>
|
||||
<p>Now we can use <code><a href="https://purrr.tidyverse.org/reference/list_c.html">purrr::list_rbind()</a></code> to combine that list of data frames into a single data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">list_rbind(files)
|
||||
<pre data-type="programlisting" data-code-language="r">list_rbind(files)
|
||||
#> # A tibble: 1,704 × 5
|
||||
#> country continent lifeExp pop gdpPercap
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl>
|
||||
|
@ -625,13 +625,13 @@ files[[1]]
|
|||
</div>
|
||||
<p>Or we could do both steps at once in pipeline:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
map(readxl::read_excel) |>
|
||||
list_rbind()</pre>
|
||||
</div>
|
||||
<p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>. For example, it’s often useful to peak at the first few row of the data with <code>n_max = 1</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
map(\(path) readxl::read_excel(path, n_max = 1)) |>
|
||||
list_rbind()
|
||||
#> # A tibble: 12 × 5
|
||||
|
@ -654,7 +654,7 @@ Data in the path</h2>
|
|||
<p>Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.</p>
|
||||
<p>First, we name the vector of paths. The easiest way to do this is with the <code><a href="https://rlang.r-lib.org/reference/set_names.html">set_names()</a></code> function, which can take a function. Here we use <code><a href="https://rdrr.io/r/base/basename.html">basename()</a></code> to extract just the file name from the full path:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |> set_names(basename)
|
||||
<pre data-type="programlisting" data-code-language="r">paths |> set_names(basename)
|
||||
#> 1952.xlsx 1957.xlsx
|
||||
#> "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
|
||||
#> 1962.xlsx 1967.xlsx
|
||||
|
@ -670,13 +670,13 @@ Data in the path</h2>
|
|||
</div>
|
||||
<p>Those names are automatically carried along by all the map functions, so the list of data frames will have those same names:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">files <- paths |>
|
||||
set_names(basename) |>
|
||||
map(readxl::read_excel)</pre>
|
||||
</div>
|
||||
<p>That makes this call to <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> shorthand for:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- list(
|
||||
<pre data-type="programlisting" data-code-language="r">files <- list(
|
||||
"1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"),
|
||||
"1957.xlsx" = readxl::read_excel("data/gapminder/1957.xlsx"),
|
||||
"1962.xlsx" = readxl::read_excel("data/gapminder/1962.xlsx"),
|
||||
|
@ -686,7 +686,7 @@ Data in the path</h2>
|
|||
</div>
|
||||
<p>You can also use <code>[[</code> to extract elements by name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files[["1962.xlsx"]]
|
||||
<pre data-type="programlisting" data-code-language="r">files[["1962.xlsx"]]
|
||||
#> # A tibble: 142 × 5
|
||||
#> country continent lifeExp pop gdpPercap
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl>
|
||||
|
@ -700,7 +700,7 @@ Data in the path</h2>
|
|||
</div>
|
||||
<p>Then we use the <code>names_to</code> argument to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> to tell it to save the names into a new column called <code>year</code> then use <code><a href="https://readr.tidyverse.org/reference/parse_number.html">readr::parse_number()</a></code> to extract the number from the string.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
set_names(basename) |>
|
||||
map(readxl::read_excel) |>
|
||||
list_rbind(names_to = "year") |>
|
||||
|
@ -718,7 +718,7 @@ Data in the path</h2>
|
|||
</div>
|
||||
<p>In more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use <code><a href="https://rlang.r-lib.org/reference/set_names.html">set_names()</a></code> (without any arguments) to record the full path, and then use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">tidyr::separate_wider_delim()</a></code> and friends to turn them into useful columns.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># NOTE: this chapter also depends on dev tidyr (in addition to dev purrr and dev dplyr)
|
||||
<pre data-type="programlisting" data-code-language="r"># NOTE: this chapter also depends on dev tidyr (in addition to dev purrr and dev dplyr)
|
||||
paths |>
|
||||
set_names() |>
|
||||
map(readxl::read_excel) |>
|
||||
|
@ -743,7 +743,7 @@ paths |>
|
|||
Save your work</h2>
|
||||
<p>Now that you’ve done all this hard work to get to a nice tidy data frame, it’s a great time to save your work:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gapminder <- paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">gapminder <- paths |>
|
||||
set_names(basename) |>
|
||||
map(readxl::read_excel) |>
|
||||
list_rbind(names_to = "year") |>
|
||||
|
@ -762,7 +762,7 @@ Many simple iterations</h2>
|
|||
<p>Here we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.</p>
|
||||
<p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps then call <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> once:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">process_file <- function(path) {
|
||||
<pre data-type="programlisting" data-code-language="r">process_file <- function(path) {
|
||||
df <- read_csv(path)
|
||||
|
||||
df |>
|
||||
|
@ -777,7 +777,7 @@ paths |>
|
|||
</div>
|
||||
<p>Alternatively, you could perform each step of <code>process_file()</code> to every file:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
map(read_csv) |>
|
||||
map(\(df) df |> filter(!is.na(id))) |>
|
||||
map(\(df) df |> mutate(id = tolower(id))) |>
|
||||
|
@ -787,7 +787,7 @@ paths |>
|
|||
<p>We recommend this approach because it stops you getting fixated on getting the first file right because moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.</p>
|
||||
<p>In this particular example, there’s another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
map(read_csv) |>
|
||||
list_rbind() |>
|
||||
filter(!is.na(id)) |>
|
||||
|
@ -801,12 +801,12 @@ paths |>
|
|||
Heterogeneous data</h2>
|
||||
<p>Unfortunately sometimes it’s not possible to go from <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> straight to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> because the data frames are so heterogeneous that <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">files <- paths |>
|
||||
map(readxl::read_excel) </pre>
|
||||
</div>
|
||||
<p>Then a very useful strategy is to capture the structure of the data frames to data so that you can explore it using your data science skills. One way to do so is with this handy <code>df_types</code> function that returns a tibble with one row for each column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df_types <- function(df) {
|
||||
<pre data-type="programlisting" data-code-language="r">df_types <- function(df) {
|
||||
tibble(
|
||||
col_name = names(df),
|
||||
col_type = map_chr(df, vctrs::vec_ptype_full),
|
||||
|
@ -839,7 +839,7 @@ df_types(nycflights13::flights)
|
|||
</div>
|
||||
<p>You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files |>
|
||||
<pre data-type="programlisting" data-code-language="r">files |>
|
||||
map(df_types) |>
|
||||
list_rbind(names_to = "file_name") |>
|
||||
select(-n_miss) |>
|
||||
|
@ -864,7 +864,7 @@ Handling failures</h2>
|
|||
<p>Sometimes the structure of your data might be sufficiently wild that you can’t even read all the files with a single command. And then you’ll encounter one of the downsides of map: it succeeds or fails as a whole. <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?</p>
|
||||
<p>Luckily, purrr comes with a helper to tackle this problem: <code><a href="https://purrr.tidyverse.org/reference/possibly.html">possibly()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/possibly.html">possibly()</a></code> is what’s known as a function operator: it takes a function and returns a function with modified behavior. In particular, <code><a href="https://purrr.tidyverse.org/reference/possibly.html">possibly()</a></code> changes a function from erroring to returning a value that you specify:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">files <- paths |>
|
||||
<pre data-type="programlisting" data-code-language="r">files <- paths |>
|
||||
map(possibly(\(path) readxl::read_excel(path), NULL))
|
||||
|
||||
data <- files |> list_rbind()</pre>
|
||||
|
@ -872,7 +872,7 @@ data <- files |> list_rbind()</pre>
|
|||
<p>This works particularly well here because <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code>, like many tidyverse functions, automatically ignores <code>NULL</code>s.</p>
|
||||
<p>Now you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed load and what do to about it. Start by getting the paths that failed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">failed <- map_vec(files, is.null)
|
||||
<pre data-type="programlisting" data-code-language="r">failed <- map_vec(files, is.null)
|
||||
paths[failed]
|
||||
#> character(0)</pre>
|
||||
</div>
|
||||
|
@ -894,13 +894,13 @@ Writing to a database</h2>
|
|||
<p>Sometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do <code>map(files, read_csv)</code>. One approach to deal with this problem is to load your into a database so you can access just the bits you need with dbplyr.</p>
|
||||
<p>If you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s <code>duckdb_read_csv()</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(duckdb::duckdb())
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb())
|
||||
duckdb::duckdb_read_csv(con, "gapminder", paths)</pre>
|
||||
</div>
|
||||
<p>This would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.</p>
|
||||
<p>We need to start by creating a table that will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">template <- readxl::read_excel(paths[[1]])
|
||||
<pre data-type="programlisting" data-code-language="r">template <- readxl::read_excel(paths[[1]])
|
||||
template$year <- 1952
|
||||
template
|
||||
#> # A tibble: 142 × 6
|
||||
|
@ -916,12 +916,12 @@ template
|
|||
</div>
|
||||
<p>Now we can connect to the database, and use <code><a href="https://dbi.r-dbi.org/reference/dbCreateTable.html">DBI::dbCreateTable()</a></code> to turn our template into database table:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(duckdb::duckdb())
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb())
|
||||
DBI::dbCreateTable(con, "gapminder", template)</pre>
|
||||
</div>
|
||||
<p><code>dbCreateTable()</code> doesn’t use the data in <code>template</code>, just the variable names and types. So if we inspect the <code>gapminder</code> table now you’ll see that it’s empty but it has the variables we need with the types we expect:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con |> tbl("gapminder")
|
||||
<pre data-type="programlisting" data-code-language="r">con |> tbl("gapminder")
|
||||
#> # Source: table<gapminder> [0 x 6]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>,
|
||||
|
@ -929,7 +929,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
|
|||
</div>
|
||||
<p>Next, we need a function that takes a single file path, reads it into R, and adds the result to the <code>gapminder</code> table. We can do that by combining <code>read_excel()</code> with <code><a href="https://dbi.r-dbi.org/reference/dbAppendTable.html">DBI::dbAppendTable()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">append_file <- function(path) {
|
||||
<pre data-type="programlisting" data-code-language="r">append_file <- function(path) {
|
||||
df <- readxl::read_excel(path)
|
||||
df$year <- parse_number(basename(path))
|
||||
|
||||
|
@ -938,15 +938,15 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
|
|||
</div>
|
||||
<p>Now we need to call <code>append_csv()</code> once for each element of <code>paths</code>. That’s certainly possible with <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |> map(append_file)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |> map(append_file)</pre>
|
||||
</div>
|
||||
<p>But we don’t care about the output of <code>append_file()</code>, so instead of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> it’s slightly nicer to use <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code> does exactly the same thing as <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> but throws the output away:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">paths |> walk(append_file)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">paths |> walk(append_file)</pre>
|
||||
</div>
|
||||
<p>Now we can see if we have all the data in our table:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con |>
|
||||
<pre data-type="programlisting" data-code-language="r">con |>
|
||||
tbl("gapminder") |>
|
||||
count(year)
|
||||
#> # Source: SQL [?? x 2]
|
||||
|
@ -968,7 +968,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
|
|||
Writing csv files</h2>
|
||||
<p>The same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the <code><a href="https://ggplot2.tidyverse.org/reference/diamonds.html">ggplot2::diamonds</a></code> data and save one csv file for each <code>clarity</code>. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: <code><a href="https://dplyr.tidyverse.org/reference/group_nest.html">group_nest()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">by_clarity <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">by_clarity <- diamonds |>
|
||||
group_nest(clarity)
|
||||
|
||||
by_clarity
|
||||
|
@ -985,7 +985,7 @@ by_clarity
|
|||
</div>
|
||||
<p>This gives us a new tibble with eight rows and two columns. <code>clarity</code> is our grouping variable and <code>data</code> is a list-column containing one tibble for each unique value of <code>clarity</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">by_clarity$data[[1]]
|
||||
<pre data-type="programlisting" data-code-language="r">by_clarity$data[[1]]
|
||||
#> # A tibble: 741 × 9
|
||||
#> carat cut color depth table price x y z
|
||||
#> <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
|
||||
|
@ -999,7 +999,7 @@ by_clarity
|
|||
</div>
|
||||
<p>While we’re here, lets create a column that gives the name of output file, using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">by_clarity <- by_clarity |>
|
||||
<pre data-type="programlisting" data-code-language="r">by_clarity <- by_clarity |>
|
||||
mutate(path = str_glue("diamonds-{clarity}.csv"))
|
||||
|
||||
by_clarity
|
||||
|
@ -1016,7 +1016,7 @@ by_clarity
|
|||
</div>
|
||||
<p>So if we were going to save these data frames by hand, we might write something like:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">write_csv(by_clarity$data[[1]], by_clarity$path[[1]])
|
||||
<pre data-type="programlisting" data-code-language="r">write_csv(by_clarity$data[[1]], by_clarity$path[[1]])
|
||||
write_csv(by_clarity$data[[2]], by_clarity$path[[2]])
|
||||
write_csv(by_clarity$data[[3]], by_clarity$path[[3]])
|
||||
...
|
||||
|
@ -1024,7 +1024,7 @@ write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])</pre>
|
|||
</div>
|
||||
<p>This is a little different to our previous uses of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> because there are two arguments that are changing, not just one. That means we need a new function: <code><a href="https://purrr.tidyverse.org/reference/map2.html">map2()</a></code>, which varies both the first and second arguments. And because we again don’t care about the output, we want <code><a href="https://purrr.tidyverse.org/reference/map2.html">walk2()</a></code> rather than <code><a href="https://purrr.tidyverse.org/reference/map2.html">map2()</a></code>. That gives us:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">walk2(by_clarity$data, by_clarity$path, write_csv)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">walk2(by_clarity$data, by_clarity$path, write_csv)</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -1033,7 +1033,7 @@ write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])</pre>
|
|||
Saving plots</h2>
|
||||
<p>We can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">carat_histogram <- function(df) {
|
||||
<pre data-type="programlisting" data-code-language="r">carat_histogram <- function(df) {
|
||||
ggplot(df, aes(carat)) + geom_histogram(binwidth = 0.1)
|
||||
}
|
||||
|
||||
|
@ -1044,7 +1044,7 @@ carat_histogram(by_clarity$data[[1]])</pre>
|
|||
</div>
|
||||
<p>Now we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> to create a list of many plots<span data-type="footnote">You can print <code>by_clarity$plot</code> to get a crude animation — you’ll get one plot for each element of <code>plots</code>. NOTE: this didn’t happen for me.</span> and their eventual file paths:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">by_clarity <- by_clarity |>
|
||||
<pre data-type="programlisting" data-code-language="r">by_clarity <- by_clarity |>
|
||||
mutate(
|
||||
plot = map(data, carat_histogram),
|
||||
path = str_glue("clarity-{clarity}.png")
|
||||
|
@ -1052,7 +1052,7 @@ carat_histogram(by_clarity$data[[1]])</pre>
|
|||
</div>
|
||||
<p>Then use <code><a href="https://purrr.tidyverse.org/reference/map2.html">walk2()</a></code> with <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> to save each plot:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">walk2(
|
||||
<pre data-type="programlisting" data-code-language="r">walk2(
|
||||
by_clarity$path,
|
||||
by_clarity$plot,
|
||||
\(path, plot) ggsave(path, plot, width = 6, height = 6)
|
||||
|
@ -1060,7 +1060,7 @@ carat_histogram(by_clarity$data[[1]])</pre>
|
|||
</div>
|
||||
<p>This is shorthand for:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)
|
||||
<pre data-type="programlisting" data-code-language="r">ggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)
|
||||
ggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)
|
||||
ggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)
|
||||
...
|
||||
|
|
|
@ -13,7 +13,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(nycflights13)</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -31,7 +31,7 @@ Primary and foreign keys</h2>
|
|||
<ul><li>
|
||||
<p><code>airlines</code> records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making <code>carrier</code> the primary key.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">airlines
|
||||
<pre data-type="programlisting" data-code-language="r">airlines
|
||||
#> # A tibble: 16 × 2
|
||||
#> carrier name
|
||||
#> <chr> <chr>
|
||||
|
@ -47,7 +47,7 @@ Primary and foreign keys</h2>
|
|||
<li>
|
||||
<p><code>airports</code> records data about each airport. You can identify each airport by its three letter airport code, making <code>faa</code> the primary key.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">airports
|
||||
<pre data-type="programlisting" data-code-language="r">airports
|
||||
#> # A tibble: 1,458 × 8
|
||||
#> faa name lat lon alt tz dst tzone
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
|
||||
|
@ -63,7 +63,7 @@ Primary and foreign keys</h2>
|
|||
<li>
|
||||
<p><code>planes</code> records data about each plane. You can identify a plane by its tail number, making <code>tailnum</code> the primary key.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">planes
|
||||
<pre data-type="programlisting" data-code-language="r">planes
|
||||
#> # A tibble: 3,322 × 9
|
||||
#> tailnum year type manuf…¹ model engines seats speed engine
|
||||
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
|
||||
|
@ -79,7 +79,7 @@ Primary and foreign keys</h2>
|
|||
<li>
|
||||
<p><code>weather</code> records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making <code>origin</code> and <code>time_hour</code> the compound primary key.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">weather
|
||||
<pre data-type="programlisting" data-code-language="r">weather
|
||||
#> # A tibble: 26,115 × 15
|
||||
#> origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…²
|
||||
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
|
@ -122,7 +122,7 @@ Primary and foreign keys</h2>
|
|||
Checking primary keys</h2>
|
||||
<p>Now that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">planes |>
|
||||
<pre data-type="programlisting" data-code-language="r">planes |>
|
||||
count(tailnum) |>
|
||||
filter(n > 1)
|
||||
#> # A tibble: 0 × 2
|
||||
|
@ -136,7 +136,7 @@ weather |>
|
|||
</div>
|
||||
<p>You should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">planes |>
|
||||
<pre data-type="programlisting" data-code-language="r">planes |>
|
||||
filter(is.na(tailnum))
|
||||
#> # A tibble: 0 × 9
|
||||
#> # … with 9 variables: tailnum <chr>, year <int>, type <chr>,
|
||||
|
@ -159,7 +159,7 @@ Surrogate keys</h2>
|
|||
<p>So far we haven’t talked about the primary key for <code>flights</code>. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if have some way to describe them to others.</p>
|
||||
<p>After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
count(time_hour, carrier, flight) |>
|
||||
filter(n > 1)
|
||||
#> # A tibble: 0 × 4
|
||||
|
@ -167,7 +167,7 @@ Surrogate keys</h2>
|
|||
</div>
|
||||
<p>Does the absence of duplicates automatically make <code>time_hour</code>-<code>carrier</code>-<code>flight</code> a primary key? It’s certainly a good start, but it doesn’t guarantee it. For example, are altitude and latitude a good primary key for <code>airports</code>?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">airports |>
|
||||
<pre data-type="programlisting" data-code-language="r">airports |>
|
||||
count(alt, lat) |>
|
||||
filter(n > 1)
|
||||
#> # A tibble: 1 × 3
|
||||
|
@ -178,7 +178,7 @@ Surrogate keys</h2>
|
|||
<p>Identifying an airport by it’s altitude and latitude is clearly a bad idea, and in general it’s not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.</p>
|
||||
<p>That said, we might be better off introducing a simple numeric surrogate key using the row number:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 <- flights |>
|
||||
mutate(id = row_number(), .before = 1)
|
||||
flights2
|
||||
#> # A tibble: 336,776 × 20
|
||||
|
@ -221,7 +221,7 @@ Basic joins</h1>
|
|||
Mutating joins</h2>
|
||||
<p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. For these examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> to avoid this problem.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 <- flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 <- flights |>
|
||||
select(year, time_hour, origin, dest, tailnum, carrier)
|
||||
flights2
|
||||
#> # A tibble: 336,776 × 6
|
||||
|
@ -237,7 +237,7 @@ flights2
|
|||
</div>
|
||||
<p>There are four types of mutating join, but there’s one that you’ll use almost all of the time: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>. It’s special because the output will always have the same rows as <code>x</code><span data-type="footnote">That’s not 100% true, but you’ll get a warning whenever it isn’t.</span>. The primary use of <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> is to add in additional metadata. For example, we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> to add the full airline name to the <code>flights2</code> data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(airlines)
|
||||
#> Joining with `by = join_by(carrier)`
|
||||
#> # A tibble: 336,776 × 7
|
||||
|
@ -253,7 +253,7 @@ flights2
|
|||
</div>
|
||||
<p>Or we could find out the temperature and wind speed when each plane departed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(weather |> select(origin, time_hour, temp, wind_speed))
|
||||
#> Joining with `by = join_by(time_hour, origin)`
|
||||
#> # A tibble: 336,776 × 8
|
||||
|
@ -269,7 +269,7 @@ flights2
|
|||
</div>
|
||||
<p>Or what size of plane was flying:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(planes |> select(tailnum, type, engines, seats))
|
||||
#> Joining with `by = join_by(tailnum)`
|
||||
#> # A tibble: 336,776 × 9
|
||||
|
@ -285,7 +285,7 @@ flights2
|
|||
</div>
|
||||
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
filter(tailnum == "N3ALAA") |>
|
||||
left_join(planes |> select(tailnum, type, engines, seats))
|
||||
#> Joining with `by = join_by(tailnum)`
|
||||
|
@ -308,7 +308,7 @@ flights2
|
|||
Specifying join keys</h2>
|
||||
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesn’t always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(planes)
|
||||
#> Joining with `by = join_by(year, tailnum)`
|
||||
#> # A tibble: 336,776 × 13
|
||||
|
@ -325,7 +325,7 @@ Specifying join keys</h2>
|
|||
</div>
|
||||
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(planes, join_by(tailnum))
|
||||
#> # A tibble: 336,776 × 14
|
||||
#> year.x time_hour origin dest tailnum carrier year.y type
|
||||
|
@ -343,7 +343,7 @@ Specifying join keys</h2>
|
|||
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an <strong>equi-join</strong>. You’ll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
|
||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin:</code></p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(airports, join_by(dest == faa))
|
||||
#> # A tibble: 336,776 × 13
|
||||
#> year time_hour origin dest tailnum carrier name lat lon
|
||||
|
@ -384,7 +384,7 @@ flights2 |>
|
|||
Filtering joins</h2>
|
||||
<p>As you might guess the primary action of a <strong>filtering join</strong> is to filter the rows. There are two types: semi-joins and anti-joins. <strong>Semi-joins</strong> keep all rows in <code>x</code> that have a match in <code>y</code>. For example, we could use a semi-join to filter the <code>airports</code> dataset to show just the origin airports:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">airports |>
|
||||
<pre data-type="programlisting" data-code-language="r">airports |>
|
||||
semi_join(flights2, join_by(faa == origin))
|
||||
#> # A tibble: 3 × 8
|
||||
#> faa name lat lon alt tz dst tzone
|
||||
|
@ -395,7 +395,7 @@ Filtering joins</h2>
|
|||
</div>
|
||||
<p>Or just the destinations:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">airports |>
|
||||
<pre data-type="programlisting" data-code-language="r">airports |>
|
||||
semi_join(flights2, join_by(faa == dest))
|
||||
#> # A tibble: 101 × 8
|
||||
#> faa name lat lon alt tz dst tzone
|
||||
|
@ -410,7 +410,7 @@ Filtering joins</h2>
|
|||
</div>
|
||||
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that don’t have a match in <code>y</code>. They’re useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values don’t show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that don’t have a matching destination airport:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
anti_join(airports, join_by(dest == faa)) |>
|
||||
distinct(dest)
|
||||
#> # A tibble: 4 × 1
|
||||
|
@ -423,7 +423,7 @@ Filtering joins</h2>
|
|||
</div>
|
||||
<p>Or we can find which <code>tailnum</code>s are missing from <code>planes</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
anti_join(planes, join_by(tailnum)) |>
|
||||
distinct(tailnum)
|
||||
#> # A tibble: 722 × 1
|
||||
|
@ -446,7 +446,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Imagine you’ve found the top 10 most popular destinations using this code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">top_dest <- flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">top_dest <- flights2 |>
|
||||
count(dest, sort = TRUE) |>
|
||||
head(10)</pre>
|
||||
</div>
|
||||
|
@ -459,7 +459,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Compute the average delay by destination, then join on the <code>airports</code> data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">airports |>
|
||||
<pre data-type="programlisting" data-code-language="r">airports |>
|
||||
semi_join(flights, join_by(faa == dest)) |>
|
||||
ggplot(aes(lon, lat)) +
|
||||
borders("state") +
|
||||
|
@ -477,7 +477,7 @@ Exercises</h2>
|
|||
How do joins work?</h1>
|
||||
<p>Now that you’ve used joins a few times it’s time to learn more about how they work, focusing on how each row in <code>x</code> matches rows in <code>y</code>. We’ll begin by using <a href="#fig-join-setup" data-type="xref">#fig-join-setup</a> to introduce a visual representation of the two simple tibbles defined below. In these examples we’ll use a single key called <code>key</code> and a single value column (<code>val_x</code> and <code>val_y</code>), but the ideas all generalize to multiple keys and multiple values.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">x <- tribble(
|
||||
~key, ~val_x,
|
||||
1, "x1",
|
||||
2, "x2",
|
||||
|
@ -583,7 +583,7 @@ Row matching</h2>
|
|||
<li>There might be the same number of rows if some rows don’t match any rows, and exactly the same number of rows match two rows in <code>y</code>!!</li>
|
||||
</ul><p>Row expansion is a fundamental property of joins, but it’s dangerous because it might happen without you realizing it. To avoid this problem, dplyr will warn whenever there are multiple matches:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
|
||||
<pre data-type="programlisting" data-code-language="r">df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
|
||||
df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
|
||||
|
||||
df1 |>
|
||||
|
@ -613,7 +613,7 @@ df1 |>
|
|||
One-to-one mapping</h2>
|
||||
<p>Both <code>unmatched</code> and <code>multiple</code> can take value <code>"error"</code> which means that the join will fail unless each row in <code>x</code> matches exactly one row in <code>y</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- tibble(x = 1)
|
||||
<pre data-type="programlisting" data-code-language="r">df1 <- tibble(x = 1)
|
||||
df2 <- tibble(x = c(1, 1))
|
||||
df3 <- tibble(x = 3)
|
||||
|
||||
|
@ -636,12 +636,12 @@ df1 |>
|
|||
Allow multiple rows</h2>
|
||||
<p>Sometimes it’s useful to deliberately expand the number of rows in the output. This can come about naturally if you “flip” the direction of the question you’re asking. For example, as we’ve seen above, it’s natural to supplement the <code>flights</code> data with information about the plane that flew each flight:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(planes, by = "tailnum")</pre>
|
||||
</div>
|
||||
<p>But it’s also reasonable to ask what flights did each plane fly:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">plane_flights <- planes |>
|
||||
<pre data-type="programlisting" data-code-language="r">plane_flights <- planes |>
|
||||
select(tailnum, type, engines, seats) |>
|
||||
left_join(flights2, by = "tailnum")
|
||||
#> Warning in left_join(select(planes, tailnum, type, engines, seats), flights2, : Each row in `x` is expected to match at most 1 row in `y`.
|
||||
|
@ -651,7 +651,7 @@ Allow multiple rows</h2>
|
|||
</div>
|
||||
<p>Since this duplicates rows in <code>x</code> (the planes), we need to explicitly say that we’re ok with the multiple matches by setting <code>multiple = "all"</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">plane_flights <- planes |>
|
||||
<pre data-type="programlisting" data-code-language="r">plane_flights <- planes |>
|
||||
select(tailnum, type, engines, seats) |>
|
||||
left_join(flights2, by = "tailnum", multiple = "all")
|
||||
|
||||
|
@ -698,7 +698,7 @@ Non-equi joins</h1>
|
|||
<p>So far you’ve only seen equi-joins, joins where the rows match if the <code>x</code> key equals the <code>y</code> key. Now we’re going to relax that restriction and discuss other ways of determining if a pair of rows match.</p>
|
||||
<p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x |> left_join(y, by = "key", keep = TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">x |> left_join(y, by = "key", keep = TRUE)
|
||||
#> # A tibble: 3 × 4
|
||||
#> key.x val_x key.y val_y
|
||||
#> <dbl> <chr> <dbl> <chr>
|
||||
|
@ -748,7 +748,7 @@ Cross joins</h2>
|
|||
</div>
|
||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
|
||||
df |> left_join(df, join_by())
|
||||
#> # A tibble: 16 × 2
|
||||
#> name.x name.y
|
||||
|
@ -777,7 +777,7 @@ Inequality joins</h2>
|
|||
</div>
|
||||
<p>Inequality joins are extremely general, so general that it’s hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
|
||||
|
||||
df |> left_join(df, join_by(id < id))
|
||||
#> # A tibble: 7 × 4
|
||||
|
@ -808,14 +808,14 @@ Rolling joins</h2>
|
|||
<p>Rolling joins are particularly useful when you have two tables of dates that don’t perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.</p>
|
||||
<p>For example, imagine that you’re in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">parties <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">parties <- tibble(
|
||||
q = 1:4,
|
||||
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
|
||||
)</pre>
|
||||
</div>
|
||||
<p>Now imagine that you have a table of employee birthdays:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">employees <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">employees <- tibble(
|
||||
name = wakefield::name(100),
|
||||
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
|
||||
)
|
||||
|
@ -833,7 +833,7 @@ employees
|
|||
</div>
|
||||
<p>And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">employees |>
|
||||
<pre data-type="programlisting" data-code-language="r">employees |>
|
||||
left_join(parties, join_by(closest(birthday >= party)))
|
||||
#> # A tibble: 100 × 4
|
||||
#> name birthday q party
|
||||
|
@ -848,7 +848,7 @@ employees
|
|||
</div>
|
||||
<p>There is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">employees |>
|
||||
<pre data-type="programlisting" data-code-language="r">employees |>
|
||||
anti_join(parties, join_by(closest(birthday >= party)))
|
||||
#> # A tibble: 4 × 2
|
||||
#> name birthday
|
||||
|
@ -873,7 +873,7 @@ Overlap joins</h2>
|
|||
<code>overlaps(x_lower, x_upper, y_lower, y_upper)</code> is short for <code>x_lower <= y_upper, x_upper >= y_lower</code>.</li>
|
||||
</ul><p>Let’s continue the birthday example to see how you might use them. There’s one problem with the strategy we used above: there’s no party preceding the birthdays Jan 1-9. So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">parties <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">parties <- tibble(
|
||||
q = 1:4,
|
||||
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
|
@ -890,7 +890,7 @@ parties
|
|||
</div>
|
||||
<p>Hadley is hopelessly bad at data entry so he also wanted to check that the party periods don’t overlap. One way to do this is by using a self-join to check to if any start-end interval overlap with another:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">parties |>
|
||||
<pre data-type="programlisting" data-code-language="r">parties |>
|
||||
inner_join(parties, join_by(overlaps(start, end, start, end), q < q)) |>
|
||||
select(start.x, end.x, start.y, end.y)
|
||||
#> # A tibble: 1 × 4
|
||||
|
@ -900,7 +900,7 @@ parties
|
|||
</div>
|
||||
<p>Ooops, there is an overlap, so let’s fix that problem and continue:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">parties <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">parties <- tibble(
|
||||
q = 1:4,
|
||||
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
|
@ -909,7 +909,7 @@ parties
|
|||
</div>
|
||||
<p>Now we can match each employee to their party. This is a good place to use <code>unmatched = "error"</code> because we want to quickly find out if any employees didn’t get assigned a party.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">employees |>
|
||||
<pre data-type="programlisting" data-code-language="r">employees |>
|
||||
inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
|
||||
#> # A tibble: 100 × 6
|
||||
#> name birthday q party start end
|
||||
|
@ -930,7 +930,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Can you explain what’s happening with the keys in this equi-join? Why are they different?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x |> full_join(y, by = "key")
|
||||
<pre data-type="programlisting" data-code-language="r">x |> full_join(y, by = "key")
|
||||
#> # A tibble: 4 × 3
|
||||
#> key val_x val_y
|
||||
#> <dbl> <chr> <chr>
|
||||
|
|
|
@ -11,18 +11,18 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>Most of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. We’ll also continue to draw examples from the nycflights13 dataset.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(nycflights13)</pre>
|
||||
</div>
|
||||
<p>However, as we start to cover more tools, there won’t always be a perfect real example. So we’ll start making up some dummy data with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1, 2, 3, 5, 7, 11, 13)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1, 2, 3, 5, 7, 11, 13)
|
||||
x * 2
|
||||
#> [1] 2 4 6 10 14 22 26</pre>
|
||||
</div>
|
||||
<p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside data frame with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and friends.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x)
|
||||
df |>
|
||||
mutate(y = x * 2)
|
||||
#> # A tibble: 7 × 2
|
||||
|
@ -44,7 +44,7 @@ df |>
|
|||
Comparisons</h1>
|
||||
<p>A very common way to create a logical vector is via a numeric comparison with <code><</code>, <code><=</code>, <code>></code>, <code>>=</code>, <code>!=</code>, and <code>==</code>. So far, we’ve mostly created logical variables transiently within <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that leave roughly on time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
|
||||
#> # A tibble: 172,286 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -62,7 +62,7 @@ Comparisons</h1>
|
|||
</div>
|
||||
<p>It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
daytime = dep_time > 600 & dep_time < 2000,
|
||||
approx_ontime = abs(arr_delay) < 20,
|
||||
|
@ -82,7 +82,7 @@ Comparisons</h1>
|
|||
<p>This is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.</p>
|
||||
<p>All up, the initial filter is equivalent to:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
daytime = dep_time > 600 & dep_time < 2000,
|
||||
approx_ontime = abs(arr_delay) < 20,
|
||||
|
@ -95,24 +95,24 @@ Comparisons</h1>
|
|||
Floating point comparison</h2>
|
||||
<p>Beware of using <code>==</code> with numbers. For example, it looks like this vector contains the numbers 1 and 2:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1 / 49 * 49, sqrt(2) ^ 2)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1 / 49 * 49, sqrt(2) ^ 2)
|
||||
x
|
||||
#> [1] 1 2</pre>
|
||||
</div>
|
||||
<p>But if you test them for equality, you get <code>FALSE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x == c(1, 2)
|
||||
<pre data-type="programlisting" data-code-language="r">x == c(1, 2)
|
||||
#> [1] FALSE FALSE</pre>
|
||||
</div>
|
||||
<p>What’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> with the the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">print(x, digits = 16)
|
||||
<pre data-type="programlisting" data-code-language="r">print(x, digits = 16)
|
||||
#> [1] 0.9999999999999999 2.0000000000000004</pre>
|
||||
</div>
|
||||
<p>You can see why R defaults to rounding these numbers; they really are very close to what you expect.</p>
|
||||
<p>Now that you’ve seen why <code>==</code> is failing, what can you do about it? One option is to use <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> which ignores small differences:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">near(x, c(1, 2))
|
||||
<pre data-type="programlisting" data-code-language="r">near(x, c(1, 2))
|
||||
#> [1] TRUE TRUE</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -122,19 +122,19 @@ x
|
|||
Missing values</h2>
|
||||
<p>Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">NA > 5
|
||||
<pre data-type="programlisting" data-code-language="r">NA > 5
|
||||
#> [1] NA
|
||||
10 == NA
|
||||
#> [1] NA</pre>
|
||||
</div>
|
||||
<p>The most confusing result is this one:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">NA == NA
|
||||
<pre data-type="programlisting" data-code-language="r">NA == NA
|
||||
#> [1] NA</pre>
|
||||
</div>
|
||||
<p>It’s easiest to understand why this is true if we artificially supply a little more context:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Let x be Mary's age. We don't know how old she is.
|
||||
<pre data-type="programlisting" data-code-language="r"># Let x be Mary's age. We don't know how old she is.
|
||||
x <- NA
|
||||
|
||||
# Let y be John's age. We don't know how old he is.
|
||||
|
@ -147,7 +147,7 @@ x == y
|
|||
</div>
|
||||
<p>So if you want to find all flights with <code>dep_time</code> is missing, the following code doesn’t work because <code>dep_time == NA</code> will yield a <code>NA</code> for every single row, and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> automatically drops missing values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_time == NA)
|
||||
#> # A tibble: 0 × 19
|
||||
#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
|
||||
|
@ -165,7 +165,7 @@ x == y
|
|||
</h2>
|
||||
<p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">is.na(c(TRUE, NA, FALSE))
|
||||
<pre data-type="programlisting" data-code-language="r">is.na(c(TRUE, NA, FALSE))
|
||||
#> [1] FALSE TRUE FALSE
|
||||
is.na(c(1, NA, 3))
|
||||
#> [1] FALSE TRUE FALSE
|
||||
|
@ -174,7 +174,7 @@ is.na(c("a", NA, "b"))
|
|||
</div>
|
||||
<p>We can use <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> to find all the rows with a missing <code>dep_time</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(is.na(dep_time))
|
||||
#> # A tibble: 8,255 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -192,7 +192,7 @@ is.na(c("a", NA, "b"))
|
|||
</div>
|
||||
<p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 1, day == 1) |>
|
||||
arrange(dep_time)
|
||||
#> # A tibble: 842 × 19
|
||||
|
@ -256,7 +256,7 @@ Boolean algebra</h1>
|
|||
Missing values</h2>
|
||||
<p>The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = c(TRUE, FALSE, NA))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = c(TRUE, FALSE, NA))
|
||||
|
||||
df |>
|
||||
mutate(
|
||||
|
@ -278,12 +278,12 @@ df |>
|
|||
Order of operations</h2>
|
||||
<p>Note that the order of operations doesn’t work like English. Take the following code finds all flights that departed in November or December:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 11 | month == 12)</pre>
|
||||
</div>
|
||||
<p>You might be tempted to write it like you’d say in English: “find all flights that departed in November or December”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 11 | 12)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -301,7 +301,7 @@ Order of operations</h2>
|
|||
</div>
|
||||
<p>This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
nov = month == 11,
|
||||
final = nov | 12,
|
||||
|
@ -326,26 +326,26 @@ Order of operations</h2>
|
|||
</h2>
|
||||
<p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">1:12 %in% c(1, 5, 11)
|
||||
<pre data-type="programlisting" data-code-language="r">1:12 %in% c(1, 5, 11)
|
||||
#> [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
|
||||
letters[1:10] %in% c("a", "e", "i", "o", "u")
|
||||
#> [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE</pre>
|
||||
</div>
|
||||
<p>So to find all flights in November and December we could write:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month %in% c(11, 12))</pre>
|
||||
</div>
|
||||
<p>Note that <code>%in%</code> obeys different rules for <code>NA</code> to <code>==</code>, as <code>NA %in% NA</code> is <code>TRUE</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">c(1, 2, NA) == NA
|
||||
<pre data-type="programlisting" data-code-language="r">c(1, 2, NA) == NA
|
||||
#> [1] NA NA NA
|
||||
c(1, 2, NA) %in% NA
|
||||
#> [1] FALSE FALSE TRUE</pre>
|
||||
</div>
|
||||
<p>This can make for a useful shortcut:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_time %in% c(NA, 0800))
|
||||
#> # A tibble: 8,803 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -383,7 +383,7 @@ Logical summaries</h2>
|
|||
<p>There are two main logical summaries: <code><a href="https://rdrr.io/r/base/any.html">any()</a></code> and <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>. <code>any(x)</code> is the equivalent of <code>|</code>; it’ll return <code>TRUE</code> if there are any <code>TRUE</code>’s in <code>x</code>. <code>all(x)</code> is equivalent of <code>&</code>; it’ll return <code>TRUE</code> only if all values of <code>x</code> are <code>TRUE</code>’s. Like all summary functions, they’ll return <code>NA</code> if there are any missing values present, and as usual you can make the missing values go away with <code>na.rm = TRUE</code>.</p>
|
||||
<p>For example, we could use <code><a href="https://rdrr.io/r/base/all.html">all()</a></code> to find out if there were days where every flight was delayed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
all_delayed = all(arr_delay >= 0, na.rm = TRUE),
|
||||
|
@ -409,7 +409,7 @@ Logical summaries</h2>
|
|||
Numeric summaries of logical vectors</h2>
|
||||
<p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
|
||||
|
@ -426,7 +426,7 @@ Numeric summaries of logical vectors</h2>
|
|||
</div>
|
||||
<p>Or we could ask how many flights left before 5am, which are often flights that were delayed from the previous day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
n_early = sum(dep_time < 500, na.rm = TRUE),
|
||||
|
@ -452,7 +452,7 @@ Logical subsetting</h2>
|
|||
<p>There’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base <code>[</code> (pronounced subset) operator, which you’ll learn more about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>.</p>
|
||||
<p>Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(arr_delay > 0) |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
|
@ -474,7 +474,7 @@ Logical subsetting</h2>
|
|||
<p>This works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames together<span data-type="footnote">We’ll cover this in <a href="#chp-joins" data-type="xref">#chp-joins</a>]</span>. Instead you could use <code>[</code> to perform an inline filtering: <code>arr_delay[arr_delay > 0]</code> will yield only the positive arrival delays.</p>
|
||||
<p>This leads to:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
|
||||
|
@ -516,30 +516,30 @@ Conditional transformations</h1>
|
|||
<p>If you want to use one value when a condition is true and another value when it’s <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base R’s <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. You’ll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
|
||||
<p>Let’s begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(-3:3, NA)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(-3:3, NA)
|
||||
if_else(x > 0, "+ve", "-ve")
|
||||
#> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA</pre>
|
||||
</div>
|
||||
<p>There’s an optional fourth argument, <code>missing</code> which will be used if the input is <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">if_else(x > 0, "+ve", "-ve", "???")
|
||||
<pre data-type="programlisting" data-code-language="r">if_else(x > 0, "+ve", "-ve", "???")
|
||||
#> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"</pre>
|
||||
</div>
|
||||
<p>You can also use vectors for the the <code>true</code> and <code>false</code> arguments. For example, this allows us to create a minimal implementation of <code><a href="https://rdrr.io/r/base/MathFun.html">abs()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">if_else(x < 0, -x, x)
|
||||
<pre data-type="programlisting" data-code-language="r">if_else(x < 0, -x, x)
|
||||
#> [1] 3 2 1 0 1 2 3 NA</pre>
|
||||
</div>
|
||||
<p>So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> like this:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x1 <- c(NA, 1, 2, NA)
|
||||
<pre data-type="programlisting" data-code-language="r">x1 <- c(NA, 1, 2, NA)
|
||||
y1 <- c(3, NA, 4, 6)
|
||||
if_else(is.na(x1), y1, x1)
|
||||
#> [1] 3 1 2 6</pre>
|
||||
</div>
|
||||
<p>You might have noticed a small infelicity in our labeling: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
|
||||
<pre data-type="programlisting" data-code-language="r">if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
|
||||
#> [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
|
||||
</div>
|
||||
<p>This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">dplyr::case_when()</a></code>.</p>
|
||||
|
@ -552,7 +552,7 @@ if_else(is.na(x1), y1, x1)
|
|||
<p>dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQL’s <code>CASE</code> statement and provides a flexible way of performing different computations for different computations. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when it’s <code>TRUE</code>, <code>output</code> will be used.</p>
|
||||
<p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">case_when(
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
x == 0 ~ "0",
|
||||
x < 0 ~ "-ve",
|
||||
x > 0 ~ "+ve",
|
||||
|
@ -563,7 +563,7 @@ if_else(is.na(x1), y1, x1)
|
|||
<p>This is more code, but it’s also more explicit.</p>
|
||||
<p>To explain how <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> works, lets explore some simpler cases. If none of the cases match, the output gets an <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">case_when(
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
x < 0 ~ "-ve",
|
||||
x > 0 ~ "+ve"
|
||||
)
|
||||
|
@ -571,7 +571,7 @@ if_else(is.na(x1), y1, x1)
|
|||
</div>
|
||||
<p>If you want to create a “default”/catch all value, use <code>TRUE</code> on the left hand side:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">case_when(
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
x < 0 ~ "-ve",
|
||||
x > 0 ~ "+ve",
|
||||
TRUE ~ "???"
|
||||
|
@ -580,7 +580,7 @@ if_else(is.na(x1), y1, x1)
|
|||
</div>
|
||||
<p>And note that if multiple conditions match, only the first will be used:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">case_when(
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
x > 0 ~ "+ve",
|
||||
x > 3 ~ "big"
|
||||
)
|
||||
|
@ -588,7 +588,7 @@ if_else(is.na(x1), y1, x1)
|
|||
</div>
|
||||
<p>Just like with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> you can use variables on both sides of the <code>~</code> and you can mix and match variables as needed for your problem. For example, we could use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> to provide some human readable labels for the arrival delay:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
status = case_when(
|
||||
is.na(arr_delay) ~ "cancelled",
|
||||
|
|
|
@ -11,7 +11,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -26,7 +26,7 @@ Explicit missing values</h1>
|
|||
Last observation carried forward</h2>
|
||||
<p>A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">treatment <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">treatment <- tribble(
|
||||
~person, ~treatment, ~response,
|
||||
"Derrick Whitmore", 1, 7,
|
||||
NA, 2, 10,
|
||||
|
@ -36,7 +36,7 @@ Last observation carried forward</h2>
|
|||
</div>
|
||||
<p>You can fill in these missing values with <code><a href="https://tidyr.tidyverse.org/reference/fill.html">tidyr::fill()</a></code>. It works like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, taking a set of columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">treatment |>
|
||||
<pre data-type="programlisting" data-code-language="r">treatment |>
|
||||
fill(everything())
|
||||
#> # A tibble: 4 × 3
|
||||
#> person treatment response
|
||||
|
@ -54,14 +54,14 @@ Last observation carried forward</h2>
|
|||
Fixed values</h2>
|
||||
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">dplyr::coalesce()</a></code> to replace them:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1, 4, 5, 7, NA)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1, 4, 5, 7, NA)
|
||||
coalesce(x, 0)
|
||||
#> [1] 1 4 5 7 0</pre>
|
||||
</div>
|
||||
<p>Sometimes you’ll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
|
||||
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">readr::read_csv()</a></code>. If you discover the problem later, or your data source doesn’t provide a way to handle on it read, you can use <code><a href="https://dplyr.tidyverse.org/reference/na_if.html">dplyr::na_if()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1, 4, 5, 7, -99)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1, 4, 5, 7, -99)
|
||||
na_if(x, -99)
|
||||
#> [1] 1 4 5 7 NA</pre>
|
||||
</div>
|
||||
|
@ -72,7 +72,7 @@ na_if(x, -99)
|
|||
NaN</h2>
|
||||
<p>Before we continue, there’s one special type of missing value that you’ll encounter from time to time: a <code>NaN</code> (pronounced “nan”), or <strong>n</strong>ot <strong>a</strong> <strong>n</strong>umber. It’s not that important to know about because it generally behaves just like <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(NA, NaN)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(NA, NaN)
|
||||
x * 10
|
||||
#> [1] NA NaN
|
||||
x == 1
|
||||
|
@ -83,7 +83,7 @@ is.na(x)
|
|||
<p>In the rare case you need to distinguish an <code>NA</code> from a <code>NaN</code>, you can use <code>is.nan(x)</code>.</p>
|
||||
<p>You’ll generally encounter a <code>NaN</code> when you perform a mathematical operation that has an indeterminate result:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">0 / 0
|
||||
<pre data-type="programlisting" data-code-language="r">0 / 0
|
||||
#> [1] NaN
|
||||
0 * Inf
|
||||
#> [1] NaN
|
||||
|
@ -101,7 +101,7 @@ sqrt(-1)
|
|||
Implicit missing values</h1>
|
||||
<p>So far we’ve talked about missing values that are <strong>explicitly</strong> missing, i.e. you can see an <code>NA</code> in your data. But missing values can also be <strong>implicitly</strong> missing, if an entire row of data is simply absent from the data. Let’s illustrate the difference with a simple data set that records the price of some stock each quarter:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">stocks <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">stocks <- tibble(
|
||||
year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
|
||||
qtr = c( 1, 2, 3, 4, 2, 3, 4),
|
||||
price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
|
||||
|
@ -122,7 +122,7 @@ Implicit missing values</h1>
|
|||
Pivoting</h2>
|
||||
<p>You’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot <code>stocks</code> to put the <code>quarter</code> in the columns, both missing values become explicit:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">stocks |>
|
||||
<pre data-type="programlisting" data-code-language="r">stocks |>
|
||||
pivot_wider(
|
||||
names_from = qtr,
|
||||
values_from = price
|
||||
|
@ -141,7 +141,7 @@ Pivoting</h2>
|
|||
Complete</h2>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/complete.html">tidyr::complete()</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">stocks |>
|
||||
<pre data-type="programlisting" data-code-language="r">stocks |>
|
||||
complete(year, qtr)
|
||||
#> # A tibble: 8 × 3
|
||||
#> year qtr price
|
||||
|
@ -156,7 +156,7 @@ Complete</h2>
|
|||
</div>
|
||||
<p>Typically, you’ll call <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">stocks |>
|
||||
<pre data-type="programlisting" data-code-language="r">stocks |>
|
||||
complete(year = 2019:2021, qtr)
|
||||
#> # A tibble: 12 × 3
|
||||
#> year qtr price
|
||||
|
@ -179,7 +179,7 @@ Joins</h2>
|
|||
<p>This brings us to another important way of revealing implicitly missing observations: joins. You’ll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
|
||||
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that don’t have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s reveal to reveal that we’re missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
|
||||
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
|
||||
|
||||
flights |>
|
||||
distinct(faa = dest) |>
|
||||
|
@ -222,7 +222,7 @@ Exercises</h2>
|
|||
Factors and empty groups</h1>
|
||||
<p>A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">health <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">health <- tibble(
|
||||
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
|
||||
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
|
||||
age = c(34L, 88L, 75L, 47L, 56L),
|
||||
|
@ -230,7 +230,7 @@ Factors and empty groups</h1>
|
|||
</div>
|
||||
<p>And we want to count the number of smokers with <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">health |> count(smoker)
|
||||
<pre data-type="programlisting" data-code-language="r">health |> count(smoker)
|
||||
#> # A tibble: 1 × 2
|
||||
#> smoker n
|
||||
#> <fct> <int>
|
||||
|
@ -238,7 +238,7 @@ Factors and empty groups</h1>
|
|||
</div>
|
||||
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">health |> count(smoker, .drop = FALSE)
|
||||
<pre data-type="programlisting" data-code-language="r">health |> count(smoker, .drop = FALSE)
|
||||
#> # A tibble: 2 × 2
|
||||
#> smoker n
|
||||
#> <fct> <int>
|
||||
|
@ -247,7 +247,7 @@ Factors and empty groups</h1>
|
|||
</div>
|
||||
<p>The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying <code>drop = FALSE</code> to the appropriate discrete axis:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(health, aes(smoker)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(health, aes(smoker)) +
|
||||
geom_bar() +
|
||||
scale_x_discrete()
|
||||
|
||||
|
@ -267,7 +267,7 @@ ggplot(health, aes(smoker)) +
|
|||
</div>
|
||||
<p>The same problem comes up more generally with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">health |>
|
||||
<pre data-type="programlisting" data-code-language="r">health |>
|
||||
group_by(smoker, .drop = FALSE) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
|
@ -291,7 +291,7 @@ ggplot(health, aes(smoker)) +
|
|||
</div>
|
||||
<p>We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. There’s an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># A vector containing two missing values
|
||||
<pre data-type="programlisting" data-code-language="r"># A vector containing two missing values
|
||||
x1 <- c(NA, NA)
|
||||
length(x1)
|
||||
#> [1] 2
|
||||
|
@ -304,7 +304,7 @@ length(x2)
|
|||
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you’ll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
|
||||
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">health |>
|
||||
<pre data-type="programlisting" data-code-language="r">health |>
|
||||
group_by(smoker) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
|
|
|
@ -11,7 +11,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(nycflights13)</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -23,13 +23,13 @@ Making numbers</h1>
|
|||
<p>In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or something has gone wrong in your data import process.</p>
|
||||
<p>readr provides two useful functions for parsing strings into numbers: <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code>. Use <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> when you have numbers that have been written as strings:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("1.2", "5.6", "1e3")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("1.2", "5.6", "1e3")
|
||||
parse_double(x)
|
||||
#> [1] 1.2 5.6 1000.0</pre>
|
||||
</div>
|
||||
<p>Use <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("$1,234", "USD 3,513", "59%")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("$1,234", "USD 3,513", "59%")
|
||||
parse_number(x)
|
||||
#> [1] 1234 3513 59</pre>
|
||||
</div>
|
||||
|
@ -40,7 +40,7 @@ parse_number(x)
|
|||
Counts</h1>
|
||||
<p>It’s surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. This function is great for quick exploration and checks during analysis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> count(dest)
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> count(dest)
|
||||
#> # A tibble: 105 × 2
|
||||
#> dest n
|
||||
#> <chr> <int>
|
||||
|
@ -55,7 +55,7 @@ Counts</h1>
|
|||
<p>(Despite the advice in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, we usually put <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.)</p>
|
||||
<p>If you want to see the most common values add <code>sort = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> count(dest, sort = TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> count(dest, sort = TRUE)
|
||||
#> # A tibble: 105 × 2
|
||||
#> dest n
|
||||
#> <chr> <int>
|
||||
|
@ -70,7 +70,7 @@ Counts</h1>
|
|||
<p>And remember that if you want to see all the values, you can use <code>|> View()</code> or <code>|> print(n = Inf)</code>.</p>
|
||||
<p>You can perform the same computation “by hand” with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>. This is useful because it allows you to compute other summaries at the same time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
|
@ -89,7 +89,7 @@ Counts</h1>
|
|||
</div>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">n()
|
||||
<pre data-type="programlisting" data-code-language="r">n()
|
||||
#> Error in `n()`:
|
||||
#> ! Must only be used inside data-masking verbs like `mutate()`,
|
||||
#> `filter()`, and `group_by()`.</pre>
|
||||
|
@ -98,7 +98,7 @@ Counts</h1>
|
|||
<ul><li>
|
||||
<p><code>n_distinct(x)</code> counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(
|
||||
carriers = n_distinct(carrier)
|
||||
|
@ -119,7 +119,7 @@ Counts</h1>
|
|||
<li>
|
||||
<p>A weighted count is a sum. For example you could “count” the number of miles each plane flew:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(tailnum) |>
|
||||
summarise(miles = sum(distance))
|
||||
#> # A tibble: 4,044 × 2
|
||||
|
@ -135,7 +135,7 @@ Counts</h1>
|
|||
</div>
|
||||
<p>Weighted counts are a common problem so <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> has a <code>wt</code> argument that does the same thing:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> count(tailnum, wt = distance)
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> count(tailnum, wt = distance)
|
||||
#> # A tibble: 4,044 × 2
|
||||
#> tailnum n
|
||||
#> <chr> <dbl>
|
||||
|
@ -151,7 +151,7 @@ Counts</h1>
|
|||
<li>
|
||||
<p>You can count missing values by combining <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>. In the <code>flights</code> dataset this represents flights that are cancelled:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(n_cancelled = sum(is.na(dep_time)))
|
||||
#> # A tibble: 105 × 2
|
||||
|
@ -189,7 +189,7 @@ Arithmetic and recycling rules</h2>
|
|||
<p>We introduced the basics of arithmetic (<code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>^</code>) in <a href="#chp-workflow-basics" data-type="xref">#chp-workflow-basics</a> and have used them a bunch since. These functions don’t need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the <strong>recycling rules</strong> which determine what happens when the left and right hand sides have different lengths. This is important for operations like <code>flights |> mutate(air_time = air_time / 60)</code> because there are 336,776 numbers on the left of <code>/</code> but only one on the right.</p>
|
||||
<p>R handles mismatched lengths by <strong>recycling,</strong> or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1, 2, 10, 20)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1, 2, 10, 20)
|
||||
x / 5
|
||||
#> [1] 0.2 0.4 2.0 4.0
|
||||
# is shorthand for
|
||||
|
@ -198,7 +198,7 @@ x / c(5, 5, 5, 5)
|
|||
</div>
|
||||
<p>Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x * c(1, 2)
|
||||
<pre data-type="programlisting" data-code-language="r">x * c(1, 2)
|
||||
#> [1] 1 4 10 40
|
||||
x * c(1, 2, 3)
|
||||
#> Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
|
||||
|
@ -207,7 +207,7 @@ x * c(1, 2, 3)
|
|||
</div>
|
||||
<p>These recycling rules are also applied to logical comparisons (<code>==</code>, <code><</code>, <code><=</code>, <code>></code>, <code>>=</code>, <code>!=</code>) and can lead to a surprising result if you accidentally use <code>==</code> instead of <code>%in%</code> and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == c(1, 2))
|
||||
#> # A tibble: 25,977 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
|
@ -232,7 +232,7 @@ x * c(1, 2, 3)
|
|||
Minimum and maximum</h2>
|
||||
<p>The arithmetic functions work with pairs of variables. Two closely related functions are <code><a href="https://rdrr.io/r/base/Extremes.html">pmin()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">pmax()</a></code>, which when given two or more variables will return the smallest or largest value in each row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
~x, ~y,
|
||||
1, 3,
|
||||
5, 2,
|
||||
|
@ -253,7 +253,7 @@ df |>
|
|||
</div>
|
||||
<p>Note that these are different to the summary functions <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
mutate(
|
||||
min = min(x, y, na.rm = TRUE),
|
||||
max = max(x, y, na.rm = TRUE)
|
||||
|
@ -272,14 +272,14 @@ df |>
|
|||
Modular arithmetic</h2>
|
||||
<p>Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder. In R, <code>%/%</code> does integer division and <code>%%</code> computes the remainder:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">1:10 %/% 3
|
||||
<pre data-type="programlisting" data-code-language="r">1:10 %/% 3
|
||||
#> [1] 0 0 1 1 1 2 2 2 3 3
|
||||
1:10 %% 3
|
||||
#> [1] 1 2 0 1 2 0 1 2 0 1</pre>
|
||||
</div>
|
||||
<p>Modular arithmetic is handy for the flights dataset, because we can use it to unpack the <code>sched_dep_time</code> variable into and <code>hour</code> and <code>minute</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
hour = sched_dep_time %/% 100,
|
||||
minute = sched_dep_time %% 100,
|
||||
|
@ -298,7 +298,7 @@ Modular arithmetic</h2>
|
|||
</div>
|
||||
<p>We can combine that with the <code>mean(is.na(x))</code> trick from <a href="#sec-logical-summaries" data-type="xref">#sec-logical-summaries</a> to see how the proportion of cancelled flights varies over the course of the day. The results are shown in <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(hour = sched_dep_time %/% 100) |>
|
||||
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
|
||||
filter(hour > 1) |>
|
||||
|
@ -319,7 +319,7 @@ Modular arithmetic</h2>
|
|||
Logarithms</h2>
|
||||
<p>Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert exponential growth to linear growth. For example, take compounding interest — the amount of money you have at <code>year + 1</code> is the amount of money you had at <code>year</code> multiplied by the interest rate. That gives a formula like <code>money = starting * interest ^ year</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">starting <- 100
|
||||
<pre data-type="programlisting" data-code-language="r">starting <- 100
|
||||
interest <- 1.05
|
||||
|
||||
money <- tibble(
|
||||
|
@ -329,7 +329,7 @@ money <- tibble(
|
|||
</div>
|
||||
<p>If you plot this data, you’ll get an exponential curve:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(money, aes(year, money)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(year, money)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="numbers_files/figure-html/unnamed-chunk-22-1.png" width="576"/></p>
|
||||
|
@ -337,7 +337,7 @@ money <- tibble(
|
|||
</div>
|
||||
<p>Log transforming the y-axis gives a straight line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(money, aes(year, money)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(year, money)) +
|
||||
geom_line() +
|
||||
scale_y_log10()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -354,12 +354,12 @@ money <- tibble(
|
|||
Rounding</h2>
|
||||
<p>Use <code>round(x)</code> to round a number to the nearest integer:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">round(123.456)
|
||||
<pre data-type="programlisting" data-code-language="r">round(123.456)
|
||||
#> [1] 123</pre>
|
||||
</div>
|
||||
<p>You can control the precision of the rounding with the second argument, <code>digits</code>. <code>round(x, digits)</code> rounds to the nearest <code>10^-n</code> so <code>digits = 2</code> will round to the nearest 0.01. This definition is useful because it implies <code>round(x, -3)</code> will round to the nearest thousand, which indeed it does:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">round(123.456, 2) # two digits
|
||||
<pre data-type="programlisting" data-code-language="r">round(123.456, 2) # two digits
|
||||
#> [1] 123.46
|
||||
round(123.456, 1) # one digit
|
||||
#> [1] 123.5
|
||||
|
@ -370,13 +370,13 @@ round(123.456, -2) # round to nearest hundred
|
|||
</div>
|
||||
<p>There’s one weirdness with <code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> that seems surprising at first glance:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">round(c(1.5, 2.5))
|
||||
<pre data-type="programlisting" data-code-language="r">round(c(1.5, 2.5))
|
||||
#> [1] 2 2</pre>
|
||||
</div>
|
||||
<p><code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the <strong>even</strong> integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.</p>
|
||||
<p><code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> is paired with <code><a href="https://rdrr.io/r/base/Round.html">floor()</a></code> which always rounds down and <code><a href="https://rdrr.io/r/base/Round.html">ceiling()</a></code> which always rounds up:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- 123.456
|
||||
<pre data-type="programlisting" data-code-language="r">x <- 123.456
|
||||
|
||||
floor(x)
|
||||
#> [1] 123
|
||||
|
@ -385,7 +385,7 @@ ceiling(x)
|
|||
</div>
|
||||
<p>These functions don’t have a digits argument, so you can instead scale down, round, and then scale back up:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Round down to nearest two digits
|
||||
<pre data-type="programlisting" data-code-language="r"># Round down to nearest two digits
|
||||
floor(x / 0.01) * 0.01
|
||||
#> [1] 123.45
|
||||
# Round up to nearest two digits
|
||||
|
@ -394,7 +394,7 @@ ceiling(x / 0.01) * 0.01
|
|||
</div>
|
||||
<p>You can use the same technique if you want to <code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> to a multiple of some other number:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Round to nearest multiple of 4
|
||||
<pre data-type="programlisting" data-code-language="r"># Round to nearest multiple of 4
|
||||
round(x / 4) * 4
|
||||
#> [1] 124
|
||||
|
||||
|
@ -409,20 +409,20 @@ round(x / 0.25) * 0.25
|
|||
Cutting numbers into ranges</h2>
|
||||
<p>Use <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code><span data-type="footnote">ggplot2 provides some helpers for common cases in <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_interval()</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>. ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.</span> to break up a numeric vector into discrete buckets:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1, 2, 5, 10, 15, 20)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1, 2, 5, 10, 15, 20)
|
||||
cut(x, breaks = c(0, 5, 10, 15, 20))
|
||||
#> [1] (0,5] (0,5] (0,5] (5,10] (10,15] (15,20]
|
||||
#> Levels: (0,5] (5,10] (10,15] (15,20]</pre>
|
||||
</div>
|
||||
<p>The breaks don’t need to be evenly spaced:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cut(x, breaks = c(0, 5, 10, 100))
|
||||
<pre data-type="programlisting" data-code-language="r">cut(x, breaks = c(0, 5, 10, 100))
|
||||
#> [1] (0,5] (0,5] (0,5] (5,10] (10,100] (10,100]
|
||||
#> Levels: (0,5] (5,10] (10,100]</pre>
|
||||
</div>
|
||||
<p>You can optionally supply your own <code>labels</code>. Note that there should be one less <code>labels</code> than <code>breaks</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cut(x,
|
||||
<pre data-type="programlisting" data-code-language="r">cut(x,
|
||||
breaks = c(0, 5, 10, 15, 20),
|
||||
labels = c("sm", "md", "lg", "xl")
|
||||
)
|
||||
|
@ -431,7 +431,7 @@ cut(x, breaks = c(0, 5, 10, 15, 20))
|
|||
</div>
|
||||
<p>Any values outside of the range of the breaks will become <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y <- c(NA, -10, 5, 10, 30)
|
||||
<pre data-type="programlisting" data-code-language="r">y <- c(NA, -10, 5, 10, 30)
|
||||
cut(y, breaks = c(0, 5, 10, 15, 20))
|
||||
#> [1] <NA> <NA> (0,5] (5,10] <NA>
|
||||
#> Levels: (0,5] (5,10] (10,15] (15,20]</pre>
|
||||
|
@ -444,13 +444,13 @@ cut(y, breaks = c(0, 5, 10, 15, 20))
|
|||
Cumulative and rolling aggregates</h2>
|
||||
<p>Base R provides <code><a href="https://rdrr.io/r/base/cumsum.html">cumsum()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cumprod()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummin()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummax()</a></code> for running, or cumulative, sums, products, mins and maxes. dplyr provides <code><a href="https://dplyr.tidyverse.org/reference/cumall.html">cummean()</a></code> for cumulative means. Cumulative sums tend to come up the most in practice:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- 1:10
|
||||
<pre data-type="programlisting" data-code-language="r">x <- 1:10
|
||||
cumsum(x)
|
||||
#> [1] 1 3 6 10 15 21 28 36 45 55</pre>
|
||||
</div>
|
||||
<p>If you need more complex rolling or sliding aggregates, try the <a href="https://davisvaughan.github.io/slider/">slider</a> package by Davis Vaughan. The following example illustrates some of its features.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(slider)
|
||||
<pre data-type="programlisting" data-code-language="r">library(slider)
|
||||
|
||||
# Same as a cumulative sum
|
||||
slide_vec(x, sum, .before = Inf)
|
||||
|
@ -475,7 +475,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Currently <code>dep_time</code> and <code>sched_dep_time</code> are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem in this plot: there’s a gap between each hour.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 1, day == 1) |>
|
||||
ggplot(aes(sched_dep_time, dep_delay)) +
|
||||
geom_point()
|
||||
|
@ -499,18 +499,18 @@ General transformations</h1>
|
|||
Ranks</h2>
|
||||
<p>dplyr provides a number of ranking functions inspired by SQL, but you should always start with <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::min_rank()</a></code>. It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(1, 2, 2, 3, 4, NA)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(1, 2, 2, 3, 4, NA)
|
||||
min_rank(x)
|
||||
#> [1] 1 2 2 4 5 NA</pre>
|
||||
</div>
|
||||
<p>Note that the smallest values get the lowest ranks; use <code>desc(x)</code> to give the largest values the smallest ranks:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">min_rank(desc(x))
|
||||
<pre data-type="programlisting" data-code-language="r">min_rank(desc(x))
|
||||
#> [1] 5 3 3 2 1 NA</pre>
|
||||
</div>
|
||||
<p>If <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code> doesn’t do what you need, look at the variants <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::row_number()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::dense_rank()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/percent_rank.html">dplyr::percent_rank()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/percent_rank.html">dplyr::cume_dist()</a></code>. See the documentation for details.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = x)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = x)
|
||||
df |>
|
||||
mutate(
|
||||
row_number = row_number(x),
|
||||
|
@ -531,7 +531,7 @@ df |>
|
|||
<p>You can achieve many of the same results by picking the appropriate <code>ties.method</code> argument to base R’s <code><a href="https://rdrr.io/r/base/rank.html">rank()</a></code>; you’ll probably also want to set <code>na.last = "keep"</code> to keep <code>NA</code>s as <code>NA</code>.</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/row_number.html">row_number()</a></code> can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row. When combined with <code>%%</code> or <code>%/%</code> this can be a useful tool for dividing data into similarly sized groups:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = runif(10))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = runif(10))
|
||||
|
||||
df |>
|
||||
mutate(
|
||||
|
@ -557,7 +557,7 @@ df |>
|
|||
Offsets</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">dplyr::lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">dplyr::lag()</a></code> allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with <code>NA</code>s at the start or end:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(2, 5, 11, 11, 19, 35)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(2, 5, 11, 11, 19, 35)
|
||||
lag(x)
|
||||
#> [1] NA 2 5 11 11 19
|
||||
lead(x)
|
||||
|
@ -566,14 +566,14 @@ lead(x)
|
|||
<ul><li>
|
||||
<p><code>x - lag(x)</code> gives you the difference between the current and previous value.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x - lag(x)
|
||||
<pre data-type="programlisting" data-code-language="r">x - lag(x)
|
||||
#> [1] NA 3 6 0 8 16</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>x == lag(x)</code> tells you when the current value changes.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x == lag(x)
|
||||
<pre data-type="programlisting" data-code-language="r">x == lag(x)
|
||||
#> [1] NA FALSE FALSE TRUE FALSE FALSE</pre>
|
||||
</div>
|
||||
</li>
|
||||
|
@ -591,7 +591,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>, explore how the average flight delay for an hour is related to the average delay for the previous hour.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(hour = dep_time %/% 100) |>
|
||||
group_by(year, month, day, hour) |>
|
||||
summarise(
|
||||
|
@ -618,7 +618,7 @@ Center</h2>
|
|||
<p>So far, we’ve mostly used <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.</p>
|
||||
<p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs the median when looking at the hourly vs median departure delay. The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
mean = mean(dep_delay, na.rm = TRUE),
|
||||
|
@ -647,7 +647,7 @@ Minimum, maximum, and quantiles</h2>
|
|||
<p>What if you’re interested in locations other than the center? <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="https://rdrr.io/r/stats/quantile.html">quantile()</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find a value that’s greater than 95% of the values.</p>
|
||||
<p>For the <code>flights</code> data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
max = max(dep_delay, na.rm = TRUE),
|
||||
|
@ -673,7 +673,7 @@ Spread</h2>
|
|||
<p>Sometimes you’re not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, <code>sd(x)</code>, and the inter-quartile range, <code><a href="https://rdrr.io/r/stats/IQR.html">IQR()</a></code>. We won’t explain <code><a href="https://rdrr.io/r/stats/sd.html">sd()</a></code> here since you’re probably already familiar with it, but <code><a href="https://rdrr.io/r/stats/IQR.html">IQR()</a></code> might be new — it’s <code>quantile(x, 0.75) - quantile(x, 0.25)</code> and gives you the range that contains the middle 50% of the data.</p>
|
||||
<p>We can use this to reveal a small oddity in the <code>flights</code> data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, <a href="https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport">EGE</a>, might have moved.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(origin, dest) |>
|
||||
summarise(
|
||||
distance_sd = IQR(distance),
|
||||
|
@ -695,7 +695,7 @@ Distributions</h2>
|
|||
<p>It’s worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that they’re fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. That’s why it’s always a good idea to visualize the distribution before committing to your summary statistics.</p>
|
||||
<p><a href="#fig-flights-dist" data-type="xref">#fig-flights-dist</a> shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
ggplot(aes(dep_delay)) +
|
||||
geom_histogram(binwidth = 15)
|
||||
#> Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).
|
||||
|
@ -724,7 +724,7 @@ flights |>
|
|||
</div>
|
||||
<p>It’s also a good idea to check that distributions for subgroups resemble the whole. <a href="#fig-flights-dist-daily" data-type="xref">#fig-flights-dist-daily</a> overlays a frequency polygon for each day. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_delay < 120) |>
|
||||
ggplot(aes(dep_delay, group = interaction(day, month))) +
|
||||
geom_freqpoly(binwidth = 5, alpha = 1/5)</pre>
|
||||
|
@ -744,7 +744,7 @@ Positions</h2>
|
|||
<p>There’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at specific position. You can do this with the base R <code>[</code> function, but we’re not going to cover it in detail until <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>, because it’s a very powerful and general function. For now we’ll introduce three specialized functions that you can use to extract values at a specified position: <code>first(x)</code>, <code>last(x)</code>, and <code>nth(x, n)</code>.</p>
|
||||
<p>For example, we can find the first and last departure for each day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
first_dep = first(dep_time),
|
||||
|
@ -769,7 +769,7 @@ Positions</h2>
|
|||
<p>If you’re familiar with <code>[</code>, you might wonder if you ever need these functions. There are two main reasons: the <code>default</code> argument and the <code>order_by</code> argument. <code>default</code> allows you to set a default value that’s used if the requested position doesn’t exist, e.g. you’re trying to get the 3rd element from a two element group. <code>order_by</code> lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by <code><a href="https://dplyr.tidyverse.org/reference/order_by.html">order_by()</a></code>.</p>
|
||||
<p>Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
mutate(r = min_rank(desc(sched_dep_time))) |>
|
||||
filter(r %in% c(1, max(r)))
|
||||
|
|
|
@ -13,11 +13,11 @@ format: html</pre>
|
|||
<li>
|
||||
<p>Transiently, by calling <code>quarto::quarto_render()</code> by hand:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">quarto::quarto_render("diamond-sizes.qmd", output_format = "docx")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = "docx")</pre>
|
||||
</div>
|
||||
<p>This is useful if you want to programmatically produce multiple types of output since the <code>output_format</code> argument can also take a list of values.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">quarto::quarto_render("diamond-sizes.qmd", output_format = c("docx", "pdf"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = c("docx", "pdf"))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
|
@ -41,7 +41,7 @@ Output options</h1>
|
|||
<p>Note the special syntax (<code>pdf: default</code>) if you don’t want to override any of the default options.</p>
|
||||
<p>To render to all formats specified in the YAML of a document, you can use <code>output_format = "all"</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">quarto::quarto_render("diamond-sizes.qmd", output_format = "all")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = "all")</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -164,7 +164,7 @@ Interactivity</h1>
|
|||
htmlwidgets</h2>
|
||||
<p>HTML is an interactive format, and you can take advantage of that interactivity with <strong>htmlwidgets</strong>, R functions that produce interactive HTML visualizations. For example, take the <strong>leaflet</strong> map below. If you’re viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously can’t do that in a book, so Quarto automatically inserts a static screenshot for you.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(leaflet)
|
||||
<pre data-type="programlisting" data-code-language="r">library(leaflet)
|
||||
leaflet() |>
|
||||
setView(174.764, -36.877, zoom = 16) |>
|
||||
addTiles() |>
|
||||
|
@ -192,7 +192,7 @@ format: html
|
|||
server: shiny</pre>
|
||||
<p>Then you can use the “input” functions to add interactive components to the document:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(shiny)
|
||||
<pre data-type="programlisting" data-code-language="r">library(shiny)
|
||||
|
||||
textInput("name", "What is your name?")
|
||||
numericInput("age", "How old are you?", NA, min = 0, max = 150)</pre>
|
||||
|
|
|
@ -333,7 +333,7 @@ Inline code</h2>
|
|||
</blockquote>
|
||||
<p>When inserting numbers into text, <code><a href="https://rdrr.io/r/base/format.html">format()</a></code> is your friend. It allows you to set the number of <code>digits</code> so you don’t print to a ridiculous degree of accuracy, and a <code>big.mark</code> to make numbers easier to read. You might combine these into a helper function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">comma <- function(x) format(x, digits = 2, big.mark = ",")
|
||||
<pre data-type="programlisting" data-code-language="r">comma <- function(x) format(x, digits = 2, big.mark = ",")
|
||||
comma(3452345)
|
||||
#> [1] "3,452,345"
|
||||
comma(.12358124331)
|
||||
|
@ -407,7 +407,7 @@ Tables</h1>
|
|||
<p>Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create in directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.</p>
|
||||
<p>By default, Quarto prints data frames and matrices as you’d see them in the console:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">mtcars[1:5, ]
|
||||
<pre data-type="programlisting" data-code-language="r">mtcars[1:5, ]
|
||||
#> mpg cyl disp hp drat wt qsec vs am gear carb
|
||||
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
|
||||
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
|
||||
|
@ -417,7 +417,7 @@ Tables</h1>
|
|||
</div>
|
||||
<p>If you prefer that data be displayed with additional formatting you can use the <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">knitr::kable()</a></code> function. The code below generates <a href="#tbl-kable" data-type="xref">#tbl-kable</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">knitr::kable(mtcars[1:5, ], )</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">knitr::kable(mtcars[1:5, ], )</pre>
|
||||
<div class="cell-output-display">
|
||||
<div id="tbl-kable" class="anchored">
|
||||
<table class="table table-sm table-striped"><caption>Table 27.1: A knitr kable.</caption>
|
||||
|
|
|
@ -11,7 +11,7 @@ Introduction</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(repurrrsive)
|
||||
library(jsonlite)</pre>
|
||||
</div>
|
||||
|
@ -23,7 +23,7 @@ library(jsonlite)</pre>
|
|||
Lists</h1>
|
||||
<p>So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is the same type. If you want to store element of different types in the same vector, you’ll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x1 <- list(1:4, "a", TRUE)
|
||||
<pre data-type="programlisting" data-code-language="r">x1 <- list(1:4, "a", TRUE)
|
||||
x1
|
||||
#> [[1]]
|
||||
#> [1] 1 2 3 4
|
||||
|
@ -36,7 +36,7 @@ x1
|
|||
</div>
|
||||
<p>It’s often convenient to name the components, or <strong>children</strong>, of a list, which you can do in the same way as naming the columns of a tibble:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
||||
<pre data-type="programlisting" data-code-language="r">x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
||||
x2
|
||||
#> $a
|
||||
#> [1] 1 2
|
||||
|
@ -49,7 +49,7 @@ x2
|
|||
</div>
|
||||
<p>Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code>, which generates a compact display of the <strong>str</strong>ucture, de-emphasizing the contents:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str(x1)
|
||||
<pre data-type="programlisting" data-code-language="r">str(x1)
|
||||
#> List of 3
|
||||
#> $ : int [1:4] 1 2 3 4
|
||||
#> $ : chr "a"
|
||||
|
@ -67,7 +67,7 @@ str(x2)
|
|||
Hierarchy</h2>
|
||||
<p>Lists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x3 <- list(list(1, 2), list(3, 4))
|
||||
<pre data-type="programlisting" data-code-language="r">x3 <- list(list(1, 2), list(3, 4))
|
||||
str(x3)
|
||||
#> List of 2
|
||||
#> $ :List of 2
|
||||
|
@ -79,7 +79,7 @@ str(x3)
|
|||
</div>
|
||||
<p>This is notably different to <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>, which generates a flat vector:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">c(c(1, 2), c(3, 4))
|
||||
<pre data-type="programlisting" data-code-language="r">c(c(1, 2), c(3, 4))
|
||||
#> [1] 1 2 3 4
|
||||
|
||||
x4 <- c(list(1, 2), list(3, 4))
|
||||
|
@ -92,7 +92,7 @@ str(x4)
|
|||
</div>
|
||||
<p>As lists get more complex, <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> gets more useful, as it lets you see the hierarchy at a glance:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
||||
<pre data-type="programlisting" data-code-language="r">x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
||||
str(x5)
|
||||
#> List of 2
|
||||
#> $ : num 1
|
||||
|
@ -138,7 +138,7 @@ List-columns</h2>
|
|||
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldn’t usually belong in a tibble. In particular, list-columns are are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like models or resamples in a data frame.</p>
|
||||
<p>Here’s a simple example of a list-column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = 1:2,
|
||||
y = c("a", "b"),
|
||||
z = list(list(1, 2), list(3, 4, 5))
|
||||
|
@ -152,7 +152,7 @@ df
|
|||
</div>
|
||||
<p>There’s nothing special about lists in a tibble; they behave like any other column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
filter(x == 1)
|
||||
#> # A tibble: 1 × 3
|
||||
#> x y z
|
||||
|
@ -162,7 +162,7 @@ df
|
|||
<p>Computing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.</p>
|
||||
<p>The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull the list-column out and apply one of the techniques that you learned above:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
filter(x == 1) |>
|
||||
pull(z) |>
|
||||
str()
|
||||
|
@ -175,13 +175,13 @@ df
|
|||
<div data-type="note"><h1>
|
||||
Base R
|
||||
</h1><p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5))
|
||||
<pre data-type="programlisting" data-code-language="r">data.frame(x = list(1:3, 3:5))
|
||||
#> x.1.3 x.3.5
|
||||
#> 1 1 3
|
||||
#> 2 2 4
|
||||
#> 3 3 5</pre>
|
||||
</div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesn’t print particularly well:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">data.frame(
|
||||
<pre data-type="programlisting" data-code-language="r">data.frame(
|
||||
x = I(list(1:2, 3:5)),
|
||||
y = c("1, 2", "3, 4, 5")
|
||||
)
|
||||
|
@ -199,7 +199,7 @@ Unnesting</h1>
|
|||
<p>Now that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.</p>
|
||||
<p>List-columns tend to come in two basic forms: named and unnamed. When the children are <strong>named</strong>, they tend to have the same names in every row. For example, in <code>df1</code>, every element of list-column <code>y</code> has two elements named <code>a</code> and <code>b</code>. Named list-columns naturally unnest into columns: each named element becomes a new named column.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df1 <- tribble(
|
||||
~x, ~y,
|
||||
1, list(a = 11, b = 12),
|
||||
2, list(a = 21, b = 22),
|
||||
|
@ -208,7 +208,7 @@ Unnesting</h1>
|
|||
</div>
|
||||
<p>When the children are <strong>unnamed</strong>, the number of elements tends to vary from row-to-row. For example, in <code>df2</code>, the elements of list-column <code>y</code> are unnamed and vary in length from one to three. Unnamed list-columns naturally unnest in to rows: you’ll get one row for each child.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">
|
||||
<pre data-type="programlisting" data-code-language="r">
|
||||
df2 <- tribble(
|
||||
~x, ~y,
|
||||
1, list(11, 12, 13),
|
||||
|
@ -224,7 +224,7 @@ df2 <- tribble(
|
|||
</h2>
|
||||
<p>When each row has the same number of elements with the same names, like <code>df1</code>, it’s natural to put each component into its own column with <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df1 |>
|
||||
unnest_wider(y)
|
||||
#> # A tibble: 3 × 3
|
||||
#> x a b
|
||||
|
@ -235,7 +235,7 @@ df2 <- tribble(
|
|||
</div>
|
||||
<p>By default, the names of the new columns come exclusively from the names of the list elements, but you can use the <code>names_sep</code> argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df1 |>
|
||||
unnest_wider(y, names_sep = "_")
|
||||
#> # A tibble: 3 × 3
|
||||
#> x y_a y_b
|
||||
|
@ -246,7 +246,7 @@ df2 <- tribble(
|
|||
</div>
|
||||
<p>We can also use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> with unnamed list-columns, as in <code>df2</code>. Since columns require names but the list lacks them, <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> will label them with consecutive integers:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df2 |>
|
||||
unnest_wider(y, names_sep = "_")
|
||||
#> # A tibble: 3 × 4
|
||||
#> x y_1 y_2 y_3
|
||||
|
@ -264,7 +264,7 @@ df2 <- tribble(
|
|||
</h2>
|
||||
<p>When each row contains an unnamed list, it’s most natural to put each element into its own row with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df2 |>
|
||||
unnest_longer(y)
|
||||
#> # A tibble: 6 × 2
|
||||
#> x y
|
||||
|
@ -278,7 +278,7 @@ df2 <- tribble(
|
|||
</div>
|
||||
<p>Note how <code>x</code> is duplicated for each element inside of <code>y</code>: we get one row of output for each element inside the list-column. But what happens if one of the elements is empty, as in the following example?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df6 <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df6 <- tribble(
|
||||
~x, ~y,
|
||||
"a", list(1, 2),
|
||||
"b", list(3),
|
||||
|
@ -295,7 +295,7 @@ df6 |> unnest_longer(y)
|
|||
<p>We get zero rows in the output, so the row effectively disappears. Once <a href="https://github.com/tidyverse/tidyr/issues/1339" class="uri">https://github.com/tidyverse/tidyr/issues/1339</a> is fixed, you’ll be able to keep this row, replacing <code>y</code> with <code>NA</code> by setting <code>keep_empty = TRUE</code>.</p>
|
||||
<p>You can also unnest named list-columns, like <code>df1$y</code>, into rows. Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix <code>_id</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df1 |>
|
||||
unnest_longer(y)
|
||||
#> # A tibble: 6 × 3
|
||||
#> x y y_id
|
||||
|
@ -309,7 +309,7 @@ df6 |> unnest_longer(y)
|
|||
</div>
|
||||
<p>If you don’t want these <code>ids</code>, you can suppress them with <code>indices_include = FALSE</code>. On the other hand, it’s sometimes useful to retain the position of unnamed elements in unnamed list-columns. You can do this with <code>indices_include = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df2 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df2 |>
|
||||
unnest_longer(y, indices_include = TRUE)
|
||||
#> # A tibble: 6 × 3
|
||||
#> x y y_id
|
||||
|
@ -328,7 +328,7 @@ df6 |> unnest_longer(y)
|
|||
Inconsistent types</h2>
|
||||
<p>What happens if you unnest a list-column contains different types of vector? For example, take the following dataset where the list-column <code>y</code> contains two numbers, a factor, and a logical, which can’t normally be mixed in a single column.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df4 <- tribble(
|
||||
~x, ~y,
|
||||
"a", list(1, "a"),
|
||||
"b", list(TRUE, factor("a"), 5)
|
||||
|
@ -336,7 +336,7 @@ Inconsistent types</h2>
|
|||
</div>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns change, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y)
|
||||
#> # A tibble: 5 × 2
|
||||
#> x y
|
||||
|
@ -350,7 +350,7 @@ Inconsistent types</h2>
|
|||
<p>As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.</p>
|
||||
<p>What happens if you find this problem in a dataset you’re trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. It’s not particularly useful here because there’s only really one class that these five class can be converted to character.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y, transform = as.character)
|
||||
#> # A tibble: 5 × 2
|
||||
#> x y
|
||||
|
@ -363,7 +363,7 @@ Inconsistent types</h2>
|
|||
</div>
|
||||
<p>Another option would be to filter down to the rows that have values of a specific type:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y) |>
|
||||
filter(map_lgl(y, is.numeric))
|
||||
#> # A tibble: 2 × 2
|
||||
|
@ -374,7 +374,7 @@ Inconsistent types</h2>
|
|||
</div>
|
||||
<p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y) |>
|
||||
filter(map_lgl(y, is.numeric)) |>
|
||||
unnest_longer(y)
|
||||
|
@ -406,7 +406,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of <code>y</code> and <code>z</code> are aligned (i.e. <code>y</code> and <code>z</code> will always have the same length within a row, and the first value of <code>y</code> corresponds to the first value of <code>z</code>). What happens if you apply two <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> calls to this data frame? How can you preserve the relationship between <code>x</code> and <code>y</code>? (Hint: carefully read the docs).</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df4 <- tribble(
|
||||
~x, ~y, ~z,
|
||||
"a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
|
||||
"b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
|
||||
|
@ -427,7 +427,7 @@ Very wide data</h2>
|
|||
<p>We’ll with <code>gh_repos</code>. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; you might want to explore a little on your own with <code>View(gh_repos)</code> before we continue.</p>
|
||||
<p><code>gh_repos</code> is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call the column <code>json</code> for reasons we’ll get to later.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos <- tibble(json = gh_repos)
|
||||
<pre data-type="programlisting" data-code-language="r">repos <- tibble(json = gh_repos)
|
||||
repos
|
||||
#> # A tibble: 6 × 1
|
||||
#> json
|
||||
|
@ -441,7 +441,7 @@ repos
|
|||
</div>
|
||||
<p>This tibble contains 6 rows, one row for each child of <code>gh_repos</code>. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, we’ll start with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put each child in its own row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos |>
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json)
|
||||
#> # A tibble: 176 × 1
|
||||
#> json
|
||||
|
@ -456,7 +456,7 @@ repos
|
|||
</div>
|
||||
<p>At first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of <code>json</code> is still a list. However, there’s an important difference: now each element is a <strong>named</strong> list so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put each element into its own column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos |>
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json)
|
||||
#> # A tibble: 176 × 68
|
||||
|
@ -478,7 +478,7 @@ repos
|
|||
</div>
|
||||
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos |>
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json) |>
|
||||
names()
|
||||
|
@ -508,7 +508,7 @@ repos
|
|||
</div>
|
||||
<p>Let’s select a few that look interesting:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos |>
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json) |>
|
||||
select(id, full_name, owner, description)
|
||||
|
@ -526,7 +526,7 @@ repos
|
|||
<p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
|
||||
<p><code>owner</code> is another list-column, and since it contains a named list, we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to get at the values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos |>
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json) |>
|
||||
select(id, full_name, owner, description) |>
|
||||
|
@ -540,7 +540,7 @@ repos
|
|||
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
|
||||
<p>Uh oh, this list column also contains an <code>id</code> column and we can’t have two <code>id</code> columns in the same data frame. Rather than following the advice to use <code>names_repair</code> (which would also work), we’ll instead use <code>names_sep</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">repos |>
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json) |>
|
||||
select(id, full_name, owner, description) |>
|
||||
|
@ -570,7 +570,7 @@ repos
|
|||
Relational data</h2>
|
||||
<p>Nested data is sometimes used to represent data that we’d usually spread out into multiple data frames. For example, take <code>got_chars</code>. Like <code>gh_repos</code> it’s a list, so we start by turning it into a list-column of a tibble:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">chars <- tibble(json = got_chars)
|
||||
<pre data-type="programlisting" data-code-language="r">chars <- tibble(json = got_chars)
|
||||
chars
|
||||
#> # A tibble: 30 × 1
|
||||
#> json
|
||||
|
@ -585,7 +585,7 @@ chars
|
|||
</div>
|
||||
<p>The <code>json</code> column contains named elements, so we’ll start by widening it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">chars |>
|
||||
<pre data-type="programlisting" data-code-language="r">chars |>
|
||||
unnest_wider(json)
|
||||
#> # A tibble: 30 × 18
|
||||
#> url id name gender culture born died alive titles aliases father
|
||||
|
@ -602,7 +602,7 @@ chars
|
|||
</div>
|
||||
<p>And selecting a few columns to make it easier to read:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">characters <- chars |>
|
||||
<pre data-type="programlisting" data-code-language="r">characters <- chars |>
|
||||
unnest_wider(json) |>
|
||||
select(id, name, gender, culture, born, died, alive)
|
||||
characters
|
||||
|
@ -619,7 +619,7 @@ characters
|
|||
</div>
|
||||
<p>There are also many list-columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">chars |>
|
||||
<pre data-type="programlisting" data-code-language="r">chars |>
|
||||
unnest_wider(json) |>
|
||||
select(id, where(is.list))
|
||||
#> # A tibble: 30 × 8
|
||||
|
@ -635,7 +635,7 @@ characters
|
|||
</div>
|
||||
<p>Lets explore the <code>titles</code> column. It’s an unnamed list-column, so we’ll unnest it into rows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">chars |>
|
||||
<pre data-type="programlisting" data-code-language="r">chars |>
|
||||
unnest_wider(json) |>
|
||||
select(id, titles) |>
|
||||
unnest_longer(titles)
|
||||
|
@ -652,7 +652,7 @@ characters
|
|||
</div>
|
||||
<p>You might expect to see this data in its own table because it would be easy to join to the characters data as needed. To do so, we’ll do a little cleaning: removing the rows containing empty strings and renaming <code>titles</code> to <code>title</code> since each row now only contains a single title.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">titles <- chars |>
|
||||
<pre data-type="programlisting" data-code-language="r">titles <- chars |>
|
||||
unnest_wider(json) |>
|
||||
select(id, titles) |>
|
||||
unnest_longer(titles) |>
|
||||
|
@ -672,7 +672,7 @@ titles
|
|||
</div>
|
||||
<p>Now, for example, we could use this table tofind all the characters that are captains and see all their titles:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">captains <- titles |> filter(str_detect(title, "Captain"))
|
||||
<pre data-type="programlisting" data-code-language="r">captains <- titles |> filter(str_detect(title, "Captain"))
|
||||
captains
|
||||
#> # A tibble: 5 × 2
|
||||
#> id title
|
||||
|
@ -705,7 +705,7 @@ characters |>
|
|||
A dash of text analysis</h2>
|
||||
<p>What if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by spitting on <code>" "</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">titles |>
|
||||
<pre data-type="programlisting" data-code-language="r">titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused")
|
||||
#> # A tibble: 53 × 2
|
||||
#> id word
|
||||
|
@ -720,7 +720,7 @@ A dash of text analysis</h2>
|
|||
</div>
|
||||
<p>This creates a unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">titles |>
|
||||
<pre data-type="programlisting" data-code-language="r">titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word)
|
||||
#> # A tibble: 202 × 2
|
||||
|
@ -736,7 +736,7 @@ A dash of text analysis</h2>
|
|||
</div>
|
||||
<p>And then we can count that column to find the most common words:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">titles |>
|
||||
<pre data-type="programlisting" data-code-language="r">titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word) |>
|
||||
count(word, sort = TRUE)
|
||||
|
@ -753,7 +753,7 @@ A dash of text analysis</h2>
|
|||
</div>
|
||||
<p>Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these is commonly called stop words.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">stop_words <- tibble(word = c("of", "the"))
|
||||
<pre data-type="programlisting" data-code-language="r">stop_words <- tibble(word = c("of", "the"))
|
||||
|
||||
titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
|
@ -780,7 +780,7 @@ titles |>
|
|||
Deeply nested</h2>
|
||||
<p>We’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to unravel: <code>gmaps_cities</code>. This is a two column tibble containing five city names and the results of using Google’s <a href="https://developers.google.com/maps/documentation/geocoding">geocoding API</a> to determine their location:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities
|
||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities
|
||||
#> # A tibble: 5 × 2
|
||||
#> city json
|
||||
#> <chr> <list>
|
||||
|
@ -792,7 +792,7 @@ Deeply nested</h2>
|
|||
</div>
|
||||
<p><code>json</code> is a list-column with internal names, so we start with an <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities |>
|
||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities |>
|
||||
unnest_wider(json)
|
||||
#> # A tibble: 5 × 3
|
||||
#> city results status
|
||||
|
@ -805,7 +805,7 @@ Deeply nested</h2>
|
|||
</div>
|
||||
<p>This gives us the <code>status</code> and the <code>results</code>. We’ll drop the status column since they’re all <code>OK</code>; in a real analysis, you’d also want capture all the rows where <code>status != "OK"</code> and figure out what went wrong. <code>results</code> is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities |>
|
||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities |>
|
||||
unnest_wider(json) |>
|
||||
select(-status) |>
|
||||
unnest_longer(results)
|
||||
|
@ -822,7 +822,7 @@ Deeply nested</h2>
|
|||
</div>
|
||||
<p>Now <code>results</code> is a named list, so we’ll use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">locations <- gmaps_cities |>
|
||||
<pre data-type="programlisting" data-code-language="r">locations <- gmaps_cities |>
|
||||
unnest_wider(json) |>
|
||||
select(-status) |>
|
||||
unnest_longer(results) |>
|
||||
|
@ -842,7 +842,7 @@ locations
|
|||
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
|
||||
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">locations |>
|
||||
<pre data-type="programlisting" data-code-language="r">locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
unnest_wider(geometry)
|
||||
#> # A tibble: 7 × 6
|
||||
|
@ -858,7 +858,7 @@ locations
|
|||
</div>
|
||||
<p>That gives us new <code>bounds</code> (a rectangular region) and <code>location</code> (a point). We can unnest <code>location</code> to see the latitude (<code>lat</code>) and longitude (<code>lng</code>):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">locations |>
|
||||
<pre data-type="programlisting" data-code-language="r">locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
unnest_wider(geometry) |>
|
||||
unnest_wider(location)
|
||||
|
@ -875,7 +875,7 @@ locations
|
|||
</div>
|
||||
<p>Extracting the bounds requires a few more steps:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">locations |>
|
||||
<pre data-type="programlisting" data-code-language="r">locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
unnest_wider(geometry) |>
|
||||
# focus on the variables of interest
|
||||
|
@ -894,7 +894,7 @@ locations
|
|||
</div>
|
||||
<p>We then rename <code>southwest</code> and <code>northeast</code> (the corners of the rectangle) so we can use <code>names_sep</code> to create short but evocative names:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">locations |>
|
||||
<pre data-type="programlisting" data-code-language="r">locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
unnest_wider(geometry) |>
|
||||
select(!location:viewport) |>
|
||||
|
@ -915,7 +915,7 @@ locations
|
|||
<p>Note how we unnest two columns simultaneously by supplying a vector of variable names to <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>.</p>
|
||||
<p>This is somewhere that <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned briefly above, can be useful. Once you’ve discovered the path to get to the components you’re interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">locations |>
|
||||
<pre data-type="programlisting" data-code-language="r">locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
hoist(
|
||||
geometry,
|
||||
|
@ -946,7 +946,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Explain the following code line-by-line. Why is it interesting? Why does it work for <code>got_chars</code> but might not work in general?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tibble(json = got_chars) |>
|
||||
<pre data-type="programlisting" data-code-language="r">tibble(json = got_chars) |>
|
||||
unnest_wider(json) |>
|
||||
select(id, where(is.list)) |>
|
||||
pivot_longer(
|
||||
|
@ -983,7 +983,7 @@ Data types</h2>
|
|||
jsonlite</h2>
|
||||
<p>To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code> and <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>. In real life, you’ll use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code> to read a JSON file from disk. For example, the repurrsive package also provides the source for <code>gh_user</code> as a JSON file and you can read it with <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># A path to a json file inside the package:
|
||||
<pre data-type="programlisting" data-code-language="r"># A path to a json file inside the package:
|
||||
gh_users_json()
|
||||
#> [1] "/Users/hadleywickham/Library/R/arm64/4.2/library/repurrrsive/extdata/gh_users.json"
|
||||
|
||||
|
@ -996,7 +996,7 @@ identical(gh_users, gh_users2)
|
|||
</div>
|
||||
<p>In this book, I’ll also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here’s three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str(parse_json('1'))
|
||||
<pre data-type="programlisting" data-code-language="r">str(parse_json('1'))
|
||||
#> int 1
|
||||
str(parse_json('[1, 2, 3]'))
|
||||
#> List of 3
|
||||
|
@ -1018,7 +1018,7 @@ str(parse_json('{"x": [1, 2, 3]}'))
|
|||
Starting the rectangling process</h2>
|
||||
<p>In most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g. multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with <code>tibble(json)</code> so that each element becomes a row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">json <- '[
|
||||
<pre data-type="programlisting" data-code-language="r">json <- '[
|
||||
{"name": "John", "age": 34},
|
||||
{"name": "Susan", "age": 27}
|
||||
]'
|
||||
|
@ -1040,7 +1040,7 @@ df |>
|
|||
</div>
|
||||
<p>In rarer cases, the JSON consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">json <- '{
|
||||
<pre data-type="programlisting" data-code-language="r">json <- '{
|
||||
"status": "OK",
|
||||
"results": [
|
||||
{"name": "John", "age": 34},
|
||||
|
@ -1067,7 +1067,7 @@ df |>
|
|||
</div>
|
||||
<p>Alternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(results = parse_json(json)$results)
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(results = parse_json(json)$results)
|
||||
df |>
|
||||
unnest_wider(results)
|
||||
#> # A tibble: 2 × 2
|
||||
|
@ -1090,7 +1090,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Rectangle the <code>df_col</code> and <code>df_row</code> below. They represent the two ways of encoding a data frame in JSON.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">json_col <- parse_json('
|
||||
<pre data-type="programlisting" data-code-language="r">json_col <- parse_json('
|
||||
{
|
||||
"x": ["a", "x", "z"],
|
||||
"y": [10, null, 3]
|
||||
|
|
|
@ -20,7 +20,7 @@ Prerequisites</h2>
|
|||
|
||||
<p>In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(babynames)</pre>
|
||||
</div>
|
||||
<p>Through this chapter we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:</p>
|
||||
|
@ -39,7 +39,7 @@ Pattern basics</h1>
|
|||
<p>We’ll use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code><></code>, and, where possible, highlighting the match in blue.</p>
|
||||
<p>The simplest patterns consist of letters and numbers which match those characters exactly:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "berry")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "berry")
|
||||
#> [6] │ bil<berry>
|
||||
#> [7] │ black<berry>
|
||||
#> [10] │ blue<berry>
|
||||
|
@ -56,14 +56,14 @@ str_view(fruit, "BERRY")</pre>
|
|||
</div>
|
||||
<p>Letters and numbers match exactly and are called <strong>literal characters</strong>. Punctuation characters like <code>.</code>, <code>+</code>, <code>*</code>, <code>[</code>, <code>]</code>, <code>?</code> have special meanings<span data-type="footnote">You’ll learn how to escape these special meanings in <a href="#sec-regexp-escaping" data-type="xref">#sec-regexp-escaping</a>.</span> and are called <strong>meta-characters</strong>. For example, <code>.</code> will match any character<span data-type="footnote">Well, any character apart from <code>\n</code>.</span>, so <code>"a."</code> will match any string that contains an “a” followed by another character :</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
|
||||
#> [2] │ <ab>
|
||||
#> [3] │ <ae>
|
||||
#> [6] │ e<ab></pre>
|
||||
</div>
|
||||
<p>Or we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "a...e")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "a...e")
|
||||
#> [1] │ <apple>
|
||||
#> [7] │ bl<ackbe>rry
|
||||
#> [48] │ mand<arine>
|
||||
|
@ -81,7 +81,7 @@ str_view(fruit, "BERRY")</pre>
|
|||
<li>
|
||||
<code>*</code> lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).</li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># ab? matches an "a", optionally followed by a "b".
|
||||
<pre data-type="programlisting" data-code-language="r"># ab? matches an "a", optionally followed by a "b".
|
||||
str_view(c("a", "ab", "abb"), "ab?")
|
||||
#> [1] │ <a>
|
||||
#> [2] │ <ab>
|
||||
|
@ -100,7 +100,7 @@ str_view(c("a", "ab", "abb"), "ab*")
|
|||
</div>
|
||||
<p><strong>Character classes</strong> are defined by <code>[]</code> and let you match a set set of characters, e.g. <code>[abcd]</code> matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with <code>^</code>: <code>[^abcd]</code> matches anything <strong>except</strong> “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(words, "[aeiou][aeiou][aeiou]")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][aeiou]")
|
||||
#> [79] │ b<eau>ty
|
||||
#> [565] │ obv<iou>s
|
||||
#> [644] │ prev<iou>s
|
||||
|
@ -116,7 +116,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
|||
</div>
|
||||
<p>You can combine character classes and quantifiers. For example, the following regexp looks for two vowel followed by two or more consonants:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
|
||||
#> [6] │ acc<ount>
|
||||
#> [21] │ ag<ainst>
|
||||
#> [31] │ alr<eady>
|
||||
|
@ -132,7 +132,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
|||
<p>(We’ll learn some more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
|
||||
<p>You can use <strong>alternation</strong>, <code>|</code> to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "apple|pear|banana")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple|pear|banana")
|
||||
#> [1] │ <apple>
|
||||
#> [4] │ <banana>
|
||||
#> [59] │ <pear>
|
||||
|
@ -161,12 +161,12 @@ Key functions</h1>
|
|||
Detect matches</h2>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_detect(c("a", "b", "c"), "[aeiou]")
|
||||
<pre data-type="programlisting" data-code-language="r">str_detect(c("a", "b", "c"), "[aeiou]")
|
||||
#> [1] TRUE FALSE FALSE</pre>
|
||||
</div>
|
||||
<p>Since <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector of the same length as the initial vector, it pairs well with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. For example, this code finds all the most popular names containing a lower-case “x”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
filter(str_detect(name, "x")) |>
|
||||
count(name, wt = n, sort = TRUE)
|
||||
#> # A tibble: 974 × 2
|
||||
|
@ -182,7 +182,7 @@ Detect matches</h2>
|
|||
</div>
|
||||
<p>We can also use <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> by pairing it with <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> or <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>: <code>sum(str_detect(x, pattern))</code> tells you the number of observations that match and <code>mean(str_detect(x, pattern))</code> tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names<span data-type="footnote">This gives us the proportion of <strong>names</strong> that contain an “x”; if you wanted the proportion of babies with a name containing an x, you’d need to perform a weighted mean.</span> that contain “x”, broken down by year. It looks like they’ve radically increased in popularity lately!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
group_by(year) |>
|
||||
summarise(prop_x = mean(str_detect(name, "x"))) |>
|
||||
ggplot(aes(year, prop_x)) +
|
||||
|
@ -196,7 +196,7 @@ Detect matches</h2>
|
|||
</div>
|
||||
<p>There are two functions that are closely related to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>, namely <code><a href="https://stringr.tidyverse.org/reference/str_subset.html">str_subset()</a></code> which returns just the strings that contain a match and <code><a href="https://stringr.tidyverse.org/reference/str_which.html">str_which()</a></code> which returns the indexes of strings that have a match:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_subset(c("a", "b", "c"), "[aeiou]")
|
||||
<pre data-type="programlisting" data-code-language="r">str_subset(c("a", "b", "c"), "[aeiou]")
|
||||
#> [1] "a"
|
||||
str_which(c("a", "b", "c"), "[aeiou]")
|
||||
#> [1] 1</pre>
|
||||
|
@ -208,20 +208,20 @@ str_which(c("a", "b", "c"), "[aeiou]")
|
|||
Count matches</h2>
|
||||
<p>The next step up in complexity from <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> is <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code>: rather than a simple true or false, it tells you how many matches there are in each string.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("apple", "banana", "pear")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("apple", "banana", "pear")
|
||||
str_count(x, "p")
|
||||
#> [1] 2 0 1</pre>
|
||||
</div>
|
||||
<p>Note that each match starts at the end of the previous match; i.e. regex matches never overlap. For example, in <code>"abababa"</code>, how many times will the pattern <code>"aba"</code> match? Regular expressions say two, not three:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_count("abababa", "aba")
|
||||
<pre data-type="programlisting" data-code-language="r">str_count("abababa", "aba")
|
||||
#> [1] 2
|
||||
str_view("abababa", "aba")
|
||||
#> [1] │ <aba>b<aba></pre>
|
||||
</div>
|
||||
<p>It’s natural to use <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. The following example uses <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with character classes to count the number of vowels and consonants in each name.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
count(name) |>
|
||||
mutate(
|
||||
vowels = str_count(name, "[aeiou]"),
|
||||
|
@ -245,7 +245,7 @@ str_view("abababa", "aba")
|
|||
</ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p>
|
||||
<p>In this case, since we’re applying two functions to the name, I think it’s easier to transform it first:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
count(name) |>
|
||||
mutate(
|
||||
name = str_to_lower(name),
|
||||
|
@ -270,13 +270,13 @@ str_view("abababa", "aba")
|
|||
Replace values</h2>
|
||||
<p>As well as detecting and counting matches, we can also modify them with <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>. <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> replaces the first match, and as the name suggests, <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code> replaces all matches.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("apple", "pear", "banana")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("apple", "pear", "banana")
|
||||
str_replace_all(x, "[aeiou]", "-")
|
||||
#> [1] "-ppl-" "p--r" "b-n-n-"</pre>
|
||||
</div>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove_all()</a></code> are handy shortcuts for <code>str_replace(x, pattern, "")</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("apple", "pear", "banana")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("apple", "pear", "banana")
|
||||
str_remove_all(x, "[aeiou]")
|
||||
#> [1] "ppl" "pr" "bnn"</pre>
|
||||
</div>
|
||||
|
@ -289,7 +289,7 @@ Extract variables</h2>
|
|||
<p>The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. It’s a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
|
||||
<p>Let’s create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
~str,
|
||||
"<Sheryl>-F_34",
|
||||
"<Kisha>-F_45",
|
||||
|
@ -302,7 +302,7 @@ Extract variables</h2>
|
|||
</div>
|
||||
<p>To extract this data using <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
separate_wider_regex(
|
||||
str,
|
||||
patterns = c(
|
||||
|
@ -346,7 +346,7 @@ Pattern details</h1>
|
|||
Escaping</h2>
|
||||
<p>In order to match a literal <code>.</code>, you need an <strong>escape</strong> which tells the regular expression to match metacharacters literally. Like strings, regexps use the backslash for escaping. So, to match a <code>.</code>, you need the regexp <code>\.</code>. Unfortunately this creates a problem. We use strings to represent regular expressions, and <code>\</code> is also used as an escape symbol in strings. So to create the regular expression <code>\.</code> we need the string <code>"\\."</code>, as the following example shows.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># To create the regular expression \., we need to use \\.
|
||||
<pre data-type="programlisting" data-code-language="r"># To create the regular expression \., we need to use \\.
|
||||
dot <- "\\."
|
||||
|
||||
# But the expression itself only contains one \
|
||||
|
@ -360,7 +360,7 @@ str_view(c("abc", "a.c", "bef"), "a\\.c")
|
|||
<p>In this book, we’ll usually write regular expression without quotes, like <code>\.</code>. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like <code>"\\."</code>.</p>
|
||||
<p>If <code>\</code> is used as an escape character in regular expressions, how do you match a literal <code>\</code>? Well, you need to escape it, creating the regular expression <code>\\</code>. To create that regular expression, you need to use a string, which also needs to escape <code>\</code>. That means to match a literal <code>\</code> you need to write <code>"\\\\"</code> — you need four backslashes to match one!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "a\\b"
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "a\\b"
|
||||
str_view(x)
|
||||
#> [1] │ a\b
|
||||
str_view(x, "\\\\")
|
||||
|
@ -368,12 +368,12 @@ str_view(x, "\\\\")
|
|||
</div>
|
||||
<p>Alternatively, you might find it easier to use the raw strings you learned about in <a href="#sec-raw-strings" data-type="xref">#sec-raw-strings</a>). That lets you to avoid one layer of escaping:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(x, r"{\\}")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(x, r"{\\}")
|
||||
#> [1] │ a<\>b</pre>
|
||||
</div>
|
||||
<p>If you’re trying to match a literal <code>.</code>, <code>$</code>, <code>|</code>, <code>*</code>, <code>+</code>, <code>?</code>, <code>{</code>, <code>}</code>, <code>(</code>, <code>)</code>, there’s an alternative to using a backslash escape: you can use a character class: <code>[.]</code>, <code>[$]</code>, <code>[|]</code>, ... all match the literal values.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
|
||||
#> [2] │ <a.c>
|
||||
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
||||
#> [3] │ <a*c></pre>
|
||||
|
@ -386,7 +386,7 @@ str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
|
|||
Anchors</h2>
|
||||
<p>By default, regular expressions will match any part of a string. If you want to match at the start of end you need to <strong>anchor</strong> the regular expression using <code>^</code> to match the start of the string or <code>$</code> to match the end of the string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "^a")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "^a")
|
||||
#> [1] │ <a>pple
|
||||
#> [2] │ <a>pricot
|
||||
#> [3] │ <a>vocado
|
||||
|
@ -401,7 +401,7 @@ str_view(fruit, "a$")
|
|||
<p>It’s tempting to think that <code>$</code> should matches the start of a string, because that’s how we write dollar amounts, but it’s not what regular expressions want.</p>
|
||||
<p>To force a regular expression to only the full string, anchor it with both <code>^</code> and <code>$</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "apple")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple")
|
||||
#> [1] │ <apple>
|
||||
#> [62] │ pine<apple>
|
||||
str_view(fruit, "^apple$")
|
||||
|
@ -409,7 +409,7 @@ str_view(fruit, "^apple$")
|
|||
</div>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
str_view(x, "sum")
|
||||
#> [1] │ <sum>mary(x)
|
||||
#> [2] │ <sum>marise(df)
|
||||
|
@ -420,14 +420,14 @@ str_view(x, "\\bsum\\b")
|
|||
</div>
|
||||
<p>When used alone, anchors will produce a zero-width match:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view("abc", c("$", "^", "\\b"))
|
||||
<pre data-type="programlisting" data-code-language="r">str_view("abc", c("$", "^", "\\b"))
|
||||
#> [1] │ abc<>
|
||||
#> [2] │ <>abc
|
||||
#> [3] │ <>abc<></pre>
|
||||
</div>
|
||||
<p>This helps you understand what happens when you replace a standalone anchor:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_replace_all("abc", c("$", "^", "\\b"), "--")
|
||||
<pre data-type="programlisting" data-code-language="r">str_replace_all("abc", c("$", "^", "\\b"), "--")
|
||||
#> [1] "abc--" "--abc" "--abc--"</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -444,7 +444,7 @@ Character classes</h2>
|
|||
<code>\</code> escapes special characters, so <code>[\^\-\]]</code> matches <code>^</code>, <code>-</code>, or <code>]</code>.</li>
|
||||
</ul><p>Here are few examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "abcd ABCD 12345 -!@#%."
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "abcd ABCD 12345 -!@#%."
|
||||
str_view(x, "[abc]+")
|
||||
#> [1] │ <abc>d ABCD 12345 -!@#%.
|
||||
str_view(x, "[a-z]+")
|
||||
|
@ -468,7 +468,7 @@ str_view("a-b-c", "[a\\-c]")
|
|||
<code>\w</code> matches any “word” character, i.e. letters and numbers;<br/><code>\W</code> matches any “non-word” character.</li>
|
||||
</ul><p>The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "abcd ABCD 12345 -!@#%."
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "abcd ABCD 12345 -!@#%."
|
||||
str_view(x, "\\d+")
|
||||
#> [1] │ abcd ABCD <12345> -!@#%.
|
||||
str_view(x, "\\D+")
|
||||
|
@ -496,7 +496,7 @@ Quantifiers</h2>
|
|||
<code>{n,m}</code> matches between n and m times.</li>
|
||||
</ul><p>The following code shows how this works for a few simple examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "-- -x- -xx- -xxx- -xxxx- -xxxxx-"
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "-- -x- -xx- -xxx- -xxxx- -xxxxx-"
|
||||
str_view(x, "-x?-") # [0, 1]
|
||||
#> [1] │ <--> <-x-> -xx- -xxx- -xxxx- -xxxxx-
|
||||
str_view(x, "-x+-") # [1, Inf)
|
||||
|
@ -526,7 +526,7 @@ Grouping and capturing</h2>
|
|||
<p>As well overriding operator precedence, parentheses have another important effect: they create <strong>capturing groups</strong> that allow you to use sub-components of the match.</p>
|
||||
<p>The first way to use a capturing group is to refer back to it within a match with <strong>back reference</strong>: <code>\1</code> refers to the match contained in the first parenthesis, <code>\2</code> in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "(..)\\1")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "(..)\\1")
|
||||
#> [4] │ b<anan>a
|
||||
#> [20] │ <coco>nut
|
||||
#> [22] │ <cucu>mber
|
||||
|
@ -536,7 +536,7 @@ Grouping and capturing</h2>
|
|||
</div>
|
||||
<p>And this one finds all words that start and end with the same pair of letters:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(words, "^(..).*\\1$")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "^(..).*\\1$")
|
||||
#> [152] │ <church>
|
||||
#> [217] │ <decide>
|
||||
#> [617] │ <photograph>
|
||||
|
@ -545,7 +545,7 @@ Grouping and capturing</h2>
|
|||
</div>
|
||||
<p>You can also use back references in <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code>. For example, this code switches the order of the second and third words in <code>sentences</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sentences |>
|
||||
<pre data-type="programlisting" data-code-language="r">sentences |>
|
||||
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |>
|
||||
str_view()
|
||||
#> [1] │ The canoe birch slid on the smooth planks.
|
||||
|
@ -562,7 +562,7 @@ Grouping and capturing</h2>
|
|||
</div>
|
||||
<p>If you want extract the matches for each group you can use <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code>. But <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code> returns a matrix, so it’s not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sentences |>
|
||||
<pre data-type="programlisting" data-code-language="r">sentences |>
|
||||
str_match("the (\\w+) (\\w+)") |>
|
||||
head()
|
||||
#> [,1] [,2] [,3]
|
||||
|
@ -575,7 +575,7 @@ Grouping and capturing</h2>
|
|||
</div>
|
||||
<p>You could convert to a tibble and name the columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sentences |>
|
||||
<pre data-type="programlisting" data-code-language="r">sentences |>
|
||||
str_match("the (\\w+) (\\w+)") |>
|
||||
as_tibble(.name_repair = "minimal") |>
|
||||
set_names("match", "word1", "word2")
|
||||
|
@ -593,7 +593,7 @@ Grouping and capturing</h2>
|
|||
<p>But then you’ve basically recreated your own version of <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Indeed, behind the scenes, <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> converts your vector of patterns to a single regex that uses grouping to capture the named components.</p>
|
||||
<p>Occasionally, you’ll want to use parentheses without creating matching groups. You can create a non-capturing group with <code>(?:)</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("a gray cat", "a grey dog")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("a gray cat", "a grey dog")
|
||||
str_match(x, "gr(e|a)y")
|
||||
#> [,1] [,2]
|
||||
#> [1,] "gray" "a"
|
||||
|
@ -647,7 +647,7 @@ Pattern control</h1>
|
|||
Regex flags</h2>
|
||||
<p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">bananas <- c("banana", "Banana", "BANANA")
|
||||
<pre data-type="programlisting" data-code-language="r">bananas <- c("banana", "Banana", "BANANA")
|
||||
str_view(bananas, "banana")
|
||||
#> [1] │ <banana>
|
||||
str_view(bananas, regex("banana", ignore_case = TRUE))
|
||||
|
@ -659,7 +659,7 @@ str_view(bananas, regex("banana", ignore_case = TRUE))
|
|||
<ul><li>
|
||||
<p><code>dotall = TRUE</code> lets <code>.</code> match everything, including <code>\n</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "Line 1\nLine 2\nLine 3"
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view(x, ".Line")
|
||||
str_view(x, regex(".Line", dotall = TRUE))
|
||||
#> [1] │ Line 1<
|
||||
|
@ -670,7 +670,7 @@ str_view(x, regex(".Line", dotall = TRUE))
|
|||
<li>
|
||||
<p><code>multiline = TRUE</code> makes <code>^</code> and <code>$</code> match the start and end of each line rather than the start and end of the complete string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "Line 1\nLine 2\nLine 3"
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "Line 1\nLine 2\nLine 3"
|
||||
str_view(x, "^Line")
|
||||
#> [1] │ <Line> 1
|
||||
#> │ Line 2
|
||||
|
@ -683,7 +683,7 @@ str_view(x, regex("^Line", multiline = TRUE))
|
|||
</li>
|
||||
</ul><p>Finally, if you’re writing a complicated regular expression and you’re worried you might not understand it in the future, you might try <code>comments = TRUE</code>. It tweaks the pattern language to ignore spaces and new lines, as well as everything after <code>#</code>. This allows you to use comments and whitespace to make complex regular expressions more understandable<span data-type="footnote"><code>comments = TRUE</code> is particularly effective in combination with a raw string, as we use here.</span>, as in the following example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">phone <- regex(
|
||||
<pre data-type="programlisting" data-code-language="r">phone <- regex(
|
||||
r"(
|
||||
\(? # optional opening parens
|
||||
(\d{3}) # area code
|
||||
|
@ -701,7 +701,7 @@ str_match("514-791-8141", phone)
|
|||
</div>
|
||||
<p>If you’re using comments and want to match a space, newline, or <code>#</code>, you’ll need to escape it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view("x x #", regex(r"(x #)", comments = TRUE))
|
||||
<pre data-type="programlisting" data-code-language="r">str_view("x x #", regex(r"(x #)", comments = TRUE))
|
||||
#> [1] │ <x> <x> #
|
||||
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
|
||||
#> [1] │ x <x #></pre>
|
||||
|
@ -713,19 +713,19 @@ str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
|
|||
Fixed matches</h2>
|
||||
<p>You can opt-out of the regular expression rules by using <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(c("", "a", "."), fixed("."))
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(c("", "a", "."), fixed("."))
|
||||
#> [3] │ <.></pre>
|
||||
</div>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code> also gives you the ability to ignore case:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view("x X", "X")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view("x X", "X")
|
||||
#> [1] │ x <X>
|
||||
str_view("x X", fixed("X", ignore_case = TRUE))
|
||||
#> [1] │ <x> <X></pre>
|
||||
</div>
|
||||
<p>If you’re working with non-English text, you will probably want <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">coll()</a></code> instead of <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>, as it implements the full rules for capitalization as used by the <code>locale</code> you specify. See <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a> for more details on locales.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
||||
<pre data-type="programlisting" data-code-language="r">str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
|
||||
#> [1] │ i <İ> ı I
|
||||
str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
|
||||
#> [1] │ <i> <İ> ı I</pre>
|
||||
|
@ -746,7 +746,7 @@ Practice</h1>
|
|||
Check your work</h2>
|
||||
<p>First, let’s find all sentences that start with “The”. Using the <code>^</code> anchor alone is not enough:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(sentences, "^The")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^The")
|
||||
#> [1] │ <The> birch canoe slid on the smooth planks.
|
||||
#> [4] │ <The>se days a chicken leg is a rare dish.
|
||||
#> [6] │ <The> juice of lemons makes fine punch.
|
||||
|
@ -761,7 +761,7 @@ Check your work</h2>
|
|||
</div>
|
||||
<p>Because that pattern also matches sentences starting with words like <code>They</code> or <code>These</code>. We need to make sure that the “e” is the last letter in the word, which we can do by adding adding a word boundary:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(sentences, "^The\\b")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^The\\b")
|
||||
#> [1] │ <The> birch canoe slid on the smooth planks.
|
||||
#> [6] │ <The> juice of lemons makes fine punch.
|
||||
#> [7] │ <The> box was thrown beside the parked truck.
|
||||
|
@ -776,7 +776,7 @@ Check your work</h2>
|
|||
</div>
|
||||
<p>What about finding all sentences that begin with a pronoun?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(sentences, "^She|He|It|They\\b")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^She|He|It|They\\b")
|
||||
#> [3] │ <It>'s easy to tell the depth of a well.
|
||||
#> [15] │ <He>lp the woman get back to her feet.
|
||||
#> [27] │ <He>r purse was full of useless trash.
|
||||
|
@ -791,7 +791,7 @@ Check your work</h2>
|
|||
</div>
|
||||
<p>A quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(sentences, "^(She|He|It|They)\\b")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^(She|He|It|They)\\b")
|
||||
#> [3] │ <It>'s easy to tell the depth of a well.
|
||||
#> [29] │ <It> snowed, rained, and hailed the same morning.
|
||||
#> [63] │ <He> ran half way to the hardware store.
|
||||
|
@ -806,7 +806,7 @@ Check your work</h2>
|
|||
</div>
|
||||
<p>You might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">pos <- c("He is a boy", "She had a good time")
|
||||
<pre data-type="programlisting" data-code-language="r">pos <- c("He is a boy", "She had a good time")
|
||||
neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")
|
||||
|
||||
pattern <- "^(She|He|It|They)\\b"
|
||||
|
@ -823,7 +823,7 @@ str_detect(neg, pattern)
|
|||
Boolean operations</h2>
|
||||
<p>Imagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels (<code>[^aeiou]</code>), then allow that to match any number of letters (<code>[^aeiou]+</code>), then force it to match the whole string by anchoring to the beginning and the end (<code>^[^aeiou]+$</code>):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(words, "^[^aeiou]+$")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "^[^aeiou]+$")
|
||||
#> [123] │ <by>
|
||||
#> [249] │ <dry>
|
||||
#> [328] │ <fly>
|
||||
|
@ -833,7 +833,7 @@ Boolean operations</h2>
|
|||
</div>
|
||||
<p>But you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(words[!str_detect(words, "[aeiou]")])
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words[!str_detect(words, "[aeiou]")])
|
||||
#> [1] │ by
|
||||
#> [2] │ dry
|
||||
#> [3] │ fly
|
||||
|
@ -843,7 +843,7 @@ Boolean operations</h2>
|
|||
</div>
|
||||
<p>This is a useful technique whenever you’re dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. There’s no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(words, "a.*b|b.*a")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "a.*b|b.*a")
|
||||
#> [2] │ <ab>le
|
||||
#> [3] │ <ab>out
|
||||
#> [4] │ <ab>solute
|
||||
|
@ -858,7 +858,7 @@ Boolean operations</h2>
|
|||
</div>
|
||||
<p>It’s simpler to combine the results of two calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">words[str_detect(words, "a") & str_detect(words, "b")]
|
||||
<pre data-type="programlisting" data-code-language="r">words[str_detect(words, "a") & str_detect(words, "b")]
|
||||
#> [1] "able" "about" "absolute" "available" "baby" "back"
|
||||
#> [7] "bad" "bag" "balance" "ball" "bank" "bar"
|
||||
#> [13] "base" "basis" "bear" "beat" "beauty" "because"
|
||||
|
@ -867,13 +867,13 @@ Boolean operations</h2>
|
|||
</div>
|
||||
<p>What if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">words[str_detect(words, "a.*e.*i.*o.*u")]
|
||||
<pre data-type="programlisting" data-code-language="r">words[str_detect(words, "a.*e.*i.*o.*u")]
|
||||
# ...
|
||||
words[str_detect(words, "u.*o.*i.*e.*a")]</pre>
|
||||
</div>
|
||||
<p>It’s much simpler to combine five calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">words[
|
||||
<pre data-type="programlisting" data-code-language="r">words[
|
||||
str_detect(words, "a") &
|
||||
str_detect(words, "e") &
|
||||
str_detect(words, "i") &
|
||||
|
@ -890,7 +890,7 @@ words[str_detect(words, "u.*o.*i.*e.*a")]</pre>
|
|||
Creating a pattern with code</h2>
|
||||
<p>What if we wanted to find all <code>sentences</code> that mention a color? The basic idea is simple: we just combine alternation with word boundaries.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(sentences, "\\b(red|green|blue)\\b")
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "\\b(red|green|blue)\\b")
|
||||
#> [2] │ Glue the sheet to the dark <blue> background.
|
||||
#> [26] │ Two <blue> fish swam in the tank.
|
||||
#> [92] │ A wisp of cloud hung in the <blue> air.
|
||||
|
@ -905,16 +905,16 @@ Creating a pattern with code</h2>
|
|||
</div>
|
||||
<p>But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">rgb <- c("red", "green", "blue")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">rgb <- c("red", "green", "blue")</pre>
|
||||
</div>
|
||||
<p>Well, we can! We’d just need to create the pattern from the vector using <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
|
||||
<pre data-type="programlisting" data-code-language="r">str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
|
||||
#> [1] "\\b(red|green|blue)\\b"</pre>
|
||||
</div>
|
||||
<p>We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_view(colors())
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(colors())
|
||||
#> [1] │ white
|
||||
#> [2] │ aliceblue
|
||||
#> [3] │ antiquewhite
|
||||
|
@ -929,7 +929,7 @@ Creating a pattern with code</h2>
|
|||
</div>
|
||||
<p>But lets first eliminate the numbered variants:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cols <- colors()
|
||||
<pre data-type="programlisting" data-code-language="r">cols <- colors()
|
||||
cols <- cols[!str_detect(cols, "\\d")]
|
||||
str_view(cols)
|
||||
#> [1] │ white
|
||||
|
@ -946,7 +946,7 @@ str_view(cols)
|
|||
</div>
|
||||
<p>Then we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
|
||||
<pre data-type="programlisting" data-code-language="r">pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
|
||||
str_view(sentences, pattern)
|
||||
#> [2] │ Glue the sheet to the dark <blue> background.
|
||||
#> [12] │ A rod is used to catch <pink> <salmon>.
|
||||
|
@ -997,14 +997,14 @@ tidyverse</h2>
|
|||
Base R</h2>
|
||||
<p><code>apropos(pattern)</code> searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">apropos("replace")
|
||||
<pre data-type="programlisting" data-code-language="r">apropos("replace")
|
||||
#> [1] "%+replace%" "replace" "replace_na"
|
||||
#> [4] "setReplaceMethod" "str_replace" "str_replace_all"
|
||||
#> [7] "str_replace_na" "theme_replace"</pre>
|
||||
</div>
|
||||
<p><code>list.files(path, pattern)</code> lists all files in <code>path</code> that match a regular expression <code>pattern</code>. For example, you can find all the R Markdown files in the current directory with:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">head(list.files(pattern = "\\.Rmd$"))
|
||||
<pre data-type="programlisting" data-code-language="r">head(list.files(pattern = "\\.Rmd$"))
|
||||
#> character(0)</pre>
|
||||
</div>
|
||||
<p>It’s worth noting that the pattern language used by base R is very slightly different to that used by stringr. That’s because stringr is built on top of the <a href="https://stringi.gagolewski.com">stringi package</a>, which is in turn built on top of the <a href="https://unicode-org.github.io/icu/userguide/strings/regexp.html">ICU engine</a>, whereas base R functions use either the <a href="https://github.com/laurikari/tre">TRE engine</a> or the <a href="https://www.pcre.org">PCRE engine</a>, depending on whether or not you’ve set <code>perl = TRUE</code>. Fortunately, the basics of regular expressions are so well established that you’ll encounter few variations when working with the patterns you’ll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the <code>(?…)</code> syntax.</p>
|
||||
|
|
|
@ -16,7 +16,7 @@ Excel</h1>
|
|||
Prerequisites</h2>
|
||||
<p>In this chapter, you’ll learn how to load data from Excel spreadsheets in R with the <strong>readxl</strong> package. This package is non-core tidyverse, so you need to load it explicitly but it is installed automatically when you install the tidyverse package.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(readxl)
|
||||
<pre data-type="programlisting" data-code-language="r">library(readxl)
|
||||
library(tidyverse)</pre>
|
||||
</div>
|
||||
<p><strong>xlsx</strong> and <strong>XLConnect</strong> can be used for reading data from and writing data to Excel spreadsheets. However, these two packages require Java installed on your machine and the rJava package. Due to potential challenges with installation, we recommend using alternative packages we’ve introduced in this chapter.</p>
|
||||
|
@ -49,11 +49,11 @@ Reading spreadsheets</h2>
|
|||
</div>
|
||||
<p>The first argument to <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> is the path to the file to read.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students <- read_excel("data/students.xlsx")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">students <- read_excel("data/students.xlsx")</pre>
|
||||
</div>
|
||||
<p><code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> will read the file in as a tibble.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students
|
||||
<pre data-type="programlisting" data-code-language="r">students
|
||||
#> # A tibble: 6 × 5
|
||||
#> `Student ID` `Full Name` favourite.food mealPlan AGE
|
||||
#> <dbl> <chr> <chr> <chr> <chr>
|
||||
|
@ -68,7 +68,7 @@ Reading spreadsheets</h2>
|
|||
<ol type="1"><li>
|
||||
<p>The column names are all over the place. You can provide column names that follow a consistent format; we recommend <code>snake_case</code> using the <code>col_names</code> argument.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(
|
||||
"data/students.xlsx",
|
||||
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
|
||||
)
|
||||
|
@ -85,7 +85,7 @@ Reading spreadsheets</h2>
|
|||
</div>
|
||||
<p>Unfortunately, this didn’t quite do the trick. You now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the <code>skip</code> argument.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(
|
||||
"data/students.xlsx",
|
||||
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
|
||||
skip = 1
|
||||
|
@ -104,7 +104,7 @@ Reading spreadsheets</h2>
|
|||
<li>
|
||||
<p>In the <code>favourite_food</code> column, one of the observations is <code>N/A</code>, which stands for “not available” but it’s currently not recognized as an <code>NA</code> (note the contrast between this <code>N/A</code> and the age of the fourth student in the list). You can specify which character strings should be recognized as <code>NA</code>s with the <code>na</code> argument. By default, only <code>""</code> (empty string, or, in the case of reading from a spreadsheet, an empty cell) is recognized as an <code>NA</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(
|
||||
"data/students.xlsx",
|
||||
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
|
||||
skip = 1,
|
||||
|
@ -124,7 +124,7 @@ Reading spreadsheets</h2>
|
|||
<li>
|
||||
<p>One other remaining issue is that <code>age</code> is read in as a character variable, but it really should be numeric. Just like with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and friends for reading data from flat files, you can supply a <code>col_types</code> argument to <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are <code>"skip"</code>, <code>"guess"</code>, <code>"logical"</code>, <code>"numeric"</code>, <code>"date"</code>, <code>"text"</code> or <code>"list"</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(
|
||||
"data/students.xlsx",
|
||||
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
|
||||
skip = 1,
|
||||
|
@ -144,7 +144,7 @@ Reading spreadsheets</h2>
|
|||
</div>
|
||||
<p>However, this didn’t quite produce the desired result either. By specifying that <code>age</code> should be numeric, we have turned the one cell with the non-numeric entry (which had the value <code>five</code>) into an <code>NA</code>. In this case, we should read age in as <code>"text"</code> and then make the change once the data is loaded in R.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">students <- read_excel(
|
||||
<pre data-type="programlisting" data-code-language="r">students <- read_excel(
|
||||
"data/students.xlsx",
|
||||
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
|
||||
skip = 1,
|
||||
|
@ -187,7 +187,7 @@ Reading individual sheets</h2>
|
|||
</div>
|
||||
<p>You can read a single sheet from a spreadsheet with the <code>sheet</code> argument in <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
|
||||
#> # A tibble: 52 × 8
|
||||
#> species island bill_length_mm bill_dep…¹ flipp…² body_…³ sex year
|
||||
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
|
||||
|
@ -202,7 +202,7 @@ Reading individual sheets</h2>
|
|||
</div>
|
||||
<p>Some variables that appear to contain numerical data are read in as characters due to the character string <code>"NA"</code> not being recognized as a true <code>NA</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">penguins_torgersen <- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
|
||||
<pre data-type="programlisting" data-code-language="r">penguins_torgersen <- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
|
||||
|
||||
penguins_torgersen
|
||||
#> # A tibble: 52 × 8
|
||||
|
@ -219,17 +219,17 @@ penguins_torgersen
|
|||
</div>
|
||||
<p>However, we cheated here a bit. We looked inside the Excel spreadsheet, which is not a recommended workflow. Instead, you can use <code><a href="https://readxl.tidyverse.org/reference/excel_sheets.html">excel_sheets()</a></code> to get information on all sheets in an Excel spreadsheet, and then read the one(s) you’re interested in.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">excel_sheets("data/penguins.xlsx")
|
||||
<pre data-type="programlisting" data-code-language="r">excel_sheets("data/penguins.xlsx")
|
||||
#> [1] "Torgersen Island" "Biscoe Island" "Dream Island"</pre>
|
||||
</div>
|
||||
<p>Once you know the names of the sheets, you can read them in individually with <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">penguins_biscoe <- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
|
||||
<pre data-type="programlisting" data-code-language="r">penguins_biscoe <- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
|
||||
penguins_dream <- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")</pre>
|
||||
</div>
|
||||
<p>In this case the full penguins dataset is spread across three sheets in the spreadsheet. Each sheet has the same number of columns but different numbers of rows.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dim(penguins_torgersen)
|
||||
<pre data-type="programlisting" data-code-language="r">dim(penguins_torgersen)
|
||||
#> [1] 52 8
|
||||
dim(penguins_biscoe)
|
||||
#> [1] 168 8
|
||||
|
@ -238,7 +238,7 @@ dim(penguins_dream)
|
|||
</div>
|
||||
<p>We can put them together with <code><a href="https://dplyr.tidyverse.org/reference/bind_rows.html">bind_rows()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
|
||||
<pre data-type="programlisting" data-code-language="r">penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
|
||||
penguins
|
||||
#> # A tibble: 344 × 8
|
||||
#> species island bill_length_mm bill_depth_mm flippe…¹ body_…² sex year
|
||||
|
@ -269,7 +269,7 @@ Reading part of a sheet</h2>
|
|||
</div>
|
||||
<p>This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the <code><a href="https://readxl.tidyverse.org/reference/readxl_example.html">readxl_example()</a></code> function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> as usual.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">deaths_path <- readxl_example("deaths.xlsx")
|
||||
<pre data-type="programlisting" data-code-language="r">deaths_path <- readxl_example("deaths.xlsx")
|
||||
deaths <- read_excel(deaths_path)
|
||||
#> New names:
|
||||
#> • `` -> `...2`
|
||||
|
@ -292,7 +292,7 @@ deaths
|
|||
<p>The top three rows and the bottom four rows are not part of the data frame.</p>
|
||||
<p>We could skip the top three rows with <code>skip</code>. Note that we set <code>skip = 4</code> since the fourth row contains column names, not the data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4)
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, skip = 4)
|
||||
#> # A tibble: 14 × 6
|
||||
#> Name Profession Age `Has kids` `Date of birth` Date of dea…¹
|
||||
#> <chr> <chr> <chr> <chr> <dttm> <chr>
|
||||
|
@ -306,7 +306,7 @@ deaths
|
|||
</div>
|
||||
<p>We could also set <code>n_max</code> to omit the extraneous rows at the bottom.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4, n_max = 10)
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, skip = 4, n_max = 10)
|
||||
#> # A tibble: 10 × 6
|
||||
#> Name Profe…¹ Age Has k…² `Date of birth` `Date of death`
|
||||
#> <chr> <chr> <dbl> <lgl> <dttm> <dttm>
|
||||
|
@ -324,19 +324,19 @@ deaths
|
|||
<ul><li>
|
||||
<p>Supply this information to the <code>range</code> argument:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = "A5:F15")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, range = "A5:F15")</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Specify rows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = cell_rows(c(5, 15)))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, range = cell_rows(c(5, 15)))</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Specify cells that mark the top-left and bottom-right corners of the data – the top-left corner, <code>A5</code>, translates to <code>c(5, 1)</code> (5th row down, 1st column) and the bottom-right corner, <code>F15</code>, translates to <code>c(15, 6)</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ul><p>If you have control over the sheet, an even better way is to create a “named range”. This is useful within Excel because named ranges help repeat formulas easier to create and they have some useful properties for creating dynamic charts and graphs as well. Even if you’re not working in Excel, named ranges can be useful for identifying which cells to read into R. In the example above, the table we’re reading in is named <code>Table1</code>, so we can read it in with the following.</p>
|
||||
|
@ -369,7 +369,7 @@ Data not in cell values</h2>
|
|||
Writing to Excel</h2>
|
||||
<p>Let’s create a small data frame that we can then write out. Note that <code>item</code> is a factor and <code>quantity</code> is an integer.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">bake_sale <- tibble(
|
||||
<pre data-type="programlisting" data-code-language="r">bake_sale <- tibble(
|
||||
item = factor(c("brownie", "cupcake", "cookie")),
|
||||
quantity = c(10, 5, 8)
|
||||
)
|
||||
|
@ -384,7 +384,7 @@ bake_sale
|
|||
</div>
|
||||
<p>You can write data back to disk as an Excel file using the <code><a href="https://docs.ropensci.org/writexl/reference/write_xlsx.html">write_xlsx()</a></code> from the <strong>writexl</strong> package.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(writexl)
|
||||
<pre data-type="programlisting" data-code-language="r">library(writexl)
|
||||
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
|
||||
</div>
|
||||
<p><a href="#fig-bake-sale-excel" data-type="xref">#fig-bake-sale-excel</a> shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting <code>col_names</code> and <code>format_headers</code> arguments to <code>FALSE</code>.</p>
|
||||
|
@ -398,7 +398,7 @@ write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
|
|||
</div>
|
||||
<p>Just like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see <a href="#sec-writing-to-a-file" data-type="xref">#sec-writing-to-a-file</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/bake-sale.xlsx")
|
||||
<pre data-type="programlisting" data-code-language="r">read_excel("data/bake-sale.xlsx")
|
||||
#> # A tibble: 3 × 2
|
||||
#> item quantity
|
||||
#> <chr> <dbl>
|
||||
|
@ -414,7 +414,7 @@ Formatted output</h2>
|
|||
<p>The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the <strong>openxlsx</strong> package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.</p>
|
||||
<p>Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the <code>penguins</code> data frame.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(openxlsx)
|
||||
<pre data-type="programlisting" data-code-language="r">library(openxlsx)
|
||||
library(palmerpenguins)
|
||||
|
||||
# Create a workbook (spreadsheet)
|
||||
|
@ -444,7 +444,7 @@ writeDataTable(
|
|||
</div>
|
||||
<p>This creates a workbook object:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">penguins_species
|
||||
<pre data-type="programlisting" data-code-language="r">penguins_species
|
||||
#> A Workbook object.
|
||||
#>
|
||||
#> Worksheets:
|
||||
|
@ -464,7 +464,7 @@ writeDataTable(
|
|||
</div>
|
||||
<p>And we can write this to this with <code><a href="https://rdrr.io/pkg/openxlsx/man/saveWorkbook.html">saveWorkbook()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre>
|
||||
</div>
|
||||
<p>The resulting spreadsheet is shown in <a href="#fig-penguins-species" data-type="xref">#fig-penguins-species</a>. By default, openxlsx formats the data as an Excel table.</p>
|
||||
<div class="cell">
|
||||
|
|
|
@ -21,7 +21,7 @@ Prerequisites</h2>
|
|||
|
||||
<p>In this chapter, we’ll use functions from the stringr package which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(babynames)</pre>
|
||||
</div>
|
||||
<p>You can easily tell when you’re using a stringr function because all stringr functions start with <code>str_</code>. This is particularly useful if you use RStudio, because typing <code>str_</code> will trigger autocomplete, allowing you jog your memory of which functions are available.</p>
|
||||
|
@ -38,7 +38,7 @@ library(babynames)</pre>
|
|||
Creating a string</h1>
|
||||
<p>We’ve created strings in passing earlier in the book, but didn’t discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). There’s no difference in behavior between the two so in the interests of consistency the <a href="https://style.tidyverse.org/syntax.html#character-vectors">tidyverse style guide</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">string1 <- "This is a string"
|
||||
<pre data-type="programlisting" data-code-language="r">string1 <- "This is a string"
|
||||
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'</pre>
|
||||
</div>
|
||||
<p>If you forget to close a quote, you’ll see <code>+</code>, the continuation character:</p>
|
||||
|
@ -53,16 +53,16 @@ string2 <- 'If I want to include a "quote" inside a string, I use single quot
|
|||
Escapes</h2>
|
||||
<p>To include a literal single or double quote in a string you can use <code>\</code> to “escape” it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">double_quote <- "\"" # or '"'
|
||||
<pre data-type="programlisting" data-code-language="r">double_quote <- "\"" # or '"'
|
||||
single_quote <- '\'' # or "'"</pre>
|
||||
</div>
|
||||
<p>So if you want to include a literal backslash in your string, you’ll need to escape it: <code>"\\"</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">backslash <- "\\"</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">backslash <- "\\"</pre>
|
||||
</div>
|
||||
<p>Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code><span data-type="footnote">Or use the base R function <code><a href="https://rdrr.io/r/base/writeLines.html">writeLines()</a></code>.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c(single_quote, double_quote, backslash)
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(single_quote, double_quote, backslash)
|
||||
x
|
||||
#> [1] "'" "\"" "\\"
|
||||
|
||||
|
@ -78,7 +78,7 @@ str_view(x)
|
|||
Raw strings</h2>
|
||||
<p>Creating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, lets create a string that contains the contents of the code block where we define the <code>double_quote</code> and <code>single_quote</code> variables:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tricky <- "double_quote <- \"\\\"\" # or '\"'
|
||||
<pre data-type="programlisting" data-code-language="r">tricky <- "double_quote <- \"\\\"\" # or '\"'
|
||||
single_quote <- '\\'' # or \"'\""
|
||||
str_view(tricky)
|
||||
#> [1] │ double_quote <- "\"" # or '"'
|
||||
|
@ -86,7 +86,7 @@ str_view(tricky)
|
|||
</div>
|
||||
<p>That’s a lot of backslashes! (This is sometimes called <a href="https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome">leaning toothpick syndrome</a>.) To eliminate the escaping you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">tricky <- r"(double_quote <- "\"" # or '"'
|
||||
<pre data-type="programlisting" data-code-language="r">tricky <- r"(double_quote <- "\"" # or '"'
|
||||
single_quote <- '\'' # or "'")"
|
||||
str_view(tricky)
|
||||
#> [1] │ double_quote <- "\"" # or '"'
|
||||
|
@ -100,7 +100,7 @@ str_view(tricky)
|
|||
Other special characters</h2>
|
||||
<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code> there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. You’ll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in <code><a href="https://rdrr.io/r/base/Quotes.html">?'"'</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
|
||||
x
|
||||
#> [1] "one\ntwo" "one\ttwo" "µ" "😄"
|
||||
str_view(x)
|
||||
|
@ -125,7 +125,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "This\u00a0is\u00a0tricky"</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "This\u00a0is\u00a0tricky"</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
|
@ -142,7 +142,7 @@ Creating many strings from data</h1>
|
|||
</h2>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code><span data-type="footnote"><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>. There are two main reasons we recommend it: it propagates <code>NA</code>s (rather than converting them to <code>"NA"</code>) and it uses the tidyverse recycling rules.</span> takes any number of vectors as arguments and returns a character vector:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_c("x", "y")
|
||||
<pre data-type="programlisting" data-code-language="r">str_c("x", "y")
|
||||
#> [1] "xy"
|
||||
str_c("x", "y", "z")
|
||||
#> [1] "xyz"
|
||||
|
@ -151,7 +151,7 @@ str_c("Hello ", c("John", "Susan"))
|
|||
</div>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> so it obeys the usual rules for recycling and missing values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">set.seed(1410)
|
||||
<pre data-type="programlisting" data-code-language="r">set.seed(1410)
|
||||
df <- tibble(name = c(wakefield::name(3), NA))
|
||||
df |> mutate(greeting = str_c("Hi ", name, "!"))
|
||||
#> # A tibble: 4 × 2
|
||||
|
@ -164,7 +164,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
|||
</div>
|
||||
<p>If you want missing values to display in some other way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code>. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
mutate(
|
||||
greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
|
||||
greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
|
||||
|
@ -185,7 +185,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
|||
</h2>
|
||||
<p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, you’ll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If you’re not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like it’s outside of the quotes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> mutate(greeting = str_glue("Hi {name}!"))
|
||||
<pre data-type="programlisting" data-code-language="r">df |> mutate(greeting = str_glue("Hi {name}!"))
|
||||
#> # A tibble: 4 × 2
|
||||
#> name greeting
|
||||
#> <chr> <glue>
|
||||
|
@ -197,7 +197,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
|||
<p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
|
||||
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. If you guess that you’ll need to somehow escape it, you’re on the right track. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
|
||||
<pre data-type="programlisting" data-code-language="r">df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
|
||||
#> # A tibble: 4 × 2
|
||||
#> name greeting
|
||||
#> <chr> <glue>
|
||||
|
@ -214,7 +214,7 @@ df |> mutate(greeting = str_c("Hi ", name, "!"))
|
|||
</h2>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, i.e. something that always returns a single string? That’s the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_flatten(c("x", "y", "z"))
|
||||
<pre data-type="programlisting" data-code-language="r">str_flatten(c("x", "y", "z"))
|
||||
#> [1] "xyz"
|
||||
str_flatten(c("x", "y", "z"), ", ")
|
||||
#> [1] "x, y, z"
|
||||
|
@ -223,7 +223,7 @@ str_flatten(c("x", "y", "z"), ", ", last = ", and ")
|
|||
</div>
|
||||
<p>This makes it work well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tribble(
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
~ name, ~ fruit,
|
||||
"Carmen", "banana",
|
||||
"Carmen", "apple",
|
||||
|
@ -250,7 +250,7 @@ Exercises</h2>
|
|||
<ol type="1"><li>
|
||||
<p>Compare and contrast the results of <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code> with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> for the following inputs:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_c("hi ", NA)
|
||||
<pre data-type="programlisting" data-code-language="r">str_c("hi ", NA)
|
||||
str_c(letters[1:2], letters[1:3])</pre>
|
||||
</div>
|
||||
</li>
|
||||
|
@ -284,7 +284,7 @@ Extracting data from strings</h1>
|
|||
Separating into rows</h2>
|
||||
<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> to split based on a delimiter:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- tibble(x = c("a,b,c", "d,e", "f"))
|
||||
<pre data-type="programlisting" data-code-language="r">df1 <- tibble(x = c("a,b,c", "d,e", "f"))
|
||||
df1 |>
|
||||
separate_longer_delim(x, delim = ",")
|
||||
#> # A tibble: 6 × 1
|
||||
|
@ -299,7 +299,7 @@ df1 |>
|
|||
</div>
|
||||
<p>It’s rarer to see <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_position()</a></code> in the wild, but some older datasets do use very compact format where each character is used to record a value:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df2 <- tibble(x = c("1211", "131", "21"))
|
||||
<pre data-type="programlisting" data-code-language="r">df2 <- tibble(x = c("1211", "131", "21"))
|
||||
df2 |>
|
||||
separate_longer_position(x, width = 1)
|
||||
#> # A tibble: 9 × 1
|
||||
|
@ -320,7 +320,7 @@ df2 |>
|
|||
Separating into columns</h2>
|
||||
<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> we supply the delimiter and the names in two arguments:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
|
||||
<pre data-type="programlisting" data-code-language="r">df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
|
||||
df3 |>
|
||||
separate_wider_delim(
|
||||
x,
|
||||
|
@ -336,7 +336,7 @@ df3 |>
|
|||
</div>
|
||||
<p>If a specific piece is not useful you can use an <code>NA</code> name to omit it from the results:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df3 |>
|
||||
<pre data-type="programlisting" data-code-language="r">df3 |>
|
||||
separate_wider_delim(
|
||||
x,
|
||||
delim = ".",
|
||||
|
@ -351,7 +351,7 @@ df3 |>
|
|||
</div>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
|
||||
<pre data-type="programlisting" data-code-language="r">df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
|
||||
df4 |>
|
||||
separate_wider_position(
|
||||
x,
|
||||
|
@ -371,7 +371,7 @@ df4 |>
|
|||
Diagnosing widening problems</h2>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code><span data-type="footnote">The same principles apply to <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Let’s first look at the <code>too_few</code> case with the following sample dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
|
||||
|
||||
df |>
|
||||
separate_wider_delim(
|
||||
|
@ -387,7 +387,7 @@ df |>
|
|||
</div>
|
||||
<p>You’ll notice that we get an error, but the error gives us some suggestions as to how you might proceed. Let’s start by debugging the problem:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">debug <- df |>
|
||||
<pre data-type="programlisting" data-code-language="r">debug <- df |>
|
||||
separate_wider_delim(
|
||||
x,
|
||||
delim = "-",
|
||||
|
@ -408,7 +408,7 @@ debug
|
|||
</div>
|
||||
<p>When you use the debug mode you get three extra columns add to the output: <code>x_ok</code>, <code>x_pieces</code>, and <code>x_remainder</code> (if you separate variable with a different name, you’ll get a different prefix). Here, <code>x_ok</code> lets you quickly find the inputs that failed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">debug |> filter(!x_ok)
|
||||
<pre data-type="programlisting" data-code-language="r">debug |> filter(!x_ok)
|
||||
#> # A tibble: 2 × 6
|
||||
#> x y z x_ok x_pieces x_remainder
|
||||
#> <chr> <chr> <chr> <lgl> <int> <chr>
|
||||
|
@ -419,7 +419,7 @@ debug
|
|||
<p>Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove <code>too_few = "debug"</code> to ensure that new problem become errors.</p>
|
||||
<p>In other cases you may just want to fill in the missing pieces with <code>NA</code>s and move on. That’s the job of <code>too_few = "align_start"</code> and <code>too_few = "align_end"</code> which allow you to control where the <code>NA</code>s should go:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
separate_wider_delim(
|
||||
x,
|
||||
delim = "-",
|
||||
|
@ -437,7 +437,7 @@ debug
|
|||
</div>
|
||||
<p>The same principles apply if you have too many pieces:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
|
||||
|
||||
df |>
|
||||
separate_wider_delim(
|
||||
|
@ -453,7 +453,7 @@ df |>
|
|||
</div>
|
||||
<p>But now when we debug the result, you can see the purpose of <code>x_remainder</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">debug <- df |>
|
||||
<pre data-type="programlisting" data-code-language="r">debug <- df |>
|
||||
separate_wider_delim(
|
||||
x,
|
||||
delim = "-",
|
||||
|
@ -471,7 +471,7 @@ debug |> filter(!x_ok)
|
|||
</div>
|
||||
<p>You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">df |>
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
separate_wider_delim(
|
||||
x,
|
||||
delim = "-",
|
||||
|
@ -517,12 +517,12 @@ Letters</h1>
|
|||
Length</h2>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> tells you the number of letters in the string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_length(c("a", "R for data science", NA))
|
||||
<pre data-type="programlisting" data-code-language="r">str_length(c("a", "R for data science", NA))
|
||||
#> [1] 1 18 NA</pre>
|
||||
</div>
|
||||
<p>You could use this with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to find the distribution of lengths of US babynames, and then with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
count(length = str_length(name), wt = n)
|
||||
#> # A tibble: 14 × 2
|
||||
#> length n
|
||||
|
@ -556,23 +556,23 @@ babynames |>
|
|||
Subsetting</h2>
|
||||
<p>You can extract parts of a string using <code>str_sub(string, start, end)</code>, where <code>start</code> and <code>end</code> are the letters where the substring should start and end. The <code>start</code> and <code>end</code> arguments are inclusive, so the length of the returned string will be <code>end - start + 1</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- c("Apple", "Banana", "Pear")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("Apple", "Banana", "Pear")
|
||||
str_sub(x, 1, 3)
|
||||
#> [1] "App" "Ban" "Pea"</pre>
|
||||
</div>
|
||||
<p>You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_sub(x, -3, -1)
|
||||
<pre data-type="programlisting" data-code-language="r">str_sub(x, -3, -1)
|
||||
#> [1] "ple" "ana" "ear"</pre>
|
||||
</div>
|
||||
<p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> won’t fail if the string is too short: it will just return as much as possible:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_sub("a", 1, 5)
|
||||
<pre data-type="programlisting" data-code-language="r">str_sub("a", 1, 5)
|
||||
#> [1] "a"</pre>
|
||||
</div>
|
||||
<p>We could use <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to find the first and last letter of each name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">babynames |>
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
mutate(
|
||||
first = str_sub(name, 1, 1),
|
||||
last = str_sub(name, -1, -1)
|
||||
|
@ -598,7 +598,7 @@ Long strings</h2>
|
|||
<li><p><code>str_wrap(x, 30)</code> wraps a string introducing new lines so that each line is at most 30 characters (it doesn’t hyphenate, however, so any word longer than 30 characters will make a longer line)</p></li>
|
||||
</ul><p>The following code shows these functions in action with a made up string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
|
||||
|
||||
str_view(str_trunc(x, 30))
|
||||
#> [1] │ Lorem ipsum dolor sit amet,...
|
||||
|
@ -633,14 +633,14 @@ Non-English text</h1>
|
|||
Encoding</h2>
|
||||
<p>When working with non-English text the first challenge is often the <strong>encoding</strong>. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="https://rdrr.io/r/base/rawConversion.html">charToRaw()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">charToRaw("Hadley")
|
||||
<pre data-type="programlisting" data-code-language="r">charToRaw("Hadley")
|
||||
#> [1] 48 61 64 6c 65 79</pre>
|
||||
</div>
|
||||
<p>Each of these six hexadecimal numbers represents one letter: <code>48</code> is H, <code>61</code> is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it’s the <strong>American</strong> Standard Code for Information Interchange.</p>
|
||||
<p>Things aren’t so easy for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte <code>b1</code> is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emojis.</p>
|
||||
<p>readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you’ll get complete gibberish. For example here are two inline CSVs with unusual encodings<span data-type="footnote">Here I’m using the special <code>\x</code> to encode binary data directly into a string.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x1 <- "text\nEl Ni\xf1o was particularly bad this year"
|
||||
<pre data-type="programlisting" data-code-language="r">x1 <- "text\nEl Ni\xf1o was particularly bad this year"
|
||||
read_csv(x1)
|
||||
#> # A tibble: 1 × 1
|
||||
#> text
|
||||
|
@ -656,7 +656,7 @@ read_csv(x2)
|
|||
</div>
|
||||
<p>To read these correctly you specify the encoding via the <code>locale</code> argument:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">read_csv(x1, locale = locale(encoding = "Latin1"))
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(x1, locale = locale(encoding = "Latin1"))
|
||||
#> # A tibble: 1 × 1
|
||||
#> text
|
||||
#> <chr>
|
||||
|
@ -670,7 +670,7 @@ read_csv(x2, locale = locale(encoding = "Shift-JIS"))
|
|||
</div>
|
||||
<p>How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides <code><a href="https://readr.tidyverse.org/reference/encoding.html">guess_encoding()</a></code> to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">guess_encoding(x1)
|
||||
<pre data-type="programlisting" data-code-language="r">guess_encoding(x1)
|
||||
#> # A tibble: 1 × 2
|
||||
#> encoding confidence
|
||||
#> <chr> <dbl>
|
||||
|
@ -689,21 +689,21 @@ guess_encoding(x2)
|
|||
Letter variations</h2>
|
||||
<p>If you’re working with individual letters (e.g. with <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code>) there’s an important challenge if you’re working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">u <- c("\u00fc", "u\u0308")
|
||||
<pre data-type="programlisting" data-code-language="r">u <- c("\u00fc", "u\u0308")
|
||||
str_view(u)
|
||||
#> [1] │ ü
|
||||
#> [2] │ ü</pre>
|
||||
</div>
|
||||
<p>But they have different lengths and the first characters are different:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_length(u)
|
||||
<pre data-type="programlisting" data-code-language="r">str_length(u)
|
||||
#> [1] 1 2
|
||||
str_sub(u, 1, 1)
|
||||
#> [1] "ü" "u"</pre>
|
||||
</div>
|
||||
<p>Finally note that these strings look differently when you compare them with <code>==</code>, for which is stringr provides the handy <code><a href="https://stringr.tidyverse.org/reference/str_equal.html">str_equal()</a></code> function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">u[[1]] == u[[2]]
|
||||
<pre data-type="programlisting" data-code-language="r">u[[1]] == u[[2]]
|
||||
#> [1] FALSE
|
||||
|
||||
str_equal(u[[1]], u[[2]])
|
||||
|
@ -718,14 +718,14 @@ Locale-dependent function</h2>
|
|||
<p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country. To avoid this problem, stringr defaults to using English rules, by using the “en” locale, and requires you to specify the <code>locale</code> argument to override it. Fortunately there are two sets of functions where the locale really matters: changing case and sorting.</p>
|
||||
<p><strong>T</strong>he rules for changing case are not the same in every language. For example, Turkish has two i’s: with and without a dot, and it capitalizes them in a different way to English:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_to_upper(c("i", "ı"))
|
||||
<pre data-type="programlisting" data-code-language="r">str_to_upper(c("i", "ı"))
|
||||
#> [1] "I" "I"
|
||||
str_to_upper(c("i", "ı"), locale = "tr")
|
||||
#> [1] "İ" "I"</pre>
|
||||
</div>
|
||||
<p>Sorting strings depends on the order of the alphabet, and order of the alphabet is not the same in every language<span data-type="footnote">Sorting in languages that don’t have an alphabet, like Chinese, is more complicated still.</span>! Here’s an example: in Czech, “ch” is a compound letter that appears after <code>h</code> in the alphabet.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">str_sort(c("a", "c", "ch", "h", "z"))
|
||||
<pre data-type="programlisting" data-code-language="r">str_sort(c("a", "c", "ch", "h", "z"))
|
||||
#> [1] "a" "c" "ch" "h" "z"
|
||||
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
|
||||
#> [1] "a" "c" "h" "ch" "z"</pre>
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
Coding basics</h1>
|
||||
<p>Let’s review some basics we’ve so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">1 / 200 * 30
|
||||
<pre data-type="programlisting" data-code-language="r">1 / 200 * 30
|
||||
#> [1] 0.15
|
||||
(59 + 73 + 2) / 3
|
||||
#> [1] 44.66667
|
||||
|
@ -14,22 +14,22 @@ sin(pi / 2)
|
|||
</div>
|
||||
<p>You can create new objects with the assignment operator <code><-</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- 3 * 4</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">x <- 3 * 4</pre>
|
||||
</div>
|
||||
<p>You can <strong>c</strong>ombine multiple elements into a vector with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">primes <- c(2, 3, 5, 7, 11, 13)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">primes <- c(2, 3, 5, 7, 11, 13)</pre>
|
||||
</div>
|
||||
<p>And basic arithmetic is applied to every element of the vector:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">primes * 2
|
||||
<pre data-type="programlisting" data-code-language="r">primes * 2
|
||||
#> [1] 4 6 10 14 22 26
|
||||
primes - 1
|
||||
#> [1] 1 2 4 6 10 12</pre>
|
||||
</div>
|
||||
<p>All R statements where you create objects, <strong>assignment</strong> statements, have the same form:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">object_name <- value</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">object_name <- value</pre>
|
||||
</div>
|
||||
<p>When reading that code, say “object name gets value” in your head.</p>
|
||||
<p>You will make lots of assignments and <code><-</code> is a pain to type. You can save time with RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automatically surrounds <code><-</code> with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.</p>
|
||||
|
@ -41,7 +41,7 @@ Comments</h1>
|
|||
<p>R will ignore any text after <code>#</code>. This allows to you to write <strong>comments</strong>, text that is ignored by R but read by other humans. We’ll sometimes include comments in examples explaining what’s happening with the code.</p>
|
||||
<p>Comments can be helpful for briefly describing what the subsequent code does.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># define primes
|
||||
<pre data-type="programlisting" data-code-language="r"># define primes
|
||||
primes <- c(2, 3, 5, 7, 11, 13)
|
||||
|
||||
# multiply primes by 2
|
||||
|
@ -59,7 +59,7 @@ primes * 2
|
|||
What’s in a name?</h1>
|
||||
<p>Object names must start with a letter, and can only contain letters, numbers, <code>_</code> and <code>.</code>. You want your object names to be descriptive, so you’ll need to adopt a convention for multiple words. We recommend <strong>snake_case</strong> where you separate lowercase words with <code>_</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">i_use_snake_case
|
||||
<pre data-type="programlisting" data-code-language="r">i_use_snake_case
|
||||
otherPeopleUseCamelCase
|
||||
some.people.use.periods
|
||||
And_aFew.People_RENOUNCEconvention</pre>
|
||||
|
@ -67,22 +67,22 @@ And_aFew.People_RENOUNCEconvention</pre>
|
|||
<p>We’ll come back to names again when we talk more about code style in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>.</p>
|
||||
<p>You can inspect an object by typing its name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x
|
||||
<pre data-type="programlisting" data-code-language="r">x
|
||||
#> [1] 12</pre>
|
||||
</div>
|
||||
<p>Make another assignment:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">this_is_a_really_long_name <- 2.5</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">this_is_a_really_long_name <- 2.5</pre>
|
||||
</div>
|
||||
<p>To inspect this object, try out RStudio’s completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.</p>
|
||||
<p>Ooops, you made a mistake! The value of <code>this_is_a_really_long_name</code> should be 3.5, not 2.5. Use another keyboard shortcut to help you fix it. Type “this” then press Cmd/Ctrl + ↑. Doing so will list all the commands you’ve typed that start with those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.</p>
|
||||
<p>Make yet another assignment:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">r_rocks <- 2 ^ 3</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">r_rocks <- 2 ^ 3</pre>
|
||||
</div>
|
||||
<p>Let’s try to inspect it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">r_rock
|
||||
<pre data-type="programlisting" data-code-language="r">r_rock
|
||||
#> Error: object 'r_rock' not found
|
||||
R_rocks
|
||||
#> Error: object 'R_rocks' not found</pre>
|
||||
|
@ -95,17 +95,17 @@ R_rocks
|
|||
Calling functions</h1>
|
||||
<p>R has a large collection of built-in functions that are called like this:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">function_name(arg1 = val1, arg2 = val2, ...)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">function_name(arg1 = val1, arg2 = val2, ...)</pre>
|
||||
</div>
|
||||
<p>Let’s try using <code><a href="https://rdrr.io/r/base/seq.html">seq()</a></code>, which makes regular <strong>seq</strong>uences of numbers and, while we’re at it, learn more helpful features of RStudio. Type <code>se</code> and hit TAB. A popup shows you possible completions. Specify <code><a href="https://rdrr.io/r/base/seq.html">seq()</a></code> by typing more (a <code>q</code>) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function’s arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.</p>
|
||||
<p>When you’ve selected the function you want, press TAB again. RStudio will add matching opening (<code>(</code>) and closing (<code>)</code>) parentheses for you. Type the arguments <code>1, 10</code> and hit return.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">seq(1, 10)
|
||||
<pre data-type="programlisting" data-code-language="r">seq(1, 10)
|
||||
#> [1] 1 2 3 4 5 6 7 8 9 10</pre>
|
||||
</div>
|
||||
<p>Type this code and notice that RStudio provides similar assistance with the paired quotation marks:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">x <- "hello world"</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">x <- "hello world"</pre>
|
||||
</div>
|
||||
<p>Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it’s still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:</p>
|
||||
<pre><code>> x <- "hello
|
||||
|
@ -125,7 +125,7 @@ Exercises</h1>
|
|||
<ol type="1"><li>
|
||||
<p>Why does this code not work?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">my_variable <- 10
|
||||
<pre data-type="programlisting" data-code-language="r">my_variable <- 10
|
||||
my_varıable
|
||||
#> Error in eval(expr, envir, enclos): object 'my_varıable' not found</pre>
|
||||
</div>
|
||||
|
@ -134,7 +134,7 @@ my_varıable
|
|||
<li>
|
||||
<p>Tweak each of the following R commands so that they run correctly:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">libary(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">libary(tidyverse)
|
||||
|
||||
ggplot(dota = mpg) +
|
||||
geom_point(maping = aes(x = displ, y = hwy))</pre>
|
||||
|
|
|
@ -18,11 +18,11 @@ Making a reprex</h1>
|
|||
<li><p>The other 20% of time you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!</p></li>
|
||||
</ul><p>When creating a reprex by hand, it’s easy to accidentally miss something that means your code can’t be run on someone else’s computer. Avoid this problem by using the reprex package which is installed as part of the tidyverse. Let’s say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y <- 1:4
|
||||
<pre data-type="programlisting" data-code-language="r">y <- 1:4
|
||||
mean(y)</pre>
|
||||
</div>
|
||||
<p>Then call <code>reprex()</code>, where the default target venue is GitHub:</p>
|
||||
<pre data-type="programlisting" data-code-language="downlit">reprex::reprex()</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">reprex::reprex()</pre>
|
||||
<p>A nicely rendered HTML preview will display in RStudio’s Viewer (if you’re in RStudio) or your default browser otherwise. The relevant bit of GitHub-flavored Markdown is ready to be pasted from your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):</p>
|
||||
<pre><code>``` r
|
||||
y <- 1:4
|
||||
|
@ -31,7 +31,7 @@ mean(y)
|
|||
```</code></pre>
|
||||
<p>Here’s what that Markdown would look like rendered in a GitHub issue:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">y <- 1:4
|
||||
<pre data-type="programlisting" data-code-language="r">y <- 1:4
|
||||
mean(y)
|
||||
#> [1] 2.5</pre>
|
||||
</div>
|
||||
|
|
|
@ -12,7 +12,7 @@
|
|||
Why use a pipe?</h1>
|
||||
<p>Each individual dplyr verb is quite simple, so solving complex problems typically requires combining multiple verbs. For example, the last chapter finished with a moderately complex pipe:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(!is.na(arr_delay), !is.na(tailnum)) |>
|
||||
group_by(tailnum) |>
|
||||
summarise(
|
||||
|
@ -23,7 +23,7 @@ Why use a pipe?</h1>
|
|||
<p>Even though this pipe has four steps, it’s easy to skim because the verbs come at the start of each line: start with the <code>flights</code> data, then filter, then group, then summarize.</p>
|
||||
<p>What would happen if we didn’t have the pipe? We could nest each function call inside the previous call:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">summarise(
|
||||
<pre data-type="programlisting" data-code-language="r">summarise(
|
||||
group_by(
|
||||
filter(
|
||||
flights,
|
||||
|
@ -38,7 +38,7 @@ Why use a pipe?</h1>
|
|||
</div>
|
||||
<p>Or we could use a bunch of intermediate variables:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
|
||||
<pre data-type="programlisting" data-code-language="r">flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
|
||||
flights2 <- group_by(flights1, tailnum)
|
||||
flights3 <- summarise(flight2,
|
||||
delay = mean(arr_delay, na.rm = TRUE),
|
||||
|
@ -53,7 +53,7 @@ flights3 <- summarise(flight2,
|
|||
magrittr and the<code>%>%</code> pipe</h1>
|
||||
<p>If you’ve been using the tidyverse for a while, you might be familiar with the <code>%>%</code> pipe provided by the <strong>magrittr</strong> package. The magrittr package is included in the core tidyverse, so you can use <code>%>%</code> whenever you load the tidyverse:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
|
||||
mtcars %>%
|
||||
group_by(cyl) %>%
|
||||
|
@ -78,7 +78,7 @@ mtcars %>%
|
|||
<p>The <code>|></code> placeholder is deliberately simple and can’t replicate many features of the <code>%>%</code> placeholder: you can’t pass it to multiple arguments, and it doesn’t have any special behavior when the placeholder is used inside another function. For example, <code>df %>% split(.$var)</code> is equivalent to <code>split(df, df$var)</code> and <code>df %>% {split(.$x, .$y)}</code> is equivalent to <code>split(df$x, df$y)</code>.</p>
|
||||
<p>With <code>%>%</code> you can use <code>.</code> on the left-hand side of operators like <code>$</code>, <code>[[</code>, <code>[</code> (which you’ll learn about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>), so you can extract a single column from a data frame with (e.g.) <code>mtcars %>% .$cyl</code>. A future version of R may add similar support for <code>|></code> and <code>_</code>. For the special case of extracting a column out of a data frame, you can also use <code><a href="https://dplyr.tidyverse.org/reference/pull.html">dplyr::pull()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">mtcars |> pull(cyl)
|
||||
<pre data-type="programlisting" data-code-language="r">mtcars |> pull(cyl)
|
||||
#> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4</pre>
|
||||
</div>
|
||||
</li>
|
||||
|
|
|
@ -131,13 +131,13 @@ Where does your analysis live?</h2>
|
|||
</div>
|
||||
<p>And you can print this out in R code by running <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">getwd()
|
||||
<pre data-type="programlisting" data-code-language="r">getwd()
|
||||
#> [1] "/Users/hadley/Documents/r4ds/r4ds"</pre>
|
||||
</div>
|
||||
<p>As a beginning R user, it’s OK to let your working direction be your home directory, documents directory, or any other weird directory on your computer. But you’re nine chapters into this book, and you’re no longer a rank beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R’s working directory to the associated directory.</p>
|
||||
<p>You can set the working directory from within R but <strong>we</strong> <strong>do not recommend it</strong>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">setwd("/path/to/my/CoolProject")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">setwd("/path/to/my/CoolProject")</pre>
|
||||
</div>
|
||||
<p>There’s a better way; a way that also puts you on the path to managing your R work like an expert. That way is the <strong>RStudio</strong> <strong>project</strong>.</p>
|
||||
</section>
|
||||
|
@ -170,12 +170,12 @@ RStudio projects</h2>
|
|||
<p>Call your project <code>r4ds</code> and think carefully about which subdirectory you put the project in. If you don’t store it somewhere sensible, it will be hard to find it in the future!</p>
|
||||
<p>Once this process is complete, you’ll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">getwd()
|
||||
<pre data-type="programlisting" data-code-language="r">getwd()
|
||||
#> [1] /Users/hadley/Documents/r4ds/r4ds</pre>
|
||||
</div>
|
||||
<p>Now enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Next, run the complete script which will save a PDF and CSV file into your project directory. Don’t worry about the details, you’ll learn them later in the book.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
|
||||
ggplot(diamonds, aes(carat, price)) +
|
||||
geom_hex()
|
||||
|
|
|
@ -7,7 +7,7 @@
|
|||
</figure>
|
||||
</div>
|
||||
</div><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(nycflights13)</pre>
|
||||
</div>
|
||||
<section id="names" data-type="sect1">
|
||||
|
@ -15,7 +15,7 @@ library(nycflights13)</pre>
|
|||
Names</h1>
|
||||
<p>We talked briefly about names in <a href="#sec-whats-in-a-name" data-type="xref">#sec-whats-in-a-name</a>. Remember that variable names (those created by <code><-</code> and those created by <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>) should use only lowercase letters, numbers, and <code>_</code>. Use <code>_</code> to separate words within a name.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Strive for:
|
||||
<pre data-type="programlisting" data-code-language="r"># Strive for:
|
||||
short_flights <- flights |> filter(air_time < 60)
|
||||
|
||||
# Avoid:
|
||||
|
@ -30,7 +30,7 @@ SHORTFLIGHTS <- flights |> filter(air_time < 60)</pre>
|
|||
Spaces</h1>
|
||||
<p>Put spaces on either side of mathematical operators apart from <code>^</code> (i.e., <code>+</code>, <code>-</code>, <code>==</code>, <code><</code>, …), and around the assignment operator (<code><-</code>).</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Strive for
|
||||
<pre data-type="programlisting" data-code-language="r"># Strive for
|
||||
z <- (a + b)^2 / d
|
||||
|
||||
# Avoid
|
||||
|
@ -38,7 +38,7 @@ z<-( a + b ) ^ 2/d</pre>
|
|||
</div>
|
||||
<p>Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in regular English.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Strive for
|
||||
<pre data-type="programlisting" data-code-language="r"># Strive for
|
||||
mean(x, na.rm = TRUE)
|
||||
|
||||
# Avoid
|
||||
|
@ -46,7 +46,7 @@ mean (x ,na.rm=TRUE)</pre>
|
|||
</div>
|
||||
<p>It’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, you might want to add spaces so that all the <code>=</code> line up. This makes it easier to skim the code.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
speed = air_time / distance,
|
||||
dep_hour = dep_time %/% 100,
|
||||
|
@ -60,7 +60,7 @@ mean (x ,na.rm=TRUE)</pre>
|
|||
Pipes</h1>
|
||||
<p><code>|></code> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Strive for
|
||||
<pre data-type="programlisting" data-code-language="r"># Strive for
|
||||
flights |>
|
||||
filter(!is.na(arr_delay), !is.na(tailnum)) |>
|
||||
count(dest)
|
||||
|
@ -70,7 +70,7 @@ flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)</pre>
|
|||
</div>
|
||||
<p>If the function you’re piping into has named arguments (like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>), put each argument on a new line. If the function doesn’t have named arguments (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>) keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Strive for
|
||||
<pre data-type="programlisting" data-code-language="r"># Strive for
|
||||
flights |>
|
||||
group_by(tailnum) |>
|
||||
summarize(
|
||||
|
@ -87,7 +87,7 @@ flights |>
|
|||
</div>
|
||||
<p>After the first step of the pipeline, indent each line by two spaces. If you’re putting each argument on its own line, indent by an extra two spaces. Make sure <code>)</code> is on its own line, and un-indented to match the horizontal position of the function name.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Strive for
|
||||
<pre data-type="programlisting" data-code-language="r"># Strive for
|
||||
flights |>
|
||||
group_by(tailnum) |>
|
||||
summarize(
|
||||
|
@ -112,7 +112,7 @@ flights|>
|
|||
</div>
|
||||
<p>It’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># This fits compactly on one line
|
||||
<pre data-type="programlisting" data-code-language="r"># This fits compactly on one line
|
||||
df |> mutate(y = x + 1)
|
||||
|
||||
# While this takes up 4x as many lines, it's easily extended to
|
||||
|
@ -130,7 +130,7 @@ df |>
|
|||
ggplot2</h1>
|
||||
<p>The same basic rules that apply to the pipe also apply to ggplot2; just treat <code>+</code> the same way as <code>|></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
summarize(
|
||||
delay = mean(arr_delay, na.rm = TRUE)
|
||||
|
@ -141,7 +141,7 @@ ggplot2</h1>
|
|||
</div>
|
||||
<p>Again, if you can fit all of the arguments to a function on to a single line, put each argument on its own line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarize(
|
||||
distance = mean(distance),
|
||||
|
@ -164,7 +164,7 @@ ggplot2</h1>
|
|||
Sectioning comments</h1>
|
||||
<p>As your scripts get longer, you can use <strong>sectioning</strong> comments to break up your file into manageable pieces:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit"># Load data --------------------------------------
|
||||
<pre data-type="programlisting" data-code-language="r"># Load data --------------------------------------
|
||||
|
||||
# Plot data --------------------------------------</pre>
|
||||
</div>
|
||||
|
@ -185,7 +185,7 @@ Exercises</h1>
|
|||
<ol type="1"><li>
|
||||
<p>Restyle the following pipelines following the guidelines above.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
|
||||
<pre data-type="programlisting" data-code-language="r">flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
|
||||
|
||||
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)</pre>
|
||||
</div>
|
||||
|
|
Loading…
Reference in New Issue