Re-render book for O'Reilly
|
@ -1,2 +0,0 @@
|
|||
*.png
|
||||
*.jpg
|
276
oreilly/EDA.html
|
@ -30,7 +30,7 @@ Questions</h1>
|
|||
<p>“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey</p>
|
||||
</blockquote>
|
||||
<p>Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.</p>
|
||||
<p>EDA is fundamentally a creative process. And like most creative processes, the key to asking <em>quality</em> questions is to generate a large <em>quantity</em> of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.</p>
|
||||
<p>EDA is fundamentally a creative process. And like most creative processes, the key to asking <em>quality</em> questions is to generate a large <em>quantity</em> of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.</p>
|
||||
<p>There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:</p>
|
||||
<ol type="1"><li><p>What type of variation occurs within my variables?</p></li>
|
||||
<li><p>What type of covariation occurs between my variables?</p></li>
|
||||
|
@ -45,81 +45,16 @@ Questions</h1>
|
|||
<section id="variation" data-type="sect1">
|
||||
<h1>
|
||||
Variation</h1>
|
||||
<p><strong>Variation</strong> is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values.</p>
|
||||
|
||||
<section id="visualizing-distributions" data-type="sect2">
|
||||
<h2>
|
||||
Visualizing distributions</h2>
|
||||
<p>How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is <strong>categorical</strong> if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, you can use a bar chart:</p>
|
||||
<p><strong>Variation</strong> is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values, which you’ve learned about in <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a>.</p>
|
||||
<p>We’ll start our exploration by visualizing the distribution of weights (<code>carat</code>) of ~54,000 diamonds from the <code>diamonds</code> dataset. Since <code>carat</code> is a numerical variable, we can use a histogram:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut)
|
||||
#> # A tibble: 5 × 2
|
||||
#> cut n
|
||||
#> <ord> <int>
|
||||
#> 1 Fair 1610
|
||||
#> 2 Good 4906
|
||||
#> 3 Very Good 12082
|
||||
#> 4 Premium 13791
|
||||
#> 5 Ideal 21551</pre>
|
||||
</div>
|
||||
<p>A variable is <strong>continuous</strong> if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, you can use a histogram:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can compute this by hand by combining <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut_width(carat, 0.5))
|
||||
#> # A tibble: 11 × 2
|
||||
#> `cut_width(carat, 0.5)` n
|
||||
#> <fct> <int>
|
||||
#> 1 [-0.25,0.25] 785
|
||||
#> 2 (0.25,0.75] 29498
|
||||
#> 3 (0.75,1.25] 15977
|
||||
#> 4 (1.25,1.75] 5313
|
||||
#> 5 (1.75,2.25] 2002
|
||||
#> 6 (2.25,2.75] 322
|
||||
#> # … with 5 more rows</pre>
|
||||
</div>
|
||||
<p>A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. Note that even though it’s not possible to have a <code>carat</code> value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0. This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25. The tallest bar shows that almost 30,000 observations have a <code>carat</code> value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5.</p>
|
||||
<p>You can set the width of the intervals in a histogram with the <code>binwidth</code> argument, which is measured in the units of the <code>x</code> variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">smaller <- diamonds |>
|
||||
filter(carat < 3)
|
||||
|
||||
ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.1)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), resulting in many bars. The distribution is right skewed but there are lots of ups and downs in the heights of the bins, creating a jagged outline." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> performs the same calculation as <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
|
||||
geom_freqpoly(binwidth = 0.1, size = 0.75)
|
||||
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
|
||||
#> ℹ Please use `linewidth` instead.</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid" alt="A frequency polygon of carats of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 3 and the y-axis ranges from 0 to almost 6000. Ideal diamonds have a much higher peak than the others around 0.25 carats. All cuts of diamonds have right skewed distributions with local peaks at 1 carat and 2 carats. As the cut level increases (from Fair to Ideal), so does the number of diamonds that fall into that category." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>We’ve also customized the thickness of the lines using the <code>size</code> argument in order to make them stand out a bit more against the background.</p>
|
||||
<p>There are a few challenges with this type of plot, which we will come back to in <a href="#sec-cat-cont" data-type="xref">#sec-cat-cont</a> on visualizing a categorical and a continuous variable.</p>
|
||||
<p>Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We’ve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).</p>
|
||||
</section>
|
||||
|
||||
<section id="typical-values" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -132,10 +67,13 @@ Typical values</h2>
|
|||
<ul><li><p>Why are there more diamonds at whole carats and common fractions of carats?</p></li>
|
||||
<li><p>Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?</p></li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
<pre data-type="programlisting" data-code-language="r">smaller <- diamonds |>
|
||||
filter(carat < 3)
|
||||
|
||||
ggplot(smaller, aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.01)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-4-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:</p>
|
||||
|
@ -145,10 +83,10 @@ Typical values</h2>
|
|||
<li><p>Why might the appearance of clusters be misleading?</p></li>
|
||||
</ul><p>The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(x = eruptions)) +
|
||||
geom_histogram(binwidth = 0.25)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-10-1.png" class="img-fluid" alt="A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Many of the questions above will prompt you to explore a relationship <em>between</em> variables, for example, to see if the values of one variable can explain the behavior of another variable. We’ll get to that shortly.</p>
|
||||
|
@ -159,19 +97,19 @@ Typical values</h2>
|
|||
Unusual values</h2>
|
||||
<p>Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the <code>y</code> variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = y)) +
|
||||
geom_histogram(binwidth = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = y)) +
|
||||
geom_histogram(binwidth = 0.5) +
|
||||
coord_cartesian(ylim = c(0, 50))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> also has an <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
|
||||
|
@ -205,13 +143,13 @@ Exercises</h2>
|
|||
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
|
||||
<li><p>Explore the distribution of <code>price</code>. Do you discover anything unusual or surprising? (Hint: Carefully think about the <code>binwidth</code> and make sure you try a wide range of values.)</p></li>
|
||||
<li><p>How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?</p></li>
|
||||
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> vs <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
|
||||
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="sec-missing-values-eda" data-type="sect1">
|
||||
<h1>
|
||||
Missing values</h1>
|
||||
Unusual values</h1>
|
||||
<p>If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.</p>
|
||||
<ol type="1"><li>
|
||||
<p>Drop the entire row with the strange values:</p>
|
||||
|
@ -228,19 +166,19 @@ Missing values</h1>
|
|||
mutate(y = if_else(y < 3 | y > 20, NA, y))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another.</p>
|
||||
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:</p>
|
||||
</ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another. You will learn more about logical vectors in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
|
||||
<p>It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds2, aes(x = x, y = y)) +
|
||||
geom_point()
|
||||
#> Warning: Removed 9 rows containing missing values (`geom_point()`).</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a strong, linear association between the two variables. All but one of the diamonds has length greater than 3. The one outlier has a length of 0 and a width of about 6.5." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-13-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a strong, linear association between the two variables. All but one of the diamonds has length greater than 3. The one outlier has a length of 0 and a width of about 6.5." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>To suppress that warning, set <code>na.rm = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds2, aes(x = x, y = y)) +
|
||||
geom_point(na.rm = TRUE)</pre>
|
||||
</div>
|
||||
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, we’ll use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
|
||||
|
@ -252,10 +190,10 @@ Missing values</h1>
|
|||
sched_min = sched_dep_time %% 100,
|
||||
sched_dep_time = sched_hour + (sched_min / 60)
|
||||
) |>
|
||||
ggplot(mapping = aes(sched_dep_time)) +
|
||||
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)</pre>
|
||||
ggplot(aes(x = sched_dep_time)) +
|
||||
geom_freqpoly(aes(color = cancelled), binwidth = 1/4)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A frequency polygon of scheduled departure times of flights. Two lines represent flights that are cancelled and not cancelled. The x-axis ranges from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of flights not cancelled are much higher than those not cancelled." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-15-1.png" class="img-fluid" alt="A frequency polygon of scheduled departure times of flights. Two lines represent flights that are cancelled and not cancelled. The x-axis ranges from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of flights not cancelled are much higher than those not cancelled." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.</p>
|
||||
|
@ -271,80 +209,73 @@ Exercises</h2>
|
|||
<section id="covariation" data-type="sect1">
|
||||
<h1>
|
||||
Covariation</h1>
|
||||
<p>If variation describes the behavior <em>within</em> a variable, covariation describes the behavior <em>between</em> variables. <strong>Covariation</strong> is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that depends again on the types of variables involved.</p>
|
||||
<p>If variation describes the behavior <em>within</em> a variable, covariation describes the behavior <em>between</em> variables. <strong>Covariation</strong> is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.</p>
|
||||
|
||||
<section id="sec-cat-cont" data-type="sect2">
|
||||
<section id="sec-cat-num" data-type="sect2">
|
||||
<h2>
|
||||
A categorical and continuous variable</h2>
|
||||
<p>It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in the shapes of their distributions. For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
|
||||
A categorical and a numerical variable</h2>
|
||||
<p>For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = price)) +
|
||||
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) +
|
||||
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
|
||||
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
|
||||
#> ℹ Please use `linewidth` instead.</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>It’s hard to see the difference in distribution because the overall counts differ so much:</p>
|
||||
<p>The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count and the overall counts of <code>cut</code> in differ so much, making it hard to see the differences in the shapes of their distributions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
|
||||
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) +
|
||||
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that we’re mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="https://ggplot2.tidyverse.org/reference/aes_eval.html">after_stat()</a></code> function to do so.</p>
|
||||
<p>There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.</p>
|
||||
<p>Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A <strong>boxplot</strong> is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:</p>
|
||||
<ul><li><p>A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.</p></li>
|
||||
<li><p>Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.</p></li>
|
||||
<li><p>A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.</p></li>
|
||||
</ul><div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="images/EDA-boxplot.png" class="img-fluid" alt="A diagram depicting how a boxplot is created following the steps outlined above." width="1066"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Let’s take a look at the distribution of price by cut using <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>:</p>
|
||||
<p>A visually simpler plot for exploring this relationship is using side-by-side boxplots.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = price)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-19-1.png" class="img-fluid" alt="Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, you’ll be challenged to figure out why.</p>
|
||||
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://rdrr.io/r/stats/reorder.factor.html">reorder()</a></code> function.</p>
|
||||
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> function.</p>
|
||||
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = class, y = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>To make the trend easier to see, we can reorder <code>class</code> based on the median value of <code>hwy</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg,
|
||||
mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg,
|
||||
aes(x = fct_reorder(class, hwy, median), y = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>If you have long variable names, <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg,
|
||||
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg,
|
||||
aes(x = hwy, y = fct_reorder(class, hwy, median))) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-28-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the y-axis and ordered by increasing median highway mileage." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the y-axis and ordered by increasing median highway mileage." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
@ -354,8 +285,8 @@ Exercises</h3>
|
|||
<ol type="1"><li><p>Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
|
||||
<li><p>What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?</p></li>
|
||||
<li><p>Instead of exchanging the x and y variables, add <code><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html">coord_flip()</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li>
|
||||
<li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs cut. What do you learn? How do you interpret the plots?</p></li>
|
||||
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_violin.html">geom_violin()</a></code> with a faceted <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, or a coloured <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>. What are the pros and cons of each method?</p></li>
|
||||
<li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?</p></li>
|
||||
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_violin.html">geom_violin()</a></code> with a faceted <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, or a colored <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>. What are the pros and cons of each method?</p></li>
|
||||
<li><p>If you have a small dataset, it’s sometimes useful to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>. List them and briefly describe what each one does.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
@ -365,29 +296,13 @@ Exercises</h3>
|
|||
Two categorical variables</h2>
|
||||
<p>To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = color)) +
|
||||
geom_count()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-29-1.png" class="img-fluid" alt="A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.</p>
|
||||
<p>A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The number of diamonds for each level of cut increases from Fair to Ideal and the heights of the segments within each bar represent the number of diamonds that fall within each color/cut combination. There appear to be some of each color of diamonds within each level of cut of diamonds." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
geom_bar(position = "fill")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The heights of each of the bars representing each cut of diamond are the same, 1. The heights of the segments within each bar represent the proportion of diamonds that fall within each color/cut combination. The proportions don't appear to be very different across the levels of cut." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Another approach for exploring the relationship between these variables is computing the counts with dplyr:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
|
@ -407,10 +322,10 @@ Two categorical variables</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(color, cut) |>
|
||||
ggplot(mapping = aes(x = color, y = cut)) +
|
||||
geom_tile(mapping = aes(fill = n))</pre>
|
||||
ggplot(aes(x = color, y = cut)) +
|
||||
geom_tile(aes(fill = n))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-33-1.png" class="img-fluid" alt="A tile plot of cut vs. color of diamonds. Each tile represents a cut/color combination and tiles are colored according to the number of observations in each tile. There are more Ideal diamonds than other cuts, with the highest number being Ideal diamonds with color G. Fair diamonds and diamonds with color I are the lowest in frequency." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="A tile plot of cut vs. color of diamonds. Each tile represents a cut/color combination and tiles are colored according to the number of observations in each tile. There are more Ideal diamonds than other cuts, with the highest number being Ideal diamonds with color G. Fair diamonds and diamonds with color I are the lowest in frequency." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
|
||||
|
@ -425,33 +340,33 @@ Exercises</h3>
|
|||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="two-continuous-variables" data-type="sect2">
|
||||
<section id="two-numerical-variables" data-type="sect2">
|
||||
<h2>
|
||||
Two continuous variables</h2>
|
||||
<p>You’ve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
|
||||
Two numerical variables</h2>
|
||||
<p>You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). You’ve already seen one way to fix the problem: using the <code>alpha</code> aesthetic to add transparency.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
|
||||
geom_point(alpha = 1 / 100)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> to bin in one dimension. Now you’ll learn how to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> to bin in two dimensions.</p>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> creates rectangular bins. <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(smaller, aes(x = carat, y = price)) +
|
||||
geom_bin2d()
|
||||
|
||||
# install.packages("hexbin")
|
||||
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
ggplot(smaller, aes(x = carat, y = price)) +
|
||||
geom_hex()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
|
||||
|
@ -459,37 +374,37 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
|||
</div>
|
||||
<p>Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin <code>carat</code> and then for each group, display a boxplot:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(smaller, aes(x = carat, y = price)) +
|
||||
geom_boxplot(aes(group = cut_width(carat, 0.1)))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-29-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
|
||||
<p>Another approach is to display approximately the same number of points in each bin. That’s the job of <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(smaller, aes(x = carat, y = price)) +
|
||||
geom_boxplot(aes(group = cut_number(carat, 20)))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-38-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises-4" data-type="sect3">
|
||||
<h3>
|
||||
Exercises</h3>
|
||||
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
|
||||
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
|
||||
<li><p>Visualize the distribution of <code>carat</code>, partitioned by <code>price</code>.</p></li>
|
||||
<li><p>How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?</p></li>
|
||||
<li><p>Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.</p></li>
|
||||
<li>
|
||||
<p>Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of <code>x</code> and <code>y</code> values, which makes the points outliers even though their <code>x</code> and <code>y</code> values appear normal when examined separately.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = x, y = y)) +
|
||||
geom_point() +
|
||||
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-39-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a positive, strong, linear relationship. There are a few unusual observations above and below the bulk of the data, more below it than above." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a positive, strong, linear relationship. There are a few unusual observations above and below the bulk of the data, more below it than above." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Why is a scatterplot a better display than a binned plot for this case?</p>
|
||||
|
@ -509,10 +424,10 @@ Patterns and models</h1>
|
|||
<li><p>Does the relationship change if you look at individual subgroups of the data?</p></li>
|
||||
</ul><p>A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(x = eruptions, y = waiting)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-40-1.png" class="img-fluid" alt="A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-32-1.png" class="img-fluid" alt="A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.</p>
|
||||
|
@ -532,46 +447,23 @@ diamonds_fit <- linear_reg() |>
|
|||
diamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>
|
||||
mutate(.resid = exp(.resid))
|
||||
|
||||
ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
|
||||
ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-41-1.png" class="img-fluid" alt="A scatter plot of residuals vs. carat of diamonds. The x-axis ranges from 0 to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered around low values of carat and residuals. There is a clear, curved pattern showing decrease in residuals as carat increases." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-33-1.png" class="img-fluid" alt="A scatter plot of residuals vs. carat of diamonds. The x-axis ranges from 0 to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered around low values of carat and residuals. There is a clear, curved pattern showing decrease in residuals as carat increases." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds_aug, aes(x = cut, y = .resid)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-42-1.png" class="img-fluid" alt="Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end." width="576"/></p>
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid" alt="Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
|
||||
</section>
|
||||
|
||||
<section id="ggplot2-calls" data-type="sect1">
|
||||
<h1>
|
||||
ggplot2 calls</h1>
|
||||
<p>As we move on from these introductory chapters, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
geom_freqpoly(binwidth = 0.25)</pre>
|
||||
</div>
|
||||
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
|
||||
<p>Rewriting the previous plot more concisely yields:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(eruptions)) +
|
||||
geom_freqpoly(binwidth = 0.25)</pre>
|
||||
</div>
|
||||
<p>Sometimes we’ll turn the end of a pipeline of data transformation into a plot. Watch for the transition from <code>|></code> to <code>+</code>. We wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut, clarity) |>
|
||||
ggplot(aes(clarity, cut, fill = n)) +
|
||||
geom_tile()</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
|
|
|
@ -0,0 +1,283 @@
|
|||
<section data-type="chapter" id="chp-arrow">
|
||||
<h1><span id="sec-arrow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Arrow</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>CSV files are designed to be easily read by humans. They’re a good interchange format because they’re very simple and they can be read by every tool under the sun. But CSV files aren’t very efficient: you have to do quite a lot of work to read the data into R. In this chapter, you’ll learn about a powerful alternative: the <a href="https://parquet.apache.org/">parquet format</a>, an open standards-based format widely used by big data systems.</p>
|
||||
<p>We’ll pair parquet files with <a href="https://arrow.apache.org">Apache Arrow</a>, a multi-language toolbox designed for efficient analysis and transport of large data sets. We’ll use Apache Arrow via the the <a href="https://arrow.apache.org/docs/r/">arrow package</a>, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: you’ll see some examples later in the chapter.</p>
|
||||
<p>Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and you’ll want to work with it as is. But if you’re starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, it’s hard to know what will work best, so in the early stages of your analysis we’d encourage you to try both and pick the one that works the best for you.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll continue to use the tidyverse, particularly dplyr, but we’ll pair it with the arrow package which is designed specifically for working with large data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(arrow)</pre>
|
||||
</div>
|
||||
<p>Later in the chapter, we’ll also see some connections between arrow and duckdb, so we’ll also need dbplyr and duckdb.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(dbplyr, warn.conflicts = FALSE)
|
||||
library(duckdb)
|
||||
#> Loading required package: DBI</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="getting-the-data" data-type="sect1">
|
||||
<h1>
|
||||
Getting the data</h1>
|
||||
<p>We begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at <a href="https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6">data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6</a>. This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2015 to October 2022.</p>
|
||||
<p>The following code will get you a cached copy of the data. The data is a 9GB CSV file, so it will take some time to download: simply getting the data is often the first challenge!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">dir.create("data", showWarnings = FALSE)
|
||||
url <- "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv"
|
||||
|
||||
# Default timeout is 60s; bump it up to an hour
|
||||
options(timeout = 60 * 60)
|
||||
download.file(url, "data/seattle-library-checkouts.csv")</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="opening-a-dataset" data-type="sect1">
|
||||
<h1>
|
||||
Opening a dataset</h1>
|
||||
<p>Let’s start by taking a look at the data. At 9GB, this file is large enough that we probably don’t want to load the whole thing into memory. A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 Gb. This means we want to avoid <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and instead use the <code><a href="https://arrow.apache.org/docs/r/reference/open_dataset.html">arrow::open_dataset()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># partial schema for ISBN column only
|
||||
opts <- CsvConvertOptions$create(col_types = schema(ISBN = string()))
|
||||
|
||||
seattle_csv <- open_dataset(
|
||||
sources = "data/seattle-library-checkouts.csv",
|
||||
format = "csv",
|
||||
convert_options = opts
|
||||
)</pre>
|
||||
</div>
|
||||
<p>(Here we’ve had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first ~83,000 rows don’t contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)</p>
|
||||
<p>What happens when this code is run? <code><a href="https://arrow.apache.org/docs/r/reference/open_dataset.html">open_dataset()</a></code> will scan a few thousand rows to figure out the structure of the data set. Then it records what it’s found and stops; it will only read further rows as you specifically request them. This metadata is what we see if we print <code>seattle_csv</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_csv
|
||||
#> FileSystemDataset with 1 csv file
|
||||
#> UsageClass: string
|
||||
#> CheckoutType: string
|
||||
#> MaterialType: string
|
||||
#> CheckoutYear: int64
|
||||
#> CheckoutMonth: int64
|
||||
#> Checkouts: int64
|
||||
#> Title: string
|
||||
#> ISBN: string
|
||||
#> Creator: string
|
||||
#> Subjects: string
|
||||
#> Publisher: string
|
||||
#> PublicationYear: string</pre>
|
||||
</div>
|
||||
<p>The first line in the output tells you that <code>seattle_csv</code> is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. The remainder of the output tells you the column type that arrow has imputed for each column.</p>
|
||||
<p>We can see what’s actually in with <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>. This reveals that there are ~41 million rows and 12 columns, and shows us a few values.</p>
|
||||
<div class="cell" data-hash="arrow_cache/html/glimpse-data_07c924738790eb185ebdd8973443e90d">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_csv |> glimpse()
|
||||
#> FileSystemDataset with 1 csv file
|
||||
#> 41,389,465 rows x 12 columns
|
||||
#> $ UsageClass <string> "Physical", "Physical", "Digital", "Physical", "Ph…
|
||||
#> $ CheckoutType <string> "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
|
||||
#> $ MaterialType <string> "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
|
||||
#> $ CheckoutYear <int64> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
|
||||
#> $ CheckoutMonth <int64> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
|
||||
#> $ Checkouts <int64> 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
|
||||
#> $ Title <string> "Super rich : a guide to having it all / Russell S…
|
||||
#> $ ISBN <string> "", "", "", "", "", "", "", "", "", "", "", "", ""…
|
||||
#> $ Creator <string> "Simmons, Russell", "Barclay, James, 1965-", "Tim …
|
||||
#> $ Subjects <string> "Self realization, Conduct of life, Attitude Psych…
|
||||
#> $ Publisher <string> "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
|
||||
#> $ PublicationYear <string> "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…</pre>
|
||||
</div>
|
||||
<p>We can start to use this dataset with dplyr verbs, using <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> to force arrow to perform the computation and return some data. For example, this code tells us the total number of checkouts per year:</p>
|
||||
<div class="cell" data-hash="arrow_cache/html/unnamed-chunk-5_7a5e1ce0bed4d69e849dff75d0c0d8d3">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_csv |>
|
||||
count(CheckoutYear, wt = Checkouts) |>
|
||||
arrange(CheckoutYear) |>
|
||||
collect()
|
||||
#> # A tibble: 18 × 2
|
||||
#> CheckoutYear n
|
||||
#> <int> <int>
|
||||
#> 1 2005 3798685
|
||||
#> 2 2006 6599318
|
||||
#> 3 2007 7126627
|
||||
#> 4 2008 8438486
|
||||
#> 5 2009 9135167
|
||||
#> 6 2010 8608966
|
||||
#> # … with 12 more rows</pre>
|
||||
</div>
|
||||
<p>Thanks to arrow, this code will work regardless of how large the underlying dataset is. But it’s currently rather slow: on Hadley’s computer, it took ~10s to run. That’s not terrible given how much data we have, but we can make it much faster by switching to a better format.</p>
|
||||
</section>
|
||||
|
||||
<section id="the-parquet-format" data-type="sect1">
|
||||
<h1>
|
||||
The parquet format</h1>
|
||||
<p>To make this data easier to work with, lets switch to the parquet file format and split it up into multiple files. The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.</p>
|
||||
|
||||
<section id="advantages-of-parquet" data-type="sect2">
|
||||
<h2>
|
||||
Advantages of parquet</h2>
|
||||
<p>Like CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, it’s a custom binary format designed specifically for the needs of big data. This means that:</p>
|
||||
<ul><li><p>Parquet files are usually smaller the equivalent CSV file. Parquet relies on <a href="https://parquet.apache.org/docs/file-format/data-pages/encodings/">efficient encodings</a> to keep file size down, and supports file compression. This helps make parquet files fast because there’s less data to move from disk to memory.</p></li>
|
||||
<li><p>Parquet files have a rich type system. As we talked about in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>, a CSV file does not provide any information about column types. For example, a CSV reader has to guess whether <code>"08-10-2022"</code> should be parsed as a string or a date. In contrast, parquet files store data in a way that records the type along with the data.</p></li>
|
||||
<li><p>Parquet files are “column-oriented”. This means that they’re organised column-by-column, much like R’s data frame. This typically leads to better performance for data analysis tasks compared to CSV files, which are organised row-by-row.</p></li>
|
||||
<li><p>Parquet files are “chunked”, which makes it possible to work on different parts of the file at the same time, and, if you’re lucky, to skip some chunks all together.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="partitioning" data-type="sect2">
|
||||
<h2>
|
||||
Partitioning</h2>
|
||||
<p>As datasets get larger and larger, storing all the data in a single file gets increasingly painful and it’s often useful to split large datasets across many files. When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.</p>
|
||||
<p>There are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data. You’re likely to need to do some experimentation before you find the ideal partitioning for your situation. As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. You should also try to partition by variables that you filter by; as you’ll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.</p>
|
||||
</section>
|
||||
|
||||
<section id="rewriting-the-seattle-library-data" data-type="sect2">
|
||||
<h2>
|
||||
Rewriting the Seattle library data</h2>
|
||||
<p>Let’s apply these ideas to the Seattle library data to see how they play out in practice. We’re going to partition by <code>CheckoutYear</code>, since it’s likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.</p>
|
||||
<p>To rewrite the data we define the partition using <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code> and then save the partitions to a directory with <code><a href="https://arrow.apache.org/docs/r/reference/write_dataset.html">arrow::write_dataset()</a></code>. <code><a href="https://arrow.apache.org/docs/r/reference/write_dataset.html">write_dataset()</a></code> has two important arguments: a directory where we’ll create the files and the format we’ll use.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">pq_path <- "data/seattle-library-checkouts"</pre>
|
||||
</div>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_csv |>
|
||||
group_by(CheckoutYear) |>
|
||||
write_dataset(path = pq_path, format = "parquet")</pre>
|
||||
</div>
|
||||
<p>This takes about a minute to run; as we’ll see shortly this is an initial investment that pays off by making future operations much much faster.</p>
|
||||
<p>Let’s take a look at what we just produced:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">tibble(
|
||||
files = list.files(pq_path, recursive = TRUE),
|
||||
size_MB = file.size(file.path(pq_path, files)) / 1024^2
|
||||
)
|
||||
#> # A tibble: 18 × 2
|
||||
#> files size_MB
|
||||
#> <chr> <dbl>
|
||||
#> 1 CheckoutYear=2005/part-0.parquet 109.
|
||||
#> 2 CheckoutYear=2006/part-0.parquet 164.
|
||||
#> 3 CheckoutYear=2007/part-0.parquet 178.
|
||||
#> 4 CheckoutYear=2008/part-0.parquet 195.
|
||||
#> 5 CheckoutYear=2009/part-0.parquet 214.
|
||||
#> 6 CheckoutYear=2010/part-0.parquet 222.
|
||||
#> # … with 12 more rows</pre>
|
||||
</div>
|
||||
<p>Our single 9GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the <a href="https://hive.apache.org">Apache Hive</a> project. Hive-style partitions name folders with a “key=value” convention, so as you might guess, the <code>CheckoutYear=2005</code> directory contains all the data where <code>CheckoutYear</code> is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="using-dplyr-with-arrow" data-type="sect1">
|
||||
<h1>
|
||||
Using dplyr with arrow</h1>
|
||||
<p>Now we’ve created these parquet files, we’ll need to read them in again. We use <code><a href="https://arrow.apache.org/docs/r/reference/open_dataset.html">open_dataset()</a></code> again, but this time we give it a directory:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_pq <- open_dataset(pq_path)</pre>
|
||||
</div>
|
||||
<p>Now we can write our dplyr pipeline. For example, we could count the total number of books checked out in each month for the last five years:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">query <- seattle_pq |>
|
||||
filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
|
||||
group_by(CheckoutYear, CheckoutMonth) |>
|
||||
summarise(TotalCheckouts = sum(Checkouts)) |>
|
||||
arrange(CheckoutYear, CheckoutMonth)</pre>
|
||||
</div>
|
||||
<p>Writing dplyr code for arrow data is conceptually similar to dbplyr, <a href="#chp-databases" data-type="xref">#chp-databases</a>: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. If we print out the <code>query</code> object we can see a little information about what we expect Arrow to return when the execution takes place:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">query
|
||||
#> FileSystemDataset (query)
|
||||
#> CheckoutYear: int32
|
||||
#> CheckoutMonth: int64
|
||||
#> TotalCheckouts: int64
|
||||
#>
|
||||
#> * Grouped by CheckoutYear
|
||||
#> * Sorted by CheckoutYear [asc], CheckoutMonth [asc]
|
||||
#> See $.data for the source Arrow object</pre>
|
||||
</div>
|
||||
<p>And we can get the results by calling <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">query |> collect()
|
||||
#> # A tibble: 58 × 3
|
||||
#> # Groups: CheckoutYear [5]
|
||||
#> CheckoutYear CheckoutMonth TotalCheckouts
|
||||
#> <int> <int> <int>
|
||||
#> 1 2018 1 355101
|
||||
#> 2 2018 2 309813
|
||||
#> 3 2018 3 344487
|
||||
#> 4 2018 4 330988
|
||||
#> 5 2018 5 318049
|
||||
#> 6 2018 6 341825
|
||||
#> # … with 52 more rows</pre>
|
||||
</div>
|
||||
<p>Like dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in <code><a href="https://arrow.apache.org/docs/r/reference/acero.html">?acero</a></code>.</p>
|
||||
|
||||
<section id="sec-parquet-fast" data-type="sect2">
|
||||
<h2>
|
||||
Performance</h2>
|
||||
<p>Let’s take a quick look at the performance impact of switching from CSV to parquet. First, let’s time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:</p>
|
||||
<div class="cell" data-hash="arrow_cache/html/dataset-performance-csv_4d24d09e336fc39a348b5ad94f60f527">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_csv |>
|
||||
filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
|
||||
group_by(CheckoutMonth) |>
|
||||
summarise(TotalCheckouts = sum(Checkouts)) |>
|
||||
arrange(desc(CheckoutMonth)) |>
|
||||
collect() |>
|
||||
system.time()
|
||||
#> user system elapsed
|
||||
#> 11.980 0.924 11.350</pre>
|
||||
</div>
|
||||
<p>Now let’s use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:</p>
|
||||
<div class="cell" data-hash="arrow_cache/html/dataset-performance-multiple-parquet_ad546f5d817df3ad4bdb238240b808d3">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_pq |>
|
||||
filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
|
||||
group_by(CheckoutMonth) |>
|
||||
summarise(TotalCheckouts = sum(Checkouts)) |>
|
||||
arrange(desc(CheckoutMonth)) |>
|
||||
collect() |>
|
||||
system.time()
|
||||
#> user system elapsed
|
||||
#> 0.273 0.045 0.055</pre>
|
||||
</div>
|
||||
<p>The ~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:</p>
|
||||
<ul><li>Partitioning improves performance because this query uses <code>CheckoutYear == 2021</code> to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.</li>
|
||||
<li>The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (<code>CheckoutYear</code>, <code>MaterialType</code>, <code>CheckoutMonth</code>, and <code>Checkouts</code>).</li>
|
||||
</ul><p>This massive difference in performance is why it pays off to convert large CSVs to parquet!</p>
|
||||
</section>
|
||||
|
||||
<section id="using-dbplyr-with-arrow" data-type="sect2">
|
||||
<h2>
|
||||
Using dbplyr with arrow</h2>
|
||||
<p>There’s one last advantage of parquet and arrow — it’s very easy to turn an arrow dataset into a duckdb datasource by calling <code><a href="https://arrow.apache.org/docs/r/reference/to_duckdb.html">arrow::to_duckdb()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">seattle_pq |>
|
||||
to_duckdb() |>
|
||||
filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
|
||||
group_by(CheckoutYear) |>
|
||||
summarise(TotalCheckouts = sum(Checkouts)) |>
|
||||
arrange(desc(CheckoutYear)) |>
|
||||
collect()
|
||||
#> Warning: Missing values are always removed in SQL aggregation functions.
|
||||
#> Use `na.rm = TRUE` to silence this warning
|
||||
#> This warning is displayed once every 8 hours.
|
||||
#> # A tibble: 5 × 2
|
||||
#> CheckoutYear TotalCheckouts
|
||||
#> <int> <dbl>
|
||||
#> 1 2022 2431502
|
||||
#> 2 2021 2266438
|
||||
#> 3 2020 1241999
|
||||
#> 4 2019 3931688
|
||||
#> 5 2018 3987569</pre>
|
||||
</div>
|
||||
<p>The neat thing about <code><a href="https://arrow.apache.org/docs/r/reference/to_duckdb.html">to_duckdb()</a></code> is that the transfer doesn’t involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter, you’ve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format that’s designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but it’s partitioned, compressed, and columnar structure makes it much more efficient to analyze.</p>
|
||||
<p>Next up you’ll learn about your first non-rectangular data source, which you’ll handle using tools provided by the tidyr package. We’ll focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source.</p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
|
@ -1,5 +1,5 @@
|
|||
<section data-type="chapter" id="chp-base-R">
|
||||
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.</p>
|
||||
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and <code>for</code> loops. To finish off, we’ll briefly discuss two important plotting functions.</p>
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
|
@ -44,14 +44,10 @@ x[c(3, 2, 5)]
|
|||
<pre data-type="programlisting" data-code-language="r">x <- c(10, 3, NA, 5, 8, 1, NA)
|
||||
|
||||
# All non-missing values of x
|
||||
!is.na(x)
|
||||
#> [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE
|
||||
x[!is.na(x)]
|
||||
#> [1] 10 3 5 8 1
|
||||
|
||||
# All even (or missing!) values of x
|
||||
x %% 2 == 0
|
||||
#> [1] TRUE FALSE NA FALSE TRUE FALSE NA
|
||||
x[x %% 2 == 0]
|
||||
#> [1] 10 NA 8 NA</pre>
|
||||
</div>
|
||||
|
@ -73,7 +69,7 @@ x[c("xyz", "def")]
|
|||
<section id="subsetting-data-frames" data-type="sect2">
|
||||
<h2>
|
||||
Subsetting data frames</h2>
|
||||
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to selecting rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
|
||||
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to select rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
|
||||
<p>Here are a couple of examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
|
@ -107,7 +103,7 @@ df[df$x > 1, ]
|
|||
#> 2 3 f 0.601</pre>
|
||||
</div>
|
||||
<p>We’ll come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesn’t use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
|
||||
<p>There’s an important difference between tibbles and data frames when it comes to <code>[</code>. In this book we’ve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write <code>data.frame</code>s. So if <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
|
||||
<p>There’s an important difference between tibbles and data frames when it comes to <code>[</code>. In this book we’ve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write <code>data.frame</code>. If <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df1 <- data.frame(x = 1:3)
|
||||
df1[, "x"]
|
||||
|
@ -124,7 +120,7 @@ df2[, "x"]
|
|||
</div>
|
||||
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df1[, "x", drop = FALSE]
|
||||
<pre data-type="programlisting" data-code-language="r">df1[, "x" , drop = FALSE]
|
||||
#> x
|
||||
#> 1 1
|
||||
#> 2 2
|
||||
|
@ -159,7 +155,7 @@ df[!is.na(df$x) & df$x > 1, ]</pre>
|
|||
# same as
|
||||
df[order(df$x, df$y), ]</pre>
|
||||
</div>
|
||||
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individual sort columns in decreasing order.</p>
|
||||
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individually sort columns in decreasing order.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
|
||||
|
@ -209,12 +205,12 @@ Exercises</h2>
|
|||
<h1>
|
||||
Selecting a single element<code>$</code> and <code>[[</code>
|
||||
</h1>
|
||||
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, we’ll show you how to use <code>[[</code> and <code>$</code> to pull columns out of a data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
|
||||
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, we’ll show you how to use <code>[[</code> and <code>$</code> to pull columns out of data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
|
||||
|
||||
<section id="data-frames" data-type="sect2">
|
||||
<h2>
|
||||
Data frames</h2>
|
||||
<p><code>[[</code> and <code>$</code> can be used like <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
|
||||
<p><code>[[</code> and <code>$</code> can be used to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">tb <- tibble(
|
||||
x = 1:4,
|
||||
|
@ -243,8 +239,8 @@ tb
|
|||
#> 3 3 1 4
|
||||
#> 4 4 21 25</pre>
|
||||
</div>
|
||||
<p>There are a number other base approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
|
||||
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, there’s no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
|
||||
<p>There are a number of other base R approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
|
||||
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of <code>cut</code>, there’s no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">max(diamonds$carat)
|
||||
#> [1] 5.01
|
||||
|
@ -252,6 +248,14 @@ tb
|
|||
levels(diamonds$cut)
|
||||
#> [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
|
||||
</div>
|
||||
<p>dplyr also provides an equivalent to <code>[[</code>/<code>$</code> that we didn’t mention in <a href="#chp-data-transform" data-type="xref">#chp-data-transform</a>: <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |> pull(carat) |> mean()
|
||||
#> [1] 0.7979397
|
||||
|
||||
diamonds |> pull(cut) |> levels()
|
||||
#> [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="tibbles" data-type="sect2">
|
||||
|
@ -283,7 +287,7 @@ tb$z
|
|||
<section id="lists" data-type="sect2">
|
||||
<h2>
|
||||
Lists</h2>
|
||||
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and it’s important to understand how they differ to <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
|
||||
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and it’s important to understand how they differ from <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">l <- list(
|
||||
a = 1:3,
|
||||
|
@ -299,6 +303,9 @@ Lists</h2>
|
|||
#> List of 2
|
||||
#> $ a: int [1:3] 1 2 3
|
||||
#> $ b: chr "a string"
|
||||
str(l[1])
|
||||
#> List of 1
|
||||
#> $ a: int [1:3] 1 2 3
|
||||
str(l[4])
|
||||
#> List of 1
|
||||
#> $ d:List of 2
|
||||
|
@ -376,7 +383,7 @@ Exercises</h2>
|
|||
<section id="apply-family" data-type="sect1">
|
||||
<h1>
|
||||
Apply family</h1>
|
||||
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, you’ll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.</p>
|
||||
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, you’ll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.</p>
|
||||
<p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.</span>. In fact, because we haven’t used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>’s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
|
||||
<p>There’s no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
|
||||
<div class="cell">
|
||||
|
@ -408,7 +415,7 @@ df
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
group_by(cut) |>
|
||||
summarise(price = mean(price))
|
||||
summarize(price = mean(price))
|
||||
#> # A tibble: 5 × 2
|
||||
#> cut price
|
||||
#> <ord> <dbl>
|
||||
|
@ -423,29 +430,29 @@ tapply(diamonds$price, diamonds$cut, mean)
|
|||
#> 4358.758 3928.864 3981.760 4584.258 3457.542</pre>
|
||||
</div>
|
||||
<p>Unfortunately <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec">in a gist</a>.</p>
|
||||
<p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out of <code>apply(df, 2, something)</code> which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
|
||||
<p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out for <code>apply(df, 2, something)</code>, which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
|
||||
</section>
|
||||
|
||||
<section id="for-loops" data-type="sect1">
|
||||
<h1>
|
||||
For loops</h1>
|
||||
<p>For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:</p>
|
||||
<p><code>for</code> loops are the fundamental building block of iteration that both the apply and map families use under the hood. <code>for</code> loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a <code>for</code> loop looks like this:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">for (element in vector) {
|
||||
# do something with element
|
||||
}</pre>
|
||||
</div>
|
||||
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
|
||||
<p>The most straightforward use of <code>for</code> loops is to achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">paths |> walk(append_file)</pre>
|
||||
</div>
|
||||
<p>We could have used a for loop:</p>
|
||||
<p>We could have used a <code>for</code> loop:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">for (path in paths) {
|
||||
append_file(path)
|
||||
}</pre>
|
||||
</div>
|
||||
<p>Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
|
||||
<p>Things get a little trickier if you want to save the output of the <code>for</code> loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
|
||||
files <- map(paths, readxl::read_excel)</pre>
|
||||
|
@ -486,23 +493,23 @@ for (path in paths) {
|
|||
out <- rbind(out, readxl::read_excel(path))
|
||||
}</pre>
|
||||
</div>
|
||||
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This the source of the persistent canard that <code>for</code> loops are slow: they’re not, but iteratively growing a vector is.</p>
|
||||
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that <code>for</code> loops are slow: they’re not, but iteratively growing a vector is.</p>
|
||||
</section>
|
||||
|
||||
<section id="plots" data-type="sect1">
|
||||
<h1>
|
||||
Plots</h1>
|
||||
<p>Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because they’re so concise — it’s very little typing to do a basic exploratory plot.</p>
|
||||
<p>Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because they’re so concise — it takes very little typing to do a basic exploratory plot.</p>
|
||||
<p>There are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Here’s a quick example from the diamonds dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">hist(diamonds$carat)
|
||||
|
||||
plot(diamonds$carat, diamonds$price)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="base-R_files/figure-html/unnamed-chunk-39-1.png" width="576"/></p>
|
||||
<p><img src="base-R_files/figure-html/unnamed-chunk-40-1.png" width="576"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="base-R_files/figure-html/unnamed-chunk-39-2.png" width="576"/></p>
|
||||
<p><img src="base-R_files/figure-html/unnamed-chunk-40-2.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
|
||||
|
@ -511,8 +518,8 @@ plot(diamonds$carat, diamonds$price)</pre>
|
|||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter, we’ve shown you selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
|
||||
<p>This chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that you’re are looking forward to learning more outside of this book.</p>
|
||||
<p>In this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
|
||||
<p>This chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that you’re looking forward to learning more outside of this book.</p>
|
||||
|
||||
|
||||
</section>
|
||||
|
|
|
@ -1,616 +0,0 @@
|
|||
<section data-type="chapter" id="chp-communicate-plots">
|
||||
<h1><span id="sec-graphics-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Graphics for communication</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.</p>
|
||||
<p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.</p>
|
||||
<p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="https://www.amazon.com/gp/product/0321934075/"><em>The Truthful Art</em></a>, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including <strong>ggrepel</strong> and <strong>patchwork</strong>. Rather than loading those extensions here, we’ll refer to their functions explicitly, using the <code>::</code> notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Don’t forget you’ll need to install those packages with <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> if you don’t already have them.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="label" data-type="sect1">
|
||||
<h1>
|
||||
Label</h1>
|
||||
<p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> function. This example adds a plot title:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(title = "Fuel efficiency generally decreases with engine size")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-3-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.</p>
|
||||
<p>If you need to add more text, there are two other useful labels that you can use in ggplot2 2.2.0 and above:</p>
|
||||
<ul><li><p><code>subtitle</code> adds additional detail in a smaller font beneath the title.</p></li>
|
||||
<li><p><code>caption</code> adds text at the bottom right of the plot, often used to describe the source of the data.</p></li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(
|
||||
title = "Fuel efficiency generally decreases with engine size",
|
||||
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
|
||||
caption = "Data from fueleconomy.gov"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-4-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(
|
||||
x = "Engine displacement (L)",
|
||||
y = "Highway fuel economy (mpg)",
|
||||
colour = "Car type"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-5-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>It’s possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="https://rdrr.io/r/base/substitute.html">quote()</a></code> and read about the available options in <code><a href="https://rdrr.io/r/grDevices/plotmath.html">?plotmath</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = runif(10),
|
||||
y = runif(10)
|
||||
)
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_point() +
|
||||
labs(
|
||||
x = quote(sum(x[i] ^ 2, i == 1, n)),
|
||||
y = quote(alpha + beta + frac(delta, theta))
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-6-1.png" style="width:50.0%"/></p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Create one plot on the fuel economy data with customized <code>title</code>, <code>subtitle</code>, <code>caption</code>, <code>x</code>, <code>y</code>, and <code>colour</code> labels.</p></li>
|
||||
<li>
|
||||
<p>Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-7-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="annotations" data-type="sect1">
|
||||
<h1>
|
||||
Annotations</h1>
|
||||
<p>In addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> is similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p>
|
||||
<p>There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isn’t terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">best_in_class <- mpg |>
|
||||
group_by(class) |>
|
||||
filter(row_number(desc(hwy)) == 1)
|
||||
|
||||
ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_text(aes(label = model), data = best_in_class)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-8-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-9-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>That helps a bit, but if you look closely in the top-left hand corner, you’ll notice that there are two labels practically on top of each other. This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same. There’s no way that we can fix these by applying the same transformation for every label. Instead, we can use the <strong>ggrepel</strong> package by Kamil Slowikowski. This useful package will automatically adjust labels so that they don’t overlap:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_point(size = 3, shape = 1, data = best_in_class) +
|
||||
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-10-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.</p>
|
||||
<p>You can sometimes use the same idea to replace the legend with labels placed directly on the plot. It’s not wonderful for this plot, but it isn’t too bad. (<code>theme(legend.position = "none"</code>) turns the legend off — we’ll talk about it more shortly.)</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">class_avg <- mpg |>
|
||||
group_by(class) |>
|
||||
summarise(
|
||||
displ = median(displ),
|
||||
hwy = median(hwy)
|
||||
)
|
||||
|
||||
ggplot(mpg, aes(displ, hwy, colour = class)) +
|
||||
ggrepel::geom_label_repel(aes(label = class),
|
||||
data = class_avg,
|
||||
size = 6,
|
||||
label.size = 0,
|
||||
segment.color = NA
|
||||
) +
|
||||
geom_point() +
|
||||
theme(legend.position = "none")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-11-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Alternatively, you might just want to add a single label to the plot, but you’ll still need to create a data frame. Often, you want the label in the corner of the plot, so it’s convenient to create a new data frame using <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to compute the maximum values of x and y.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- mpg |>
|
||||
summarise(
|
||||
displ = max(displ),
|
||||
hwy = max(hwy),
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy."
|
||||
)
|
||||
|
||||
ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point() +
|
||||
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-12-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since we’re no longer computing the positions from <code>mpg</code>, we can use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> to create the data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- tibble(
|
||||
displ = Inf,
|
||||
hwy = Inf,
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy."
|
||||
)
|
||||
|
||||
ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point() +
|
||||
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-13-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="https://stringr.tidyverse.org/reference/str_wrap.html">stringr::str_wrap()</a></code> to automatically add line breaks, given the number of characters you want per line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">"Increasing engine size is related to decreasing fuel economy." |>
|
||||
str_wrap(width = 40) |>
|
||||
writeLines()
|
||||
#> Increasing engine size is related to
|
||||
#> decreasing fuel economy.</pre>
|
||||
</div>
|
||||
<p>Note the use of <code>hjust</code> and <code>vjust</code> to control the alignment of the label. <a href="#fig-just" data-type="xref">#fig-just</a> shows all nine possible combinations.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-just"><p><img src="communicate-plots_files/figure-html/fig-just-1.png" style="width:60.0%"/></p>
|
||||
<figcaption>All nine combinations of <code>hjust</code> and <code>vjust</code>.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Remember, in addition to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A few ideas:</p>
|
||||
<ul><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_hline()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_vline()</a></code> to add reference lines. We often make them thick (<code>size = 2</code>) and white (<code>colour = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
|
||||
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_rect()</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li>
|
||||
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_segment.html">geom_segment()</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li>
|
||||
</ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> with infinite positions to place text at the four corners of the plot.</p></li>
|
||||
<li><p>Read the documentation for <code><a href="https://ggplot2.tidyverse.org/reference/annotate.html">annotate()</a></code>. How can you use it to add a text label to a plot without having to create a tibble?</p></li>
|
||||
<li><p>How do labels with <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)</p></li>
|
||||
<li><p>What arguments to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> control the appearance of the background box?</p></li>
|
||||
<li><p>What are the four arguments to <code><a href="https://rdrr.io/r/grid/arrow.html">arrow()</a></code>? How do they work? Create a series of plots that demonstrate the most important options.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="scales" data-type="sect1">
|
||||
<h1>
|
||||
Scales</h1>
|
||||
<p>The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive. Normally, ggplot2 automatically adds scales for you. For example, when you type:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class))</pre>
|
||||
</div>
|
||||
<p>ggplot2 automatically adds default scales behind the scenes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
scale_x_continuous() +
|
||||
scale_y_continuous() +
|
||||
scale_colour_discrete()</pre>
|
||||
</div>
|
||||
<p>Note the naming scheme for scales: <code>scale_</code> followed by the name of the aesthetic, then <code>_</code>, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. There are lots of non-default scales which you’ll learn about below.</p>
|
||||
<p>The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:</p>
|
||||
<ul><li><p>You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.</p></li>
|
||||
<li><p>You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.</p></li>
|
||||
</ul>
|
||||
<section id="axis-ticks-and-legend-keys" data-type="sect2">
|
||||
<h2>
|
||||
Axis ticks and legend keys</h2>
|
||||
<p>There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: <code>breaks</code> and <code>labels</code>. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of <code>breaks</code> is to override the default choice:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point() +
|
||||
scale_y_continuous(breaks = seq(15, 40, by = 5))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-18-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can use <code>labels</code> in the same way (a character vector the same length as <code>breaks</code>), but you can also set it to <code>NULL</code> to suppress the labels altogether. This is useful for maps, or for publishing plots where you can’t share the absolute numbers.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point() +
|
||||
scale_x_continuous(labels = NULL) +
|
||||
scale_y_continuous(labels = NULL)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-19-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use <code>breaks</code> and <code>labels</code> to control the appearance of legends. Collectively axes and legends are called <strong>guides</strong>. Axes are used for x and y aesthetics; legends are used for everything else.</p>
|
||||
<p>Another use of <code>breaks</code> is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">presidential |>
|
||||
mutate(id = 33 + row_number()) |>
|
||||
ggplot(aes(start, id)) +
|
||||
geom_point() +
|
||||
geom_segment(aes(xend = end, yend = id)) +
|
||||
scale_x_date(NULL, breaks = presidential$start, date_labels = "'%y")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-20-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that the specification of breaks and labels for date and datetime scales is a little different:</p>
|
||||
<ul><li><p><code>date_labels</code> takes a format specification, in the same form as <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">parse_datetime()</a></code>.</p></li>
|
||||
<li><p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1 month”.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="legend-layout" data-type="sect2">
|
||||
<h2>
|
||||
Legend layout</h2>
|
||||
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p>
|
||||
<p>To control the overall position of the legend, you need to use a <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">base <- ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class))
|
||||
|
||||
base + theme(legend.position = "left")
|
||||
base + theme(legend.position = "top")
|
||||
base + theme(legend.position = "bottom")
|
||||
base + theme(legend.position = "right") # the default</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-3.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-4.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p>
|
||||
<p>To control the display of individual legends, use <code><a href="https://ggplot2.tidyverse.org/reference/guides.html">guides()</a></code> along with <code><a href="https://ggplot2.tidyverse.org/reference/guide_legend.html">guide_legend()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/guide_colourbar.html">guide_colorbar()</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(colour = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
theme(legend.position = "bottom") +
|
||||
guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))
|
||||
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-22-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="replacing-a-scale" data-type="sect2">
|
||||
<h2>
|
||||
Replacing a scale</h2>
|
||||
<p>Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and colour, you’ll be able to quickly pick up other scale replacements.</p>
|
||||
<p>It’s very useful to plot transformations of your variable. For example, as we’ve seen in <a href="#chp-diamond-prices" data-type="xref">#chp-diamond-prices</a> it’s easier to see the precise relationship between <code>carat</code> and <code>price</code> if we log transform them:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(carat, price)) +
|
||||
geom_bin2d()
|
||||
|
||||
ggplot(diamonds, aes(log10(carat), log10(price))) +
|
||||
geom_bin2d()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-23-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-23-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(carat, price)) +
|
||||
geom_bin2d() +
|
||||
scale_x_log10() +
|
||||
scale_y_log10()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-24-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Another scale that is frequently customized is colour. The default categorical scale picks colors that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = drv))
|
||||
|
||||
ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = drv)) +
|
||||
scale_colour_brewer(palette = "Set1")</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-25-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-25-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Don’t forget simpler techniques. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = drv, shape = drv)) +
|
||||
scale_colour_brewer(palette = "Set1")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-26-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The ColorBrewer scales are documented online at <a href="https://colorbrewer2.org/" class="uri">https://colorbrewer2.org/</a> and made available in R via the <strong>RColorBrewer</strong> package, by Erich Neuwirth. <a href="#fig-brewer" data-type="xref">#fig-brewer</a> shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code> to make a continuous variable into a categorical variable.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-brewer"><p><img src="communicate-plots_files/figure-html/fig-brewer-1.png" width="576"/></p>
|
||||
<figcaption>All ColourBrewer scales.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>When you have a predefined mapping between values and colors, use <code><a href="https://ggplot2.tidyverse.org/reference/scale_manual.html">scale_colour_manual()</a></code>. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">presidential |>
|
||||
mutate(id = 33 + row_number()) |>
|
||||
ggplot(aes(start, id, colour = party)) +
|
||||
geom_point() +
|
||||
geom_segment(aes(xend = end, yend = id)) +
|
||||
scale_colour_manual(values = c(Republican = "red", Democratic = "blue"))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-28-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>For continuous colour, you can use the built-in <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_colour_gradient()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_fill_gradient()</a></code>. If you have a diverging scale, you can use <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_colour_gradient2()</a></code>. That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.</p>
|
||||
<p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = rnorm(10000),
|
||||
y = rnorm(10000)
|
||||
)
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
coord_fixed() +
|
||||
labs(title = "Default, continuous")
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
coord_fixed() +
|
||||
scale_fill_viridis_c() +
|
||||
labs(title = "Viridis, continuous")
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
coord_fixed() +
|
||||
scale_fill_viridis_b() +
|
||||
labs(title = "Viridis, binned")</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-29-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-29-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-29-3.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that all colour scales come in two variety: <code>scale_colour_x()</code> and <code>scale_fill_x()</code> for the <code>colour</code> and <code>fill</code> aesthetics respectively (the colour scales are available in both UK and US spellings).</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
<p>Why doesn’t the following code override the default scale?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
scale_colour_gradient(low = "white", high = "red") +
|
||||
coord_fixed()</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>What is the first argument to every scale? How does it compare to <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code>?</p></li>
|
||||
<li>
|
||||
<p>Change the display of the presidential terms by:</p>
|
||||
<ol type="a"><li>Combining the two variants shown above.</li>
|
||||
<li>Improving the display of the y axis.</li>
|
||||
<li>Labelling each term with the name of the president.</li>
|
||||
<li>Adding informative plot labels.</li>
|
||||
<li>Placing breaks every 4 years (this is trickier than it seems!).</li>
|
||||
</ol></li>
|
||||
<li>
|
||||
<p>Use <code>override.aes</code> to make the legend on the following plot easier to see.</p>
|
||||
<div class="cell" data-fig.format="png">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(carat, price)) +
|
||||
geom_point(aes(colour = cut), alpha = 1/20)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-31-1.png" style="width:50.0%"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="zooming" data-type="sect1">
|
||||
<h1>
|
||||
Zooming</h1>
|
||||
<p>There are three ways to control the plot limits:</p>
|
||||
<ol type="1"><li>Adjusting what data are plotted</li>
|
||||
<li>Setting the limits in each scale</li>
|
||||
<li>Setting <code>xlim</code> and <code>ylim</code> in <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>
|
||||
</li>
|
||||
</ol><p>To zoom in on a region of the plot, it’s generally best to use <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>. Compare the following two plots:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, mapping = aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth() +
|
||||
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
|
||||
|
||||
mpg |>
|
||||
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
|
||||
ggplot(aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-32-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-32-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">suv <- mpg |> filter(class == "suv")
|
||||
compact <- mpg |> filter(class == "compact")
|
||||
|
||||
ggplot(suv, aes(displ, hwy, colour = drv)) +
|
||||
geom_point()
|
||||
|
||||
ggplot(compact, aes(displ, hwy, colour = drv)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-33-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-33-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>One way to overcome this problem is to share scales across multiple plots, training the scales with the <code>limits</code> of the full data.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">x_scale <- scale_x_continuous(limits = range(mpg$displ))
|
||||
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
|
||||
col_scale <- scale_colour_discrete(limits = unique(mpg$drv))
|
||||
|
||||
ggplot(suv, aes(displ, hwy, colour = drv)) +
|
||||
geom_point() +
|
||||
x_scale +
|
||||
y_scale +
|
||||
col_scale
|
||||
|
||||
ggplot(compact, aes(displ, hwy, colour = drv)) +
|
||||
geom_point() +
|
||||
x_scale +
|
||||
y_scale +
|
||||
col_scale</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-34-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-34-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.</p>
|
||||
</section>
|
||||
|
||||
<section id="themes" data-type="sect1">
|
||||
<h1>
|
||||
Themes</h1>
|
||||
<p>Finally, you can customize the non-data elements of your plot with a theme:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
theme_bw()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-35-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>ggplot2 includes eight themes by default, as shown in <a href="#fig-themes" data-type="xref">#fig-themes</a>. Many more are included in add-on packages like <strong>ggthemes</strong> (<a href="https://jrnold.github.io/ggthemes" class="uri">https://jrnold.github.io/ggthemes</a>), by Jeffrey Arnold.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-themes"><p><img src="images/visualization-themes.png" alt="Eight barplots created with ggplot2, each with one of the eight built-in themes: theme_bw() - White background with grid lines, theme_light() - Light axes and grid lines, theme_classic() - Classic theme, axes but no grid lines, theme_linedraw() - Only black lines, theme_dark() - Dark background for contrast, theme_minimal() - Minimal theme, no background, theme_gray() - Gray background (default theme), theme_void() - Empty theme, only geoms are visible." width="1600"/></p>
|
||||
<figcaption>The eight themes built-in to ggplot2.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.</p>
|
||||
<p>It’s also possible to control individual components of each theme, like the size and colour of the font used for the y axis. Unfortunately, this level of detail is outside the scope of this book, so you’ll need to read the <a href="https://ggplot2-book.org/">ggplot2 book</a> for the full details. You can also create your own themes, if you are trying to match a particular corporate or journal style.</p>
|
||||
</section>
|
||||
|
||||
<section id="sec-ggsave" data-type="sect1">
|
||||
<h1>
|
||||
Saving your plots</h1>
|
||||
<p>There are two main ways to get your plots out of R and into your final write-up: <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> and knitr. <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> will save the most recent plot to disk:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(displ, hwy)) + geom_point()
|
||||
ggsave("my-plot.pdf")
|
||||
#> Saving 6 x 4 in image</pre>
|
||||
</div>
|
||||
<p>If you don’t specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them.</p>
|
||||
<p>Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics. You can learn more about <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> in the documentation.</p>
|
||||
</section>
|
||||
|
||||
<section id="learning-more" data-type="sect1">
|
||||
<h1>
|
||||
Learning more</h1>
|
||||
<p>The absolute best place to learn more is the ggplot2 book: <a href="https://ggplot2-book.org/"><em>ggplot2: Elegant graphics for data analysis</em></a>. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.</p>
|
||||
<p>Another great resource is the ggplot2 extensions gallery <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a>. This site lists many of the packages that extend ggplot2 with new geoms and scales. It’s a great place to start if you’re trying to do something that seems hard with ggplot2.</p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
|
@ -2,7 +2,7 @@
|
|||
<h1><span id="sec-communicate-intro" class="quarto-section-identifier d-none d-lg-block">Communicate</span></h1><p>So far, you’ve learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualization. However, it doesn’t matter how great your analysis is unless you can explain it to others: you need to <strong>communicate</strong> your results.</p><div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-ds-communicate"><p><img src="diagrams/data-science/communicate.png" alt="A diagram displaying the data science cycle with visualize and communicate highlighed in blue. " width="535"/></p>
|
||||
<figure id="fig-ds-communicate"><p><img src="diagrams/data-science/communicate.png" alt="A diagram displaying the data science cycle with communicate highlighed in blue. " width="535"/></p>
|
||||
<figcaption>Figure 1: Communication is the final part of the data science process; if you can’t communicate your results to other humans, it doesn’t matter how great your analysis is.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
|
|
|
@ -0,0 +1,859 @@
|
|||
<section data-type="chapter" id="chp-communication">
|
||||
<h1><span id="sec-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Communication</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.</p>
|
||||
<p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.</p>
|
||||
<p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="https://www.amazon.com/gp/product/0321934075/">The Truthful Art</a>, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, <strong>scales</strong> to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including <strong>ggrepel</strong> (<a href="https://ggrepel.slowkow.com/">https://ggrepel.slowkow.com</a>) by Kamil Slowikowski and <strong>patchwork</strong> (<a href="https://patchwork.data-imaginist.com/">https://patchwork.data-imaginist.com</a>) by Thomas Lin Pedersen. Don’t forget that you’ll need to install those packages with <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> if you don’t already have them.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
#> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
|
||||
#> ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
|
||||
#> ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
|
||||
#> ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
|
||||
#> ✔ purrr 1.0.1
|
||||
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
|
||||
#> ✖ dplyr::filter() masks stats::filter()
|
||||
#> ✖ dplyr::lag() masks stats::lag()
|
||||
#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
|
||||
library(ggrepel)
|
||||
library(patchwork)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="labels" data-type="sect1">
|
||||
<h1>
|
||||
Labels</h1>
|
||||
<p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> function. This example adds a plot title:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(title = "Fuel efficiency generally decreases with engine size")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-3-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The plot is titled "Fuel efficiency generally decreases with engine size"." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.</p>
|
||||
<p>If you need to add more text, there are two other useful labels:</p>
|
||||
<ul><li><p><code>subtitle</code> adds additional detail in a smaller font beneath the title.</p></li>
|
||||
<li><p><code>caption</code> adds text at the bottom right of the plot, often used to describe the source of the data.</p></li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(
|
||||
title = "Fuel efficiency generally decreases with engine size",
|
||||
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
|
||||
caption = "Data from fueleconomy.gov"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-4-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The plot is titled "Fuel efficiency generally decreases with engine size". The subtitle is "Two seaters (sports cars) are an exception because of their light weight" and the caption is "Data from fueleconomy.gov"." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
labs(
|
||||
x = "Engine displacement (L)",
|
||||
y = "Highway fuel economy (mpg)",
|
||||
color = "Car type"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-5-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The x-axis is labelled "Engine displacement (L)" and the y-axis is labelled "Highway fuel economy (mpg)". The legend is labelled "Car type"." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>It’s possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="https://rdrr.io/r/base/substitute.html">quote()</a></code> and read about the available options in <code><a href="https://rdrr.io/r/grDevices/plotmath.html">?plotmath</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = 1:10,
|
||||
y = x ^ 2
|
||||
)
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_point() +
|
||||
labs(
|
||||
x = quote(sum(x[i] ^ 2, i == 1, n)),
|
||||
y = quote(alpha + beta + frac(delta, theta))
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-6-1.png" style="width:50.0%" alt="Scatterplot with math text on the x and y axis labels. X-axis label says sum of x_i squared, for i from 1 to n. Y-axis label says alpha + beta + delta over theta."/></p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Create one plot on the fuel economy data with customized <code>title</code>, <code>subtitle</code>, <code>caption</code>, <code>x</code>, <code>y</code>, and <code>color</code> labels.</p></li>
|
||||
<li>
|
||||
<p>Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-7-1.png" alt="Scatterplot of highway versus city fuel efficiency. Shapes and colors of points are determined by type of drive train." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="annotations" data-type="sect1">
|
||||
<h1>
|
||||
Annotations</h1>
|
||||
<p>In addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> is similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p>
|
||||
<p>There are two possible sources of labels. First, you might have a tibble that provides labels. In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called <code>label_info</code>. In order to create the <code>label_info</code> data frame we used a number of new dplyr functions. You’ll learn more about each of these soon!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- mpg |>
|
||||
group_by(drv) |>
|
||||
arrange(desc(displ)) |>
|
||||
slice_head(n = 1) |>
|
||||
mutate(
|
||||
drive_type = case_when(
|
||||
drv == "f" ~ "front-wheel drive",
|
||||
drv == "r" ~ "rear-wheel drive",
|
||||
drv == "4" ~ "4-wheel drive"
|
||||
)
|
||||
) |>
|
||||
select(displ, hwy, drv, drive_type)
|
||||
|
||||
label_info
|
||||
#> # A tibble: 3 × 4
|
||||
#> # Groups: drv [3]
|
||||
#> displ hwy drv drive_type
|
||||
#> <dbl> <int> <chr> <chr>
|
||||
#> 1 6.5 17 4 4-wheel drive
|
||||
#> 2 5.3 25 f front-wheel drive
|
||||
#> 3 7 24 r rear-wheel drive</pre>
|
||||
</div>
|
||||
<p>Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the <code>fontface</code> and <code>size</code> arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (<code>theme(legend.position = "none"</code>) turns the legend off — we’ll talk about it more shortly.)</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_smooth(se = FALSE) +
|
||||
geom_text(
|
||||
data = label_info,
|
||||
aes(x = displ, y = hwy, label = drive_type),
|
||||
fontface = "bold", size = 5, hjust = "right", vjust = "bottom"
|
||||
) +
|
||||
theme(legend.position = "none")
|
||||
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-9-1.png" alt="Scatterplot of highway mileage versus engine size where points are colored by drive type. Smooth curves for each drive type are overlaid. Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note the use of <code>hjust</code> and <code>vjust</code> to control the alignment of the label. <a href="#fig-just" data-type="xref">#fig-just</a> shows all nine possible combinations.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-just"><p><img src="communication_files/figure-html/fig-just-1.png" style="width:60.0%" alt="A 1x1 grid. At (0,0) hjust is set to left and vjust is set to bottom. At (0.5, 0) hjust is center and vjust is bottom and at (1, 0) hjust is right and vjust is bottom. At (0, 0.5) hjust is left and vjust is center, at (0.5, 0.5) hjust is center and vjust is center, and at (1, 0.5) hjust is right and vjust is center. Finally, at (1, 0) hjust is left and vjust is top, at (0.5, 1) hjust is center and vjust is top, and at (1, 1) hjust is right and vjust is bottom."/></p>
|
||||
<figcaption>All nine combinations of <code>hjust</code> and <code>vjust</code>.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>However the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_smooth(se = FALSE) +
|
||||
geom_label(
|
||||
data = label_info,
|
||||
aes(x = displ, y = hwy, label = drive_type),
|
||||
fontface = "bold", size = 5, hjust = "right", alpha = 0.5, nudge_y = 2,
|
||||
) +
|
||||
theme(legend.position = "none")
|
||||
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-11-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. Some points are labelled with the car's name. The labels are box with white, transparent background." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>That helps a bit, but two of the labels still overlap with each other. This is difficult to fix by applying the same transformation for every label. Instead, we can use the <code><a href="https://rdrr.io/pkg/ggrepel/man/geom_text_repel.html">geom_label_repel()</a></code> function from the ggrepel package. This useful package will automatically adjust labels so that they don’t overlap:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point(alpha = 0.3) +
|
||||
geom_smooth(se = FALSE) +
|
||||
geom_label_repel(
|
||||
data = label_info,
|
||||
aes(x = displ, y = hwy, label = drive_type),
|
||||
fontface = "bold", size = 5, nudge_y = 2,
|
||||
) +
|
||||
theme(legend.position = "none")
|
||||
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-12-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. Some points are labelled with the car's name. The labels are box with white, transparent background and positioned to not overlap." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use the same idea to highlight certain points on a plot with <code><a href="https://rdrr.io/pkg/ggrepel/man/geom_text_repel.html">geom_text_repel()</a></code> from the ggrepel package. Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">potential_outliers <- mpg |>
|
||||
filter(hwy > 40 | (hwy > 20 & displ > 5))
|
||||
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_text_repel(data = potential_outliers, aes(label = model)) +
|
||||
geom_point(data = potential_outliers, color = "red") +
|
||||
geom_point(data = potential_outliers, color = "red", size = 3, shape = "circle open")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-13-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. Points where highway mileage is above 40 as well as above 20 with engine size above 5 are red, with a hollow red circle, and labelled with model name of the car." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Alternatively, you might just want to add a single label to the plot, but you’ll still need to create a data frame. Often, you want the label in the corner of the plot, so it’s convenient to create a new data frame using <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to compute the maximum values of x and y.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- mpg |>
|
||||
summarize(
|
||||
displ = max(displ),
|
||||
hwy = max(hwy),
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy."
|
||||
)
|
||||
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_text(
|
||||
data = label_info, aes(label = label),
|
||||
vjust = "top", hjust = "right"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-14-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. On the top right corner, inset a bit from the corner, is an annotation that reads "increasing engine size is related to decreasing fuel economy". The text spans two lines." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since we’re no longer computing the positions from <code>mpg</code>, we can use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> to create the data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">label_info <- tibble(
|
||||
displ = Inf,
|
||||
hwy = Inf,
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy."
|
||||
)
|
||||
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. On the top right corner, flush against the corner, is an annotation that reads "increasing engine size is related to decreasing fuel economy". The text spans two lines." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Alternatively, we can add the annotation without creating a new data frame, using <code><a href="https://ggplot2.tidyverse.org/reference/annotate.html">annotate()</a></code>. This function adds a geom to a plot, but it doesn’t map variables of a data frame to an aesthetic. The first argument of this function, <code>geom</code>, is the geometric object you want to use for annotation.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
annotate(
|
||||
geom = "text", x = Inf, y = Inf,
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy.",
|
||||
vjust = "top", hjust = "right"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-16-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. On the top right corner, flush against the corner, is an annotation that reads "increasing engine size is related to decreasing fuel economy". The text spans two lines." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use a label geom instead of a text geom like we did earlier, set aesthetics like color. Another approach for drawing attention to a plot feature is using a segment geom with the <code>arrow</code> argument. The <code>x</code> and <code>y</code> aesthetics define the starting location of the segment and <code>xend</code> and <code>yend</code> to define the end location.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
annotate(
|
||||
geom = "label", x = 3.5, y = 38,
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy.",
|
||||
hjust = "left", color = "red"
|
||||
) +
|
||||
annotate(
|
||||
geom = "segment",
|
||||
x = 3, y = 35, xend = 5, yend = 25, color = "red",
|
||||
arrow = arrow(type = "closed")
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-17-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. A red arrow pointing down follows the trend of the points and the annptation placed next to the arrow reads "increasing engine size is related to decreasing fuel economy". The arrow and the annotation text is red." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="https://stringr.tidyverse.org/reference/str_wrap.html">stringr::str_wrap()</a></code> to automatically add line breaks, given the number of characters you want per line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">"Increasing engine size is related to decreasing fuel economy." |>
|
||||
str_wrap(width = 40) |>
|
||||
writeLines()
|
||||
#> Increasing engine size is related to
|
||||
#> decreasing fuel economy.</pre>
|
||||
</div>
|
||||
<p>Remember, in addition to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:</p>
|
||||
<ul><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_hline()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_vline()</a></code> to add reference lines. We often make them thick (<code>size = 2</code>) and white (<code>color = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
|
||||
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_rect()</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li>
|
||||
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_segment.html">geom_segment()</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li>
|
||||
</ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> with infinite positions to place text at the four corners of the plot.</p></li>
|
||||
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/annotate.html">annotate()</a></code> to add a point geom in the middle of your last plot without having to create a tibble. Customize the shape, size, or color of the point.</p></li>
|
||||
<li><p>How do labels with <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)</p></li>
|
||||
<li><p>What arguments to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> control the appearance of the background box?</p></li>
|
||||
<li><p>What are the four arguments to <code><a href="https://rdrr.io/r/grid/arrow.html">arrow()</a></code>? How do they work? Create a series of plots that demonstrate the most important options.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="scales" data-type="sect1">
|
||||
<h1>
|
||||
Scales</h1>
|
||||
<p>The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive.</p>
|
||||
|
||||
<section id="default-scales" data-type="sect2">
|
||||
<h2>
|
||||
Default scales</h2>
|
||||
<p>Normally, ggplot2 automatically adds scales for you. For example, when you type:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class))</pre>
|
||||
</div>
|
||||
<p>ggplot2 automatically adds default scales behind the scenes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
scale_x_continuous() +
|
||||
scale_y_continuous() +
|
||||
scale_color_discrete()</pre>
|
||||
</div>
|
||||
<p>Note the naming scheme for scales: <code>scale_</code> followed by the name of the aesthetic, then <code>_</code>, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. There are lots of non-default scales which you’ll learn about below.</p>
|
||||
<p>The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:</p>
|
||||
<ul><li><p>You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.</p></li>
|
||||
<li><p>You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="axis-ticks-and-legend-keys" data-type="sect2">
|
||||
<h2>
|
||||
Axis ticks and legend keys</h2>
|
||||
<p>There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: <code>breaks</code> and <code>labels</code>. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of <code>breaks</code> is to override the default choice:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
scale_y_continuous(breaks = seq(15, 40, by = 5))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-21-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. The y-axis has breaks starting at 15 and ending at 40, increasing by 5." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can use <code>labels</code> in the same way (a character vector the same length as <code>breaks</code>), but you can also set it to <code>NULL</code> to suppress the labels altogether. This is useful for maps, or for publishing plots where you can’t share the absolute numbers.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
scale_x_continuous(labels = NULL) +
|
||||
scale_y_continuous(labels = NULL)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-22-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. The x and y-axes do not have any labels at the axis ticks." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The <code>labels</code> argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc. The plot on the left shows default labelling with <code>label_dollar()</code>, which adds a dollar sign as well as a thousand separator comma. The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix “K” (for “thousands”) as well as adding custom breaks. Note that <code>breaks</code> is in the original scale of the data.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(diamonds, aes(x = cut, y = price)) +
|
||||
geom_boxplot(alpha = 0.05) +
|
||||
scale_y_continuous(labels = scales::label_dollar())
|
||||
|
||||
# Right
|
||||
ggplot(diamonds, aes(x = cut, y = price)) +
|
||||
geom_boxplot(alpha = 0.05) +
|
||||
scale_y_continuous(
|
||||
labels = scales::label_dollar(scale = 1/1000, suffix = "K"),
|
||||
breaks = seq(1000, 19000, by = 6000)
|
||||
)</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-23-1.png" alt="Two side-by-side box plots of price versus cut of diamonds. The outliers are transparent. On both plots the y-axis labels are formatted as dollars. The y-axis labels on the plot start at $0 and go to $15,000, increasing by $5,000. The y-axis labels on the right plot start at $1K and go to $19K, increasing by $6K." width="576"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-23-2.png" alt="Two side-by-side box plots of price versus cut of diamonds. The outliers are transparent. On both plots the y-axis labels are formatted as dollars. The y-axis labels on the plot start at $0 and go to $15,000, increasing by $5,000. The y-axis labels on the right plot start at $1K and go to $19K, increasing by $6K." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Another handy label function is <code>label_percent()</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
|
||||
geom_bar(position = "fill") +
|
||||
scale_y_continuous(
|
||||
name = "Percentage",
|
||||
labels = scales::label_percent()
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-24-1.png" alt="Segmented bar plots of cut, filled with levels of clarity. The y-axis labels start at 0% and go to 100%, increasing by 25%. The y-axis label name is "Percentage"." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use <code>breaks</code> and <code>labels</code> to control the appearance of legends. Collectively axes and legends are called <strong>guides</strong>. Axes are used for x and y aesthetics; legends are used for everything else.</p>
|
||||
<p>Another use of <code>breaks</code> is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">presidential |>
|
||||
mutate(id = 33 + row_number()) |>
|
||||
ggplot(aes(x = start, y = id)) +
|
||||
geom_point() +
|
||||
geom_segment(aes(xend = end, yend = id)) +
|
||||
scale_x_date(name = NULL, breaks = presidential$start, date_labels = "'%y")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-25-1.png" alt="Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. The x-axis labels are formatted as two digit years starting with an apostrophe, e.g., '53." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that the specification of breaks and labels for date and datetime scales is a little different:</p>
|
||||
<ul><li><p><code>date_labels</code> takes a format specification, in the same form as <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">parse_datetime()</a></code>.</p></li>
|
||||
<li><p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1 month”.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="legend-layout" data-type="sect2">
|
||||
<h2>
|
||||
Legend layout</h2>
|
||||
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p>
|
||||
<p>To control the overall position of the legend, you need to use a <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">base <- ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class))
|
||||
|
||||
base + theme(legend.position = "left")
|
||||
base + theme(legend.position = "top")
|
||||
base + theme(legend.position = "bottom")
|
||||
base + theme(legend.position = "right") # the default</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-26-1.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-26-2.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-26-3.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-26-4.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p>
|
||||
<p>To control the display of individual legends, use <code><a href="https://ggplot2.tidyverse.org/reference/guides.html">guides()</a></code> along with <code><a href="https://ggplot2.tidyverse.org/reference/guide_legend.html">guide_legend()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/guide_colourbar.html">guide_colorbar()</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
theme(legend.position = "bottom") +
|
||||
guides(color = guide_legend(nrow = 1, override.aes = list(size = 4)))
|
||||
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Overlaid on the plot is a smooth curve. The legend is in the bottom and classes are listed horizontally in a row. The points in the legend are larger than the points in the plot." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="replacing-a-scale" data-type="sect2">
|
||||
<h2>
|
||||
Replacing a scale</h2>
|
||||
<p>Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and color scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and color, you’ll be able to quickly pick up other scale replacements.</p>
|
||||
<p>It’s very useful to plot transformations of your variable. For example, it’s easier to see the precise relationship between <code>carat</code> and <code>price</code> if we log transform them:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(diamonds, aes(x = carat, y = price)) +
|
||||
geom_bin2d()
|
||||
|
||||
# Right
|
||||
ggplot(diamonds, aes(x = log10(carat), y = log10(price))) +
|
||||
geom_bin2d()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-28-1.png" alt="Two plots of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. In the plot on the right, price and carat values are logged and the axis labels shows the logged values." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-28-2.png" alt="Two plots of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. In the plot on the right, price and carat values are logged and the axis labels shows the logged values." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
|
||||
geom_bin2d() +
|
||||
scale_x_log10() +
|
||||
scale_y_log10()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-29-1.png" alt="Plot of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. The axis labels are on the original data scale." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Another scale that is frequently customized is color. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = drv))
|
||||
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = drv)) +
|
||||
scale_color_brewer(palette = "Set1")</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-30-1.png" alt="Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-30-2.png" alt="Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Don’t forget simpler techniques. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = drv, shape = drv)) +
|
||||
scale_color_brewer(palette = "Set1")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-31-1.png" alt="Two scatterplots of highway mileage versus engine size where both color and shape of points are based on drive type. The color palette is not the default ggplot2 palette." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The ColorBrewer scales are documented online at <a href="https://colorbrewer2.org/" class="uri">https://colorbrewer2.org/</a> and made available in R via the <strong>RColorBrewer</strong> package, by Erich Neuwirth. <a href="#fig-brewer" data-type="xref">#fig-brewer</a> shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code> to make a continuous variable into a categorical variable.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-brewer"><p><img src="communication_files/figure-html/fig-brewer-1.png" alt="All colorBrewer scales. One group goes from light to dark colors. Another group is a set of non ordinal colors. And the last group has diverging scales (from dark to light to dark again). Within each set there are a number of palettes." width="576"/></p>
|
||||
<figcaption>All colorBrewer scales.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>When you have a predefined mapping between values and colors, use <code><a href="https://ggplot2.tidyverse.org/reference/scale_manual.html">scale_color_manual()</a></code>. For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">presidential |>
|
||||
mutate(id = 33 + row_number()) |>
|
||||
ggplot(aes(x = start, y = id, color = party)) +
|
||||
geom_point() +
|
||||
geom_segment(aes(xend = end, yend = id)) +
|
||||
scale_color_manual(values = c(Republican = "red", Democratic = "blue"))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-33-1.png" alt="Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. Democratic presidents are represented in black and Republicans in red." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>For continuous color, you can use the built-in <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_color_gradient()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_fill_gradient()</a></code>. If you have a diverging scale, you can use <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_color_gradient2()</a></code>. That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.</p>
|
||||
<p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = rnorm(10000),
|
||||
y = rnorm(10000)
|
||||
)
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
coord_fixed() +
|
||||
labs(title = "Default, continuous")
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
coord_fixed() +
|
||||
scale_fill_viridis_c() +
|
||||
labs(title = "Viridis, continuous")
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
coord_fixed() +
|
||||
scale_fill_viridis_b() +
|
||||
labs(title = "Viridis, binned")</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-34-1.png" alt="Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-34-2.png" alt="Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-34-3.png" alt="Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that all color scales come in two variety: <code>scale_color_x()</code> and <code>scale_fill_x()</code> for the <code>color</code> and <code>fill</code> aesthetics respectively (the color scales are available in both UK and US spellings).</p>
|
||||
</section>
|
||||
|
||||
<section id="zooming" data-type="sect2">
|
||||
<h2>
|
||||
Zooming</h2>
|
||||
<p>There are three ways to control the plot limits:</p>
|
||||
<ol type="1"><li>Adjusting what data are plotted.</li>
|
||||
<li>Setting the limits in each scale.</li>
|
||||
<li>Setting <code>xlim</code> and <code>ylim</code> in <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>.</li>
|
||||
</ol><p>To zoom in on a region of the plot, it’s generally best to use <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>. Compare the following two plots:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth() +
|
||||
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
|
||||
|
||||
mpg |>
|
||||
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
|
||||
ggplot(aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-35-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-35-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">suv <- mpg |> filter(class == "suv")
|
||||
compact <- mpg |> filter(class == "compact")
|
||||
|
||||
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point()
|
||||
|
||||
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-36-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-36-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>One way to overcome this problem is to share scales across multiple plots, training the scales with the <code>limits</code> of the full data.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">x_scale <- scale_x_continuous(limits = range(mpg$displ))
|
||||
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
|
||||
col_scale <- scale_color_discrete(limits = unique(mpg$drv))
|
||||
|
||||
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point() +
|
||||
x_scale +
|
||||
y_scale +
|
||||
col_scale
|
||||
|
||||
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point() +
|
||||
x_scale +
|
||||
y_scale +
|
||||
col_scale</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-37-1.png" width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-37-2.png" width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
<p>Why doesn’t the following code override the default scale?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = rnorm(10000),
|
||||
y = rnorm(10000)
|
||||
)
|
||||
|
||||
ggplot(df, aes(x, y)) +
|
||||
geom_hex() +
|
||||
scale_color_gradient(low = "white", high = "red") +
|
||||
coord_fixed()</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>What is the first argument to every scale? How does it compare to <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code>?</p></li>
|
||||
<li>
|
||||
<p>Change the display of the presidential terms by:</p>
|
||||
<ol type="a"><li>Combining the two variants shown above.</li>
|
||||
<li>Improving the display of the y axis.</li>
|
||||
<li>Labelling each term with the name of the president.</li>
|
||||
<li>Adding informative plot labels.</li>
|
||||
<li>Placing breaks every 4 years (this is trickier than it seems!).</li>
|
||||
</ol></li>
|
||||
<li>
|
||||
<p>Use <code>override.aes</code> to make the legend on the following plot easier to see.</p>
|
||||
<div class="cell" data-fig.format="png">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
|
||||
geom_point(aes(color = cut), alpha = 1/20)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-39-1.png" style="width:50.0%" alt="Scatterplot of price versus carat of diamonds. The points are colored by cut of the diamonds and they're very transparent."/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="sec-themes" data-type="sect1">
|
||||
<h1>
|
||||
Themes</h1>
|
||||
<p>Finally, you can customize the non-data elements of your plot with a theme:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth(se = FALSE) +
|
||||
theme_bw()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-40-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>ggplot2 includes eight themes by default, as shown in <a href="#fig-themes" data-type="xref">#fig-themes</a>. Many more are included in add-on packages like <strong>ggthemes</strong> (<a href="https://jrnold.github.io/ggthemes" class="uri">https://jrnold.github.io/ggthemes</a>), by Jeffrey Arnold. You can also create your own themes, if you are trying to match a particular corporate or journal style.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-themes"><p><img src="images/visualization-themes.png" alt="Eight barplots created with ggplot2, each with one of the eight built-in themes: theme_bw() - White background with grid lines, theme_light() - Light axes and grid lines, theme_classic() - Classic theme, axes but no grid lines, theme_linedraw() - Only black lines, theme_dark() - Dark background for contrast, theme_minimal() - Minimal theme, no background, theme_gray() - Gray background (default theme), theme_void() - Empty theme, only geoms are visible." width="1600"/></p>
|
||||
<figcaption>The eight themes built-in to ggplot2.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.</p>
|
||||
<p>It’s also possible to control individual components of each theme, like the size and color of the font used for the y axis. We’ve already seen that <code>legend.position</code> controls where the legend is drawn. There are many other aspects of the legend that can be customized with <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code>. For example, in the plot below we change the direction of the legend as well as put a black border around it. A few other helpful <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> components are use to change the placement for format of the title and caption text.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
|
||||
geom_point() +
|
||||
labs(
|
||||
title = "Highway mileage decreases as engine size increases",
|
||||
caption = "Source: https://fueleconomy.gov."
|
||||
) +
|
||||
theme(
|
||||
legend.position = c(0.6, 0.7),
|
||||
legend.direction = "horizontal",
|
||||
legend.box.background = element_rect(color = "black"),
|
||||
plot.title = element_text(face = "bold"),
|
||||
plot.title.position = "plot",
|
||||
plot.caption.position = "plot",
|
||||
plot.caption = element_text(hjust = 0)
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-42-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>For an overview of all <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> components, see help with <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">?theme</a></code>. The <a href="https://ggplot2-book.org/">ggplot2 book</a> is also a great place to go for the full details on theming.</p>
|
||||
|
||||
<section id="exercises-3" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>Pick a theme offered by the ggthemes package and apply it to the last plot you made.</li>
|
||||
<li>Make the axis labels of your plot blue and bolded.</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="layout" data-type="sect1">
|
||||
<h1>
|
||||
Layout</h1>
|
||||
<p>So far we talked about how to create and modify a single plot. What if you have multiple plots you want to lay out in a certain way? The patchwork package allows you to combine separate plots into the same graphic. We loaded this package earlier in the chapter.</p>
|
||||
<p>To place two plots next to each other, you can simply add them to each other. Note that you first need to create the plots and save them as objects (in the following example they’re called <code>p1</code> and <code>p2</code>). Then, you place them next to each other with <code>+</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
labs(title = "Plot 1")
|
||||
p2 <- ggplot(mpg, aes(x = drv, y = hwy)) +
|
||||
geom_boxplot() +
|
||||
labs(title = "Plot 2")
|
||||
p1 + p2</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-43-1.png" alt="Two plots (a scatterplot of highway mileage versus engine size and a side-by-side boxplots of highway mileage versus drive train) placed next to each other." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>It’s important to note that in the above code chunk we did not use a new function from the patchwork package. Instead, the package added a new functionality to the <code>+</code> operator.</p>
|
||||
<p>You can also create arbitrary plot layouts with patchwork. In the following, <code>|</code> places the <code>p1</code> and <code>p3</code> next to each other and <code>/</code> moves <code>p2</code> to the next line.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">p3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
|
||||
geom_point() +
|
||||
labs(title = "Plot 3")
|
||||
(p1 | p3) / p2</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-44-1.png" alt="Three plots laid out such that first and third plot are next to each other and the second plot streatched beneath them. The first plot is a scatterplot of highway mileage versus engine size, third plot is a scatterplot of highway mileage versus city mileage, and the third plot is side-by-side boxplots of highway mileage versus drive train) placed next to each other." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Additionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots. In the following, we have 5 plots. We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with <code>& theme(legend.position = "top")</code>. Note the use of the <code>&</code> operator here instead of the usual <code>+</code>. This is because we’re modifying the theme for the patchwork plot as opposed to the individual ggplots. The legend is placed on top, inside the <code><a href="https://patchwork.data-imaginist.com/reference/guide_area.html">guide_area()</a></code>. Finally, we have also customized the heights of the various components of our patchwork – the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatter plot 4. Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">p1 <- ggplot(mpg, aes(x = drv, y = cty, color = drv)) +
|
||||
geom_boxplot(show.legend = FALSE) +
|
||||
labs(title = "Plot 1")
|
||||
|
||||
p2 <- ggplot(mpg, aes(x = drv, y = hwy, color = drv)) +
|
||||
geom_boxplot(show.legend = FALSE) +
|
||||
labs(title = "Plot 2")
|
||||
|
||||
p3 <- ggplot(mpg, aes(x = cty, color = drv, fill = drv)) +
|
||||
geom_density(alpha = 0.5) +
|
||||
labs(title = "Plot 3")
|
||||
|
||||
p4 <- ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) +
|
||||
geom_density(alpha = 0.5) +
|
||||
labs(title = "Plot 4")
|
||||
|
||||
p5 <- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
|
||||
geom_point(show.legend = FALSE) +
|
||||
facet_wrap(~drv) +
|
||||
labs(title = "Plot 5")
|
||||
|
||||
(guide_area() / (p1 + p2) / (p3 + p4) / p5) +
|
||||
plot_annotation(
|
||||
title = "City and highway mileage for cars with different drive trains",
|
||||
caption = "Source: Source: https://fueleconomy.gov."
|
||||
) +
|
||||
plot_layout(
|
||||
guides = "collect",
|
||||
heights = c(1, 3, 2, 4)
|
||||
) &
|
||||
theme(legend.position = "bottom")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-45-1.png" alt="Five plots laid out such that first two plots are next to each other. Plots three and four are underneath them. And the fifth plot stretches under them. The patchworked plot is titled "City and highway mileage for cars with different drive trains" and captioned "Source: Source: https://fueleconomy.gov". The first two plots are side-by-side box plots. Plots 3 and 4 are density plots. And the fifth plot is a faceted scatterplot. Each of these plots show geoms colored by drive train, but the patchworked plot has only one legend that applies to all of them, above the plots and beneath the title." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>If you’d like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: <a href="https://patchwork.data-imaginist.com" class="uri">https://patchwork.data-imaginist.com</a>.</p>
|
||||
|
||||
<section id="exercises-4" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
<p>What happens if you omit the parentheses in the following plot layout. Can you explain why this happens?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
labs(title = "Plot 1")
|
||||
p2 <- ggplot(mpg, aes(x = drv, y = hwy)) +
|
||||
geom_boxplot() +
|
||||
labs(title = "Plot 2")
|
||||
p3 <- ggplot(mpg, aes(x = cty, y = hwy)) +
|
||||
geom_point() +
|
||||
labs(title = "Plot 3")
|
||||
|
||||
(p1 | p2) / p3</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-46-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Using the three plots from the previous exercise, recreate the following patchwork.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="communication_files/figure-html/unnamed-chunk-47-1.png" alt="Three plots: Plot 1 is a scatterplot of highway mileage versus engine size. Plot 2 is side-by-side box plots of highway mileage versus drive train. Plot 3 is side-by-side box plots of city mileage versus drive train. Plots 1 is on the first row. Plots 2 and 3 are on the next row, each span half the width of Plot 1. Plot 1 is labelled "Fig. A", Plot 2 is labelled "Fig. B", and Plot 3 is labelled "Fig. C"." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you’ve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. You’ve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.</p>
|
||||
<p>While you’ve so far learned about how to make many different types of plots and how to customize them using a variety of techniques, we’ve barely scratched the surface of what you can create with ggplot2. If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, <a href="https://ggplot2-book.org"><em>ggplot2: Elegant Graphics for Data Analysis</em></a>. Other useful resources are the <a href="https://r-graphics.org"><em>R Graphics Cookbook</em></a> by Winston Chang and <a href="https://clauswilke.com/dataviz/"><em>Fundamentals of Data Visualization</em></a> by Claus Wilke.</p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
|
@ -3,7 +3,8 @@
|
|||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you’ll learn how to read plain-text rectangular files into R.</p>
|
||||
<p>Working with data provided by R packages is a great way to learn data science tools, but you want to apply what you’ve learned to your own data at some point. In this chapter, you’ll learn the basics of reading data files into R.</p>
|
||||
<p>Specifically, this chapter will focus on reading plain-text rectangular files. We’ll start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, you’ll learn how to handcraft data frames in R.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -18,7 +19,7 @@ Prerequisites</h2>
|
|||
<section id="reading-data-from-a-file" data-type="sect1">
|
||||
<h1>
|
||||
Reading data from a file</h1>
|
||||
<p>To begin we’ll focus on the most rectangular data file type: the CSV, short for comma separate values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows give the data.</p>
|
||||
<p>To begin, we’ll focus on the most rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data.</p>
|
||||
<div class="cell">
|
||||
<pre><code>#> Student ID,Full Name,favourite.food,mealPlan,AGE
|
||||
#> 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
|
||||
|
@ -83,13 +84,13 @@ Reading data from a file</h1>
|
|||
#> ℹ Use `spec()` to retrieve the full column specification for this data.
|
||||
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
|
||||
</div>
|
||||
<p>When you run <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and we’ll come back to in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
|
||||
<p>When you run <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and we’ll return to it in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
|
||||
|
||||
<section id="practical-advice" data-type="sect2">
|
||||
<h2>
|
||||
Practical advice</h2>
|
||||
<p>Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the <code>students</code> data with that in mind.</p>
|
||||
<p>In the <code>favourite.food</code> column, there are a bunch of food items and then the character string <code>N/A</code>, which should have been an real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
|
||||
<p>In the <code>favourite.food</code> column, there are a bunch of food items, and then the character string <code>N/A</code>, which should have been a real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">students <- read_csv("data/students.csv", na = c("N/A", ""))
|
||||
|
||||
|
@ -104,7 +105,7 @@ students
|
|||
#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
|
||||
#> 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
|
||||
</div>
|
||||
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by back ticks. That’s because they contain spaces, breaking R’s usual rules for variable names. To refer to them, you need to use those back ticks:</p>
|
||||
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by backticks. That’s because they contain spaces, breaking R’s usual rules for variable names. To refer to them, you need to use those backticks:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">students |>
|
||||
rename(
|
||||
|
@ -134,7 +135,7 @@ students
|
|||
#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
|
||||
#> 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
|
||||
</div>
|
||||
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represent as factor:</p>
|
||||
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represented as a factor:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">students |>
|
||||
janitor::clean_names() |>
|
||||
|
@ -151,8 +152,8 @@ students
|
|||
#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
|
||||
#> 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
|
||||
</div>
|
||||
<p>Note that the values in the <code>meal_type</code> variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<code><chr></code>) to factor (<code><fct></code>). You’ll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
|
||||
<p>Before you move on to analyzing these data, you’ll probably want to fix the <code>age</code> column as well: currently it’s a character variable because of the one observation that is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
|
||||
<p>Note that the values in the <code>meal_type</code> variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<code><chr></code>) to factor (<code><fct></code>). You’ll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
|
||||
<p>Before you analyze these data, you’ll probably want to fix the <code>age</code> column. Currently, it’s a character variable because one of the observations is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">students <- students |>
|
||||
janitor::clean_names() |>
|
||||
|
@ -177,7 +178,7 @@ students
|
|||
<section id="other-arguments" data-type="sect2">
|
||||
<h2>
|
||||
Other arguments</h2>
|
||||
<p>There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read csv files that you’ve created in a string:</p>
|
||||
<p>There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read CSV files that you’ve created in a string:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"a,b,c
|
||||
|
@ -190,7 +191,7 @@ Other arguments</h2>
|
|||
#> 1 1 2 3
|
||||
#> 2 4 5 6</pre>
|
||||
</div>
|
||||
<p>Usually <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
|
||||
<p>Usually, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But it’s not uncommon for a few lines of metadata to be included at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"The first line of metadata
|
||||
|
@ -215,7 +216,7 @@ read_csv(
|
|||
#> <dbl> <dbl> <dbl>
|
||||
#> 1 1 2 3</pre>
|
||||
</div>
|
||||
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings, and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
|
||||
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"1,2,3
|
||||
|
@ -228,7 +229,7 @@ read_csv(
|
|||
#> 1 1 2 3
|
||||
#> 2 4 5 6</pre>
|
||||
</div>
|
||||
<p>Alternatively you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
|
||||
<p>Alternatively, you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">read_csv(
|
||||
"1,2,3
|
||||
|
@ -241,19 +242,19 @@ read_csv(
|
|||
#> 1 1 2 3
|
||||
#> 2 4 5 6</pre>
|
||||
</div>
|
||||
<p>These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your <code>.csv</code> file and carefully read the documentation for <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>’s many other arguments.)</p>
|
||||
<p>These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your <code>.csv</code> file and read the documentation for <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>’s many other arguments.)</p>
|
||||
</section>
|
||||
|
||||
<section id="other-file-types" data-type="sect2">
|
||||
<h2>
|
||||
Other file types</h2>
|
||||
<p>Once you’ve mastered <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:</p>
|
||||
<ul><li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon separated files. These use <code>;</code> instead of <code>,</code> to separate fields, and are common in countries that use <code>,</code> as the decimal marker.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> reads tab delimited files.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimited if you don’t specify it.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed width files. You can specify fields either by their widths with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or their position with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed width files where columns are separated by white space.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache style log files.</p></li>
|
||||
<ul><li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon-separated files. These use <code>;</code> instead of <code>,</code> to separate fields and are common in countries that use <code>,</code> as the decimal marker.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> reads tab-delimited files.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed-width files. You can specify fields by their widths with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or by their positions with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed-width files where columns are separated by white space.</p></li>
|
||||
<li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache-style log files.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
|
@ -263,7 +264,7 @@ Exercises</h2>
|
|||
<li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> have in common?</p></li>
|
||||
<li><p>What are the most important arguments to <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code>?</p></li>
|
||||
<li>
|
||||
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. What argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify to read the following text into a data frame?</p>
|
||||
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. To read the following text into a data frame, what argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">"x,y\n1,'a,b'"</pre>
|
||||
</div>
|
||||
|
@ -281,9 +282,9 @@ read_csv("a;b\n1;3")</pre>
|
|||
<li>
|
||||
<p>Practice referring to non-syntactic names in the following data frame by:</p>
|
||||
<ol type="a"><li>Extracting the variable called <code>1</code>.</li>
|
||||
<li>Plotting a scatterplot of <code>1</code> vs <code>2</code>.</li>
|
||||
<li>Creating a new column called <code>3</code> which is <code>2</code> divided by <code>1</code>.</li>
|
||||
<li>Renaming the columns to <code>one</code>, <code>two</code> and <code>three</code>.</li>
|
||||
<li>Plotting a scatterplot of <code>1</code> vs. <code>2</code>.</li>
|
||||
<li>Creating a new column called <code>3</code>, which is <code>2</code> divided by <code>1</code>.</li>
|
||||
<li>Renaming the columns to <code>one</code>, <code>two</code>, and <code>three</code>.</li>
|
||||
</ol><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">annoying <- tibble(
|
||||
`1` = 1:10,
|
||||
|
@ -297,15 +298,15 @@ read_csv("a;b\n1;3")</pre>
|
|||
<section id="sec-col-types" data-type="sect1">
|
||||
<h1>
|
||||
Controlling column types</h1>
|
||||
<p>A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself. Finally, we’ll mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.</p>
|
||||
<p>A CSV file doesn’t contain any information about the type of each variable (i.e., whether it’s a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, we’ll mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.</p>
|
||||
|
||||
<section id="guessing-types" data-type="sect2">
|
||||
<h2>
|
||||
Guessing types</h2>
|
||||
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<span data-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring an missing values. It then works through the following questions:</p>
|
||||
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<span data-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:</p>
|
||||
<ul><li>Does it contain only <code>F</code>, <code>T</code>, <code>FALSE</code>, or <code>TRUE</code> (ignoring case)? If so, it’s a logical.</li>
|
||||
<li>Does it contain only numbers (e.g. <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, it’s a number.</li>
|
||||
<li>Does it match match the ISO8601 standard? If so, it’s a date or date-time. (We’ll come back to date/times in more detail in <a href="#sec-creating-datetimes" data-type="xref">#sec-creating-datetimes</a>).</li>
|
||||
<li>Does it contain only numbers (e.g., <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, it’s a number.</li>
|
||||
<li>Does it match the ISO8601 standard? If so, it’s a date or date-time. (We’ll return to date-times in more detail in <a href="#sec-creating-datetimes" data-type="xref">#sec-creating-datetimes</a>).</li>
|
||||
<li>Otherwise, it must be a string.</li>
|
||||
</ul><p>You can see that behavior in action in this simple example:</p>
|
||||
<div class="cell">
|
||||
|
@ -332,13 +333,13 @@ Guessing types</h2>
|
|||
#> 2 FALSE 4.5 2021-02-15 def
|
||||
#> 3 TRUE Inf 2021-02-16 ghi</pre>
|
||||
</div>
|
||||
<p>This heuristic works well if you have a clean dataset, but in real life you’ll encounter a selection of weird and wonderful failures.</p>
|
||||
<p>This heuristic works well if you have a clean dataset, but in real life, you’ll encounter a selection of weird and beautiful failures.</p>
|
||||
</section>
|
||||
|
||||
<section id="missing-values-column-types-and-problems" data-type="sect2">
|
||||
<h2>
|
||||
Missing values, column types, and problems</h2>
|
||||
<p>The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type. One of the most common causes for this a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
|
||||
<p>The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
|
||||
<p>Take this simple 1 column CSV file as an example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">csv <- "
|
||||
|
@ -359,7 +360,7 @@ Missing values, column types, and problems</h2>
|
|||
#> ℹ Use `spec()` to retrieve the full column specification for this data.
|
||||
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
|
||||
</div>
|
||||
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled amongst them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
|
||||
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled among them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- read_csv(csv, col_types = list(x = col_double()))
|
||||
#> Warning: One or more parsing issues, call `problems()` on your data frame for
|
||||
|
@ -371,9 +372,9 @@ Missing values, column types, and problems</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">problems(df)
|
||||
#> # A tibble: 1 × 5
|
||||
#> row col expected actual file
|
||||
#> <int> <int> <chr> <chr> <chr>
|
||||
#> 1 3 1 a double . /private/tmp/RtmpZYGhlj/file9e8176037b8c</pre>
|
||||
#> row col expected actual file
|
||||
#> <int> <int> <chr> <chr> <chr>
|
||||
#> 1 3 1 a double . /private/tmp/Rtmp1nE0XP/file11b88112257a4</pre>
|
||||
</div>
|
||||
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
|
||||
<div class="cell">
|
||||
|
@ -395,11 +396,11 @@ Column types</h2>
|
|||
<ul><li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_logical()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_double()</a></code> read logicals and real numbers. They’re relatively rarely needed (except as above), since readr will usually guess them for you.</li>
|
||||
<li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish because integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish integers and doubles in this book because they’re functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
|
||||
<li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_character()</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn’t make sense to (e.g.) divide it in half.</li>
|
||||
<li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates and date-time respectively; you’ll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code>, and <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates, and date-times respectively; you’ll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
|
||||
<li>
|
||||
<code><a href="https://readr.tidyverse.org/reference/parse_number.html">col_number()</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You’ll learn more about it in <a href="#chp-numbers" data-type="xref">#chp-numbers</a>.</li>
|
||||
<li>
|
||||
|
@ -498,7 +499,7 @@ read_csv("students-2.csv")
|
|||
#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
|
||||
#> 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
|
||||
</div>
|
||||
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:</p>
|
||||
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternative:</p>
|
||||
<ol type="1"><li>
|
||||
<p><code><a href="https://readr.tidyverse.org/reference/read_rds.html">write_rds()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_rds.html">read_rds()</a></code> are uniform wrappers around the base functions <code><a href="https://rdrr.io/r/base/readRDS.html">readRDS()</a></code> and <code><a href="https://rdrr.io/r/base/readRDS.html">saveRDS()</a></code>. These store data in R’s custom binary format called RDS:</p>
|
||||
<div class="cell">
|
||||
|
@ -516,7 +517,7 @@ read_rds("students.rds")
|
|||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:</p>
|
||||
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. We’ll return to arrow in more depth in <a href="#chp-arrow" data-type="xref">#chp-arrow</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(arrow)
|
||||
write_parquet(students, "students.parquet")
|
||||
|
@ -532,7 +533,7 @@ read_parquet("students.parquet")
|
|||
#> 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol><p>Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package.</p>
|
||||
</ol><p>Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.</p>
|
||||
</section>
|
||||
|
||||
<section id="data-entry" data-type="sect1">
|
||||
|
@ -586,7 +587,7 @@ Data entry</h1>
|
|||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter, you’ve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
|
||||
<p>In this chapter, you’ve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-arrow" data-type="xref">#chp-arrow</a> from parquet files, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
|
||||
<p>Now that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p>
|
||||
|
||||
|
||||
|
|
|
@ -12,12 +12,12 @@ Introduction</h1>
|
|||
— Hadley Wickham</p>
|
||||
</blockquote>
|
||||
<p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
|
||||
<p>In this chapter, you’ll first learn the definition of tidy data and see it applied to simple toy dataset. Then we’ll dive into the main tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data, without changing any of the values. We’ll finish up with a discussion of usefully untidy data, and how you can create it if needed.</p>
|
||||
<p>In this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. We’ll finish with a discussion of usefully untidy data and how you can create it if needed.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
|
||||
<p>In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
|
@ -28,7 +28,7 @@ Prerequisites</h2>
|
|||
<section id="sec-tidy-data" data-type="sect1">
|
||||
<h1>
|
||||
Tidy data</h1>
|
||||
<p>You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables: <em>country</em>, <em>year</em>, <em>population</em>, and <em>cases</em> of TB (tuberculosis), but each dataset organizes the values in a different way.</p>
|
||||
<p>You can represent the same underlying data in multiple ways. The example below shows the same data organized in four different ways. Each dataset shows the same values of four variables: <em>country</em>, <em>year</em>, <em>population</em>, and <em>cases</em> of TB (tuberculosis), but each dataset organizes the values in a different way.</p>
|
||||
|
||||
<!-- TODO redraw as tables -->
|
||||
<div class="cell">
|
||||
|
@ -83,7 +83,7 @@ table4b # population
|
|||
<p>These are all representations of the same underlying data, but they are not equally easy to use. One of them, <code>table1</code>, will be much easier to work with inside the tidyverse because it’s tidy.</p>
|
||||
<p>There are three interrelated rules that make a dataset tidy:</p>
|
||||
<ol type="1"><li>Each variable is a column; each column is a variable.</li>
|
||||
<li>Each observation is row; each row is an observation.</li>
|
||||
<li>Each observation is a row; each row is an observation.</li>
|
||||
<li>Each value is a cell; each cell is a single value.</li>
|
||||
</ol><p><a href="#fig-tidy-structure" data-type="xref">#fig-tidy-structure</a> shows the rules visually.</p>
|
||||
<div class="cell">
|
||||
|
@ -96,8 +96,8 @@ table4b # population
|
|||
</div>
|
||||
<p>Why ensure that your data is tidy? There are two main advantages:</p>
|
||||
<ol type="1"><li><p>There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.</p></li>
|
||||
<li><p>There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
|
||||
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with <code>table1</code>.</p>
|
||||
<li><p>There’s a specific advantage to placing variables in columns because it allows R’s vectorized nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
|
||||
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with <code>table1</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># Compute rate per 10,000
|
||||
table1 |>
|
||||
|
@ -124,12 +124,12 @@ table1 |>
|
|||
#> 2 2000 296920
|
||||
|
||||
# Visualise changes over time
|
||||
ggplot(table1, aes(year, cases)) +
|
||||
ggplot(table1, aes(x = year, y = cases)) +
|
||||
geom_line(aes(group = country), color = "grey50") +
|
||||
geom_point(aes(color = country, shape = country)) +
|
||||
scale_x_continuous(breaks = c(1999, 2000))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-tidy_files/figure-html/unnamed-chunk-5-1.png" alt="This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale." width="480"/></p>
|
||||
<p><img src="data-tidy_files/figure-html/unnamed-chunk-5-1.png" alt="This figure shows the number of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale." width="480"/></p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
@ -166,15 +166,15 @@ Data in column names</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">billboard
|
||||
#> # A tibble: 317 × 79
|
||||
#> artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
|
||||
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
|
||||
#> 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
|
||||
#> 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
|
||||
#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
|
||||
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
|
||||
#> # … with 311 more rows, 68 more variables: wk9 <dbl>, wk10 <dbl>,
|
||||
#> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
|
||||
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
|
||||
#> 3 3 Doors… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
|
||||
#> 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59
|
||||
#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
|
||||
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
|
||||
#> # … with 311 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
|
||||
#> # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
|
||||
#> # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
|
||||
#> # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
|
||||
|
@ -261,7 +261,7 @@ billboard_tidy
|
|||
<p>Now we’re in a good position to look at how song ranks vary over time by drawing a plot. The code is shown below and the result is <a href="#fig-billboard-ranks" data-type="xref">#fig-billboard-ranks</a>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">billboard_tidy |>
|
||||
ggplot(aes(week, rank, group = track)) +
|
||||
ggplot(aes(x = week, y = rank, group = track)) +
|
||||
geom_line(alpha = 1/3) +
|
||||
scale_y_reverse()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -339,21 +339,21 @@ Many variables in column names</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">who2
|
||||
#> # A tibble: 7,240 × 58
|
||||
#> country year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanistan 1980 NA NA NA NA NA NA NA
|
||||
#> 2 Afghanistan 1981 NA NA NA NA NA NA NA
|
||||
#> 3 Afghanistan 1982 NA NA NA NA NA NA NA
|
||||
#> 4 Afghanistan 1983 NA NA NA NA NA NA NA
|
||||
#> 5 Afghanistan 1984 NA NA NA NA NA NA NA
|
||||
#> 6 Afghanistan 1985 NA NA NA NA NA NA NA
|
||||
#> # … with 7,234 more rows, 49 more variables: sp_f_014 <dbl>,
|
||||
#> # sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>,
|
||||
#> # sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>,
|
||||
#> # sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>,
|
||||
#> # sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>,
|
||||
#> # sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>,
|
||||
#> # ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, ep_m_3544 <dbl>, …</pre>
|
||||
#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanist… 1980 NA NA NA NA NA NA
|
||||
#> 2 Afghanist… 1981 NA NA NA NA NA NA
|
||||
#> 3 Afghanist… 1982 NA NA NA NA NA NA
|
||||
#> 4 Afghanist… 1983 NA NA NA NA NA NA
|
||||
#> 5 Afghanist… 1984 NA NA NA NA NA NA
|
||||
#> 6 Afghanist… 1985 NA NA NA NA NA NA
|
||||
#> # … with 7,234 more rows, and 50 more variables: sp_m_65 <dbl>,
|
||||
#> # sp_f_014 <dbl>, sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>,
|
||||
#> # sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>,
|
||||
#> # sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>,
|
||||
#> # sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>,
|
||||
#> # sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>,
|
||||
#> # sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, …</pre>
|
||||
</div>
|
||||
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
|
||||
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
|
||||
|
@ -446,15 +446,15 @@ Widening data</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_experience
|
||||
#> # A tibble: 500 × 5
|
||||
#> org_pac_id org_nm measure_cd measure_title prf_r…¹
|
||||
#> org_pac_id org_nm measure_cd measure_title prf_rate
|
||||
#> <chr> <chr> <chr> <chr> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS … 63
|
||||
#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS … 87
|
||||
#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS … 86
|
||||
#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS … 57
|
||||
#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS … 85
|
||||
#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS … 24
|
||||
#> # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
|
||||
#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS… 63
|
||||
#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS… 87
|
||||
#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS… 86
|
||||
#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS… 57
|
||||
#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS… 85
|
||||
#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS… 24
|
||||
#> # … with 494 more rows</pre>
|
||||
</div>
|
||||
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
|
||||
<div class="cell">
|
||||
|
@ -479,17 +479,16 @@ Widening data</h2>
|
|||
values_from = prf_rate
|
||||
)
|
||||
#> # A tibble: 500 × 9
|
||||
#> org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
|
||||
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CAR… CAHPS … 63 NA NA NA NA NA
|
||||
#> 2 0446157747 USC CAR… CAHPS … NA 87 NA NA NA NA
|
||||
#> 3 0446157747 USC CAR… CAHPS … NA NA 86 NA NA NA
|
||||
#> 4 0446157747 USC CAR… CAHPS … NA NA NA 57 NA NA
|
||||
#> 5 0446157747 USC CAR… CAHPS … NA NA NA NA 85 NA
|
||||
#> 6 0446157747 USC CAR… CAHPS … NA NA NA NA NA 24
|
||||
#> # … with 494 more rows, and abbreviated variable names ¹measure_title,
|
||||
#> # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
|
||||
#> # ⁷CAHPS_GRP_12</pre>
|
||||
#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3
|
||||
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDI… CAHPS for MI… 63 NA NA
|
||||
#> 2 0446157747 USC CARE MEDI… CAHPS for MI… NA 87 NA
|
||||
#> 3 0446157747 USC CARE MEDI… CAHPS for MI… NA NA 86
|
||||
#> 4 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
|
||||
#> 5 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
|
||||
#> 6 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
|
||||
#> # … with 494 more rows, and 3 more variables: CAHPS_GRP_5 <dbl>,
|
||||
#> # CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl></pre>
|
||||
</div>
|
||||
<p>The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
|
||||
<div class="cell">
|
||||
|
@ -500,16 +499,16 @@ Widening data</h2>
|
|||
values_from = prf_rate
|
||||
)
|
||||
#> # A tibble: 95 × 8
|
||||
#> org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICA… 63 87 86 57 85 24
|
||||
#> 2 0446162697 ASSOCIATION OF … 59 85 83 63 88 22
|
||||
#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44 73 12
|
||||
#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65 82 24
|
||||
#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64 87 28
|
||||
#> 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
|
||||
#> # … with 89 more rows, and abbreviated variable names ¹CAHPS_GRP_1,
|
||||
#> # ²CAHPS_GRP_2, ³CAHPS_GRP_3, ⁴CAHPS_GRP_5, ⁵CAHPS_GRP_8, ⁶CAHPS_GRP_12</pre>
|
||||
#> org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICA… 63 87 86 57
|
||||
#> 2 0446162697 ASSOCIATION OF … 59 85 83 63
|
||||
#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44
|
||||
#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65
|
||||
#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64
|
||||
#> 6 0840109864 REX HOSPITAL INC 73 87 84 67
|
||||
#> # … with 89 more rows, and 2 more variables: CAHPS_GRP_8 <dbl>,
|
||||
#> # CAHPS_GRP_12 <dbl></pre>
|
||||
</div>
|
||||
<p>This gives us the output that we’re looking for.</p>
|
||||
</section>
|
||||
|
@ -826,7 +825,7 @@ Pragmatic computation</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">cms_patient_care |>
|
||||
filter(type == "observed") |>
|
||||
ggplot(aes(score)) +
|
||||
ggplot(aes(x = score)) +
|
||||
geom_histogram(binwidth = 2) +
|
||||
facet_wrap(vars(measure_abbr))
|
||||
#> Warning: Removed 1 rows containing non-finite values (`stat_bin()`).</pre>
|
||||
|
@ -842,7 +841,7 @@ Pragmatic computation</h2>
|
|||
names_from = measure_abbr,
|
||||
values_from = score
|
||||
) |>
|
||||
ggplot(aes(dyspnea_screening, dyspena_treatment)) +
|
||||
ggplot(aes(x = dyspnea_screening, y = dyspena_treatment)) +
|
||||
geom_point() +
|
||||
coord_equal()</pre>
|
||||
</div>
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
<h1>
|
||||
Introduction</h1>
|
||||
<p>Visualisation is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need for it. Often you’ll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the <strong>dplyr</strong> package and a new dataset on flights that departed New York City in 2013.</p>
|
||||
<p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>
|
||||
<p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll start with functions that operate on rows and then columns of a data frame. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and we’ll come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -13,14 +13,16 @@ Prerequisites</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
|
||||
library(tidyverse)
|
||||
#> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ──
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
|
||||
#> ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
|
||||
#> ✔ tidyr 1.2.1.9001 ✔ stringr 1.4.1.9000
|
||||
#> ✔ readr 2.1.3 ✔ forcats 0.5.2
|
||||
#> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
|
||||
#> ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
|
||||
#> ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
|
||||
#> ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
|
||||
#> ✔ purrr 1.0.1
|
||||
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
|
||||
#> ✖ dplyr::filter() masks stats::filter()
|
||||
#> ✖ dplyr::lag() masks stats::lag()</pre>
|
||||
#> ✖ dplyr::lag() masks stats::lag()
|
||||
#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
|
||||
</div>
|
||||
<p>Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, we’ll use the same syntax as R: <code>packagename::functionname()</code>.</p>
|
||||
</section>
|
||||
|
@ -32,21 +34,45 @@ nycflights13</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>If you’ve used R before, you might notice that this data frame prints a little differently to other data frames you’ve seen. That’s because it’s a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. To see everything you can use <code>print(flights, width = Inf)</code> to show everything in the console, but it’s generally more convenient to instead use <code>View(flights)</code> to open the dataset in the scrollable RStudio viewer.</p>
|
||||
<p>You might have noticed the short abbreviations that follow each column name. These tell you the type of each variable: <code><int></code> is short for integer, <code><dbl></code> is short for double (aka real numbers), <code><chr></code> for character (aka strings), and <code><dttm></code> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.</p>
|
||||
<p>If you’ve used R before, you might notice that this data frame prints a little differently to other data frames you’ve seen. That’s because it’s a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If you’re using RStudio, the most convenient is probably <code>View(flights)</code>, which will open an interactive scrollable and filterable view. Otherwise you can use <code>print(flights, width = Inf)</code> to show all columns, or use call <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">glimpse(flights)
|
||||
#> Rows: 336,776
|
||||
#> Columns: 19
|
||||
#> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
|
||||
#> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
|
||||
#> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
|
||||
#> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…
|
||||
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…
|
||||
#> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…
|
||||
#> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…
|
||||
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…
|
||||
#> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…
|
||||
#> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
|
||||
#> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…
|
||||
#> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
|
||||
#> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
|
||||
#> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
|
||||
#> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…
|
||||
#> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…
|
||||
#> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
|
||||
#> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…
|
||||
#> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…</pre>
|
||||
</div>
|
||||
<p>In both views, the variables names are followed by abbreviations that tell you the type of each variable: <code><int></code> is short for integer, <code><dbl></code> is short for double (aka real numbers), <code><chr></code> for character (aka strings), and <code><dttm></code> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.</p>
|
||||
</section>
|
||||
|
||||
<section id="dplyr-basics" data-type="sect2">
|
||||
|
@ -66,14 +92,14 @@ dplyr basics</h2>
|
|||
)</pre>
|
||||
</div>
|
||||
<p>The code starts with the <code>flights</code> dataset, then filters it, then groups it, then summarizes it. We’ll come back to the pipe and its alternatives in <a href="#sec-pipes" data-type="xref">#sec-pipes</a>.</p>
|
||||
<p>dplyr’s verbs are organised into four groups based on what they operate on: <strong>rows</strong>, <strong>columns</strong>, <strong>groups</strong>, or <strong>tables</strong>. In the following sections you’ll learn the most important verbs for rows, columns, and groups, then we’ll come back to verb that work on tables in <a href="#chp-joins" data-type="xref">#chp-joins</a>. Let’s dive in!</p>
|
||||
<p>dplyr’s verbs are organised into four groups based on what they operate on: <strong>rows</strong>, <strong>columns</strong>, <strong>groups</strong>, or <strong>tables</strong>. In the following sections you’ll learn the most important verbs for rows, columns, and groups, then we’ll come back to verbs that work on tables in <a href="#chp-joins" data-type="xref">#chp-joins</a>. Let’s dive in!</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="rows" data-type="sect1">
|
||||
<h1>
|
||||
Rows</h1>
|
||||
<p>The most important verbs that operate on rows are <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, which changes which rows are present without changing their order, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.</p>
|
||||
<p>The most important verbs that operate on rows are <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, which changes which rows are present without changing their order, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. We’ll also discuss <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> which finds rows with unique values but unlike <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> it can also optionally modify the columns.</p>
|
||||
|
||||
<section id="filter" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -84,18 +110,18 @@ Rows</h1>
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(arr_delay > 120)
|
||||
#> # A tibble: 10,034 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 811 630 101 1047 830 137 MQ
|
||||
#> 2 2013 1 1 848 1835 853 1001 1950 851 MQ
|
||||
#> 3 2013 1 1 957 733 144 1056 853 123 UA
|
||||
#> 4 2013 1 1 1114 900 134 1447 1222 145 UA
|
||||
#> 5 2013 1 1 1505 1310 115 1638 1431 127 EV
|
||||
#> 6 2013 1 1 1525 1340 105 1831 1626 125 B6
|
||||
#> # … with 10,028 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 811 630 101 1047 830
|
||||
#> 2 2013 1 1 848 1835 853 1001 1950
|
||||
#> 3 2013 1 1 957 733 144 1056 853
|
||||
#> 4 2013 1 1 1114 900 134 1447 1222
|
||||
#> 5 2013 1 1 1505 1310 115 1638 1431
|
||||
#> 6 2013 1 1 1525 1340 105 1831 1626
|
||||
#> # … with 10,028 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>As well as <code>></code> (greater than), you can use <code>>=</code> (greater than or equal to), <code><</code> (less than), <code><=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
|
||||
<div class="cell">
|
||||
|
@ -103,35 +129,35 @@ Rows</h1>
|
|||
flights |>
|
||||
filter(month == 1 & day == 1)
|
||||
#> # A tibble: 842 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>
|
||||
|
||||
# Flights that departed in January or February
|
||||
flights |>
|
||||
filter(month == 1 | month == 2)
|
||||
#> # A tibble: 51,955 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 51,949 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>There’s a useful shortcut when you’re combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
|
||||
<div class="cell">
|
||||
|
@ -139,18 +165,18 @@ flights |>
|
|||
flights |>
|
||||
filter(month %in% c(1, 2))
|
||||
#> # A tibble: 51,955 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 51,949 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>We’ll come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
|
||||
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code><-</code>:</p>
|
||||
|
@ -189,36 +215,36 @@ Common mistakes</h2>
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(year, month, day, dep_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(desc(dep_delay))
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 9 641 900 1301 1242 1530 1272 HA
|
||||
#> 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
|
||||
#> 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
|
||||
#> 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
|
||||
#> 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
|
||||
#> 6 2013 4 10 1100 1900 960 1342 2211 931 DL
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 9 641 900 1301 1242 1530
|
||||
#> 2 2013 6 15 1432 1935 1137 1607 2120
|
||||
#> 3 2013 1 10 1121 1635 1126 1239 1810
|
||||
#> 4 2013 9 20 1139 1845 1014 1457 2210
|
||||
#> 5 2013 7 22 845 1600 1005 1044 1815
|
||||
#> 6 2013 4 10 1100 1900 960 1342 2211
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
|
||||
<div class="cell">
|
||||
|
@ -226,21 +252,61 @@ Common mistakes</h2>
|
|||
filter(dep_delay <= 10 & dep_delay >= -10) |>
|
||||
arrange(desc(arr_delay))
|
||||
#> # A tibble: 239,109 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 11 1 658 700 -2 1329 1015 194 VX
|
||||
#> 2 2013 4 18 558 600 -2 1149 850 179 AA
|
||||
#> 3 2013 7 7 1659 1700 -1 2050 1823 147 US
|
||||
#> 4 2013 7 22 1606 1615 -9 2056 1831 145 DL
|
||||
#> 5 2013 9 19 648 641 7 1035 810 145 UA
|
||||
#> 6 2013 4 18 655 700 -5 1213 950 143 AA
|
||||
#> # … with 239,103 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 11 1 658 700 -2 1329 1015
|
||||
#> 2 2013 4 18 558 600 -2 1149 850
|
||||
#> 3 2013 7 7 1659 1700 -1 2050 1823
|
||||
#> 4 2013 7 22 1606 1615 -9 2056 1831
|
||||
#> 5 2013 9 19 648 641 7 1035 810
|
||||
#> 6 2013 4 18 655 700 -5 1213 950
|
||||
#> # … with 239,103 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="distinct" data-type="sect2">
|
||||
<h2>
|
||||
<code>distinct()</code>
|
||||
</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, you’ll want to the distinct combination of some variables, so you can also optionally supply column names:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># This would remove any duplicate rows if there were any
|
||||
flights |>
|
||||
distinct()
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>
|
||||
|
||||
# This finds all unique origin and destination pairs.
|
||||
flights |>
|
||||
distinct(origin, dest)
|
||||
#> # A tibble: 224 × 2
|
||||
#> origin dest
|
||||
#> <chr> <chr>
|
||||
#> 1 EWR IAH
|
||||
#> 2 LGA IAH
|
||||
#> 3 JFK MIA
|
||||
#> 4 JFK BQN
|
||||
#> 5 LGA ATL
|
||||
#> 6 EWR ORD
|
||||
#> # … with 218 more rows</pre>
|
||||
</div>
|
||||
<p>Note that if you want to find the number of duplicates, or rows that weren’t duplicated, you’re better off swapping <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> for <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and then filtering as needed.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
|
@ -255,15 +321,16 @@ Exercises</h2>
|
|||
</ol></li>
|
||||
<li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li>
|
||||
<li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li>
|
||||
<li><p>Which flights traveled the farthest? Which traveled the shortest?</p></li>
|
||||
<li><p>Does it matter what order you used <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> in if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
|
||||
<li><p>Was there a flight on every day of 2013?</p></li>
|
||||
<li><p>Which flights traveled the farthest distance? Which traveled the least distance?</p></li>
|
||||
<li><p>Does it matter what order you used <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="columns" data-type="sect1">
|
||||
<h1>
|
||||
Columns</h1>
|
||||
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> creates new columns that are functions of the existing columns; <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> change which columns are present, their names, or their positions.</p>
|
||||
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> creates new columns that are functions of the existing columns; <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> change which columns are present, their names, or their positions. We’ll also discuss <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> since it allows you to get a column out of data frame.</p>
|
||||
|
||||
<section id="sec-mutate" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -277,19 +344,18 @@ Columns</h1>
|
|||
speed = distance / air_time * 60
|
||||
)
|
||||
#> # A tibble: 336,776 × 21
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 11 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>, and abbreviated
|
||||
#> # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
|
||||
#> # ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 13 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>, gain <dbl>, speed <dbl></pre>
|
||||
</div>
|
||||
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
|
||||
<div class="cell">
|
||||
|
@ -300,21 +366,20 @@ Columns</h1>
|
|||
.before = 1
|
||||
)
|
||||
#> # A tibble: 336,776 × 21
|
||||
#> gain speed year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
|
||||
#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 -9 370. 2013 1 1 517 515 2 830 819
|
||||
#> 2 -16 374. 2013 1 1 533 529 4 850 830
|
||||
#> 3 -31 408. 2013 1 1 542 540 2 923 850
|
||||
#> 4 17 517. 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 19 394. 2013 1 1 554 600 -6 812 837
|
||||
#> 6 -16 288. 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
|
||||
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time</pre>
|
||||
#> gain speed year month day dep_time sched_dep_time dep_delay arr_time
|
||||
#> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl> <int>
|
||||
#> 1 -9 370. 2013 1 1 517 515 2 830
|
||||
#> 2 -16 374. 2013 1 1 533 529 4 850
|
||||
#> 3 -31 408. 2013 1 1 542 540 2 923
|
||||
#> 4 17 517. 2013 1 1 544 545 -1 1004
|
||||
#> 5 19 394. 2013 1 1 554 600 -6 812
|
||||
#> 6 -16 288. 2013 1 1 554 558 -4 740
|
||||
#> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
|
||||
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can the name of a variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
|
||||
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can use the variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
|
@ -323,19 +388,18 @@ Columns</h1>
|
|||
.after = day
|
||||
)
|
||||
#> # A tibble: 336,776 × 21
|
||||
#> year month day gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
|
||||
#> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 -9 370. 517 515 2 830 819
|
||||
#> 2 2013 1 1 -16 374. 533 529 4 850 830
|
||||
#> 3 2013 1 1 -31 408. 542 540 2 923 850
|
||||
#> 4 2013 1 1 17 517. 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 19 394. 554 600 -6 812 837
|
||||
#> 6 2013 1 1 -16 288. 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
|
||||
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time</pre>
|
||||
#> year month day gain speed dep_time sched_dep_time dep_delay arr_time
|
||||
#> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl> <int>
|
||||
#> 1 2013 1 1 -9 370. 517 515 2 830
|
||||
#> 2 2013 1 1 -16 374. 533 529 4 850
|
||||
#> 3 2013 1 1 -31 408. 542 540 2 923
|
||||
#> 4 2013 1 1 17 517. 544 545 -1 1004
|
||||
#> 5 2013 1 1 19 394. 554 600 -6 812
|
||||
#> 6 2013 1 1 -16 288. 554 558 -4 740
|
||||
#> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
|
||||
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
|
||||
<div class="cell">
|
||||
|
@ -397,18 +461,17 @@ flights |>
|
|||
flights |>
|
||||
select(!year:day)
|
||||
#> # A tibble: 336,776 × 16
|
||||
#> dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum
|
||||
#> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
|
||||
#> 1 517 515 2 830 819 11 UA 1545 N14228
|
||||
#> 2 533 529 4 850 830 20 UA 1714 N24211
|
||||
#> 3 542 540 2 923 850 33 AA 1141 N619AA
|
||||
#> 4 544 545 -1 1004 1022 -18 B6 725 N804JB
|
||||
#> 5 554 600 -6 812 837 -25 DL 461 N668DN
|
||||
#> 6 554 558 -4 740 728 12 UA 1696 N39463
|
||||
#> # … with 336,770 more rows, 7 more variables: origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
|
||||
#> # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
|
||||
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
|
||||
#> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 517 515 2 830 819 11 UA
|
||||
#> 2 533 529 4 850 830 20 UA
|
||||
#> 3 542 540 2 923 850 33 AA
|
||||
#> 4 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 554 600 -6 812 837 -25 DL
|
||||
#> 6 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, and 9 more variables: flight <int>,
|
||||
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
|
||||
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
|
||||
|
||||
# Select all columns that are characters
|
||||
flights |>
|
||||
|
@ -433,7 +496,7 @@ flights |>
|
|||
<code>contains("ijk")</code>: matches names that contain “ijk”.</li>
|
||||
<li>
|
||||
<code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li>
|
||||
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) you’ll also be use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
|
||||
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) you’ll also be able to use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
|
||||
<p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
|
@ -460,18 +523,18 @@ flights |>
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
rename(tail_num = tailnum)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tail_num <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tail_num <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that aren’t explicitly selected.</p>
|
||||
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
|
||||
|
@ -486,51 +549,51 @@ flights |>
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
relocate(time_hour, air_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> time_hour air_time year month day dep_time sched_dep…¹ dep_d…²
|
||||
#> <dttm> <dbl> <int> <int> <int> <int> <int> <dbl>
|
||||
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2
|
||||
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4
|
||||
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2
|
||||
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1
|
||||
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6
|
||||
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4
|
||||
#> # … with 336,770 more rows, 11 more variables: arr_time <int>,
|
||||
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
|
||||
#> # tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, and abbreviated variable names ¹sched_dep_time, ²dep_delay</pre>
|
||||
#> time_hour air_time year month day dep_time sched_dep_time
|
||||
#> <dttm> <dbl> <int> <int> <int> <int> <int>
|
||||
#> 1 2013-01-01 05:00:00 227 2013 1 1 517 515
|
||||
#> 2 2013-01-01 05:00:00 227 2013 1 1 533 529
|
||||
#> 3 2013-01-01 05:00:00 160 2013 1 1 542 540
|
||||
#> 4 2013-01-01 05:00:00 183 2013 1 1 544 545
|
||||
#> 5 2013-01-01 06:00:00 116 2013 1 1 554 600
|
||||
#> 6 2013-01-01 05:00:00 150 2013 1 1 554 558
|
||||
#> # … with 336,770 more rows, and 12 more variables: dep_delay <dbl>,
|
||||
#> # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
|
||||
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
|
||||
#> # hour <dbl>, minute <dbl></pre>
|
||||
</div>
|
||||
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
relocate(year:dep_time, .after = time_hour)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
|
||||
#> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
|
||||
#> 1 515 2 830 819 11 UA 1545 N14228 EWR IAH
|
||||
#> 2 529 4 850 830 20 UA 1714 N24211 LGA IAH
|
||||
#> 3 540 2 923 850 33 AA 1141 N619AA JFK MIA
|
||||
#> 4 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
|
||||
#> 5 600 -6 812 837 -25 DL 461 N668DN LGA ATL
|
||||
#> 6 558 -4 740 728 12 UA 1696 N39463 EWR ORD
|
||||
#> # … with 336,770 more rows, 9 more variables: air_time <dbl>,
|
||||
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>,
|
||||
#> # month <int>, day <int>, dep_time <int>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
|
||||
#> sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
|
||||
#> <int> <dbl> <int> <int> <dbl> <chr> <int>
|
||||
#> 1 515 2 830 819 11 UA 1545
|
||||
#> 2 529 4 850 830 20 UA 1714
|
||||
#> 3 540 2 923 850 33 AA 1141
|
||||
#> 4 545 -1 1004 1022 -18 B6 725
|
||||
#> 5 600 -6 812 837 -25 DL 461
|
||||
#> 6 558 -4 740 728 12 UA 1696
|
||||
#> # … with 336,770 more rows, and 12 more variables: tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, year <int>, month <int>, day <int>,
|
||||
#> # dep_time <int>
|
||||
flights |>
|
||||
relocate(starts_with("arr"), .before = dep_time)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier
|
||||
#> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <int> <chr>
|
||||
#> 1 2013 1 1 830 11 517 515 2 819 UA
|
||||
#> 2 2013 1 1 850 20 533 529 4 830 UA
|
||||
#> 3 2013 1 1 923 33 542 540 2 850 AA
|
||||
#> 4 2013 1 1 1004 -18 544 545 -1 1022 B6
|
||||
#> 5 2013 1 1 812 -25 554 600 -6 837 DL
|
||||
#> 6 2013 1 1 740 12 554 558 -4 728 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹arr_delay, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵sched_arr_time</pre>
|
||||
#> year month day arr_time arr_delay dep_time sched_dep_time dep_delay
|
||||
#> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
|
||||
#> 1 2013 1 1 830 11 517 515 2
|
||||
#> 2 2013 1 1 850 20 533 529 4
|
||||
#> 3 2013 1 1 923 33 542 540 2
|
||||
#> 4 2013 1 1 1004 -18 544 545 -1
|
||||
#> 5 2013 1 1 812 -25 554 600 -6
|
||||
#> 6 2013 1 1 740 12 554 558 -4
|
||||
#> # … with 336,770 more rows, and 11 more variables: sched_arr_time <int>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -574,27 +637,27 @@ Groups</h1>
|
|||
group_by(month)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> # Groups: month [12]
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesn’t change the data but, if you look closely at the output, you’ll notice that it’s now “grouped by” month. This means subsequent operations will now work “by month”.</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesn’t change the data but, if you look closely at the output, you’ll notice that it’s now “grouped by” month. This means subsequent operations will now work “by month”. <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesn’t do anything by itself; instead it changes the behavior of the subsequent verbs.</p>
|
||||
</section>
|
||||
|
||||
<section id="sec-summarize" data-type="sect2">
|
||||
<h2>
|
||||
<code>summarize()</code>
|
||||
</h2>
|
||||
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on you’ll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
|
||||
<p>The most important grouped operation is a summary, which collapses each group to a single row. In dplyr, this is operation is performed by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code><span data-type="footnote">Or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, if you prefer British English.</span>, as shown by the following example, which computes the average departure delay by month:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(month) |>
|
||||
|
@ -665,7 +728,7 @@ The<code>slice_</code> functions</h2>
|
|||
<li>
|
||||
<code>df |> slice_max(x, n = 1)</code> takes the row with the largest value of <code>x</code>.</li>
|
||||
<li>
|
||||
<code>df |> slice_sample(x, n = 1)</code> takes one random row.</li>
|
||||
<code>df |> slice_sample(n = 1)</code> takes one random row.</li>
|
||||
</ul><p>You can vary <code>n</code> to select more than one row, or instead of <code>n =</code>, you can use <code>prop = 0.1</code> to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
|
@ -673,18 +736,18 @@ The<code>slice_</code> functions</h2>
|
|||
slice_max(arr_delay, n = 1)
|
||||
#> # A tibble: 108 × 19
|
||||
#> # Groups: dest [105]
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 7 22 2145 2007 98 132 2259 153 B6
|
||||
#> 2 2013 7 23 1139 800 219 1250 909 221 B6
|
||||
#> 3 2013 1 25 123 2000 323 229 2101 328 EV
|
||||
#> 4 2013 8 17 1740 1625 75 2042 2003 39 UA
|
||||
#> 5 2013 7 22 2257 759 898 121 1026 895 DL
|
||||
#> 6 2013 7 10 2056 1505 351 2347 1758 349 UA
|
||||
#> # … with 102 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 7 22 2145 2007 98 132 2259
|
||||
#> 2 2013 7 23 1139 800 219 1250 909
|
||||
#> 3 2013 1 25 123 2000 323 229 2101
|
||||
#> 4 2013 8 17 1740 1625 75 2042 2003
|
||||
#> 5 2013 7 22 2257 759 898 121 1026
|
||||
#> 6 2013 7 10 2056 1505 351 2347 1758
|
||||
#> # … with 102 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
|
||||
<div class="cell">
|
||||
|
@ -692,7 +755,7 @@ The<code>slice_</code> functions</h2>
|
|||
group_by(dest) |>
|
||||
summarize(max_delay = max(arr_delay, na.rm = TRUE))
|
||||
#> Warning: There was 1 warning in `summarize()`.
|
||||
#> ℹ In argument `max_delay = max(arr_delay, na.rm = TRUE)`.
|
||||
#> ℹ In argument: `max_delay = max(arr_delay, na.rm = TRUE)`.
|
||||
#> ℹ In group 52: `dest = "LGA"`.
|
||||
#> Caused by warning in `max()`:
|
||||
#> ! no non-missing arguments to max; returning -Inf
|
||||
|
@ -719,18 +782,18 @@ Grouping by multiple variables</h2>
|
|||
daily
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> # Groups: year, month, day [365]
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:</p>
|
||||
<div class="cell">
|
||||
|
@ -779,6 +842,66 @@ Exercises</h2>
|
|||
<li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li>
|
||||
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code> and friends?</p></li>
|
||||
<li><p>Explain what <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> do?</p></li>
|
||||
<li>
|
||||
<p>Suppose we have the following tiny data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = 1:5,
|
||||
y = c("a", "b", "a", "a", "b"),
|
||||
z = c("K", "K", "L", "L", "K")
|
||||
)</pre>
|
||||
</div>
|
||||
<ol type="a"><li>
|
||||
<p>What does the following code do? Run it, analyze the result, and describe what <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> does.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(y)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>What does the following code do? Run it, analyze the result, and describe what <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> does. Also comment on how it’s different from the <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> in part (a)?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
arrange(y)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(y) |>
|
||||
summarize(mean_x = mean(x))</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does. Then, comment on what the message says.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(y, z) |>
|
||||
summarize(mean_x = mean(x))</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does. How is the output different from the one in part (d).</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(y, z) |>
|
||||
summarize(mean_x = mean(x), .groups = "drop")</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>What do the following pipelines do? Run both, analyze the results, and describe what each pipeline does. How are the outputs of the two pipelines different?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(y, z) |>
|
||||
summarize(mean_x = mean(x))
|
||||
|
||||
df |>
|
||||
group_by(y, z) |>
|
||||
mutate(mean_x = mean(x))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
|
@ -795,18 +918,18 @@ Case study: aggregates and sample size</h1>
|
|||
n = n()
|
||||
)
|
||||
|
||||
ggplot(delays, aes(delay)) +
|
||||
ggplot(delays, aes(x = delay)) +
|
||||
geom_freqpoly(binwidth = 10)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-36-1.png" class="img-fluid" alt="A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours." width="576"/></p>
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-45-1.png" class="img-fluid" alt="A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Wow, there are some planes that have an <em>average</em> delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(delays, aes(n, delay)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(delays, aes(x = n, y = delay)) +
|
||||
geom_point(alpha = 1/10)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases<span data-type="footnote">*cough* the central limit theorem *cough*.</span>.</p>
|
||||
|
@ -814,11 +937,11 @@ ggplot(delays, aes(delay)) +
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">delays |>
|
||||
filter(n > 25) |>
|
||||
ggplot(aes(n, delay)) +
|
||||
ggplot(aes(x = n, y = delay)) +
|
||||
geom_point(alpha = 1/10) +
|
||||
geom_smooth(se = FALSE)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-38-1.png" class="img-fluid" alt="Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights." width="576"/></p>
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid" alt="Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note the handy pattern for combining ggplot2 and dplyr. It’s a bit annoying that you have to switch from <code>|></code> to <code>+</code>, but it’s not too much of a hassle once you get the hang of it.</p>
|
||||
|
@ -848,11 +971,11 @@ batters
|
|||
</ol><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">batters |>
|
||||
filter(n > 100) |>
|
||||
ggplot(aes(n, perf)) +
|
||||
ggplot(aes(x = n, y = perf)) +
|
||||
geom_point(alpha = 1 / 10) +
|
||||
geom_smooth(se = FALSE)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-40-1.png" class="img-fluid" alt="A scatterplot of number of batting opportunites vs batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000." width="576"/></p>
|
||||
<p><img src="data-transform_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" alt="A scatterplot of number of batting opportunites vs. batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>This also has important implications for ranking. If you naively sort on <code>desc(ba)</code>, the people with the best batting averages are clearly lucky, not skilled:</p>
|
||||
|
@ -876,7 +999,7 @@ batters
|
|||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
|
||||
<p>In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
|
||||
<p>For now, we’ll pivot back to workflow, and in the next chapter you’ll learn more about the pipe, <code>|></code>, why we recommend it, and a little of the history that lead from magrittr’s <code>%>%</code> to base R’s <code>|></code>.</p>
|
||||
|
||||
|
||||
|
|
|
@ -164,7 +164,7 @@ dbplyr basics</h1>
|
|||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, "diamonds")
|
||||
diamonds_db
|
||||
#> # Source: table<diamonds> [?? x 10]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> carat cut color clarity depth table price x y z
|
||||
#> <dbl> <fct> <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
|
||||
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
|
||||
|
@ -203,7 +203,7 @@ FROM `planes`</pre></div>
|
|||
|
||||
big_diamonds_db
|
||||
#> # Source: SQL [?? x 5]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> carat cut color clarity price
|
||||
#> <dbl> <fct> <fct> <fct> <int>
|
||||
#> 1 1.54 Premium E VS2 15002
|
||||
|
@ -293,7 +293,7 @@ planes |> show_query()
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
#> SELECT dest, AVG(dep_delay) AS dep_delay
|
||||
|
@ -399,11 +399,11 @@ FROM</h2>
|
|||
<section id="group-by" data-type="sect2">
|
||||
<h2>
|
||||
GROUP BY</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> is translated to the <code>SELECT</code> clause:</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> is translated to the <code>SELECT</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
avg_price = mean(price, na.rm = TRUE)
|
||||
) |>
|
||||
|
@ -456,12 +456,12 @@ flights |>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(delay = mean(arr_delay))
|
||||
summarize(delay = mean(arr_delay))
|
||||
#> Warning: Missing values are always removed in SQL aggregation functions.
|
||||
#> Use `na.rm = TRUE` to silence this warning
|
||||
#> This warning is displayed once every 8 hours.
|
||||
#> # Source: SQL [?? x 2]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> dest delay
|
||||
#> <chr> <dbl>
|
||||
#> 1 ATL 11.3
|
||||
|
@ -489,7 +489,7 @@ flights |>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(n = n()) |>
|
||||
summarize(n = n()) |>
|
||||
filter(n > 100) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
|
@ -617,11 +617,11 @@ FROM flights</pre>
|
|||
<h1>
|
||||
Function translations</h1>
|
||||
<p>So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
|
||||
<p>To help see what’s going on, we’ll use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
|
||||
<p>To help see what’s going on, we’ll use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">summarize_query <- function(df, ...) {
|
||||
df |>
|
||||
summarise(...) |>
|
||||
summarize(...) |>
|
||||
show_query()
|
||||
}
|
||||
mutate_query <- function(df, ...) {
|
||||
|
@ -729,15 +729,18 @@ flights |>
|
|||
#> FROM flights</pre>
|
||||
</div>
|
||||
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.</p>
|
||||
</section>
|
||||
|
||||
<section id="learning-more" data-type="sect2">
|
||||
<h2>
|
||||
Learning more</h2>
|
||||
<p>If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s <em>the</em> most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who don’t use R. If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
|
||||
<ul><li>
|
||||
<a href="https://sqlfordatascientists.com"><em>SQL for Data Scientists</em></a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organisations.</li>
|
||||
<li>
|
||||
<a href="https://www.practicalsql.com"><em>Practical SQL</em></a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
|
||||
</ul></section>
|
||||
</ul><p>In the next chapter, we’ll learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.</p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
||||
|
|
|
@ -33,9 +33,9 @@ Creating date/times</h1>
|
|||
<p>To get the current date or date-time you can use <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> or <code><a href="https://lubridate.tidyverse.org/reference/now.html">now()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">today()
|
||||
#> [1] "2022-11-18"
|
||||
#> [1] "2023-01-12"
|
||||
now()
|
||||
#> [1] "2022-11-18 11:36:09 CST"</pre>
|
||||
#> [1] "2023-01-12 17:04:08 CST"</pre>
|
||||
</div>
|
||||
<p>Otherwise, the following sections describe the four ways you’re likely to create a date/time:</p>
|
||||
<ul><li>While reading a file with readr.</li>
|
||||
|
@ -61,7 +61,7 @@ read_csv(csv)
|
|||
<p>If you haven’t heard of <strong>ISO8601</strong> before, it’s an international standard<span data-type="footnote"><a href="https://xkcd.com/1179/" class="uri">https://xkcd.com/1179/</a></span> for writing dates where the components of a date are organised from biggest to smallest separated by <code>-</code>. For example, in ISO8601 March 5 2022 is <code>2022-05-03</code>. ISO8601 dates can also include times, where hour, minute, and second are separated by <code>:</code>, and the date and time components are separated by either a <code>T</code> or a space. For example, you could write 4:26pm on March 5 2022 as either <code>2022-05-03 16:26</code> or <code>2022-05-03T16:26</code>.</p>
|
||||
<p>For other date-time formats, you’ll need to use <code>col_types</code> plus <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a <code>%</code> followed by a single character. For example, <code>%Y-%m-%d</code> specifies a date that’s a year, <code>-</code>, month (as number) <code>-</code>, day. Table <a href="#tbl-date-formats" data-type="xref">#tbl-date-formats</a> lists all the options.</p>
|
||||
<div id="tbl-date-formats" class="anchored">
|
||||
<table class="table"><caption>Table 17.1: All date formats understood by readr</caption>
|
||||
<table class="table"><caption>Table 19.1: All date formats understood by readr</caption>
|
||||
<thead><tr class="header"><th>Type</th>
|
||||
<th>Code</th>
|
||||
<th>Meaning</th>
|
||||
|
@ -256,20 +256,20 @@ flights_dt
|
|||
<p>With this data, we can visualize the distribution of departure times across the year:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
ggplot(aes(dep_time)) +
|
||||
ggplot(aes(x = dep_time)) +
|
||||
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-12-1.png" alt="A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Or within a single day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
filter(dep_time < ymd(20130102)) |>
|
||||
ggplot(aes(dep_time)) +
|
||||
ggplot(aes(x = dep_time)) +
|
||||
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-13-1.png" class="img-fluid" alt="A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-13-1.png" alt="A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.</p>
|
||||
|
@ -281,9 +281,9 @@ From other types</h2>
|
|||
<p>You may want to switch between a date-time and a date. That’s the job of <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">as_datetime(today())
|
||||
#> [1] "2022-11-18 UTC"
|
||||
#> [1] "2023-01-12 UTC"
|
||||
as_date(now())
|
||||
#> [1] "2022-11-18"</pre>
|
||||
#> [1] "2023-01-12"</pre>
|
||||
</div>
|
||||
<p>Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code>; if it’s in days, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>.</p>
|
||||
<div class="cell">
|
||||
|
@ -357,9 +357,9 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(wday = wday(dep_time, label = TRUE)) |>
|
||||
ggplot(aes(x = wday)) +
|
||||
geom_bar()</pre>
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000)." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-20-1.png" alt="A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>There’s an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!</p>
|
||||
|
@ -367,13 +367,14 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(minute = minute(dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
summarize(
|
||||
avg_delay = mean(dep_delay, na.rm = TRUE),
|
||||
n = n()) |>
|
||||
ggplot(aes(minute, avg_delay)) +
|
||||
geom_line()</pre>
|
||||
n = n()
|
||||
) |>
|
||||
ggplot(aes(x = minute, y = avg_delay)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9)." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-21-1.png" alt="A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Interestingly, if we look at the <em>scheduled</em> departure time we don’t see such a strong pattern:</p>
|
||||
|
@ -381,22 +382,24 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
<pre data-type="programlisting" data-code-language="r">sched_dep <- flights_dt |>
|
||||
mutate(minute = minute(sched_dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
summarize(
|
||||
avg_delay = mean(arr_delay, na.rm = TRUE),
|
||||
n = n())
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(sched_dep, aes(minute, avg_delay)) +
|
||||
ggplot(sched_dep, aes(x = minute, y = avg_delay)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-22-1.png" alt="A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
|
||||
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times, as <a href="#fig-human-rounding" data-type="xref">#fig-human-rounding</a> shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(sched_dep, aes(minute, n)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes." width="576"/></p>
|
||||
|
||||
<figure id="fig-human-rounding"><p><img src="datetimes_files/figure-html/fig-human-rounding-1.png" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes. " width="576"/></p>
|
||||
<figcaption>A frequency polygon showing the number of flights scheduled to depart each hour. You can see a strong preference for round numbers like 0 and 30 and generally for numbers that are a multiple of five.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -408,33 +411,33 @@ Rounding</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
count(week = floor_date(dep_time, "week")) |>
|
||||
ggplot(aes(week, n)) +
|
||||
ggplot(aes(x = week, y = n)) +
|
||||
geom_line() +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-24-1.png" class="img-fluid" alt="A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights)." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-24-1.png" alt="A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights)." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can use rounding to show the distribution of flights across the course of a day by computing the difference between <code>dep_time</code> and the earliest instant of that day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
ggplot(aes(x = dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)
|
||||
#> Don't know how to automatically pick scale for object of type <difftime>.
|
||||
#> Defaulting to continuous.</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-25-1.png" alt="A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Computing the difference between a pair of date-times yields a difftime (more on that in <a href="#sec-intervals" data-type="xref">#sec-intervals</a>). We can convert that to an <code>hms</code> object to get a more useful x-axis:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_dt |>
|
||||
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |>
|
||||
ggplot(aes(dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)</pre>
|
||||
ggplot(aes(x = dep_hour)) +
|
||||
geom_freqpoly(binwidth = 60 * 30)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (<100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again." width="576"/></p>
|
||||
<p><img src="datetimes_files/figure-html/unnamed-chunk-26-1.png" alt="A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (<100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -442,7 +445,7 @@ Rounding</h2>
|
|||
<section id="modifying-components" data-type="sect2">
|
||||
<h2>
|
||||
Modifying components</h2>
|
||||
<p>You can also use each accessor function to modify the components of a date/time:</p>
|
||||
<p>You can also use each accessor function to modify the components of a date/time. This doesn’t come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">(datetime <- ymd_hms("2026-07-08 12:34:56"))
|
||||
#> [1] "2026-07-08 12:34:56 UTC"
|
||||
|
@ -457,7 +460,7 @@ hour(datetime) <- hour(datetime) + 1
|
|||
datetime
|
||||
#> [1] "2030-01-08 13:34:56 UTC"</pre>
|
||||
</div>
|
||||
<p>Alternatively, rather than modifying an existing variabke, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
|
||||
<p>Alternatively, rather than modifying an existing variable, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
|
||||
#> [1] "2030-02-02 02:34:56 UTC"</pre>
|
||||
|
@ -480,7 +483,7 @@ Exercises</h2>
|
|||
<li><p>How does the average delay time change over the course of a day? Should you use <code>dep_time</code> or <code>sched_dep_time</code>? Why?</p></li>
|
||||
<li><p>On what day of the week should you leave if you want to minimise the chance of a delay?</p></li>
|
||||
<li><p>What makes the distribution of <code>diamonds$carat</code> and <code>flights$sched_dep_time</code> similar?</p></li>
|
||||
<li><p>Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.</p></li>
|
||||
<li><p>Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
|
@ -504,12 +507,12 @@ Durations</h2>
|
|||
<pre data-type="programlisting" data-code-language="r"># How old is Hadley?
|
||||
h_age <- today() - ymd("1979-10-14")
|
||||
h_age
|
||||
#> Time difference of 15741 days</pre>
|
||||
#> Time difference of 15796 days</pre>
|
||||
</div>
|
||||
<p>A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the <strong>duration</strong>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">as.duration(h_age)
|
||||
#> [1] "1360022400s (~43.1 years)"</pre>
|
||||
#> [1] "1364774400s (~43.25 years)"</pre>
|
||||
</div>
|
||||
<p>Durations come with a bunch of convenient constructors:</p>
|
||||
<div class="cell">
|
||||
|
@ -691,7 +694,7 @@ Time zones</h1>
|
|||
<p>And see the complete list of all time zone names with <code><a href="https://rdrr.io/r/base/timezones.html">OlsonNames()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">length(OlsonNames())
|
||||
#> [1] 595
|
||||
#> [1] 596
|
||||
head(OlsonNames())
|
||||
#> [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
|
||||
#> [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"</pre>
|
||||
|
|
After Width: | Height: | Size: 46 KiB |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 48 KiB |
After Width: | Height: | Size: 49 KiB |
After Width: | Height: | Size: 47 KiB |
After Width: | Height: | Size: 46 KiB |
After Width: | Height: | Size: 49 KiB |
After Width: | Height: | Size: 37 KiB |
After Width: | Height: | Size: 32 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 55 KiB |
After Width: | Height: | Size: 60 KiB |
After Width: | Height: | Size: 44 KiB |
After Width: | Height: | Size: 42 KiB |
After Width: | Height: | Size: 49 KiB |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 39 KiB |
After Width: | Height: | Size: 53 KiB |
After Width: | Height: | Size: 39 KiB |
After Width: | Height: | Size: 15 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 59 KiB |
After Width: | Height: | Size: 76 KiB |
After Width: | Height: | Size: 380 KiB |
After Width: | Height: | Size: 328 KiB |
After Width: | Height: | Size: 386 KiB |
After Width: | Height: | Size: 56 KiB |
After Width: | Height: | Size: 60 KiB |
After Width: | Height: | Size: 58 KiB |
After Width: | Height: | Size: 42 KiB |
After Width: | Height: | Size: 56 KiB |
After Width: | Height: | Size: 83 KiB |
|
@ -138,7 +138,7 @@ General Social Survey</h1>
|
|||
</div>
|
||||
<p>Or with a bar chart:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(gss_cat, aes(race)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(gss_cat, aes(x = race)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A bar chart showing the distribution of race. There are ~2000 records with race "Other", 3000 with race "Black", and other 15,000 with race "White"." width="576"/></p>
|
||||
|
@ -162,13 +162,13 @@ Modifying factor order</h1>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">relig_summary <- gss_cat |>
|
||||
group_by(relig) |>
|
||||
summarise(
|
||||
summarize(
|
||||
age = mean(age, na.rm = TRUE),
|
||||
tvhours = mean(tvhours, na.rm = TRUE),
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(relig_summary, aes(tvhours, relig)) +
|
||||
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
|
||||
|
@ -181,7 +181,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
|
|||
<code>x</code>, a numeric vector that you want to use to reorder the levels.</li>
|
||||
<li>Optionally, <code>fun</code>, a function that’s used if there are multiple values of <code>x</code> for each value of <code>f</code>. The default value is <code>median</code>.</li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. "Other eastern" has the fewest tvhours under 2, and "Don't know" has the highest (over 5)." width="576"/></p>
|
||||
|
@ -194,20 +194,20 @@ ggplot(relig_summary, aes(tvhours, relig)) +
|
|||
mutate(
|
||||
relig = fct_reorder(relig, tvhours)
|
||||
) |>
|
||||
ggplot(aes(tvhours, relig)) +
|
||||
ggplot(aes(x = tvhours, y = relig)) +
|
||||
geom_point()</pre>
|
||||
</div>
|
||||
<p>What if we create a similar plot looking at how average age varies across reported income level?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">rincome_summary <- gss_cat |>
|
||||
group_by(rincome) |>
|
||||
summarise(
|
||||
summarize(
|
||||
age = mean(age, na.rm = TRUE),
|
||||
tvhours = mean(tvhours, na.rm = TRUE),
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
|
||||
ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999." width="576"/></p>
|
||||
|
@ -216,7 +216,7 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
|
|||
<p>Here, arbitrarily reordering the levels isn’t a good idea! That’s because <code>rincome</code> already has a principled order that we shouldn’t mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
|
||||
<p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="https://forcats.tidyverse.org/reference/fct_relevel.html">fct_relevel()</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="The same scatterplot but now "Not Applicable" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is "Not applicable"." width="576"/></p>
|
||||
|
@ -227,7 +227,7 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
|
|||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">#|
|
||||
#| Rearranging the legend makes the plot easier to read because the
|
||||
#| legend colours now match the order of the lines on the far right
|
||||
#| legend colors now match the order of the lines on the far right
|
||||
#| of the plot. You can see some unsuprising patterns: the proportion
|
||||
#| never marred decreases with age, married forms an upside down U
|
||||
#| shape, and widowed starts off low but increases steeply after age
|
||||
|
@ -240,12 +240,12 @@ by_age <- gss_cat |>
|
|||
prop = n / sum(n)
|
||||
)
|
||||
|
||||
ggplot(by_age, aes(age, prop, colour = marital)) +
|
||||
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
|
||||
geom_line(na.rm = TRUE)
|
||||
|
||||
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
|
||||
ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
|
||||
geom_line() +
|
||||
labs(colour = "marital")</pre>
|
||||
labs(color = "marital")</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
|
@ -261,7 +261,7 @@ ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">gss_cat |>
|
||||
mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
|
||||
ggplot(aes(marital)) +
|
||||
ggplot(aes(x = marital)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="factors_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
|
||||
|
|
|
@ -60,7 +60,7 @@ df |> mutate(
|
|||
<section id="writing-a-function" data-type="sect2">
|
||||
<h2>
|
||||
Writing a function</h2>
|
||||
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> it’s a little easier to see the pattern because each repetition is now one line:</p>
|
||||
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, it’s a little easier to see the pattern because each repetition is now one line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
|
||||
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
|
||||
|
@ -73,8 +73,8 @@ Writing a function</h2>
|
|||
</div>
|
||||
<p>To turn this into a function you need three things:</p>
|
||||
<ol type="1"><li><p>A <strong>name</strong>. Here we’ll use <code>rescale01</code> because this function rescales a vector to lie between 0 and 1.</p></li>
|
||||
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that have just one. We’ll call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
|
||||
<li><p>The <strong>body</strong>. The body is the code that repeated across all the calls.</p></li>
|
||||
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that we have just one. We’ll call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
|
||||
<li><p>The <strong>body</strong>. The body is the code that’s repeated across all the calls.</p></li>
|
||||
</ol><p>Then you create a function by following the template:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">name <- function(arguments) {
|
||||
|
@ -117,7 +117,7 @@ rescale01(c(1, 2, 3, NA, 5))
|
|||
<section id="improving-our-function" data-type="sect2">
|
||||
<h2>
|
||||
Improving our function</h2>
|
||||
<p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
|
||||
<p>You might notice that the <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">rescale01 <- function(x) {
|
||||
rng <- range(x, na.rm = TRUE)
|
||||
|
@ -136,6 +136,7 @@ rescale01(x)
|
|||
rng <- range(x, na.rm = TRUE, finite = TRUE)
|
||||
(x - rng[1]) / (rng[2] - rng[1])
|
||||
}
|
||||
|
||||
rescale01(x)
|
||||
#> [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
|
||||
#> [8] 0.7777778 0.8888889 1.0000000 Inf</pre>
|
||||
|
@ -146,14 +147,14 @@ rescale01(x)
|
|||
<section id="mutate-functions" data-type="sect2">
|
||||
<h2>
|
||||
Mutate functions</h2>
|
||||
<p>Now you’ve got the basic idea of functions, lets take a look a whole bunch of examples. We’ll start by looking at “mutate” functions, functions that work well like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output the same length as the input.</p>
|
||||
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p>
|
||||
<p>Now you’ve got the basic idea of functions, let’s take a look at a whole bunch of examples. We’ll start by looking at “mutate” functions, i.e. functions that work well inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output of the same length as the input.</p>
|
||||
<p>Let’s start with a simple variation of <code>rescale01()</code>. Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">z_score <- function(x) {
|
||||
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
|
||||
}</pre>
|
||||
</div>
|
||||
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
|
||||
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> and give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">clamp <- function(x, min, max) {
|
||||
case_when(
|
||||
|
@ -162,6 +163,7 @@ Mutate functions</h2>
|
|||
.default = x
|
||||
)
|
||||
}
|
||||
|
||||
clamp(1:10, min = 3, max = 7)
|
||||
#> [1] 3 3 3 4 5 6 7 7 7 7</pre>
|
||||
</div>
|
||||
|
@ -174,15 +176,17 @@ clamp(1:10, min = 3, max = 7)
|
|||
.default = x
|
||||
)
|
||||
}
|
||||
|
||||
na_outside(1:10, min = 3, max = 7)
|
||||
#> [1] NA NA 3 4 5 6 7 NA NA NA</pre>
|
||||
</div>
|
||||
<p>Of course functions don’t just need to work with numeric variables. You might want to extract out some repeated string manipulation. Maybe you need to make the first character upper case:</p>
|
||||
<p>Of course functions don’t just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">first_upper <- function(x) {
|
||||
str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
|
||||
x
|
||||
}
|
||||
|
||||
first_upper("hello")
|
||||
#> [1] "Hello"</pre>
|
||||
</div>
|
||||
|
@ -198,12 +202,13 @@ clean_number <- function(x) {
|
|||
as.numeric(x)
|
||||
if_else(is_pct, num / 100, num)
|
||||
}
|
||||
|
||||
clean_number("$12,300")
|
||||
#> [1] 12300
|
||||
clean_number("45%")
|
||||
#> [1] 0.45</pre>
|
||||
</div>
|
||||
<p>Sometimes your functions will be highly specialized for one data analysis. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
|
||||
<p>Sometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">fix_na <- function(x) {
|
||||
if_else(x %in% c(997, 998, 999), NA, x)
|
||||
|
@ -237,14 +242,16 @@ Summary functions</h2>
|
|||
<pre data-type="programlisting" data-code-language="r">commas <- function(x) {
|
||||
str_flatten(x, collapse = ", ", last = " and ")
|
||||
}
|
||||
|
||||
commas(c("cat", "dog", "pigeon"))
|
||||
#> [1] "cat, dog and pigeon"</pre>
|
||||
</div>
|
||||
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:</p>
|
||||
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">cv <- function(x, na.rm = FALSE) {
|
||||
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
|
||||
}
|
||||
|
||||
cv(runif(100, min = 0, max = 50))
|
||||
#> [1] 0.5196276
|
||||
cv(runif(100, min = 0, max = 500))
|
||||
|
@ -318,42 +325,62 @@ Data frame functions</h1>
|
|||
<section id="indirection-and-tidy-evaluation" data-type="sect2">
|
||||
<h2>
|
||||
Indirection and tidy evaluation</h2>
|
||||
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> the unique (distinct) values of a variable:</p>
|
||||
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: <code>grouped_mean()</code>. The goal of this function is compute the mean of <code>mean_var</code> grouped by <code>group_var</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">pull_unique <- function(df, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">grouped_mean <- function(df, group_var, mean_var) {
|
||||
df |>
|
||||
distinct(var) |>
|
||||
pull(var)
|
||||
group_by(group_var) |>
|
||||
summarize(mean(mean_var))
|
||||
}</pre>
|
||||
</div>
|
||||
<p>If we try and use it, we get an error:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |> pull_unique(clarity)
|
||||
#> Error in `distinct()` at ]8;line = 38:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/pull.Rdplyr/R/pull.R:38:2]8;;:
|
||||
#> ! Must use existing variables.
|
||||
#> ✖ `var` not found in `.data`.</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |> grouped_mean(cut, carat)
|
||||
#> Error in `group_by()`:
|
||||
#> ! Must group by variables found in `.data`.
|
||||
#> ✖ Column `group_var` is not found.</pre>
|
||||
</div>
|
||||
<p>To make the problem a bit more clear we can use a made up data frame:</p>
|
||||
<p>To make the problem a bit more clear, we can use a made up data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(var = "var", x = "x", y = "y")
|
||||
df |> pull_unique(x)
|
||||
#> [1] "var"
|
||||
df |> pull_unique(y)
|
||||
#> [1] "var"</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
mean_var = 1,
|
||||
group_var = "g",
|
||||
group = 1,
|
||||
x = 10,
|
||||
y = 100
|
||||
)
|
||||
|
||||
df |> grouped_mean(group, x)
|
||||
#> # A tibble: 1 × 2
|
||||
#> group_var `mean(mean_var)`
|
||||
#> <chr> <dbl>
|
||||
#> 1 g 1
|
||||
df |> grouped_mean(group, y)
|
||||
#> # A tibble: 1 × 2
|
||||
#> group_var `mean(mean_var)`
|
||||
#> <chr> <dbl>
|
||||
#> 1 g 1</pre>
|
||||
</div>
|
||||
<p>Regardless of how we call <code>pull_unique()</code> it always does <code>df |> distinct(var) |> pull(var)</code>, instead of <code>df |> distinct(x) |> pull(x)</code> or <code>df |> distinct(y) |> pull(y)</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
|
||||
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it’s obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> not to treat <code>var</code> as the name of a variable, but instead look inside <code>var</code> for the variable we actually want to use.</p>
|
||||
<p>Regardless of how we call <code>grouped_mean()</code> it always does <code>df |> group_by(group_var) |> summarize(mean(mean_var))</code>, instead of <code>df |> group_by(group) |> summarize(mean(x))</code> or <code>df |> group_by(group) |> summarize(mean(y))</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
|
||||
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it’s obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code>group_mean()</code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> not to treat <code>group_var</code> and <code>mean_var</code> as the name of the variables, but instead look inside them for the variable we actually want to use.</p>
|
||||
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what’s happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
|
||||
<p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p>
|
||||
<p>So to make grouped_mean<code>()</code> work, we need to surround <code>group_var</code> and <code>mean_var()</code> with <code>{{ }}</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">pull_unique <- function(df, var) {
|
||||
<pre data-type="programlisting" data-code-language="r">grouped_mean <- function(df, group_var, mean_var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
group_by({{ group_var }}) |>
|
||||
summarize(mean({{ mean_var }}))
|
||||
}
|
||||
diamonds |> pull_unique(clarity)
|
||||
#> [1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
|
||||
#> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF</pre>
|
||||
|
||||
diamonds |> grouped_mean(cut, carat)
|
||||
#> # A tibble: 5 × 2
|
||||
#> cut `mean(carat)`
|
||||
#> <ord> <dbl>
|
||||
#> 1 Fair 1.05
|
||||
#> 2 Good 0.849
|
||||
#> 3 Very Good 0.806
|
||||
#> 4 Premium 0.892
|
||||
#> 5 Ideal 0.703</pre>
|
||||
</div>
|
||||
<p>Success!</p>
|
||||
</section>
|
||||
|
@ -361,11 +388,11 @@ diamonds |> pull_unique(clarity)
|
|||
<section id="sec-embracing" data-type="sect2">
|
||||
<h2>
|
||||
When to embrace?</h2>
|
||||
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:</p>
|
||||
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> that compute with variables.</p></li>
|
||||
<li><p><strong>Tidy-selection</strong>: this is used for for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
|
||||
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:</p>
|
||||
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> that compute with variables.</p></li>
|
||||
<li><p><strong>Tidy-selection</strong>: this is used for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
|
||||
</ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p>
|
||||
<p>In the following sections we’ll explore the sorts of handy functions you might write once you understand embracing.</p>
|
||||
<p>In the following sections, we’ll explore the sorts of handy functions you might write once you understand embracing.</p>
|
||||
</section>
|
||||
|
||||
<section id="common-use-cases" data-type="sect2">
|
||||
|
@ -374,7 +401,7 @@ Common use cases</h2>
|
|||
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">summary6 <- function(data, var) {
|
||||
data |> summarise(
|
||||
data |> summarize(
|
||||
min = min({{ var }}, na.rm = TRUE),
|
||||
mean = mean({{ var }}, na.rm = TRUE),
|
||||
median = median({{ var }}, na.rm = TRUE),
|
||||
|
@ -384,14 +411,15 @@ Common use cases</h2>
|
|||
.groups = "drop"
|
||||
)
|
||||
}
|
||||
|
||||
diamonds |> summary6(carat)
|
||||
#> # A tibble: 1 × 6
|
||||
#> min mean median max n n_miss
|
||||
#> <dbl> <dbl> <dbl> <dbl> <int> <int>
|
||||
#> 1 0.2 0.798 0.7 5.01 53940 0</pre>
|
||||
</div>
|
||||
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> in a helper, we think it’s good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
|
||||
<p>The nice thing about this function is because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> you can used it on grouped data:</p>
|
||||
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> in a helper, we think it’s good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
|
||||
<p>The nice thing about this function is, because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, you can use it on grouped data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
group_by(cut) |>
|
||||
|
@ -405,7 +433,7 @@ diamonds |> summary6(carat)
|
|||
#> 4 Premium 0.2 0.892 0.86 4.01 13791 0
|
||||
#> 5 Ideal 0.2 0.703 0.54 3.5 21551 0</pre>
|
||||
</div>
|
||||
<p>Because the arguments to summarize are data-masking that also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
|
||||
<p>Furthermore, since the arguments to summarize are data-masking also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
group_by(cut) |>
|
||||
|
@ -419,8 +447,8 @@ diamonds |> summary6(carat)
|
|||
#> 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0
|
||||
#> 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre>
|
||||
</div>
|
||||
<p>To summarize multiple variables you’ll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where you’ll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
|
||||
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
|
||||
<p>To summarize multiple variables, you’ll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where you’ll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
|
||||
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/Diabb6/status/1571635146658402309
|
||||
count_prop <- function(df, var, sort = FALSE) {
|
||||
|
@ -428,6 +456,7 @@ count_prop <- function(df, var, sort = FALSE) {
|
|||
count({{ var }}, sort = sort) |>
|
||||
mutate(prop = n / sum(n))
|
||||
}
|
||||
|
||||
diamonds |> count_prop(clarity)
|
||||
#> # A tibble: 8 × 3
|
||||
#> clarity n prop
|
||||
|
@ -447,26 +476,36 @@ diamonds |> count_prop(clarity)
|
|||
df |>
|
||||
filter({{ condition }}) |>
|
||||
distinct({{ var }}) |>
|
||||
arrange({{ var }}) |>
|
||||
pull({{ var }})
|
||||
arrange({{ var }})
|
||||
}
|
||||
|
||||
# Find all the destinations in December
|
||||
flights |> unique_where(month == 12, dest)
|
||||
#> [1] "ABQ" "ALB" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA" "BOS" "BQN" "BTV"
|
||||
#> [13] "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHS" "CLE" "CLT" "CMH" "CVG" "DAY"
|
||||
#> [25] "DCA" "DEN" "DFW" "DSM" "DTW" "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN"
|
||||
#> [37] "HNL" "HOU" "IAD" "IAH" "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LGB" "MCI"
|
||||
#> [49] "MCO" "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "OAK" "OKC"
|
||||
#> [61] "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE" "PSP" "PVD" "PWM"
|
||||
#> [73] "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN" "SDF" "SEA" "SFO" "SJC"
|
||||
#> [85] "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT" "SYR" "TPA" "TUL" "TYS" "XNA"
|
||||
#> # A tibble: 96 × 1
|
||||
#> dest
|
||||
#> <chr>
|
||||
#> 1 ABQ
|
||||
#> 2 ALB
|
||||
#> 3 ATL
|
||||
#> 4 AUS
|
||||
#> 5 AVL
|
||||
#> 6 BDL
|
||||
#> # … with 90 more rows
|
||||
# Which months did plane N14228 fly in?
|
||||
flights |> unique_where(tailnum == "N14228", month)
|
||||
#> [1] 1 2 3 4 5 6 7 8 9 10 12</pre>
|
||||
#> # A tibble: 11 × 1
|
||||
#> month
|
||||
#> <int>
|
||||
#> 1 1
|
||||
#> 2 2
|
||||
#> 3 3
|
||||
#> 4 4
|
||||
#> 5 5
|
||||
#> 6 6
|
||||
#> # … with 5 more rows</pre>
|
||||
</div>
|
||||
<p>Here we embrace <code>condition</code> because it’s passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>.</p>
|
||||
<p>We’ve made all these examples take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
|
||||
<p>Here we embrace <code>condition</code> because it’s passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because it’s passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>.</p>
|
||||
<p>We’ve made all these examples to take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights_sub <- function(rows, cols) {
|
||||
flights |>
|
||||
|
@ -476,43 +515,45 @@ flights |> unique_where(tailnum == "N14228", month)
|
|||
|
||||
flights_sub(dest == "IAH", contains("time"))
|
||||
#> # A tibble: 7,198 × 8
|
||||
#> time_hour carrier flight dep_time sched…¹ arr_t…² sched…³ air_t…⁴
|
||||
#> <dttm> <chr> <int> <int> <int> <int> <int> <dbl>
|
||||
#> 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227
|
||||
#> 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227
|
||||
#> 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229
|
||||
#> 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238
|
||||
#> 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249
|
||||
#> 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233
|
||||
#> # … with 7,192 more rows, and abbreviated variable names ¹sched_dep_time,
|
||||
#> # ²arr_time, ³sched_arr_time, ⁴air_time</pre>
|
||||
#> time_hour carrier flight dep_time sched_dep_time arr_time
|
||||
#> <dttm> <chr> <int> <int> <int> <int>
|
||||
#> 1 2013-01-01 05:00:00 UA 1545 517 515 830
|
||||
#> 2 2013-01-01 05:00:00 UA 1714 533 529 850
|
||||
#> 3 2013-01-01 06:00:00 UA 496 623 627 933
|
||||
#> 4 2013-01-01 07:00:00 UA 473 728 732 1041
|
||||
#> 5 2013-01-01 07:00:00 UA 1479 739 739 1104
|
||||
#> 6 2013-01-01 09:00:00 UA 1220 908 908 1228
|
||||
#> # … with 7,192 more rows, and 2 more variables: sched_arr_time <int>,
|
||||
#> # air_time <dbl></pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="data-masking-vs-tidy-selection" data-type="sect2">
|
||||
<section id="data-masking-vs.-tidy-selection" data-type="sect2">
|
||||
<h2>
|
||||
Data-masking vs tidy-selection</h2>
|
||||
Data-masking vs. tidy-selection</h2>
|
||||
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by({{ group_vars }}) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
#> Error in `group_by()` at ]8;line = 127:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/summarise.Rdplyr/R/summarise.R:127:2]8;;:
|
||||
#> ℹ In argument: `..1 = c(year, month, day)`.
|
||||
#> Error in `group_by()`:
|
||||
#> ℹ In argument: `c(year, month, day)`.
|
||||
#> Caused by error:
|
||||
#> ! `..1` must be size 336776 or 1, not 1010328.</pre>
|
||||
#> ! `c(year, month, day)` must be size 336776 or 1, not 1010328.</pre>
|
||||
</div>
|
||||
<p>This doesn’t work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> which allows you to use use tidy-selection inside data-masking functions:</p>
|
||||
<p>This doesn’t work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> function, which allows you to use tidy-selection inside data-masking functions:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
#> `summarise()` has grouped output by 'year', 'month'. You can override using
|
||||
|
@ -542,6 +583,7 @@ count_wide <- function(data, rows, cols) {
|
|||
values_fill = 0
|
||||
)
|
||||
}
|
||||
|
||||
diamonds |> count_wide(clarity, cut)
|
||||
#> # A tibble: 8 × 6
|
||||
#> clarity Fair Good `Very Good` Premium Ideal
|
||||
|
@ -572,9 +614,9 @@ diamonds |> count_wide(c(clarity, color), cut)
|
|||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
<p>Using the datasets from nyclights13, write functions that:</p>
|
||||
<p>Using the datasets from nycflights13, write a function that:</p>
|
||||
<ol type="1"><li>
|
||||
<p>Find all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
|
||||
<p>Finds all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> filter_severe()</pre>
|
||||
</div>
|
||||
|
@ -582,7 +624,7 @@ Exercises</h2>
|
|||
<li>
|
||||
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> group_by(dest) |> summarise_severe()</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> group_by(dest) |> summarize_severe()</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
|
@ -592,19 +634,19 @@ Exercises</h2>
|
|||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Summarizes the weather to compute the minum, mean, and maximum, of a user supplied variable:</p>
|
||||
<p>Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">weather |> summarise_weather(temp)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">weather |> summarize_weather(temp)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc) into a decimal time (i.e. hours + minutes / 60).</p>
|
||||
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc.) into a decimal time (i.e. hours + (minutes / 60)).</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">weather |> standardise_time(sched_dep_time)</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></li>
|
||||
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
|
||||
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
|
||||
<li>
|
||||
<p>Generalize the following function so that you can supply any number of variables to count.</p>
|
||||
<div class="cell">
|
||||
|
@ -621,21 +663,21 @@ Exercises</h2>
|
|||
<section id="plot-functions" data-type="sect1">
|
||||
<h1>
|
||||
Plot functions</h1>
|
||||
<p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that you’re making a lot of histograms:</p>
|
||||
<p>Instead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that you’re making a lot of histograms:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
ggplot(aes(carat)) +
|
||||
ggplot(aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.1)
|
||||
|
||||
diamonds |>
|
||||
ggplot(aes(carat)) +
|
||||
ggplot(aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.05)</pre>
|
||||
</div>
|
||||
<p>Wouldn’t it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function so that you need to embrace:</p>
|
||||
<p>Wouldn’t it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function and you need to embrace:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">histogram <- function(df, var, binwidth = NULL) {
|
||||
df |>
|
||||
ggplot(aes({{ var }})) +
|
||||
ggplot(aes(x = {{ var }})) +
|
||||
geom_histogram(binwidth = binwidth)
|
||||
}
|
||||
|
||||
|
@ -644,7 +686,7 @@ diamonds |> histogram(carat, 0.1)</pre>
|
|||
<p><img src="functions_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note that <code>histogram()</code> returns a ggplot2 plot, so that you can still add on additional components if you want. Just remember to switch from <code>|></code> to <code>+</code>:</p>
|
||||
<p>Note that <code>histogram()</code> returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from <code>|></code> to <code>+</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
histogram(carat, 0.1) +
|
||||
|
@ -660,10 +702,9 @@ More variables</h2>
|
|||
<p>It’s straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/tyler_js_smith/status/1574377116988104704
|
||||
|
||||
linearity_check <- function(df, x, y) {
|
||||
df |>
|
||||
ggplot(aes({{ x }}, {{ y }})) +
|
||||
ggplot(aes(x = {{ x }}, y = {{ y }})) +
|
||||
geom_point() +
|
||||
geom_smooth(method = "loess", color = "red", se = FALSE) +
|
||||
geom_smooth(method = "lm", color = "blue", se = FALSE)
|
||||
|
@ -683,13 +724,14 @@ starwars |>
|
|||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/ppaxisa/status/1574398423175921665
|
||||
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
||||
df |>
|
||||
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
|
||||
ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) +
|
||||
stat_summary_hex(
|
||||
aes(colour = after_scale(fill)), # make border same colour as fill
|
||||
aes(color = after_scale(fill)), # make border same color as fill
|
||||
bins = bins,
|
||||
fun = fun,
|
||||
)
|
||||
}
|
||||
|
||||
diamonds |> hex_plot(carat, price, depth)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="functions_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" width="576"/></p>
|
||||
|
@ -708,17 +750,19 @@ Combining with dplyr</h2>
|
|||
ggplot(aes(y = {{ var }})) +
|
||||
geom_bar()
|
||||
}
|
||||
|
||||
diamonds |> sorted_bars(cut)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="functions_files/figure-html/unnamed-chunk-50-1.png" class="img-fluid" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
|
||||
<p>We have to use a new operator here, <code>:=</code>, because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of <code>=</code>, but R’s syntax doesn’t allow anything to the left of <code>=</code> except for a single literal name. To work around this problem, we use the special operator <code>:=</code> which tidy evaluation treats in exactly the same way as <code>=</code>.</p>
|
||||
<p>Or maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">conditional_bars <- function(df, condition, var) {
|
||||
df |>
|
||||
filter({{ condition }}) |>
|
||||
ggplot(aes({{ var }})) +
|
||||
ggplot(aes(x = {{ var }})) +
|
||||
geom_bar()
|
||||
}
|
||||
|
||||
|
@ -727,17 +771,16 @@ diamonds |> conditional_bars(cut == "Good", clarity)</pre>
|
|||
<p><img src="functions_files/figure-html/unnamed-chunk-51-1.png" class="img-fluid" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can also get creative and display data summaries in other way. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
|
||||
<p>You can also get creative and display data summaries in other ways. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
|
||||
|
||||
fancy_ts <- function(df, val, group) {
|
||||
labs <- df |>
|
||||
group_by({{group}}) |>
|
||||
summarize(breaks = max({{val}}))
|
||||
group_by({{ group }}) |>
|
||||
summarize(breaks = max({{ val }}))
|
||||
|
||||
df |>
|
||||
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
|
||||
ggplot(aes(x = date, y = {{ val }}, group = {{ group }}, color = {{ group }})) +
|
||||
geom_path() +
|
||||
scale_y_continuous(
|
||||
breaks = labs$breaks,
|
||||
|
@ -753,6 +796,7 @@ df <- tibble(
|
|||
dist4 = sort(rnorm(50, 15, 1)),
|
||||
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
|
||||
)
|
||||
|
||||
df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
|
||||
|
||||
fancy_ts(df, value, dist_name)</pre>
|
||||
|
@ -766,26 +810,26 @@ fancy_ts(df, value, dist_name)</pre>
|
|||
<section id="faceting" data-type="sect2">
|
||||
<h2>
|
||||
Faceting</h2>
|
||||
<p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
|
||||
<p>Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. So you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/sharoz/status/1574376332821204999
|
||||
|
||||
foo <- function(x) {
|
||||
ggplot(mtcars, aes(mpg, disp)) +
|
||||
ggplot(mtcars, aes(x = mpg, y = disp)) +
|
||||
geom_point() +
|
||||
facet_wrap(vars({{ x }}))
|
||||
}
|
||||
|
||||
foo(cyl)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="functions_files/figure-html/unnamed-chunk-53-1.png" class="img-fluid" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution <code>bill_length_mm</code> from palmerpenguins dataset.</p>
|
||||
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution of <code>carat</code> from the diamonds dataset.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/yutannihilat_en/status/1574387230025875457
|
||||
density <- function(colour, facets, binwidth = 0.1) {
|
||||
density <- function(color, facets, binwidth = 0.1) {
|
||||
diamonds |>
|
||||
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
|
||||
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
|
||||
geom_freqpoly(binwidth = binwidth) +
|
||||
facet_wrap(vars({{ facets }}))
|
||||
}
|
||||
|
@ -812,18 +856,18 @@ Labeling</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">histogram <- function(df, var, binwidth = NULL) {
|
||||
df |>
|
||||
ggplot(aes({{ var }})) +
|
||||
ggplot(aes(x = {{ var }})) +
|
||||
geom_histogram(binwidth = binwidth)
|
||||
}</pre>
|
||||
</div>
|
||||
<p>Wouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from package we haven’t talked about before: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
|
||||
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p>
|
||||
<p>Wouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from the package we haven’t talked about yet: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
|
||||
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically inserts the appropriate variable name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">histogram <- function(df, var, binwidth) {
|
||||
label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
|
||||
|
||||
df |>
|
||||
ggplot(aes({{ var }})) +
|
||||
ggplot(aes(x = {{ var }})) +
|
||||
geom_histogram(binwidth = binwidth) +
|
||||
labs(title = label)
|
||||
}
|
||||
|
@ -833,17 +877,16 @@ diamonds |> histogram(carat, 0.1)</pre>
|
|||
<p><img src="functions_files/figure-html/unnamed-chunk-56-1.png" class="img-fluid" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can use the same approach any other place that you might supply a string in a ggplot2 plot.</p>
|
||||
<p>You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>Build up a rich plotting function by incrementally implementing each of the steps below.
|
||||
<p>Build up a rich plotting function by incrementally implementing each of the steps below:</p>
|
||||
<ol type="1"><li><p>Draw a scatterplot given dataset and <code>x</code> and <code>y</code> variables.</p></li>
|
||||
<li><p>Add a line of best fit (i.e. a linear model with no standard errors).</p></li>
|
||||
<li><p>Add a title.</p></li>
|
||||
</ol></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
|
@ -866,21 +909,20 @@ collapse_years()</pre>
|
|||
<p>R also doesn’t care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># missing extra two spaces
|
||||
pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
density <- function(color, facets, binwidth = 0.1) {
|
||||
diamonds |>
|
||||
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
|
||||
geom_freqpoly(binwidth = binwidth) +
|
||||
facet_wrap(vars({{ facets }}))
|
||||
}
|
||||
|
||||
# Pipe indented incorrectly
|
||||
pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
}
|
||||
|
||||
# Missing {} and all one line
|
||||
pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }})</pre>
|
||||
density <- function(color, facets, binwidth = 0.1) {
|
||||
diamonds |>
|
||||
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
|
||||
geom_freqpoly(binwidth = binwidth) +
|
||||
facet_wrap(vars({{ facets }}))
|
||||
}</pre>
|
||||
</div>
|
||||
<p>As you can see we recommend putting extra spaces inside of <code>{{ }}</code>. This makes it very obvious that something unusual is happening.</p>
|
||||
|
||||
|
@ -893,20 +935,21 @@ Exercises</h2>
|
|||
<pre data-type="programlisting" data-code-language="r">f1 <- function(string, prefix) {
|
||||
substr(string, 1, nchar(prefix)) == prefix
|
||||
}
|
||||
|
||||
f3 <- function(x, y) {
|
||||
rep(y, length.out = length(x))
|
||||
}</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li>
|
||||
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
|
||||
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc. would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
|
||||
<p>In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
|
||||
<p>We have only shown you the bare minimum to get started with functions and there’s much more to learn. A few places to learn more are:</p>
|
||||
<ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="https://dplyr.tidyverse.org/articles/programming.html">programming with dplyr</a> and <a href="https://tidyr.tidyverse.org/articles/programming.html">programming with tidyr</a> and learn more about the theory in <a href="https://rlang.r-lib.org/reference/topic-data-mask.html">What is data-masking and why do I need {{?</a>.</li>
|
||||
<li>To learn more about reducing duplication in your ggplot2 code, read the <a href="https://ggplot2-book.org/programming.html" class="uri">Programming with ggplot2</a> chapter of the ggplot2 book.</li>
|
||||
|
|
After Width: | Height: | Size: 148 KiB |
After Width: | Height: | Size: 186 KiB |
After Width: | Height: | Size: 102 KiB |
After Width: | Height: | Size: 176 KiB |
After Width: | Height: | Size: 37 KiB |
After Width: | Height: | Size: 220 KiB |
After Width: | Height: | Size: 185 KiB |
After Width: | Height: | Size: 257 KiB |
After Width: | Height: | Size: 524 KiB |
|
@ -0,0 +1,14 @@
|
|||
<div data-type="part">
|
||||
<h1><span id="sec-import" class="quarto-section-identifier d-none d-lg-block">Import</span></h1><p>In this part of the book, you’ll learn how to import a wider range of data into R, as well as how to get it into a form useful form for analysis. Sometimes this is just a matter of calling a function from the appropriate data import package. But in more complex cases it might require both tidying and transformation in order to get to the tidy rectangle that you’d prefer to work with.</p><div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-ds-import"><p><img src="diagrams/data-science/import.png" alt="Our data science model with import highlighted in blue. " width="535"/></p>
|
||||
<figcaption>Figure 1: Data import is the beginning of the data science process; without data you can’t do data science!</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div><p>In this part of the book you’ll learn how to access data stored in the following ways:</p><ul><li><p>In <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>, you’ll learn how to import data from Excel spreadsheets and Google Sheets.</p></li>
|
||||
<li><p>In <a href="#chp-databases" data-type="xref">#chp-databases</a>, you’ll learn about getting data out of a database and into R (and you’ll also learn a little about how to get data out of R and into a database).</p></li>
|
||||
<li><p>In <a href="#chp-arrow" data-type="xref">#chp-arrow</a>, you’ll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when it’s stored in the parquet format.</p></li>
|
||||
<li><p>In <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>, you’ll learn how to work with hierarchical data, including the the deeply nested lists produced by data stored in the JSON format.</p></li>
|
||||
<li><p>In <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a>, you’ll learn web “scraping”, the art and science of extracting data from web pages.</p></li>
|
||||
</ul><p>There are two important tidyverse packages that we don’t discuss here: haven and xml2. If you working with data from SPSS, Stata, and SAS files, check out the <strong>haven</strong> package, <a href="https://haven.tidyverse.org" class="uri">https://haven.tidyverse.org</a>. If you’re working with XML data, check out the <strong>xml2</strong> package, <a href="https://xml2.r-lib.org" class="uri">https://xml2.r-lib.org</a>. Otherwise, you’ll need to do some research to figure which package you’ll need to use; google is your friend here 😃.</p></div>
|
|
@ -3,12 +3,12 @@
|
|||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, you’d need to explicitly double each element of x using some sort of for loop.</p>
|
||||
<p>In this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, you’d need to explicitly double each element of <code>x</code> using some sort of for loop.</p>
|
||||
<p>This book has already given you a small but powerful number of tools that perform the same action for multiple “things”:</p>
|
||||
<ul><li>
|
||||
<code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> draws a plot for each subset.</li>
|
||||
<li>
|
||||
<code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> plus <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> computes a summary statistics for each subset.</li>
|
||||
<code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> plus <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> computes a summary statistics for each subset.</li>
|
||||
<li>
|
||||
<code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> create new rows and columns for each element of a list-column.</li>
|
||||
</ul><p>Now it’s time to learn some more general tools, often called <strong>functional programming</strong> tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter we’ll keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.</p>
|
||||
|
@ -25,7 +25,7 @@ Prerequisites</h2>
|
|||
|
||||
<p>This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with <code>devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr"))</code>.</p></div>
|
||||
|
||||
<p>In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. We’re going to use just a couple of purrr functions from in this chapter, but it’s a great package to explore as you improve your programming skills.</p>
|
||||
<p>In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. We’re just going to use a couple of purrr functions in this chapter, but it’s a great package to explore as you improve your programming skills.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
|
@ -46,7 +46,7 @@ Modifying multiple columns</h1>
|
|||
</div>
|
||||
<p>You could do it with copy-and-paste:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |> summarise(
|
||||
<pre data-type="programlisting" data-code-language="r">df |> summarize(
|
||||
n = n(),
|
||||
a = median(a),
|
||||
b = median(b),
|
||||
|
@ -58,9 +58,9 @@ Modifying multiple columns</h1>
|
|||
#> <int> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 10 -0.246 -0.287 -0.0567 0.144</pre>
|
||||
</div>
|
||||
<p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead you can use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>:</p>
|
||||
<p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead, you can use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |> summarise(
|
||||
<pre data-type="programlisting" data-code-language="r">df |> summarize(
|
||||
n = n(),
|
||||
across(a:d, median),
|
||||
)
|
||||
|
@ -76,7 +76,7 @@ Modifying multiple columns</h1>
|
|||
Selecting columns with<code>.cols</code>
|
||||
</h2>
|
||||
<p>The first argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, <code>.cols</code>, selects the columns to transform. This uses the same specifications as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <a href="#sec-select" data-type="xref">#sec-select</a>, so you can use functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">ends_with()</a></code> to select columns based on their name.</p>
|
||||
<p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code>where()</code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
|
||||
<p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
grp = sample(2, 10, replace = TRUE),
|
||||
|
@ -88,15 +88,15 @@ Selecting columns with<code>.cols</code>
|
|||
|
||||
df |>
|
||||
group_by(grp) |>
|
||||
summarise(across(everything(), median))
|
||||
summarize(across(everything(), median))
|
||||
#> # A tibble: 2 × 5
|
||||
#> grp a b c d
|
||||
#> <int> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 1 -0.0935 -0.0163 0.363 0.364
|
||||
#> 2 2 0.312 -0.0576 0.208 0.565</pre>
|
||||
</div>
|
||||
<p>Note grouping columns (<code>grp</code> here) are not included in <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, because they’re automatically preserved by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>.</p>
|
||||
<p><code>where()</code> allows you to select columns based on their type:</p>
|
||||
<p>Note grouping columns (<code>grp</code> here) are not included in <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, because they’re automatically preserved by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
|
||||
<p><code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code> allows you to select columns based on their type:</p>
|
||||
<ul><li>
|
||||
<code>where(is.numeric)</code> selects all numeric columns.</li>
|
||||
<li>
|
||||
|
@ -116,33 +116,35 @@ df |>
|
|||
)
|
||||
|
||||
df_types |>
|
||||
summarise(across(where(is.numeric), mean))
|
||||
summarize(across(where(is.numeric), mean))
|
||||
#> # A tibble: 1 × 2
|
||||
#> x1 x2
|
||||
#> <dbl> <dbl>
|
||||
#> 1 2 0.370
|
||||
|
||||
df_types |>
|
||||
summarise(across(where(is.character), str_flatten))
|
||||
summarize(across(where(is.character), str_flatten))
|
||||
#> # A tibble: 1 × 2
|
||||
#> y1 y2
|
||||
#> <chr> <chr>
|
||||
#> 1 kjh bananaappleegg</pre>
|
||||
</div>
|
||||
<p>Just like other selectors, you can combine these with Boolean algebra. For example, <code>!where(is.numeric)</code> selects all non-numeric columns and <code>starts_with("a") & where(is.logical)</code> selects all logical columns whose name starts with “a”.</p>
|
||||
<p>Just like other selectors, you can combine these with Boolean algebra. For example, <code>!where(is.numeric)</code> selects all non-numeric columns, and <code>starts_with("a") & where(is.logical)</code> selects all logical columns whose name starts with “a”.</p>
|
||||
</section>
|
||||
|
||||
<section id="calling-a-single-function" data-type="sect2">
|
||||
<h2>
|
||||
Calling a single function</h2>
|
||||
<p>The second argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a function programming language.</p>
|
||||
<p>The second argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a functional programming language.</p>
|
||||
<p>It’s important to note that we’re passing this function to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, so <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can call it, not calling it ourselves. That means the function name should never be followed by <code>()</code>. If you forget, you’ll get an error:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(grp) |>
|
||||
summarise(across(everything(), median()))
|
||||
#> Error in vapply(.x, .f, .mold, ..., USE.NAMES = FALSE): values must be length 1,
|
||||
#> but FUN(X[[1]]) result is length 0</pre>
|
||||
summarize(across(everything(), median()))
|
||||
#> Error in `summarize()`:
|
||||
#> ℹ In argument: `across(everything(), median())`.
|
||||
#> Caused by error in `is.factor()`:
|
||||
#> ! argument "x" is missing, with no default</pre>
|
||||
</div>
|
||||
<p>This error arises because you’re calling the function with no input, e.g.:</p>
|
||||
<div class="cell">
|
||||
|
@ -154,7 +156,7 @@ Calling a single function</h2>
|
|||
<section id="calling-multiple-functions" data-type="sect2">
|
||||
<h2>
|
||||
Calling multiple functions</h2>
|
||||
<p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> propagates those missing values, giving us a suboptimal output:</p>
|
||||
<p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Let’s motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> propagates those missing values, giving us a suboptimal output:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
|
||||
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
|
||||
|
@ -167,7 +169,7 @@ df_miss <- tibble(
|
|||
d = rnorm(5)
|
||||
)
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, median),
|
||||
n = n()
|
||||
)
|
||||
|
@ -179,7 +181,7 @@ df_miss |>
|
|||
<p>It would be nice if we could pass along <code>na.rm = TRUE</code> to <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> to remove these missing values. To do so, instead of calling <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> directly, we need to create a new function that calls <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> with the desired arguments:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, function(x) median(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
)
|
||||
|
@ -191,7 +193,7 @@ df_miss |>
|
|||
<p>This is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or <strong>anonymous</strong><span data-type="footnote">Anonymous, because we never explicitly gave it a name with <code><-</code>. Another term programmers use for this is “lambda function”.</span>, function you can replace <code>function</code> with <code>\</code><span data-type="footnote">In older code you might see syntax that looks like <code>~ .x + 1</code>. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name <code>.x</code>. We now recommend the base syntax, <code>\(x) x + 1</code>.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, \(x) median(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
)</pre>
|
||||
|
@ -199,7 +201,7 @@ df_miss |>
|
|||
<p>In either case, <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> effectively expands to the following code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
a = median(a, na.rm = TRUE),
|
||||
b = median(b, na.rm = TRUE),
|
||||
c = median(c, na.rm = TRUE),
|
||||
|
@ -207,10 +209,10 @@ df_miss |>
|
|||
n = n()
|
||||
)</pre>
|
||||
</div>
|
||||
<p>When we remove the missing values from the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, it would be nice to know just how many values we were removing. We can find that out by supplying two functions to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p>
|
||||
<p>When we remove the missing values from the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, it would be nice to know just how many values were removed. We can find that out by supplying two functions to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, list(
|
||||
median = \(x) median(x, na.rm = TRUE),
|
||||
n_miss = \(x) sum(is.na(x))
|
||||
|
@ -218,10 +220,10 @@ df_miss |>
|
|||
n = n()
|
||||
)
|
||||
#> # A tibble: 1 × 9
|
||||
#> a_median a_n_miss b_median b_n_miss c_median c_n_miss d_med…¹ d_n_m…² n
|
||||
#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int>
|
||||
#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5
|
||||
#> # … with abbreviated variable names ¹d_median, ²d_n_miss</pre>
|
||||
#> a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss
|
||||
#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
|
||||
#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0
|
||||
#> # … with 1 more variable: n <int></pre>
|
||||
</div>
|
||||
<p>If you look carefully, you might intuit that the columns are named using using a glue specification (<a href="#sec-glue" data-type="xref">#sec-glue</a>) like <code>{.col}_{.fn}</code> where <code>.col</code> is the name of the original column and <code>.fn</code> is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use <code>.names</code> argument to supply your own glue spec.</p>
|
||||
</section>
|
||||
|
@ -232,7 +234,7 @@ Column names</h2>
|
|||
<p>The result of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is named according to the specification provided in the <code>.names</code> argument. We could specify our own if we wanted the name of the function to come first<span data-type="footnote">You can’t currently change the order of the columns, but you could reorder them after the fact using <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> or similar.</span>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(
|
||||
a:d,
|
||||
list(
|
||||
|
@ -244,12 +246,12 @@ Column names</h2>
|
|||
n = n(),
|
||||
)
|
||||
#> # A tibble: 1 × 9
|
||||
#> median_a n_miss_a median_b n_miss_b median_c n_miss_c media…¹ n_mis…² n
|
||||
#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int> <int>
|
||||
#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5
|
||||
#> # … with abbreviated variable names ¹median_d, ²n_miss_d</pre>
|
||||
#> median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d
|
||||
#> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <int>
|
||||
#> 1 0.429 1 -0.721 1 -0.796 2 0.704 0
|
||||
#> # … with 1 more variable: n <int></pre>
|
||||
</div>
|
||||
<p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
|
||||
<p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default, the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |>
|
||||
mutate(
|
||||
|
@ -284,7 +286,7 @@ Column names</h2>
|
|||
<section id="filtering" data-type="sect2">
|
||||
<h2>
|
||||
Filtering</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is a great match for <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> but it’s more awkward to use with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&</code>. It’s clear that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> called <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>:</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is a great match for <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> but it’s more awkward to use with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&</code>. It’s clear that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> called <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
|
||||
#> # A tibble: 3 × 4
|
||||
|
@ -318,12 +320,6 @@ df_miss |> filter(if_all(a:d, is.na))
|
|||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is particularly useful to program with because it allows you to operate on multiple columns. For example, <a href="https://twitter.com/_wurli/status/1571836746899283969">Jacob Scott</a> uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(lubridate)
|
||||
#> Loading required package: timechange
|
||||
#>
|
||||
#> Attaching package: 'lubridate'
|
||||
#> The following objects are masked from 'package:base':
|
||||
#>
|
||||
#> date, intersect, setdiff, union
|
||||
|
||||
expand_dates <- function(df) {
|
||||
df |>
|
||||
|
@ -347,16 +343,16 @@ df_date |>
|
|||
</div>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in <a href="#sec-embracing" data-type="xref">#sec-embracing</a>. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">summarise_means <- function(df, summary_vars = where(is.numeric)) {
|
||||
<pre data-type="programlisting" data-code-language="r">summarize_means <- function(df, summary_vars = where(is.numeric)) {
|
||||
df |>
|
||||
summarise(
|
||||
summarize(
|
||||
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
)
|
||||
}
|
||||
diamonds |>
|
||||
group_by(clarity) |>
|
||||
summarise_means()
|
||||
summarize_means()
|
||||
#> # A tibble: 8 × 9
|
||||
#> clarity carat depth table price x y z n
|
||||
#> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
|
||||
|
@ -370,7 +366,7 @@ diamonds |>
|
|||
|
||||
diamonds |>
|
||||
group_by(clarity) |>
|
||||
summarise_means(c(carat, x:z))
|
||||
summarize_means(c(carat, x:z))
|
||||
#> # A tibble: 8 × 6
|
||||
#> clarity carat x y z n
|
||||
#> <ord> <dbl> <dbl> <dbl> <dbl> <int>
|
||||
|
@ -391,7 +387,7 @@ Vs<code>pivot_longer()</code>
|
|||
<p>Before we go on, it’s worth pointing out an interesting connection between <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> (<a href="#sec-pivoting" data-type="xref">#sec-pivoting</a>). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
summarise(across(a:d, list(median = median, mean = mean)))
|
||||
summarize(across(a:d, list(median = median, mean = mean)))
|
||||
#> # A tibble: 1 × 8
|
||||
#> a_median a_mean b_median b_mean c_median c_mean d_median d_mean
|
||||
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
|
@ -402,7 +398,7 @@ Vs<code>pivot_longer()</code>
|
|||
<pre data-type="programlisting" data-code-language="r">long <- df |>
|
||||
pivot_longer(a:d) |>
|
||||
group_by(name) |>
|
||||
summarise(
|
||||
summarize(
|
||||
median = median(value),
|
||||
mean = mean(value)
|
||||
)
|
||||
|
@ -464,7 +460,7 @@ df_long
|
|||
|
||||
df_long |>
|
||||
group_by(group) |>
|
||||
summarise(mean = weighted.mean(val, wts))
|
||||
summarize(mean = weighted.mean(val, wts))
|
||||
#> # A tibble: 4 × 2
|
||||
#> group mean
|
||||
#> <chr> <dbl>
|
||||
|
@ -486,12 +482,12 @@ Exercises</h2>
|
|||
<li><p>It is possible to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> where it’s equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>. Can you explain why?</p></li>
|
||||
<li><p>Adjust <code>expand_dates()</code> to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?</p></li>
|
||||
<li>
|
||||
<p>Explain what each step of the pipeline in this function does. What special feature of <code>where()</code> are we taking advantage of?</p>
|
||||
<p>Explain what each step of the pipeline in this function does. What special feature of <code><a href="https://tidyselect.r-lib.org/reference/where.html">where()</a></code> are we taking advantage of?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">show_missing <- function(df, group_vars, summary_vars = everything()) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(
|
||||
summarize(
|
||||
across({{ summary_vars }}, \(x) sum(is.na(x))),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
|
@ -522,7 +518,7 @@ data2022 <- readxl::read_excel("data/y2022.xlsx")</pre>
|
|||
<section id="listing-files-in-a-directory" data-type="sect2">
|
||||
<h2>
|
||||
Listing files in a directory</h2>
|
||||
<p>As the name suggests, <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> lists the files in a directory. TO CONSIDER: why not use it via the more obvious name <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code>? You’ll almost always use three arguments:</p>
|
||||
<p>As the name suggests, <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> lists the files in a directory. You’ll almost always use three arguments:</p>
|
||||
<ul><li><p>The first argument, <code>path</code>, is the directory to look in.</p></li>
|
||||
<li><p><code>pattern</code> is a regular expression used to filter the file names. The most common pattern is something like <code>[.]xlsx$</code> or <code>[.]csv$</code> to find all files with a specified extension.</p></li>
|
||||
<li><p><code>full.names</code> determines whether or not the directory name should be included in the output. You almost always want this to be <code>TRUE</code>.</p></li>
|
||||
|
@ -608,7 +604,7 @@ files[[1]]
|
|||
#> 6 Australia Oceania 69.1 8691212 10040.
|
||||
#> # … with 136 more rows</pre>
|
||||
</div>
|
||||
<p>(This is another data structure that doesn’t display particularly compactly with <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> so you might want to load into RStudio and inspect it with <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>).</p>
|
||||
<p>(This is another data structure that doesn’t display particularly compactly with <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> so you might want to load it into RStudio and inspect it with <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>).</p>
|
||||
<p>Now we can use <code><a href="https://purrr.tidyverse.org/reference/list_c.html">purrr::list_rbind()</a></code> to combine that list of data frames into a single data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">list_rbind(files)
|
||||
|
@ -623,13 +619,13 @@ files[[1]]
|
|||
#> 6 Australia Oceania 69.1 8691212 10040.
|
||||
#> # … with 1,698 more rows</pre>
|
||||
</div>
|
||||
<p>Or we could do both steps at once in pipeline:</p>
|
||||
<p>Or we could do both steps at once in a pipeline:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
map(readxl::read_excel) |>
|
||||
list_rbind()</pre>
|
||||
</div>
|
||||
<p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>. For example, it’s often useful to peak at the first few row of the data with <code>n_max = 1</code>:</p>
|
||||
<p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>. For example, it’s often useful to peak at the first few rows of the data with <code>n_max = 1</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
map(\(path) readxl::read_excel(path, n_max = 1)) |>
|
||||
|
@ -651,7 +647,7 @@ files[[1]]
|
|||
<section id="sec-data-in-the-path" data-type="sect2">
|
||||
<h2>
|
||||
Data in the path</h2>
|
||||
<p>Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.</p>
|
||||
<p>Sometimes the name of the file is data itself. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things:</p>
|
||||
<p>First, we name the vector of paths. The easiest way to do this is with the <code><a href="https://rlang.r-lib.org/reference/set_names.html">set_names()</a></code> function, which can take a function. Here we use <code><a href="https://rdrr.io/r/base/basename.html">basename()</a></code> to extract just the file name from the full path:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">paths |> set_names(basename)
|
||||
|
@ -752,15 +748,15 @@ Save your work</h2>
|
|||
write_csv(gapminder, "gapminder.csv")</pre>
|
||||
</div>
|
||||
<p>Now when you come back to this problem in the future, you can read in a single csv file.</p>
|
||||
<p>If you’re working in a project, we’d suggest calling the file that does this sort of data prep work something like <code>0-cleanup.R.</code> The <code>0</code> in the file name suggests that this should be run before anything else.</p>
|
||||
<p>If you’re working in a project, we’d suggest calling the file that does this sort of data prep work something like <code>0-cleanup.R</code>. The <code>0</code> in the file name suggests that this should be run before anything else.</p>
|
||||
<p>If your input data files change over time, you might consider learning a tool like <a href="https://docs.ropensci.org/targets/">targets</a> to set up your data cleaning code to automatically re-run whenever one of the input files is modified.</p>
|
||||
</section>
|
||||
|
||||
<section id="many-simple-iterations" data-type="sect2">
|
||||
<h2>
|
||||
Many simple iterations</h2>
|
||||
<p>Here we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.</p>
|
||||
<p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps then call <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> once:</p>
|
||||
<p>Here we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic options: you can do one round of iteration with a complex function, or do multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.</p>
|
||||
<p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is to write a function that takes a file and does all those steps then call <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> once:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">process_file <- function(path) {
|
||||
df <- read_csv(path)
|
||||
|
@ -784,7 +780,7 @@ paths |>
|
|||
map(\(df) df |> pivot_longer(jan:dec, names_to = "month")) |>
|
||||
list_rbind()</pre>
|
||||
</div>
|
||||
<p>We recommend this approach because it stops you getting fixated on getting the first file right because moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.</p>
|
||||
<p>We recommend this approach because it stops you getting fixated on getting the first file right before moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.</p>
|
||||
<p>In this particular example, there’s another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">paths |>
|
||||
|
@ -799,12 +795,12 @@ paths |>
|
|||
<section id="heterogeneous-data" data-type="sect2">
|
||||
<h2>
|
||||
Heterogeneous data</h2>
|
||||
<p>Unfortunately sometimes it’s not possible to go from <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> straight to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> because the data frames are so heterogeneous that <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:</p>
|
||||
<p>Unfortunately, sometimes it’s not possible to go from <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> straight to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> because the data frames are so heterogeneous that <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">files <- paths |>
|
||||
map(readxl::read_excel) </pre>
|
||||
</div>
|
||||
<p>Then a very useful strategy is to capture the structure of the data frames to data so that you can explore it using your data science skills. One way to do so is with this handy <code>df_types</code> function that returns a tibble with one row for each column:</p>
|
||||
<p>Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills. One way to do so is with this handy <code>df_types</code> function that returns a tibble with one row for each column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df_types <- function(df) {
|
||||
tibble(
|
||||
|
@ -837,7 +833,7 @@ df_types(nycflights13::flights)
|
|||
#> 6 dep_delay double 8255
|
||||
#> # … with 13 more rows</pre>
|
||||
</div>
|
||||
<p>You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:</p>
|
||||
<p>You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">files |>
|
||||
map(df_types) |>
|
||||
|
@ -855,7 +851,7 @@ df_types(nycflights13::flights)
|
|||
#> 6 1977.xlsx character character double double double
|
||||
#> # … with 6 more rows</pre>
|
||||
</div>
|
||||
<p>If the files have heterogeneous formats you might need to do more processing before you can successfully merge them. Unfortunately we’re now going to leave you to figure that out on your own, but you might want to read about <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> and <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> allows you to selectively modify elements of a list based on their values; <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code> allows you to selectively modify elements based on their names.</p>
|
||||
<p>If the files have heterogeneous formats, you might need to do more processing before you can successfully merge them. Unfortunately, we’re now going to leave you to figure that out on your own, but you might want to read about <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> and <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> allows you to selectively modify elements of a list based on their values; <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code> allows you to selectively modify elements based on their names.</p>
|
||||
</section>
|
||||
|
||||
<section id="handling-failures" data-type="sect2">
|
||||
|
@ -870,7 +866,7 @@ Handling failures</h2>
|
|||
data <- files |> list_rbind()</pre>
|
||||
</div>
|
||||
<p>This works particularly well here because <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code>, like many tidyverse functions, automatically ignores <code>NULL</code>s.</p>
|
||||
<p>Now you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed load and what do to about it. Start by getting the paths that failed:</p>
|
||||
<p>Now you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed to load and what do to about it. Start by getting the paths that failed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">failed <- map_vec(files, is.null)
|
||||
paths[failed]
|
||||
|
@ -885,13 +881,13 @@ paths[failed]
|
|||
Saving multiple outputs</h1>
|
||||
<p>In the last section, you learned about <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>, which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:</p>
|
||||
<ul><li>Saving multiple data frames into one database.</li>
|
||||
<li>Saving multiple data frames into multiple csv files.</li>
|
||||
<li>Saving multiple data frames into multiple <code>.csv</code> files.</li>
|
||||
<li>Saving multiple plots to multiple <code>.png</code> files.</li>
|
||||
</ul>
|
||||
<section id="sec-save-database" data-type="sect2">
|
||||
<h2>
|
||||
Writing to a database</h2>
|
||||
<p>Sometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do <code>map(files, read_csv)</code>. One approach to deal with this problem is to load your into a database so you can access just the bits you need with dbplyr.</p>
|
||||
<p>Sometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do <code>map(files, read_csv)</code>. One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.</p>
|
||||
<p>If you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s <code>duckdb_read_csv()</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb())
|
||||
|
@ -914,7 +910,7 @@ template
|
|||
#> 6 Australia Oceania 69.1 8691212 10040. 1952
|
||||
#> # … with 136 more rows</pre>
|
||||
</div>
|
||||
<p>Now we can connect to the database, and use <code><a href="https://dbi.r-dbi.org/reference/dbCreateTable.html">DBI::dbCreateTable()</a></code> to turn our template into database table:</p>
|
||||
<p>Now we can connect to the database, and use <code><a href="https://dbi.r-dbi.org/reference/dbCreateTable.html">DBI::dbCreateTable()</a></code> to turn our template into a database table:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb())
|
||||
DBI::dbCreateTable(con, "gapminder", template)</pre>
|
||||
|
@ -923,7 +919,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">con |> tbl("gapminder")
|
||||
#> # Source: table<gapminder> [0 x 6]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>,
|
||||
#> # pop <dbl>, gdpPercap <dbl>, year <dbl></pre>
|
||||
</div>
|
||||
|
@ -950,15 +946,15 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
|
|||
tbl("gapminder") |>
|
||||
count(year)
|
||||
#> # Source: SQL [?? x 2]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> year n
|
||||
#> <dbl> <dbl>
|
||||
#> 1 1952 142
|
||||
#> 2 1987 142
|
||||
#> 3 1957 142
|
||||
#> 4 1992 142
|
||||
#> 5 1962 142
|
||||
#> 6 1997 142
|
||||
#> 2 1957 142
|
||||
#> 3 1962 142
|
||||
#> 4 1967 142
|
||||
#> 5 1972 142
|
||||
#> 6 1977 142
|
||||
#> # … with more rows</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -997,7 +993,7 @@ by_clarity
|
|||
#> 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4
|
||||
#> # … with 735 more rows</pre>
|
||||
</div>
|
||||
<p>While we’re here, lets create a column that gives the name of output file, using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>:</p>
|
||||
<p>While we’re here, let’s create a column that gives the name of output file, using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">by_clarity <- by_clarity |>
|
||||
mutate(path = str_glue("diamonds-{clarity}.csv"))
|
||||
|
@ -1034,7 +1030,7 @@ Saving plots</h2>
|
|||
<p>We can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">carat_histogram <- function(df) {
|
||||
ggplot(df, aes(carat)) + geom_histogram(binwidth = 0.1)
|
||||
ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1)
|
||||
}
|
||||
|
||||
carat_histogram(by_clarity$data[[1]])</pre>
|
||||
|
@ -1078,8 +1074,8 @@ ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)</pre>
|
|||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="https://adv-r.hadley.nz/functionals.html">Functionals chapter</a> of <em>Advanced R</em> and consulting the <a href="https://purrr.tidyverse.org">purrr website</a>.</p>
|
||||
<p>If you know much about iteration in other languages you might be surprised that we didn’t discuss the <code>for</code> loop. That’s because R’s orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you can’t, you can often use a functional programming tool like <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> that does something to each element of a list. However, you will see <code>for</code> loops in wild-caught code, so you’ll learn about them in the next chapter where we’ll discuss some important base R tools.</p>
|
||||
<p>In this chapter, you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="https://adv-r.hadley.nz/functionals.html">Functionals chapter</a> of <em>Advanced R</em> and consulting the <a href="https://purrr.tidyverse.org">purrr website</a>.</p>
|
||||
<p>If you know much about iteration in other languages, you might be surprised that we didn’t discuss the <code>for</code> loop. That’s because R’s orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you can’t, you can often use a functional programming tool like <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> that does something to each element of a list. However, you will see <code>for</code> loops in wild-caught code, so you’ll learn about them in the next chapter where we’ll discuss some important base R tools.</p>
|
||||
|
||||
|
||||
</section>
|
||||
|
|
|
@ -65,15 +65,15 @@ Primary and foreign keys</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">planes
|
||||
#> # A tibble: 3,322 × 9
|
||||
#> tailnum year type manuf…¹ model engines seats speed engine
|
||||
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
|
||||
#> 1 N10156 2004 Fixed wing multi en… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 2 N102UW 1998 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> 3 N103US 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> 4 N104UW 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> 5 N10575 2002 Fixed wing multi en… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 6 N105UW 1999 Fixed wing multi en… AIRBUS… A320… 2 182 NA Turbo…
|
||||
#> # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
|
||||
#> tailnum year type manufacturer model engines seats speed engine
|
||||
#> <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
|
||||
#> 1 N10156 2004 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 2 N102UW 1998 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> 3 N103US 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> 4 N104UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> 5 N10575 2002 Fixed wing mul… EMBRAER EMB-… 2 55 NA Turbo…
|
||||
#> 6 N105UW 1999 Fixed wing mul… AIRBUS INDU… A320… 2 182 NA Turbo…
|
||||
#> # … with 3,316 more rows</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
|
@ -81,17 +81,16 @@ Primary and foreign keys</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">weather
|
||||
#> # A tibble: 26,115 × 15
|
||||
#> origin year month day hour temp dewp humid wind_dir wind_sp…¹ wind_…²
|
||||
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
|
||||
#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
|
||||
#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
|
||||
#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
|
||||
#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
|
||||
#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
|
||||
#> # … with 26,109 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
|
||||
#> # visib <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹wind_speed, ²wind_gust</pre>
|
||||
#> origin year month day hour temp dewp humid wind_dir wind_speed
|
||||
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
|
||||
#> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
|
||||
#> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
|
||||
#> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
|
||||
#> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
|
||||
#> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
|
||||
#> # … with 26,109 more rows, and 5 more variables: wind_gust <dbl>,
|
||||
#> # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm></pre>
|
||||
</div>
|
||||
</li>
|
||||
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
|
||||
|
@ -102,7 +101,7 @@ Primary and foreign keys</h2>
|
|||
<li>
|
||||
<code>flights$origin</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
|
||||
<li>
|
||||
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code> .</li>
|
||||
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
|
||||
<li>
|
||||
<code>flights$origin</code>-<code>flights$time_hour</code> is a compound foreign key that corresponds to the compound primary key <code>weather$origin</code>-<code>weather$time_hour</code>.</li>
|
||||
</ul><p>These relationships are summarized visually in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>.</p>
|
||||
|
@ -110,7 +109,7 @@ Primary and foreign keys</h2>
|
|||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-flights-relationships"><p><img src="diagrams/relational.png" alt="The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames." width="502"/></p>
|
||||
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are coloured grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
|
||||
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
|
@ -182,19 +181,18 @@ Surrogate keys</h2>
|
|||
mutate(id = row_number(), .before = 1)
|
||||
flights2
|
||||
#> # A tibble: 336,776 × 20
|
||||
#> id year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
|
||||
#> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl>
|
||||
#> 1 1 2013 1 1 517 515 2 830 819 11
|
||||
#> 2 2 2013 1 1 533 529 4 850 830 20
|
||||
#> 3 3 2013 1 1 542 540 2 923 850 33
|
||||
#> 4 4 2013 1 1 544 545 -1 1004 1022 -18
|
||||
#> 5 5 2013 1 1 554 600 -6 812 837 -25
|
||||
#> 6 6 2013 1 1 554 558 -4 740 728 12
|
||||
#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
|
||||
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
|
||||
#> # hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable
|
||||
#> # names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
|
||||
#> # ⁵arr_delay</pre>
|
||||
#> id year month day dep_time sched_dep_time dep_delay arr_time
|
||||
#> <int> <int> <int> <int> <int> <int> <dbl> <int>
|
||||
#> 1 1 2013 1 1 517 515 2 830
|
||||
#> 2 2 2013 1 1 533 529 4 850
|
||||
#> 3 3 2013 1 1 542 540 2 923
|
||||
#> 4 4 2013 1 1 544 545 -1 1004
|
||||
#> 5 5 2013 1 1 554 600 -6 812
|
||||
#> 6 6 2013 1 1 554 558 -4 740
|
||||
#> # … with 336,770 more rows, and 12 more variables: sched_arr_time <int>,
|
||||
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
|
||||
</section>
|
||||
|
@ -312,16 +310,16 @@ Specifying join keys</h2>
|
|||
left_join(planes)
|
||||
#> Joining with `by = join_by(year, tailnum)`
|
||||
#> # A tibble: 336,776 × 13
|
||||
#> year time_hour origin dest tailnum carrier type manufa…¹ model
|
||||
#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
|
||||
#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA> <NA>
|
||||
#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA> <NA>
|
||||
#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA> <NA>
|
||||
#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA> <NA>
|
||||
#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA> <NA>
|
||||
#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA> <NA>
|
||||
#> # … with 336,770 more rows, 4 more variables: engines <int>, seats <int>,
|
||||
#> # speed <int>, engine <chr>, and abbreviated variable name ¹manufacturer</pre>
|
||||
#> year time_hour origin dest tailnum carrier type manufacturer
|
||||
#> <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
|
||||
#> 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA>
|
||||
#> 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA>
|
||||
#> 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA>
|
||||
#> 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA>
|
||||
#> 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA>
|
||||
#> 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA>
|
||||
#> # … with 336,770 more rows, and 5 more variables: model <chr>,
|
||||
#> # engines <int>, seats <int>, speed <int>, engine <chr></pre>
|
||||
</div>
|
||||
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
|
||||
<div class="cell">
|
||||
|
@ -341,7 +339,7 @@ Specifying join keys</h2>
|
|||
</div>
|
||||
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
|
||||
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an <strong>equi-join</strong>. You’ll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
|
||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin:</code></p>
|
||||
<p>Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights2 |>
|
||||
left_join(airports, join_by(dest == faa))
|
||||
|
@ -461,12 +459,12 @@ Exercises</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">airports |>
|
||||
semi_join(flights, join_by(faa == dest)) |>
|
||||
ggplot(aes(lon, lat)) +
|
||||
ggplot(aes(x = lon, y = lat)) +
|
||||
borders("state") +
|
||||
geom_point() +
|
||||
coord_quickmap()</pre>
|
||||
</div>
|
||||
<p>You might want to use the <code>size</code> or <code>colour</code> of the points to display the average delay for each airport.</p>
|
||||
<p>You might want to use the <code>size</code> or <code>color</code> of the points to display the average delay for each airport.</p>
|
||||
</li>
|
||||
<li><p>What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.</p></li>
|
||||
</ol></section>
|
||||
|
@ -493,8 +491,8 @@ y <- tribble(
|
|||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are coloured: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
|
||||
<figcaption>Graphical representation of two simple tables. The coloured <code>key</code> columns map background colour to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
|
||||
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are colored: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
|
||||
<figcaption>Graphical representation of two simple tables. The colored <code>key</code> columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
|
@ -710,11 +708,11 @@ Non-equi joins</h1>
|
|||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-inner-both"><p><img src="diagrams/join/inner-both.png" alt="A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one. " width="415"/></p>
|
||||
<figcaption>An left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
|
||||
<figcaption>A left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>When we move away from equi-joins we’ll always show the keys, because the key values will often different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
|
||||
<p>When we move away from equi-joins we’ll always show the keys, because the key values will often be different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
|
@ -746,10 +744,10 @@ Cross joins</h2>
|
|||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>.</p>
|
||||
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>. Cross joins use a different join function because there’s no distinction between inner/left/right/full when you’re matching every row.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
|
||||
df |> left_join(df, join_by())
|
||||
df |> cross_join(df)
|
||||
#> # A tibble: 16 × 2
|
||||
#> name.x name.y
|
||||
#> <chr> <chr>
|
||||
|
|
|
@ -0,0 +1,738 @@
|
|||
<section data-type="chapter" id="chp-layers">
|
||||
<h1><span id="sec-layers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Layers</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In the <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a>, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make <em>any</em> type of plot with ggplot2.</p>
|
||||
<p>In this chapter, you’ll expand on that foundation as you learn about the layered grammar of graphics. We’ll start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, we’ll briefly introduce coordinate systems.</p>
|
||||
<p>We will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
#> ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
|
||||
#> ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
|
||||
#> ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
|
||||
#> ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
|
||||
#> ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
|
||||
#> ✔ purrr 1.0.1
|
||||
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
|
||||
#> ✖ dplyr::filter() masks stats::filter()
|
||||
#> ✖ dplyr::lag() masks stats::lag()
|
||||
#> ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="aesthetic-mappings" data-type="sect1">
|
||||
<h1>
|
||||
Aesthetic mappings</h1>
|
||||
<blockquote class="blockquote">
|
||||
<p>“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey</p>
|
||||
</blockquote>
|
||||
<p>The <code>mpg</code> data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">mpg
|
||||
#> # A tibble: 234 × 11
|
||||
#> manufacturer model displ year cyl trans drv cty hwy fl class
|
||||
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
|
||||
#> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p comp…
|
||||
#> 2 audi a4 1.8 1999 4 manual(… f 21 29 p comp…
|
||||
#> 3 audi a4 2 2008 4 manual(… f 20 31 p comp…
|
||||
#> 4 audi a4 2 2008 4 auto(av) f 21 30 p comp…
|
||||
#> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p comp…
|
||||
#> 6 audi a4 2.8 1999 6 manual(… f 18 26 p comp…
|
||||
#> # … with 228 more rows</pre>
|
||||
</div>
|
||||
<p>Among the variables in <code>mpg</code> are:</p>
|
||||
<ol type="1"><li><p><code>displ</code>: A car’s engine size, in liters. A numerical variable.</p></li>
|
||||
<li><p><code>hwy</code>: A car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. A numerical variable.</p></li>
|
||||
<li><p><code>class</code>: Type of car. A categorical variable.</p></li>
|
||||
</ol><p>You can learn about <code>mpg</code> on its help page by running <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code>.</p>
|
||||
<p>Let’s start by visualizing the relationship between <code>displ</code> and <code>hwy</code> for various <code>class</code>es of cars. We can do this with a scatterplot where the numerical variables are mapped to the <code>x</code> and <code>y</code> aesthetics and the categorical variable is mapped to an aesthetic like <code>color</code> or <code>shape</code>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
|
||||
geom_point()
|
||||
|
||||
# Right
|
||||
ggplot(mpg, aes(x = displ, y = hwy, shape = class)) +
|
||||
geom_point()
|
||||
#> Warning: The shape palette can deal with a maximum of 6 discrete values
|
||||
#> because more than 6 becomes difficult to discriminate; you have 7.
|
||||
#> Consider specifying shapes manually if you must have them.
|
||||
#> Warning: Removed 62 rows containing missing values (`geom_point()`).</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-4-1.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the color aesthetic, resulting in different colors for each class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each class, except for suv. Each plot comes with a legend that shows the mapping between color or shape and levels of the class variable." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-4-2.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the color aesthetic, resulting in different colors for each class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each class, except for suv. Each plot comes with a legend that shows the mapping between color or shape and levels of the class variable." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>When <code>class</code> is mapped to <code>shape</code>, we get two warnings:</p>
|
||||
<blockquote class="blockquote">
|
||||
<p>1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.</p>
|
||||
<p>2: Removed 62 rows containing missing values (<code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>).</p>
|
||||
</blockquote>
|
||||
<p>Since ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related – there are 62 SUVs in the dataset and they’re not plotted.</p>
|
||||
<p>Similarly, we can map <code>class</code> to <code>size</code> or <code>alpha</code> (transparency) aesthetics as well.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(mpg, aes(x = displ, y = hwy, size = class)) +
|
||||
geom_point()
|
||||
#> Warning: Using size for a discrete variable is not advised.
|
||||
|
||||
# Right
|
||||
ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +
|
||||
geom_point()
|
||||
#> Warning: Using alpha for a discrete variable is not advised.</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-5-1.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-5-2.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Both of these produce warnings as well:</p>
|
||||
<blockquote class="blockquote">
|
||||
<p>Using alpha for a discrete variable is not advised.</p>
|
||||
</blockquote>
|
||||
<p>Mapping a non-ordinal discrete (categorical) variable (<code>class</code>) to an ordered aesthetic (<code>size</code> or <code>alpha</code>) is generally not a good idea because it implies a ranking that does not in fact exist.</p>
|
||||
<p>Similarly, we could have mapped <code>class</code> to the <code>alpha</code> aesthetic, which controls the transparency of the points, or to the <code>shape</code> aesthetic, which controls the shape of the points.</p>
|
||||
<p>Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.</p>
|
||||
<p>You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(color = "blue")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-6-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. You can set an aesthetic manually by name as an argument of your geom function. In other words, it goes <em>outside</em> of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code>. You’ll need to pick a value that makes sense for that aesthetic:</p>
|
||||
<ul><li>The name of a color as a character string.</li>
|
||||
<li>The size of a point in mm.</li>
|
||||
<li>The shape of a point as a number, as shown in <a href="#fig-shapes" data-type="xref">#fig-shapes</a>.</li>
|
||||
</ul><div class="cell" data-layout-align="center">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-shapes"><p><img src="layers_files/figure-html/fig-shapes-1.png" alt="Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue." width="576"/></p>
|
||||
<figcaption>R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the <code>color</code> and <code>fill</code> aesthetics. The hollow shapes (0–14) have a border determined by <code>color</code>; the solid shapes (15–20) are filled with <code>color</code>; the filled shapes (21–24) have a border of <code>color</code> and are filled with <code>fill</code>.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at <a href="https://ggplot2.tidyverse.org/articles/ggplot2-specs.html" class="uri">https://ggplot2.tidyverse.org/articles/ggplot2-specs.html</a>.</p>
|
||||
<p>The specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.</p>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Create a scatterplot of <code>hwy</code> vs. <code>displ</code> where the points are pink filled in triangles.</p></li>
|
||||
<li>
|
||||
<p>Why did the following code not result in a plot with blue points?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy, color = "blue"))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-8-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are red and the legend shows a red point that is mapped to the word blue." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>What does the <code>stroke</code> aesthetic do? What shapes does it work with? (Hint: use <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">?geom_point</a></code>)</p></li>
|
||||
<li><p>What happens if you map an aesthetic to something other than a variable name, like <code>aes(color = displ < 5)</code>? Note, you’ll also need to specify x and y.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="sec-geometric-objects" data-type="sect1">
|
||||
<h1>
|
||||
Geometric objects</h1>
|
||||
<p>How are these two plots similar?</p>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-9-1.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-9-2.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.</p>
|
||||
<p>To change the geom in your plot, change the geom function that you add to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. For instance, to make the plots above, you can use this code:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point()
|
||||
|
||||
# Right
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_smooth()</pre>
|
||||
</div>
|
||||
<p>Every geom function in ggplot2 takes a <code>mapping</code> argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping. On the other hand, you <em>could</em> set the linetype of a line. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
|
||||
geom_smooth()
|
||||
ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) +
|
||||
geom_smooth()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-11-1.png" alt="Two plots of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves. On the left, three smooth curves, all with the same linetype. On the right, three smooth curves with different line types (solid, dashed, or long dashed) for each type of drive train. In both plots, confidence intervals around the smooth curves are also displayed." width="576"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-11-2.png" alt="Two plots of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves. On the left, three smooth curves, all with the same linetype. On the right, three smooth curves with different line types (solid, dashed, or long dashed) for each type of drive train. In both plots, confidence intervals around the smooth curves are also displayed." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Here, <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> separates the cars into three lines based on their <code>drv</code> value, which describes a car’s drive train. One line describes all of the points that have a <code>4</code> value, one line describes all of the points that have an <code>f</code> value, and one line describes all of the points that have an <code>r</code> value. Here, <code>4</code> stands for four-wheel drive, <code>f</code> for front-wheel drive, and <code>r</code> for rear-wheel drive.</p>
|
||||
<p>If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to <code>drv</code>.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-12-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with points (colored by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Notice that this plot contains two geoms in the same graph.</p>
|
||||
<p>Many geoms, like <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>, use a single geometric object to display multiple rows of data. For these geoms, you can set the <code>group</code> aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the <code>linetype</code> example). It is convenient to rely on this feature because the <code>group</code> aesthetic by itself does not add a legend or distinguishing features to the geoms.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_smooth()
|
||||
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_smooth(aes(group = drv))
|
||||
|
||||
ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_smooth(aes(color = drv), show.legend = FALSE)</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-13-1.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-13-2.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-13-3.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings <em>for that layer only</em>. This makes it possible to display different aesthetics in different layers.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(aes(color = class)) +
|
||||
geom_smooth()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-14-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can use the same idea to specify different <code>data</code> for each layer. Here, we use red points as well as open circles to highlight two-seater cars. The local data argument in <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> overrides the global data argument in <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> for that layer only.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_point(
|
||||
data = mpg |> filter(class == "2seater"),
|
||||
color = "red"
|
||||
) +
|
||||
geom_point(
|
||||
data = mpg |> filter(class == "2seater"),
|
||||
shape = "circle open", size = 3, color = "red"
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>(You’ll learn how <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.)</p>
|
||||
<p>Geoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># Left
|
||||
ggplot(mpg, aes(x = hwy)) +
|
||||
geom_histogram(binwidth = 2)
|
||||
|
||||
# Middle
|
||||
ggplot(mpg, aes(x = hwy)) +
|
||||
geom_density()
|
||||
|
||||
# Right
|
||||
ggplot(mpg, aes(x = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-16-1.png" alt="Three plots: histogram, density plot, and box plot of highway mileage." width="576"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-16-2.png" alt="Three plots: histogram, density plot, and box plot of highway mileage." width="576"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-16-3.png" alt="Three plots: histogram, density plot, and box plot of highway mileage." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>ggplot2 provides more than 40 geoms but these don’t cover all possible plots one could make. If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a> for a sampling). For example, the <strong>ggridges</strong> package (<a href="https://wilkelab.org/ggridges/" class="uri">https://wilkelab.org/ggridges</a>) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. In the following plot not only did we use a new geom (<code><a href="https://wilkelab.org/ggridges/reference/geom_density_ridges.html">geom_density_ridges()</a></code>), but we have also mapped the same variable to multiple aesthetics (<code>drv</code> to <code>y</code>, <code>fill</code>, and <code>color</code>) as well as set an aesthetic (<code>alpha = 0.5</code>) to make the density curves transparent.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(ggridges)
|
||||
|
||||
ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
|
||||
geom_density_ridges(alpha = 0.5, show.legend = FALSE)
|
||||
#> Picking joint bandwidth of 1.28</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-17-1.png" alt="Density curves for highway mileage for cars with rear wheel, front wheel, and 4-wheel drives plotted separately. The distribution is bimodal and roughly symmetric for real and 4 wheel drive cars and unimodal and right skewed for front wheel drive cars." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: <a href="https://ggplot2.tidyverse.org/reference" class="uri">https://ggplot2.tidyverse.org/reference</a>. To learn more about any single geom, use the help (e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">?geom_smooth</a></code>).</p>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?</p></li>
|
||||
<li>
|
||||
<p>Earlier in this chapter we used <code>show.legend</code> without explaining it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_smooth(aes(color = drv), show.legend = FALSE)</pre>
|
||||
</div>
|
||||
<p>What does <code>show.legend = FALSE</code> do here? What happens if you remove it? Why do you think we used it earlier?</p>
|
||||
</li>
|
||||
<li><p>What does the <code>se</code> argument to <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> do?</p></li>
|
||||
<li>
|
||||
<p>Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s <code>drv</code>.</p>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-19-1.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-19-2.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-19-3.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-19-4.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-19-5.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-19-6.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="facets" data-type="sect1">
|
||||
<h1>
|
||||
Facets</h1>
|
||||
<p>In <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> you learned about faceting with <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code>, which splits a plot into subplots that each display one subset of the data based on a categorical variable.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
facet_wrap(~cyl)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-20-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>To facet your plot with the combination of two variables, switch from <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> to <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> is also a formula, but now it’s a double sided formula: <code>rows ~ cols</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
facet_grid(drv ~ cyl)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-21-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>By default each of the facets share the same scale for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. Setting the <code>scales</code> argument in a faceting function to <code>"free"</code> will allow for different axis scales across both rows and columns. Other options for this argument are <code>"free_x"</code> (different scales across rows) and <code>"free_y"</code> (different scales across columns).</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
facet_grid(drv ~ cyl, scales = "free")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-22-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive. Facets within a row share the same y-scale and facets within a column share the same x-scale." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>What happens if you facet on a continuous variable?</p></li>
|
||||
<li>
|
||||
<p>What do the empty cells in plot with <code>facet_grid(drv ~ cyl)</code> mean? How do they relate to this plot?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
|
||||
geom_point(aes(x = drv, y = cyl))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-23-1.png" alt="Scatterplot of number of cycles versus type of drive train of cars. The plot shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>What plots does the following code make? What does <code>.</code> do?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ .)
|
||||
|
||||
ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy)) +
|
||||
facet_grid(. ~ cyl)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Take the first faceted plot in this section:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy)) +
|
||||
facet_wrap(~ class, nrow = 2)</pre>
|
||||
</div>
|
||||
<p>What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?</p>
|
||||
</li>
|
||||
<li><p>Read <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">?facet_wrap</a></code>. What does <code>nrow</code> do? What does <code>ncol</code> do? What other options control the layout of the individual panels? Why doesn’t <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> have <code>nrow</code> and <code>ncol</code> arguments?</p></li>
|
||||
<li>
|
||||
<p>Which of the following two plots makes it easier to compare engine size (<code>displ</code>) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ .)
|
||||
|
||||
ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy)) +
|
||||
facet_grid(. ~ drv)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-26-1.png" alt="Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." width="576"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-26-2.png" alt="Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>Recreate this plot using <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. How do the positions of the facet labels change?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
|
||||
geom_point(aes(x = displ, y = hwy)) +
|
||||
facet_grid(drv ~ .)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by type of drive train across rows." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="statistical-transformations" data-type="sect1">
|
||||
<h1>
|
||||
Statistical transformations</h1>
|
||||
<p>Consider a basic bar chart, drawn with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code>. The following chart displays the total number of diamonds in the <code>diamonds</code> dataset, grouped by <code>cut</code>. The <code>diamonds</code> dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the <code>price</code>, <code>carat</code>, <code>color</code>, <code>clarity</code>, and <code>cut</code> of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-28-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>On the x-axis, the chart displays <code>cut</code>, a variable from <code>diamonds</code>. On the y-axis, it displays count, but count is not a variable in <code>diamonds</code>! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:</p>
|
||||
<ul><li><p>Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.</p></li>
|
||||
<li><p>Smoothers fit a model to your data and then plot predictions from the model.</p></li>
|
||||
<li><p>Boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.</p></li>
|
||||
</ul><p>The algorithm used to calculate new values for a graph is called a <strong>stat</strong>, short for statistical transformation. <a href="#fig-vis-stat-bar" data-type="xref">#fig-vis-stat-bar</a> shows how this process works with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-vis-stat-bar"><p><img src="images/visualization-stat-bar.png" style="width:100.0%" alt="A figure demonstrating three steps of creating a bar chart. Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() transforms the data with the count stat, which returns a data set of cut values and counts. Step 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis."/></p>
|
||||
<figcaption>When create a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can learn which stat a geom uses by inspecting the default value for the <code>stat</code> argument. For example, <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">?geom_bar</a></code> shows that the default value for <code>stat</code> is “count”, which means that <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> uses <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code> is documented on the same page as <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>. If you scroll down, the section called “Computed variables” explains that it computes two new variables: <code>count</code> and <code>prop</code>.</p>
|
||||
<p>Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:</p>
|
||||
<ol type="1"><li>
|
||||
<p>You might want to override the default stat. In the code below, we change the stat of <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> from count (the default) to identity. This lets us map the height of the bars to the raw values of a <span class="math inline">\(y\)</span> variable.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">cut_frequencies <- tribble(
|
||||
~cut, ~freq,
|
||||
"Fair", 1610,
|
||||
"Good", 4906,
|
||||
"Very Good", 12082,
|
||||
"Premium", 13791,
|
||||
"Ideal", 21551
|
||||
)
|
||||
|
||||
ggplot(cut_frequencies, aes(x = cut, y = freq)) +
|
||||
geom_bar(stat = "identity")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-30-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p>You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-31-1.png" alt="Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>To find the variables computed by the stat, look for the section titled “computed variables” in the help for <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>You might want to draw greater attention to the statistical transformation in your code. For example, you might use <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>, which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds) +
|
||||
stat_summary(
|
||||
aes(x = cut, y = depth),
|
||||
fun.min = min,
|
||||
fun.max = max,
|
||||
fun = median
|
||||
)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-32-1.png" alt="A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol><p>ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">?stat_bin</a></code>.</p>
|
||||
|
||||
<section id="exercises-3" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>What is the default geom associated with <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>? How could you rewrite the previous plot to use that geom function instead of the stat function?</p></li>
|
||||
<li><p>What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code> do? How is it different from <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>?</p></li>
|
||||
<li><p>Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?</p></li>
|
||||
<li><p>What variables does <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">stat_smooth()</a></code> compute? What parameters control its behavior?</p></li>
|
||||
<li>
|
||||
<p>In our proportion bar chart, we need to set <code>group = 1</code>. Why? In other words, what is the problem with these two graphs?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = after_stat(prop))) +
|
||||
geom_bar()
|
||||
ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) +
|
||||
geom_bar()</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="position-adjustments" data-type="sect1">
|
||||
<h1>
|
||||
Position adjustments</h1>
|
||||
<p>There’s one more piece of magic associated with bar charts. You can color a bar chart using either the <code>color</code> aesthetic, or, more usefully, <code>fill</code>:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, color = cut)) +
|
||||
geom_bar()
|
||||
ggplot(diamonds, aes(x = cut, fill = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-34-1.png" alt="Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-34-2.png" alt="Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>Note what happens if you map the fill aesthetic to another variable, like <code>clarity</code>: the bars are automatically stacked. Each colored rectangle represents a combination of <code>cut</code> and <code>clarity</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-35-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The stacking is performed automatically using the <strong>position adjustment</strong> specified by the <code>position</code> argument. If you don’t want a stacked bar chart, you can use one of three other options: <code>"identity"</code>, <code>"dodge"</code> or <code>"fill"</code>.</p>
|
||||
<ul><li>
|
||||
<p><code>position = "identity"</code> will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting <code>alpha</code> to a small value, or completely transparent by setting <code>fill = NA</code>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
|
||||
geom_bar(alpha = 1/5, position = "identity")
|
||||
ggplot(diamonds, aes(x = cut, color = clarity)) +
|
||||
geom_bar(fill = NA, position = "identity")</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-36-1.png" alt="Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-36-2.png" alt="Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
<p>The identity position adjustment is more useful for 2d geoms, like points, where it is the default.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>position = "fill"</code> works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
|
||||
geom_bar(position = "fill")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-37-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Height of each bar is 1 and heights of the colored segments are proportional to the proportion of diamonds with a given clarity level within a given cut level." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p><code>position = "dodge"</code> places overlapping objects directly <em>beside</em> one another. This makes it easier to compare individual values.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
|
||||
geom_bar(position = "dodge")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-38-1.png" alt="Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ul><p>There’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-39-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>The underlying values of <code>hwy</code> and <code>displ</code> are rounded so the points appear on a grid and many points overlap each other. This problem is known as <strong>overplotting</strong>. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of <code>hwy</code> and <code>displ</code> that contains 109 values?</p>
|
||||
<p>You can avoid this gridding by setting the position adjustment to “jitter”. <code>position = "jitter"</code> adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
|
||||
geom_point(position = "jitter")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-40-1.png" alt="Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph <em>more</em> revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for <code>geom_point(position = "jitter")</code>: <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>.</p>
|
||||
<p>To learn more about a position adjustment, look up the help page associated with each adjustment: <code><a href="https://ggplot2.tidyverse.org/reference/position_dodge.html">?position_dodge</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_fill</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_identity.html">?position_identity</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_jitter.html">?position_jitter</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_stack</a></code>.</p>
|
||||
|
||||
<section id="exercises-4" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
<p>What is the problem with this plot? How could you improve it?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = cty, y = hwy)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-41-1.png" alt="Scatterplot of highway fuel efficiency versus city fuel efficiency of cars that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>What parameters to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> control the amount of jittering?</p></li>
|
||||
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> with <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>.</p></li>
|
||||
<li><p>What’s the default position adjustment for <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>? Create a visualization of the <code>mpg</code> dataset that demonstrates it.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="coordinate-systems" data-type="sect1">
|
||||
<h1>
|
||||
Coordinate systems</h1>
|
||||
<p>Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are two other coordinate systems that are occasionally helpful.</p>
|
||||
<ul><li>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the <a href="https://ggplot2-book.org/maps.html">Maps chapter</a> of <em>ggplot2: Elegant graphics for data analysis</em>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">nz <- map_data("nz")
|
||||
|
||||
ggplot(nz, aes(x = long, y = lat, group = group)) +
|
||||
geom_polygon(fill = "white", color = "black")
|
||||
|
||||
ggplot(nz, aes(x = long, y = lat, group = group)) +
|
||||
geom_polygon(fill = "white", color = "black") +
|
||||
coord_quickmap()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-42-1.png" alt="Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-42-2.png" alt="Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
<li>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code> uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">bar <- ggplot(data = diamonds) +
|
||||
geom_bar(
|
||||
mapping = aes(x = cut, fill = cut),
|
||||
show.legend = FALSE,
|
||||
width = 1
|
||||
) +
|
||||
theme(aspect.ratio = 1) +
|
||||
labs(x = NULL, y = NULL)
|
||||
|
||||
bar + coord_flip()
|
||||
bar + coord_polar()</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
<div class="quarto-layout-row quarto-layout-valign-top">
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-43-1.png" alt="There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data." width="384"/></p>
|
||||
</div>
|
||||
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-43-2.png" alt="There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data." width="384"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ul>
|
||||
<section id="exercises-5" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Turn a stacked bar chart into a pie chart using <code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code>.</p></li>
|
||||
<li><p>What’s the difference between <code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_map()</a></code>?</p></li>
|
||||
<li>
|
||||
<p>What does the plot below tell you about the relationship between city and highway mpg? Why is <code><a href="https://ggplot2.tidyverse.org/reference/coord_fixed.html">coord_fixed()</a></code> important? What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_abline()</a></code> do?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_abline() +
|
||||
coord_fixed()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="layers_files/figure-html/unnamed-chunk-44-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but does not go through the cloud of points, it is beneath it." width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="the-layered-grammar-of-graphics" data-type="sect1">
|
||||
<h1>
|
||||
The layered grammar of graphics</h1>
|
||||
<p>We can expand on the graphing template you learned in <span class="quarto-unresolved-ref">?sec-graphing-template</span> by adding position adjustments, stats, coordinate systems, and faceting:</p>
|
||||
<pre><code>ggplot(data = <DATA>) +
|
||||
<GEOM_FUNCTION>(
|
||||
mapping = aes(<MAPPINGS>),
|
||||
stat = <STAT>,
|
||||
position = <POSITION>
|
||||
) +
|
||||
<COORDINATE_FUNCTION> +
|
||||
<FACET_FUNCTION></code></pre>
|
||||
<p>Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.</p>
|
||||
<p>The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe <em>any</em> plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.</p>
|
||||
<p>To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. You’d then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="images/visualization-grammar.png" alt="A figure demonstrating the steps for going from raw data to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level." width="1332"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>At this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.</p>
|
||||
<p>You could use this method to build <em>any</em> plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots.</p>
|
||||
<p>If you’d like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “<a href="https://vita.had.co.nz/papers/layered-grammar.pdf">The Layered Grammar of Graphics</a>”, the scientific paper that describes the theory of ggplot2 in detail.</p>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what <code>x</code> and <code>y</code> mean. One layer we have not yet touched on is theme, which we will introduce in <a href="#sec-themes" data-type="xref">#sec-themes</a>.</p>
|
||||
<p>Two very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at <a href="https://posit.co/resources/cheatsheets" class="uri">https://posit.co/resources/cheatsheets</a> ) and the ggplot2 package website (<a href="https://ggplot2.tidyverse.org/">https://ggplot2.tidyverse.org</a>).</p>
|
||||
<p>An important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, it’s always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom.</p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
|
@ -3,7 +3,7 @@
|
|||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate in the course of almost every analysis.</p>
|
||||
<p>In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate them in the course of almost every analysis.</p>
|
||||
<p>We’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
|
@ -20,7 +20,7 @@ library(nycflights13)</pre>
|
|||
x * 2
|
||||
#> [1] 2 4 6 10 14 22 26</pre>
|
||||
</div>
|
||||
<p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside data frame with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and friends.</p>
|
||||
<p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and friends.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x)
|
||||
df |>
|
||||
|
@ -47,18 +47,18 @@ Comparisons</h1>
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
|
||||
#> # A tibble: 172,286 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 601 600 1 844 850 -6 B6
|
||||
#> 2 2013 1 1 602 610 -8 812 820 -8 DL
|
||||
#> 3 2013 1 1 602 605 -3 821 805 16 MQ
|
||||
#> 4 2013 1 1 606 610 -4 858 910 -12 AA
|
||||
#> 5 2013 1 1 606 610 -4 837 845 -8 DL
|
||||
#> 6 2013 1 1 607 607 0 858 915 -17 UA
|
||||
#> # … with 172,280 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 601 600 1 844 850
|
||||
#> 2 2013 1 1 602 610 -8 812 820
|
||||
#> 3 2013 1 1 602 605 -3 821 805
|
||||
#> 4 2013 1 1 606 610 -4 858 910
|
||||
#> 5 2013 1 1 606 610 -4 837 845
|
||||
#> 6 2013 1 1 607 607 0 858 915
|
||||
#> # … with 172,280 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
|
||||
<div class="cell">
|
||||
|
@ -104,7 +104,7 @@ x
|
|||
<pre data-type="programlisting" data-code-language="r">x == c(1, 2)
|
||||
#> [1] FALSE FALSE</pre>
|
||||
</div>
|
||||
<p>What’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> with the the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p>
|
||||
<p>What’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> with the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">print(x, digits = 16)
|
||||
#> [1] 0.9999999999999999 2.0000000000000004</pre>
|
||||
|
@ -145,7 +145,7 @@ x == y
|
|||
#> [1] NA
|
||||
# We don't know!</pre>
|
||||
</div>
|
||||
<p>So if you want to find all flights with <code>dep_time</code> is missing, the following code doesn’t work because <code>dep_time == NA</code> will yield a <code>NA</code> for every single row, and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> automatically drops missing values:</p>
|
||||
<p>So if you want to find all flights where <code>dep_time</code> is missing, the following code doesn’t work because <code>dep_time == NA</code> will yield <code>NA</code> for every single row, and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> automatically drops missing values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_time == NA)
|
||||
|
@ -177,18 +177,18 @@ is.na(c("a", NA, "b"))
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(is.na(dep_time))
|
||||
#> # A tibble: 8,255 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 NA 1630 NA NA 1815 NA EV
|
||||
#> 2 2013 1 1 NA 1935 NA NA 2240 NA AA
|
||||
#> 3 2013 1 1 NA 1500 NA NA 1825 NA AA
|
||||
#> 4 2013 1 1 NA 600 NA NA 901 NA B6
|
||||
#> 5 2013 1 2 NA 1540 NA NA 1747 NA EV
|
||||
#> 6 2013 1 2 NA 1620 NA NA 1746 NA EV
|
||||
#> # … with 8,249 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 NA 1630 NA NA 1815
|
||||
#> 2 2013 1 1 NA 1935 NA NA 2240
|
||||
#> 3 2013 1 1 NA 1500 NA NA 1825
|
||||
#> 4 2013 1 1 NA 600 NA NA 901
|
||||
#> 5 2013 1 2 NA 1540 NA NA 1747
|
||||
#> 6 2013 1 2 NA 1620 NA NA 1746
|
||||
#> # … with 8,249 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
|
||||
<div class="cell">
|
||||
|
@ -196,35 +196,35 @@ is.na(c("a", NA, "b"))
|
|||
filter(month == 1, day == 1) |>
|
||||
arrange(dep_time)
|
||||
#> # A tibble: 842 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>
|
||||
|
||||
flights |>
|
||||
filter(month == 1, day == 1) |>
|
||||
arrange(desc(is.na(dep_time)), dep_time)
|
||||
#> # A tibble: 842 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 NA 1630 NA NA 1815 NA EV
|
||||
#> 2 2013 1 1 NA 1935 NA NA 2240 NA AA
|
||||
#> 3 2013 1 1 NA 1500 NA NA 1825 NA AA
|
||||
#> 4 2013 1 1 NA 600 NA NA 901 NA B6
|
||||
#> 5 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 6 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 NA 1630 NA NA 1815
|
||||
#> 2 2013 1 1 NA 1935 NA NA 2240
|
||||
#> 3 2013 1 1 NA 1500 NA NA 1825
|
||||
#> 4 2013 1 1 NA 600 NA NA 901
|
||||
#> 5 2013 1 1 517 515 2 830 819
|
||||
#> 6 2013 1 1 533 529 4 850 830
|
||||
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>We’ll come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
|
||||
</section>
|
||||
|
@ -240,7 +240,7 @@ Exercises</h2>
|
|||
<section id="boolean-algebra" data-type="sect1">
|
||||
<h1>
|
||||
Boolean algebra</h1>
|
||||
<p>Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, <code>&</code> is “and”, <code>|</code> is “or”, and <code>!</code> is “not”, and <code><a href="https://rdrr.io/r/base/Logic.html">xor()</a></code> is exclusive or<span data-type="footnote">That is, <code>xor(x, y)</code> is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.</span>. <a href="#fig-bool-ops" data-type="xref">#fig-bool-ops</a> shows the complete set of Boolean operations and how they work.</p>
|
||||
<p>Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, <code>&</code> is “and”, <code>|</code> is “or”, <code>!</code> is “not”, and <code><a href="https://rdrr.io/r/base/Logic.html">xor()</a></code> is exclusive or<span data-type="footnote">That is, <code>xor(x, y)</code> is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.</span>. <a href="#fig-bool-ops" data-type="xref">#fig-bool-ops</a> shows the complete set of Boolean operations and how they work.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
||||
|
@ -249,7 +249,7 @@ Boolean algebra</h1>
|
|||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>As well as <code>&</code> and <code>|</code>, R also has <code>&&</code> and <code>||</code>. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single <code>TRUE</code> or <code>FALSE</code>. They’re important for programming, not data science</p>
|
||||
<p>As well as <code>&</code> and <code>|</code>, R also has <code>&&</code> and <code>||</code>. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single <code>TRUE</code> or <code>FALSE</code>. They’re important for programming, not data science.</p>
|
||||
|
||||
<section id="sec-na-boolean" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -276,30 +276,30 @@ df |>
|
|||
<section id="order-of-operations" data-type="sect2">
|
||||
<h2>
|
||||
Order of operations</h2>
|
||||
<p>Note that the order of operations doesn’t work like English. Take the following code finds all flights that departed in November or December:</p>
|
||||
<p>Note that the order of operations doesn’t work like English. Take the following code that finds all flights that departed in November or December:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 11 | month == 12)</pre>
|
||||
</div>
|
||||
<p>You might be tempted to write it like you’d say in English: “find all flights that departed in November or December”:</p>
|
||||
<p>You might be tempted to write it like you’d say in English: “Find all flights that departed in November or December.”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 11 | 12)
|
||||
#> # A tibble: 336,776 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 533 529 4 850 830 20 UA
|
||||
#> 3 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022 -18 B6
|
||||
#> 5 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 6 2013 1 1 554 558 -4 740 728 12 UA
|
||||
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 533 529 4 850 830
|
||||
#> 3 2013 1 1 542 540 2 923 850
|
||||
#> 4 2013 1 1 544 545 -1 1004 1022
|
||||
#> 5 2013 1 1 554 600 -6 812 837
|
||||
#> 6 2013 1 1 554 558 -4 740 728
|
||||
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
|
||||
<p>This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here, R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to <code>TRUE</code>, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
|
@ -348,18 +348,18 @@ c(1, 2, NA) %in% NA
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_time %in% c(NA, 0800))
|
||||
#> # A tibble: 8,803 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 800 800 0 1022 1014 8 DL
|
||||
#> 2 2013 1 1 800 810 -10 949 955 -6 MQ
|
||||
#> 3 2013 1 1 NA 1630 NA NA 1815 NA EV
|
||||
#> 4 2013 1 1 NA 1935 NA NA 2240 NA AA
|
||||
#> 5 2013 1 1 NA 1500 NA NA 1825 NA AA
|
||||
#> 6 2013 1 1 NA 600 NA NA 901 NA B6
|
||||
#> # … with 8,797 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 800 800 0 1022 1014
|
||||
#> 2 2013 1 1 800 810 -10 949 955
|
||||
#> 3 2013 1 1 NA 1630 NA NA 1815
|
||||
#> 4 2013 1 1 NA 1935 NA NA 2240
|
||||
#> 5 2013 1 1 NA 1500 NA NA 1825
|
||||
#> 6 2013 1 1 NA 600 NA NA 901
|
||||
#> # … with 8,797 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -368,7 +368,7 @@ c(1, 2, NA) %in% NA
|
|||
Exercises</h2>
|
||||
<ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
|
||||
<li>How many flights have a missing <code>dep_time</code>? What other variables are missing in these rows? What might these rows represent?</li>
|
||||
<li>Assuming that a missing <code>dep_time</code> implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and average delay of non-cancelled flights?</li>
|
||||
<li>Assuming that a missing <code>dep_time</code> implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
|
@ -385,7 +385,7 @@ Logical summaries</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
all_delayed = all(arr_delay >= 0, na.rm = TRUE),
|
||||
any_delayed = any(arr_delay >= 0, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
|
@ -404,18 +404,18 @@ Logical summaries</h2>
|
|||
<p>In most cases, however, <code><a href="https://rdrr.io/r/base/any.html">any()</a></code> and <code><a href="https://rdrr.io/r/base/all.html">all()</a></code> are a little too crude, and it would be nice to be able to get a little more detail about how many values are <code>TRUE</code> or <code>FALSE</code>. That leads us to the numeric summaries.</p>
|
||||
</section>
|
||||
|
||||
<section id="numeric-summaries-of-logical-vectors" data-type="sect2">
|
||||
<section id="sec-numeric-summaries-of-logicals" data-type="sect2">
|
||||
<h2>
|
||||
Numeric summaries of logical vectors</h2>
|
||||
<p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a>.</p>
|
||||
<p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a></p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
ggplot(aes(prop_delayed)) +
|
||||
ggplot(aes(x = prop_delayed)) +
|
||||
geom_histogram(binwidth = 0.05)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
||||
|
@ -424,11 +424,11 @@ Numeric summaries of logical vectors</h2>
|
|||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Or we could ask how many flights left before 5am, which are often flights that were delayed from the previous day:</p>
|
||||
<p>Or we could ask: “How many flights left before 5am?”, which are often flights that were delayed from the previous day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n_early = sum(dep_time < 500, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
|
@ -450,12 +450,12 @@ Numeric summaries of logical vectors</h2>
|
|||
<h2>
|
||||
Logical subsetting</h2>
|
||||
<p>There’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base <code>[</code> (pronounced subset) operator, which you’ll learn more about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>.</p>
|
||||
<p>Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights:</p>
|
||||
<p>Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights and then calculate the average delay:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(arr_delay > 0) |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
behind = mean(arr_delay),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
|
@ -476,7 +476,7 @@ Logical subsetting</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
|
||||
ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
|
||||
n = n(),
|
||||
|
@ -500,7 +500,7 @@ Logical subsetting</h2>
|
|||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
|
||||
<li>What does <code><a href="https://rdrr.io/r/base/prod.html">prod()</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? What does <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.</li>
|
||||
<li>What does <code><a href="https://rdrr.io/r/base/prod.html">prod()</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? What does <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
|
@ -513,7 +513,7 @@ Conditional transformations</h1>
|
|||
<h2>
|
||||
<code>if_else()</code>
|
||||
</h2>
|
||||
<p>If you want to use one value when a condition is true and another value when it’s <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base R’s <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. You’ll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
|
||||
<p>If you want to use one value when a condition is <code>TRUE</code> and another value when it’s <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base R’s <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. You’ll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
|
||||
<p>Let’s begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c(-3:3, NA)
|
||||
|
@ -537,7 +537,7 @@ y1 <- c(3, NA, 4, 6)
|
|||
if_else(is.na(x1), y1, x1)
|
||||
#> [1] 3 1 2 6</pre>
|
||||
</div>
|
||||
<p>You might have noticed a small infelicity in our labeling: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>:</p>
|
||||
<p>You might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
|
||||
#> [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
|
||||
|
@ -549,7 +549,7 @@ if_else(is.na(x1), y1, x1)
|
|||
<h2>
|
||||
<code>case_when()</code>
|
||||
</h2>
|
||||
<p>dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQL’s <code>CASE</code> statement and provides a flexible way of performing different computations for different computations. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when it’s <code>TRUE</code>, <code>output</code> will be used.</p>
|
||||
<p>dplyr’s <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQL’s <code>CASE</code> statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when it’s <code>TRUE</code>, <code>output</code> will be used.</p>
|
||||
<p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">case_when(
|
||||
|
@ -592,11 +592,11 @@ if_else(is.na(x1), y1, x1)
|
|||
mutate(
|
||||
status = case_when(
|
||||
is.na(arr_delay) ~ "cancelled",
|
||||
arr_delay > 60 ~ "very late",
|
||||
arr_delay > 15 ~ "late",
|
||||
abs(arr_delay) <= 15 ~ "on time",
|
||||
arr_delay < -15 ~ "early",
|
||||
arr_delay < -30 ~ "very early",
|
||||
arr_delay < -15 ~ "early",
|
||||
abs(arr_delay) <= 15 ~ "on time",
|
||||
arr_delay > 15 ~ "late",
|
||||
arr_delay > 60 ~ "very late",
|
||||
),
|
||||
.keep = "used"
|
||||
)
|
||||
|
@ -612,13 +612,38 @@ if_else(is.na(x1), y1, x1)
|
|||
#> # … with 336,770 more rows</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="compatible-types" data-type="sect2">
|
||||
<h2>
|
||||
Compatible types</h2>
|
||||
<p>Note that both <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> require <strong>compatible</strong> types in the output. If they’re not compatible, you’ll see errors like this:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">if_else(TRUE, "a", 1)
|
||||
#> Error in `if_else()`:
|
||||
#> ! Can't combine `true` <character> and `false` <double>.
|
||||
|
||||
case_when(
|
||||
x < -1 ~ TRUE,
|
||||
x > 0 ~ lubridate::now()
|
||||
)
|
||||
#> Error in `case_when()`:
|
||||
#> ! Can't combine `TRUE` <logical> and `lubridate::now()` <datetime<local>>.</pre>
|
||||
</div>
|
||||
<p>Overall, relatively few types are compatible, because automatically converting one type of vector to another is a common source of errors. Here are the most important cases that are compatible:</p>
|
||||
<ul><li>Numeric and logical vectors are compatible, as we discussed in <a href="#sec-numeric-summaries-of-logicals" data-type="xref">#sec-numeric-summaries-of-logicals</a>.</li>
|
||||
<li>Strings and factors (<a href="#chp-factors" data-type="xref">#chp-factors</a>) are compatible, because you can think of a factor as a string with a restricted set of values.</li>
|
||||
<li>Dates and date-times, which we’ll discuss in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>, are compatible because you can think of a date as a special case of date-time.</li>
|
||||
<li>
|
||||
<code>NA</code>, which is technically a logical vector, is compatible with everything because every vector has some way of representing a missing value.</li>
|
||||
</ul><p>We don’t expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>></code>, <code><</code>, <code><=</code>, <code>=></code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> that allow you to return values depending on the value of a logical vector.</p>
|
||||
<p>We’ll see logical vectors again and in the following chapters. For example in <a href="#chp-strings" data-type="xref">#chp-strings</a> you’ll learn about <code>str_detect(x, pattern)</code> which returns a logical vector that’s <code>TRUE</code> for the elements of <code>x</code> that match the <code>pattern</code>, and in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> you’ll create logical vectors from the comparison of dates and times. But for now, we’re going to move onto the next most important type of vector: numeric vectors.</p>
|
||||
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>></code>, <code><</code>, <code><=</code>, <code>=></code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> functions that allow you to return values depending on the value of a logical vector.</p>
|
||||
<p>We’ll see logical vectors again and again in the following chapters. For example in <a href="#chp-strings" data-type="xref">#chp-strings</a> you’ll learn about <code>str_detect(x, pattern)</code> which returns a logical vector that’s <code>TRUE</code> for the elements of <code>x</code> that match the <code>pattern</code>, and in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> you’ll create logical vectors from the comparison of dates and times. But for now, we’re going to move onto the next most important type of vector: numeric vectors.</p>
|
||||
|
||||
|
||||
</section>
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>You’ve already learned the basics of missing values earlier in the book. You first saw them in <a href="#sec-summarize" data-type="xref">#sec-summarize</a> where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in <a href="#sec-na-comparison" data-type="xref">#sec-na-comparison</a>. Now we’ll come back to them in more depth, so you can learn more of the details.</p>
|
||||
<p>You’ve already learned the basics of missing values earlier in the book. You first saw them in <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> where they resulted in a warning when making a plot as well as in <a href="#sec-summarize" data-type="xref">#sec-summarize</a> where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in <a href="#sec-na-comparison" data-type="xref">#sec-na-comparison</a>. Now we’ll come back to them in more depth, so you can learn more of the details.</p>
|
||||
<p>We’ll start by discussing some general tools for working with missing values recorded as <code>NA</code>s. We’ll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
|
@ -247,11 +247,11 @@ Factors and empty groups</h1>
|
|||
</div>
|
||||
<p>The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying <code>drop = FALSE</code> to the appropriate discrete axis:</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(health, aes(smoker)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(health, aes(x = smoker)) +
|
||||
geom_bar() +
|
||||
scale_x_discrete()
|
||||
|
||||
ggplot(health, aes(smoker)) +
|
||||
ggplot(health, aes(x = smoker)) +
|
||||
geom_bar() +
|
||||
scale_x_discrete(drop = FALSE)</pre>
|
||||
<div class="cell quarto-layout-panel">
|
||||
|
@ -269,16 +269,16 @@ ggplot(health, aes(smoker)) +
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">health |>
|
||||
group_by(smoker, .drop = FALSE) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
mean_age = mean(age),
|
||||
min_age = min(age),
|
||||
max_age = max(age),
|
||||
sd_age = sd(age)
|
||||
)
|
||||
#> Warning: There were 2 warnings in `summarise()`.
|
||||
#> Warning: There were 2 warnings in `summarize()`.
|
||||
#> The first warning was:
|
||||
#> ℹ In argument `min_age = min(age)`.
|
||||
#> ℹ In argument: `min_age = min(age)`.
|
||||
#> ℹ In group 1: `smoker = yes`.
|
||||
#> Caused by warning in `min()`:
|
||||
#> ! no non-missing arguments to min; returning Inf
|
||||
|
@ -306,7 +306,7 @@ length(x2)
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">health |>
|
||||
group_by(smoker) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
mean_age = mean(age),
|
||||
min_age = min(age),
|
||||
|
|
|
@ -4,11 +4,20 @@
|
|||
<h1>
|
||||
Introduction</h1>
|
||||
<p>Numeric vectors are the backbone of data science, and you’ve already used them a bunch of times earlier in the book. Now it’s time to systematically survey what you can do with them in R, ensuring that you’re well situated to tackle any future problem involving numeric vectors.</p>
|
||||
<p>We’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. Then we’ll dive into various numeric transformations that pair well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and show you how they can also be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>.</p>
|
||||
<p>We’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. Then we’ll dive into various numeric transformations that pair well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and show you how they can also be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<div data-type="important"><div class="callout-body d-flex">
|
||||
<div class="callout-icon-container">
|
||||
<i class="callout-icon"/>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/dplyr")</code>.</p></div>
|
||||
|
||||
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
|
@ -20,7 +29,7 @@ library(nycflights13)</pre>
|
|||
<section id="making-numbers" data-type="sect1">
|
||||
<h1>
|
||||
Making numbers</h1>
|
||||
<p>In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or something has gone wrong in your data import process.</p>
|
||||
<p>In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or because something has gone wrong in your data import process.</p>
|
||||
<p>readr provides two useful functions for parsing strings into numbers: <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code>. Use <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> when you have numbers that have been written as strings:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("1.2", "5.6", "1e3")
|
||||
|
@ -53,7 +62,7 @@ Counts</h1>
|
|||
#> # … with 99 more rows</pre>
|
||||
</div>
|
||||
<p>(Despite the advice in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, we usually put <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.)</p>
|
||||
<p>If you want to see the most common values add <code>sort = TRUE</code>:</p>
|
||||
<p>If you want to see the most common values, add <code>sort = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> count(dest, sort = TRUE)
|
||||
#> # A tibble: 105 × 2
|
||||
|
@ -68,11 +77,11 @@ Counts</h1>
|
|||
#> # … with 99 more rows</pre>
|
||||
</div>
|
||||
<p>And remember that if you want to see all the values, you can use <code>|> View()</code> or <code>|> print(n = Inf)</code>.</p>
|
||||
<p>You can perform the same computation “by hand” with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>. This is useful because it allows you to compute other summaries at the same time:</p>
|
||||
<p>You can perform the same computation “by hand” with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>. This is useful because it allows you to compute other summaries at the same time:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
delay = mean(arr_delay, na.rm = TRUE)
|
||||
)
|
||||
|
@ -100,7 +109,7 @@ Counts</h1>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(
|
||||
summarize(
|
||||
carriers = n_distinct(carrier)
|
||||
) |>
|
||||
arrange(desc(carriers))
|
||||
|
@ -121,7 +130,7 @@ Counts</h1>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(tailnum) |>
|
||||
summarise(miles = sum(distance))
|
||||
summarize(miles = sum(distance))
|
||||
#> # A tibble: 4,044 × 2
|
||||
#> tailnum miles
|
||||
#> <chr> <dbl>
|
||||
|
@ -153,7 +162,7 @@ Counts</h1>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(n_cancelled = sum(is.na(dep_time)))
|
||||
summarize(n_cancelled = sum(is.na(dep_time)))
|
||||
#> # A tibble: 105 × 2
|
||||
#> dest n_cancelled
|
||||
#> <chr> <int>
|
||||
|
@ -171,7 +180,7 @@ Counts</h1>
|
|||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>How can you use <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to count the number rows with a missing value for a given variable?</li>
|
||||
<li>Expand the following calls to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to instead use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>:
|
||||
<li>Expand the following calls to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to instead use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>:
|
||||
<ol type="1"><li><p><code>flights |> count(dest, sort = TRUE)</code></p></li>
|
||||
<li><p><code>flights |> count(tailnum, wt = distance)</code></p></li>
|
||||
</ol></li>
|
||||
|
@ -210,20 +219,20 @@ x * c(1, 2, 3)
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == c(1, 2))
|
||||
#> # A tibble: 25,977 × 19
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 542 540 2 923 850 33 AA
|
||||
#> 3 2013 1 1 554 600 -6 812 837 -25 DL
|
||||
#> 4 2013 1 1 555 600 -5 913 854 19 B6
|
||||
#> 5 2013 1 1 557 600 -3 838 846 -8 B6
|
||||
#> 6 2013 1 1 558 600 -2 849 851 -2 B6
|
||||
#> # … with 25,971 more rows, 9 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 542 540 2 923 850
|
||||
#> 3 2013 1 1 554 600 -6 812 837
|
||||
#> 4 2013 1 1 555 600 -5 913 854
|
||||
#> 5 2013 1 1 557 600 -3 838 846
|
||||
#> 6 2013 1 1 558 600 -2 849 851
|
||||
#> # … with 25,971 more rows, and 11 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm></pre>
|
||||
</div>
|
||||
<p>The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unforuntately there’s no warning because <code>flights</code> has an even number of rows.</p>
|
||||
<p>The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately there’s no warning because <code>flights</code> has an even number of rows.</p>
|
||||
<p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesn’t help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>.</p>
|
||||
</section>
|
||||
|
||||
|
@ -277,7 +286,7 @@ Modular arithmetic</h2>
|
|||
1:10 %% 3
|
||||
#> [1] 1 2 0 1 2 0 1 2 0 1</pre>
|
||||
</div>
|
||||
<p>Modular arithmetic is handy for the flights dataset, because we can use it to unpack the <code>sched_dep_time</code> variable into and <code>hour</code> and <code>minute</code>:</p>
|
||||
<p>Modular arithmetic is handy for the flights dataset, because we can use it to unpack the <code>sched_dep_time</code> variable into <code>hour</code> and <code>minute</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
|
@ -300,9 +309,9 @@ Modular arithmetic</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(hour = sched_dep_time %/% 100) |>
|
||||
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
|
||||
summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
|
||||
filter(hour > 1) |>
|
||||
ggplot(aes(hour, prop_cancelled)) +
|
||||
ggplot(aes(x = hour, y = prop_cancelled)) +
|
||||
geom_line(color = "grey50") +
|
||||
geom_point(aes(size = n))</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -323,13 +332,13 @@ Logarithms</h2>
|
|||
interest <- 1.05
|
||||
|
||||
money <- tibble(
|
||||
year = 2000 + 1:50,
|
||||
money = starting * interest^(1:50)
|
||||
year = 1:50,
|
||||
money = starting * interest ^ year
|
||||
)</pre>
|
||||
</div>
|
||||
<p>If you plot this data, you’ll get an exponential curve:</p>
|
||||
<p>If you plot this data, you’ll get an exponential curve showing how your money grows year by year at an interest rate of 1.05:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(year, money)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(x = year, y = money)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="numbers_files/figure-html/unnamed-chunk-22-1.png" width="576"/></p>
|
||||
|
@ -337,15 +346,15 @@ money <- tibble(
|
|||
</div>
|
||||
<p>Log transforming the y-axis gives a straight line:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(year, money)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(x = year, y = money)) +
|
||||
geom_line() +
|
||||
scale_y_log10()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="numbers_files/figure-html/unnamed-chunk-23-1.png" width="576"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>This a straight line because a little algebra reveals that <code>log(money) = log(starting) + n * log(interest)</code>, which matches the pattern for a line, <code>y = m * x + b</code>. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there’s underlying exponential growth.</p>
|
||||
<p>If you’re log-transforming your data with dplyr you have a choice of three logarithms provided by base R: <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> (the natural log, base e), <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> (base 2), and <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> (base 10). We recommend using <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code>. <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> is easy to back-transform because (e.g) 3 is 10^3 = 1000.</p>
|
||||
<p>This a straight line because a little algebra reveals that <code>log10(money) = log10(interest) * year + log10(starting)</code>, which matches the pattern for a line, <code>y = m * x + b</code>. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there’s underlying exponential growth.</p>
|
||||
<p>If you’re log-transforming your data with dplyr you have a choice of three logarithms provided by base R: <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> (the natural log, base e), <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> (base 2), and <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> (base 10). We recommend using <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code>. <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> is easy to back-transform because (e.g.) 3 is 10^3 = 1000.</p>
|
||||
<p>The inverse of <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> is <code><a href="https://rdrr.io/r/base/Log.html">exp()</a></code>; to compute the inverse of <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> you’ll need to use <code>2^</code> or <code>10^</code>.</p>
|
||||
</section>
|
||||
|
||||
|
@ -383,7 +392,7 @@ floor(x)
|
|||
ceiling(x)
|
||||
#> [1] 124</pre>
|
||||
</div>
|
||||
<p>These functions don’t have a digits argument, so you can instead scale down, round, and then scale back up:</p>
|
||||
<p>These functions don’t have a <code>digits</code> argument, so you can instead scale down, round, and then scale back up:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># Round down to nearest two digits
|
||||
floor(x / 0.01) * 0.01
|
||||
|
@ -439,7 +448,7 @@ cut(y, breaks = c(0, 5, 10, 15, 20))
|
|||
<p>See the documentation for other useful arguments like <code>right</code> and <code>include.lowest</code>, which control if the intervals are <code>[a, b)</code> or <code>(a, b]</code> and if the lowest interval should be <code>[a, b]</code>.</p>
|
||||
</section>
|
||||
|
||||
<section id="cumulative-and-rolling-aggregates" data-type="sect2">
|
||||
<section id="sec-cumulative-and-rolling-aggregates" data-type="sect2">
|
||||
<h2>
|
||||
Cumulative and rolling aggregates</h2>
|
||||
<p>Base R provides <code><a href="https://rdrr.io/r/base/cumsum.html">cumsum()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cumprod()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummin()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummax()</a></code> for running, or cumulative, sums, products, mins and maxes. dplyr provides <code><a href="https://dplyr.tidyverse.org/reference/cumall.html">cummean()</a></code> for cumulative means. Cumulative sums tend to come up the most in practice:</p>
|
||||
|
@ -477,7 +486,7 @@ Exercises</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(month == 1, day == 1) |>
|
||||
ggplot(aes(sched_dep_time, dep_delay)) +
|
||||
ggplot(aes(x = sched_dep_time, y = dep_delay)) +
|
||||
geom_point()
|
||||
#> Warning: Removed 4 rows containing missing values (`geom_point()`).</pre>
|
||||
<div class="cell-output-display">
|
||||
|
@ -580,13 +589,95 @@ lead(x)
|
|||
</ul><p>You can lead or lag by more than one position by using the second argument, <code>n</code>.</p>
|
||||
</section>
|
||||
|
||||
<section id="consecutive-identifiers" data-type="sect2">
|
||||
<h2>
|
||||
Consecutive identifiers</h2>
|
||||
<p>Sometimes you want to start a new group every time some event occurs. For example, when you’re looking at website data, it’s common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.</p>
|
||||
<p>For example, imagine you have the times when someone visited a website:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">events <- tibble(
|
||||
time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)
|
||||
)</pre>
|
||||
</div>
|
||||
<p>And you’ve the time lag between the events, and figured out if there’s a gap that’s big enough to qualify:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">events <- events |>
|
||||
mutate(
|
||||
diff = time - lag(time, default = first(time)),
|
||||
gap = diff >= 5
|
||||
)
|
||||
events
|
||||
#> # A tibble: 14 × 3
|
||||
#> time diff gap
|
||||
#> <dbl> <dbl> <lgl>
|
||||
#> 1 0 0 FALSE
|
||||
#> 2 1 1 FALSE
|
||||
#> 3 2 1 FALSE
|
||||
#> 4 3 1 FALSE
|
||||
#> 5 5 2 FALSE
|
||||
#> 6 10 5 TRUE
|
||||
#> # … with 8 more rows</pre>
|
||||
</div>
|
||||
<p>But how do we go from that logical vector to something that we can <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>? <code><a href="https://rdrr.io/r/base/cumsum.html">cumsum()</a></code> from <a href="#sec-cumulative-and-rolling-aggregates" data-type="xref">#sec-cumulative-and-rolling-aggregates</a> comes to the rescue as each occurring gap, i.e., <code>gap</code> is <code>TRUE</code>, increments <code>group</code> by one (see <a href="#sec-numeric-summaries-of-logicals" data-type="xref">#sec-numeric-summaries-of-logicals</a> on the numerical interpretation of logicals):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">events |> mutate(
|
||||
group = cumsum(gap)
|
||||
)
|
||||
#> # A tibble: 14 × 4
|
||||
#> time diff gap group
|
||||
#> <dbl> <dbl> <lgl> <int>
|
||||
#> 1 0 0 FALSE 0
|
||||
#> 2 1 1 FALSE 0
|
||||
#> 3 2 1 FALSE 0
|
||||
#> 4 3 1 FALSE 0
|
||||
#> 5 5 2 FALSE 0
|
||||
#> 6 10 5 TRUE 1
|
||||
#> # … with 8 more rows</pre>
|
||||
</div>
|
||||
<p>Another approach for creating grouping variables is <code><a href="https://dplyr.tidyverse.org/reference/consecutive_id.html">consecutive_id()</a></code>, which starts a new group every time one of its arguments changes. For example, inspired by <a href="https://stackoverflow.com/questions/27482712">this stackoverflow question</a>, imagine you have a data frame with a bunch of repeated values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
|
||||
y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
|
||||
)
|
||||
df
|
||||
#> # A tibble: 12 × 2
|
||||
#> x y
|
||||
#> <chr> <dbl>
|
||||
#> 1 a 1
|
||||
#> 2 a 2
|
||||
#> 3 a 3
|
||||
#> 4 b 2
|
||||
#> 5 c 4
|
||||
#> 6 c 1
|
||||
#> # … with 6 more rows</pre>
|
||||
</div>
|
||||
<p>You want to keep the first row from each repeated <code>x</code>. That’s easier to express with a combination of <code><a href="https://dplyr.tidyverse.org/reference/consecutive_id.html">consecutive_id()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_head()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
group_by(id = consecutive_id(x)) |>
|
||||
slice_head(n = 1)
|
||||
#> # A tibble: 7 × 3
|
||||
#> # Groups: id [7]
|
||||
#> x y id
|
||||
#> <chr> <dbl> <int>
|
||||
#> 1 a 1 1
|
||||
#> 2 b 2 2
|
||||
#> 3 c 4 3
|
||||
#> 4 d 3 4
|
||||
#> 5 e 9 5
|
||||
#> 6 a 4 6
|
||||
#> # … with 1 more row</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code>.</p></li>
|
||||
<li><p>Which plane (<code>tailnum</code>) has the worst on-time record?</p></li>
|
||||
<li><p>What time of day should you fly if you want to avoid delays as much as possible?</p></li>
|
||||
<li><p>What does <code>flights |> group_by(dest() |> filter(row_number() < 4)</code> do? What does <code>flights |> group_by(dest() |> filter(row_number(dep_delay) < 4)</code> do?</p></li>
|
||||
<li><p>What does <code>flights |> group_by(dest) |> filter(row_number() < 4)</code> do? What does <code>flights |> group_by(dest) |> filter(row_number(dep_delay) < 4)</code> do?</p></li>
|
||||
<li><p>For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.</p></li>
|
||||
<li>
|
||||
<p>Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>, explore how the average flight delay for an hour is related to the average delay for the previous hour.</p>
|
||||
|
@ -594,7 +685,7 @@ Exercises</h2>
|
|||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(hour = dep_time %/% 100) |>
|
||||
group_by(year, month, day, hour) |>
|
||||
summarise(
|
||||
summarize(
|
||||
dep_delay = mean(dep_delay, na.rm = TRUE),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
|
@ -602,7 +693,7 @@ Exercises</h2>
|
|||
filter(n > 5)</pre>
|
||||
</div>
|
||||
</li>
|
||||
<li><p>Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?</p></li>
|
||||
<li><p>Look at each destination. Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)? Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?</p></li>
|
||||
<li><p>Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.</p></li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
@ -610,23 +701,23 @@ Exercises</h2>
|
|||
<section id="numeric-summaries" data-type="sect1">
|
||||
<h1>
|
||||
Numeric summaries</h1>
|
||||
<p>Just using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here are a selection that you might find useful.</p>
|
||||
<p>Just using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here is a selection that you might find useful.</p>
|
||||
|
||||
<section id="center" data-type="sect2">
|
||||
<h2>
|
||||
Center</h2>
|
||||
<p>So far, we’ve mostly used <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.</p>
|
||||
<p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs the median when looking at the hourly vs median departure delay. The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.</p>
|
||||
<p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs. the median when looking at the hourly vs. median departure delay. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
mean = mean(dep_delay, na.rm = TRUE),
|
||||
median = median(dep_delay, na.rm = TRUE),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
ggplot(aes(mean, median)) +
|
||||
ggplot(aes(x = mean, y = median)) +
|
||||
geom_abline(slope = 1, intercept = 0, color = "white", size = 2) +
|
||||
geom_point()
|
||||
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
|
||||
|
@ -644,12 +735,12 @@ Center</h2>
|
|||
<section id="sec-min-max-summary" data-type="sect2">
|
||||
<h2>
|
||||
Minimum, maximum, and quantiles</h2>
|
||||
<p>What if you’re interested in locations other than the center? <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="https://rdrr.io/r/stats/quantile.html">quantile()</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find a value that’s greater than 95% of the values.</p>
|
||||
<p>What if you’re interested in locations other than the center? <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="https://rdrr.io/r/stats/quantile.html">quantile()</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find the value that’s greater than 95% of the values.</p>
|
||||
<p>For the <code>flights</code> data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
max = max(dep_delay, na.rm = TRUE),
|
||||
q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
|
@ -675,7 +766,7 @@ Spread</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(origin, dest) |>
|
||||
summarise(
|
||||
summarize(
|
||||
distance_sd = IQR(distance),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
|
@ -696,13 +787,13 @@ Distributions</h2>
|
|||
<p><a href="#fig-flights-dist" data-type="xref">#fig-flights-dist</a> shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
ggplot(aes(dep_delay)) +
|
||||
ggplot(aes(x = dep_delay)) +
|
||||
geom_histogram(binwidth = 15)
|
||||
#> Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).
|
||||
|
||||
flights |>
|
||||
filter(dep_delay < 120) |>
|
||||
ggplot(aes(dep_delay)) +
|
||||
ggplot(aes(x = dep_delay)) +
|
||||
geom_histogram(binwidth = 5)</pre>
|
||||
<div id="fig-flights-dist" class="cell quarto-layout-panel">
|
||||
<figure class="figure"><div class="quarto-layout-row quarto-layout-valign-top">
|
||||
|
@ -719,14 +810,14 @@ flights |>
|
|||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p/><figcaption class="figure-caption">Figure 13.3: The distribution of <code>dep_delay</code> appears highly skewed to the right in both histograms.</figcaption><p/>
|
||||
<p/><figcaption class="figure-caption">Figure 15.3: The distribution of <code>dep_delay</code> appears highly skewed to the right in both histograms.</figcaption><p/>
|
||||
</figure></div>
|
||||
</div>
|
||||
<p>It’s also a good idea to check that distributions for subgroups resemble the whole. <a href="#fig-flights-dist-daily" data-type="xref">#fig-flights-dist-daily</a> overlays a frequency polygon for each day. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dep_delay < 120) |>
|
||||
ggplot(aes(dep_delay, group = interaction(day, month))) +
|
||||
ggplot(aes(x = dep_delay, group = interaction(day, month))) +
|
||||
geom_freqpoly(binwidth = 5, alpha = 1/5)</pre>
|
||||
<div class="cell-output-display">
|
||||
|
||||
|
@ -735,18 +826,18 @@ flights |>
|
|||
</figure>
|
||||
</div>
|
||||
</div>
|
||||
<p>Don’t be afraid to explore your own custom summaries specifically tailored for the data that you’re working with. In this case, that might mean separately summarizing the flights that left early vs the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don’t forget what you learned in <a href="#sec-sample-size" data-type="xref">#sec-sample-size</a>: whenever creating numerical summaries, it’s a good idea to include the number of observations in each group.</p>
|
||||
<p>Don’t be afraid to explore your own custom summaries specifically tailored for the data that you’re working with. In this case, that might mean separately summarizing the flights that left early vs. the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don’t forget what you learned in <a href="#sec-sample-size" data-type="xref">#sec-sample-size</a>: whenever creating numerical summaries, it’s a good idea to include the number of observations in each group.</p>
|
||||
</section>
|
||||
|
||||
<section id="positions" data-type="sect2">
|
||||
<h2>
|
||||
Positions</h2>
|
||||
<p>There’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at specific position. You can do this with the base R <code>[</code> function, but we’re not going to cover it in detail until <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>, because it’s a very powerful and general function. For now we’ll introduce three specialized functions that you can use to extract values at a specified position: <code>first(x)</code>, <code>last(x)</code>, and <code>nth(x, n)</code>.</p>
|
||||
<p>There’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position. You can do this with the base R <code>[</code> function, but we’re not going to cover it in detail until <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>, because it’s a very powerful and general function. For now we’ll introduce three specialized functions that you can use to extract values at a specified position: <code>first(x)</code>, <code>last(x)</code>, and <code>nth(x, n)</code>.</p>
|
||||
<p>For example, we can find the first and last departure for each day:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
first_dep = first(dep_time),
|
||||
fifth_dep = nth(dep_time, 5),
|
||||
last_dep = last(dep_time)
|
||||
|
@ -775,18 +866,18 @@ Positions</h2>
|
|||
filter(r %in% c(1, max(r)))
|
||||
#> # A tibble: 1,195 × 20
|
||||
#> # Groups: year, month, day [365]
|
||||
#> year month day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
|
||||
#> 1 2013 1 1 517 515 2 830 819 11 UA
|
||||
#> 2 2013 1 1 2353 2359 -6 425 445 -20 B6
|
||||
#> 3 2013 1 1 2353 2359 -6 418 442 -24 B6
|
||||
#> 4 2013 1 1 2356 2359 -3 425 437 -12 B6
|
||||
#> 5 2013 1 2 42 2359 43 518 442 36 B6
|
||||
#> 6 2013 1 2 458 500 -2 703 650 13 US
|
||||
#> # … with 1,189 more rows, 10 more variables: flight <int>, tailnum <chr>,
|
||||
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
|
||||
#> # minute <dbl>, time_hour <dttm>, r <int>, and abbreviated variable names
|
||||
#> # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
|
||||
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
|
||||
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
|
||||
#> 1 2013 1 1 517 515 2 830 819
|
||||
#> 2 2013 1 1 2353 2359 -6 425 445
|
||||
#> 3 2013 1 1 2353 2359 -6 418 442
|
||||
#> 4 2013 1 1 2356 2359 -3 425 437
|
||||
#> 5 2013 1 2 42 2359 43 518 442
|
||||
#> 6 2013 1 2 458 500 -2 703 650
|
||||
#> # … with 1,189 more rows, and 12 more variables: arr_delay <dbl>,
|
||||
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
|
||||
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
|
||||
#> # time_hour <dttm>, r <int></pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
@ -794,7 +885,7 @@ Positions</h2>
|
|||
<h2>
|
||||
With<code>mutate()</code>
|
||||
</h2>
|
||||
<p>As the names suggest, the summary functions are typically paired with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, particularly when you want do some sort of group standardization. For example:</p>
|
||||
<p>As the names suggest, the summary functions are typically paired with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, particularly when you want do some sort of group standardization. For example:</p>
|
||||
<ul><li>
|
||||
<code>x / sum(x)</code> calculates the proportion of a total.</li>
|
||||
<li>
|
||||
|
|
|
@ -1,19 +1,9 @@
|
|||
<section data-type="chapter" id="chp-preface-2e">
|
||||
<h1>Preface to the second edition</h1><p>Welcome to the second edition of “R for Data Science”.</p>
|
||||
<section id="major-changes" data-type="sectNA">
|
||||
<h1>Major changes</h1>
|
||||
<ul><li><p>The first part is renamed to “whole game” to reflect the entire data science cycle. It gains a new chapter that briefly introduces the basics of reading data from csv files.</p></li>
|
||||
<li><p>The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.</p></li>
|
||||
<li><p>We’ve added new chapters on column-wise and row-wise operations.</p></li>
|
||||
<li><p>We’ve added a new set of chapters on import that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and scraping data from the web.</p></li>
|
||||
<li><p>The modeling part has been removed. For modeling, we recommend using packages from <a href="https://www.tidymodels.org/">tidymodels</a> and reading <a href="https://www.tmwr.org/">Tidy Modeling with R</a> by Max Kuhn and Julia Silge to learn more about them.</p></li>
|
||||
<li><p>We’ve switched from the magrittr pipe to the base pipe.</p></li>
|
||||
</ul></section>
|
||||
|
||||
<section id="acknowledgements" data-type="sectNA">
|
||||
<h1>Acknowledgements</h1>
|
||||
<p><em>TO DO: Add acknowledgements.</em></p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
||||
<h1>Preface to the second edition</h1><p>Welcome to the second edition of “R for Data Science”! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. We’re also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).</p><p>A brief summary of the biggest changes follows:</p><ul><li><p>The first part of the book has been renamed to “Whole game”. The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.</p></li>
|
||||
<li><p>The second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition.</p></li>
|
||||
<li><p>The third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.</p></li>
|
||||
<li><p>The fourth part of the book is called “Import”. It’s a new set of chapters that goes beyond reading flat text files to now embrace working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.</p></li>
|
||||
<li><p>The “Program” part continues, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes sections on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier over the last few years. We’ve added a new chapter on important Base R functions that you’re likely to see when reading R code found in the wild.</p></li>
|
||||
<li><p>The modeling part has been removed. We never had enough room to fully do modelling justice, and there are now much better resources available. We generally recommend using the <a href="https://www.tidymodels.org/">tidymodels</a> packages and reading <a href="https://www.tmwr.org/">Tidy Modeling with R</a> by Max Kuhn and Julia Silge.</p></li>
|
||||
<li><p>The communicate part continues as well, but features Quarto instead of R Markdown as the tool of choice for authoring reproducible computational documents.</p></li>
|
||||
</ul><p>Other changes include switching from magrittr’s pipe (<code>%>%</code>) to the base pipe (<code>|></code>) and switching the book’s source from RMarkdown to Quarto.</p></section>
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>So far you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.</p>
|
||||
<p>So far, you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.</p>
|
||||
<p>There are two ways to set the output of a document:</p>
|
||||
<ol type="1"><li>
|
||||
<p>Permanently, by modifying the YAML header:</p>
|
||||
|
@ -26,7 +26,7 @@ format: html</pre>
|
|||
<h1>
|
||||
Output options</h1>
|
||||
<p>Quarto offers a wide range of output formats. You can find the complete list at <a href="https://quarto.org/docs/output-formats/all-formats.html" class="uri">https://quarto.org/docs/output-formats/all-formats.html</a>. Many formats share some output options (e.g., <code>toc: true</code> for including a table of contents), but others have options that are format specific (e.g., <code>code-fold: true</code> collapses code chunks into a <code><details></code> tag for HTML output so the user can display it on demand, it’s not applicable in a PDF or Word document).</p>
|
||||
<p>To override the default voptions, you need to use an expanded <code>format</code> field. For example, if you wanted to render an <code>html</code> with a floating table of contents, you’d use:</p>
|
||||
<p>To override the default options, you need to use an expanded <code>format</code> field. For example, if you wanted to render an <code>html</code> with a floating table of contents, you’d use:</p>
|
||||
<pre data-type="programlisting" data-code-language="yaml">format:
|
||||
html:
|
||||
toc: true
|
||||
|
@ -38,7 +38,7 @@ Output options</h1>
|
|||
toc_float: true
|
||||
pdf: default
|
||||
docx: default</pre>
|
||||
<p>Note the special syntax (<code>pdf: default</code>) if you don’t want to override any of the default options.</p>
|
||||
<p>Note the special syntax (<code>pdf: default</code>) if you don’t want to override any default options.</p>
|
||||
<p>To render to all formats specified in the YAML of a document, you can use <code>output_format = "all"</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = "all")</pre>
|
||||
|
@ -48,14 +48,14 @@ Output options</h1>
|
|||
<section id="documents" data-type="sect1">
|
||||
<h1>
|
||||
Documents</h1>
|
||||
<p>The previous chapter focused on the default <code>html</code> output. There are a number of basic variations on that theme, generating different types of documents. For example:</p>
|
||||
<ul><li><p><code>pdf</code> makes a PDF with LaTeX (an open source document layout system), which you’ll need to install. RStudio will prompt you if you don’t already have it.</p></li>
|
||||
<p>The previous chapter focused on the default <code>html</code> output. There are several basic variations on that theme, generating different types of documents. For example:</p>
|
||||
<ul><li><p><code>pdf</code> makes a PDF with LaTeX (an open-source document layout system), which you’ll need to install. RStudio will prompt you if you don’t already have it.</p></li>
|
||||
<li><p><code>docx</code> for Microsoft Word (<code>.docx</code>) documents.</p></li>
|
||||
<li><p><code>odt</code> for OpenDocument Text (<code>.odt</code>) documents.</p></li>
|
||||
<li><p><code>rtf</code> for Rich Text Format (<code>.rtf</code>) documents.</p></li>
|
||||
<li><p><code>gfm</code> for a GitHub Flavored Markdown (<code>.md</code>) document.</p></li>
|
||||
<li><p><code>ipynb</code> for Jupyter Notebooks (<code>.ipynb</code>).</p></li>
|
||||
</ul><p>Remember, when generating a document to share with decision makers, you can turn off the default display of code by setting global options in document YAML:</p>
|
||||
</ul><p>Remember, when generating a document to share with decision-makers, you can turn off the default display of code by setting global options in document YAML:</p>
|
||||
<pre data-type="programlisting" data-code-language="yaml">execute:
|
||||
echo: false</pre>
|
||||
<p>For <code>html</code> documents another option is to make the code chunks hidden by default, but visible with a click:</p>
|
||||
|
@ -67,7 +67,7 @@ Documents</h1>
|
|||
<section id="presentations" data-type="sect1">
|
||||
<h1>
|
||||
Presentations</h1>
|
||||
<p>You can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (<code>##</code>) level header. Additionally, first (<code>#</code>) level headers can be used to indicate the beginning of a new section with a section title slide that is by default centered in the middle.</p>
|
||||
<p>You can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (<code>##</code>) level header. Additionally, first (<code>#</code>) level headers indicate the beginning of a new section with a section title slide that is, by default, centered in the middle.</p>
|
||||
<p>Quarto supports a variety of presentation formats, including:</p>
|
||||
<ol type="1"><li><p><code>revealjs</code> - HTML presentation with revealjs</p></li>
|
||||
<li><p><code>pptx</code> - PowerPoint presentation</p></li>
|
||||
|
@ -78,7 +78,7 @@ Presentations</h1>
|
|||
<section id="dashboards" data-type="sect1">
|
||||
<h1>
|
||||
Dashboards</h1>
|
||||
<p>Dashboards are a useful way to communicate large amounts of information visually and quickly. A dashboard-like look can be achieved with Quarto using document layout options like sidebars, tabsets, multi-column layouts, etc.</p>
|
||||
<p>Dashboards are a useful way to communicate information visually and quickly. A dashboard-like look can be achieved with Quarto using document layout options like sidebars, tabsets, multi-column layouts, etc.</p>
|
||||
<p>For example, you can produce this dashboard:</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
|
@ -157,7 +157,7 @@ diamonds |>
|
|||
<section id="interactivity" data-type="sect1">
|
||||
<h1>
|
||||
Interactivity</h1>
|
||||
<p>Any HTML documents can contain interactive components.</p>
|
||||
<p>Any HTML document can contain interactive components.</p>
|
||||
|
||||
<section id="htmlwidgets" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -170,23 +170,23 @@ leaflet() |>
|
|||
addTiles() |>
|
||||
addMarkers(174.764, -36.877, popup = "Maungawhau") </pre>
|
||||
<div class="cell-output-display">
|
||||
<div id="htmlwidget-ac96cb3ee4656e2e9ec3" style="width:100%;height:433px;" class="leaflet html-widget"/>
|
||||
<div class="leaflet html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-ac96cb3ee4656e2e9ec3" style="width:100%;height:433px;"/>
|
||||
<script type="application/json" data-for="htmlwidget-ac96cb3ee4656e2e9ec3"><![CDATA[{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"setView":[[-36.877,174.764],16,[]],"calls":[{"method":"addTiles","args":["https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"© <a href=\"https://openstreetmap.org\">OpenStreetMap<\/a> contributors, <a href=\"https://creativecommons.org/licenses/by-sa/2.0/\">CC-BY-SA<\/a>"}]},{"method":"addMarkers","args":[-36.877,174.764,null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},"Maungawhau",null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[-36.877,-36.877],"lng":[174.764,174.764]}},"evals":[],"jsHooks":[]}]]></script></div>
|
||||
</div>
|
||||
<p>The great thing about htmlwidgets is that you don’t need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you don’t need to worry about it.</p>
|
||||
<p>There are many packages that provide htmlwidgets, including:</p>
|
||||
<ul><li><p><strong>dygraphs</strong>, <a href="https://rstudio.github.io/dygraphs/" class="uri">https://rstudio.github.io/dygraphs</a>, for interactive time series visualisations.</p></li>
|
||||
<ul><li><p><strong>dygraphs</strong>, <a href="https://rstudio.github.io/dygraphs/" class="uri">https://rstudio.github.io/dygraphs</a>, for interactive time series visualizations.</p></li>
|
||||
<li><p><strong>DT</strong>, <a href="https://rstudio.github.io/DT" class="uri">https://rstudio.github.io/DT/</a>, for interactive tables.</p></li>
|
||||
<li><p><strong>threejs</strong>, <a href="https://bwlewis.github.io/rthreejs/" class="uri">https://bwlewis.github.io/rthreejs</a> for interactive 3d plots.</p></li>
|
||||
<li><p><strong>DiagrammeR</strong>, <a href="https://rich-iannone.github.io/DiagrammeR" class="uri">https://rich-iannone.github.io/DiagrammeR</a> for diagrams (like flow charts and simple node-link diagrams).</p></li>
|
||||
</ul><p>To learn more about htmlwidgets and see a more complete list of packages that provide them visit <a href="https://www.htmlwidgets.org" class="uri">https://www.htmlwidgets.org</a>.</p>
|
||||
</ul><p>To learn more about htmlwidgets and see a complete list of packages that provide them visit <a href="https://www.htmlwidgets.org" class="uri">https://www.htmlwidgets.org</a>.</p>
|
||||
</section>
|
||||
|
||||
<section id="shiny" data-type="sect2">
|
||||
<h2>
|
||||
Shiny</h2>
|
||||
<p>htmlwidgets provide <strong>client-side</strong> interactivity — all the interactivity happens in the browser, independently of R. On one hand, that’s great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use <strong>shiny</strong>, a package that allows you to create interactivity using R code, not JavaScript.</p>
|
||||
<p>To call Shiny code from an Quarto document, add <code>server: shiny</code> to the YAML header:</p>
|
||||
<p>htmlwidgets provide <strong>client-side</strong> interactivity — all the interactivity happens in the browser, independently of R. On the one hand, that’s great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use <strong>shiny</strong>, a package that allows you to create interactivity using R code, not JavaScript.</p>
|
||||
<p>To call Shiny code from a Quarto document, add <code>server: shiny</code> to the YAML header:</p>
|
||||
<pre data-type="programlisting" data-code-language="yaml">title: "Shiny Web App"
|
||||
format: html
|
||||
server: shiny</pre>
|
||||
|
@ -200,11 +200,11 @@ numericInput("age", "How old are you?", NA, min = 0, max = 150)</pre>
|
|||
<p>And you also need a code chunk with chunk option <code>context: server</code> which contains the code that needs to run in a Shiny server.</p>
|
||||
<div class="cell">
|
||||
<div class="cell-output-display">
|
||||
<p><img src="quarto/quarto-shiny.png" class="img-fluid" alt="Two input boxes on top of each other. Top one says "What is your name?", the bottom one "How old are you?"." width="650"/></p>
|
||||
<p><img src="quarto/quarto-shiny.png" class="img-fluid" alt="Two input boxes on top of each other. Top one says, "What is your name?", the bottom, "How old are you?"." width="650"/></p>
|
||||
</div>
|
||||
</div>
|
||||
<p>You can then refer to the values with <code>input$name</code> and <code>input$age</code>, and the code that uses them will be automatically re-run whenever they change.</p>
|
||||
<p>We can’t show you a live shiny app here because shiny interactions occur on the <strong>server-side</strong>. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public facing Shiny server if you want to publish this sort of interactivity online. That’s the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.</p>
|
||||
<p>We can’t show you a live shiny app here because shiny interactions occur on the <strong>server-side</strong>. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public-facing Shiny server if you want to publish this sort of interactivity online. That’s the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.</p>
|
||||
<p>For learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, <a href="https://mastering-shiny.org/">https://mastering-shiny.org</a>.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
@ -212,16 +212,12 @@ numericInput("age", "How old are you?", NA, min = 0, max = 150)</pre>
|
|||
<section id="websites-and-books" data-type="sect1">
|
||||
<h1>
|
||||
Websites and books</h1>
|
||||
<p>With a little additional infrastructure you can use Quarto to generate a complete website:</p>
|
||||
<p>With a bit of additional infrastructure, you can use Quarto to generate a complete website or book:</p>
|
||||
<ul><li><p>Put your <code>.qmd</code> files in a single directory. <code>index.qmd</code> will become the home page.</p></li>
|
||||
<li>
|
||||
<p>Add a YAML file named <code>_quarto.yml</code> that provides the navigation for the site. In this file, set the <code>project</code> type:</p>
|
||||
<ul><li>For a website, set <code>type: book</code>:</li>
|
||||
</ul><pre data-type="programlisting" data-code-language="yaml">project:
|
||||
<p>Add a YAML file named <code>_quarto.yml</code> that provides the navigation for the site. In this file, set the <code>project</code> type to either <code>book</code> or <code>website</code>, e.g.:</p>
|
||||
<pre data-type="programlisting" data-code-language="yaml">project:
|
||||
type: book</pre>
|
||||
<ul><li>For a website, set <code>type: website</code>:</li>
|
||||
</ul><pre data-type="programlisting" data-code-language="yaml">project:
|
||||
type: website</pre>
|
||||
</li>
|
||||
</ul><p>For example, the following <code>_quarto.yml</code> file creates a website from three source files: <code>index.qmd</code> (the home page), <code>viridis-colors.qmd</code>, and <code>terrain-colors.qmd</code>.</p>
|
||||
<div class="cell">
|
||||
|
@ -275,11 +271,11 @@ Other formats</h1>
|
|||
<section id="learning-more" data-type="sect1">
|
||||
<h1>
|
||||
Learning more</h1>
|
||||
<p>To learn more about effective communication in these different formats we recommend the following resources:</p>
|
||||
<ul><li><p>To improve your presentation skills, try <a href="https://amzn.com/0321820800"><em>Presentation Patterns</em></a>, by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.</p></li>
|
||||
<p>To learn more about effective communication in these different formats, we recommend the following resources:</p>
|
||||
<ul><li><p>To improve your presentation skills, try <a href="https://presentationpatterns.com/"><em>Presentation Patterns</em></a> by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.</p></li>
|
||||
<li><p>If you give academic talks, you might like the <a href="https://github.com/jtleek/talkguide"><em>Leek group guide to giving talks</em></a>.</p></li>
|
||||
<li><p>We haven’t taken it outselves, but we’ve heard good things about Matt McGarrity’s online course on public speaking: <a href="https://www.coursera.org/learn/public-speaking" class="uri">https://www.coursera.org/learn/public-speaking</a>.</p></li>
|
||||
<li><p>If you are creating a lot of dashboards, make sure to read Stephen Few’s <a href="https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"><em>Information Dashboard Design: The Effective Visual Communication of Data</em></a>. It will help you create dashboards that are truly useful, not just pretty to look at.</p></li>
|
||||
<li><p>We haven’t taken it ourselves, but we’ve heard good things about Matt McGarrity’s online course on public speaking: <a href="https://www.coursera.org/learn/public-speaking" class="uri">https://www.coursera.org/learn/public-speaking</a>.</p></li>
|
||||
<li><p>If you are creating many dashboards, make sure to read Stephen Few’s <a href="https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"><em>Information Dashboard Design: The Effective Visual Communication of Data</em></a>. It will help you create dashboards that are truly useful, not just pretty to look at.</p></li>
|
||||
<li><p>Effectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams’ <a href="https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151"><em>The Non-Designer’s Design Book</em></a> is a great place to start.</p></li>
|
||||
</ul></section>
|
||||
</section>
|
||||
|
|
|
@ -48,7 +48,7 @@ The distribution of the remainder is shown below:
|
|||
#| echo: false
|
||||
|
||||
smaller |>
|
||||
ggplot(aes(carat)) +
|
||||
ggplot(aes(x = carat)) +
|
||||
geom_freqpoly(binwidth = 0.01)
|
||||
```</code></pre>
|
||||
</div>
|
||||
|
@ -235,7 +235,7 @@ Chunk options</h2>
|
|||
<li><p><code>message: false</code> or <code>warning: false</code> prevents messages or warnings from appearing in the finished file.</p></li>
|
||||
<li><p><code>results: hide</code> hides printed output; <code>fig-show: hide</code> hides plots.</p></li>
|
||||
<li><p><code>error: true</code> causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your <code>.qmd</code>. It’s also useful if you’re teaching R and want to deliberately include an error. The default, <code>error: false</code> causes rendering to fail if there is a single error in the document.</p></li>
|
||||
</ul><p>Each of these chunk options get added to the header of the chunk, following <code>#|</code>, e.g., in the following chunk the result is not printed since <code>eval</code> is set to false.</p>
|
||||
</ul><p>Each of these chunk options get added to the header of the chunk, following <code>#|</code>, e.g. in the following chunk the result is not printed since <code>eval</code> is set to false.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="markdown">```{r}
|
||||
#| label: simple-multiplication
|
||||
|
@ -311,7 +311,7 @@ Global options</h2>
|
|||
<pre data-type="programlisting" data-code-language="yaml">title: "My report"
|
||||
execute:
|
||||
echo: false</pre>
|
||||
<p>Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). You can, however, still set these as global options for your document under the <code>knitr</code> field, under <code>opts_chunk</code>. For example, when writing books and tutorials we set:</p>
|
||||
<p>Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g. Jupyter). You can, however, still set these as global options for your document under the <code>knitr</code> field, under <code>opts_chunk</code>. For example, when writing books and tutorials we set:</p>
|
||||
<pre data-type="programlisting" data-code-language="yaml">title: "Tutorial"
|
||||
knitr:
|
||||
opts_chunk:
|
||||
|
@ -344,7 +344,7 @@ comma(.12358124331)
|
|||
<section id="exercises-3" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Add a section that explores how diamond sizes vary by cut, colour, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting <code>echo: false</code> on each chunk, set a global option.</p></li>
|
||||
<ol type="1"><li><p>Add a section that explores how diamond sizes vary by cut, color, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting <code>echo: false</code> on each chunk, set a global option.</p></li>
|
||||
<li><p>Download <code>diamond-sizes.qmd</code> from <a href="https://github.com/hadley/r4ds/tree/main/quarto" class="uri">https://github.com/hadley/r4ds/tree/main/quarto</a>. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.</p></li>
|
||||
<li><p>Modify <code>diamonds-sizes.qmd</code> to use <code>label_comma()</code> to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats.</p></li>
|
||||
</ol></section>
|
||||
|
@ -353,14 +353,14 @@ Exercises</h2>
|
|||
<section id="sec-figures" data-type="sect1">
|
||||
<h1>
|
||||
Figures</h1>
|
||||
<p>The figures in a Quarto document can be embedded (e.g., a PNG or JPEG file) or generated as a result of a code chunk.</p>
|
||||
<p>The figures in a Quarto document can be embedded (e.g. a PNG or JPEG file) or generated as a result of a code chunk.</p>
|
||||
<p>To embed an image from an external file, you can use the Insert menu in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.</p>
|
||||
<p>If you include a code chunk that generates a figure (e.g., includes a <code>ggplot()</code> call), the resulting figure will be automatically included in your Quarto document.</p>
|
||||
<p>If you include a code chunk that generates a figure (e.g. includes a <code>ggplot()</code> call), the resulting figure will be automatically included in your Quarto document.</p>
|
||||
|
||||
<section id="figure-sizing" data-type="sect2">
|
||||
<h2>
|
||||
Figure sizing</h2>
|
||||
<p>The biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: <code>fig-width</code>, <code>fig-height</code>, <code>fig-asp</code>, <code>out-width</code> and <code>out-height</code>. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).</p>
|
||||
<p>The biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: <code>fig-width</code>, <code>fig-height</code>, <code>fig-asp</code>, <code>out-width</code> and <code>out-height</code>. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e. height, width, and aspect ratio: pick two of three).</p>
|
||||
<!-- TODO: https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/ -->
|
||||
<p>We recommend three of the five options:</p>
|
||||
<ul><li><p>Plots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set <code>fig-width: 6</code> (6”) and <code>fig-asp: 0.618</code> (the golden ratio) in the defaults. Then in individual chunks, only adjust <code>fig-asp</code>.</p></li>
|
||||
|
@ -420,7 +420,7 @@ Tables</h1>
|
|||
<pre data-type="programlisting" data-code-language="r">knitr::kable(mtcars[1:5, ], )</pre>
|
||||
<div class="cell-output-display">
|
||||
<div id="tbl-kable" class="anchored">
|
||||
<table class="table table-sm table-striped"><caption>Table 27.1: A knitr kable.</caption>
|
||||
<table class="table table-sm table-striped"><caption>Table 30.1: A knitr kable.</caption>
|
||||
<colgroup><col style="width: 26%"/><col style="width: 7%"/><col style="width: 5%"/><col style="width: 7%"/><col style="width: 5%"/><col style="width: 7%"/><col style="width: 8%"/><col style="width: 8%"/><col style="width: 4%"/><col style="width: 4%"/><col style="width: 7%"/><col style="width: 7%"/></colgroup><thead><tr class="header"><th style="text-align: left;"/>
|
||||
<th style="text-align: right;">mpg</th>
|
||||
<th style="text-align: right;">cyl</th>
|
||||
|
@ -497,7 +497,7 @@ Tables</h1>
|
|||
</div>
|
||||
</div>
|
||||
<p>Read the documentation for <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">?knitr::kable</a></code> to see the other ways in which you can customize the table. For even deeper customization, consider the <strong>gt</strong>, <strong>huxtable</strong>, <strong>reactable</strong>, <strong>kableExtra</strong>, <strong>xtable</strong>, <strong>stargazer</strong>, <strong>pander</strong>, <strong>tables</strong>, and <strong>ascii</strong> packages. Each provides a set of tools for returning formatted tables from R code.</p>
|
||||
<p>There is also a rich set of options for controlling how figures are embedded. You’ll learn about these in <a href="#chp-communicate-plots" data-type="xref">#chp-communicate-plots</a>.</p>
|
||||
<p>There is also a rich set of options for controlling how figures are embedded. You’ll learn about these in <span class="quarto-unresolved-ref">?sec-graphics-communication</span>.</p>
|
||||
|
||||
<section id="exercises-5" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -565,18 +565,28 @@ rawdata <- readr::read_csv("a_very_large_file.csv")
|
|||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>Set up a network of chunks where <code>d</code> depends on <code>c</code> and <code>b</code>, and both <code>b</code> and <code>c</code> depend on <code>a</code>. Have each chunk print <code><a href="https://lubridate.tidyverse.org/reference/now.html">lubridate::now()</a></code>, set <code>cache: true</code>, then verify your understanding of caching.</li>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="troubleshooting" data-type="sect1">
|
||||
<h1>
|
||||
Troubleshooting</h1>
|
||||
</ol><blockquote class="blockquote">
|
||||
<blockquote class="blockquote">
|
||||
<blockquote class="blockquote">
|
||||
<blockquote class="blockquote">
|
||||
<blockquote class="blockquote">
|
||||
<blockquote class="blockquote">
|
||||
<blockquote class="blockquote">
|
||||
<p>7ff2b1502187f15a978d74f59a88534fa6f1012e ## Troubleshooting</p>
|
||||
</blockquote>
|
||||
</blockquote>
|
||||
</blockquote>
|
||||
</blockquote>
|
||||
</blockquote>
|
||||
</blockquote>
|
||||
</blockquote>
|
||||
<p>Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.</p>
|
||||
<p>One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.</p>
|
||||
<p>If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.</p>
|
||||
<p>If that doesn’t help, there must be something different between your interactive environment and the Quarto environment. You’re going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code> in a chunk.</p>
|
||||
<p>Next, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your Quarto session. The easiest way to do that is to set <code>error: true</code> on the chunk causing the problem, then use <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> and <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> to check that settings are as you expect.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="yaml-header" data-type="sect1">
|
||||
<h1>
|
||||
|
@ -586,7 +596,7 @@ YAML header</h1>
|
|||
<section id="self-contained" data-type="sect2">
|
||||
<h2>
|
||||
Self-contained</h2>
|
||||
<p>HTML documents typically have a number of external dependencies (e.g. images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a <code>_files</code> folder in the same directory as your <code>.qmd</code> file. If you publish the HTML file on a hosting platform (e.g., QuartoPub, <a href="https://quartopub.com/" class="uri">https://quartopub.com/</a>), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the <code>embed-resources</code> option:</p>
|
||||
<p>HTML documents typically have a number of external dependencies (e.g. images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a <code>_files</code> folder in the same directory as your <code>.qmd</code> file. If you publish the HTML file on a hosting platform (e.g. QuartoPub, <a href="https://quartopub.com/" class="uri">https://quartopub.com/</a>), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the <code>embed-resources</code> option:</p>
|
||||
<p>By default these dependencies are placed in a <code>_files</code> directory alongside your document. For example, if you render <code>report.qmd</code> to HTML:</p>
|
||||
<pre data-type="programlisting" data-code-language="yaml">format:
|
||||
html:
|
||||
|
@ -620,7 +630,7 @@ class <- mpg |> filter(class == params$my_class)
|
|||
```{r}
|
||||
#| message: false
|
||||
|
||||
ggplot(class, aes(displ, hwy)) +
|
||||
ggplot(class, aes(x = displ, y = hwy)) +
|
||||
geom_point() +
|
||||
geom_smooth(se = FALSE)
|
||||
```</code></pre>
|
||||
|
|
After Width: | Height: | Size: 470 KiB |
After Width: | Height: | Size: 374 KiB |
After Width: | Height: | Size: 462 KiB |
After Width: | Height: | Size: 335 KiB |
After Width: | Height: | Size: 31 KiB |
After Width: | Height: | Size: 320 KiB |
|
@ -1,15 +1,15 @@
|
|||
<section data-type="chapter" id="chp-rectangling">
|
||||
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data rectangling</span></span></h1>
|
||||
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Hierarchical data</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In this chapter, you’ll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
|
||||
<p>In this chapter, you’ll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
|
||||
<p>To learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">tidyr::unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">tidyr::unnest_wider()</a></code>. We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.</p>
|
||||
<p>In this chapter, we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(repurrrsive)
|
||||
|
@ -21,7 +21,7 @@ library(jsonlite)</pre>
|
|||
<section id="lists" data-type="sect1">
|
||||
<h1>
|
||||
Lists</h1>
|
||||
<p>So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is the same type. If you want to store element of different types in the same vector, you’ll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
|
||||
<p>So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, you’ll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">x1 <- list(1:4, "a", TRUE)
|
||||
x1
|
||||
|
@ -135,7 +135,7 @@ str(x5)
|
|||
<section id="list-columns" data-type="sect2">
|
||||
<h2>
|
||||
List-columns</h2>
|
||||
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldn’t usually belong in a tibble. In particular, list-columns are are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like models or resamples in a data frame.</p>
|
||||
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to place objects in a tibble that wouldn’t usually belong in there. In particular, list-columns are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like model outputs or resamples in a data frame.</p>
|
||||
<p>Here’s a simple example of a list-column:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||||
|
@ -160,7 +160,7 @@ df
|
|||
#> 1 1 a <list [2]></pre>
|
||||
</div>
|
||||
<p>Computing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.</p>
|
||||
<p>The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull the list-column out and apply one of the techniques that you learned above:</p>
|
||||
<p>The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull the list-column out and apply one of the techniques that you’ve learned above:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df |>
|
||||
filter(x == 1) |>
|
||||
|
@ -188,7 +188,7 @@ Base R
|
|||
#> x y
|
||||
#> 1 1, 2 1, 2
|
||||
#> 2 3, 4, 5 3, 4, 5</pre>
|
||||
</div><p>It’s easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div>
|
||||
</div><p>It’s easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like vectors and the print method has been designed with lists in mind.</p></div>
|
||||
|
||||
</section>
|
||||
</section>
|
||||
|
@ -307,7 +307,7 @@ df6 |> unnest_longer(y)
|
|||
#> 5 3 31 a
|
||||
#> 6 3 32 b</pre>
|
||||
</div>
|
||||
<p>If you don’t want these <code>ids</code>, you can suppress them with <code>indices_include = FALSE</code>. On the other hand, it’s sometimes useful to retain the position of unnamed elements in unnamed list-columns. You can do this with <code>indices_include = TRUE</code>:</p>
|
||||
<p>If you don’t want these <code>ids</code>, you can suppress them with <code>indices_include = FALSE</code>. On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices. You can do this with <code>indices_include = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df2 |>
|
||||
unnest_longer(y, indices_include = TRUE)
|
||||
|
@ -326,7 +326,7 @@ df6 |> unnest_longer(y)
|
|||
<section id="inconsistent-types" data-type="sect2">
|
||||
<h2>
|
||||
Inconsistent types</h2>
|
||||
<p>What happens if you unnest a list-column contains different types of vector? For example, take the following dataset where the list-column <code>y</code> contains two numbers, a factor, and a logical, which can’t normally be mixed in a single column.</p>
|
||||
<p>What happens if you unnest a list-column that contains different types of vector? For example, take the following dataset where the list-column <code>y</code> contains two numbers, a factor, and a logical, which can’t normally be mixed in a single column.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df4 <- tribble(
|
||||
~x, ~y,
|
||||
|
@ -334,7 +334,7 @@ Inconsistent types</h2>
|
|||
"b", list(TRUE, factor("a"), 5)
|
||||
)</pre>
|
||||
</div>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns change, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns unchanged, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y)
|
||||
|
@ -348,7 +348,7 @@ Inconsistent types</h2>
|
|||
#> 5 b <dbl [1]></pre>
|
||||
</div>
|
||||
<p>As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.</p>
|
||||
<p>What happens if you find this problem in a dataset you’re trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. It’s not particularly useful here because there’s only really one class that these five class can be converted to character.</p>
|
||||
<p>What happens if you find this problem in a dataset you’re trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. However, it’s not particularly useful here because there’s only really one class that these five class can be converted to character.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y, transform = as.character)
|
||||
|
@ -372,7 +372,7 @@ Inconsistent types</h2>
|
|||
#> 1 a <dbl [1]>
|
||||
#> 2 b <dbl [1]></pre>
|
||||
</div>
|
||||
<p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more:</p>
|
||||
<p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more. This gives us a rectangular dataset of just the numeric values.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df4 |>
|
||||
unnest_longer(y) |>
|
||||
|
@ -392,12 +392,12 @@ Inconsistent types</h2>
|
|||
Other functions</h2>
|
||||
<p>tidyr has a few other useful rectangling functions that we’re not going to cover in this book:</p>
|
||||
<ul><li>
|
||||
<code><a href="https://tidyr.tidyverse.org/reference/unnest_auto.html">unnest_auto()</a></code> automatically picks between <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> based on the structure of the list-column. It’s a great for rapid exploration, but ultimately its a bad idea because it doesn’t force you to understand how your data is structured, and makes your code harder to understand.</li>
|
||||
<code><a href="https://tidyr.tidyverse.org/reference/unnest_auto.html">unnest_auto()</a></code> automatically picks between <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> based on the structure of the list-column. It’s great for rapid exploration, but ultimately it’s a bad idea because it doesn’t force you to understand how your data is structured, and makes your code harder to understand.</li>
|
||||
<li>
|
||||
<code><a href="https://tidyr.tidyverse.org/reference/unnest.html">unnest()</a></code> expands both rows and columns. It’s useful when you have a list-column that contains a 2d structure like a data frame, which you don’t see in this book.</li>
|
||||
<code><a href="https://tidyr.tidyverse.org/reference/unnest.html">unnest()</a></code> expands both rows and columns. It’s useful when you have a list-column that contains a 2d structure like a data frame, which you don’t see in this book, but you might encounter if you use the <a href="https://www.tmwr.org/base-r.html#combining-base-r-models-and-the-tidyverse">tidymodels</a> ecosystem.</li>
|
||||
<li>
|
||||
<code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code> allows you to reach into a deeply nested list and extract just the components that you need. It’s mostly equivalent to repeated invocations of <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> + <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> so read up on it if you’re trying to extract just a couple of important variables embedded in a bunch of data that you don’t care about.</li>
|
||||
</ul><p>These are good to know about when you’re reading other people’s code or tackling rarer rectangling challenges.</p>
|
||||
</ul><p>These functions are good to know about as you might encounter them when reading other people’s code or tackling rarer rectangling challenges yourself.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
|
@ -424,7 +424,7 @@ Case studies</h1>
|
|||
<section id="very-wide-data" data-type="sect2">
|
||||
<h2>
|
||||
Very wide data</h2>
|
||||
<p>We’ll with <code>gh_repos</code>. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; you might want to explore a little on your own with <code>View(gh_repos)</code> before we continue.</p>
|
||||
<p>We’ll start with <code>gh_repos</code>. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; we recommend exploring a little on your own with <code>View(gh_repos)</code> before we continue.</p>
|
||||
<p><code>gh_repos</code> is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call the column <code>json</code> for reasons we’ll get to later.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">repos <- tibble(json = gh_repos)
|
||||
|
@ -460,21 +460,21 @@ repos
|
|||
unnest_longer(json) |>
|
||||
unnest_wider(json)
|
||||
#> # A tibble: 176 × 68
|
||||
#> id name full_…¹ owner private html_…² descr…³ fork url
|
||||
#> <int> <chr> <chr> <list> <lgl> <chr> <chr> <lgl> <chr>
|
||||
#> 1 61160198 after gaborc… <named list> FALSE https:… Run Co… FALSE http…
|
||||
#> 2 40500181 argufy gaborc… <named list> FALSE https:… Declar… FALSE http…
|
||||
#> 3 36442442 ask gaborc… <named list> FALSE https:… Friend… FALSE http…
|
||||
#> 4 34924886 baseimpo… gaborc… <named list> FALSE https:… Do we … FALSE http…
|
||||
#> 5 61620661 citest gaborc… <named list> FALSE https:… Test R… TRUE http…
|
||||
#> 6 33907457 clisymbo… gaborc… <named list> FALSE https:… Unicod… FALSE http…
|
||||
#> # … with 170 more rows, 59 more variables: forks_url <chr>, keys_url <chr>,
|
||||
#> # collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
|
||||
#> # issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
|
||||
#> # branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
|
||||
#> # git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
|
||||
#> # languages_url <chr>, stargazers_url <chr>, contributors_url <chr>,
|
||||
#> # subscribers_url <chr>, subscription_url <chr>, commits_url <chr>, …</pre>
|
||||
#> id name full_name owner private html_url description fork
|
||||
#> <int> <chr> <chr> <list> <lgl> <chr> <chr> <lgl>
|
||||
#> 1 61160198 after gaborcsa… <named list> FALSE https:/… Run Code i… FALSE
|
||||
#> 2 40500181 argufy gaborcsa… <named list> FALSE https:/… Declarativ… FALSE
|
||||
#> 3 36442442 ask gaborcsa… <named list> FALSE https:/… Friendly C… FALSE
|
||||
#> 4 34924886 baseimp… gaborcsa… <named list> FALSE https:/… Do we get … FALSE
|
||||
#> 5 61620661 citest gaborcsa… <named list> FALSE https:/… Test R pac… TRUE
|
||||
#> 6 33907457 clisymb… gaborcsa… <named list> FALSE https:/… Unicode sy… FALSE
|
||||
#> # … with 170 more rows, and 60 more variables: url <chr>, forks_url <chr>,
|
||||
#> # keys_url <chr>, collaborators_url <chr>, teams_url <chr>,
|
||||
#> # hooks_url <chr>, issue_events_url <chr>, events_url <chr>,
|
||||
#> # assignees_url <chr>, branches_url <chr>, tags_url <chr>,
|
||||
#> # blobs_url <chr>, git_tags_url <chr>, git_refs_url <chr>,
|
||||
#> # trees_url <chr>, statuses_url <chr>, languages_url <chr>,
|
||||
#> # stargazers_url <chr>, contributors_url <chr>, subscribers_url <chr>, …</pre>
|
||||
</div>
|
||||
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
|
||||
<div class="cell">
|
||||
|
@ -531,7 +531,7 @@ repos
|
|||
unnest_wider(json) |>
|
||||
select(id, full_name, owner, description) |>
|
||||
unnest_wider(owner)
|
||||
#> Error in `unpack()`:
|
||||
#> Error in `unpack()` at ]8;line = 121:col = 2;file:///Users/hadleywickham/Documents/tidy-data/tidyr/R/unnest-wider.Rtidyr/R/unnest-wider.R:121:2]8;;:
|
||||
#> ! Names must be unique.
|
||||
#> ✖ These names are duplicated:
|
||||
#> * "id" at locations 1 and 4.
|
||||
|
@ -546,21 +546,21 @@ repos
|
|||
select(id, full_name, owner, description) |>
|
||||
unnest_wider(owner, names_sep = "_")
|
||||
#> # A tibble: 176 × 20
|
||||
#> id full_name owner…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷
|
||||
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
|
||||
#> 1 61160198 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
|
||||
#> 2 40500181 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
|
||||
#> 3 36442442 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
|
||||
#> 4 34924886 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
|
||||
#> 5 61620661 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
|
||||
#> 6 33907457 gaborcsar… gaborc… 660288 https:… "" https:… https:… https:…
|
||||
#> # … with 170 more rows, 11 more variables: owner_following_url <chr>,
|
||||
#> # owner_gists_url <chr>, owner_starred_url <chr>,
|
||||
#> # owner_subscriptions_url <chr>, owner_organizations_url <chr>,
|
||||
#> # owner_repos_url <chr>, owner_events_url <chr>,
|
||||
#> # owner_received_events_url <chr>, owner_type <chr>,
|
||||
#> # owner_site_admin <lgl>, description <chr>, and abbreviated variable
|
||||
#> # names ¹owner_login, ²owner_id, ³owner_avatar_url, ⁴owner_gravatar_id, …</pre>
|
||||
#> id full_name owner_login owner_id owner_avatar_url owner_gravatar_id
|
||||
#> <int> <chr> <chr> <int> <chr> <chr>
|
||||
#> 1 61160198 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 2 40500181 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 3 36442442 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 4 34924886 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 5 61620661 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 6 33907457 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> # … with 170 more rows, and 14 more variables: owner_url <chr>,
|
||||
#> # owner_html_url <chr>, owner_followers_url <chr>,
|
||||
#> # owner_following_url <chr>, owner_gists_url <chr>,
|
||||
#> # owner_starred_url <chr>, owner_subscriptions_url <chr>,
|
||||
#> # owner_organizations_url <chr>, owner_repos_url <chr>,
|
||||
#> # owner_events_url <chr>, owner_received_events_url <chr>,
|
||||
#> # owner_type <chr>, owner_site_admin <lgl>, description <chr></pre>
|
||||
</div>
|
||||
<p>This gives another wide dataset, but you can see that <code>owner</code> appears to contain a lot of additional data about the person who “owns” the repository.</p>
|
||||
</section>
|
||||
|
@ -568,7 +568,7 @@ repos
|
|||
<section id="relational-data" data-type="sect2">
|
||||
<h2>
|
||||
Relational data</h2>
|
||||
<p>Nested data is sometimes used to represent data that we’d usually spread out into multiple data frames. For example, take <code>got_chars</code>. Like <code>gh_repos</code> it’s a list, so we start by turning it into a list-column of a tibble:</p>
|
||||
<p>Nested data is sometimes used to represent data that we’d usually spread out into multiple data frames. For example, take <code>got_chars</code> which contains data about characters that appear in Game of Thrones. Like <code>gh_repos</code> it’s a list, so we start by turning it into a list-column of a tibble:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">chars <- tibble(json = got_chars)
|
||||
chars
|
||||
|
@ -623,15 +623,15 @@ characters
|
|||
unnest_wider(json) |>
|
||||
select(id, where(is.list))
|
||||
#> # A tibble: 30 × 8
|
||||
#> id titles aliases allegiances books povBooks tvSeries playe…¹
|
||||
#> <int> <list> <list> <list> <list> <list> <list> <list>
|
||||
#> 1 1022 <chr [3]> <chr [4]> <chr [1]> <chr [3]> <chr [2]> <chr> <chr>
|
||||
#> 2 1052 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr [4]> <chr> <chr>
|
||||
#> 3 1074 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr> <chr>
|
||||
#> 4 1109 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr [1]> <chr> <chr>
|
||||
#> 5 1166 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr [2]> <chr> <chr>
|
||||
#> 6 1267 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr [1]> <chr> <chr>
|
||||
#> # … with 24 more rows, and abbreviated variable name ¹playedBy</pre>
|
||||
#> id titles aliases allegiances books povBooks tvSeries playedBy
|
||||
#> <int> <list> <list> <list> <list> <list> <list> <list>
|
||||
#> 1 1022 <chr [2]> <chr [4]> <chr [1]> <chr [3]> <chr> <chr> <chr>
|
||||
#> 2 1052 <chr [2]> <chr [11]> <chr [1]> <chr [2]> <chr> <chr> <chr>
|
||||
#> 3 1074 <chr [2]> <chr [1]> <chr [1]> <chr [3]> <chr> <chr> <chr>
|
||||
#> 4 1109 <chr [1]> <chr [1]> <NULL> <chr [1]> <chr> <chr> <chr>
|
||||
#> 5 1166 <chr [1]> <chr [1]> <chr [1]> <chr [3]> <chr> <chr> <chr>
|
||||
#> 6 1267 <chr [1]> <chr [1]> <NULL> <chr [2]> <chr> <chr> <chr>
|
||||
#> # … with 24 more rows</pre>
|
||||
</div>
|
||||
<p>Lets explore the <code>titles</code> column. It’s an unnamed list-column, so we’ll unnest it into rows:</p>
|
||||
<div class="cell">
|
||||
|
@ -639,16 +639,16 @@ characters
|
|||
unnest_wider(json) |>
|
||||
select(id, titles) |>
|
||||
unnest_longer(titles)
|
||||
#> # A tibble: 60 × 2
|
||||
#> # A tibble: 59 × 2
|
||||
#> id titles
|
||||
#> <int> <chr>
|
||||
#> 1 1022 Prince of Winterfell
|
||||
#> 2 1022 Captain of Sea Bitch
|
||||
#> 3 1022 Lord of the Iron Islands (by law of the green lands)
|
||||
#> 4 1052 Acting Hand of the King (former)
|
||||
#> 5 1052 Master of Coin (former)
|
||||
#> 6 1074 Lord Captain of the Iron Fleet
|
||||
#> # … with 54 more rows</pre>
|
||||
#> 2 1022 Lord of the Iron Islands (by law of the green lands)
|
||||
#> 3 1052 Acting Hand of the King (former)
|
||||
#> 4 1052 Master of Coin (former)
|
||||
#> 5 1074 Lord Captain of the Iron Fleet
|
||||
#> 6 1074 Master of the Iron Victory
|
||||
#> # … with 53 more rows</pre>
|
||||
</div>
|
||||
<p>You might expect to see this data in its own table because it would be easy to join to the characters data as needed. To do so, we’ll do a little cleaning: removing the rows containing empty strings and renaming <code>titles</code> to <code>title</code> since each row now only contains a single title.</p>
|
||||
<div class="cell">
|
||||
|
@ -659,43 +659,42 @@ characters
|
|||
filter(titles != "") |>
|
||||
rename(title = titles)
|
||||
titles
|
||||
#> # A tibble: 53 × 2
|
||||
#> # A tibble: 52 × 2
|
||||
#> id title
|
||||
#> <int> <chr>
|
||||
#> 1 1022 Prince of Winterfell
|
||||
#> 2 1022 Captain of Sea Bitch
|
||||
#> 3 1022 Lord of the Iron Islands (by law of the green lands)
|
||||
#> 4 1052 Acting Hand of the King (former)
|
||||
#> 5 1052 Master of Coin (former)
|
||||
#> 6 1074 Lord Captain of the Iron Fleet
|
||||
#> # … with 47 more rows</pre>
|
||||
#> 2 1022 Lord of the Iron Islands (by law of the green lands)
|
||||
#> 3 1052 Acting Hand of the King (former)
|
||||
#> 4 1052 Master of Coin (former)
|
||||
#> 5 1074 Lord Captain of the Iron Fleet
|
||||
#> 6 1074 Master of the Iron Victory
|
||||
#> # … with 46 more rows</pre>
|
||||
</div>
|
||||
<p>Now, for example, we could use this table tofind all the characters that are captains and see all their titles:</p>
|
||||
<p>Now, for example, we could use this table to find all the characters that are captains and see all their titles:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">captains <- titles |> filter(str_detect(title, "Captain"))
|
||||
captains
|
||||
#> # A tibble: 5 × 2
|
||||
#> # A tibble: 4 × 2
|
||||
#> id title
|
||||
#> <int> <chr>
|
||||
#> 1 1022 Captain of Sea Bitch
|
||||
#> 2 1074 Lord Captain of the Iron Fleet
|
||||
#> 3 1166 Captain of the Guard at Sunspear
|
||||
#> 4 150 Captain of the Black Wind
|
||||
#> 5 60 Captain of the Golden Storm (formerly)
|
||||
#> 1 1074 Lord Captain of the Iron Fleet
|
||||
#> 2 1166 Captain of the Guard at Sunspear
|
||||
#> 3 150 Captain of the Black Wind
|
||||
#> 4 60 Captain of the Golden Storm (formerly)
|
||||
|
||||
characters |>
|
||||
select(id, name) |>
|
||||
inner_join(titles, by = "id", multiple = "all")
|
||||
#> # A tibble: 53 × 3
|
||||
#> # A tibble: 52 × 3
|
||||
#> id name title
|
||||
#> <int> <chr> <chr>
|
||||
#> 1 1022 Theon Greyjoy Prince of Winterfell
|
||||
#> 2 1022 Theon Greyjoy Captain of Sea Bitch
|
||||
#> 3 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green land…
|
||||
#> 4 1052 Tyrion Lannister Acting Hand of the King (former)
|
||||
#> 5 1052 Tyrion Lannister Master of Coin (former)
|
||||
#> 6 1074 Victarion Greyjoy Lord Captain of the Iron Fleet
|
||||
#> # … with 47 more rows</pre>
|
||||
#> 2 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green land…
|
||||
#> 3 1052 Tyrion Lannister Acting Hand of the King (former)
|
||||
#> 4 1052 Tyrion Lannister Master of Coin (former)
|
||||
#> 5 1074 Victarion Greyjoy Lord Captain of the Iron Fleet
|
||||
#> 6 1074 Victarion Greyjoy Master of the Iron Victory
|
||||
#> # … with 46 more rows</pre>
|
||||
</div>
|
||||
<p>You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.</p>
|
||||
</section>
|
||||
|
@ -703,36 +702,36 @@ characters |>
|
|||
<section id="a-dash-of-text-analysis" data-type="sect2">
|
||||
<h2>
|
||||
A dash of text analysis</h2>
|
||||
<p>What if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by spitting on <code>" "</code>:</p>
|
||||
<p>Sticking with the same data, what if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by splitting on <code>" "</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused")
|
||||
#> # A tibble: 53 × 2
|
||||
#> # A tibble: 52 × 2
|
||||
#> id word
|
||||
#> <int> <list>
|
||||
#> 1 1022 <chr [3]>
|
||||
#> 2 1022 <chr [4]>
|
||||
#> 3 1022 <chr [11]>
|
||||
#> 4 1052 <chr [6]>
|
||||
#> 5 1052 <chr [4]>
|
||||
#> 6 1074 <chr [6]>
|
||||
#> # … with 47 more rows</pre>
|
||||
#> 2 1022 <chr [11]>
|
||||
#> 3 1052 <chr [6]>
|
||||
#> 4 1052 <chr [4]>
|
||||
#> 5 1074 <chr [6]>
|
||||
#> 6 1074 <chr [5]>
|
||||
#> # … with 46 more rows</pre>
|
||||
</div>
|
||||
<p>This creates a unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
|
||||
<p>This creates an unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word)
|
||||
#> # A tibble: 202 × 2
|
||||
#> # A tibble: 198 × 2
|
||||
#> id word
|
||||
#> <int> <chr>
|
||||
#> 1 1022 Prince
|
||||
#> 2 1022 of
|
||||
#> 3 1022 Winterfell
|
||||
#> 4 1022 Captain
|
||||
#> 4 1022 Lord
|
||||
#> 5 1022 of
|
||||
#> 6 1022 Sea
|
||||
#> # … with 196 more rows</pre>
|
||||
#> 6 1022 the
|
||||
#> # … with 192 more rows</pre>
|
||||
</div>
|
||||
<p>And then we can count that column to find the most common words:</p>
|
||||
<div class="cell">
|
||||
|
@ -740,18 +739,18 @@ A dash of text analysis</h2>
|
|||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word) |>
|
||||
count(word, sort = TRUE)
|
||||
#> # A tibble: 78 × 2
|
||||
#> word n
|
||||
#> <chr> <int>
|
||||
#> 1 of 41
|
||||
#> 2 the 29
|
||||
#> 3 Lord 9
|
||||
#> 4 Hand 6
|
||||
#> 5 Captain 5
|
||||
#> 6 King 5
|
||||
#> # … with 72 more rows</pre>
|
||||
#> # A tibble: 77 × 2
|
||||
#> word n
|
||||
#> <chr> <int>
|
||||
#> 1 of 40
|
||||
#> 2 the 29
|
||||
#> 3 Lord 9
|
||||
#> 4 Hand 6
|
||||
#> 5 King 5
|
||||
#> 6 Princess 5
|
||||
#> # … with 71 more rows</pre>
|
||||
</div>
|
||||
<p>Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these is commonly called stop words.</p>
|
||||
<p>Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these are commonly called stop words.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">stop_words <- tibble(word = c("of", "the"))
|
||||
|
||||
|
@ -761,16 +760,16 @@ titles |>
|
|||
anti_join(stop_words) |>
|
||||
count(word, sort = TRUE)
|
||||
#> Joining with `by = join_by(word)`
|
||||
#> # A tibble: 76 × 2
|
||||
#> # A tibble: 75 × 2
|
||||
#> word n
|
||||
#> <chr> <int>
|
||||
#> 1 Lord 9
|
||||
#> 2 Hand 6
|
||||
#> 3 Captain 5
|
||||
#> 4 King 5
|
||||
#> 5 Princess 5
|
||||
#> 6 Queen 5
|
||||
#> # … with 70 more rows</pre>
|
||||
#> 3 King 5
|
||||
#> 4 Princess 5
|
||||
#> 5 Queen 5
|
||||
#> 6 Ser 5
|
||||
#> # … with 69 more rows</pre>
|
||||
</div>
|
||||
<p>Breaking up text into individual fragments is a powerful idea that underlies much of text analysis. If this sounds interesting, a good place to learn more is <a href="https://www.tidytextmining.com">Text Mining with R</a> by Julia Silge and David Robinson.</p>
|
||||
</section>
|
||||
|
@ -803,7 +802,7 @@ Deeply nested</h2>
|
|||
#> 4 Chicago <list [1]> OK
|
||||
#> 5 Arlington <list [2]> OK</pre>
|
||||
</div>
|
||||
<p>This gives us the <code>status</code> and the <code>results</code>. We’ll drop the status column since they’re all <code>OK</code>; in a real analysis, you’d also want capture all the rows where <code>status != "OK"</code> and figure out what went wrong. <code>results</code> is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:</p>
|
||||
<p>This gives us the <code>status</code> and the <code>results</code>. We’ll drop the status column since they’re all <code>OK</code>; in a real analysis, you’d also want to capture all the rows where <code>status != "OK"</code> and figure out what went wrong. <code>results</code> is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities |>
|
||||
unnest_wider(json) |>
|
||||
|
@ -829,15 +828,15 @@ Deeply nested</h2>
|
|||
unnest_wider(results)
|
||||
locations
|
||||
#> # A tibble: 7 × 6
|
||||
#> city address_components formatted_address geometry place…¹ types
|
||||
#> <chr> <list> <chr> <list> <chr> <list>
|
||||
#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAY… <list>
|
||||
#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-b… <list>
|
||||
#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-… <list>
|
||||
#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOw… <list>
|
||||
#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7c… <list>
|
||||
#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05… <list>
|
||||
#> # … with 1 more row, and abbreviated variable name ¹place_id</pre>
|
||||
#> city address_compone…¹ formatted_address geometry place_id types
|
||||
#> <chr> <list> <chr> <list> <chr> <list>
|
||||
#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAYW… <list>
|
||||
#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-bD… <list>
|
||||
#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-T… <list>
|
||||
#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOwg… <list>
|
||||
#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7cv… <list>
|
||||
#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05g… <list>
|
||||
#> # … with 1 more row, and abbreviated variable name ¹address_components</pre>
|
||||
</div>
|
||||
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
|
||||
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
|
||||
|
@ -846,15 +845,15 @@ locations
|
|||
select(city, formatted_address, geometry) |>
|
||||
unnest_wider(geometry)
|
||||
#> # A tibble: 7 × 6
|
||||
#> city formatted_address bounds location locat…¹ viewport
|
||||
#> <chr> <chr> <list> <list> <chr> <list>
|
||||
#> 1 Houston Houston, TX, USA <named list> <named list> APPROX… <named list>
|
||||
#> 2 Washington Washington, USA <named list> <named list> APPROX… <named list>
|
||||
#> 3 Washington Washington, DC, … <named list> <named list> APPROX… <named list>
|
||||
#> 4 New York New York, NY, USA <named list> <named list> APPROX… <named list>
|
||||
#> 5 Chicago Chicago, IL, USA <named list> <named list> APPROX… <named list>
|
||||
#> 6 Arlington Arlington, TX, U… <named list> <named list> APPROX… <named list>
|
||||
#> # … with 1 more row, and abbreviated variable name ¹location_type</pre>
|
||||
#> city formatted_address bounds location location_type
|
||||
#> <chr> <chr> <list> <list> <chr>
|
||||
#> 1 Houston Houston, TX, USA <named list [2]> <named list> APPROXIMATE
|
||||
#> 2 Washington Washington, USA <named list [2]> <named list> APPROXIMATE
|
||||
#> 3 Washington Washington, DC, USA <named list [2]> <named list> APPROXIMATE
|
||||
#> 4 New York New York, NY, USA <named list [2]> <named list> APPROXIMATE
|
||||
#> 5 Chicago Chicago, IL, USA <named list [2]> <named list> APPROXIMATE
|
||||
#> 6 Arlington Arlington, TX, USA <named list [2]> <named list> APPROXIMATE
|
||||
#> # … with 1 more row, and 1 more variable: viewport <list></pre>
|
||||
</div>
|
||||
<p>That gives us new <code>bounds</code> (a rectangular region) and <code>location</code> (a point). We can unnest <code>location</code> to see the latitude (<code>lat</code>) and longitude (<code>lng</code>):</p>
|
||||
<div class="cell">
|
||||
|
@ -863,15 +862,15 @@ locations
|
|||
unnest_wider(geometry) |>
|
||||
unnest_wider(location)
|
||||
#> # A tibble: 7 × 7
|
||||
#> city formatted_address bounds lat lng locat…¹ viewport
|
||||
#> <chr> <chr> <list> <dbl> <dbl> <chr> <list>
|
||||
#> 1 Houston Houston, TX, USA <named list> 29.8 -95.4 APPROX… <named list>
|
||||
#> 2 Washington Washington, USA <named list> 47.8 -121. APPROX… <named list>
|
||||
#> 3 Washington Washington, DC, … <named list> 38.9 -77.0 APPROX… <named list>
|
||||
#> 4 New York New York, NY, USA <named list> 40.7 -74.0 APPROX… <named list>
|
||||
#> 5 Chicago Chicago, IL, USA <named list> 41.9 -87.6 APPROX… <named list>
|
||||
#> 6 Arlington Arlington, TX, U… <named list> 32.7 -97.1 APPROX… <named list>
|
||||
#> # … with 1 more row, and abbreviated variable name ¹location_type</pre>
|
||||
#> city formatted_address bounds lat lng location_type
|
||||
#> <chr> <chr> <list> <dbl> <dbl> <chr>
|
||||
#> 1 Houston Houston, TX, USA <named list [2]> 29.8 -95.4 APPROXIMATE
|
||||
#> 2 Washington Washington, USA <named list [2]> 47.8 -121. APPROXIMATE
|
||||
#> 3 Washington Washington, DC, USA <named list [2]> 38.9 -77.0 APPROXIMATE
|
||||
#> 4 New York New York, NY, USA <named list [2]> 40.7 -74.0 APPROXIMATE
|
||||
#> 5 Chicago Chicago, IL, USA <named list [2]> 41.9 -87.6 APPROXIMATE
|
||||
#> 6 Arlington Arlington, TX, USA <named list [2]> 32.7 -97.1 APPROXIMATE
|
||||
#> # … with 1 more row, and 1 more variable: viewport <list></pre>
|
||||
</div>
|
||||
<p>Extracting the bounds requires a few more steps:</p>
|
||||
<div class="cell">
|
||||
|
@ -913,7 +912,7 @@ locations
|
|||
#> # … with 1 more row</pre>
|
||||
</div>
|
||||
<p>Note how we unnest two columns simultaneously by supplying a vector of variable names to <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>.</p>
|
||||
<p>This is somewhere that <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned briefly above, can be useful. Once you’ve discovered the path to get to the components you’re interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
|
||||
<p>This is where <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned earlier in the chapter, can be useful. Once you’ve discovered the path to get to the components you’re interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
|
@ -972,7 +971,7 @@ Data types</h2>
|
|||
<p>JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:</p>
|
||||
<ul><li>The simplest type is a null (<code>null</code>) which plays the same role as both <code>NULL</code> and <code>NA</code> in R. It represents the absence of data.</li>
|
||||
<li>A <strong>string</strong> is much like a string in R, but must always use double quotes.</li>
|
||||
<li>A <strong>number</strong> is similar to R’s numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn’t support Inf, -Inf, or NaN.</li>
|
||||
<li>A <strong>number</strong> is similar to R’s numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn’t support <code>Inf</code>, <code>-Inf</code>, or <code>NaN</code>.</li>
|
||||
<li>A <strong>boolean</strong> is similar to R’s <code>TRUE</code> and <code>FALSE</code>, but uses lowercase <code>true</code> and <code>false</code>.</li>
|
||||
</ul><p>JSON’s strings, numbers, and booleans are pretty similar to R’s character, numeric, and logical vectors. The main difference is that JSON’s scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.</p>
|
||||
<p>Both arrays and objects are similar to lists in R; the difference is whether or not they’re named. An <strong>array</strong> is like an unnamed list, and is written with <code>[]</code>. For example <code>[1, 2, 3]</code> is an array containing 3 numbers, and <code>[null, 1, "string", false]</code> is an array that contains a null, a number, a string, and a boolean. An <strong>object</strong> is like a named list, and is written with <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, <code>{"x": 1, "y": 2}</code> is an object that maps <code>x</code> to 1 and <code>y</code> to 2.</p>
|
||||
|
@ -994,7 +993,7 @@ gh_users2 <- read_json(gh_users_json())
|
|||
identical(gh_users, gh_users2)
|
||||
#> [1] TRUE</pre>
|
||||
</div>
|
||||
<p>In this book, I’ll also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here’s three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:</p>
|
||||
<p>In this book, we’ll also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str(parse_json('1'))
|
||||
#> int 1
|
||||
|
@ -1038,7 +1037,7 @@ df |>
|
|||
#> 1 John 34
|
||||
#> 2 Susan 27</pre>
|
||||
</div>
|
||||
<p>In rarer cases, the JSON consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.</p>
|
||||
<p>In rarer cases, the JSON file consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">json <- '{
|
||||
"status": "OK",
|
||||
|
@ -1114,7 +1113,7 @@ df_row <- tibble(json = json_row)</pre>
|
|||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesn’t matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>
|
||||
<p>In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesn’t matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>
|
||||
<p>JSON is the most common data format returned by web APIs. What happens if the website doesn’t have an API, but you can see data you want on the website? That’s the topic of the next chapter: web scraping, extracting data from HTML webpages.</p>
|
||||
|
||||
|
||||
|
|
|
@ -3,8 +3,8 @@
|
|||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. In this chapter we’ll focusing on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
|
||||
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with, and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish up with a survey of other places in the tidyverse and base R where you might use regexes.</p>
|
||||
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
|
||||
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish with a survey of other places in the tidyverse and base R where you might use regexes.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
|
@ -16,14 +16,14 @@ Prerequisites</h2>
|
|||
|
||||
</div>
|
||||
|
||||
<p>This chapter relies on features only found in stringr 1.5.0 and tidyr 1.3.0 which are still in development. If you want to live life on the edge, you can get the dev versions with <code>devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr"))</code>.</p></div>
|
||||
<p>This chapter relies on features only found in tidyr 1.3.0, which is still in development. If you want to live on the edge, you can get the dev version with <code>devtools::install_github("tidyverse/tidyr")</code>.</p></div>
|
||||
|
||||
<p>In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
|
||||
library(babynames)</pre>
|
||||
</div>
|
||||
<p>Through this chapter we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:</p>
|
||||
<p>Through this chapter, we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:</p>
|
||||
<ul><li>
|
||||
<code>fruit</code> contains the names of 80 fruits.</li>
|
||||
<li>
|
||||
|
@ -36,7 +36,7 @@ library(babynames)</pre>
|
|||
<section id="sec-reg-basics" data-type="sect1">
|
||||
<h1>
|
||||
Pattern basics</h1>
|
||||
<p>We’ll use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code><></code>, and, where possible, highlighting the match in blue.</p>
|
||||
<p>We’ll use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs. its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code><></code>, and, where possible, highlighting the match in blue.</p>
|
||||
<p>The simplest patterns consist of letters and numbers which match those characters exactly:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "berry")
|
||||
|
@ -75,11 +75,11 @@ str_view(fruit, "BERRY")</pre>
|
|||
</div>
|
||||
<p><strong>Quantifiers</strong> control how many times a pattern can match:</p>
|
||||
<ul><li>
|
||||
<code>?</code> makes a pattern optional (i.e. it matches 0 or 1 times)</li>
|
||||
<code>?</code> makes a pattern optional (i.e., it matches 0 or 1 times)</li>
|
||||
<li>
|
||||
<code>+</code> lets a pattern repeat (i.e. it matches at least once)</li>
|
||||
<code>+</code> lets a pattern repeat (i.e., it matches at least once)</li>
|
||||
<li>
|
||||
<code>*</code> lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).</li>
|
||||
<code>*</code> lets a pattern be optional or repeat (i.e., it matches any number of times, including 0).</li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r"># ab? matches an "a", optionally followed by a "b".
|
||||
str_view(c("a", "ab", "abb"), "ab?")
|
||||
|
@ -98,7 +98,7 @@ str_view(c("a", "ab", "abb"), "ab*")
|
|||
#> [2] │ <ab>
|
||||
#> [3] │ <abb></pre>
|
||||
</div>
|
||||
<p><strong>Character classes</strong> are defined by <code>[]</code> and let you match a set set of characters, e.g. <code>[abcd]</code> matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with <code>^</code>: <code>[^abcd]</code> matches anything <strong>except</strong> “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:</p>
|
||||
<p><strong>Character classes</strong> are defined by <code>[]</code> and let you match a set of characters, e.g. <code>[abcd]</code> matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with <code>^</code>: <code>[^abcd]</code> matches anything <strong>except</strong> “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][aeiou]")
|
||||
#> [79] │ b<eau>ty
|
||||
|
@ -114,7 +114,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
|||
#> [830] │ su<pply>
|
||||
#> [836] │ <syst>em</pre>
|
||||
</div>
|
||||
<p>You can combine character classes and quantifiers. For example, the following regexp looks for two vowel followed by two or more consonants:</p>
|
||||
<p>You can combine character classes and quantifiers. For example, the following regexp looks for two vowels followed by two or more consonants:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
|
||||
#> [6] │ acc<ount>
|
||||
|
@ -129,7 +129,7 @@ str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
|
|||
#> [79] │ be<auty>
|
||||
#> ... and 62 more</pre>
|
||||
</div>
|
||||
<p>(We’ll learn some more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
|
||||
<p>(We’ll learn more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
|
||||
<p>You can use <strong>alternation</strong>, <code>|</code> to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple|pear|banana")
|
||||
|
@ -154,12 +154,12 @@ Exercises</h2>
|
|||
<section id="sec-stringr-regex-funs" data-type="sect1">
|
||||
<h1>
|
||||
Key functions</h1>
|
||||
<p>Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn about how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.</p>
|
||||
<p>Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.</p>
|
||||
|
||||
<section id="detect-matches" data-type="sect2">
|
||||
<h2>
|
||||
Detect matches</h2>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p>
|
||||
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matches an element of the character vector and <code>FALSE</code> otherwise:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">str_detect(c("a", "b", "c"), "[aeiou]")
|
||||
#> [1] TRUE FALSE FALSE</pre>
|
||||
|
@ -184,12 +184,12 @@ Detect matches</h2>
|
|||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">babynames |>
|
||||
group_by(year) |>
|
||||
summarise(prop_x = mean(str_detect(name, "x"))) |>
|
||||
ggplot(aes(year, prop_x)) +
|
||||
summarize(prop_x = mean(str_detect(name, "x"))) |>
|
||||
ggplot(aes(x = year, y = prop_x)) +
|
||||
geom_line()</pre>
|
||||
<div class="cell-output-display">
|
||||
|
||||
<figure id="fig-x-names"><p><img src="regexps_files/figure-html/fig-x-names-1.png" alt="A timeseries showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019." width="576"/></p>
|
||||
<figure id="fig-x-names"><p><img src="regexps_files/figure-html/fig-x-names-1.png" alt="A time series showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019." width="576"/></p>
|
||||
<figcaption>A time series showing the proportion of baby names that contain a lower case “x”.</figcaption>
|
||||
</figure>
|
||||
</div>
|
||||
|
@ -241,7 +241,7 @@ str_view("abababa", "aba")
|
|||
<p>If you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:</p>
|
||||
<ul><li>Add the upper case vowels to the character class: <code>str_count(name, "[aeiouAEIOU]")</code>.</li>
|
||||
<li>Tell the regular expression to ignore case: <code>str_count(regex(name, ignore_case = TRUE), "[aeiou]")</code>. We’ll talk about more in <a href="#sec-flags" data-type="xref">#sec-flags</a>.</li>
|
||||
<li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>. You learned about this function in <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a>.</li>
|
||||
<li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>.</li>
|
||||
</ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p>
|
||||
<p>In this case, since we’re applying two functions to the name, I think it’s easier to transform it first:</p>
|
||||
<div class="cell">
|
||||
|
@ -283,7 +283,7 @@ str_remove_all(x, "[aeiou]")
|
|||
<p>These functions are naturally paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.</p>
|
||||
</section>
|
||||
|
||||
<section id="extract-variables" data-type="sect2">
|
||||
<section id="sec-extract-variables" data-type="sect2">
|
||||
<h2>
|
||||
Extract variables</h2>
|
||||
<p>The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. It’s a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
|
||||
|
@ -407,12 +407,12 @@ str_view(fruit, "a$")
|
|||
str_view(fruit, "^apple$")
|
||||
#> [1] │ <apple></pre>
|
||||
</div>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
<pre data-type="programlisting" data-code-language="r">x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
|
||||
str_view(x, "sum")
|
||||
#> [1] │ <sum>mary(x)
|
||||
#> [2] │ <sum>marise(df)
|
||||
#> [2] │ <sum>marize(df)
|
||||
#> [3] │ row<sum>(x)
|
||||
#> [4] │ <sum>(x)
|
||||
str_view(x, "\\bsum\\b")
|
||||
|
@ -621,7 +621,7 @@ Exercises</h2>
|
|||
<li>Contain at least two vowel-consonant pairs in a row.</li>
|
||||
<li>Only consist of repeated vowel-consonant pairs.</li>
|
||||
</ol></li>
|
||||
<li><p>Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. Try and make the shortest possible regex!</p></li>
|
||||
<li><p>Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarize, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. Try and make the shortest possible regex!</p></li>
|
||||
<li><p>Switch the first and last letters in <code>words</code>. Which of those strings are still <code>words</code>?</p></li>
|
||||
<li>
|
||||
<p>Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)</p>
|
||||
|
|
After Width: | Height: | Size: 297 KiB |
After Width: | Height: | Size: 314 KiB |
After Width: | Height: | Size: 332 KiB |
After Width: | Height: | Size: 391 KiB |
After Width: | Height: | Size: 497 KiB |
After Width: | Height: | Size: 984 KiB |
After Width: | Height: | Size: 963 KiB |
After Width: | Height: | Size: 794 KiB |
After Width: | Height: | Size: 144 KiB |
After Width: | Height: | Size: 292 KiB |
After Width: | Height: | Size: 709 KiB |
After Width: | Height: | Size: 100 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 28 KiB |
After Width: | Height: | Size: 9.7 KiB |
After Width: | Height: | Size: 78 KiB |
After Width: | Height: | Size: 15 KiB |
After Width: | Height: | Size: 65 KiB |
After Width: | Height: | Size: 120 KiB |
After Width: | Height: | Size: 135 KiB |
After Width: | Height: | Size: 83 KiB |
After Width: | Height: | Size: 160 KiB |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 418 KiB |