Remove O'Reilly files

This commit is contained in:
Hadley Wickham 2023-02-14 07:37:43 -06:00
parent ad72f30b9a
commit 911528d48f
114 changed files with 1 additions and 18824 deletions

2
.gitignore vendored
View File

@ -19,4 +19,4 @@ site_libs
/data/seattle-library-checkouts.csv
/data/seattle-library-checkouts.parquet
/data/seattle-library-checkouts
oreilly

View File

@ -1,473 +0,0 @@
<section data-type="chapter" id="chp-EDA">
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1>
<section id="EDA-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
<ol type="1"><li><p>Generate questions about your data.</p></li>
<li><p>Search for answers by visualizing, transforming, and modelling your data.</p></li>
<li><p>Use what you learn to refine your questions and/or generate new questions.</p></li>
</ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that youll eventually write up and communicate to others.</p>
<p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, youll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>
<section id="EDA-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well combine what youve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="questions" data-type="sect1">
<h1>
Questions</h1>
<blockquote class="blockquote">
<p>“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox</p>
</blockquote>
<blockquote class="blockquote">
<p>“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey</p>
</blockquote>
<p>Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.</p>
<p>EDA is fundamentally a creative process. And like most creative processes, the key to asking <em>quality</em> questions is to generate a large <em>quantity</em> of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights can be gleaned from your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.</p>
<p>There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:</p>
<ol type="1"><li><p>What type of variation occurs within my variables?</p></li>
<li><p>What type of covariation occurs between my variables?</p></li>
</ol><p>The rest of this chapter will look at these two questions. Well explain what variation and covariation are, and well show you several ways to answer each question. To make the discussion easier, lets define some terms:</p>
<ul><li><p>A <strong>variable</strong> is a quantity, quality, or property that you can measure.</p></li>
<li><p>A <strong>value</strong> is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.</p></li>
<li><p>An <strong>observation</strong> is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. Well sometimes refer to an observation as a data point.</p></li>
<li><p><strong>Tabular data</strong> is a set of values, each associated with a variable and an observation. Tabular data is <em>tidy</em> if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.</p></li>
</ul><p>So far, all of the data that youve seen has been tidy. In real-life, most data isnt tidy, so well come back to these ideas again in <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>.</p>
</section>
<section id="variation" data-type="sect1">
<h1>
Variation</h1>
<p><strong>Variation</strong> is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variables values, which youve learned about in <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a>.</p>
<p>Well start our exploration by visualizing the distribution of weights (<code>carat</code>) of ~54,000 diamonds from the <code>diamonds</code> dataset. Since <code>carat</code> is a numerical variable, we can use a histogram:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat)) +
geom_histogram(binwidth = 0.5)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
</div>
</div>
<p>Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? Weve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).</p>
<section id="typical-values" data-type="sect2">
<h2>
Typical values</h2>
<p>In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:</p>
<ul><li><p>Which values are the most common? Why?</p></li>
<li><p>Which values are rare? Why? Does that match your expectations?</p></li>
<li><p>Can you see any unusual patterns? What might explain them?</p></li>
</ul><p>As an example, the histogram below suggests several interesting questions:</p>
<ul><li><p>Why are there more diamonds at whole carats and common fractions of carats?</p></li>
<li><p>Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?</p></li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="r">smaller &lt;- diamonds |&gt;
filter(carat &lt; 3)
ggplot(smaller, aes(x = carat)) +
geom_histogram(binwidth = 0.01)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-4-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak." width="576"/></p>
</div>
</div>
<p>Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:</p>
<ul><li><p>How are the observations within each cluster similar to each other?</p></li>
<li><p>How are the observations in separate clusters different from each other?</p></li>
<li><p>How can you explain or describe the clusters?</p></li>
<li><p>Why might the appearance of clusters be misleading?</p></li>
</ul><p>The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5." width="576"/></p>
</div>
</div>
<p>Many of the questions above will prompt you to explore a relationship <em>between</em> variables, for example, to see if the values of one variable can explain the behavior of another variable. Well get to that shortly.</p>
</section>
<section id="unusual-values" data-type="sect2">
<h2>
Unusual values</h2>
<p>Outliers are observations that are unusual; data points that dont seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the <code>y</code> variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
</div>
</div>
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 youll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = y)) +
geom_histogram(binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p>
</div>
</div>
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> also has an <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
<p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">unusual &lt;- diamonds |&gt;
filter(y &lt; 3 | y &gt; 20) |&gt;
select(price, x, y, z) |&gt;
arrange(y)
unusual
#&gt; # A tibble: 9 × 4
#&gt; price x y z
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 5139 0 0 0
#&gt; 2 6381 0 0 0
#&gt; 3 12800 0 0 0
#&gt; 4 15686 0 0 0
#&gt; 5 18034 0 0 0
#&gt; 6 2130 0 0 0
#&gt; 7 2130 0 0 0
#&gt; 8 2075 5.15 31.8 5.12
#&gt; 9 12210 8.09 58.9 8.06</pre>
</div>
<p>The <code>y</code> variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds cant have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but dont cost hundreds of thousands of dollars!</p>
<p>Its good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you cant figure out why theyre there, its reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldnt drop them without justification. Youll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
</section>
<section id="EDA-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
<li><p>Explore the distribution of <code>price</code>. Do you discover anything unusual or surprising? (Hint: Carefully think about the <code>binwidth</code> and make sure you try a wide range of values.)</p></li>
<li><p>How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?</p></li>
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
</ol></section>
</section>
<section id="sec-missing-values-eda" data-type="sect1">
<h1>
Unusual values</h1>
<p>If youve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.</p>
<ol type="1"><li>
<p>Drop the entire row with the strange values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds2 &lt;- diamonds |&gt;
filter(between(y, 3, 20))</pre>
</div>
<p>We dont recommend this option because just because one measurement is invalid, doesnt mean all the measurements are. Additionally, if you have low quality data, by time that youve applied this approach to every variable you might find that you dont have any data left!</p>
</li>
<li>
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to replace the variable with a modified copy. You can use the <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> function to replace unusual values with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds2 &lt;- diamonds |&gt;
mutate(y = if_else(y &lt; 3 | y &gt; 20, NA, y))</pre>
</div>
</li>
</ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another. You will learn more about logical vectors in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
<p>Its not obvious where you should plot missing values, so ggplot2 doesnt include them in the plot, but it does warn that theyve been removed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds2, aes(x = x, y = y)) +
geom_point()
#&gt; Warning: Removed 9 rows containing missing values (`geom_point()`).</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-13-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a strong, linear association between the two variables. All but one of the diamonds has length greater than 3. The one outlier has a length of 0 and a width of about 6.5." width="576"/></p>
</div>
</div>
<p>To suppress that warning, set <code>na.rm = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds2, aes(x = x, y = y)) +
geom_point(na.rm = TRUE)</pre>
</div>
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, well use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">nycflights13::flights |&gt;
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + (sched_min / 60)
) |&gt;
ggplot(aes(x = sched_dep_time)) +
geom_freqpoly(aes(color = cancelled), binwidth = 1/4)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-15-1.png" class="img-fluid" alt="A frequency polygon of scheduled departure times of flights. Two lines represent flights that are cancelled and not cancelled. The x-axis ranges from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of flights not cancelled are much higher than those not cancelled." width="576"/></p>
</div>
</div>
<p>However this plot isnt great because there are many more non-cancelled flights than cancelled flights. In the next section well explore some techniques for improving this comparison.</p>
<section id="EDA-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
<li><p>What does <code>na.rm = TRUE</code> do in <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> and <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>?</p></li>
</ol></section>
</section>
<section id="covariation" data-type="sect1">
<h1>
Covariation</h1>
<p>If variation describes the behavior <em>within</em> a variable, covariation describes the behavior <em>between</em> variables. <strong>Covariation</strong> is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables.</p>
<section id="sec-cat-num" data-type="sect2">
<h2>
A categorical and a numerical variable</h2>
<p>For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) +
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
</div>
</div>
<p>The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count and the overall counts of <code>cut</code> in differ so much, making it hard to see the differences in the shapes of their distributions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal." width="576"/></p>
</div>
</div>
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, well display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div>
</div>
<p>Note that were mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="https://ggplot2.tidyverse.org/reference/aes_eval.html">after_stat()</a></code> function to do so.</p>
<p>Theres something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe thats because frequency polygons are a little hard to interpret - theres a lot going on in this plot.</p>
<p>A visually simpler plot for exploring this relationship is using side-by-side boxplots.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-19-1.png" class="img-fluid" alt="Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest." width="576"/></p>
</div>
</div>
<p>We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, youll be challenged to figure out why.</p>
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables dont have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> function.</p>
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
</div>
</div>
<p>To make the trend easier to see, we can reorder <code>class</code> based on the median value of <code>hwy</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg,
aes(x = fct_reorder(class, hwy, median), y = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p>
</div>
</div>
<p>If you have long variable names, <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg,
aes(x = hwy, y = fct_reorder(class, hwy, median))) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the y-axis and ordered by increasing median highway mileage." width="576"/></p>
</div>
</div>
<section id="EDA-exercises-2" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
<li><p>What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?</p></li>
<li><p>Instead of exchanging the x and y variables, add <code><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html">coord_flip()</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li>
<li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs. cut. What do you learn? How do you interpret the plots?</p></li>
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_violin.html">geom_violin()</a></code> with a faceted <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, or a colored <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>. What are the pros and cons of each method?</p></li>
<li><p>If you have a small dataset, its sometimes useful to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>. List them and briefly describe what each one does.</p></li>
</ol></section>
</section>
<section id="EDA-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = color)) +
geom_count()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000." width="576"/></p>
</div>
</div>
<p>The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.</p>
<p>Another approach for exploring the relationship between these variables is computing the counts with dplyr:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
count(color, cut)
#&gt; # A tibble: 35 × 3
#&gt; color cut n
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt;
#&gt; 1 D Fair 163
#&gt; 2 D Good 662
#&gt; 3 D Very Good 1513
#&gt; 4 D Premium 1603
#&gt; 5 D Ideal 2834
#&gt; 6 E Fair 224
#&gt; # … with 29 more rows</pre>
</div>
<p>Then visualize with <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> and the fill aesthetic:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
count(color, cut) |&gt;
ggplot(aes(x = color, y = cut)) +
geom_tile(aes(fill = n))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="A tile plot of cut vs. color of diamonds. Each tile represents a cut/color combination and tiles are colored according to the number of observations in each tile. There are more Ideal diamonds than other cuts, with the highest number being Ideal diamonds with color G. Fair diamonds and diamonds with color I are the lowest in frequency." width="576"/></p>
</div>
</div>
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
<section id="EDA-exercises-3" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
<li><p>How does the segmented bar chart change if color is mapped to the <code>x</code> aesthetic and <code>cut</code> is mapped to the <code>fill</code> aesthetic? Calculate the counts that fall into each of the segments.</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?</p></li>
<li><p>Why is it slightly better to use <code>aes(x = color, y = cut)</code> rather than <code>aes(x = cut, y = color)</code> in the example above?</p></li>
</ol></section>
</section>
<section id="EDA-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>Youve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential." width="576"/></p>
</div>
</div>
<p>Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). Youve already seen one way to fix the problem: using the <code>alpha</code> aesthetic to add transparency.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 1 / 100)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
</div>
</div>
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> to bin in one dimension. Now youll learn how to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> to bin in two dimensions.</p>
<p><code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> creates rectangular bins. <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code>.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(smaller, aes(x = carat, y = price)) +
geom_bin2d()
# install.packages("hexbin")
ggplot(smaller, aes(x = carat, y = price)) +
geom_hex()</pre>
<div class="cell quarto-layout-panel">
</div>
</div>
<p>Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin <code>carat</code> and then for each group, display a boxplot:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_width(carat, 0.1)))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-29-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
</div>
</div>
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so its difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
<p>Another approach is to display approximately the same number of points in each bin. Thats the job of <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(smaller, aes(x = carat, y = price)) +
geom_boxplot(aes(group = cut_number(carat, 20)))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
</div>
</div>
<section id="EDA-exercises-4" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
<li><p>Visualize the distribution of <code>carat</code>, partitioned by <code>price</code>.</p></li>
<li><p>How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?</p></li>
<li><p>Combine two of the techniques youve learned to visualize the combined distribution of cut, carat, and price.</p></li>
<li>
<p>Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of <code>x</code> and <code>y</code> values, which makes the points outliers even though their <code>x</code> and <code>y</code> values appear normal when examined separately.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a positive, strong, linear relationship. There are a few unusual observations above and below the bulk of the data, more below it than above." width="576"/></p>
</div>
</div>
<p>Why is a scatterplot a better display than a binned plot for this case?</p>
</li>
</ol></section>
</section>
</section>
<section id="patterns-and-models" data-type="sect1">
<h1>
Patterns and models</h1>
<p>Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:</p>
<ul><li><p>Could this pattern be due to coincidence (i.e. random chance)?</p></li>
<li><p>How can you describe the relationship implied by the pattern?</p></li>
<li><p>How strong is the relationship implied by the pattern?</p></li>
<li><p>What other variables might affect the relationship?</p></li>
<li><p>Does the relationship change if you look at individual subgroups of the data?</p></li>
</ul><p>A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(x = eruptions, y = waiting)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-32-1.png" class="img-fluid" alt="A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times." width="576"/></p>
</div>
</div>
<p>Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.</p>
<p>Models are a tool for extracting patterns out of data. For example, consider the diamonds data. Its hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. Its possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts <code>price</code> from <code>carat</code> and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of <code>price</code> and <code>carat</code>, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidymodels)
diamonds &lt;- diamonds |&gt;
mutate(
log_price = log(price),
log_carat = log(carat)
)
diamonds_fit &lt;- linear_reg() |&gt;
fit(log_price ~ log_carat, data = diamonds)
diamonds_aug &lt;- augment(diamonds_fit, new_data = diamonds) |&gt;
mutate(.resid = exp(.resid))
ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-33-1.png" class="img-fluid" alt="A scatter plot of residuals vs. carat of diamonds. The x-axis ranges from 0 to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered around low values of carat and residuals. There is a clear, curved pattern showing decrease in residuals as carat increases." width="576"/></p>
</div>
</div>
<p>Once youve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds_aug, aes(x = cut, y = .resid)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid" alt="Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end." width="576"/></p>
</div>
</div>
<p>Were not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
</section>
<section id="EDA-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned a variety of tools to help you understand the variation within your data. Youve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but theyre foundation upon which all other techniques are built.</p>
<p>In the next chapter, well tackle our final piece of workflow advice: how to get help when youre stuck.</p>
</section>
</section>

View File

@ -1,283 +0,0 @@
<section data-type="chapter" id="chp-arrow">
<h1><span id="sec-arrow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Arrow</span></span></h1>
<section id="arrow-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>CSV files are designed to be easily read by humans. Theyre a good interchange format because theyre very simple and they can be read by every tool under the sun. But CSV files arent very efficient: you have to do quite a lot of work to read the data into R. In this chapter, youll learn about a powerful alternative: the <a href="https://parquet.apache.org/">parquet format</a>, an open standards-based format widely used by big data systems.</p>
<p>Well pair parquet files with <a href="https://arrow.apache.org">Apache Arrow</a>, a multi-language toolbox designed for efficient analysis and transport of large data sets. Well use Apache Arrow via the the <a href="https://arrow.apache.org/docs/r/">arrow package</a>, which provides a dplyr backend allowing you to analyze larger-than-memory datasets using familiar dplyr syntax. As an additional benefit, arrow is extremely fast: youll see some examples later in the chapter.</p>
<p>Both arrow and dbplyr provide dplyr backends, so you might wonder when to use each. In many cases, the choice is made for you, as in the data is already in a database or in parquet files, and youll want to work with it as is. But if youre starting with your own data (perhaps CSV files), you can either load it into a database or convert it to parquet. In general, its hard to know what will work best, so in the early stages of your analysis wed encourage you to try both and pick the one that works the best for you.</p>
<section id="arrow-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well continue to use the tidyverse, particularly dplyr, but well pair it with the arrow package which is designed specifically for working with large data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(arrow)</pre>
</div>
<p>Later in the chapter, well also see some connections between arrow and duckdb, so well also need dbplyr and duckdb.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(dbplyr, warn.conflicts = FALSE)
library(duckdb)
#&gt; Loading required package: DBI</pre>
</div>
</section>
</section>
<section id="getting-the-data" data-type="sect1">
<h1>
Getting the data</h1>
<p>We begin by getting a dataset worthy of these tools: a data set of item checkouts from Seattle public libraries, available online at <a href="https://data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6">data.seattle.gov/Community/Checkouts-by-Title/tmmm-ytt6</a>. This dataset contains 41,389,465 rows that tell you how many times each book was checked out each month from April 2015 to October 2022.</p>
<p>The following code will get you a cached copy of the data. The data is a 9GB CSV file, so it will take some time to download: simply getting the data is often the first challenge!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">dir.create("data", showWarnings = FALSE)
url &lt;- "https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv"
# Default timeout is 60s; bump it up to an hour
options(timeout = 60 * 60)
download.file(url, "data/seattle-library-checkouts.csv")</pre>
</div>
</section>
<section id="opening-a-dataset" data-type="sect1">
<h1>
Opening a dataset</h1>
<p>Lets start by taking a look at the data. At 9GB, this file is large enough that we probably dont want to load the whole thing into memory. A good rule of thumb is that you usually want at least twice as much memory as the size of the data, and many laptops top out at 16 Gb. This means we want to avoid <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and instead use the <code><a href="https://arrow.apache.org/docs/r/reference/open_dataset.html">arrow::open_dataset()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># partial schema for ISBN column only
opts &lt;- CsvConvertOptions$create(col_types = schema(ISBN = string()))
seattle_csv &lt;- open_dataset(
sources = "data/seattle-library-checkouts.csv",
format = "csv",
convert_options = opts
)</pre>
</div>
<p>(Here weve had to use some relatively advanced code to parse the ISBN variable correctly: this is because the first ~83,000 rows dont contain any data so arrow guesses the wrong types. The arrow team is aware of this problem and there will hopefully be a better approach by the time you read this chapter.)</p>
<p>What happens when this code is run? <code><a href="https://arrow.apache.org/docs/r/reference/open_dataset.html">open_dataset()</a></code> will scan a few thousand rows to figure out the structure of the data set. Then it records what its found and stops; it will only read further rows as you specifically request them. This metadata is what we see if we print <code>seattle_csv</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">seattle_csv
#&gt; FileSystemDataset with 1 csv file
#&gt; UsageClass: string
#&gt; CheckoutType: string
#&gt; MaterialType: string
#&gt; CheckoutYear: int64
#&gt; CheckoutMonth: int64
#&gt; Checkouts: int64
#&gt; Title: string
#&gt; ISBN: string
#&gt; Creator: string
#&gt; Subjects: string
#&gt; Publisher: string
#&gt; PublicationYear: string</pre>
</div>
<p>The first line in the output tells you that <code>seattle_csv</code> is stored locally on-disk as a single CSV file; it will only be loaded into memory as needed. The remainder of the output tells you the column type that arrow has imputed for each column.</p>
<p>We can see whats actually in with <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>. This reveals that there are ~41 million rows and 12 columns, and shows us a few values.</p>
<div class="cell" data-hash="arrow_cache/html/glimpse-data_07c924738790eb185ebdd8973443e90d">
<pre data-type="programlisting" data-code-language="r">seattle_csv |&gt; glimpse()
#&gt; FileSystemDataset with 1 csv file
#&gt; 41,389,465 rows x 12 columns
#&gt; $ UsageClass &lt;string&gt; "Physical", "Physical", "Digital", "Physical", "Ph…
#&gt; $ CheckoutType &lt;string&gt; "Horizon", "Horizon", "OverDrive", "Horizon", "Hor…
#&gt; $ MaterialType &lt;string&gt; "BOOK", "BOOK", "EBOOK", "BOOK", "SOUNDDISC", "BOO…
#&gt; $ CheckoutYear &lt;int64&gt; 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
#&gt; $ CheckoutMonth &lt;int64&gt; 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
#&gt; $ Checkouts &lt;int64&gt; 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 2, 3, 2, 1, 3, 2,…
#&gt; $ Title &lt;string&gt; "Super rich : a guide to having it all / Russell S…
#&gt; $ ISBN &lt;string&gt; "", "", "", "", "", "", "", "", "", "", "", "", ""…
#&gt; $ Creator &lt;string&gt; "Simmons, Russell", "Barclay, James, 1965-", "Tim …
#&gt; $ Subjects &lt;string&gt; "Self realization, Conduct of life, Attitude Psych…
#&gt; $ Publisher &lt;string&gt; "Gotham Books,", "Pyr,", "Random House, Inc.", "Di…
#&gt; $ PublicationYear &lt;string&gt; "c2011.", "2010.", "2015", "2005.", "c2004.", "c20…</pre>
</div>
<p>We can start to use this dataset with dplyr verbs, using <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> to force arrow to perform the computation and return some data. For example, this code tells us the total number of checkouts per year:</p>
<div class="cell" data-hash="arrow_cache/html/unnamed-chunk-5_7a5e1ce0bed4d69e849dff75d0c0d8d3">
<pre data-type="programlisting" data-code-language="r">seattle_csv |&gt;
count(CheckoutYear, wt = Checkouts) |&gt;
arrange(CheckoutYear) |&gt;
collect()
#&gt; # A tibble: 18 × 2
#&gt; CheckoutYear n
#&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2005 3798685
#&gt; 2 2006 6599318
#&gt; 3 2007 7126627
#&gt; 4 2008 8438486
#&gt; 5 2009 9135167
#&gt; 6 2010 8608966
#&gt; # … with 12 more rows</pre>
</div>
<p>Thanks to arrow, this code will work regardless of how large the underlying dataset is. But its currently rather slow: on Hadleys computer, it took ~10s to run. Thats not terrible given how much data we have, but we can make it much faster by switching to a better format.</p>
</section>
<section id="the-parquet-format" data-type="sect1">
<h1>
The parquet format</h1>
<p>To make this data easier to work with, lets switch to the parquet file format and split it up into multiple files. The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.</p>
<section id="advantages-of-parquet" data-type="sect2">
<h2>
Advantages of parquet</h2>
<p>Like CSV, parquet is used for rectangular data, but instead of being a text format that you can read with any file editor, its a custom binary format designed specifically for the needs of big data. This means that:</p>
<ul><li><p>Parquet files are usually smaller the equivalent CSV file. Parquet relies on <a href="https://parquet.apache.org/docs/file-format/data-pages/encodings/">efficient encodings</a> to keep file size down, and supports file compression. This helps make parquet files fast because theres less data to move from disk to memory.</p></li>
<li><p>Parquet files have a rich type system. As we talked about in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>, a CSV file does not provide any information about column types. For example, a CSV reader has to guess whether <code>"08-10-2022"</code> should be parsed as a string or a date. In contrast, parquet files store data in a way that records the type along with the data.</p></li>
<li><p>Parquet files are “column-oriented”. This means that theyre organised column-by-column, much like Rs data frame. This typically leads to better performance for data analysis tasks compared to CSV files, which are organised row-by-row.</p></li>
<li><p>Parquet files are “chunked”, which makes it possible to work on different parts of the file at the same time, and, if youre lucky, to skip some chunks all together.</p></li>
</ul></section>
<section id="partitioning" data-type="sect2">
<h2>
Partitioning</h2>
<p>As datasets get larger and larger, storing all the data in a single file gets increasingly painful and its often useful to split large datasets across many files. When this structuring is done intelligently, this strategy can lead to significant improvements in performance because many analyses will only require a subset of the files.</p>
<p>There are no hard and fast rules about how to partition your data set: the results will depend on your data, access patterns, and the systems that read the data. Youre likely to need to do some experimentation before you find the ideal partitioning for your situation. As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. You should also try to partition by variables that you filter by; as youll see shortly, that allows arrow to skip a lot of work by reading only the relevant files.</p>
</section>
<section id="rewriting-the-seattle-library-data" data-type="sect2">
<h2>
Rewriting the Seattle library data</h2>
<p>Lets apply these ideas to the Seattle library data to see how they play out in practice. Were going to partition by <code>CheckoutYear</code>, since its likely some analyses will only want to look at recent data and partitioning by year yields 18 chunks of a reasonable size.</p>
<p>To rewrite the data we define the partition using <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code> and then save the partitions to a directory with <code><a href="https://arrow.apache.org/docs/r/reference/write_dataset.html">arrow::write_dataset()</a></code>. <code><a href="https://arrow.apache.org/docs/r/reference/write_dataset.html">write_dataset()</a></code> has two important arguments: a directory where well create the files and the format well use.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">pq_path &lt;- "data/seattle-library-checkouts"</pre>
</div>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">seattle_csv |&gt;
group_by(CheckoutYear) |&gt;
write_dataset(path = pq_path, format = "parquet")</pre>
</div>
<p>This takes about a minute to run; as well see shortly this is an initial investment that pays off by making future operations much much faster.</p>
<p>Lets take a look at what we just produced:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tibble(
files = list.files(pq_path, recursive = TRUE),
size_MB = file.size(file.path(pq_path, files)) / 1024^2
)
#&gt; # A tibble: 18 × 2
#&gt; files size_MB
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 CheckoutYear=2005/part-0.parquet 109.
#&gt; 2 CheckoutYear=2006/part-0.parquet 164.
#&gt; 3 CheckoutYear=2007/part-0.parquet 178.
#&gt; 4 CheckoutYear=2008/part-0.parquet 195.
#&gt; 5 CheckoutYear=2009/part-0.parquet 214.
#&gt; 6 CheckoutYear=2010/part-0.parquet 222.
#&gt; # … with 12 more rows</pre>
</div>
<p>Our single 9GB CSV file has been rewritten into 18 parquet files. The file names use a “self-describing” convention used by the <a href="https://hive.apache.org">Apache Hive</a> project. Hive-style partitions name folders with a “key=value” convention, so as you might guess, the <code>CheckoutYear=2005</code> directory contains all the data where <code>CheckoutYear</code> is 2005. Each file is between 100 and 300 MB and the total size is now around 4 GB, a little over half the size of the original CSV file. This is as we expect since parquet is a much more efficient format.</p>
</section>
</section>
<section id="using-dplyr-with-arrow" data-type="sect1">
<h1>
Using dplyr with arrow</h1>
<p>Now weve created these parquet files, well need to read them in again. We use <code><a href="https://arrow.apache.org/docs/r/reference/open_dataset.html">open_dataset()</a></code> again, but this time we give it a directory:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">seattle_pq &lt;- open_dataset(pq_path)</pre>
</div>
<p>Now we can write our dplyr pipeline. For example, we could count the total number of books checked out in each month for the last five years:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">query &lt;- seattle_pq |&gt;
filter(CheckoutYear &gt;= 2018, MaterialType == "BOOK") |&gt;
group_by(CheckoutYear, CheckoutMonth) |&gt;
summarise(TotalCheckouts = sum(Checkouts)) |&gt;
arrange(CheckoutYear, CheckoutMonth)</pre>
</div>
<p>Writing dplyr code for arrow data is conceptually similar to dbplyr, <a href="#chp-databases" data-type="xref">#chp-databases</a>: you write dplyr code, which is automatically transformed into a query that the Apache Arrow C++ library understands, which is then executed when you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. If we print out the <code>query</code> object we can see a little information about what we expect Arrow to return when the execution takes place:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">query
#&gt; FileSystemDataset (query)
#&gt; CheckoutYear: int32
#&gt; CheckoutMonth: int64
#&gt; TotalCheckouts: int64
#&gt;
#&gt; * Grouped by CheckoutYear
#&gt; * Sorted by CheckoutYear [asc], CheckoutMonth [asc]
#&gt; See $.data for the source Arrow object</pre>
</div>
<p>And we can get the results by calling <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">query |&gt; collect()
#&gt; # A tibble: 58 × 3
#&gt; # Groups: CheckoutYear [5]
#&gt; CheckoutYear CheckoutMonth TotalCheckouts
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2018 1 355101
#&gt; 2 2018 2 309813
#&gt; 3 2018 3 344487
#&gt; 4 2018 4 330988
#&gt; 5 2018 5 318049
#&gt; 6 2018 6 341825
#&gt; # … with 52 more rows</pre>
</div>
<p>Like dbplyr, arrow only understands some R expressions, so you may not be able to write exactly the same code you usually would. However, the list of operations and functions supported is fairly extensive and continues to grow; find a complete list of currently supported functions in <code><a href="https://arrow.apache.org/docs/r/reference/acero.html">?acero</a></code>.</p>
<section id="sec-parquet-fast" data-type="sect2">
<h2>
Performance</h2>
<p>Lets take a quick look at the performance impact of switching from CSV to parquet. First, lets time how long it takes to calculate the number of books checked out in each month of 2021, when the data is stored as a single large csv:</p>
<div class="cell" data-hash="arrow_cache/html/dataset-performance-csv_4d24d09e336fc39a348b5ad94f60f527">
<pre data-type="programlisting" data-code-language="r">seattle_csv |&gt;
filter(CheckoutYear == 2021, MaterialType == "BOOK") |&gt;
group_by(CheckoutMonth) |&gt;
summarise(TotalCheckouts = sum(Checkouts)) |&gt;
arrange(desc(CheckoutMonth)) |&gt;
collect() |&gt;
system.time()
#&gt; user system elapsed
#&gt; 11.980 0.924 11.350</pre>
</div>
<p>Now lets use our new version of the data set in which the Seattle library checkout data has been partitioned into 18 smaller parquet files:</p>
<div class="cell" data-hash="arrow_cache/html/dataset-performance-multiple-parquet_ad546f5d817df3ad4bdb238240b808d3">
<pre data-type="programlisting" data-code-language="r">seattle_pq |&gt;
filter(CheckoutYear == 2021, MaterialType == "BOOK") |&gt;
group_by(CheckoutMonth) |&gt;
summarise(TotalCheckouts = sum(Checkouts)) |&gt;
arrange(desc(CheckoutMonth)) |&gt;
collect() |&gt;
system.time()
#&gt; user system elapsed
#&gt; 0.273 0.045 0.055</pre>
</div>
<p>The ~100x speedup in performance is attributable to two factors: the multi-file partitioning, and the format of individual files:</p>
<ul><li>Partitioning improves performance because this query uses <code>CheckoutYear == 2021</code> to filter the data, and arrow is smart enough to recognize that it only needs to read 1 of the 18 parquet files.</li>
<li>The parquet format improves performance by storing data in a binary format that can be read more directly into memory. The column-wise format and rich metadata means that arrow only needs to read the four columns actually used in the query (<code>CheckoutYear</code>, <code>MaterialType</code>, <code>CheckoutMonth</code>, and <code>Checkouts</code>).</li>
</ul><p>This massive difference in performance is why it pays off to convert large CSVs to parquet!</p>
</section>
<section id="using-dbplyr-with-arrow" data-type="sect2">
<h2>
Using dbplyr with arrow</h2>
<p>Theres one last advantage of parquet and arrow — its very easy to turn an arrow dataset into a duckdb datasource by calling <code><a href="https://arrow.apache.org/docs/r/reference/to_duckdb.html">arrow::to_duckdb()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">seattle_pq |&gt;
to_duckdb() |&gt;
filter(CheckoutYear &gt;= 2018, MaterialType == "BOOK") |&gt;
group_by(CheckoutYear) |&gt;
summarise(TotalCheckouts = sum(Checkouts)) |&gt;
arrange(desc(CheckoutYear)) |&gt;
collect()
#&gt; Warning: Missing values are always removed in SQL aggregation functions.
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # A tibble: 5 × 2
#&gt; CheckoutYear TotalCheckouts
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2022 2431502
#&gt; 2 2021 2266438
#&gt; 3 2020 1241999
#&gt; 4 2019 3931688
#&gt; 5 2018 3987569</pre>
</div>
<p>The neat thing about <code><a href="https://arrow.apache.org/docs/r/reference/to_duckdb.html">to_duckdb()</a></code> is that the transfer doesnt involve any memory copying, and speaks to the goals of the arrow ecosystem: enabling seamless transitions from one computing environment to another.</p>
</section>
</section>
<section id="arrow-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve been given a taste of the arrow package, which provides a dplyr backend for working with large on-disk datasets. It can work with CSV files, its much much faster if you convert your data to parquet. Parquet is a binary data format thats designed specifically for data analysis on modern computers. Far fewer tools can work with parquet files compared to CSV, but its partitioned, compressed, and columnar structure makes it much more efficient to analyze.</p>
<p>Next up youll learn about your first non-rectangular data source, which youll handle using tools provided by the tidyr package. Well focus on data that comes from JSON files, but the general principles apply to tree-like data regardless of its source.</p>
</section>
</section>

View File

@ -1,526 +0,0 @@
<section data-type="chapter" id="chp-base-R">
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and <code>for</code> loops. To finish off, well briefly discuss two important plotting functions.</p>
<section id="base-R-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
<section id="sec-subset-many" data-type="sect1">
<h1>
Selecting multiple elements with [
</h1>
<p><code>[</code> is used to extract sub-components from vectors and data frames, and is called like <code>x[i]</code> or <code>x[i, j]</code>. In this section, well introduce you to the power of <code>[</code>, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. Well then help you cement that knowledge by showing how various dplyr verbs are special cases of <code>[</code>.</p>
<section id="subsetting-vectors" data-type="sect2">
<h2>
Subsetting vectors</h2>
<p>There are five main types of things that you can subset a vector with, i.e. that can be the <code>i</code> in <code>x[i]</code>:</p>
<ol type="1"><li>
<p><strong>A vector of positive integers</strong>. Subsetting with positive integers keeps the elements at those positions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
#&gt; [1] "three" "two" "five"</pre>
</div>
<p>By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x[c(1, 1, 5, 5, 5, 2)]
#&gt; [1] "one" "one" "five" "five" "five" "two"</pre>
</div>
</li>
<li>
<p><strong>A vector of negative integers</strong>. Negative values drop the elements at the specified positions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x[c(-1, -3, -5)]
#&gt; [1] "two" "four"</pre>
</div>
</li>
<li>
<p><strong>A logical vector</strong>. Subsetting with a logical vector keeps all values corresponding to a <code>TRUE</code> value. This is most often useful in conjunction with the comparison functions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]
#&gt; [1] 10 3 5 8 1
# All even (or missing!) values of x
x[x %% 2 == 0]
#&gt; [1] 10 NA 8 NA</pre>
</div>
<p>Note that, unlike <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, <code>NA</code> indices will be included in the output as <code>NA</code>s.</p>
</li>
<li>
<p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
#&gt; xyz def
#&gt; 5 2</pre>
</div>
<p>As with subsetting with positive integers, you can use a character vector to duplicate individual entries.</p>
</li>
<li><p><strong>Nothing</strong>. The final type of subsetting is nothing, <code>x[]</code>, which returns the complete <code>x</code>. This is not useful for subsetting vectors, but as well see shortly it is useful when subsetting 2d structures like tibbles.</p></li>
</ol></section>
<section id="subsetting-data-frames" data-type="sect2">
<h2>
Subsetting data frames</h2>
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to select rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
<p>Here are a couple of examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = 1:3,
y = c("a", "e", "f"),
z = runif(3)
)
# Select first row and second column
df[1, 2]
#&gt; # A tibble: 1 × 1
#&gt; y
#&gt; &lt;chr&gt;
#&gt; 1 a
# Select all rows and columns x and y
df[, c("x" , "y")]
#&gt; # A tibble: 3 × 2
#&gt; x y
#&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1 a
#&gt; 2 2 e
#&gt; 3 3 f
# Select rows where `x` is greater than 1 and all columns
df[df$x &gt; 1, ]
#&gt; # A tibble: 2 × 3
#&gt; x y z
#&gt; &lt;int&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 e 0.834
#&gt; 2 3 f 0.601</pre>
</div>
<p>Well come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesnt use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
<p>Theres an important difference between tibbles and data frames when it comes to <code>[</code>. In this book weve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to Rs built-in data frame, well write <code>data.frame</code>. If <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1 &lt;- data.frame(x = 1:3)
df1[, "x"]
#&gt; [1] 1 2 3
df2 &lt;- tibble(x = 1:3)
df2[, "x"]
#&gt; # A tibble: 3 × 1
#&gt; x
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3</pre>
</div>
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1[, "x" , drop = FALSE]
#&gt; x
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3</pre>
</div>
</section>
<section id="dplyr-equivalents" data-type="sect2">
<h2>
dplyr equivalents</h2>
<p>A number of dplyr verbs are special cases of <code>[</code>:</p>
<ul><li>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = c(2, 3, 1, 1, NA),
y = letters[1:5],
z = runif(5)
)
df |&gt; filter(x &gt; 1)
# same as
df[!is.na(df$x) &amp; df$x &gt; 1, ]</pre>
</div>
<p>Another common technique in the wild is to use <code><a href="https://rdrr.io/r/base/which.html">which()</a></code> for its side-effect of dropping missing values: <code>df[which(df$x &gt; 1), ]</code>.</p>
</li>
<li>
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="https://rdrr.io/r/base/order.html">order()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; arrange(x, y)
# same as
df[order(df$x, df$y), ]</pre>
</div>
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individually sort columns in decreasing order.</p>
</li>
<li>
<p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; select(x, z)
# same as
df[, c("x", "z")]</pre>
</div>
</li>
</ul><p>Base R also provides a function that combines the features of <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code><span data-type="footnote">But it doesnt handle grouped data frames differently and it doesnt support selection helper functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code>.</span> called <code><a href="https://rdrr.io/r/base/subset.html">subset()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
filter(x &gt; 1) |&gt;
select(y, z)
#&gt; # A tibble: 2 × 2
#&gt; y z
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 a 0.157
#&gt; 2 b 0.00740
# same as
df |&gt; subset(x &gt; 1, c(y, z))
#&gt; # A tibble: 2 × 2
#&gt; y z
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 a 0.157
#&gt; 2 b 0.00740</pre>
</div>
<p>This function was the inspiration for much of dplyrs syntax.</p>
</section>
<section id="base-R-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Create functions that take a vector as input and return:</p>
<ol type="a"><li>The elements at even numbered positions.</li>
<li>Every element except the last value.</li>
<li>Only even values (and no missing values).</li>
</ol></li>
<li><p>Why is <code>x[-which(x &gt; 0)]</code> not the same as <code>x[x &lt;= 0]</code>? Read the documentation for <code><a href="https://rdrr.io/r/base/which.html">which()</a></code> and do some experiments to figure it out.</p></li>
</ol></section>
</section>
<section id="sec-subset-one" data-type="sect1">
<h1>
Selecting a single element with $ and [[
</h1>
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, well show you how to use <code>[[</code> and <code>$</code> to pull columns out of data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
<section id="data-frames" data-type="sect2">
<h2>
Data frames</h2>
<p><code>[[</code> and <code>$</code> can be used to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tb &lt;- tibble(
x = 1:4,
y = c(10, 4, 1, 21)
)
# by position
tb[[1]]
#&gt; [1] 1 2 3 4
# by name
tb[["x"]]
#&gt; [1] 1 2 3 4
tb$x
#&gt; [1] 1 2 3 4</pre>
</div>
<p>They can also be used to create new columns, the base R equivalent of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tb$z &lt;- tb$x + tb$y
tb
#&gt; # A tibble: 4 × 3
#&gt; x y z
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 10 11
#&gt; 2 2 4 6
#&gt; 3 3 1 4
#&gt; 4 4 21 25</pre>
</div>
<p>There are a number of other base R approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">max(diamonds$carat)
#&gt; [1] 5.01
levels(diamonds$cut)
#&gt; [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
</div>
<p>dplyr also provides an equivalent to <code>[[</code>/<code>$</code> that we didnt mention in <a href="#chp-data-transform" data-type="xref">#chp-data-transform</a>: <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; pull(carat) |&gt; mean()
#&gt; [1] 0.7979397
diamonds |&gt; pull(cut) |&gt; levels()
#&gt; [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
</div>
</section>
<section id="tibbles" data-type="sect2">
<h2>
Tibbles</h2>
<p>There are a couple of important differences between tibbles and base <code>data.frame</code>s when it comes to <code>$</code>. Data frames match the prefix of any variable names (so-called <strong>partial matching</strong>) and dont complain if a column doesnt exist:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- data.frame(x1 = 1)
df$x
#&gt; Warning in df$x: partial match of 'x' to 'x1'
#&gt; [1] 1
df$z
#&gt; NULL</pre>
</div>
<p>Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesnt exist:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tb &lt;- tibble(x1 = 1)
tb$x
#&gt; Warning: Unknown or uninitialised column: `x`.
#&gt; NULL
tb$z
#&gt; Warning: Unknown or uninitialised column: `z`.
#&gt; NULL</pre>
</div>
<p>For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.</p>
</section>
<section id="base-R-lists" data-type="sect2">
<h2>
Lists</h2>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and its important to understand how they differ from <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">l &lt;- list(
a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5)
)</pre>
</div>
<ul><li>
<p><code>[</code> extracts a sub-list. It doesnt matter how many elements you extract, the result will always be a list.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str(l[1:2])
#&gt; List of 2
#&gt; $ a: int [1:3] 1 2 3
#&gt; $ b: chr "a string"
str(l[1])
#&gt; List of 1
#&gt; $ a: int [1:3] 1 2 3
str(l[4])
#&gt; List of 1
#&gt; $ d:List of 2
#&gt; ..$ : num -1
#&gt; ..$ : num -5</pre>
</div>
<p>Like with vectors, you can subset with a logical, integer, or character vector.</p>
</li>
<li>
<p><code>[[</code> and <code>$</code> extract a single component from a list. They remove a level of hierarchy from the list.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str(l[[1]])
#&gt; int [1:3] 1 2 3
str(l[[4]])
#&gt; List of 2
#&gt; $ : num -1
#&gt; $ : num -5
str(l$a)
#&gt; int [1:3] 1 2 3</pre>
</div>
</li>
</ul><p>The difference between <code>[</code> and <code>[[</code> is particularly important for lists because <code>[[</code> drills down into the list while <code>[</code> returns a new, smaller list. To help you remember the difference, take a look at the an unusual pepper shaker shown in <a href="#fig-pepper-1" data-type="xref">#fig-pepper-1</a>. If this pepper shaker is your list <code>pepper</code>, then, <code>pepper[1]</code> is a pepper shaker containing a single pepper packet, as in <a href="#fig-pepper-2" data-type="xref">#fig-pepper-2</a>. If we suppose this pepper shaker is a list <code>pepper</code>, then, <code>pepper[1]</code> is a pepper shaker containing a single pepper packet, as in <a href="#fig-pepper-2" data-type="xref">#fig-pepper-2</a>. <code>pepper[2]</code> would look the same, but would contain the second packet. <code>pepper[1:2]</code> would be a pepper shaker containing two pepper packets. <code>pepper[[1]]</code> would extract the pepper packet itself, as in <a href="#fig-pepper-3" data-type="xref">#fig-pepper-3</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pepper-1"><p><img src="images/pepper.jpg" style="width:25.0%" alt="A photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains many packets of pepper."/></p>
<figcaption>A pepper shaker that Hadley once found in his hotel room.</figcaption>
</figure>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pepper-2"><p><img src="images/pepper-1.jpg" style="width:25.0%" alt="A photo of the glass pepper shaker containing just one packet of pepper."/></p>
<figcaption><code>pepper[1]</code></figcaption>
</figure>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pepper-3"><p><img src="images/pepper-2.jpg" style="width:25.0%" alt="A photo of single packet of pepper."/></p>
<figcaption><code>pepper[[1]]</code></figcaption>
</figure>
</div>
</div>
<p>This same principle applies when you use 1d <code>[</code> with a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = 1:3, y = 3:5)
# returns a one-column data frame
df["x"]
#&gt; # A tibble: 3 × 1
#&gt; x
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3
# returns the contents of x
df[["x"]]
#&gt; [1] 1 2 3</pre>
</div>
</section>
<section id="base-R-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens when you use <code>[[</code> with a positive integer thats bigger than the length of the vector? What happens when you subset with a name that doesnt exist?</p></li>
<li><p>What would <code>pepper[[1]][1]</code> be? What about <code>pepper[[1]][[1]]</code>?</p></li>
</ol></section>
</section>
<section id="apply-family" data-type="sect1">
<h1>
Apply family</h1>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, youll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here well give you a quick overview of this family so you can recognize them in the wild.</p>
<p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if theres an error.</span>. In fact, because we havent used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
<p>Theres no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
# First find numeric columns
num_cols &lt;- sapply(df, is.numeric)
num_cols
#&gt; a b c d e
#&gt; TRUE TRUE FALSE FALSE TRUE
# Then transform each column with lapply() then replace the original values
df[, num_cols] &lt;- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
df
#&gt; # A tibble: 1 × 5
#&gt; a b c d e
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 4 a b 8</pre>
</div>
<p>The code above uses a new function, <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code>. Its similar to <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We dont recommend using it for programming, because the simplification can fail and give you an unexpected type, but its usually fine for interactive use. purrr has a similar function called <code><a href="https://purrr.tidyverse.org/reference/map.html">map_vec()</a></code> that we didnt mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
<p>Base R provides a stricter version of <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> called <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> call above with this <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> where we specify that we expect <code><a href="https://rdrr.io/r/base/numeric.html">is.numeric()</a></code> to return a logical vector of length 1:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">vapply(df, is.numeric, logical(1))
#&gt; a b c d e
#&gt; TRUE TRUE FALSE FALSE TRUE</pre>
</div>
<p>The distinction between <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> and <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> is really important when theyre inside a function (because it makes a big difference to the functions robustness to unusual inputs), but it doesnt usually matter in data analysis.</p>
<p>Another important member of the apply family is <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> which computes a single grouped summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summarize(price = mean(price))
#&gt; # A tibble: 5 × 2
#&gt; cut price
#&gt; &lt;ord&gt; &lt;dbl&gt;
#&gt; 1 Fair 4359.
#&gt; 2 Good 3929.
#&gt; 3 Very Good 3982.
#&gt; 4 Premium 4584.
#&gt; 5 Ideal 3458.
tapply(diamonds$price, diamonds$cut, mean)
#&gt; Fair Good Very Good Premium Ideal
#&gt; 4358.758 3928.864 3981.760 4584.258 3457.542</pre>
</div>
<p>Unfortunately <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (its certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec">in a gist</a>.</p>
<p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out for <code>apply(df, 2, something)</code>, which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
</section>
<section id="for-loops" data-type="sect1">
<h1>
For loops</h1>
<p><code>for</code> loops are the fundamental building block of iteration that both the apply and map families use under the hood. <code>for</code> loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a <code>for</code> loop looks like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">for (element in vector) {
# do something with element
}</pre>
</div>
<p>The most straightforward use of <code>for</code> loops is to achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt; walk(append_file)</pre>
</div>
<p>We could have used a <code>for</code> loop:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">for (path in paths) {
append_file(path)
}</pre>
</div>
<p>Things get a little trickier if you want to save the output of the <code>for</code> loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files &lt;- map(paths, readxl::read_excel)</pre>
</div>
<p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, were going to want a list the same length as <code>paths</code>, which we can create with <code><a href="https://rdrr.io/r/base/vector.html">vector()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">files &lt;- vector("list", length(paths))</pre>
</div>
<p>Then instead of iterating over the elements of <code>paths</code>, well iterate over their indices, using <code><a href="https://rdrr.io/r/base/seq.html">seq_along()</a></code> to generate one index for each element of paths:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">seq_along(paths)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre>
</div>
<p>Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">for (i in seq_along(paths)) {
files[[i]] &lt;- readxl::read_excel(paths[[i]])
}</pre>
</div>
<p>To combine the list of tibbles into a single tibble you can use <code><a href="https://rdrr.io/r/base/do.call.html">do.call()</a></code> + <code><a href="https://rdrr.io/r/base/cbind.html">rbind()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">do.call(rbind, files)
#&gt; # A tibble: 1,704 × 5
#&gt; country continent lifeExp pop gdpPercap
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan Asia 28.8 8425333 779.
#&gt; 2 Albania Europe 55.2 1282697 1601.
#&gt; 3 Algeria Africa 43.1 9279525 2449.
#&gt; 4 Angola Africa 30.0 4232095 3521.
#&gt; 5 Argentina Americas 62.5 17876956 5911.
#&gt; 6 Australia Oceania 69.1 8691212 10040.
#&gt; # … with 1,698 more rows</pre>
</div>
<p>Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">out &lt;- NULL
for (path in paths) {
out &lt;- rbind(out, readxl::read_excel(path))
}</pre>
</div>
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that <code>for</code> loops are slow: theyre not, but iteratively growing a vector is.</p>
</section>
<section id="plots" data-type="sect1">
<h1>
Plots</h1>
<p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because theyre so concise — it takes very little typing to do a basic exploratory plot.</p>
<p>There are two main types of base plot youll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Heres a quick example from the diamonds dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">hist(diamonds$carat)
plot(diamonds$carat, diamonds$price)</pre>
<div class="cell-output-display">
<p><img src="base-R_files/figure-html/unnamed-chunk-40-1.png" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="base-R_files/figure-html/unnamed-chunk-40-2.png" width="576"/></p>
</div>
</div>
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
</section>
<section id="base-R-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, weve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
<p>This chapter concludes the programming section of the book. Youve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that youre looking forward to learning more outside of this book.</p>
</section>
</section>

View File

@ -1,12 +0,0 @@
<div data-type="part">
<h1><span id="sec-communicate-intro" class="quarto-section-identifier d-none d-lg-block">Communicate</span></h1><p>So far, youve learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualization. However, it doesnt matter how great your analysis is unless you can explain it to others: you need to <strong>communicate</strong> your results.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-communicate"><p><img src="diagrams/data-science/communicate.png" alt="A diagram displaying the data science cycle with communicate highlighed in blue. " width="535"/></p>
<figcaption>Figure 1: Communication is the final part of the data science process; if you cant communicate your results to other humans, it doesnt matter how great your analysis is.</figcaption>
</figure>
</div>
</div><p>Communication is the theme of the following three chapters:</p><ul><li><p>In <a href="#chp-quarto" data-type="xref">#chp-quarto</a>, you will learn about Quarto, a tool for integrating prose, code, and results. You can use Quarto for analyst-to-analyst communication as well as analyst-to-decision-maker communication. Thanks to the power of Quarto formats, you can even use the same document for both purposes.</p></li>
<li><p>In <a href="#chp-quarto-formats" data-type="xref">#chp-quarto-formats</a>, youll learn a little about the many other varieties of outputs you can produce using Quarto, including dashboards, websites, and books.</p></li>
<li><p>Well finish up with <a href="#chp-quarto-workflow" data-type="xref">#chp-quarto-workflow</a>, where youll learn about the “analysis notebook” and how to systematically record your successes and failures so that you can learn from them.</p></li>
</ul><p>These chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans. However, there are lot of other great books about communication, which well point you to at the end of each chapter.</p></div>

View File

@ -1,849 +0,0 @@
<section data-type="chapter" id="chp-communication">
<h1><span id="sec-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Communication</span></span></h1>
<section id="communication-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, youll produce tens or hundreds of plots, most of which are immediately thrown away.</p>
<p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, youll learn some of the tools that ggplot2 provides to do so.</p>
<p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="https://www.amazon.com/gp/product/0321934075/">The Truthful Art</a>, by Albert Cairo. It doesnt teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p>
<section id="communication-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus once again on ggplot2. Well also use a little dplyr for data manipulation, <strong>scales</strong> to override the default breaks, labels, transformations and palettes, and a few ggplot2 extension packages, including <strong>ggrepel</strong> (<a href="https://ggrepel.slowkow.com/">https://ggrepel.slowkow.com</a>) by Kamil Slowikowski and <strong>patchwork</strong> (<a href="https://patchwork.data-imaginist.com/">https://patchwork.data-imaginist.com</a>) by Thomas Lin Pedersen. Dont forget that youll need to install those packages with <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> if you dont already have them.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(ggrepel)
library(patchwork)</pre>
</div>
</section>
</section>
<section id="labels" data-type="sect1">
<h1>
Labels</h1>
<p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> function. This example adds a plot title:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-3-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The plot is titled &quot;Fuel efficiency generally decreases with engine size&quot;." width="576"/></p>
</div>
</div>
<p>The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.</p>
<p>If you need to add more text, there are two other useful labels:</p>
<ul><li><p><code>subtitle</code> adds additional detail in a smaller font beneath the title.</p></li>
<li><p><code>caption</code> adds text at the bottom right of the plot, often used to describe the source of the data.</p></li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-4-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The plot is titled &quot;Fuel efficiency generally decreases with engine size&quot;. The subtitle is &quot;Two seaters (sports cars) are an exception because of their light weight&quot; and the caption is &quot;Data from fueleconomy.gov&quot;." width="576"/></p>
</div>
</div>
<p>You can also use <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> to replace the axis and legend titles. Its usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
color = "Car type"
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-5-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid. The x-axis is labelled &quot;Engine displacement (L)&quot; and the y-axis is labelled &quot;Highway fuel economy (mpg)&quot;. The legend is labelled &quot;Car type&quot;." width="576"/></p>
</div>
</div>
<p>Its possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="https://rdrr.io/r/base/substitute.html">quote()</a></code> and read about the available options in <code><a href="https://rdrr.io/r/grDevices/plotmath.html">?plotmath</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = 1:10,
y = x ^ 2
)
ggplot(df, aes(x, y)) +
geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-6-1.png" style="width:50.0%" alt="Scatterplot with math text on the x and y axis labels. X-axis label says sum of x_i squared, for i from 1 to n. Y-axis label says alpha + beta + delta over theta."/></p>
</div>
</div>
<section id="communication-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create one plot on the fuel economy data with customized <code>title</code>, <code>subtitle</code>, <code>caption</code>, <code>x</code>, <code>y</code>, and <code>color</code> labels.</p></li>
<li>
<p>Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-7-1.png" alt="Scatterplot of highway versus city fuel efficiency. Shapes and colors of points are determined by type of drive train." width="576"/></p>
</div>
</div>
</li>
<li><p>Take an exploratory graphic that youve created in the last month, and add informative titles to make it easier for others to understand.</p></li>
</ol></section>
</section>
<section id="annotations" data-type="sect1">
<h1>
Annotations</h1>
<p>In addition to labelling major components of your plot, its often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> is similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p>
<p>There are two possible sources of labels. First, you might have a tibble that provides labels. In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called <code>label_info</code>. In order to create the <code>label_info</code> data frame we used a number of new dplyr functions. Youll learn more about each of these soon!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">label_info &lt;- mpg |&gt;
group_by(drv) |&gt;
arrange(desc(displ)) |&gt;
slice_head(n = 1) |&gt;
mutate(
drive_type = case_when(
drv == "f" ~ "front-wheel drive",
drv == "r" ~ "rear-wheel drive",
drv == "4" ~ "4-wheel drive"
)
) |&gt;
select(displ, hwy, drv, drive_type)
label_info
#&gt; # A tibble: 3 × 4
#&gt; # Groups: drv [3]
#&gt; displ hwy drv drive_type
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 6.5 17 4 4-wheel drive
#&gt; 2 5.3 25 f front-wheel drive
#&gt; 3 7 24 r rear-wheel drive</pre>
</div>
<p>Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the <code>fontface</code> and <code>size</code> arguments we can customize the look of the text labels. Theyre larger than the rest of the text on the plot and bolded. (<code>theme(legend.position = "none"</code>) turns the legend off — well talk about it more shortly.)</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
geom_text(
data = label_info,
aes(x = displ, y = hwy, label = drive_type),
fontface = "bold", size = 5, hjust = "right", vjust = "bottom"
) +
theme(legend.position = "none")
#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-9-1.png" alt="Scatterplot of highway mileage versus engine size where points are colored by drive type. Smooth curves for each drive type are overlaid. Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel." width="576"/></p>
</div>
</div>
<p>Note the use of <code>hjust</code> and <code>vjust</code> to control the alignment of the label. <a href="#fig-just" data-type="xref">#fig-just</a> shows all nine possible combinations.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-just"><p><img src="communication_files/figure-html/fig-just-1.png" style="width:60.0%" alt="A 1x1 grid. At (0,0) hjust is set to left and vjust is set to bottom. At (0.5, 0) hjust is center and vjust is bottom and at (1, 0) hjust is right and vjust is bottom. At (0, 0.5) hjust is left and vjust is center, at (0.5, 0.5) hjust is center and vjust is center, and at (1, 0.5) hjust is right and vjust is center. Finally, at (1, 0) hjust is left and vjust is top, at (0.5, 1) hjust is center and vjust is top, and at (1, 1) hjust is right and vjust is bottom."/></p>
<figcaption>All nine combinations of <code>hjust</code> and <code>vjust</code>.</figcaption>
</figure>
</div>
</div>
<p>However the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
geom_label(
data = label_info,
aes(x = displ, y = hwy, label = drive_type),
fontface = "bold", size = 5, hjust = "right", alpha = 0.5, nudge_y = 2,
) +
theme(legend.position = "none")
#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-11-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. Some points are labelled with the car's name. The labels are box with white, transparent background." width="576"/></p>
</div>
</div>
<p>That helps a bit, but two of the labels still overlap with each other. This is difficult to fix by applying the same transformation for every label. Instead, we can use the <code><a href="https://rdrr.io/pkg/ggrepel/man/geom_text_repel.html">geom_label_repel()</a></code> function from the ggrepel package. This useful package will automatically adjust labels so that they dont overlap:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
geom_label_repel(
data = label_info,
aes(x = displ, y = hwy, label = drive_type),
fontface = "bold", size = 5, nudge_y = 2,
) +
theme(legend.position = "none")
#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-12-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. Some points are labelled with the car's name. The labels are box with white, transparent background and positioned to not overlap." width="576"/></p>
</div>
</div>
<p>You can also use the same idea to highlight certain points on a plot with <code><a href="https://rdrr.io/pkg/ggrepel/man/geom_text_repel.html">geom_text_repel()</a></code> from the ggrepel package. Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">potential_outliers &lt;- mpg |&gt;
filter(hwy &gt; 40 | (hwy &gt; 20 &amp; displ &gt; 5))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_text_repel(data = potential_outliers, aes(label = model)) +
geom_point(data = potential_outliers, color = "red") +
geom_point(data = potential_outliers, color = "red", size = 3, shape = "circle open")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-13-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. Points where highway mileage is above 40 as well as above 20 with engine size above 5 are red, with a hollow red circle, and labelled with model name of the car." width="576"/></p>
</div>
</div>
<p>Alternatively, you might just want to add a single label to the plot, but youll still need to create a data frame. Often, you want the label in the corner of the plot, so its convenient to create a new data frame using <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to compute the maximum values of x and y.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">label_info &lt;- mpg |&gt;
summarize(
displ = max(displ),
hwy = max(hwy),
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_text(
data = label_info, aes(label = label),
vjust = "top", hjust = "right"
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-14-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. On the top right corner, inset a bit from the corner, is an annotation that reads &quot;increasing engine size is related to decreasing fuel economy&quot;. The text spans two lines." width="576"/></p>
</div>
</div>
<p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since were no longer computing the positions from <code>mpg</code>, we can use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> to create the data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">label_info &lt;- tibble(
displ = Inf,
hwy = Inf,
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. On the top right corner, flush against the corner, is an annotation that reads &quot;increasing engine size is related to decreasing fuel economy&quot;. The text spans two lines." width="576"/></p>
</div>
</div>
<p>Alternatively, we can add the annotation without creating a new data frame, using <code><a href="https://ggplot2.tidyverse.org/reference/annotate.html">annotate()</a></code>. This function adds a geom to a plot, but it doesnt map variables of a data frame to an aesthetic. The first argument of this function, <code>geom</code>, is the geometric object you want to use for annotation.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
annotate(
geom = "text", x = Inf, y = Inf,
label = "Increasing engine size is \nrelated to decreasing fuel economy.",
vjust = "top", hjust = "right"
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-16-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. On the top right corner, flush against the corner, is an annotation that reads &quot;increasing engine size is related to decreasing fuel economy&quot;. The text spans two lines." width="576"/></p>
</div>
</div>
<p>You can also use a label geom instead of a text geom like we did earlier, set aesthetics like color. Another approach for drawing attention to a plot feature is using a segment geom with the <code>arrow</code> argument. The <code>x</code> and <code>y</code> aesthetics define the starting location of the segment and <code>xend</code> and <code>yend</code> to define the end location.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
annotate(
geom = "label", x = 3.5, y = 38,
label = "Increasing engine size is \nrelated to decreasing fuel economy.",
hjust = "left", color = "red"
) +
annotate(
geom = "segment",
x = 3, y = 35, xend = 5, yend = 25, color = "red",
arrow = arrow(type = "closed")
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-17-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. A red arrow pointing down follows the trend of the points and the annptation placed next to the arrow reads &quot;increasing engine size is related to decreasing fuel economy&quot;. The arrow and the annotation text is red." width="576"/></p>
</div>
</div>
<p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="https://stringr.tidyverse.org/reference/str_wrap.html">stringr::str_wrap()</a></code> to automatically add line breaks, given the number of characters you want per line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">"Increasing engine size is related to decreasing fuel economy." |&gt;
str_wrap(width = 40) |&gt;
writeLines()
#&gt; Increasing engine size is related to
#&gt; decreasing fuel economy.</pre>
</div>
<p>Remember, in addition to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A couple ideas:</p>
<ul><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_hline()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_vline()</a></code> to add reference lines. We often make them thick (<code>linewidth = 2</code>) and white (<code>color = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_rect()</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_segment.html">geom_segment()</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li>
</ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p>
<section id="communication-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> with infinite positions to place text at the four corners of the plot.</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/annotate.html">annotate()</a></code> to add a point geom in the middle of your last plot without having to create a tibble. Customize the shape, size, or color of the point.</p></li>
<li><p>How do labels with <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)</p></li>
<li><p>What arguments to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> control the appearance of the background box?</p></li>
<li><p>What are the four arguments to <code><a href="https://rdrr.io/r/grid/arrow.html">arrow()</a></code>? How do they work? Create a series of plots that demonstrate the most important options.</p></li>
</ol></section>
</section>
<section id="scales" data-type="sect1">
<h1>
Scales</h1>
<p>The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive.</p>
<section id="default-scales" data-type="sect2">
<h2>
Default scales</h2>
<p>Normally, ggplot2 automatically adds scales for you. For example, when you type:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))</pre>
</div>
<p>ggplot2 automatically adds default scales behind the scenes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_color_discrete()</pre>
</div>
<p>Note the naming scheme for scales: <code>scale_</code> followed by the name of the aesthetic, then <code>_</code>, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. There are lots of non-default scales which youll learn about below.</p>
<p>The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:</p>
<ul><li><p>You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.</p></li>
<li><p>You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.</p></li>
</ul></section>
<section id="axis-ticks-and-legend-keys" data-type="sect2">
<h2>
Axis ticks and legend keys</h2>
<p>There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: <code>breaks</code> and <code>labels</code>. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of <code>breaks</code> is to override the default choice:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-21-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. The y-axis has breaks starting at 15 and ending at 40, increasing by 5." width="576"/></p>
</div>
</div>
<p>You can use <code>labels</code> in the same way (a character vector the same length as <code>breaks</code>), but you can also set it to <code>NULL</code> to suppress the labels altogether. This is useful for maps, or for publishing plots where you cant share the absolute numbers.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-22-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars. The x and y-axes do not have any labels at the axis ticks." width="576"/></p>
</div>
</div>
<p>The <code>labels</code> argument coupled with labelling functions from the scales package is also useful for formatting numbers as currency, percent, etc. The plot on the left shows default labelling with <code>label_dollar()</code>, which adds a dollar sign as well as a thousand separator comma. The plot on the right adds further customization by dividing dollar values by 1,000 and adding a suffix “K” (for “thousands”) as well as adding custom breaks. Note that <code>breaks</code> is in the original scale of the data.</p>
<div>
<pre data-type="programlisting" data-code-language="r"># Left
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot(alpha = 0.05) +
scale_y_continuous(labels = scales::label_dollar())
# Right
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot(alpha = 0.05) +
scale_y_continuous(
labels = scales::label_dollar(scale = 1/1000, suffix = "K"),
breaks = seq(1000, 19000, by = 6000)
)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-23-1.png" alt="Two side-by-side box plots of price versus cut of diamonds. The outliers are transparent. On both plots the y-axis labels are formatted as dollars. The y-axis labels on the plot start at $0 and go to $15,000, increasing by $5,000. The y-axis labels on the right plot start at $1K and go to $19K, increasing by $6K." width="576"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-23-2.png" alt="Two side-by-side box plots of price versus cut of diamonds. The outliers are transparent. On both plots the y-axis labels are formatted as dollars. The y-axis labels on the plot start at $0 and go to $15,000, increasing by $5,000. The y-axis labels on the right plot start at $1K and go to $19K, increasing by $6K." width="576"/></p>
</div>
</div>
</div>
</div>
<p>Another handy label function is <code>label_percent()</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "fill") +
scale_y_continuous(
name = "Percentage",
labels = scales::label_percent()
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-24-1.png" alt="Segmented bar plots of cut, filled with levels of clarity. The y-axis labels start at 0% and go to 100%, increasing by 25%. The y-axis label name is &quot;Percentage&quot;." width="576"/></p>
</div>
</div>
<p>You can also use <code>breaks</code> and <code>labels</code> to control the appearance of legends. Collectively axes and legends are called <strong>guides</strong>. Axes are used for x and y aesthetics; legends are used for everything else.</p>
<p>Another use of <code>breaks</code> is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">presidential |&gt;
mutate(id = 33 + row_number()) |&gt;
ggplot(aes(x = start, y = id)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_x_date(name = NULL, breaks = presidential$start, date_labels = "'%y")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-25-1.png" alt="Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. The x-axis labels are formatted as two digit years starting with an apostrophe, e.g., '53." width="576"/></p>
</div>
</div>
<p>Note that the specification of breaks and labels for date and datetime scales is a little different:</p>
<ul><li><p><code>date_labels</code> takes a format specification, in the same form as <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">parse_datetime()</a></code>.</p></li>
<li><p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1 month”.</p></li>
</ul></section>
<section id="legend-layout" data-type="sect2">
<h2>
Legend layout</h2>
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p>
<p>To control the overall position of the legend, you need to use a <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> setting. Well come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p>
<div>
<pre data-type="programlisting" data-code-language="r">base &lt;- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class))
base + theme(legend.position = "left")
base + theme(legend.position = "top")
base + theme(legend.position = "bottom")
base + theme(legend.position = "right") # the default</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-26-1.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-26-2.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-26-3.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-26-4.png" alt="Four scatterplots of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Clockwise, the legend is placed on the left, top, bottom, and right of the plot." width="384"/></p>
</div>
</div>
</div>
</div>
<p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p>
<p>To control the display of individual legends, use <code><a href="https://ggplot2.tidyverse.org/reference/guides.html">guides()</a></code> along with <code><a href="https://ggplot2.tidyverse.org/reference/guide_legend.html">guide_legend()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/guide_colourbar.html">guide_colorbar()</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme(legend.position = "bottom") +
guides(color = guide_legend(nrow = 1, override.aes = list(size = 4)))
#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars where points are colored based on class of car. Overlaid on the plot is a smooth curve. The legend is in the bottom and classes are listed horizontally in a row. The points in the legend are larger than the points in the plot." width="576"/></p>
</div>
</div>
</section>
<section id="replacing-a-scale" data-type="sect2">
<h2>
Replacing a scale</h2>
<p>Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales youre mostly likely to want to switch out: continuous position scales and color scales. Fortunately, the same principles apply to all the other aesthetics, so once youve mastered position and color, youll be able to quickly pick up other scale replacements.</p>
<p>Its very useful to plot transformations of your variable. For example, its easier to see the precise relationship between <code>carat</code> and <code>price</code> if we log transform them:</p>
<div>
<pre data-type="programlisting" data-code-language="r"># Left
ggplot(diamonds, aes(x = carat, y = price)) +
geom_bin2d()
# Right
ggplot(diamonds, aes(x = log10(carat), y = log10(price))) +
geom_bin2d()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-28-1.png" alt="Two plots of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. In the plot on the right, price and carat values are logged and the axis labels shows the logged values." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-28-2.png" alt="Two plots of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. In the plot on the right, price and carat values are logged and the axis labels shows the logged values." width="384"/></p>
</div>
</div>
</div>
</div>
<p>However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-29-1.png" alt="Plot of price versus carat of diamonds. Data binned and the color of the rectangles representing each bin based on the number of points that fall into that bin. The axis labels are on the original data scale." width="576"/></p>
</div>
</div>
<p>Another scale that is frequently customized is color. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of color blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green color blindness.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
scale_color_brewer(palette = "Set1")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-30-1.png" alt="Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-30-2.png" alt="Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Dont forget simpler techniques. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv, shape = drv)) +
scale_color_brewer(palette = "Set1")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-31-1.png" alt="Two scatterplots of highway mileage versus engine size where both color and shape of points are based on drive type. The color palette is not the default ggplot2 palette." width="576"/></p>
</div>
</div>
<p>The ColorBrewer scales are documented online at <a href="https://colorbrewer2.org/" class="uri">https://colorbrewer2.org/</a> and made available in R via the <strong>RColorBrewer</strong> package, by Erich Neuwirth. <a href="#fig-brewer" data-type="xref">#fig-brewer</a> shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if youve used <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code> to make a continuous variable into a categorical variable.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-brewer"><p><img src="communication_files/figure-html/fig-brewer-1.png" alt="All colorBrewer scales. One group goes from light to dark colors. Another group is a set of non ordinal colors. And the last group has diverging scales (from dark to light to dark again). Within each set there are a number of palettes." width="576"/></p>
<figcaption>All colorBrewer scales.</figcaption>
</figure>
</div>
</div>
<p>When you have a predefined mapping between values and colors, use <code><a href="https://ggplot2.tidyverse.org/reference/scale_manual.html">scale_color_manual()</a></code>. For example, if we map presidential party to color, we want to use the standard mapping of red for Republicans and blue for Democrats:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">presidential |&gt;
mutate(id = 33 + row_number()) |&gt;
ggplot(aes(x = start, y = id, color = party)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_color_manual(values = c(Republican = "red", Democratic = "blue"))</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-33-1.png" alt="Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. Democratic presidents are represented in black and Republicans in red." width="576"/></p>
</div>
</div>
<p>For continuous color, you can use the built-in <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_color_gradient()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_fill_gradient()</a></code>. If you have a diverging scale, you can use <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_color_gradient2()</a></code>. That allows you to give, for example, positive and negative values different colors. Thats sometimes also useful if you want to distinguish points above or below the mean.</p>
<p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p>
<div>
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
labs(title = "Default, continuous")
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_c() +
labs(title = "Viridis, continuous")
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_b() +
labs(title = "Viridis, binned")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-34-1.png" alt="Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-34-2.png" alt="Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale." width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-34-3.png" alt="Three hex plots where the color of the hexes show the number of observations that fall into that hex bin. The first plot uses the default, continuous ggplot2 scale. The second plot uses the viridis, continuous scale, and the third plot uses the viridis, binned scale." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Note that all color scales come in two variety: <code>scale_color_x()</code> and <code>scale_fill_x()</code> for the <code>color</code> and <code>fill</code> aesthetics respectively (the color scales are available in both UK and US spellings).</p>
</section>
<section id="zooming" data-type="sect2">
<h2>
Zooming</h2>
<p>There are three ways to control the plot limits:</p>
<ol type="1"><li>Adjusting what data are plotted.</li>
<li>Setting the limits in each scale.</li>
<li>Setting <code>xlim</code> and <code>ylim</code> in <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>.</li>
</ol><p>To zoom in on a region of the plot, its generally best to use <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>. Compare the following two plots:</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
mpg |&gt;
filter(displ &gt;= 5, displ &lt;= 7, hwy &gt;= 10, hwy &lt;= 30) |&gt;
ggplot(aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-35-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-35-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want to <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, its difficult to compare the plots because all three scales (the x-axis, the y-axis, and the color aesthetic) have different ranges.</p>
<div>
<pre data-type="programlisting" data-code-language="r">suv &lt;- mpg |&gt; filter(class == "suv")
compact &lt;- mpg |&gt; filter(class == "compact")
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
geom_point()
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-36-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-36-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>One way to overcome this problem is to share scales across multiple plots, training the scales with the <code>limits</code> of the full data.</p>
<div>
<pre data-type="programlisting" data-code-language="r">x_scale &lt;- scale_x_continuous(limits = range(mpg$displ))
y_scale &lt;- scale_y_continuous(limits = range(mpg$hwy))
col_scale &lt;- scale_color_discrete(limits = unique(mpg$drv))
ggplot(suv, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
ggplot(compact, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-37-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communication_files/figure-html/unnamed-chunk-37-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.</p>
</section>
<section id="communication-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Why doesnt the following code override the default scale?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
scale_color_gradient(low = "white", high = "red") +
coord_fixed()</pre>
</div>
</li>
<li><p>What is the first argument to every scale? How does it compare to <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code>?</p></li>
<li>
<p>Change the display of the presidential terms by:</p>
<ol type="a"><li>Combining the two variants shown above.</li>
<li>Improving the display of the y axis.</li>
<li>Labelling each term with the name of the president.</li>
<li>Adding informative plot labels.</li>
<li>Placing breaks every 4 years (this is trickier than it seems!).</li>
</ol></li>
<li>
<p>Use <code>override.aes</code> to make the legend on the following plot easier to see.</p>
<div class="cell" data-fig.format="png">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(aes(color = cut), alpha = 1/20)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-39-1.png" style="width:50.0%" alt="Scatterplot of price versus carat of diamonds. The points are colored by cut of the diamonds and they're very transparent."/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="sec-themes" data-type="sect1">
<h1>
Themes</h1>
<p>Finally, you can customize the non-data elements of your plot with a theme:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-40-1.png" width="576"/></p>
</div>
</div>
<p>ggplot2 includes eight themes by default, as shown in <a href="#fig-themes" data-type="xref">#fig-themes</a>. Many more are included in add-on packages like <strong>ggthemes</strong> (<a href="https://jrnold.github.io/ggthemes" class="uri">https://jrnold.github.io/ggthemes</a>), by Jeffrey Arnold. You can also create your own themes, if you are trying to match a particular corporate or journal style.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-themes"><p><img src="images/visualization-themes.png" alt="Eight barplots created with ggplot2, each with one of the eight built-in themes: theme_bw() - White background with grid lines, theme_light() - Light axes and grid lines, theme_classic() - Classic theme, axes but no grid lines, theme_linedraw() - Only black lines, theme_dark() - Dark background for contrast, theme_minimal() - Minimal theme, no background, theme_gray() - Gray background (default theme), theme_void() - Empty theme, only geoms are visible." width="1600"/></p>
<figcaption>The eight themes built-in to ggplot2.</figcaption>
</figure>
</div>
</div>
<p>Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic color to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of color which ensures that the plot is perceived as a single visual entity.</p>
<p>Its also possible to control individual components of each theme, like the size and color of the font used for the y axis. Weve already seen that <code>legend.position</code> controls where the legend is drawn. There are many other aspects of the legend that can be customized with <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code>. For example, in the plot below we change the direction of the legend as well as put a black border around it. A few other helpful <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> components are use to change the placement for format of the title and caption text.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
labs(
title = "Highway mileage decreases as engine size increases",
caption = "Source: https://fueleconomy.gov."
) +
theme(
legend.position = c(0.6, 0.7),
legend.direction = "horizontal",
legend.box.background = element_rect(color = "black"),
plot.title = element_text(face = "bold"),
plot.title.position = "plot",
plot.caption.position = "plot",
plot.caption = element_text(hjust = 0)
)</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-42-1.png" width="576"/></p>
</div>
</div>
<p>For an overview of all <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> components, see help with <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">?theme</a></code>. The <a href="https://ggplot2-book.org/">ggplot2 book</a> is also a great place to go for the full details on theming.</p>
<section id="communication-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Pick a theme offered by the ggthemes package and apply it to the last plot you made.</li>
<li>Make the axis labels of your plot blue and bolded.</li>
</ol></section>
</section>
<section id="layout" data-type="sect1">
<h1>
Layout</h1>
<p>So far we talked about how to create and modify a single plot. What if you have multiple plots you want to lay out in a certain way? The patchwork package allows you to combine separate plots into the same graphic. We loaded this package earlier in the chapter.</p>
<p>To place two plots next to each other, you can simply add them to each other. Note that you first need to create the plots and save them as objects (in the following example theyre called <code>p1</code> and <code>p2</code>). Then, you place them next to each other with <code>+</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">p1 &lt;- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Plot 1")
p2 &lt;- ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
labs(title = "Plot 2")
p1 + p2</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-43-1.png" alt="Two plots (a scatterplot of highway mileage versus engine size and a side-by-side boxplots of highway mileage versus drive train) placed next to each other." width="576"/></p>
</div>
</div>
<p>Its important to note that in the above code chunk we did not use a new function from the patchwork package. Instead, the package added a new functionality to the <code>+</code> operator.</p>
<p>You can also create arbitrary plot layouts with patchwork. In the following, <code>|</code> places the <code>p1</code> and <code>p3</code> next to each other and <code>/</code> moves <code>p2</code> to the next line.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">p3 &lt;- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Plot 3")
(p1 | p3) / p2</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-44-1.png" alt="Three plots laid out such that first and third plot are next to each other and the second plot streatched beneath them. The first plot is a scatterplot of highway mileage versus engine size, third plot is a scatterplot of highway mileage versus city mileage, and the third plot is side-by-side boxplots of highway mileage versus drive train) placed next to each other." width="576"/></p>
</div>
</div>
<p>Additionally, patchwork allows you to collect legends from multiple plots into one common legend, customize the placement of the legend as well as dimensions of the plots, and add a common title, subtitle, caption, etc. to your plots. In the following, we have 5 plots. We have turned off the legends on the box plots and the scatterplot and collected the legends for the density plots at the top of the plot with <code>&amp; theme(legend.position = "top")</code>. Note the use of the <code>&amp;</code> operator here instead of the usual <code>+</code>. This is because were modifying the theme for the patchwork plot as opposed to the individual ggplots. The legend is placed on top, inside the <code><a href="https://patchwork.data-imaginist.com/reference/guide_area.html">guide_area()</a></code>. Finally, we have also customized the heights of the various components of our patchwork the guide has a height of 1, the box plots 3, density plots 2, and the faceted scatter plot 4. Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">p1 &lt;- ggplot(mpg, aes(x = drv, y = cty, color = drv)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Plot 1")
p2 &lt;- ggplot(mpg, aes(x = drv, y = hwy, color = drv)) +
geom_boxplot(show.legend = FALSE) +
labs(title = "Plot 2")
p3 &lt;- ggplot(mpg, aes(x = cty, color = drv, fill = drv)) +
geom_density(alpha = 0.5) +
labs(title = "Plot 3")
p4 &lt;- ggplot(mpg, aes(x = hwy, color = drv, fill = drv)) +
geom_density(alpha = 0.5) +
labs(title = "Plot 4")
p5 &lt;- ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
geom_point(show.legend = FALSE) +
facet_wrap(~drv) +
labs(title = "Plot 5")
(guide_area() / (p1 + p2) / (p3 + p4) / p5) +
plot_annotation(
title = "City and highway mileage for cars with different drive trains",
caption = "Source: Source: https://fueleconomy.gov."
) +
plot_layout(
guides = "collect",
heights = c(1, 3, 2, 4)
) &amp;
theme(legend.position = "top")</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-45-1.png" alt="Five plots laid out such that first two plots are next to each other. Plots three and four are underneath them. And the fifth plot stretches under them. The patchworked plot is titled &quot;City and highway mileage for cars with different drive trains&quot; and captioned &quot;Source: Source: https://fueleconomy.gov&quot;. The first two plots are side-by-side box plots. Plots 3 and 4 are density plots. And the fifth plot is a faceted scatterplot. Each of these plots show geoms colored by drive train, but the patchworked plot has only one legend that applies to all of them, above the plots and beneath the title." width="576"/></p>
</div>
</div>
<p>If youd like to learn more about combining and layout out multiple plots with patchwork, we recommend looking through the guides on the package website: <a href="https://patchwork.data-imaginist.com" class="uri">https://patchwork.data-imaginist.com</a>.</p>
<section id="communication-exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>What happens if you omit the parentheses in the following plot layout. Can you explain why this happens?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">p1 &lt;- ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
labs(title = "Plot 1")
p2 &lt;- ggplot(mpg, aes(x = drv, y = hwy)) +
geom_boxplot() +
labs(title = "Plot 2")
p3 &lt;- ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
labs(title = "Plot 3")
(p1 | p2) / p3</pre>
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-46-1.png" width="576"/></p>
</div>
</div>
</li>
<li>
<p>Using the three plots from the previous exercise, recreate the following patchwork.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="communication_files/figure-html/unnamed-chunk-47-1.png" alt="Three plots: Plot 1 is a scatterplot of highway mileage versus engine size. Plot 2 is side-by-side box plots of highway mileage versus drive train. Plot 3 is side-by-side box plots of city mileage versus drive train. Plots 1 is on the first row. Plots 2 and 3 are on the next row, each span half the width of Plot 1. Plot 1 is labelled &quot;Fig. A&quot;, Plot 2 is labelled &quot;Fig. B&quot;, and Plot 3 is labelled &quot;Fig. C&quot;." width="576"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="communication-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customizing the axis scales, and changing the theme of your plot. Youve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.</p>
<p>While youve so far learned about how to make many different types of plots and how to customize them using a variety of techniques, weve barely scratched the surface of what you can create with ggplot2. If you want to get a comprehensive understanding of ggplot2, we recommend reading the book, <a href="https://ggplot2-book.org"><em>ggplot2: Elegant Graphics for Data Analysis</em></a>. Other useful resources are the <a href="https://r-graphics.org"><em>R Graphics Cookbook</em></a> by Winston Chang and <a href="https://clauswilke.com/dataviz/"><em>Fundamentals of Data Visualization</em></a> by Claus Wilke.</p>
</section>
</section>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 334 KiB

View File

@ -1,595 +0,0 @@
<section data-type="chapter" id="chp-data-import">
<h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1>
<section id="data-import-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Working with data provided by R packages is a great way to learn data science tools, but you want to apply what youve learned to your own data at some point. In this chapter, youll learn the basics of reading data files into R.</p>
<p>Specifically, this chapter will focus on reading plain-text rectangular files. Well start with practical advice for handling features like column names, types, and missing data. You will then learn about reading data from multiple files at once and writing data from R to a file. Finally, youll learn how to handcraft data frames in R.</p>
<section id="data-import-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, youll learn how to load flat files in R with the <strong>readr</strong> package, which is part of the core tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="reading-data-from-a-file" data-type="sect1">
<h1>
Reading data from a file</h1>
<p>To begin, well focus on the most rectangular data file type: CSV, which is short for comma-separated values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows provide the data.</p>
<div class="cell">
<pre><code>#&gt; Student ID,Full Name,favourite.food,mealPlan,AGE
#&gt; 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
#&gt; 2,Barclay Lynn,French fries,Lunch only,5
#&gt; 3,Jayendra Lyne,N/A,Breakfast and lunch,7
#&gt; 4,Leon Rossini,Anchovies,Lunch only,
#&gt; 5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
#&gt; 6,Güvenç Attila,Ice cream,Lunch only,6</code></pre>
</div>
<p><a href="#tbl-students-table" data-type="xref">#tbl-students-table</a> shows a representation of the same data as a table.</p>
<div class="cell">
<div class="cell-output-display">
<div id="tbl-students-table" class="anchored">
<table class="table table-sm table-striped"><caption>Table 8.1: Data from the students.csv file as a table.</caption>
<colgroup><col style="width: 15%"/><col style="width: 23%"/><col style="width: 26%"/><col style="width: 27%"/><col style="width: 6%"/></colgroup><thead><tr class="header"><th style="text-align: right;">Student ID</th>
<th style="text-align: left;">Full Name</th>
<th style="text-align: left;">favourite.food</th>
<th style="text-align: left;">mealPlan</th>
<th style="text-align: left;">AGE</th>
</tr></thead><tbody><tr class="odd"><td style="text-align: right;">1</td>
<td style="text-align: left;">Sunil Huffmann</td>
<td style="text-align: left;">Strawberry yoghurt</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">4</td>
</tr><tr class="even"><td style="text-align: right;">2</td>
<td style="text-align: left;">Barclay Lynn</td>
<td style="text-align: left;">French fries</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">5</td>
</tr><tr class="odd"><td style="text-align: right;">3</td>
<td style="text-align: left;">Jayendra Lyne</td>
<td style="text-align: left;">N/A</td>
<td style="text-align: left;">Breakfast and lunch</td>
<td style="text-align: left;">7</td>
</tr><tr class="even"><td style="text-align: right;">4</td>
<td style="text-align: left;">Leon Rossini</td>
<td style="text-align: left;">Anchovies</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">NA</td>
</tr><tr class="odd"><td style="text-align: right;">5</td>
<td style="text-align: left;">Chidiegwu Dunkel</td>
<td style="text-align: left;">Pizza</td>
<td style="text-align: left;">Breakfast and lunch</td>
<td style="text-align: left;">five</td>
</tr><tr class="even"><td style="text-align: right;">6</td>
<td style="text-align: left;">Güvenç Attila</td>
<td style="text-align: left;">Ice cream</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">6</td>
</tr></tbody></table></div>
</div>
</div>
<p>We can read this file into R using <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>. The first argument is the most important: its the path to the file.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students &lt;- read_csv("data/students.csv")
#&gt; Rows: 6 Columns: 5
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (4): Full Name, favourite.food, mealPlan, AGE
#&gt; dbl (1): Student ID
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
<p>When you run <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about retrieving the full column specification and how to quiet this message. This message is an integral part of readr, and well return to it in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
<section id="practical-advice" data-type="sect2">
<h2>
Practical advice</h2>
<p>Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Lets take another look at the <code>students</code> data with that in mind.</p>
<p>In the <code>favourite.food</code> column, there are a bunch of food items, and then the character string <code>N/A</code>, which should have been a real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students &lt;- read_csv("data/students.csv", na = c("N/A", ""))
students
#&gt; # A tibble: 6 × 5
#&gt; `Student ID` `Full Name` favourite.food mealPlan AGE
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by backticks. Thats because they contain spaces, breaking Rs usual rules for variable names. To refer to them, you need to use those backticks:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students |&gt;
rename(
student_id = `Student ID`,
full_name = `Full Name`
)
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite.food mealPlan AGE
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>An alternative approach is to use <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> to use some heuristics to turn them all into snake case at once<span data-type="footnote">The <a href="http://sfirke.github.io/janitor/">janitor</a> package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses <code>|&gt;</code>.</span>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students |&gt; janitor::clean_names()
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represented as a factor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students |&gt;
janitor::clean_names() |&gt;
mutate(
meal_plan = factor(meal_plan)
)
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>Note that the values in the <code>meal_type</code> variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (<code>&lt;chr&gt;</code>) to factor (<code>&lt;fct&gt;</code>). Youll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
<p>Before you analyze these data, youll probably want to fix the <code>age</code> column. Currently, its a character variable because one of the observations is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students &lt;- students |&gt;
janitor::clean_names() |&gt;
mutate(
meal_plan = factor(meal_plan),
age = parse_number(if_else(age == "five", "5", age))
)
students
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</section>
<section id="other-arguments" data-type="sect2">
<h2>
Other arguments</h2>
<p>There are a couple of other important arguments that we need to mention, and theyll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read CSV files that youve created in a string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"a,b,c
1,2,3
4,5,6"
)
#&gt; # A tibble: 2 × 3
#&gt; a b c
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>Usually, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But its not uncommon for a few lines of metadata to be included at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"The first line of metadata
The second line of metadata
x,y,z
1,2,3",
skip = 2
)
#&gt; # A tibble: 1 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
read_csv(
"# A comment I want to skip
x,y,z
1,2,3",
comment = "#"
)
#&gt; # A tibble: 1 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3</pre>
</div>
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"1,2,3
4,5,6",
col_names = FALSE
)
#&gt; # A tibble: 2 × 3
#&gt; X1 X2 X3
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>Alternatively, you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"1,2,3
4,5,6",
col_names = c("x", "y", "z")
)
#&gt; # A tibble: 2 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>These arguments are all you need to know to read the majority of CSV files that youll encounter in practice. (For the rest, youll need to carefully inspect your <code>.csv</code> file and read the documentation for <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>s many other arguments.)</p>
</section>
<section id="other-file-types" data-type="sect2">
<h2>
Other file types</h2>
<p>Once youve mastered <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, using readrs other functions is straightforward; its just a matter of knowing which function to reach for:</p>
<ul><li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon-separated files. These use <code>;</code> instead of <code>,</code> to separate fields and are common in countries that use <code>,</code> as the decimal marker.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> reads tab-delimited files.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimiter if you dont specify it.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed-width files. You can specify fields by their widths with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or by their positions with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed-width files where columns are separated by white space.</p></li>
<li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache-style log files.</p></li>
</ul></section>
<section id="data-import-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What function would you use to read a file where fields were separated with “|”?</p></li>
<li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> have in common?</p></li>
<li><p>What are the most important arguments to <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code>?</p></li>
<li>
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems, they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. To read the following text into a data frame, what argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">"x,y\n1,'a,b'"</pre>
</div>
</li>
<li>
<p>Identify what is wrong with each of the following inline CSV files. What happens when you run the code?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")</pre>
</div>
</li>
<li>
<p>Practice referring to non-syntactic names in the following data frame by:</p>
<ol type="a"><li>Extracting the variable called <code>1</code>.</li>
<li>Plotting a scatterplot of <code>1</code> vs. <code>2</code>.</li>
<li>Creating a new column called <code>3</code>, which is <code>2</code> divided by <code>1</code>.</li>
<li>Renaming the columns to <code>one</code>, <code>two</code>, and <code>three</code>.</li>
</ol><div class="cell">
<pre data-type="programlisting" data-code-language="r">annoying &lt;- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)</pre>
</div>
</li>
</ol></section>
</section>
<section id="sec-col-types" data-type="sect1">
<h1>
Controlling column types</h1>
<p>A CSV file doesnt contain any information about the type of each variable (i.e., whether its a logical, number, string, etc.), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and, if needed, how to supply the column types yourself. Finally, well mention a few general strategies that are useful if readr is failing catastrophically and you need to get more insight into the structure of your file.</p>
<section id="guessing-types" data-type="sect2">
<h2>
Guessing types</h2>
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<span data-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring missing values. It then works through the following questions:</p>
<ul><li>Does it contain only <code>F</code>, <code>T</code>, <code>FALSE</code>, or <code>TRUE</code> (ignoring case)? If so, its a logical.</li>
<li>Does it contain only numbers (e.g., <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, its a number.</li>
<li>Does it match the ISO8601 standard? If so, its a date or date-time. (Well return to date-times in more detail in <a href="#sec-creating-datetimes" data-type="xref">#sec-creating-datetimes</a>).</li>
<li>Otherwise, it must be a string.</li>
</ul><p>You can see that behavior in action in this simple example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv("
logical,numeric,date,string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi"
)
#&gt; Rows: 3 Columns: 4
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): string
#&gt; dbl (1): numeric
#&gt; lgl (1): logical
#&gt; date (1): date
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.
#&gt; # A tibble: 3 × 4
#&gt; logical numeric date string
#&gt; &lt;lgl&gt; &lt;dbl&gt; &lt;date&gt; &lt;chr&gt;
#&gt; 1 TRUE 1 2021-01-15 abc
#&gt; 2 FALSE 4.5 2021-02-15 def
#&gt; 3 TRUE Inf 2021-02-16 ghi</pre>
</div>
<p>This heuristic works well if you have a clean dataset, but in real life, youll encounter a selection of weird and beautiful failures.</p>
</section>
<section id="missing-values-column-types-and-problems" data-type="sect2">
<h2>
Missing values, column types, and problems</h2>
<p>The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type. One of the most common causes for this is a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
<p>Take this simple 1 column CSV file as an example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
x
10
.
20
30"</pre>
</div>
<p>If we read it without any additional arguments, <code>x</code> becomes a character column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- read_csv(csv)
#&gt; Rows: 4 Columns: 1
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): x
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled among them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- read_csv(csv, col_types = list(x = col_double()))
#&gt; Warning: One or more parsing issues, call `problems()` on your data frame for
#&gt; details, e.g.:
#&gt; dat &lt;- vroom(...)
#&gt; problems(dat)</pre>
</div>
<p>Now <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> reports that there was a problem, and tells us we can find out more with <code><a href="https://readr.tidyverse.org/reference/problems.html">problems()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">problems(df)
#&gt; # A tibble: 1 × 5
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/Rtmpx37bAU/filec1bb57d587a7</pre>
</div>
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- read_csv(csv, na = ".")
#&gt; Rows: 4 Columns: 1
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; dbl (1): x
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
</section>
<section id="column-types" data-type="sect2">
<h2>
Column types</h2>
<p>readr provides a total of nine column types for you to use:</p>
<ul><li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_logical()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_double()</a></code> read logicals and real numbers. Theyre relatively rarely needed (except as above), since readr will usually guess them for you.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish integers and doubles in this book because theyre functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_character()</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesnt make sense to (e.g.) divide it in half.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code>, and <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates, and date-times respectively; youll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/parse_number.html">col_number()</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. Youll learn more about it in <a href="#chp-numbers" data-type="xref">#chp-numbers</a>.</li>
<li>
<code><a href="https://readr.tidyverse.org/reference/col_skip.html">col_skip()</a></code> skips a column so its not included in the result.</li>
</ul><p>Its also possible to override the default column by switching from <code><a href="https://rdrr.io/r/base/list.html">list()</a></code> to <code><a href="https://readr.tidyverse.org/reference/cols.html">cols()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
x,y,z
1,2,3"
read_csv(csv, col_types = cols(.default = col_character()))
#&gt; # A tibble: 1 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 2 3</pre>
</div>
<p>Another useful helper is <code><a href="https://readr.tidyverse.org/reference/cols.html">cols_only()</a></code> which will read in only the columns you specify:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">read_csv(
"x,y,z
1,2,3",
col_types = cols_only(x = col_character())
)
#&gt; # A tibble: 1 × 1
#&gt; x
#&gt; &lt;chr&gt;
#&gt; 1 1</pre>
</div>
</section>
</section>
<section id="sec-readr-directory" data-type="sect1">
<h1>
Reading data from multiple files</h1>
<p>Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each months data in a separate file: <code>01-sales.csv</code> for January, <code>02-sales.csv</code> for February, and <code>03-sales.csv</code> for March. With <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> you can read these data in at once and stack them on top of each other in a single data frame.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sales_files &lt;- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
#&gt; Rows: 19 Columns: 6
#&gt; ── Column specification ─────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): month
#&gt; dbl (4): year, brand, item, n
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.
#&gt; # A tibble: 19 × 6
#&gt; file month year brand item n
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 data/01-sales.csv January 2019 1 1234 3
#&gt; 2 data/01-sales.csv January 2019 1 8721 9
#&gt; 3 data/01-sales.csv January 2019 1 1822 2
#&gt; 4 data/01-sales.csv January 2019 2 3333 1
#&gt; 5 data/01-sales.csv January 2019 2 2156 9
#&gt; 6 data/01-sales.csv January 2019 2 3987 6
#&gt; # … with 13 more rows</pre>
</div>
<p>With the additional <code>id</code> parameter we have added a new column called <code>file</code> to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files youre reading in do not have an identifying column that can help you trace the observations back to their original sources.</p>
<p>If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> function to find the files for you by matching a pattern in the file names. Youll learn more about these patterns in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sales_files &lt;- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
sales_files
#&gt; [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"</pre>
</div>
</section>
<section id="sec-writing-to-a-file" data-type="sect1">
<h1>
Writing to a file</h1>
<p>readr also comes with two useful functions for writing data back to disk: <code><a href="https://readr.tidyverse.org/reference/write_delim.html">write_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/write_delim.html">write_tsv()</a></code>. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.</p>
<p>The most important arguments are <code>x</code> (the data frame to save), and <code>file</code> (the location to save it). You can also specify how missing values are written with <code>na</code>, and if you want to <code>append</code> to an existing file.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">write_csv(students, "students.csv")</pre>
</div>
<p>Now lets read that csv file back in. Note that the type information is lost when you save to csv:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">students
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main alternative:</p>
<ol type="1"><li>
<p><code><a href="https://readr.tidyverse.org/reference/read_rds.html">write_rds()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_rds.html">read_rds()</a></code> are uniform wrappers around the base functions <code><a href="https://rdrr.io/r/base/readRDS.html">readRDS()</a></code> and <code><a href="https://rdrr.io/r/base/readRDS.html">saveRDS()</a></code>. These store data in Rs custom binary format called RDS:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">write_rds(students, "students.rds")
read_rds("students.rds")
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
<li>
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages. Well return to arrow in more depth in <a href="#chp-arrow" data-type="xref">#chp-arrow</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(arrow)
write_parquet(students, "students.parquet")
read_parquet("students.parquet")
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne NA Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
</ol><p>Parquet tends to be much faster than RDS and is usable outside of R, but does require the arrow package.</p>
</section>
<section id="data-entry" data-type="sect1">
<h1>
Data entry</h1>
<p>Sometimes youll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> works by column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tibble(
x = c(1, 2, 5),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.60)
)
#&gt; # A tibble: 3 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 1 h 0.08
#&gt; 2 2 m 0.83
#&gt; 3 5 g 0.6</pre>
</div>
<p>Note that every column in tibble must be same size, so youll get an error if theyre not:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tibble(
x = c(1, 2),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.6)
)
#&gt; Error:
#&gt; ! Tibble columns must have compatible sizes.
#&gt; • Size 2: Existing data.
#&gt; • Size 3: Column `y`.
#&gt; Only values of size one are recycled.</pre>
</div>
<p>Laying out the data by column can make it hard to see how the rows are related, so an alternative is <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>, short for <strong>tr</strong>ansposed t<strong>ibble</strong>, which lets you lay out your data row by row. <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> is customized for data entry in code: column headings start with <code>~</code> and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tribble(
~x, ~y, ~z,
"h", 1, 0.08,
"m", 2, 0.83,
"g", 5, 0.60,
)
#&gt; # A tibble: 3 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 h 1 0.08
#&gt; 2 m 2 0.83
#&gt; 3 g 5 0.6</pre>
</div>
<p>Well use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> later in the book to construct small examples to demonstrate how various functions work.</p>
</section>
<section id="data-import-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-arrow" data-type="xref">#chp-arrow</a> from parquet files, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
<p>Now that youre writing a substantial amount of R code, its time to learn more about organizing your code into files and directories. In the next chapter, youll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p>
</section>
</section>

View File

@ -1,850 +0,0 @@
<section data-type="chapter" id="chp-data-tidy">
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1>
<section id="data-tidy-introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
<p>“Happy families are all alike; every unhappy family is unhappy in its own way.”<br/>
— Leo Tolstoy</p>
</blockquote>
<blockquote class="blockquote">
<p>“Tidy datasets are all alike, but every messy dataset is messy in its own way.”<br/>
— Hadley Wickham</p>
</blockquote>
<p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
<p>In this chapter, youll first learn the definition of tidy data and see it applied to a simple toy dataset. Then well dive into the primary tool youll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. Well finish with a discussion of usefully untidy data and how you can create it if needed.</p>
<section id="data-tidy-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
<p>From this chapter on, well suppress the loading message from <code><a href="https://tidyverse.tidyverse.org">library(tidyverse)</a></code>.</p>
</section>
</section>
<section id="sec-tidy-data" data-type="sect1">
<h1>
Tidy data</h1>
<p>You can represent the same underlying data in multiple ways. The example below shows the same data organized in four different ways. Each dataset shows the same values of four variables: <em>country</em>, <em>year</em>, <em>population</em>, and <em>cases</em> of TB (tuberculosis), but each dataset organizes the values in a different way.</p>
<!-- TODO redraw as tables -->
<div class="cell">
<pre data-type="programlisting" data-code-language="r">table1
#&gt; # A tibble: 6 × 4
#&gt; country year cases population
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 745 19987071
#&gt; 2 Afghanistan 2000 2666 20595360
#&gt; 3 Brazil 1999 37737 172006362
#&gt; 4 Brazil 2000 80488 174504898
#&gt; 5 China 1999 212258 1272915272
#&gt; 6 China 2000 213766 1280428583
table2
#&gt; # A tibble: 12 × 4
#&gt; country year type count
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 cases 745
#&gt; 2 Afghanistan 1999 population 19987071
#&gt; 3 Afghanistan 2000 cases 2666
#&gt; 4 Afghanistan 2000 population 20595360
#&gt; 5 Brazil 1999 cases 37737
#&gt; 6 Brazil 1999 population 172006362
#&gt; # … with 6 more rows
table3
#&gt; # A tibble: 6 × 3
#&gt; country year rate
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 Afghanistan 1999 745/19987071
#&gt; 2 Afghanistan 2000 2666/20595360
#&gt; 3 Brazil 1999 37737/172006362
#&gt; 4 Brazil 2000 80488/174504898
#&gt; 5 China 1999 212258/1272915272
#&gt; 6 China 2000 213766/1280428583
# Spread across two tibbles
table4a # cases
#&gt; # A tibble: 3 × 3
#&gt; country `1999` `2000`
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 745 2666
#&gt; 2 Brazil 37737 80488
#&gt; 3 China 212258 213766
table4b # population
#&gt; # A tibble: 3 × 3
#&gt; country `1999` `2000`
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 19987071 20595360
#&gt; 2 Brazil 172006362 174504898
#&gt; 3 China 1272915272 1280428583</pre>
</div>
<p>These are all representations of the same underlying data, but they are not equally easy to use. One of them, <code>table1</code>, will be much easier to work with inside the tidyverse because its tidy.</p>
<p>There are three interrelated rules that make a dataset tidy:</p>
<ol type="1"><li>Each variable is a column; each column is a variable.</li>
<li>Each observation is a row; each row is an observation.</li>
<li>Each value is a cell; each cell is a single value.</li>
</ol><p><a href="#fig-tidy-structure" data-type="xref">#fig-tidy-structure</a> shows the rules visually.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-tidy-structure"><p><img src="images/tidy-1.png" alt="Three panels, each representing a tidy data frame. The first panel shows that each variable is a column. The second panel shows that each observation is a row. The third panel shows that each value is a cell." width="683"/></p>
<figcaption>The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.</figcaption>
</figure>
</div>
</div>
<p>Why ensure that your data is tidy? There are two main advantages:</p>
<ol type="1"><li><p>Theres a general advantage to picking one consistent way of storing data. If you have a consistent data structure, its easier to learn the tools that work with it because they have an underlying uniformity.</p></li>
<li><p>Theres a specific advantage to placing variables in columns because it allows Rs vectorized nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a few small examples showing how you might work with <code>table1</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Compute rate per 10,000
table1 |&gt;
mutate(
rate = cases / population * 10000
)
#&gt; # A tibble: 6 × 5
#&gt; country year cases population rate
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 745 19987071 0.373
#&gt; 2 Afghanistan 2000 2666 20595360 1.29
#&gt; 3 Brazil 1999 37737 172006362 2.19
#&gt; 4 Brazil 2000 80488 174504898 4.61
#&gt; 5 China 1999 212258 1272915272 1.67
#&gt; 6 China 2000 213766 1280428583 1.67
# Compute cases per year
table1 |&gt;
count(year, wt = cases)
#&gt; # A tibble: 2 × 2
#&gt; year n
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1999 250740
#&gt; 2 2000 296920
# Visualise changes over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000))</pre>
<div class="cell-output-display">
<p><img src="data-tidy_files/figure-html/unnamed-chunk-5-1.png" alt="This figure shows the number of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale." width="480"/></p>
</div>
</div>
<section id="data-tidy-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Using prose, describe how the variables and observations are organised in each of the sample tables.</p></li>
<li>
<p>Sketch out the process youd use to calculate the <code>rate</code> for <code>table2</code> and <code>table4a</code> + <code>table4b</code>. You will need to perform four operations:</p>
<ol type="a"><li>Extract the number of TB cases per country per year.</li>
<li>Extract the matching population per country per year.</li>
<li>Divide cases by population, and multiply by 10000.</li>
<li>Store back in the appropriate place.</li>
</ol><p>You havent yet learned all the functions youd need to actually perform these operations, but you should still be able to think through the transformations youd need.</p>
</li>
<li><p>Recreate the plot showing change in cases over time using <code>table2</code> instead of <code>table1</code>. What do you need to do first?</p></li>
</ol></section>
</section>
<section id="sec-pivoting" data-type="sect1">
<h1>
Pivoting</h1>
<p>The principles of tidy data might seem so obvious that you wonder if youll ever encounter a dataset that isnt tidy. Unfortunately, however, most real data is untidy. There are two main reasons:</p>
<ol type="1"><li><p>Data is often organised to facilitate some goal other than analysis. For example, its common for data to be structured to make data entry, not analysis, easy.</p></li>
<li><p>Most people arent familiar with the principles of tidy data, and its hard to derive them yourself unless you spend a lot of time working with data.</p></li>
</ol><p>This means that most real analyses will require at least a little tidying. Youll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times youll need to consult with the people who originally generated the data. Next, youll <strong>pivot</strong> your data into a tidy form, with variables in the columns and observations in the rows.</p>
<p>tidyr provides two functions for pivoting data: <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>, which makes datasets <strong>longer</strong> by increasing rows and reducing columns, and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which makes datasets <strong>wider</strong> by increasing columns and reducing rows. The following sections work through the use of <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to tackle a wide range of realistic datasets. These examples are drawn from <code><a href="https://tidyr.tidyverse.org/articles/pivot.html">vignette("pivot", package = "tidyr")</a></code>, which you should check out if you want to see more variations and more challenging problems.</p>
<p>Lets dive in.</p>
<section id="sec-billboard" data-type="sect2">
<h2>
Data in column names</h2>
<p>The <code>billboard</code> dataset records the billboard rank of songs in the year 2000:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard
#&gt; # A tibble: 317 × 79
#&gt; artist track date.entered wk1 wk2 wk3 wk4 wk5
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87
#&gt; 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA
#&gt; 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66
#&gt; 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67
#&gt; 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17
#&gt; 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26
#&gt; # … with 311 more rows, and 71 more variables: wk6 &lt;dbl&gt;, wk7 &lt;dbl&gt;,
#&gt; # wk8 &lt;dbl&gt;, wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;, wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, …</pre>
</div>
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
<p>To tidy this data, well use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
<ul><li>
<code>cols</code> specifies which columns need to be pivoted, i.e. which columns arent variables. This argument uses the same syntax as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> so here we could use <code>!c(artist, track, date.entered)</code> or <code>starts_with("wk")</code>.</li>
<li>
<code>names_to</code> names of the variable stored in the column names, here <code>"week"</code>.</li>
<li>
<code>values_to</code> names the variable stored in the cell values, here <code>"rank"</code>.</li>
</ul><p>That gives the following call:</p>
<div class="cell" data-r.options="{&quot;pillar.print_min&quot;:10}">
<pre data-type="programlisting" data-code-language="r">billboard |&gt;
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank"
)
#&gt; # A tibble: 24,092 × 5
#&gt; artist track date.entered week rank
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#&gt; 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#&gt; 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#&gt; 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#&gt; 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#&gt; 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#&gt; 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
#&gt; 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
#&gt; 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
#&gt; 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
#&gt; # … with 24,082 more rows</pre>
</div>
<p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pacs “Baby Dont Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s dont really represent unknown observations; theyre forced to exist by the structure of the dataset<span data-type="footnote">Well come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard |&gt;
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
#&gt; # A tibble: 5,307 × 5
#&gt; artist track date.entered week rank
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#&gt; 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#&gt; 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#&gt; 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#&gt; 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#&gt; 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#&gt; # … with 5,301 more rows</pre>
</div>
<p>You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We cant tell from this data, but you might guess that additional columns <code>wk77</code>, <code>wk78</code>, … would be added to the dataset.</p>
<p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">readr::parse_number()</a></code>. <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard_tidy &lt;- billboard |&gt;
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) |&gt;
mutate(
week = parse_number(week)
)
billboard_tidy
#&gt; # A tibble: 5,307 × 5
#&gt; artist track date.entered week rank
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#&gt; 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#&gt; 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#&gt; 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#&gt; 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#&gt; 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#&gt; # … with 5,301 more rows</pre>
</div>
<p>Now were in a good position to look at how song ranks vary over time by drawing a plot. The code is shown below and the result is <a href="#fig-billboard-ranks" data-type="xref">#fig-billboard-ranks</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">billboard_tidy |&gt;
ggplot(aes(x = week, y = rank, group = track)) +
geom_line(alpha = 1/3) +
scale_y_reverse()</pre>
<div class="cell-output-display">
<figure id="fig-billboard-ranks"><p><img src="data-tidy_files/figure-html/fig-billboard-ranks-1.png" alt="A line plot with week on the x-axis and rank on the y-axis, where each line represents a song. Most songs appear to start at a high rank, rapidly accelerate to a low rank, and then decay again. There are suprisingly few tracks in the region when week is &gt;20 and rank is &gt;50." width="576"/></p>
<figcaption>A line plot showing how the rank of a song changes over time.</figcaption>
</figure>
</div>
</div>
</section>
<section id="how-does-pivoting-work" data-type="sect2">
<h2>
How does pivoting work?</h2>
<p>Now that youve seen what pivoting can do for you, its worth taking a little time to gain some intuition about what it does to the data. Lets start with a very simple dataset to make it easier to see whats happening:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
~var, ~col1, ~col2,
"A", 1, 2,
"B", 3, 4,
"C", 5, 6
)</pre>
</div>
<p>Here well say there are three variables: <code>var</code> (already in a variable), <code>name</code> (the column names in the column names), and <code>value</code> (the cell values). So we can tidy it with:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
pivot_longer(
cols = col1:col2,
names_to = "names",
values_to = "values"
)
#&gt; # A tibble: 6 × 3
#&gt; var names values
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 A col1 1
#&gt; 2 A col2 2
#&gt; 3 B col1 3
#&gt; 4 B col2 4
#&gt; 5 C col1 5
#&gt; 6 C col2 6</pre>
</div>
<p>How does this transformation take place? Its easier to see if we take it component by component. Columns that are already variables need to be repeated, once for each column in <code>cols</code>, as shown in <a href="#fig-pivot-variables" data-type="xref">#fig-pivot-variables</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pivot-variables"><p><img src="diagrams/tidy-data/variables.png" alt="A diagram showing how `pivot_longer()` transforms a simple dataset, using color to highlight how the values in the `var` column (&quot;A&quot;, &quot;B&quot;, &quot;C&quot;) are each repeated twice in the output because there are two columns being pivotted (&quot;col1&quot; and &quot;col2&quot;)." width="469"/></p>
<figcaption>Columns that are already variables need to be repeated, once for each column that is pivotted.</figcaption>
</figure>
</div>
</div>
<p>The column names become values in a new variable, whose name is given by <code>names_to</code>, as shown in <a href="#fig-pivot-names" data-type="xref">#fig-pivot-names</a>. They need to be repeated once for each row in the original dataset.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pivot-names"><p><img src="diagrams/tidy-data/column-names.png" alt="A diagram showing how `pivot_longer()` transforms a simple data set, using color to highlight how column names (&quot;col1&quot; and &quot;col2&quot;) become the values in a new `var` column. They are repeated three times because there were three rows in the input." width="469"/></p>
<figcaption>The column names of pivoted columns become a new column.</figcaption>
</figure>
</div>
</div>
<p>The cell values also become values in a new variable, with a name given by <code>values_to</code>. They are unwound row by row. <a href="#fig-pivot-values" data-type="xref">#fig-pivot-values</a> illustrates the process.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pivot-values"><p><img src="diagrams/tidy-data/cell-values.png" alt="A diagram showing how `pivot_longer()` transforms data, using color to highlight how the cell values (the numbers 1 to 6) become the values in a new `value` column. They are unwound row-by-row, so the original rows (1,2), then (3,4), then (5,6), become a column running from 1 to 6." width="469"/></p>
<figcaption>The number of values is preserved (not repeated), but unwound row-by-row.</figcaption>
</figure>
</div>
</div>
</section>
<section id="many-variables-in-column-names" data-type="sect2">
<h2>
Many variables in column names</h2>
<p>A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the <code>who2</code> dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">who2
#&gt; # A tibble: 7,240 × 58
#&gt; country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1980 NA NA NA NA NA
#&gt; 2 Afghanistan 1981 NA NA NA NA NA
#&gt; 3 Afghanistan 1982 NA NA NA NA NA
#&gt; 4 Afghanistan 1983 NA NA NA NA NA
#&gt; 5 Afghanistan 1984 NA NA NA NA NA
#&gt; 6 Afghanistan 1985 NA NA NA NA NA
#&gt; # … with 7,234 more rows, and 51 more variables: sp_m_5564 &lt;dbl&gt;,
#&gt; # sp_m_65 &lt;dbl&gt;, sp_f_014 &lt;dbl&gt;, sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, …</pre>
</div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">who2 |&gt;
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
#&gt; # A tibble: 405,440 × 6
#&gt; country year diagnosis gender age count
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1980 sp m 014 NA
#&gt; 2 Afghanistan 1980 sp m 1524 NA
#&gt; 3 Afghanistan 1980 sp m 2534 NA
#&gt; 4 Afghanistan 1980 sp m 3544 NA
#&gt; 5 Afghanistan 1980 sp m 4554 NA
#&gt; 6 Afghanistan 1980 sp m 5564 NA
#&gt; # … with 405,434 more rows</pre>
</div>
<p>An alternative to <code>names_sep</code> is <code>names_pattern</code>, which you can use to extract variables from more complicated naming scenarios, once youve learned about regular expressions in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p>
<p>Conceptually, this is only a minor variation on the simpler case youve already seen. <a href="#fig-pivot-multiple-names" data-type="xref">#fig-pivot-multiple-names</a> shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that gives better performance.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pivot-multiple-names"><p><img src="diagrams/tidy-data/multiple-names.png" alt="A diagram that uses color to illustrate how supplying `names_sep` and multiple `names_to` creates multiple variables in the output. The input has variable names &quot;x_1&quot; and &quot;y_2&quot; which are split up by &quot;_&quot; to create name and number columns in the output. This is is similar case with a single `names_to`, but what would have been a single output variable is now separated into multiple variables." width="600"/></p>
<figcaption>Pivotting with many variables in the column names means that each column name now fills in values in multiple output columns.</figcaption>
</figure>
</div>
</div>
</section>
<section id="data-and-variable-names-in-the-column-headers" data-type="sect2">
<h2>
Data and variable names in the column headers</h2>
<p>The next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the <code>household</code> dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">household
#&gt; # A tibble: 5 × 5
#&gt; family dob_child1 dob_child2 name_child1 name_child2
#&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 1998-11-26 2000-01-29 Susan Jose
#&gt; 2 2 1996-06-22 NA Mark &lt;NA&gt;
#&gt; 3 3 2002-07-11 2004-04-05 Sam Seth
#&gt; 4 4 2004-10-10 2009-08-27 Craig Khai
#&gt; 5 5 2000-12-05 2005-02-28 Parker Gracie</pre>
</div>
<p>This dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (<code>dob</code>, <code>name)</code> and the values of another (<code>child,</code> with values 1 and 2). To solve this problem we again need to supply a vector to <code>names_to</code> but this time we use the special <code>".value"</code> sentinel. This overrides the usual <code>values_to</code> argument to use the first component of the pivoted column name as a variable name in the output.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">household |&gt;
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
) |&gt;
mutate(
child = parse_number(child)
)
#&gt; # A tibble: 9 × 4
#&gt; family child dob name
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;date&gt; &lt;chr&gt;
#&gt; 1 1 1 1998-11-26 Susan
#&gt; 2 1 2 2000-01-29 Jose
#&gt; 3 2 1 1996-06-22 Mark
#&gt; 4 3 1 2002-07-11 Sam
#&gt; 5 3 2 2004-04-05 Seth
#&gt; 6 4 1 2004-10-10 Craig
#&gt; # … with 3 more rows</pre>
</div>
<p>We again use <code>values_drop_na = TRUE</code>, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> to convert (e.g.) <code>child1</code> into 1.</p>
<p><a href="#fig-pivot-names-and-values" data-type="xref">#fig-pivot-names-and-values</a> illustrates the basic idea with a simpler example. When you use <code>".value"</code> in <code>names_to</code>, the column names in the input contribute to both values and variable names in the output.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pivot-names-and-values"><p><img src="diagrams/tidy-data/names-and-values.png" alt="A diagram that uses color to illustrate how the special &quot;.value&quot; sentinel works. The input has names &quot;x_1&quot;, &quot;x_2&quot;, &quot;y_1&quot;, and &quot;y_2&quot;, and we want to use the first component (&quot;x&quot;, &quot;y&quot;) as a variable name and the second (&quot;1&quot;, &quot;2&quot;) as the value for a new &quot;id&quot; column." width="540"/></p>
<figcaption>Pivoting with <code>names_to = c(".value", "id")</code> splits the column names into two components: the first part determines the output column name (<code>x</code> or <code>y</code>), and the second part determines the value of the <code>id</code> column.</figcaption>
</figure>
</div>
</div>
</section>
<section id="widening-data" data-type="sect2">
<h2>
Widening data</h2>
<p>So far weve used <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to solve the common class of problems where values have ended up in column names. Next well pivot (HA HA) to <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p>
<p>Well start by looking at <code>cms_patient_experience</code>, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_experience
#&gt; # A tibble: 500 × 5
#&gt; org_pac_id org_nm measure_cd measure_title prf_rate
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS… 63
#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS… 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS… 86
#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS… 57
#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS… 85
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS… 24
#&gt; # … with 494 more rows</pre>
</div>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |&gt;
distinct(measure_cd, measure_title)
#&gt; # A tibble: 6 × 2
#&gt; measure_cd measure_title
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
#&gt; 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
#&gt; 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
#&gt; 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
#&gt; 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
#&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre>
</div>
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesnt hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. Well use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> has the opposite interface to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |&gt;
pivot_wider(
names_from = measure_cd,
values_from = prf_rate
)
#&gt; # A tibble: 500 × 9
#&gt; org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… 63 NA
#&gt; 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
#&gt; # … with 494 more rows, and 4 more variables: CAHPS_GRP_3 &lt;dbl&gt;,
#&gt; # CAHPS_GRP_5 &lt;dbl&gt;, CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;</pre>
</div>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |&gt;
pivot_wider(
id_cols = starts_with("org"),
names_from = measure_cd,
values_from = prf_rate
)
#&gt; # A tibble: 95 × 8
#&gt; org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICA… 63 87 86 57
#&gt; 2 0446162697 ASSOCIATION OF … 59 85 83 63
#&gt; 3 0547164295 BEAVER MEDICAL … 49 NA 75 44
#&gt; 4 0749333730 CAPE PHYSICIANS… 67 84 85 65
#&gt; 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64
#&gt; 6 0840109864 REX HOSPITAL INC 73 87 84 67
#&gt; # … with 89 more rows, and 2 more variables: CAHPS_GRP_8 &lt;dbl&gt;,
#&gt; # CAHPS_GRP_12 &lt;dbl&gt;</pre>
</div>
<p>This gives us the output that were looking for.</p>
</section>
<section id="how-does-pivot_wider-work" data-type="sect2">
<h2>
How does pivot_wider() work?</h2>
<p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, lets again start with a very simple dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
~id, ~name, ~value,
"A", "x", 1,
"B", "y", 2,
"B", "x", 3,
"A", "y", 4,
"A", "z", 5,
)</pre>
</div>
<p>Well take the values from the <code>value</code> column and the names from the <code>name</code> column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
pivot_wider(
names_from = name,
values_from = value
)
#&gt; # A tibble: 2 × 4
#&gt; id x y z
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 A 1 4 5
#&gt; 2 B 3 2 NA</pre>
</div>
<p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p>
<p>To begin the process <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: its just the values of <code>name</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
distinct(name)
#&gt; # A tibble: 3 × 1
#&gt; name
#&gt; &lt;chr&gt;
#&gt; 1 x
#&gt; 2 y
#&gt; 3 z</pre>
</div>
<p>By default, the rows in the output are formed by all the variables that arent going into the names or values. These are called the <code>id_cols</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
select(-name, -value) |&gt;
distinct()
#&gt; # A tibble: 2 × 1
#&gt; id
#&gt; &lt;chr&gt;
#&gt; 1 A
#&gt; 2 B</pre>
</div>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> then combines these results to generate an empty data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
select(-name, -value) |&gt;
distinct() |&gt;
mutate(x = NA, y = NA, z = NA)
#&gt; # A tibble: 2 × 4
#&gt; id x y z
#&gt; &lt;chr&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 A NA NA NA
#&gt; 2 B NA NA NA</pre>
</div>
<p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as theres no entry for id “B” and name “z”, so that cell remains missing. Well come back to this idea that <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
<p>You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
~id, ~name, ~value,
"A", "x", 1,
"A", "x", 2,
"A", "y", 3,
"B", "x", 4,
"B", "y", 5,
)</pre>
</div>
<p>If we attempt to pivot this we get an output that contains list-columns, which youll learn more about in <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; pivot_wider(
names_from = name,
values_from = value
)
#&gt; Warning: Values from `value` are not uniquely identified; output will contain
#&gt; list-cols.
#&gt; • Use `values_fn = list` to suppress this warning.
#&gt; • Use `values_fn = {summary_fun}` to summarise duplicates.
#&gt; • Use the following dplyr code to identify duplicates.
#&gt; {data} %&gt;%
#&gt; dplyr::group_by(id, name) %&gt;%
#&gt; dplyr::summarise(n = dplyr::n(), .groups = "drop") %&gt;%
#&gt; dplyr::filter(n &gt; 1L)
#&gt; # A tibble: 2 × 3
#&gt; id x y
#&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 A &lt;dbl [2]&gt; &lt;dbl [1]&gt;
#&gt; 2 B &lt;dbl [1]&gt; &lt;dbl [1]&gt;</pre>
</div>
<p>Since you dont know how to work with this sort of data yet, youll want to follow the hint in the warning to figure out where the problem is:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(id, name) |&gt;
summarize(n = n(), .groups = "drop") |&gt;
filter(n &gt; 1L)
#&gt; # A tibble: 1 × 3
#&gt; id name n
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 A x 2</pre>
</div>
<p>Its then up to you to figure out whats gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.</p>
</section>
</section>
<section id="untidy-data" data-type="sect1">
<h1>
Untidy data</h1>
<p>While <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> is occasionally useful for making tidy data, its real strength is making <strong>untidy</strong> data. While that sounds like a bad thing, untidy isnt a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but its not the only data format youll ever need.</p>
<p>The following sections will show a few examples of <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.</p>
<section id="presenting-data-to-humans" data-type="sect2">
<h2>
Presenting data to humans</h2>
<p>As youve seen, <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
count(clarity, color)
#&gt; # A tibble: 56 × 3
#&gt; clarity color n
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt;
#&gt; 1 I1 D 42
#&gt; 2 I1 E 102
#&gt; 3 I1 F 143
#&gt; 4 I1 G 150
#&gt; 5 I1 H 162
#&gt; 6 I1 I 92
#&gt; # … with 50 more rows</pre>
</div>
<p>This is easy to visualize or summarize further, but its not the most compact form for display. You can use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to create a form more suitable for display to other humans:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
count(clarity, color) |&gt;
pivot_wider(
names_from = color,
values_from = n
)
#&gt; # A tibble: 8 × 8
#&gt; clarity D E F G H I J
#&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 42 102 143 150 162 92 50
#&gt; 2 SI2 1370 1713 1609 1548 1563 912 479
#&gt; 3 SI1 2083 2426 2131 1976 2275 1424 750
#&gt; 4 VS2 1697 2470 2201 2347 1643 1169 731
#&gt; 5 VS1 705 1281 1364 2148 1169 962 542
#&gt; 6 VVS2 553 991 975 1443 608 365 131
#&gt; # … with 2 more rows</pre>
</div>
<p>This display also makes it easy to compare in two directions, horizontally and vertically, much like <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>.</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like <a href="https://gt.rstudio.com">gt</a>. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.</p>
</section>
<section id="multivariate-statistics" data-type="sect2">
<h2>
Multivariate statistics</h2>
<p>Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable. Sometimes these formats have substantial performance or space advantages, or sometimes theyre just necessary to get closer to the underlying matrix mathematics.</p>
<p>Were not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need. For example, lets imagine you wanted to cluster the gapminder data to find countries that had similar progression of <code>gdpPercap</code> over time. To do this, we need one row for each country and one column for each year:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(gapminder)
col_year &lt;- gapminder |&gt;
mutate(gdpPercap = log10(gdpPercap)) |&gt;
pivot_wider(
id_cols = country,
names_from = year,
values_from = gdpPercap
)
col_year
#&gt; # A tibble: 142 × 13
#&gt; country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
#&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81
#&gt; 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40
#&gt; 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70
#&gt; 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42
#&gt; 5 Argentina 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97
#&gt; 6 Australia 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37
#&gt; # … with 136 more rows, and 3 more variables: `1997` &lt;dbl&gt;, `2002` &lt;dbl&gt;,
#&gt; # `2007` &lt;dbl&gt;</pre>
</div>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">col_year &lt;- col_year |&gt;
column_to_rownames("country")
head(col_year)
#&gt; 1952 1957 1962 1967 1972 1977 1982
#&gt; Afghanistan 2.891786 2.914265 2.931000 2.922309 2.869221 2.895485 2.990344
#&gt; Albania 3.204407 3.288313 3.364155 3.440940 3.520277 3.548144 3.560012
#&gt; Algeria 3.388990 3.479140 3.406679 3.511481 3.621453 3.691118 3.759302
#&gt; Angola 3.546618 3.582965 3.630354 3.742157 3.738248 3.478371 3.440429
#&gt; Argentina 3.771684 3.836125 3.853282 3.905955 3.975112 4.003419 3.954141
#&gt; Australia 4.001716 4.039400 4.086973 4.162150 4.225015 4.263262 4.289522
#&gt; 1987 1992 1997 2002 2007
#&gt; Afghanistan 2.930641 2.812473 2.803007 2.861376 2.988818
#&gt; Albania 3.572748 3.397495 3.504206 3.663155 3.773569
#&gt; Algeria 3.754452 3.700982 3.680996 3.723295 3.794025
#&gt; Angola 3.385644 3.419600 3.357390 3.442995 3.680991
#&gt; Argentina 3.960931 3.968876 4.040099 3.944366 4.106510
#&gt; Australia 4.340224 4.369675 4.431331 4.486965 4.537005</pre>
</div>
<p>This makes a data frame, because tibbles dont support row names<span data-type="footnote">tibbles dont use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.</span>.</p>
<p>Were now ready to cluster with (e.g.) <code><a href="https://rdrr.io/r/stats/kmeans.html">kmeans()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cluster &lt;- stats::kmeans(col_year, centers = 6)</pre>
</div>
<p>Extracting the data out of this object into a form you can work with is a challenge youll need to come back to later in the book, once youve learned more about lists. But for now, you can get the clustering membership out with this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cluster_id &lt;- cluster$cluster |&gt;
enframe() |&gt;
rename(country = name, cluster_id = value)
cluster_id
#&gt; # A tibble: 142 × 2
#&gt; country cluster_id
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 Afghanistan 4
#&gt; 2 Albania 2
#&gt; 3 Algeria 6
#&gt; 4 Angola 2
#&gt; 5 Argentina 5
#&gt; 6 Australia 1
#&gt; # … with 136 more rows</pre>
</div>
<p>You could then combine this back with the original data using one of the joins youll learn about in <a href="#chp-joins" data-type="xref">#chp-joins</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gapminder |&gt; left_join(cluster_id)
#&gt; Joining with `by = join_by(country)`
#&gt; # A tibble: 1,704 × 7
#&gt; country continent year lifeExp pop gdpPercap cluster_id
#&gt; &lt;chr&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 Afghanistan Asia 1952 28.8 8425333 779. 4
#&gt; 2 Afghanistan Asia 1957 30.3 9240934 821. 4
#&gt; 3 Afghanistan Asia 1962 32.0 10267083 853. 4
#&gt; 4 Afghanistan Asia 1967 34.0 11537966 836. 4
#&gt; 5 Afghanistan Asia 1972 36.1 13079460 740. 4
#&gt; 6 Afghanistan Asia 1977 38.4 14880372 786. 4
#&gt; # … with 1,698 more rows</pre>
</div>
</section>
<section id="pragmatic-computation" data-type="sect2">
<h2>
Pragmatic computation</h2>
<p>Sometimes its just easier to answer a question using untidy data. For example, if youre interested in just the total number of missing values in <code>cms_patient_experience</code>, its easier to work with the untidy form:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_experience |&gt;
group_by(org_pac_id) |&gt;
summarize(
n_miss = sum(is.na(prf_rate)),
n = n(),
)
#&gt; # A tibble: 95 × 3
#&gt; org_pac_id n_miss n
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0446157747 0 6
#&gt; 2 0446162697 0 6
#&gt; 3 0547164295 1 6
#&gt; 4 0749333730 0 6
#&gt; 5 0840104360 0 6
#&gt; 6 0840109864 0 6
#&gt; # … with 89 more rows</pre>
</div>
<p>This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didnt actually define what a variable is (and its surprisingly hard to do so). Its totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.</p>
<p>So if youre stuck figuring out how to do some computation, maybe its time to switch up the organisation of your data. For computations involving a fixed number of values (like computing differences or ratios), its usually easier if the data is in columns; for those with a variable number of values (like sums or means) its usually easier in rows. Dont be afraid to untidy, transform, and re-tidy if needed.</p>
<p>Lets explore this idea by looking at <code>cms_patient_care</code>, which has a similar structure to <code>cms_patient_experience</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_care
#&gt; # A tibble: 252 × 5
#&gt; ccn facility_name measure_abbr score type
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 011500 BAPTIST HOSPICE beliefs_addressed 202 denominator
#&gt; 2 011500 BAPTIST HOSPICE beliefs_addressed 100 observed
#&gt; 3 011500 BAPTIST HOSPICE composite_process 202 denominator
#&gt; 4 011500 BAPTIST HOSPICE composite_process 88.1 observed
#&gt; 5 011500 BAPTIST HOSPICE dyspena_treatment 110 denominator
#&gt; 6 011500 BAPTIST HOSPICE dyspena_treatment 99.1 observed
#&gt; # … with 246 more rows</pre>
</div>
<p>It contains information about 9 measures (<code>beliefs_addressed</code>, <code>composite_process</code>, <code>dyspena_treatment</code>, …) on 14 different facilities (identified by <code>ccn</code> with a name given by <code>facility_name</code>). Compared to <code>cms_patient_experience</code>, however, each measurement is recorded in two rows with a <code>score</code>, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to. Depending on what you want to do next, you may find any of the following three structures useful:</p>
<ul><li>
<p>If you want to compute the number of patients that answered yes to the question, you may pivot <code>type</code> into the columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_care |&gt;
pivot_wider(
names_from = type,
values_from = score
) |&gt;
mutate(
numerator = round(observed / 100 * denominator)
)
#&gt; # A tibble: 126 × 6
#&gt; ccn facility_name measure_abbr denominator observed numerator
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 011500 BAPTIST HOSPICE beliefs_addressed 202 100 202
#&gt; 2 011500 BAPTIST HOSPICE composite_process 202 88.1 178
#&gt; 3 011500 BAPTIST HOSPICE dyspena_treatment 110 99.1 109
#&gt; 4 011500 BAPTIST HOSPICE dyspnea_screening 202 100 202
#&gt; 5 011500 BAPTIST HOSPICE opioid_bowel 61 100 61
#&gt; 6 011500 BAPTIST HOSPICE pain_assessment 107 100 107
#&gt; # … with 120 more rows</pre>
</div>
</li>
<li>
<p>If you want to display the distribution of each metric, you may keep it as is so you could facet by <code>measure_abbr</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_care |&gt;
filter(type == "observed") |&gt;
ggplot(aes(x = score)) +
geom_histogram(binwidth = 2) +
facet_wrap(vars(measure_abbr))
#&gt; Warning: Removed 1 rows containing non-finite values (`stat_bin()`).</pre>
</div>
</li>
<li>
<p>If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cms_patient_care |&gt;
filter(type == "observed") |&gt;
select(-type) |&gt;
pivot_wider(
names_from = measure_abbr,
values_from = score
) |&gt;
ggplot(aes(x = dyspnea_screening, y = dyspena_treatment)) +
geom_point() +
coord_equal()</pre>
</div>
</li>
</ul></section>
</section>
<section id="data-tidy-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="https://www.jstatsoft.org/article/view/v059i10">Tidy Data</a> paper published in the Journal of Statistical Software.</p>
<p>In the next chapter, well pivot back to workflow to discuss the importance of code style, keeping your code “tidy” (ha!) in order to make it easy for you and others to read and understand your code.</p>
</section>
</section>

View File

@ -1,968 +0,0 @@
<section data-type="chapter" id="chp-data-transform">
<h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1>
<section id="data-transform-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Visualisation is an important tool for generating insight, but its rare that you get the data in exactly the right form you need for it. Often youll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. Youll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the <strong>dplyr</strong> package and a new dataset on flights that departed New York City in 2013.</p>
<p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. Well start with functions that operate on rows and then columns of a data frame. We will then introduce the ability to work with groups. We will end the chapter with a case study that showcases these functions in action and well come back to the functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>
<section id="data-transform-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well focus on the dplyr package, another core member of the tidyverse. Well illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
</section>
<section id="nycflights13" data-type="sect2">
<h2>
nycflights13</h2>
<p>To explore the basic dplyr verbs, were going to use <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0">Bureau of Transportation Statistics</a>, and is documented in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">?flights</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>If youve used R before, you might notice that this data frame prints a little differently to other data frames youve seen. Thats because its a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. There are a few options to see everything. If youre using RStudio, the most convenient is probably <code>View(flights)</code>, which will open an interactive scrollable and filterable view. Otherwise you can use <code>print(flights, width = Inf)</code> to show all columns, or use call <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">glimpse(flights)
#&gt; Rows: 336,776
#&gt; Columns: 19
#&gt; $ year &lt;int&gt; 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
#&gt; $ month &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#&gt; $ day &lt;int&gt; 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#&gt; $ dep_time &lt;int&gt; 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55…
#&gt; $ sched_dep_time &lt;int&gt; 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60…
#&gt; $ dep_delay &lt;dbl&gt; 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,…
#&gt; $ arr_time &lt;int&gt; 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8…
#&gt; $ sched_arr_time &lt;int&gt; 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8…
#&gt; $ arr_delay &lt;dbl&gt; 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,…
#&gt; $ carrier &lt;chr&gt; "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"…
#&gt; $ flight &lt;int&gt; 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301…
#&gt; $ tailnum &lt;chr&gt; "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N…
#&gt; $ origin &lt;chr&gt; "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG…
#&gt; $ dest &lt;chr&gt; "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA…
#&gt; $ air_time &lt;dbl&gt; 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149…
#&gt; $ distance &lt;dbl&gt; 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73…
#&gt; $ hour &lt;dbl&gt; 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6…
#&gt; $ minute &lt;dbl&gt; 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59…
#&gt; $ time_hour &lt;dttm&gt; 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0…</pre>
</div>
<p>In both views, the variables names are followed by abbreviations that tell you the type of each variable: <code>&lt;int&gt;</code> is short for integer, <code>&lt;dbl&gt;</code> is short for double (aka real numbers), <code>&lt;chr&gt;</code> for character (aka strings), and <code>&lt;dttm&gt;</code> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.</p>
</section>
<section id="dplyr-basics" data-type="sect2">
<h2>
dplyr basics</h2>
<p>Youre about to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, its worth stating what they have in common:</p>
<ol type="1"><li><p>The first argument is always a data frame.</p></li>
<li><p>The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).</p></li>
<li><p>The result is always a new data frame.</p></li>
</ol><p>Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, <code>|&gt;</code>. The pipe takes the thing on its left and passes it along to the function on its right so that <code>x |&gt; f(y)</code> is equivalent to <code>f(x, y)</code>, and <code>x |&gt; f(y) |&gt; g(z)</code> is equivalent to into <code>g(f(x, y), z)</code>. The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you havent yet learned the details:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest == "IAH") |&gt;
group_by(year, month, day) |&gt;
summarize(
arr_delay = mean(arr_delay, na.rm = TRUE)
)</pre>
</div>
<p>The code starts with the <code>flights</code> dataset, then filters it, then groups it, then summarizes it. Well come back to the pipe and its alternatives in <a href="#sec-pipes" data-type="xref">#sec-pipes</a>.</p>
<p>dplyrs verbs are organised into four groups based on what they operate on: <strong>rows</strong>, <strong>columns</strong>, <strong>groups</strong>, or <strong>tables</strong>. In the following sections youll learn the most important verbs for rows, columns, and groups, then well come back to verbs that work on tables in <a href="#chp-joins" data-type="xref">#chp-joins</a>. Lets dive in!</p>
</section>
</section>
<section id="rows" data-type="sect1">
<h1>
Rows</h1>
<p>The most important verbs that operate on rows are <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, which changes which rows are present without changing their order, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged. Well also discuss <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> which finds rows with unique values but unlike <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> it can also optionally modify the columns.</p>
<section id="filter" data-type="sect2">
<h2>
filter()
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(arr_delay &gt; 120)
#&gt; # A tibble: 10,034 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 811 630 101 1047 830
#&gt; 2 2013 1 1 848 1835 853 1001 1950
#&gt; 3 2013 1 1 957 733 144 1056 853
#&gt; 4 2013 1 1 1114 900 134 1447 1222
#&gt; 5 2013 1 1 1505 1310 115 1638 1431
#&gt; 6 2013 1 1 1525 1340 105 1831 1626
#&gt; # … with 10,028 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>As well as <code>&gt;</code> (greater than), you can use <code>&gt;=</code> (greater than or equal to), <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&amp;</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Flights that departed on January 1
flights |&gt;
filter(month == 1 &amp; day == 1)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
# Flights that departed in January or February
flights |&gt;
filter(month == 1 | month == 2)
#&gt; # A tibble: 51,955 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Theres a useful shortcut when youre combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># A shorter way to select flights that departed in January or February
flights |&gt;
filter(month %in% c(1, 2))
#&gt; # A tibble: 51,955 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 51,949 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
<p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">jan1 &lt;- flights |&gt;
filter(month == 1 &amp; day == 1)</pre>
</div>
</section>
<section id="common-mistakes" data-type="sect2">
<h2>
Common mistakes</h2>
<p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> will let you know when this happens:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month = 1)
#&gt; Error in `filter()`:
#&gt; ! We detected a named input.
#&gt; This usually means that you've used `=` instead of `==`.
#&gt; Did you mean `month == 1`?</pre>
</div>
<p>Another mistakes is you write “or” statements like you would in English:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == 1 | 2)</pre>
</div>
<p>This works, in the sense that it doesnt throw an error, but it doesnt do what you want. Well come back to what it does and why in <a href="#sec-boolean-operations" data-type="xref">#sec-boolean-operations</a>.</p>
</section>
<section id="arrange" data-type="sect2">
<h2>
arrange()
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
arrange(year, month, day, dep_time)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
arrange(desc(dep_delay))
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 9 641 900 1301 1242 1530
#&gt; 2 2013 6 15 1432 1935 1137 1607 2120
#&gt; 3 2013 1 10 1121 1635 1126 1239 1810
#&gt; 4 2013 9 20 1139 1845 1014 1457 2210
#&gt; 5 2013 7 22 845 1600 1005 1044 1815
#&gt; 6 2013 4 10 1100 1900 960 1342 2211
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt;
arrange(desc(arr_delay))
#&gt; # A tibble: 239,109 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 11 1 658 700 -2 1329 1015
#&gt; 2 2013 4 18 558 600 -2 1149 850
#&gt; 3 2013 7 7 1659 1700 -1 2050 1823
#&gt; 4 2013 7 22 1606 1615 -9 2056 1831
#&gt; 5 2013 9 19 648 641 7 1035 810
#&gt; 6 2013 4 18 655 700 -5 1213 950
#&gt; # … with 239,103 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="distinct" data-type="sect2">
<h2>
distinct()
</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> finds all the unique rows in a dataset, so in a technical sense, it primarily operates on the rows. Most of the time, however, youll want the distinct combination of some variables, so you can also optionally supply column names:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># This would remove any duplicate rows if there were any
flights |&gt;
distinct()
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
# This finds all unique origin and destination pairs.
flights |&gt;
distinct(origin, dest)
#&gt; # A tibble: 224 × 2
#&gt; origin dest
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 EWR IAH
#&gt; 2 LGA IAH
#&gt; 3 JFK MIA
#&gt; 4 JFK BQN
#&gt; 5 LGA ATL
#&gt; 6 EWR ORD
#&gt; # … with 218 more rows</pre>
</div>
<p>Note that if you want to find the number of duplicates, or rows that werent duplicated, youre better off swapping <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> for <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and then filtering as needed.</p>
</section>
<section id="data-transform-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Find all flights that</p>
<ol type="a"><li>Had an arrival delay of two or more hours</li>
<li>Flew to Houston (<code>IAH</code> or <code>HOU</code>)</li>
<li>Were operated by United, American, or Delta</li>
<li>Departed in summer (July, August, and September)</li>
<li>Arrived more than two hours late, but didnt leave late</li>
<li>Were delayed by at least an hour, but made up over 30 minutes in flight</li>
</ol></li>
<li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li>
<li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li>
<li><p>Was there a flight on every day of 2013?</p></li>
<li><p>Which flights traveled the farthest distance? Which traveled the least distance?</p></li>
<li><p>Does it matter what order you used <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
</ol></section>
</section>
<section id="columns" data-type="sect1">
<h1>
Columns</h1>
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> creates new columns that are functions of the existing columns; <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> change which columns are present, their names, or their positions. Well also discuss <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> since it allows you to get a column out of data frame.</p>
<section id="sec-mutate" data-type="sect2">
<h2>
mutate()
</h2>
<p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#&gt; # A tibble: 336,776 × 21
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 13 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
#&gt; # A tibble: 336,776 × 21
#&gt; gain speed year month day dep_time sched_dep_time dep_delay arr_time
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 -9 370. 2013 1 1 517 515 2 830
#&gt; 2 -16 374. 2013 1 1 533 529 4 850
#&gt; 3 -31 408. 2013 1 1 542 540 2 923
#&gt; 4 17 517. 2013 1 1 544 545 -1 1004
#&gt; 5 19 394. 2013 1 1 554 600 -6 812
#&gt; 6 -16 288. 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can use the variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
#&gt; # A tibble: 336,776 × 21
#&gt; year month day gain speed dep_time sched_dep_time dep_delay arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013 1 1 -9 370. 517 515 2 830
#&gt; 2 2013 1 1 -16 374. 533 529 4 850
#&gt; 3 2013 1 1 -31 408. 542 540 2 923
#&gt; 4 2013 1 1 17 517. 544 545 -1 1004
#&gt; 5 2013 1 1 19 394. 554 600 -6 812
#&gt; 6 2013 1 1 -16 288. 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 6
#&gt; dep_delay arr_delay air_time gain hours gain_per_hour
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 11 227 -9 3.78 -2.38
#&gt; 2 4 20 227 -16 3.78 -4.23
#&gt; 3 2 33 160 -31 2.67 -11.6
#&gt; 4 -1 -18 183 17 3.05 5.57
#&gt; 5 -6 -25 116 19 1.93 9.83
#&gt; 6 -4 12 150 -16 2.5 -6.4
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="sec-select" data-type="sect2">
<h2>
select()
</h2>
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Select columns by name
flights |&gt;
select(year, month, day)
#&gt; # A tibble: 336,776 × 3
#&gt; year month day
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1
#&gt; 2 2013 1 1
#&gt; 3 2013 1 1
#&gt; 4 2013 1 1
#&gt; 5 2013 1 1
#&gt; 6 2013 1 1
#&gt; # … with 336,770 more rows
# Select all columns between year and day (inclusive)
flights |&gt;
select(year:day)
#&gt; # A tibble: 336,776 × 3
#&gt; year month day
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1
#&gt; 2 2013 1 1
#&gt; 3 2013 1 1
#&gt; 4 2013 1 1
#&gt; 5 2013 1 1
#&gt; 6 2013 1 1
#&gt; # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
flights |&gt;
select(!year:day)
#&gt; # A tibble: 336,776 × 16
#&gt; dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 517 515 2 830 819 11 UA
#&gt; 2 533 529 4 850 830 20 UA
#&gt; 3 542 540 2 923 850 33 AA
#&gt; 4 544 545 -1 1004 1022 -18 B6
#&gt; 5 554 600 -6 812 837 -25 DL
#&gt; 6 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, and 9 more variables: flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, …
# Select all columns that are characters
flights |&gt;
select(where(is.character))
#&gt; # A tibble: 336,776 × 4
#&gt; carrier tailnum origin dest
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 UA N14228 EWR IAH
#&gt; 2 UA N24211 LGA IAH
#&gt; 3 AA N619AA JFK MIA
#&gt; 4 B6 N804JB JFK BQN
#&gt; 5 DL N668DN LGA ATL
#&gt; 6 UA N39463 EWR ORD
#&gt; # … with 336,770 more rows</pre>
</div>
<p>There are a number of helper functions you can use within <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<ul><li>
<code>starts_with("abc")</code>: matches names that begin with “abc”.</li>
<li>
<code>ends_with("xyz")</code>: matches names that end with “xyz”.</li>
<li>
<code>contains("ijk")</code>: matches names that contain “ijk”.</li>
<li>
<code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li>
</ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be able to use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
<p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
select(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 1
#&gt; tail_num
#&gt; &lt;chr&gt;
#&gt; 1 N14228
#&gt; 2 N24211
#&gt; 3 N619AA
#&gt; 4 N804JB
#&gt; 5 N668DN
#&gt; 6 N39463
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="rename" data-type="sect2">
<h2>
rename()
</h2>
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
rename(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tail_num &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
</section>
<section id="relocate" data-type="sect2">
<h2>
relocate()
</h2>
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
relocate(time_hour, air_time)
#&gt; # A tibble: 336,776 × 19
#&gt; time_hour air_time year month day dep_time sched_dep_time
#&gt; &lt;dttm&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 227 2013 1 1 517 515
#&gt; 2 2013-01-01 05:00:00 227 2013 1 1 533 529
#&gt; 3 2013-01-01 05:00:00 160 2013 1 1 542 540
#&gt; 4 2013-01-01 05:00:00 183 2013 1 1 544 545
#&gt; 5 2013-01-01 06:00:00 116 2013 1 1 554 600
#&gt; 6 2013-01-01 05:00:00 150 2013 1 1 554 558
#&gt; # … with 336,770 more rows, and 12 more variables: dep_delay &lt;dbl&gt;,
#&gt; # arr_time &lt;int&gt;, sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, …</pre>
</div>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
relocate(year:dep_time, .after = time_hour)
#&gt; # A tibble: 336,776 × 19
#&gt; sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 515 2 830 819 11 UA 1545
#&gt; 2 529 4 850 830 20 UA 1714
#&gt; 3 540 2 923 850 33 AA 1141
#&gt; 4 545 -1 1004 1022 -18 B6 725
#&gt; 5 600 -6 812 837 -25 DL 461
#&gt; 6 558 -4 740 728 12 UA 1696
#&gt; # … with 336,770 more rows, and 12 more variables: tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, …
flights |&gt;
relocate(starts_with("arr"), .before = dep_time)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day arr_time arr_delay dep_time sched_dep_time dep_delay
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 830 11 517 515 2
#&gt; 2 2013 1 1 850 20 533 529 4
#&gt; 3 2013 1 1 923 33 542 540 2
#&gt; 4 2013 1 1 1004 -18 544 545 -1
#&gt; 5 2013 1 1 812 -25 554 600 -6
#&gt; 6 2013 1 1 740 12 554 558 -4
#&gt; # … with 336,770 more rows, and 11 more variables: sched_arr_time &lt;int&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="data-transform-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<div class="cell">
</div>
<ol type="1"><li><p>Compare <code>air_time</code> with <code>arr_time - dep_time</code>. What do you expect to see? What do you see? What do you need to do to fix it?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code>, and <code>dep_delay</code>. How would you expect those three numbers to be related?</p></li>
<li><p>Brainstorm as many ways as possible to select <code>dep_time</code>, <code>dep_delay</code>, <code>arr_time</code>, and <code>arr_delay</code> from <code>flights</code>.</p></li>
<li><p>What happens if you include the name of a variable multiple times in a <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> call?</p></li>
<li>
<p>What does the <code><a href="https://tidyselect.r-lib.org/reference/all_of.html">any_of()</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">variables &lt;- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
</div>
</li>
<li>
<p>Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">select(flights, contains("TIME"))</pre>
</div>
</li>
</ol></section>
</section>
<section id="groups" data-type="sect1">
<h1>
Groups</h1>
<p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, and the slice family of functions.</p>
<section id="group_by" data-type="sect2">
<h2>
group_by()
</h2>
<p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month)
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: month [12]
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”. <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt do anything by itself; instead it changes the behavior of the subsequent verbs.</p>
</section>
<section id="sec-summarize" data-type="sect2">
<h2>
summarize()
</h2>
<p>The most important grouped operation is a summary, which collapses each group to a single row. In dplyr, this is operation is performed by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code><span data-type="footnote">Or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, if you prefer British English.</span>, as shown by the following example, which computes the average departure delay by month:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay)
)
#&gt; # A tibble: 12 × 2
#&gt; month delay
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 NA
#&gt; 2 2 NA
#&gt; 3 3 NA
#&gt; 4 4 NA
#&gt; 5 5 NA
#&gt; 6 6 NA
#&gt; # … with 6 more rows</pre>
</div>
<p>Uhoh! Something has gone wrong and all of our results are <code>NA</code> (pronounced “N-A”), Rs symbol for missing value. Well come back to discuss missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>, but for now well remove them by using <code>na.rm = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE)
)
#&gt; # A tibble: 12 × 2
#&gt; month delay
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 10.0
#&gt; 2 2 10.8
#&gt; 3 3 13.2
#&gt; 4 4 13.9
#&gt; 5 5 13.0
#&gt; 6 6 20.8
#&gt; # … with 6 more rows</pre>
</div>
<p>You can create any number of summaries in a single call to <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>, which returns the number of rows in each group:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
#&gt; # A tibble: 12 × 3
#&gt; month delay n
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 1 10.0 27004
#&gt; 2 2 10.8 24951
#&gt; 3 3 13.2 28834
#&gt; 4 4 13.9 28330
#&gt; 5 5 13.0 28796
#&gt; 6 6 20.8 28243
#&gt; # … with 6 more rows</pre>
</div>
<p>Means and counts can get you a surprisingly long way in data science!</p>
</section>
<section id="the-slice_-functions" data-type="sect2">
<h2>
The slice_ functions</h2>
<p>There are five handy functions that allow you pick off specific rows within each group:</p>
<ul><li>
<code>df |&gt; slice_head(n = 1)</code> takes the first row from each group.</li>
<li>
<code>df |&gt; slice_tail(n = 1)</code> takes the last row in each group.</li>
<li>
<code>df |&gt; slice_min(x, n = 1)</code> takes the row with the smallest value of <code>x</code>.</li>
<li>
<code>df |&gt; slice_max(x, n = 1)</code> takes the row with the largest value of <code>x</code>.</li>
<li>
<code>df |&gt; slice_sample(n = 1)</code> takes one random row.</li>
</ul><p>You can vary <code>n</code> to select more than one row, or instead of <code>n =</code>, you can use <code>prop = 0.1</code> to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
slice_max(arr_delay, n = 1)
#&gt; # A tibble: 108 × 19
#&gt; # Groups: dest [105]
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 7 22 2145 2007 98 132 2259
#&gt; 2 2013 7 23 1139 800 219 1250 909
#&gt; 3 2013 1 25 123 2000 323 229 2101
#&gt; 4 2013 8 17 1740 1625 75 2042 2003
#&gt; 5 2013 7 22 2257 759 898 121 1026
#&gt; 6 2013 7 10 2056 1505 351 2347 1758
#&gt; # … with 102 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(max_delay = max(arr_delay, na.rm = TRUE))
#&gt; Warning: There was 1 warning in `summarize()`.
#&gt; In argument: `max_delay = max(arr_delay, na.rm = TRUE)`.
#&gt; In group 52: `dest = "LGA"`.
#&gt; Caused by warning in `max()`:
#&gt; ! no non-missing arguments to max; returning -Inf
#&gt; # A tibble: 105 × 2
#&gt; dest max_delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ABQ 153
#&gt; 2 ACK 221
#&gt; 3 ALB 328
#&gt; 4 ANC 39
#&gt; 5 ATL 895
#&gt; 6 AUS 349
#&gt; # … with 99 more rows</pre>
</div>
</section>
<section id="grouping-by-multiple-variables" data-type="sect2">
<h2>
Grouping by multiple variables</h2>
<p>You can create groups using more than one variable. For example, we could make a group for each day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">daily &lt;- flights |&gt;
group_by(year, month, day)
daily
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: year, month, day [365]
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasnt great way to make this function work, but its difficult to change without breaking existing code. To make it obvious whats happening, dplyr displays a message that tells you how you can change this behavior:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">daily_flights &lt;- daily |&gt;
summarize(
n = n()
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.</pre>
</div>
<p>If youre happy with this behavior, you can explicitly request it in order to suppress the message:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">daily_flights &lt;- daily |&gt;
summarize(
n = n(),
.groups = "drop_last"
)</pre>
</div>
<p>Alternatively, change the default behavior by setting a different value, e.g. <code>"drop"</code> to drop all grouping or <code>"keep"</code> to preserve the same groups.</p>
</section>
<section id="ungrouping" data-type="sect2">
<h2>
Ungrouping</h2>
<p>You might also want to remove grouping outside of <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You can do this with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">ungroup()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">daily |&gt;
ungroup() |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
#&gt; # A tibble: 1 × 2
#&gt; delay flights
#&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 12.6 336776</pre>
</div>
<p>As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.</p>
</section>
<section id="data-transform-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
<li><p>Find the most delayed flight to each destination.</p></li>
<li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li>
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code> and friends?</p></li>
<li><p>Explain what <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> do?</p></li>
<li>
<p>Suppose we have the following tiny data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = 1:5,
y = c("a", "b", "a", "a", "b"),
z = c("K", "K", "L", "L", "K")
)</pre>
</div>
<ol type="a"><li>
<p>What does the following code do? Run it, analyze the result, and describe what <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> does.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y)</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> does. Also comment on how its different from the <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> in part (a)?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
arrange(y)</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y) |&gt;
summarize(mean_x = mean(x))</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does. Then, comment on what the message says.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y, z) |&gt;
summarize(mean_x = mean(x))</pre>
</div>
</li>
<li>
<p>What does the following code do? Run it, analyze the result, and describe what the pipeline does. How is the output different from the one in part (d).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y, z) |&gt;
summarize(mean_x = mean(x), .groups = "drop")</pre>
</div>
</li>
<li>
<p>What do the following pipelines do? Run both, analyze the results, and describe what each pipeline does. How are the outputs of the two pipelines different?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(y, z) |&gt;
summarize(mean_x = mean(x))
df |&gt;
group_by(y, z) |&gt;
mutate(mean_x = mean(x))</pre>
</div>
</li>
</ol></li>
</ol></section>
</section>
<section id="sec-sample-size" data-type="sect1">
<h1>
Case study: aggregates and sample size</h1>
<p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">delays &lt;- flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(delays, aes(x = delay)) +
geom_freqpoly(binwidth = 10)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-45-1.png" class="img-fluid" alt="A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours." width="576"/></p>
</div>
</div>
<p>Wow, there are some planes that have an <em>average</em> delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(delays, aes(x = n, y = delay)) +
geom_point(alpha = 1/10)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
</div>
</div>
<p>Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, youll see that the variation decreases as the sample size increases<span data-type="footnote">*cough* the central limit theorem *cough*.</span>.</p>
<p>When looking at this sort of plot, its often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">delays |&gt;
filter(n &gt; 25) |&gt;
ggplot(aes(x = n, y = delay)) +
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid" alt="Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights." width="576"/></p>
</div>
</div>
<p>Note the handy pattern for combining ggplot2 and dplyr. Its a bit annoying that you have to switch from <code>|&gt;</code> to <code>+</code>, but its not too much of a hassle once you get the hang of it.</p>
<p>Theres another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the <strong>Lahman</strong> package to compare what proportion of times a player hits the ball vs. the number of attempts they take:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">batters &lt;- Lahman::Batting |&gt;
group_by(playerID) |&gt;
summarize(
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
#&gt; # A tibble: 20,166 × 3
#&gt; playerID perf n
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 aardsda01 0 4
#&gt; 2 aaronha01 0.305 12364
#&gt; 3 aaronto01 0.229 944
#&gt; 4 aasedo01 0 5
#&gt; 5 abadan01 0.0952 21
#&gt; 6 abadfe01 0.111 9
#&gt; # … with 20,160 more rows</pre>
</div>
<p>When we plot the skill of the batter (measured by the batting average, <code>ba</code>) against the number of opportunities to hit the ball (measured by at bat, <code>ab</code>), you see two patterns:</p>
<ol type="1"><li><p>As above, the variation in our aggregate decreases as we get more data points.</p></li>
<li><p>Theres a positive correlation between skill (<code>perf</code>) and opportunities to hit the ball (<code>n</code>) because obviously teams want to give their best batters the most opportunities to hit the ball.</p></li>
</ol><div class="cell">
<pre data-type="programlisting" data-code-language="r">batters |&gt;
filter(n &gt; 100) |&gt;
ggplot(aes(x = n, y = perf)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" alt="A scatterplot of number of batting opportunites vs. batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000." width="576"/></p>
</div>
</div>
<p>This also has important implications for ranking. If you naively sort on <code>desc(ba)</code>, the people with the best batting averages are clearly lucky, not skilled:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">batters |&gt;
arrange(desc(perf))
#&gt; # A tibble: 20,166 × 3
#&gt; playerID perf n
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 abramge01 1 1
#&gt; 2 alberan01 1 1
#&gt; 3 banisje01 1 1
#&gt; 4 bartocl01 1 1
#&gt; 5 bassdo01 1 1
#&gt; 6 birasst01 1 2
#&gt; # … with 20,160 more rows</pre>
</div>
<p>You can find a good explanation of this problem and how to overcome it at <a href="http://varianceexplained.org/r/empirical_bayes_baseball/" class="uri">http://varianceexplained.org/r/empirical_bayes_baseball/</a> and <a href="https://www.evanmiller.org/how-not-to-sort-by-average-rating.html" class="uri">https://www.evanmiller.org/how-not-to-sort-by-average-rating.html</a>.</p>
</section>
<section id="data-transform-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
<p>For now, well pivot back to workflow, and in the next chapter youll learn more about the pipe, <code>|&gt;</code>, why we recommend it, and a little of the history that lead from magrittrs <code>%&gt;%</code> to base Rs <code>|&gt;</code>.</p>
</section>
</section>

View File

@ -1,620 +0,0 @@
<section data-type="chapter" id="chp-data-visualize">
<h1><span id="sec-data-visualization" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data visualization</span></span></h1>
<section id="data-visualize-introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
<p>“The simple graph has brought more information to the data analysts mind than any other device.” — John Tukey</p>
</blockquote>
<p>R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the <strong>grammar of graphics</strong>, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.</p>
<p>This chapter will teach you how to visualize your data using <strong>ggplot2</strong>. We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects the fundamental building blocks of ggplot2. We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. Well finish off with saving your plots and troubleshooting tips.</p>
<section id="data-visualize-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
<p>That one line of code loads the core tidyverse; the packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded)<span data-type="footnote">You can eliminate that message and force conflict resolution to happen on demand by using the conflicted package, which becomes more important as you load more packages. You can learn more about conflicted at <a href="https://conflicted.r-lib.org" class="uri">https://conflicted.r-lib.org</a>.</span>.</p>
<p>If you run this code and get the error message <code>there is no package called 'tidyverse'</code>, youll need to first install it, then run <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> once again.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">install.packages("tidyverse")
library(tidyverse)</pre>
</div>
<p>You only need to install a package once, but you need to load it every time you start a new session.</p>
<p>In addition to tidyverse, we will also use the <strong>palmerpenguins</strong> package, which includes the <code>penguins</code> dataset containing body measurements for penguins on three islands in the Palmer Archipelago.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(palmerpenguins)</pre>
</div>
</section>
</section>
<section id="first-steps" data-type="sect1">
<h1>
First steps</h1>
<p>Lets use our first graph to answer a question: Do penguins with longer flippers weigh more or less than penguins with shorter flippers? You probably already have an answer, but try to make your answer precise. What does the relationship between flipper length and body mass look like? Is it positive? Negative? Linear? Nonlinear? Does the relationship vary by the species of the penguin? And how about by the island where the penguin lives.</p>
<section id="the-penguins-data-frame" data-type="sect2">
<h2>
The penguins data frame</h2>
<p>You can test your answer with the <code>penguins</code> <strong>data frame</strong> found in palmerpenguins (a.k.a. <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">palmerpenguins::penguins</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>penguins</code> contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER<span data-type="footnote">Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. <a href="https://allisonhorst.github.io/palmerpenguins/" class="uri">https://allisonhorst.github.io/palmerpenguins/</a>. doi: 10.5281/zenodo.3960218.</span>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181
#&gt; 2 Adelie Torgersen 39.5 17.4 186
#&gt; 3 Adelie Torgersen 40.3 18 195
#&gt; 4 Adelie Torgersen NA NA NA
#&gt; 5 Adelie Torgersen 36.7 19.3 193
#&gt; 6 Adelie Torgersen 39.3 20.6 190
#&gt; # … with 338 more rows, and 3 more variables: body_mass_g &lt;int&gt;, sex &lt;fct&gt;,
#&gt; # year &lt;int&gt;</pre>
</div>
<p>This data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>. Or, if youre in RStudio, run <code>View(penguins)</code> to open an interactive data viewer.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">glimpse(penguins)
#&gt; Rows: 344
#&gt; Columns: 8
#&gt; $ species &lt;fct&gt; Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A…
#&gt; $ island &lt;fct&gt; Torgersen, Torgersen, Torgersen, Torgersen, Torge…
#&gt; $ bill_length_mm &lt;dbl&gt; 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.…
#&gt; $ bill_depth_mm &lt;dbl&gt; 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.…
#&gt; $ flipper_length_mm &lt;int&gt; 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, …
#&gt; $ body_mass_g &lt;int&gt; 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347…
#&gt; $ sex &lt;fct&gt; male, female, female, NA, female, male, female, m…
#&gt; $ year &lt;int&gt; 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2…</pre>
</div>
<p>Among the variables in <code>penguins</code> are:</p>
<ol type="1"><li><p><code>species</code>: a penguins species (Adelie, Chinstrap, or Gentoo).</p></li>
<li><p><code>flipper_length_mm</code>: length of a penguins flipper, in millimeters.</p></li>
<li><p><code>body_mass_g</code>: body mass of a penguin, in grams.</p></li>
</ol><p>To learn more about <code>penguins</code>, open its help page by running <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">?penguins</a></code>.</p>
</section>
<section id="sec-ultimate-goal" data-type="sect2">
<h2>
Ultimate goal</h2>
<p>Our ultimate goal in this chapter is to recreate the following visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-7-1.png" alt="A scatterplot of body mass vs. flipper length of penguins, with a smooth curve displaying the relationship between these two variables overlaid. The plot displays a positive, fairly linear, and relatively strong relationship between these two variables. Species (Adelie, Chinstrap, and Gentoo) are represented with different colors and shapes. The relationship between body mass and flipper length is roughly the same for these three species, and Gentoo penguins are larger than penguins from the other two species." width="576"/></p>
</div>
</div>
</section>
<section id="creating-a-ggplot" data-type="sect2">
<h2>
Creating a ggplot</h2>
<p>Lets recreate this plot layer-by-layer.</p>
<p>With ggplot2, you begin a plot with the function <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>, defining a plot object that you then add layers to. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> is the dataset to use in the graph and So <code>ggplot(data = penguins)</code> creates an empty graph. This is not a very exciting plot, but you can think of it like an empty canvas youll paint the remaining layers of your plot onto.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(data = penguins)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-8-1.png" alt="A blank, gray plot area." width="576"/></p>
</div>
</div>
<p>Next, we need to tell <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> the variables from this data frame that we want to map to visual properties (<strong>aesthetics</strong>) of the plot. The <code>mapping</code> argument of the <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> function defines how variables in your dataset are mapped to visual properties of your plot. The <code>mapping</code> argument is always paired with the <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> function, and the <code>x</code> and <code>y</code> arguments of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> specify which variables to map to the x and y axes. For now, we will only map flipper length to the <code>x</code> aesthetic and body mass to the <code>y</code> aesthetic. ggplot2 looks for the mapped variables in the <code>data</code> argument, in this case, <code>penguins</code>.</p>
<p>The following plots show the result of adding these mappings, one at a time.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm)
)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-9-1.png" alt="There are two plots. The plot on the left is shows flipper length on the x-axis. The values range from 170 to 230 The plot on the right also shows body mass on the y-axis. The values range from 3000 to 6000." width="576"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-9-2.png" alt="There are two plots. The plot on the left is shows flipper length on the x-axis. The values range from 170 to 230 The plot on the right also shows body mass on the y-axis. The values range from 3000 to 6000." width="576"/></p>
</div>
</div>
</div>
</div>
<p>Our empty canvas now has more structure its clear where flipper lengths will be displayed (on the x-axis) and where body masses will be displayed (on the y-axis). But the penguins themselves are not yet on the plot. This is because we have not yet articulated, in our code, how to represent the observations from our data frame on our plot.</p>
<p>To do so, we need to define a <strong>geom</strong>: the geometrical object that a plot uses to represent data. These geometric objects are made available in ggplot2 with functions that start with <code>geom_</code>. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms (<code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>), line charts use line geoms (<code><a href="https://ggplot2.tidyverse.org/reference/geom_path.html">geom_line()</a></code>), boxplots use boxplot geoms (<code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>), and so on. Scatterplots break the trend; they use the point geom: <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>.</p>
<p>The function <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code> adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. Youll learn a whole bunch of geoms throughout the book, particularly in <a href="#chp-layers" data-type="xref">#chp-layers</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
#&gt; Warning: Removed 2 rows containing missing values (`geom_point()`).</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-10-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. The plot displays a positive, linear, and relatively strong relationship between these two variables." width="576"/></p>
</div>
</div>
<p>Now we have something that looks like what we might think of as a “scatter plot”. It doesnt yet match our “ultimate goal” plot, but using this plot we can start answering the question that motivated our exploration: “What does the relationship between flipper length and body mass look like?” The relationship appears to be positive, fairly linear, and moderately strong. Penguins with longer flippers are generally larger in terms of their body mass.</p>
<p>Before we add more layers to this plot, lets pause for a moment and review the warning message we got:</p>
<blockquote class="blockquote">
<p>Removed 2 rows containing missing values (<code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>).</p>
</blockquote>
<p>Were seeing this message because there are two penguins in our dataset with missing body mass and flipper length values and ggplot2 has no way of representing them on the plot. You dont need to worry about understanding the following code yet (you will learn about it in <a href="#chp-data-transform" data-type="xref">#chp-data-transform</a>), but its one way of identifying the observations with <code>NA</code>s for either body mass or flipper length.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">penguins |&gt;
select(species, flipper_length_mm, body_mass_g) |&gt;
filter(is.na(body_mass_g) | is.na(flipper_length_mm))
#&gt; # A tibble: 2 × 3
#&gt; species flipper_length_mm body_mass_g
#&gt; &lt;fct&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Adelie NA NA
#&gt; 2 Gentoo NA NA</pre>
</div>
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. This type of warning is probably one of the most common types of warnings you will see when working with real data missing values are a very common issue and youll learn more about them throughout the book, particularly in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>. For the remaining plots in this chapter we will suppress this warning so its not printed alongside every single plot we make.</p>
</section>
<section id="adding-aesthetics-and-layers" data-type="sect2">
<h2>
Adding aesthetics and layers</h2>
<p>Scatterplots are useful for displaying the relationship between two variables, but its always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship. Lets incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between flipper length and body mass. We will do this by representing species with different colored points.</p>
<p>To achieve this, where should <code>species</code> go in the ggplot call from earlier? If you guessed “in the aesthetic mapping, inside of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code>”, youre already getting the hang of creating data visualizations with ggplot2! And if not, dont worry. Throughout the book you will make many more ggplots and have many more opportunities to check your intuition as you make them.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-12-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. The plot displays a positive, fairly linear, and relatively strong relationship between these two variables. Species (Adelie, Chinstrap, and Gentoo) are represented with different colors." width="576"/></p>
</div>
</div>
<p>When a variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as <strong>scaling</strong>. ggplot2 will also add a legend that explains which values correspond to which levels.</p>
<p>Now lets add one more layer: a smooth curve displaying the relationship between body mass and flipper length. Before you proceed, refer back to the code above, and think about how we can add this to our existing plot.</p>
<p>Since this is a new geometric object representing our data, we will add a new geom: <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) +
geom_point() +
geom_smooth()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-13-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. Overlaid on the scatterplot are three smooth curves displaying the relationship between these variables for each species (Adelie, Chinstrap, and Gentoo). Different penguin species are plotted in different colors for the points and the smooth curves." width="576"/></p>
</div>
</div>
<p>We have successfully added smooth curves, but this plot doesnt look like the plot from <a href="#sec-ultimate-goal" data-type="xref">#sec-ultimate-goal</a>, which only has one curve for the entire dataset as opposed to separate curves for each of the penguin species.</p>
<p>When aesthetic mappings are defined in <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>, at the <em>global</em> level, theyre inherited by each of the subsequent geom layers of the plot. However, each geom function in ggplot2 can also take a <code>mapping</code> argument, which allows for aesthetic mappings at the <em>local</em> level. Since we want points to be colored based on species but dont want the smooth curves to be separated out for them, we should specify <code>color = species</code> for <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code> only.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-14-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. Overlaid on the scatterplot are is a single smooth curve displaying the relationship between these variables for each species (Adelie, Chinstrap, and Gentoo). Different penguin species are plotted in different colors for the points only." width="576"/></p>
</div>
</div>
<p>Voila! We have something that looks very much like our ultimate goal, though its not yet perfect. We still need to use different shapes for each species of penguins and improve labels.</p>
<p>Its generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences. Therefore, in addition to color, we can also map <code>species</code> to the <code>shape</code> aesthetic.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-15-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. Overlaid on the scatterplot are is a single smooth curve displaying the relationship between these variables for each species (Adelie, Chinstrap, and Gentoo). Different penguin species are plotted in different colors and shapes for the points only." width="576"/></p>
</div>
</div>
<p>Note that the legend is automatically updated to reflect the different shapes of the points as well.</p>
<p>And finally, we can improve the labels of our plot using the <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> function in a new layer. Some of the arguments to <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> might be self explanatory: <code>title</code> adds a title and <code>subtitle</code> adds a subtitle to the plot. Other arguments match the aesthetic mappings, <code>x</code> is the x-axis label, <code>y</code> is the y-axis label, and <code>color</code> and <code>shape</code> define the label for the legend.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(aes(color = species, shape = species)) +
geom_smooth() +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper length (mm)",
y = "Body mass (g)",
color = "Species",
shape = "Species"
)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-16-1.png" alt="A scatterplot of body mass vs. flipper length of penguins, with a smooth curve displaying the relationship between these two variables overlaid. The plot displays a positive, fairly linear, and relatively strong relationship between these two variables. Species (Adelie, Chinstrap, and Gentoo) are represented with different colors and shapes. The relationship between body mass and flipper length is roughly the same for these three species, and Gentoo penguins are larger than penguins from the other two species." width="576"/></p>
</div>
</div>
<p>We finally have a plot that perfectly matches our “ultimate goal”!</p>
</section>
<section id="data-visualize-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How many rows are in <code>penguins</code>? How many columns?</p></li>
<li><p>What does the <code>bill_depth_mm</code> variable in the <code>penguins</code> data frame describe? Read the help for <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">?penguins</a></code> to find out.</p></li>
<li><p>Make a scatterplot of <code>bill_depth_mm</code> vs. <code>bill_length_mm</code>. Describe the relationship between these two variables.</p></li>
<li><p>What happens if you make a scatterplot of <code>species</code> vs <code>bill_depth_mm</code>? Why is the plot not useful?</p></li>
<li>
<p>Why does the following give an error and how would you fix it?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(data = penguins) +
geom_point()</pre>
</div>
</li>
<li><p>What does the <code>na.rm</code> argument do in <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>? What is the default value of the argument? Create a scatterplot where you successfully use this argument set to <code>TRUE</code>.</p></li>
<li><p>Add the following caption to the plot you made in the previous exercise: “Data come from the palmerpenguins package.” Hint: Take a look at the documentation for <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code>.</p></li>
<li>
<p>Recreate the following visualization. What aesthetic should <code>bill_depth_mm</code> be mapped to? And should it be mapped at the global level or at the geom level?</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-18-1.png" alt="A scatterplot of body mass vs. flipper length of penguins, colored by bill depth. A smooth curve of the relationship between body mass and flipper length is overlaid. The relationship is positive, fairly linear, and moderately strong." width="576"/></p>
</div>
</div>
</li>
<li>
<p>Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = island)
) +
geom_point() +
geom_smooth(se = FALSE)</pre>
</div>
</li>
<li>
<p>Will these two graphs look different? Why/why not?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_smooth(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)</pre>
</div>
</li>
</ol></section>
</section>
<section id="ggplot2-calls" data-type="sect1">
<h1>
ggplot2 calls</h1>
<p>As we move on from these introductory sections, well transition to a more concise expression of ggplot2 code. So far weve been very explicit, which is helpful when you are learning:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()</pre>
</div>
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> are <code>data</code> and <code>mapping</code>, in the remainder of the book, we wont supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see whats different between plots. Thats a really important programming concern that well come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
<p>Rewriting the previous plot more concisely yields:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()</pre>
</div>
<p>In the future, youll also learn about the pipe which will allow you to create that plot with:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">penguins |&gt;
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()</pre>
</div>
<p>This is the most common syntax youll see in the wild.</p>
</section>
<section id="visualizing-distributions" data-type="sect1">
<h1>
Visualizing distributions</h1>
<p>How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.</p>
<section id="a-categorical-variable" data-type="sect2">
<h2>
A categorical variable</h2>
<p>A variable is <strong>categorical</strong> if it can only take one of a small set of values. To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each <code>x</code> value.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = species)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-24-1.png" alt="A bar chart of frequencies of species of penguins: Adelie (approximately 150), Chinstrap (approximately 90), Gentoo (approximately 125)." width="576"/></p>
</div>
</div>
<p>In bar plots of categorical variables with non-ordered levels, like the penguin <code>species</code> above, its often preferable to reorder the bars based on their frequencies. Doing so requires transforming the variable to a factor (how R handles categorical data) and then reordering the levels of that factor.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = fct_infreq(species))) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-25-1.png" alt="A bar chart of frequencies of species of penguins, where the bars are ordered in decreasing order of their heights (frequencies): Adelie (approximately 150), Gentoo (approximately 125), Chinstrap (approximately 90)." width="576"/></p>
</div>
</div>
<p>You will learn more about factors and functions for dealing with factors (like <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code> shown above) in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
</section>
<section id="a-numerical-variable" data-type="sect2">
<h2>
A numerical variable</h2>
<p>A variable is <strong>numerical</strong> if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To visualize the distribution of a continuous variable, you can use a histogram or a density plot.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-26-1.png" alt="A histogram (on the left) and density plot (on the right) of body masses of penguins. The distribution is unimodal and right skewed, ranging between approximately 2500 to 6500 grams." width="576"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-26-2.png" alt="A histogram (on the left) and density plot (on the right) of body masses of penguins. The distribution is unimodal and right skewed, ranging between approximately 2500 to 6500 grams." width="576"/></p>
</div>
</div>
</div>
</div>
<p>A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that 39 observations have a <code>body_mass_g</code> value between 3,500 and 3,700 grams, which are the left and right edges of the bar.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">penguins |&gt;
count(cut_width(body_mass_g, 200))
#&gt; # A tibble: 19 × 2
#&gt; `cut_width(body_mass_g, 200)` n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 [2.7e+03,2.9e+03] 7
#&gt; 2 (2.9e+03,3.1e+03] 10
#&gt; 3 (3.1e+03,3.3e+03] 23
#&gt; 4 (3.3e+03,3.5e+03] 38
#&gt; 5 (3.5e+03,3.7e+03] 39
#&gt; 6 (3.7e+03,3.9e+03] 37
#&gt; # … with 13 more rows</pre>
</div>
<p>You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the <code>x</code> variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. In the plots below a binwidth of 20 is too narrow, resulting in too many bars, making it difficult to determine the shape of the distribution. Similarly, a binwidth of 2,000 is too high, resulting in all data being binned into only three bars, and also making it difficult to determine the shape of the distribution.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 20)
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 2000)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-28-1.png" alt="Three histograms of body masses of penguins, one with binwidth of 20 (right), one with binwidth of 200 (center), and one with binwidth of 2000 (left). The histogram with binwidth of 20 shows lots of ups and downs in the heights of the bins, creating a jagged outline. The histogram with binwidth of 2000 shows only three bins." width="576"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-28-2.png" alt="Three histograms of body masses of penguins, one with binwidth of 20 (right), one with binwidth of 200 (center), and one with binwidth of 2000 (left). The histogram with binwidth of 20 shows lots of ups and downs in the heights of the bins, creating a jagged outline. The histogram with binwidth of 2000 shows only three bins." width="576"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-28-3.png" alt="Three histograms of body masses of penguins, one with binwidth of 20 (right), one with binwidth of 200 (center), and one with binwidth of 2000 (left). The histogram with binwidth of 20 shows lots of ups and downs in the heights of the bins, creating a jagged outline. The histogram with binwidth of 2000 shows only three bins." width="576"/></p>
</div>
</div>
</div>
</div>
</section>
<section id="data-visualize-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Make a bar plot of <code>species</code> of <code>penguins</code>, where you assign <code>species</code> to the <code>y</code> aesthetic. How is this plot different?</p></li>
<li>
<p>How are the following two plots different? Which aesthetic, <code>color</code> or <code>fill</code>, is more useful for changing the color of bars?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = species)) +
geom_bar(color = "red")
ggplot(penguins, aes(x = species)) +
geom_bar(fill = "red")</pre>
</div>
</li>
<li><p>What does the <code>bins</code> argument in <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> do?</p></li>
<li><p>Make a histogram of the <code>carat</code> variable in the <code>diamonds</code> dataset. Experiment with different binwidths. What binwidth reveals the most interesting patterns?</p></li>
</ol></section>
</section>
<section id="visualizing-relationships" data-type="sect1">
<h1>
Visualizing relationships</h1>
<p>To visualize a relationship we need to have at least two variables mapped to aesthetics of a plot. In the following sections you will learn about commonly used plots for visualizing relationships between two or more variables and the geoms used for creating them.</p>
<section id="a-numerical-and-a-categorical-variable" data-type="sect2">
<h2>
A numerical and a categorical variable</h2>
<p>To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots. A <strong>boxplot</strong> is a type of visual shorthand for a distribution of values that is popular among statisticians. As shown in <a href="#fig-eda-boxplot" data-type="xref">#fig-eda-boxplot</a>, each boxplot consists of:</p>
<ul><li><p>A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.</p></li>
<li><p>Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.</p></li>
<li><p>A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.</p></li>
</ul><div class="cell">
<div class="cell-output-display">
<figure id="fig-eda-boxplot"><p><img src="images/EDA-boxplot.png" alt="A diagram depicting how a boxplot is created following the steps outlined above." width="1066"/></p>
<figcaption>Diagram depicting how a boxplot is created.</figcaption>
</figure>
</div>
</div>
<p>Lets take a look at the distribution of body mass by species using <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-31-1.png" alt="Side-by-side box plots of distributions of body masses of Adelie, Chinstrap, and Gentoo penguins. The distribution of Adelie and Chinstrap penguins' body masses appear to be symmetric with medians around 3750 grams. The median body mass of Gentoo penguins is much higher, around 5000 grams, and the distribution of the body masses of these penguins appears to be somewhat right skewed." width="576"/></p>
</div>
</div>
<p>Alternatively, we can make frequency polygons with <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> performs the same calculation as <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, but instead of displaying the counts with bars, it uses lines instead. Its much easier to understand overlapping lines than bars of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>. There are a few challenges with this type of plot, which we will come back to in <a href="#sec-cat-num" data-type="xref">#sec-cat-num</a> on exploring the covariation between a categorical and a numerical variable.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = body_mass_g, color = species)) +
geom_freqpoly(binwidth = 200, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-32-1.png" alt="A frequency polygon of body masses of penguins by species of penguins. Each species (Adelie, Chinstrap, and Gentoo) is represented with different colored outlines for the polygons." width="576"/></p>
</div>
</div>
<p>Weve also customized the thickness of the lines using the <code>linewidth</code> argument in order to make them stand out a bit more against the background.</p>
<p>We can also use overlaid density plots, with <code>species</code> mapped to both <code>color</code> and <code>fill</code> aesthetics and using the <code>alpha</code> aesthetic to add transparency to the filled density curves. This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque). In the following plot its <em>set</em> to 0.5.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
geom_density(alpha = 0.5)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-33-1.png" alt="A frequency polygon of body masses of penguins (on the left) and density plot (on the right). Each species of penguins (Adelie, Chinstrap, and Gentoo) are represented in different colored outlines for the frequency polygons and the density curves. The density curves are also filled with the same colors, with some transparency added." width="576"/></p>
</div>
</div>
<p>Note the terminology we have used here:</p>
<ul><li>We <em>map</em> variables to aesthetics if we want the visual attribute represented by that aesthetic to vary based on the values of that variable.</li>
<li>Otherwise, we <em>set</em> the value of an aesthetic.</li>
</ul></section>
<section id="data-visualize-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>We can use segmented bar plots to visualize the distribution between two categorical variables. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
<p>Below are two segmented bar plots, both displaying the relationship between <code>island</code> and <code>species</code>, or specifically, visualizing the distribution of <code>species</code> within each island. The plot on the left shows the frequencies of each species of penguins on each island and the plot on the right shows the relative frequencies (proportions) of each species within each island (despite the incorrectly labeled y-axis that says “count”). The relative frequency plot, created by setting <code>position = "fill"</code> in the geom is more useful for comparing species distributions across islands since its not affected by the unequal numbers of penguins across the islands. Based on the plot on the left, we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = island, fill = species)) +
geom_bar()
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = "fill")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-1.png" alt="Bar plots of penguin species by island (Biscoe, Dream, and Torgersen). On the right, frequencies of species are shown. On the left, relative frequencies of species are shown." width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-2.png" alt="Bar plots of penguin species by island (Biscoe, Dream, and Torgersen). On the right, frequencies of species are shown. On the left, relative frequencies of species are shown." width="576"/></p>
</div>
</div>
</section>
<section id="data-visualize-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>So far youve learned about scatterplots (created with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>) and smooth curves (created with <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-35-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. The plot displays a positive, linear, relatively strong relationship between these two variables." width="576"/></p>
</div>
</div>
</section>
<section id="three-or-more-variables" data-type="sect2">
<h2>
Three or more variables</h2>
<p>One way to add additional variables to a plot is by mapping them to an aesthetic. For example, in the following scatterplot the colors of points represent species and the shapes of points represent islands.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = island))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-36-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. The plot displays a positive, linear, relatively strong relationship between these two variables. The points are colored based on the species of the penguins and the shapes of the points represent islands (round points are Biscoe island, triangles are Dream island, and squared are Torgersen island). The plot is very busy and it's difficult to distinguish the shapes of the points." width="576"/></p>
</div>
</div>
<p>However adding too many aesthetic mappings to a plot makes it cluttered and difficult to make sense of. Another way, which is particularly useful for categorical variables, is to split your plot into <strong>facets</strong>, subplots that each display one subset of the data.</p>
<p>To facet your plot by a single variable, use <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> is a formula<span data-type="footnote">Here “formula” is the name of the type of thing created by <code>~</code>, not a synonym for “equation”.</span>, which you create with <code>~</code> followed by a variable name. The variable that you pass to <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> should be categorical.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~island)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-37-1.png" alt="A scatterplot of body mass vs. flipper length of penguins. The shapes and colors of points represent species. Penguins from each island are on a separate facet. Within each facet, the relationship between body mass and flipper length is positive, linear, relatively strong." width="768"/></p>
</div>
</div>
<p>You will learn about many other geoms for visualizing distributions of variables and relationships between them in <a href="#chp-layers" data-type="xref">#chp-layers</a>.</p>
</section>
<section id="data-visualize-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which variables in <code>mpg</code> are categorical? Which variables are continuous? (Hint: type <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code> to read the documentation for the dataset). How can you see this information when you run <code>mpg</code>?</p></li>
<li><p>Make a scatterplot of <code>hwy</code> vs. <code>displ</code> using the <code>mpg</code> data frame. Next, map a third, numerical variable to <code>color</code>, then <code>size</code>, then both <code>color</code> and <code>size</code>, then <code>shape</code>. How do these aesthetics behave differently for categorical vs. numerical variables?</p></li>
<li><p>In the scatterplot of <code>hwy</code> vs. <code>displ</code>, what happens if you map a third variable to <code>linewidth</code>?</p></li>
<li><p>What happens if you map the same variable to multiple aesthetics?</p></li>
<li><p>Make a scatterplot of <code>bill_depth_mm</code> vs. <code>bill_length_mm</code> and color the points by <code>species</code>. What does adding coloring by species reveal about the relationship between these two variables?</p></li>
<li>
<p>Why does the following yield two separate legends? How would you fix it to combine the two legends?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(
data = penguins,
mapping = aes(
x = bill_length_mm, y = bill_depth_mm,
color = species, shape = species
)
) +
geom_point() +
labs(color = "Species")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-38-1.png" alt="Scatterplot of bill depth vs. bill length where different color and shape pairings represent each species. The plot has two legends, one labelled &quot;species&quot; which shows the shape scale and the other that shows the color scale." width="576"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="sec-ggsave" data-type="sect1">
<h1>
Saving your plots</h1>
<p>Once youve made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere. Thats the job of <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code>, which will save the most recent plot to disk:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
ggsave(filename = "my-plot.png")</pre>
</div>
<p>This will save your plot to your working directory, a concept youll learn more about in <a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>.</p>
<p>If you dont specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, youll want to specify them. You can learn more about <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> in the documentation.</p>
<p>Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in <a href="#chp-quarto" data-type="xref">#chp-quarto</a>.</p>
<section id="data-visualize-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Run the following lines of code. Which of the two plots is saved as <code>mpg-plot.png</code>? Why?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = class)) +
geom_bar()
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()
ggsave("mpg-plot.png")</pre>
</div>
</li>
<li><p>What do you need to change in the code above to save the plot as a PDF instead of a PNG?</p></li>
</ol></section>
</section>
<section id="common-problems" data-type="sect1">
<h1>
Common problems</h1>
<p>As you start to run R code, youre likely to run into problems. Dont worry — it happens to everyone. We have all been writing R code for years, but every day we still write code that doesnt work!</p>
<p>Start by carefully comparing the code that youre running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every <code>(</code> is matched with a <code>)</code> and every <code>"</code> is paired with another <code>"</code>. Sometimes youll run the code and nothing happens. Check the left-hand of your console: if its a <code>+</code>, it means that R doesnt think youve typed a complete expression and its waiting for you to finish it. In this case, its usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.</p>
<p>One common problem when creating ggplot2 graphics is to put the <code>+</code> in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you havent accidentally written code like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg)
+ geom_point(mapping = aes(x = displ, y = hwy))</pre>
</div>
<p>If youre still stuck, try the help. You can get help about any R function by running <code>?function_name</code> in the console, or selecting the function name and pressing F1 in RStudio. Dont worry if the help doesnt seem that helpful - instead skip down to the examples and look for code that matches what youre trying to do.</p>
<p>If that doesnt help, carefully read the error message. Sometimes the answer will be buried there! But when youre new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as its likely someone else has had the same problem, and has gotten help online.</p>
</section>
<section id="data-visualize-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by levering additional aesthetic mappings and/or splitting your plot into small multiples using faceting.</p>
<p>Well use visualizations again and again through out this book, introducing new techniques as we need them as well as do a deeper dive into creating visualizations with ggplot2 in <a href="#chp-layers" data-type="xref">#chp-layers</a> through <a href="#chp-EDA" data-type="xref">#chp-EDA</a>.</p>
<p>With the basics of visualization under your belt, in the next chapter were going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because itll help you stay organized as you write increasing amounts of R code.</p>
</section>
</section>

View File

@ -1,733 +0,0 @@
<section data-type="chapter" id="chp-databases">
<h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1>
<section id="databases-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>A huge amount of data lives in databases, so its essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a <code>.csv</code> for you, but this gets painful quickly: every time you need to make a change youll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.</p>
<p>In this chapter, youll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL<span data-type="footnote">SQL is either pronounced “s”-“q”-“l” or “sequel”.</span> query. <strong>SQL</strong>, short for <strong>s</strong>tructured <strong>q</strong>uery <strong>l</strong>anguage, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, were not going to start with SQL, but instead well teach you dbplyr, which can translate your dplyr code to the SQL. Well use that as way to teach you some of the most important features of SQL. You wont become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.</p>
<section id="databases-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(DBI)
library(dbplyr)
library(tidyverse)</pre>
</div>
</section>
</section>
<section id="database-basics" data-type="sect1">
<h1>
Database basics</h1>
<p>At the simplest level, you can think about a database as a collection of data frames, called <strong>tables</strong> in database terminology. Like a data.frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:</p>
<ul><li><p>Database tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).</p></li>
<li><p>Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles dont have indexes, but data.tables do, which is one of the reasons that theyre so fast.</p></li>
<li><p>Most classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called <strong>row-oriented</strong> because the data is stored row-by-row, rather than column-by-column like R. More recently, theres been much development of <strong>column-oriented</strong> databases that make analyzing the existing data much faster.</p></li>
</ul><p>Databases are run by database management systems (<strong>DBMS</strong>s for short), which come in three basic forms:</p>
<ul><li>
<strong>Client-server</strong> DBMSs run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMSs include PostgreSQL, MariaDB, SQL Server, and Oracle.</li>
<li>
<strong>Cloud</strong> DBMSs, like Snowflake, Amazons RedShift, and Googles BigQuery, are similar to client server DBMSs, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.</li>
<li>
<strong>In-process</strong> DBMSs, like SQLite or duckdb, run entirely on your computer. Theyre great for working with large datasets where youre the primary user.</li>
</ul></section>
<section id="connecting-to-a-database" data-type="sect1">
<h1>
Connecting to a database</h1>
<p>To connect to the database from R, youll use a pair of packages:</p>
<ul><li><p>Youll always use DBI (<strong>d</strong>ata<strong>b</strong>ase <strong>i</strong>nterface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.</p></li>
<li><p>Youll also use a package tailored for the DBMS youre connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. Theres usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.</p></li>
</ul><p>If you cant find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because youll also need to install an ODBC driver and tell the odbc package where to find it.</p>
<p>Concretely, you create a database connection using <code><a href="https://dbi.r-dbi.org/reference/dbConnect.html">DBI::dbConnect()</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(
RMariaDB::MariaDB(),
username = "foo"
)
con &lt;- DBI::dbConnect(
RPostgres::Postgres(),
hostname = "databases.mycompany.com",
port = 1234
)</pre>
</div>
<p>The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we cant cover all the details here. This means youll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (<strong>d</strong>ata<strong>b</strong>ase <strong>a</strong>dministrator). The initial setup will often take a little fiddling (and maybe some googling) to get right, but youll generally only need to do it once.</p>
<section id="in-this-book" data-type="sect2">
<h2>
In this book</h2>
<p>Setting up a client-server or cloud DBMS would be a pain for this book, so well instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how youll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.</p>
<p>Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. Thats great for learning because it guarantees that youll start from a clean slate every time you restart R:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(duckdb::duckdb())</pre>
</div>
<p>duckdb is a high-performance database thats designed very much for the needs of a data scientist. We use it here because its very to easy to get started with, but its also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, youll also need to supply the <code>dbdir</code> argument to make a persistent database and tell duckdb where to save it. Assuming youre using a project (<a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>), its reasonable to store it in the <code>duckdb</code> directory of the current project:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
</div>
</section>
<section id="sec-load-data" data-type="sect2">
<h2>
Load some data</h2>
<p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">DBI::dbWriteTable()</a></code>. The simplest usage of <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
</div>
<p>If youre using duckdb in a real project, we highly recommend learning about <code>duckdb_read_csv()</code> and <code>duckdb_register_arrow()</code>. These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.</p>
<p>Well also show off a useful technique for loading multiple files into a database in <a href="#sec-save-database" data-type="xref">#sec-save-database</a>.</p>
</section>
</section>
<section id="dbi-basics" data-type="sect1">
<h1>
DBI basics</h1>
<p>Now that weve connected to a database with some data in it, lets perform some basic operations with DBI.</p>
<section id="whats-there" data-type="sect2">
<h2>
Whats there?</h2>
<p>The most important database objects for data scientists are tables. DBI provides two useful functions to either list all the tables in the database<span data-type="footnote">At least, all the tables that you have permission to see.</span> or to check if a specific table already exists:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">dbListTables(con)
#&gt; [1] "diamonds" "mpg"
dbExistsTable(con, "foo")
#&gt; [1] FALSE</pre>
</div>
</section>
<section id="extract-some-data" data-type="sect2">
<h2>
Extract some data</h2>
<p>Once youve determined a table exists, you can retrieve it with <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">con |&gt;
dbReadTable("diamonds") |&gt;
as_tibble()
#&gt; # A tibble: 53,940 × 10
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#&gt; 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#&gt; 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#&gt; 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#&gt; 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with 53,934 more rows</pre>
</div>
<p><code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> returns a <code>data.frame</code> so we use <code><a href="https://tibble.tidyverse.org/reference/as_tibble.html">as_tibble()</a></code> to convert it into a tibble so that it prints nicely.</p>
<p>In real life, its rare that youll use <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p>
</section>
<section id="sec-dbGetQuery" data-type="sect2">
<h2>
Run a query</h2>
<p>The way youll usually retrieve data is with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sql &lt;- "
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price &gt; 15000
"
as_tibble(dbGetQuery(con, sql))
#&gt; # A tibble: 1,655 × 5
#&gt; carat cut clarity color price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium VS2 E 15002
#&gt; 2 1.19 Ideal VVS1 F 15005
#&gt; 3 2.1 Premium SI1 I 15007
#&gt; 4 1.69 Ideal SI1 D 15011
#&gt; 5 1.5 Very Good VVS2 G 15013
#&gt; 6 1.73 Very Good VS1 G 15014
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Dont worry if youve never seen SQL before; youll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where <code>price</code> is greater than 15,000.</p>
<p>Youll need to be a little careful with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="https://dbi.r-dbi.org/reference/dbSendQuery.html">dbSendQuery()</a></code> to get a “result set” which you can page through by calling <code><a href="https://dbi.r-dbi.org/reference/dbFetch.html">dbFetch()</a></code> until <code><a href="https://dbi.r-dbi.org/reference/dbHasCompleted.html">dbHasCompleted()</a></code> returns <code>TRUE</code>.</p>
</section>
<section id="databases-other-functions" data-type="sect2">
<h2>
Other functions</h2>
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
</section>
</section>
<section id="dbplyr-basics" data-type="sect1">
<h1>
dbplyr basics</h1>
<p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="https://dtplyr.tidyverse.org">dtplyr</a> which translates to <a href="https://r-datatable.com">data.table</a>, and <a href="https://multidplyr.tidyverse.org">multidplyr</a> which executes your code on multiple cores.</p>
<p>To use dbplyr, you must first use <code><a href="https://dplyr.tidyverse.org/reference/tbl.html">tbl()</a></code> to create an object that represents a database table:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, "diamonds")
diamonds_db
#&gt; # Source: table&lt;diamonds&gt; [?? x 10]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#&gt; 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#&gt; 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#&gt; 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#&gt; 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with more rows</pre>
</div>
<div data-type="note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div>
<p>Other times you might want to use your own SQL query as a starting point:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
</div>
</div>
<p>This object is <strong>lazy</strong>; when you use dplyr verbs on it, dplyr doesnt do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">big_diamonds_db &lt;- diamonds_db |&gt;
filter(price &gt; 15000) |&gt;
select(carat:clarity, price)
big_diamonds_db
#&gt; # Source: SQL [?? x 5]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
#&gt; 2 1.19 Ideal F VVS1 15005
#&gt; 3 2.1 Premium I SI1 15007
#&gt; 4 1.69 Ideal D SI1 15011
#&gt; 5 1.5 Very Good G VVS2 15013
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with more rows</pre>
</div>
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesnt know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something were trying to avoid.</p>
<p>You can see the SQL code generated by the dbplyr function <code><a href="https://dplyr.tidyverse.org/reference/explain.html">show_query()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">big_diamonds_db |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT carat, cut, color, clarity, price
#&gt; FROM diamonds
#&gt; WHERE (price &gt; 15000.0)</pre>
</div>
<p>To get all the data back into R, you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> to get the data, then turns the result into a tibble:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">big_diamonds &lt;- big_diamonds_db |&gt;
collect()
big_diamonds
#&gt; # A tibble: 1,655 × 5
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
#&gt; 2 1.19 Ideal F VVS1 15005
#&gt; 3 2.1 Premium I SI1 15007
#&gt; 4 1.69 Ideal D SI1 15011
#&gt; 5 1.5 Very Good G VVS2 15013
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p>
</section>
<section id="sql" data-type="sect1">
<h1>
SQL</h1>
<p>The rest of the chapter will teach you a little SQL through the lens of dbplyr. Its a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr youre in a great place to quickly pick up SQL because so many of the concepts are the same.</p>
<p>Well explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: <code>flights</code> and <code>planes</code>. These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">dbplyr::copy_nycflights13(con)
#&gt; Creating table: airlines
#&gt; Creating table: airports
#&gt; Creating table: flights
#&gt; Creating table: planes
#&gt; Creating table: weather
flights &lt;- tbl(con, "flights")
planes &lt;- tbl(con, "planes")</pre>
</div>
<div class="cell">
</div>
<section id="sql-basics" data-type="sect2">
<h2>
SQL basics</h2>
<p>The top-level components of SQL are called <strong>statements</strong>. Common statements include <code>CREATE</code> for defining new tables, <code>INSERT</code> for adding data, and <code>SELECT</code> for retrieving data. We will on focus on <code>SELECT</code> statements, also called <strong>queries</strong>, because they are almost exclusively what youll use as a data scientist.</p>
<p>A query is made up of <strong>clauses</strong>. There are five important clauses: <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>ORDER BY</code>, and <code>GROUP BY</code>. Every query must have the <code>SELECT</code><span data-type="footnote">Confusingly, depending on the context, <code>SELECT</code> is either a statement or a clause. To avoid this confusion, well generally use query instead of <code>SELECT</code> statement.</span> and <code>FROM</code><span data-type="footnote">Ok, technically, only the <code>SELECT</code> is required, since you can write queries like <code>SELECT 1+1</code> to perform basic calculations. But if you want to work with data (as you always do!) youll also need a <code>FROM</code> clause.</span> clauses and the simplest query is <code>SELECT * FROM table</code>, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
planes |&gt; show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM planes</pre>
</div>
<p><code>WHERE</code> and <code>ORDER BY</code> control which rows are included and how they are ordered:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest == "IAH") |&gt;
arrange(dep_delay) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest = 'IAH')
#&gt; ORDER BY dep_delay</pre>
</div>
<p><code>GROUP BY</code> converts the query to a summary, causing aggregation to happen:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT dest, AVG(dep_delay) AS dep_delay
#&gt; FROM flights
#&gt; GROUP BY dest</pre>
</div>
<p>There are two important differences between dplyr verbs and SELECT clauses:</p>
<ul><li>In SQL, case doesnt matter: you can write <code>select</code>, <code>SELECT</code>, or even <code>SeLeCt</code>. In this book well stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.</li>
<li>In SQL, order matters: you must always write the clauses in the order <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>GROUP BY</code>, <code>ORDER BY</code>. Confusingly, this order doesnt match how the clauses actually evaluated which is first <code>FROM</code>, then <code>WHERE</code>, <code>GROUP BY</code>, <code>SELECT</code>, and <code>ORDER BY</code>.</li>
</ul><p>The following sections explore each clause in more detail.</p>
<div data-type="note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
</div>
</div>
</section>
<section id="select" data-type="sect2">
<h2>
SELECT</h2>
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and, as youll learn in the next section, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, "type", manufacturer, model, "year"
#&gt; FROM planes
planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
rename(year_built = year) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, "type", manufacturer, model, "year" AS year_built
#&gt; FROM planes
planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
relocate(manufacturer, model, .before = type) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, manufacturer, model, "type", "year"
#&gt; FROM planes</pre>
</div>
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the old name is on the left and the new name is on the right.</p>
<div data-type="note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p>
<p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre>
<p>Some other database systems use backticks instead of quotes:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre>
</div>
</div>
<p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
speed = distance / (air_time / 60)
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *, distance / (air_time / 60.0) AS speed
#&gt; FROM flights</pre>
</div>
<p>Well come back to the translation of individual components (like <code>/</code>) in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="from" data-type="sect2">
<h2>
FROM</h2>
<p>The <code>FROM</code> clause defines the data source. Its going to be rather uninteresting for a little while, because were just using single tables. Youll see more complex examples once we hit the join functions.</p>
</section>
<section id="group-by" data-type="sect2">
<h2>
GROUP BY</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> is translated to the <code>SELECT</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db |&gt;
group_by(cut) |&gt;
summarize(
n = n(),
avg_price = mean(price, na.rm = TRUE)
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price
#&gt; FROM diamonds
#&gt; GROUP BY cut</pre>
</div>
<p>Well come back to whats happening with translation <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="where" data-type="sect2">
<h2>
WHERE</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is translated to the <code>WHERE</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest == "IAH" | dest == "HOU") |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest = 'IAH' OR dest = 'HOU')
flights |&gt;
filter(arr_delay &gt; 0 &amp; arr_delay &lt; 20) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (arr_delay &gt; 0.0 AND arr_delay &lt; 20.0)</pre>
</div>
<p>There are a few important details to note here:</p>
<ul><li>
<code>|</code> becomes <code>OR</code> and <code>&amp;</code> becomes <code>AND</code>.</li>
<li>SQL uses <code>=</code> for comparison, not <code>==</code>. SQL doesnt have assignment, so theres no potential for confusion there.</li>
<li>SQL uses only <code>''</code> for strings, not <code>""</code>. In SQL, <code>""</code> is used to identify variables, like Rs <code>``</code>.</li>
</ul><p>Another useful SQL operator is <code>IN</code>, which is very close to Rs <code>%in%</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dest %in% c("IAH", "HOU")) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest IN ('IAH', 'HOU'))</pre>
</div>
<p>SQL uses <code>NULL</code> instead of <code>NA</code>. <code>NULL</code>s behave similarly to <code>NA</code>s. The main difference is that while theyre “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(delay = mean(arr_delay))
#&gt; Warning: Missing values are always removed in SQL aggregation functions.
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # Source: SQL [?? x 2]
#&gt; # Database: DuckDB 0.6.1 [root@Darwin 22.2.0:R 4.2.1/:memory:]
#&gt; dest delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ATL 11.3
#&gt; 2 ORD 5.88
#&gt; 3 RDU 10.1
#&gt; 4 IAD 13.9
#&gt; 5 DTW 5.43
#&gt; 6 LAX 0.547
#&gt; # … with more rows</pre>
</div>
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="https://modern-sql.com/concept/three-valued-logic"><em>Three valued logic</em></a>” by Markus Winand.</p>
<p>In general, you can work with <code>NULL</code>s using the functions youd use for <code>NA</code>s in R:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(!is.na(dep_delay)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (NOT((dep_delay IS NULL)))</pre>
</div>
<p>This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isnt as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator thats easier to read:</p>
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
<p>Note that if you <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds_db |&gt;
group_by(cut) |&gt;
summarize(n = n()) |&gt;
filter(n &gt; 100) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT cut, COUNT(*) AS n
#&gt; FROM diamonds
#&gt; GROUP BY cut
#&gt; HAVING (COUNT(*) &gt; 100.0)</pre>
</div>
</section>
<section id="order-by" data-type="sect2">
<h2>
ORDER BY</h2>
<p>Ordering rows involves a straightforward translation from <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> to the <code>ORDER BY</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
arrange(year, month, day, desc(dep_delay)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; ORDER BY "year", "month", "day", dep_delay DESC</pre>
</div>
<p>Notice how <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p>
</section>
<section id="subqueries" data-type="sect2">
<h2>
Subqueries</h2>
<p>Sometimes its not possible to translate a dplyr pipeline into a single <code>SELECT</code> statement and you need to use a subquery. A <strong>subquery</strong> is just a query used as a data source in the <code>FROM</code> clause, instead of the usual table.</p>
<p>dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the <code>SELECT</code> clause cant refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes <code>year1</code> and then the second (outer) query can compute <code>year2</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
year1 = year + 1,
year2 = year1 + 1
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *, year1 + 1.0 AS year2
#&gt; FROM (
#&gt; SELECT *, "year" + 1.0 AS year1
#&gt; FROM flights
#&gt; ) q01</pre>
</div>
<p>Youll also see this if you attempted to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(year1 = year + 1) |&gt;
filter(year1 == 2014) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM (
#&gt; SELECT *, "year" + 1.0 AS year1
#&gt; FROM flights
#&gt; ) q01
#&gt; WHERE (year1 = 2014.0)</pre>
</div>
<p>Sometimes dbplyr will create a subquery where its not needed because it doesnt yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.</p>
</section>
<section id="databases-joins" data-type="sect2">
<h2>
Joins</h2>
<p>If youre familiar with dplyrs joins, SQL joins are very similar. Heres a simple example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
left_join(planes |&gt; rename(year_built = year), by = "tailnum") |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; flights.*,
#&gt; planes."year" AS year_built,
#&gt; "type",
#&gt; manufacturer,
#&gt; model,
#&gt; engines,
#&gt; seats,
#&gt; speed,
#&gt; engine
#&gt; FROM flights
#&gt; LEFT JOIN planes
#&gt; ON (flights.tailnum = planes.tailnum)</pre>
</div>
<p>The main thing to notice here is the syntax: SQL joins use sub-clauses of the <code>FROM</code> clause to bring in additional tables, using <code>ON</code> to define how the tables are related.</p>
<p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)</pre>
<p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="https://cynkra.github.io/dm/">dm package</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p>
</section>
<section id="other-verbs" data-type="sect2">
<h2>
Other verbs</h2>
<p>dbplyr also translates other verbs like <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code>slice_*()</code>, and <code><a href="https://generics.r-lib.org/reference/setops.html">intersect()</a></code>, and a growing selection of tidyr functions like <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
</section>
<section id="databases-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> translated to? How about <code><a href="https://rdrr.io/r/utils/head.html">head()</a></code>?</p></li>
<li>
<p>Explain what each of the following SQL queries do and try recreate them using dbplyr.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT *
FROM flights
WHERE dep_delay &lt; arr_delay
SELECT *, distance / (airtime / 60) AS speed
FROM flights</pre>
</li>
</ol></section>
</section>
<section id="sec-sql-expressions" data-type="sect1">
<h1>
Function translations</h1>
<p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
<p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">summarize_query &lt;- function(df, ...) {
df |&gt;
summarize(...) |&gt;
show_query()
}
mutate_query &lt;- function(df, ...) {
df |&gt;
mutate(..., .keep = "none") |&gt;
show_query()
}</pre>
</div>
<p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>, have a relatively simple translation while others, like <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize_query(
mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE)
)
#&gt; `summarise()` has grouped output by "year" and "month". You can override
#&gt; using the `.groups` argument.
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",
#&gt; "month",
#&gt; "day",
#&gt; AVG(arr_delay) AS mean,
#&gt; PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median
#&gt; FROM flights
#&gt; GROUP BY "year", "month", "day"</pre>
</div>
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
mutate_query(
mean = mean(arr_delay, na.rm = TRUE),
)
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",
#&gt; "month",
#&gt; "day",
#&gt; AVG(arr_delay) OVER (PARTITION BY "year", "month", "day") AS mean
#&gt; FROM flights</pre>
</div>
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
<p>Window functions include all functions that look forward or backwards, like <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
arrange(time_hour) |&gt;
mutate_query(
lead = lead(arr_delay),
lag = lag(arr_delay)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; dest,
#&gt; LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,
#&gt; LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag
#&gt; FROM flights
#&gt; ORDER BY time_hour</pre>
</div>
<p>Here its important to <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p>
<p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate_query(
description = if_else(arr_delay &gt; 0, "delayed", "on-time")
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE WHEN (arr_delay &gt; 0.0) THEN 'delayed' WHEN NOT (arr_delay &gt; 0.0) THEN 'on-time' END AS description
#&gt; FROM flights
flights |&gt;
mutate_query(
description =
case_when(
arr_delay &lt; -5 ~ "early",
arr_delay &lt; 5 ~ "on-time",
arr_delay &gt;= 5 ~ "late"
)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE
#&gt; WHEN (arr_delay &lt; -5.0) THEN 'early'
#&gt; WHEN (arr_delay &lt; 5.0) THEN 'on-time'
#&gt; WHEN (arr_delay &gt;= 5.0) THEN 'late'
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
<p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate_query(
description = cut(
arr_delay,
breaks = c(-Inf, -5, 5, Inf),
labels = c("early", "on-time", "late")
)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE
#&gt; WHEN (arr_delay &lt;= -5.0) THEN 'early'
#&gt; WHEN (arr_delay &lt;= 5.0) THEN 'on-time'
#&gt; WHEN (arr_delay &gt; 5.0) THEN 'late'
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
</section>
<section id="databases-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code youre familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; its important to learn some SQL because its <em>the</em> most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who dont use R. If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
<ul><li>
<a href="https://sqlfordatascientists.com"><em>SQL for Data Scientists</em></a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li>
<li>
<a href="https://www.practicalsql.com"><em>Practical SQL</em></a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
</ul><p>In the next chapter, well learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.</p>
</section>
</section>

View File

@ -1,764 +0,0 @@
<section data-type="chapter" id="chp-datetimes">
<h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1>
<section id="datetimes-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they dont seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!</p>
<p>To warm up think about how many days there are in a year, and how many hours there are in a day. You probably remembered that most years have 365 days, but leap years have 366. Do you know the full rule for determining if a year is a leap year<span data-type="footnote">A year is a leap year if its divisible by 4, unless its also divisible by 100, except if its also divisible by 400. In other words, in every set of 400 years, theres 97 leap years.</span>? The number of hours in a day is a little less obvious: most days have 24 hours, but in places that use daylight saving time (DST), one day each year has 23 hours and another has 25.</p>
<p>Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter wont teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.</p>
<p>Well begin by showing you how to create date-times from various inputs, and then once youve got a date-time, how you can extract components like year, month, and day. Well then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what youre trying to do. Well conclude with a brief discussion of the additional challenges posed by time zones.</p>
<section id="datetimes-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter will focus on the <strong>lubridate</strong> package, which makes it easier to work with dates and times in R. As of the latest tidyverse release, lubridate is part of core tidyverse so. We will also need nycflights13 for practice data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="sec-creating-datetimes" data-type="sect1">
<h1>
Creating date/times</h1>
<p>There are three types of date/time data that refer to an instant in time:</p>
<ul><li><p>A <strong>date</strong>. Tibbles print this as <code>&lt;date&gt;</code>.</p></li>
<li><p>A <strong>time</strong> within a day. Tibbles print this as <code>&lt;time&gt;</code>.</p></li>
<li><p>A <strong>date-time</strong> is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <code>&lt;dttm&gt;</code>. Base R calls these POSIXct, but doesnt exactly trip off the tongue.</p></li>
</ul><p>In this chapter we are going to focus on dates and date-times as R doesnt have a native class for storing times. If you need one, you can use the <strong>hms</strong> package.</p>
<p>You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which well come back to at the end of the chapter.</p>
<p>To get the current date or date-time you can use <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> or <code><a href="https://lubridate.tidyverse.org/reference/now.html">now()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">today()
#&gt; [1] "2023-01-26"
now()
#&gt; [1] "2023-01-26 10:32:54 CST"</pre>
</div>
<p>Otherwise, the following sections describe the four ways youre likely to create a date/time:</p>
<ul><li>While reading a file with readr.</li>
<li>From a string.</li>
<li>From individual date-time components.</li>
<li>From an existing date/time object.</li>
</ul>
<section id="during-import" data-type="sect2">
<h2>
During import</h2>
<p>If your CSV contains an ISO8601 date or date-time, you dont need to do anything; readr will automatically recognize it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
date,datetime
2022-01-02,2022-01-02 05:12
"
read_csv(csv)
#&gt; # A tibble: 1 × 2
#&gt; date datetime
#&gt; &lt;date&gt; &lt;dttm&gt;
#&gt; 1 2022-01-02 2022-01-02 05:12:00</pre>
</div>
<p>If you havent heard of <strong>ISO8601</strong> before, its an international standard<span data-type="footnote"><a href="https://xkcd.com/1179/" class="uri">https://xkcd.com/1179/</a></span> for writing dates where the components of a date are organised from biggest to smallest separated by <code>-</code>. For example, in ISO8601 March 5 2022 is <code>2022-05-03</code>. ISO8601 dates can also include times, where hour, minute, and second are separated by <code>:</code>, and the date and time components are separated by either a <code>T</code> or a space. For example, you could write 4:26pm on March 5 2022 as either <code>2022-05-03 16:26</code> or <code>2022-05-03T16:26</code>.</p>
<p>For other date-time formats, youll need to use <code>col_types</code> plus <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a <code>%</code> followed by a single character. For example, <code>%Y-%m-%d</code> specifies a date thats a year, <code>-</code>, month (as number) <code>-</code>, day. Table <a href="#tbl-date-formats" data-type="xref">#tbl-date-formats</a> lists all the options.</p>
<div id="tbl-date-formats" class="anchored">
<table class="table"><caption>Table 19.1: All date formats understood by readr</caption>
<thead><tr class="header"><th>Type</th>
<th>Code</th>
<th>Meaning</th>
<th>Example</th>
</tr></thead><tbody><tr class="odd"><td>Year</td>
<td><code>%Y</code></td>
<td>4 digit year</td>
<td>2021</td>
</tr><tr class="even"><td/>
<td><code>%y</code></td>
<td>2 digit year</td>
<td>21</td>
</tr><tr class="odd"><td>Month</td>
<td><code>%m</code></td>
<td>Number</td>
<td>2</td>
</tr><tr class="even"><td/>
<td><code>%b</code></td>
<td>Abbreviated name</td>
<td>Feb</td>
</tr><tr class="odd"><td/>
<td><code>%B</code></td>
<td>Full name</td>
<td>Februrary</td>
</tr><tr class="even"><td>Day</td>
<td><code>%d</code></td>
<td>Two digits</td>
<td>02</td>
</tr><tr class="odd"><td/>
<td><code>%e</code></td>
<td>One or two digits</td>
<td>2</td>
</tr><tr class="even"><td>Time</td>
<td><code>%H</code></td>
<td>24-hour hour</td>
<td>13</td>
</tr><tr class="odd"><td/>
<td><code>%I</code></td>
<td>12-hour hour</td>
<td>1</td>
</tr><tr class="even"><td/>
<td><code>%p</code></td>
<td>AM/PM</td>
<td>pm</td>
</tr><tr class="odd"><td/>
<td><code>%M</code></td>
<td>Minutes</td>
<td>35</td>
</tr><tr class="even"><td/>
<td><code>%S</code></td>
<td>Seconds</td>
<td>45</td>
</tr><tr class="odd"><td/>
<td><code>%OS</code></td>
<td>Seconds with decimal component</td>
<td>45.35</td>
</tr><tr class="even"><td/>
<td><code>%Z</code></td>
<td>Time zone name</td>
<td>America/Chicago</td>
</tr><tr class="odd"><td/>
<td><code>%z</code></td>
<td>Offset from UTC</td>
<td>+0800</td>
</tr><tr class="even"><td>Other</td>
<td><code>%.</code></td>
<td>Skip one non-digit</td>
<td>:</td>
</tr><tr class="odd"><td/>
<td><code>%*</code></td>
<td>Skip any number of non-digits</td>
<td/>
</tr></tbody></table></div>
<p>And this code shows some a few options applied to a very ambiguous date:</p>
<div class="cell" data-messages="false">
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
date
01/02/15
"
read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
#&gt; # A tibble: 1 × 1
#&gt; date
#&gt; &lt;date&gt;
#&gt; 1 2015-01-02
read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
#&gt; # A tibble: 1 × 1
#&gt; date
#&gt; &lt;date&gt;
#&gt; 1 2015-02-01
read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
#&gt; # A tibble: 1 × 1
#&gt; date
#&gt; &lt;date&gt;
#&gt; 1 2001-02-15</pre>
</div>
<p>Note that no matter how you specify the date format, its always displayed the same way once you get it into R.</p>
<p>If youre using <code>%b</code> or <code>%B</code> and working with non-English dates, youll also need to provide a <code><a href="https://readr.tidyverse.org/reference/locale.html">locale()</a></code>. See the list of built-in languages in <code><a href="https://readr.tidyverse.org/reference/date_names.html">date_names_langs()</a></code>, or create your own with <code><a href="https://readr.tidyverse.org/reference/date_names.html">date_names()</a></code>,</p>
</section>
<section id="from-strings" data-type="sect2">
<h2>
From strings</h2>
<p>The date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridates helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ymd("2017-01-31")
#&gt; [1] "2017-01-31"
mdy("January 31st, 2017")
#&gt; [1] "2017-01-31"
dmy("31-Jan-2017")
#&gt; [1] "2017-01-31"</pre>
</div>
<p><code><a href="https://lubridate.tidyverse.org/reference/ymd.html">ymd()</a></code> and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ymd_hms("2017-01-31 20:11:59")
#&gt; [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
#&gt; [1] "2017-01-31 08:01:00 UTC"</pre>
</div>
<p>You can also force the creation of a date-time from a date by supplying a timezone:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ymd("2017-01-31", tz = "UTC")
#&gt; [1] "2017-01-31 UTC"</pre>
</div>
</section>
<section id="from-individual-components" data-type="sect2">
<h2>
From individual components</h2>
<p>Instead of a single string, sometimes youll have the individual components of the date-time spread across multiple columns. This is what we have in the <code>flights</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
select(year, month, day, hour, minute)
#&gt; # A tibble: 336,776 × 5
#&gt; year month day hour minute
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 5 15
#&gt; 2 2013 1 1 5 29
#&gt; 3 2013 1 1 5 40
#&gt; 4 2013 1 1 5 45
#&gt; 5 2013 1 1 6 0
#&gt; 6 2013 1 1 5 58
#&gt; # … with 336,770 more rows</pre>
</div>
<p>To create a date/time from this sort of input, use <code><a href="https://lubridate.tidyverse.org/reference/make_datetime.html">make_date()</a></code> for dates, or <code><a href="https://lubridate.tidyverse.org/reference/make_datetime.html">make_datetime()</a></code> for date-times:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
select(year, month, day, hour, minute) |&gt;
mutate(departure = make_datetime(year, month, day, hour, minute))
#&gt; # A tibble: 336,776 × 6
#&gt; year month day hour minute departure
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt;
#&gt; 1 2013 1 1 5 15 2013-01-01 05:15:00
#&gt; 2 2013 1 1 5 29 2013-01-01 05:29:00
#&gt; 3 2013 1 1 5 40 2013-01-01 05:40:00
#&gt; 4 2013 1 1 5 45 2013-01-01 05:45:00
#&gt; 5 2013 1 1 6 0 2013-01-01 06:00:00
#&gt; 6 2013 1 1 5 58 2013-01-01 05:58:00
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Lets do the same thing for each of the four time columns in <code>flights</code>. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once weve created the date-time variables, we focus in on the variables well explore in the rest of the chapter.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">make_datetime_100 &lt;- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt &lt;- flights |&gt;
filter(!is.na(dep_time), !is.na(arr_time)) |&gt;
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) |&gt;
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
#&gt; # A tibble: 328,063 × 9
#&gt; origin dest dep_delay arr_delay dep_time sched_dep_time
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
#&gt; 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
#&gt; 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
#&gt; 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
#&gt; 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
#&gt; 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
#&gt; # … with 328,057 more rows, and 3 more variables: arr_time &lt;dttm&gt;,
#&gt; # sched_arr_time &lt;dttm&gt;, air_time &lt;dbl&gt;</pre>
</div>
<p>With this data, we can visualize the distribution of departure times across the year:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
ggplot(aes(x = dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-12-1.png" alt="A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December." width="576"/></p>
</div>
</div>
<p>Or within a single day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
filter(dep_time &lt; ymd(20130102)) |&gt;
ggplot(aes(x = dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-13-1.png" alt="A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm." width="576"/></p>
</div>
</div>
<p>Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.</p>
</section>
<section id="from-other-types" data-type="sect2">
<h2>
From other types</h2>
<p>You may want to switch between a date-time and a date. Thats the job of <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">as_datetime(today())
#&gt; [1] "2023-01-26 UTC"
as_date(now())
#&gt; [1] "2023-01-26"</pre>
</div>
<p>Sometimes youll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code>; if its in days, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">as_datetime(60 * 60 * 10)
#&gt; [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
#&gt; [1] "1980-01-01"</pre>
</div>
</section>
<section id="datetimes-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>What happens if you parse a string that contains invalid dates?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ymd(c("2010-10-10", "bananas"))</pre>
</div>
</li>
<li><p>What does the <code>tzone</code> argument to <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> do? Why is it important?</p></li>
<li>
<p>For each of the following date-times show how youd parse it using a readr column-specification and a lubridate function.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">d1 &lt;- "January 1, 2010"
d2 &lt;- "2015-Mar-07"
d3 &lt;- "06-Jun-2017"
d4 &lt;- c("August 19 (2015)", "July 1 (2015)")
d5 &lt;- "12/30/14" # Dec 30, 2014
t1 &lt;- "1705"
t2 &lt;- "11:15:10.12 PM"</pre>
</div>
</li>
</ol></section>
</section>
<section id="date-time-components" data-type="sect1">
<h1>
Date-time components</h1>
<p>Now that you know how to get date-time data into Rs date-time data structures, lets explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.</p>
<section id="getting-components" data-type="sect2">
<h2>
Getting components</h2>
<p>You can pull out individual parts of the date with the accessor functions <code><a href="https://lubridate.tidyverse.org/reference/year.html">year()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/month.html">month()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/day.html">mday()</a></code> (day of the month), <code><a href="https://lubridate.tidyverse.org/reference/day.html">yday()</a></code> (day of the year), <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> (day of the week), <code><a href="https://lubridate.tidyverse.org/reference/hour.html">hour()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/minute.html">minute()</a></code>, and <code><a href="https://lubridate.tidyverse.org/reference/second.html">second()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">datetime &lt;- ymd_hms("2026-07-08 12:34:56")
year(datetime)
#&gt; [1] 2026
month(datetime)
#&gt; [1] 7
mday(datetime)
#&gt; [1] 8
yday(datetime)
#&gt; [1] 189
wday(datetime)
#&gt; [1] 4</pre>
</div>
<p>For <code><a href="https://lubridate.tidyverse.org/reference/month.html">month()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> you can set <code>label = TRUE</code> to return the abbreviated name of the month or day of the week. Set <code>abbr = FALSE</code> to return the full name.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">month(datetime, label = TRUE)
#&gt; [1] Jul
#&gt; 12 Levels: Jan &lt; Feb &lt; Mar &lt; Apr &lt; May &lt; Jun &lt; Jul &lt; Aug &lt; Sep &lt; ... &lt; Dec
wday(datetime, label = TRUE, abbr = FALSE)
#&gt; [1] Wednesday
#&gt; 7 Levels: Sunday &lt; Monday &lt; Tuesday &lt; Wednesday &lt; Thursday &lt; ... &lt; Saturday</pre>
</div>
<p>We can use <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> to see that more flights depart during the week than on the weekend:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
mutate(wday = wday(dep_time, label = TRUE)) |&gt;
ggplot(aes(x = wday)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-20-1.png" alt="A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000)." width="576"/></p>
</div>
</div>
<p>Theres an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
mutate(minute = minute(dep_time)) |&gt;
group_by(minute) |&gt;
summarize(
avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()
) |&gt;
ggplot(aes(x = minute, y = avg_delay)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-21-1.png" alt="A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9)." width="576"/></p>
</div>
</div>
<p>Interestingly, if we look at the <em>scheduled</em> departure time we dont see such a strong pattern:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sched_dep &lt;- flights_dt |&gt;
mutate(minute = minute(sched_dep_time)) |&gt;
group_by(minute) |&gt;
summarize(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(sched_dep, aes(x = minute, y = avg_delay)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-22-1.png" alt="A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour." width="576"/></p>
</div>
</div>
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, theres a strong bias towards flights leaving at “nice” departure times, as <a href="#fig-human-rounding" data-type="xref">#fig-human-rounding</a> shows. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-human-rounding"><p><img src="datetimes_files/figure-html/fig-human-rounding-1.png" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes. " width="576"/></p>
<figcaption>A frequency polygon showing the number of flights scheduled to depart each hour. You can see a strong preference for round numbers like 0 and 30 and generally for numbers that are a multiple of five.</figcaption>
</figure>
</div>
</div>
</section>
<section id="rounding" data-type="sect2">
<h2>
Rounding</h2>
<p>An alternative approach to plotting individual components is to round the date to a nearby unit of time, with <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">floor_date()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">round_date()</a></code>, and <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">ceiling_date()</a></code>. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
count(week = floor_date(dep_time, "week")) |&gt;
ggplot(aes(x = week, y = n)) +
geom_line() +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-24-1.png" alt="A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights)." width="576"/></p>
</div>
</div>
<p>You can use rounding to show the distribution of flights across the course of a day by computing the difference between <code>dep_time</code> and the earliest instant of that day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |&gt;
ggplot(aes(x = dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)
#&gt; Don't know how to automatically pick scale for object of type &lt;difftime&gt;.
#&gt; Defaulting to continuous.</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-25-1.png" alt="A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret." width="576"/></p>
</div>
</div>
<p>Computing the difference between a pair of date-times yields a difftime (more on that in <a href="#sec-intervals" data-type="xref">#sec-intervals</a>). We can convert that to an <code>hms</code> object to get a more useful x-axis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |&gt;
ggplot(aes(x = dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-26-1.png" alt="A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (&lt;100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again." width="576"/></p>
</div>
</div>
</section>
<section id="modifying-components" data-type="sect2">
<h2>
Modifying components</h2>
<p>You can also use each accessor function to modify the components of a date/time. This doesnt come up much in data analysis, but can be useful when cleaning data that has clearly incorrect dates.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">(datetime &lt;- ymd_hms("2026-07-08 12:34:56"))
#&gt; [1] "2026-07-08 12:34:56 UTC"
year(datetime) &lt;- 2030
datetime
#&gt; [1] "2030-07-08 12:34:56 UTC"
month(datetime) &lt;- 01
datetime
#&gt; [1] "2030-01-08 12:34:56 UTC"
hour(datetime) &lt;- hour(datetime) + 1
datetime
#&gt; [1] "2030-01-08 13:34:56 UTC"</pre>
</div>
<p>Alternatively, rather than modifying an existing variable, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
#&gt; [1] "2030-02-02 02:34:56 UTC"</pre>
</div>
<p>If values are too big, they will roll-over:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">update(ymd("2023-02-01"), mday = 30)
#&gt; [1] "2023-03-02"
update(ymd("2023-02-01"), hour = 400)
#&gt; [1] "2023-02-17 16:00:00 UTC"</pre>
</div>
</section>
<section id="datetimes-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How does the distribution of flight times within a day change over the course of the year?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code> and <code>dep_delay</code>. Are they consistent? Explain your findings.</p></li>
<li><p>Compare <code>air_time</code> with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)</p></li>
<li><p>How does the average delay time change over the course of a day? Should you use <code>dep_time</code> or <code>sched_dep_time</code>? Why?</p></li>
<li><p>On what day of the week should you leave if you want to minimise the chance of a delay?</p></li>
<li><p>What makes the distribution of <code>diamonds$carat</code> and <code>flights$sched_dep_time</code> similar?</p></li>
<li><p>Confirm our hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.</p></li>
</ol></section>
</section>
<section id="time-spans" data-type="sect1">
<h1>
Time spans</h1>
<p>Next youll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, youll learn about three important classes that represent time spans:</p>
<ul><li>
<strong>Durations</strong>, which represent an exact number of seconds.</li>
<li>
<strong>Periods</strong>, which represent human units like weeks and months.</li>
<li>
<strong>Intervals</strong>, which represent a starting and ending point.</li>
</ul><p>How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.</p>
<section id="durations" data-type="sect2">
<h2>
Durations</h2>
<p>In R, when you subtract two dates, you get a difftime object:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># How old is Hadley?
h_age &lt;- today() - ymd("1979-10-14")
h_age
#&gt; Time difference of 15810 days</pre>
</div>
<p>A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the <strong>duration</strong>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">as.duration(h_age)
#&gt; [1] "1365984000s (~43.29 years)"</pre>
</div>
<p>Durations come with a bunch of convenient constructors:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">dseconds(15)
#&gt; [1] "15s"
dminutes(10)
#&gt; [1] "600s (~10 minutes)"
dhours(c(12, 24))
#&gt; [1] "43200s (~12 hours)" "86400s (~1 days)"
ddays(0:5)
#&gt; [1] "0s" "86400s (~1 days)" "172800s (~2 days)"
#&gt; [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
dweeks(3)
#&gt; [1] "1814400s (~3 weeks)"
dyears(1)
#&gt; [1] "31557600s (~1 years)"</pre>
</div>
<p>Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. Theres no way to convert a month to a duration, because theres just too much variation.</p>
<p>You can add and multiply durations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">2 * dyears(1)
#&gt; [1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
#&gt; [1] "38869200s (~1.23 years)"</pre>
</div>
<p>You can add and subtract durations to and from days:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tomorrow &lt;- today() + ddays(1)
last_year &lt;- today() - dyears(1)</pre>
</div>
<p>However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">one_am &lt;- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_am
#&gt; [1] "2026-03-08 01:00:00 EST"
one_am + ddays(1)
#&gt; [1] "2026-03-09 02:00:00 EDT"</pre>
</div>
<p>Why is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because its when DST starts, so if we add a full days worth of seconds we end up with a different time.</p>
</section>
<section id="periods" data-type="sect2">
<h2>
Periods</h2>
<p>To solve this problem, lubridate provides <strong>periods</strong>. Periods are time spans but dont have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">one_am
#&gt; [1] "2026-03-08 01:00:00 EST"
one_am + days(1)
#&gt; [1] "2026-03-09 01:00:00 EDT"</pre>
</div>
<p>Like durations, periods can be created with a number of friendly constructor functions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">hours(c(12, 24))
#&gt; [1] "12H 0M 0S" "24H 0M 0S"
days(7)
#&gt; [1] "7d 0H 0M 0S"
months(1:6)
#&gt; [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
#&gt; [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"</pre>
</div>
<p>You can add and multiply periods:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">10 * (months(6) + days(1))
#&gt; [1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
#&gt; [1] "50d 25H 2M 0S"</pre>
</div>
<p>And of course, add them to dates. Compared to durations, periods are more likely to do what you expect:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># A leap year
ymd("2024-01-01") + dyears(1)
#&gt; [1] "2024-12-31 06:00:00 UTC"
ymd("2024-01-01") + years(1)
#&gt; [1] "2025-01-01"
# Daylight Savings Time
one_am + ddays(1)
#&gt; [1] "2026-03-09 02:00:00 EDT"
one_am + days(1)
#&gt; [1] "2026-03-09 01:00:00 EDT"</pre>
</div>
<p>Lets use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination <em>before</em> they departed from New York City.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
filter(arr_time &lt; dep_time)
#&gt; # A tibble: 10,640 × 9
#&gt; origin dest dep_delay arr_delay dep_time sched_dep_time
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00
#&gt; 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00
#&gt; 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00
#&gt; 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00
#&gt; 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00
#&gt; 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00
#&gt; # … with 10,634 more rows, and 3 more variables: arr_time &lt;dttm&gt;,
#&gt; # sched_arr_time &lt;dttm&gt;, air_time &lt;dbl&gt;</pre>
</div>
<p>These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding <code>days(1)</code> to the arrival time of each overnight flight.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt &lt;- flights_dt |&gt;
mutate(
overnight = arr_time &lt; dep_time,
arr_time = arr_time + days(if_else(overnight, 0, 1)),
sched_arr_time = sched_arr_time + days(overnight * 1)
)</pre>
</div>
<p>Now all of our flights obey the laws of physics.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_dt |&gt;
filter(overnight, arr_time &lt; dep_time)
#&gt; # A tibble: 10,640 × 10
#&gt; origin dest dep_delay arr_delay dep_time sched_dep_time
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00
#&gt; 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00
#&gt; 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00
#&gt; 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00
#&gt; 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00
#&gt; 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00
#&gt; # … with 10,634 more rows, and 4 more variables: arr_time &lt;dttm&gt;,
#&gt; # sched_arr_time &lt;dttm&gt;, air_time &lt;dbl&gt;, overnight &lt;lgl&gt;</pre>
</div>
</section>
<section id="sec-intervals" data-type="sect2">
<h2>
Intervals</h2>
<p>Its obvious what <code>dyears(1) / ddays(365)</code> should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.</p>
<p>What should <code>years(1) / days(1)</code> return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! Theres not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">years(1) / days(1)
#&gt; [1] 365.25</pre>
</div>
<p>If you want a more accurate measurement, youll have to use an <strong>interval</strong>. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.</p>
<p>You can create an interval by writing <code>start %--% end</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">y2023 &lt;- ymd("2023-01-01") %--% ymd("2024-01-01")
y2024 &lt;- ymd("2024-01-01") %--% ymd("2025-01-01")
y2023
#&gt; [1] 2023-01-01 UTC--2024-01-01 UTC
y2024
#&gt; [1] 2024-01-01 UTC--2025-01-01 UTC</pre>
</div>
<p>You could then divide it by <code><a href="https://lubridate.tidyverse.org/reference/period.html">days()</a></code> to find out how many days fit in the year:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">y2023 / days(1)
#&gt; [1] 365
y2024 / days(1)
#&gt; [1] 366</pre>
</div>
</section>
<section id="datetimes-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain <code>days(overnight * 1)</code> to someone who has just started learning R. How does it work?</p></li>
<li><p>Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the <em>current</em> year.</p></li>
<li><p>Write a function that given your birthday (as a date), returns how old you are in years.</p></li>
<li><p>Why cant <code>(today() %--% (today() + years(1))) / months(1)</code> work?</p></li>
</ol></section>
</section>
<section id="time-zones" data-type="sect1">
<h1>
Time zones</h1>
<p>Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we dont need to dig into all the details as theyre not all important for data analysis, but there are a few challenges well need to tackle head on.</p>
<!--# https://www.ietf.org/timezones/tzdb-2018a/theory.html -->
<p>The first challenge is that everyday names of time zones tend to be ambiguous. For example, if youre American youre probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme <code>{area}/{location}</code>, typically in the form <code>{continent}/{city}</code> or <code>{ocean}/{city}</code>. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.</p>
<p>You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. Its worth reading the raw time zone database (available at <a href="https://www.iana.org/time-zones" class="uri">https://www.iana.org/time-zones</a>) just to read some of these stories!</p>
<p>You can find out what R thinks your current time zone is with <code><a href="https://rdrr.io/r/base/timezones.html">Sys.timezone()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">Sys.timezone()
#&gt; [1] "America/Chicago"</pre>
</div>
<p>(If R doesnt know, youll get an <code>NA</code>.)</p>
<p>And see the complete list of all time zone names with <code><a href="https://rdrr.io/r/base/timezones.html">OlsonNames()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">length(OlsonNames())
#&gt; [1] 597
head(OlsonNames())
#&gt; [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
#&gt; [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"</pre>
</div>
<p>In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x1 &lt;- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
x1
#&gt; [1] "2024-06-01 12:00:00 EDT"
x2 &lt;- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
x2
#&gt; [1] "2024-06-01 18:00:00 CEST"
x3 &lt;- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
x3
#&gt; [1] "2024-06-02 04:00:00 NZST"</pre>
</div>
<p>You can verify that theyre the same time using subtraction:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x1 - x2
#&gt; Time difference of 0 secs
x1 - x3
#&gt; Time difference of 0 secs</pre>
</div>
<p>Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>, will often drop the time zone. In that case, the date-times will display in your local time zone:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x4 &lt;- c(x1, x2, x3)
x4
#&gt; [1] "2024-06-01 12:00:00 EDT" "2024-06-01 12:00:00 EDT"
#&gt; [3] "2024-06-01 12:00:00 EDT"</pre>
</div>
<p>You can change the time zone in two ways:</p>
<ul><li>
<p>Keep the instant in time the same, and change how its displayed. Use this when the instant is correct, but you want a more natural display.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x4a &lt;- with_tz(x4, tzone = "Australia/Lord_Howe")
x4a
#&gt; [1] "2024-06-02 02:30:00 +1030" "2024-06-02 02:30:00 +1030"
#&gt; [3] "2024-06-02 02:30:00 +1030"
x4a - x4
#&gt; Time differences in secs
#&gt; [1] 0 0 0</pre>
</div>
<p>(This also illustrates another challenge of times zones: theyre not all integer hour offsets!)</p>
</li>
<li>
<p>Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x4b &lt;- force_tz(x4, tzone = "Australia/Lord_Howe")
x4b
#&gt; [1] "2024-06-01 12:00:00 +1030" "2024-06-01 12:00:00 +1030"
#&gt; [3] "2024-06-01 12:00:00 +1030"
x4b - x4
#&gt; Time differences in hours
#&gt; [1] -14.5 -14.5 -14.5</pre>
</div>
</li>
</ul></section>
<section id="datetimes-summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.</p>
<p>The next chapter gives a round up of missing values. Youve seen them in a few places and have no doubt encounter in your own analysis, and its how time to provide a grab bag of useful techniques for dealing with them.</p>
</section>
</section>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 47 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 55 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 59 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 380 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 328 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 386 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 83 KiB

View File

@ -1,424 +0,0 @@
<section data-type="chapter" id="chp-factors">
<h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1>
<section id="factors-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.</p>
<p>Well start by motivating why factors are needed for data analysis and how you can create them with <code><a href="https://rdrr.io/r/base/factor.html">factor()</a></code>. Well then introduce you to the <code>gss_cat</code> dataset which contains a bunch of categorical variables to experiment with. Youll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.</p>
<section id="factors-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Base R provides some basic tools for creating and manipulating factors. Well supplement these with the <strong>forcats</strong> package, which is part of the core tidyverse. It provides tools for dealing with <strong>cat</strong>egorical variables (and its an anagram of factors!) using a wide range of helpers for working with factors.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="factor-basics" data-type="sect1">
<h1>
Factor basics</h1>
<p>Imagine that you have a variable that records month:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x1 &lt;- c("Dec", "Apr", "Jan", "Mar")</pre>
</div>
<p>Using a string to record this variable has two problems:</p>
<ol type="1"><li>
<p>There are only twelve possible months, and theres nothing saving you from typos:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x2 &lt;- c("Dec", "Apr", "Jam", "Mar")</pre>
</div>
</li>
<li>
<p>It doesnt sort in a useful way:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sort(x1)
#&gt; [1] "Apr" "Dec" "Jan" "Mar"</pre>
</div>
</li>
</ol><p>You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid <strong>levels</strong>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">month_levels &lt;- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)</pre>
</div>
<p>Now you can create a factor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">y1 &lt;- factor(x1, levels = month_levels)
y1
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
#&gt; [1] Jan Mar Apr Dec
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
<p>And any values not in the level will be silently converted to NA:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">y2 &lt;- factor(x2, levels = month_levels)
y2
#&gt; [1] Dec Apr &lt;NA&gt; Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
<p>This seems risky, so you might want to use <code><a href="https://forcats.tidyverse.org/reference/fct.html">fct()</a></code> instead:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">y2 &lt;- fct(x2, levels = month_levels)
#&gt; Error in `fct()`:
#&gt; ! All values of `x` must appear in `levels` or `na`
#&gt; Missing level: "Jam"</pre>
</div>
<p>If you omit the levels, theyll be taken from the data in alphabetical order:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">factor(x1)
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Apr Dec Jan Mar</pre>
</div>
<p>Sometimes youd prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to <code>unique(x)</code>, or after the fact, with <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_inorder()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">f1 &lt;- factor(x1, levels = unique(x1))
f1
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Dec Apr Jan Mar
f2 &lt;- x1 |&gt; factor() |&gt; fct_inorder()
f2
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Dec Apr Jan Mar</pre>
</div>
<p>If you ever need to access the set of valid levels directly, you can do so with <code><a href="https://rdrr.io/r/base/levels.html">levels()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">levels(f2)
#&gt; [1] "Dec" "Apr" "Jan" "Mar"</pre>
</div>
<p>You can also create a factor when reading your data with readr with <code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
month,value
Jan,12
Feb,56
Mar,12"
df &lt;- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
df$month
#&gt; [1] Jan Feb Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
</section>
<section id="general-social-survey" data-type="sect1">
<h1>
General Social Survey</h1>
<p>For the rest of this chapter, were going to use <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">forcats::gss_cat</a></code>. Its a sample of data from the <a href="https://gss.norc.org">General Social Survey</a>, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in <code>gss_cat</code> Hadley selected a handful that will illustrate some common challenges youll encounter when working with factors.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat
#&gt; # A tibble: 21,483 × 9
#&gt; year marital age race rincome partyid
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,near rep
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str republican
#&gt; 3 2000 Widowed 67 White Not applicable Independent
#&gt; 4 2000 Never married 39 White Not applicable Ind,near rep
#&gt; 5 2000 Divorced 25 White Not applicable Not str democrat
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong democrat
#&gt; # … with 21,477 more rows, and 3 more variables: relig &lt;fct&gt;, denom &lt;fct&gt;,
#&gt; # tvhours &lt;int&gt;</pre>
</div>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>
<p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
count(race)
#&gt; # A tibble: 3 × 2
#&gt; race n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Other 1959
#&gt; 2 Black 3129
#&gt; 3 White 16395</pre>
</div>
<p>When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.</p>
<section id="exercise" data-type="sect2">
<h2>
Exercise</h2>
<ol type="1"><li><p>Explore the distribution of <code>rincome</code> (reported income). What makes the default bar chart hard to understand? How could you improve the plot?</p></li>
<li><p>What is the most common <code>relig</code> in this survey? Whats the most common <code>partyid</code>?</p></li>
<li><p>Which <code>relig</code> does <code>denom</code> (denomination) apply to? How can you find out with a table? How can you find out with a visualization?</p></li>
</ol></section>
</section>
<section id="modifying-factor-order" data-type="sect1">
<h1>
Modifying factor order</h1>
<p>Its often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">relig_summary &lt;- gss_cat |&gt;
group_by(relig) |&gt;
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
</div>
</div>
<p>It is hard to read this plot because theres no overall pattern. We can improve it by reordering the levels of <code>relig</code> using <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code>. <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> takes three arguments:</p>
<ul><li>
<code>f</code>, the factor whose levels you want to modify.</li>
<li>
<code>x</code>, a numeric vector that you want to use to reorder the levels.</li>
<li>Optionally, <code>fun</code>, a function thats used if there are multiple values of <code>x</code> for each value of <code>f</code>. The default value is <code>median</code>.</li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
</div>
</div>
<p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p>
<p>As you start making more complicated transformations, we recommend moving them out of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> and into a separate <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> step. For example, you could rewrite the plot above as:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">relig_summary |&gt;
mutate(
relig = fct_reorder(relig, tvhours)
) |&gt;
ggplot(aes(x = tvhours, y = relig)) +
geom_point()</pre>
</div>
<p>What if we create a similar plot looking at how average age varies across reported income level?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rincome_summary &lt;- gss_cat |&gt;
group_by(rincome) |&gt;
summarize(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-19-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
</div>
</div>
<p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
<p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="https://forcats.tidyverse.org/reference/fct_relevel.html">fct_relevel()</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highethst average age is &quot;Not applicable&quot;." width="576"/></p>
</div>
</div>
<p>Why do you think the average age for “Not applicable” is so high?</p>
<p>Another type of reordering is useful when you are coloring the lines on a plot. <code>fct_reorder2(f, x, y)</code> reorders the factor <code>f</code> by the <code>y</code> values associated with the largest <code>x</code> values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.</p>
<div>
<pre data-type="programlisting" data-code-language="r">by_age &lt;- gss_cat |&gt;
filter(!is.na(age)) |&gt;
count(age, marital) |&gt;
group_by(age) |&gt;
mutate(
prop = n / sum(n)
)
ggplot(by_age, aes(x = age, y = prop, color = marital)) +
geom_line(na.rm = TRUE)
ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(color = "marital")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-21-2.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Finally, for bar plots, you can use <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code> to order levels in decreasing frequency: this is the simplest type of reordering because it doesnt need any extra variables. Combine it with <code><a href="https://forcats.tidyverse.org/reference/fct_rev.html">fct_rev()</a></code> if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(marital = marital |&gt; fct_infreq() |&gt; fct_rev()) |&gt;
ggplot(aes(x = marital)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
</div>
</div>
<section id="factors-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>There are some suspiciously high numbers in <code>tvhours</code>. Is the mean a good summary?</p></li>
<li><p>For each factor in <code>gss_cat</code> identify whether the order of the levels is arbitrary or principled.</p></li>
<li><p>Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?</p></li>
</ol></section>
</section>
<section id="modifying-factor-levels" data-type="sect1">
<h1>
Modifying factor levels</h1>
<p>More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. It allows you to recode, or change, the value of each level. For example, take the <code>gss_cat$partyid</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt; count(partyid)
#&gt; # A tibble: 10 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 No answer 154
#&gt; 2 Don't know 1
#&gt; 3 Other party 393
#&gt; 4 Strong republican 2314
#&gt; 5 Not str republican 3032
#&gt; 6 Ind,near rep 1791
#&gt; # … with 4 more rows</pre>
</div>
<p>The levels are terse and inconsistent. Lets tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)
) |&gt;
count(partyid)
#&gt; # A tibble: 10 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 No answer 154
#&gt; 2 Don't know 1
#&gt; 3 Other party 393
#&gt; 4 Republican, strong 2314
#&gt; 5 Republican, weak 3032
#&gt; 6 Independent, near rep 1791
#&gt; # … with 4 more rows</pre>
</div>
<p><code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code> will leave the levels that arent explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist.</p>
<p>To combine groups, you can assign multiple old levels to the same new level:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)
) |&gt;
count(partyid)
#&gt; # A tibble: 8 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Other 548
#&gt; 2 Republican, strong 2314
#&gt; 3 Republican, weak 3032
#&gt; 4 Independent, near rep 1791
#&gt; 5 Independent 4119
#&gt; 6 Independent, near dem 2499
#&gt; # … with 2 more rows</pre>
</div>
<p>Use this technique with care: if you group together categories that are truly different you will end up with misleading results.</p>
<p>If you want to collapse a lot of levels, <code><a href="https://forcats.tidyverse.org/reference/fct_collapse.html">fct_collapse()</a></code> is a useful variant of <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. For each new variable, you can provide a vector of old levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(
partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)
) |&gt;
count(partyid)
#&gt; # A tibble: 4 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 other 548
#&gt; 2 rep 5346
#&gt; 3 ind 8409
#&gt; 4 dem 7180</pre>
</div>
<p>Sometimes you just want to lump together the small groups to make a plot or table simpler. Thats the job of the <code>fct_lump_*()</code> family of functions. <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_lowfreq()</a></code> is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(relig = fct_lump_lowfreq(relig)) |&gt;
count(relig)
#&gt; # A tibble: 2 × 2
#&gt; relig n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Protestant 10846
#&gt; 2 Other 10637</pre>
</div>
<p>In this case its not very helpful: it is true that the majority of Americans in this survey are Protestant, but wed probably like to see some more details! Instead, we can use the <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_n()</a></code> to specify that we want exactly 10 groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(relig = fct_lump_n(relig, n = 10)) |&gt;
count(relig, sort = TRUE) |&gt;
print(n = Inf)
#&gt; # A tibble: 10 × 2
#&gt; relig n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Protestant 10846
#&gt; 2 Catholic 5124
#&gt; 3 None 3523
#&gt; 4 Christian 689
#&gt; 5 Other 458
#&gt; 6 Jewish 388
#&gt; 7 Buddhism 147
#&gt; 8 Inter-nondenominational 109
#&gt; 9 Moslem/islam 104
#&gt; 10 Orthodox-christian 95</pre>
</div>
<p>Read the documentation to learn about <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_min()</a></code> and <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_prop()</a></code> which are useful in other cases.</p>
<section id="factors-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?</p></li>
<li><p>How could you collapse <code>rincome</code> into a small set of categories?</p></li>
<li><p>Notice there are 9 groups (excluding other) in the <code>fct_lump</code> example above. Why not 10? (Hint: type <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">?fct_lump</a></code>, and find the default for the argument <code>other_level</code> is “Other”.)</p></li>
</ol></section>
</section>
<section id="ordered-factors" data-type="sect1">
<h1>
Ordered factors</h1>
<p>Before we go on, theres a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with <code><a href="https://rdrr.io/r/base/factor.html">ordered()</a></code>, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use <code>&lt;</code> between the factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ordered(c("a", "b", "c"))
#&gt; [1] a b c
#&gt; Levels: a &lt; b &lt; c</pre>
</div>
<p>In practice, <code><a href="https://rdrr.io/r/base/factor.html">ordered()</a></code> factors behave very similarly to regular factors. There are only two places where you might notice different behavior:</p>
<ul><li>If you map an ordered factor to color or fill in ggplot2, it will default to <code>scale_color_viridis()</code>/<code>scale_fill_viridis()</code>, a color scale that implies a ranking.</li>
<li>If you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably dont routinely interpret them. If you want to learn more, we recommend <code>vignette("contrasts", package = "faux")</code> by Lisa DeBruine.</li>
</ul><p>Given the arguable utility of these differences, we dont generally recommend using ordered factors.</p>
</section>
<section id="factors-summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didnt have space to discuss here, so whenever youre facing a factor analysis challenge that you havent encountered before, I highly recommend skimming the <a href="https://forcats.tidyverse.org/reference/index.html">reference index</a> to see if theres a canned function that can help solve your problem.</p>
<p>If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Hortons paper, <a href="https://peerj.com/preprints/3163/"><em>Wrangling categorical data in R</em></a>. This paper lays out some of the history discussed in <a href="https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/"><em>stringsAsFactors: An unauthorized biography</em></a> and <a href="https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh"><em>stringsAsFactors = &lt;sigh&gt;</em></a>, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia &amp; Nick!</p>
<p>In the next chapter well switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as youll soon see, the more you learn about them, the more complex they seem to get!</p>
</section>
</section>

View File

@ -1,952 +0,0 @@
<section data-type="chapter" id="chp-functions">
<h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1>
<section id="functions-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:</p>
<ol type="1"><li><p>You can give a function an evocative name that makes your code easier to understand.</p></li>
<li><p>As requirements change, you only need to update code in one place, instead of many.</p></li>
<li><p>You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).</p></li>
</ol><p>A good rule of thumb is to consider writing a function whenever youve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, youll learn about three useful types of functions:</p>
<ul><li>Vector functions take one or more vectors as input and return a vector as output.</li>
<li>Data frame functions take a data frame as input and return a data frame as output.</li>
<li>Plot functions that take a data frame as input and return a plot as output.</li>
</ul><p>Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldnt be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for <a href="https://twitter.com/hadleywickham/status/1571603361350164486">general functions</a> and <a href="https://twitter.com/hadleywickham/status/1574373127349575680">plotting functions</a> to see even more functions.</p>
<section id="functions-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Well wrap up a variety of functions from around the tidyverse. Well also use nycflights13 as a source of familiar data to use our functions with.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="vector-functions" data-type="sect1">
<h1>
Vector functions</h1>
<p>Well begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
d = rnorm(5),
)
df |&gt; mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
#&gt; # A tibble: 5 × 4
#&gt; a b c d
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.339 2.59 0.291 0
#&gt; 2 0.880 0 0.611 0.557
#&gt; 3 0 1.37 1 0.752
#&gt; 4 0.795 1.37 0 1
#&gt; 5 1 1.34 0.580 0.394</pre>
</div>
<p>You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an <code>a</code> to a <code>b</code>. Preventing this type of mistake of is one very good reason to learn how to write functions.</p>
<section id="writing-a-function" data-type="sect2">
<h2>
Writing a function</h2>
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, its a little easier to see the pattern because each repetition is now one line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) </pre>
</div>
<p>To make this a bit clearer we can replace the bit that varies with <code></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))</pre>
</div>
<p>To turn this into a function you need three things:</p>
<ol type="1"><li><p>A <strong>name</strong>. Here well use <code>rescale01</code> because this function rescales a vector to lie between 0 and 1.</p></li>
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that we have just one. Well call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
<li><p>The <strong>body</strong>. The body is the code thats repeated across all the calls.</p></li>
</ol><p>Then you create a function by following the template:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">name &lt;- function(arguments) {
body
}</pre>
</div>
<p>For this case that leads to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}</pre>
</div>
<p>At this point you might test with a few simple inputs to make sure youve captured the logic correctly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rescale01(c(-10, 0, 10))
#&gt; [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
#&gt; [1] 0.00 0.25 0.50 NA 1.00</pre>
</div>
<p>Then you can rewrite the call to <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> as:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt; mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d),
)
#&gt; # A tibble: 5 × 4
#&gt; a b c d
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.339 1 0.291 0
#&gt; 2 0.880 0 0.611 0.557
#&gt; 3 0 0.530 1 0.752
#&gt; 4 0.795 0.531 0 1
#&gt; 5 1 0.518 0.580 0.394</pre>
</div>
<p>(In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> to reduce the duplication even further so all you need is <code>df |&gt; mutate(across(a:d, rescale01))</code>).</p>
</section>
<section id="improving-our-function" data-type="sect2">
<h2>
Improving our function</h2>
<p>You might notice that the <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}</pre>
</div>
<p>Or you might try this function on a vector that includes an infinite value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1:10, Inf)
rescale01(x)
#&gt; [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre>
</div>
<p>That result is not particularly useful so we could ask <code><a href="https://rdrr.io/r/base/range.html">range()</a></code> to ignore infinite values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
#&gt; [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
#&gt; [8] 0.7777778 0.8888889 1.0000000 Inf</pre>
</div>
<p>These changes illustrate an important benefit of functions: because weve moved the repeated code into a function, we only need to make the change in one place.</p>
</section>
<section id="mutate-functions" data-type="sect2">
<h2>
Mutate functions</h2>
<p>Now youve got the basic idea of functions, lets take a look at a whole bunch of examples. Well start by looking at “mutate” functions, i.e. functions that work well inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output of the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want to compute the Z-score, rescaling a vector to have a mean of zero and a standard deviation of one:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">z_score &lt;- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}</pre>
</div>
<p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> and give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">clamp &lt;- function(x, min, max) {
case_when(
x &lt; min ~ min,
x &gt; max ~ max,
.default = x
)
}
clamp(1:10, min = 3, max = 7)
#&gt; [1] 3 3 3 4 5 6 7 7 7 7</pre>
</div>
<p>Or maybe youd rather mark those values as <code>NA</code>s:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">na_outside &lt;- function(x, min, max) {
case_when(
x &lt; min ~ NA,
x &gt; max ~ NA,
.default = x
)
}
na_outside(1:10, min = 3, max = 7)
#&gt; [1] NA NA 3 4 5 6 7 NA NA NA</pre>
</div>
<p>Of course functions dont just need to work with numeric variables. You might want to do some repeated string manipulation. Maybe you need to make the first character upper case:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">first_upper &lt;- function(x) {
str_sub(x, 1, 1) &lt;- str_to_upper(str_sub(x, 1, 1))
x
}
first_upper("hello")
#&gt; [1] "Hello"</pre>
</div>
<p>Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number &lt;- function(x) {
is_pct &lt;- str_detect(x, "%")
num &lt;- x |&gt;
str_remove_all("%") |&gt;
str_remove_all(",") |&gt;
str_remove_all(fixed("$")) |&gt;
as.numeric(x)
if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
#&gt; [1] 12300
clean_number("45%")
#&gt; [1] 0.45</pre>
</div>
<p>Sometimes your functions will be highly specialized for one data analysis step. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">fix_na &lt;- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}</pre>
</div>
<p>Weve focused on examples that take a single vector because we think theyre the most common. But theres no reason that your function cant take multiple vector inputs. For example, you might want to compute the distance between two locations on the globe using the haversine formula. This requires four vectors:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
haversine &lt;- function(long1, lat1, long2, lat2, round = 3) {
# convert to radians
long1 &lt;- long1 * pi / 180
lat1 &lt;- lat1 * pi / 180
long2 &lt;- long2 * pi / 180
lat2 &lt;- lat2 * pi / 180
R &lt;- 6371 # Earth mean radius in km
a &lt;- sin((lat2 - lat1) / 2)^2 +
cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
d &lt;- R * 2 * asin(sqrt(a))
round(d, round)
}</pre>
</div>
</section>
<section id="summary-functions" data-type="sect2">
<h2>
Summary functions</h2>
<p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">commas &lt;- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
#&gt; [1] "cat, dog and pigeon"</pre>
</div>
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides the standard deviation by the mean:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cv &lt;- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
#&gt; [1] 0.5196276
cv(runif(100, min = 0, max = 500))
#&gt; [1] 0.5652554</pre>
</div>
<p>Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/gbganalyst/status/1571619641390252033
n_missing &lt;- function(x) {
sum(is.na(x))
} </pre>
</div>
<p>You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/neilgcurrie/status/1571607727255834625
mape &lt;- function(actual, predicted) {
sum(abs((actual - predicted) / actual)) / length(actual)
}</pre>
</div>
<div data-type="note"><h1>
RStudio
</h1>
<p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p>
<ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul>
</div>
</section>
<section id="functions-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)</pre>
</div>
</li>
<li><p>In the second variant of <code>rescale01()</code>, infinite values are left unchanged. Can you rewrite <code>rescale01()</code> so that <code>-Inf</code> is mapped to 0, and <code>Inf</code> is mapped to 1?</p></li>
<li><p>Given a vector of birthdates, write a function to compute the age in years.</p></li>
<li><p>Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as <span class="math display">\[
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
\]</span> where <span class="math inline">\(\bar{x} = (\sum_i^n x_i) / n\)</span> is the sample mean. Skewness is defined as <span class="math display">\[
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
\]</span></p></li>
<li><p>Write <code>both_na()</code>, a summary function that takes two vectors of the same length and returns the number of positions that have an <code>NA</code> in both vectors.</p></li>
<li>
<p>Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">is_directory &lt;- function(x) file.info(x)$isdir
is_readable &lt;- function(x) file.access(x, 4) == 0</pre>
</div>
</li>
</ol></section>
</section>
<section id="data-frame-functions" data-type="sect1">
<h1>
Data frame functions</h1>
<p>Vector functions are useful for pulling out code thats repeated within a dplyr verb. But youll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.</p>
<p>To let you write a function that uses dplyr verbs, well first introduce you to the challenge of indirection and how you can overcome it with embracing, <code>{{ }}</code>. With this theory under your belt, well then show you a bunch of examples to illustrate what you might do with it.</p>
<section id="indirection-and-tidy-evaluation" data-type="sect2">
<h2>
Indirection and tidy evaluation</h2>
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>grouped_mean()</code>. The goal of this function is compute the mean of <code>mean_var</code> grouped by <code>group_var</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">grouped_mean &lt;- function(df, group_var, mean_var) {
df |&gt;
group_by(group_var) |&gt;
summarize(mean(mean_var))
}</pre>
</div>
<p>If we try and use it, we get an error:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; grouped_mean(cut, carat)
#&gt; Error in `group_by()`:
#&gt; ! Must group by variables found in `.data`.
#&gt; ✖ Column `group_var` is not found.</pre>
</div>
<p>To make the problem a bit more clear, we can use a made up data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
mean_var = 1,
group_var = "g",
group = 1,
x = 10,
y = 100
)
df |&gt; grouped_mean(group, x)
#&gt; # A tibble: 1 × 2
#&gt; group_var `mean(mean_var)`
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 g 1
df |&gt; grouped_mean(group, y)
#&gt; # A tibble: 1 × 2
#&gt; group_var `mean(mean_var)`
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 g 1</pre>
</div>
<p>Regardless of how we call <code>grouped_mean()</code> it always does <code>df |&gt; group_by(group_var) |&gt; summarize(mean(mean_var))</code>, instead of <code>df |&gt; group_by(group) |&gt; summarize(mean(x))</code> or <code>df |&gt; group_by(group) |&gt; summarize(mean(y))</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code>group_mean()</code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> not to treat <code>group_var</code> and <code>mean_var</code> as the name of the variables, but instead look inside them for the variable we actually want to use.</p>
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
<p>So to make grouped_mean<code>()</code> work, we need to surround <code>group_var</code> and <code>mean_var()</code> with <code>{{ }}</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">grouped_mean &lt;- function(df, group_var, mean_var) {
df |&gt;
group_by({{ group_var }}) |&gt;
summarize(mean({{ mean_var }}))
}
diamonds |&gt; grouped_mean(cut, carat)
#&gt; # A tibble: 5 × 2
#&gt; cut `mean(carat)`
#&gt; &lt;ord&gt; &lt;dbl&gt;
#&gt; 1 Fair 1.05
#&gt; 2 Good 0.849
#&gt; 3 Very Good 0.806
#&gt; 4 Premium 0.892
#&gt; 5 Ideal 0.703</pre>
</div>
<p>Success!</p>
</section>
<section id="sec-embracing" data-type="sect2">
<h2>
When to embrace?</h2>
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately, this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which correspond to the two most common sub-types of tidy evaluation:</p>
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> that compute with variables.</p></li>
<li><p><strong>Tidy-selection</strong>: this is used for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
</ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p>
<p>In the following sections, well explore the sorts of handy functions you might write once you understand embracing.</p>
</section>
<section id="common-use-cases" data-type="sect2">
<h2>
Common use cases</h2>
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">summary6 &lt;- function(data, var) {
data |&gt; summarize(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
diamonds |&gt; summary6(carat)
#&gt; # A tibble: 1 × 6
#&gt; min mean median max n n_miss
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.2 0.798 0.7 5.01 53940 0</pre>
</div>
<p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is, because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, you can use it on grouped data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summary6(carat)
#&gt; # A tibble: 5 × 7
#&gt; cut min mean median max n n_miss
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Fair 0.22 1.05 1 5.01 1610 0
#&gt; 2 Good 0.23 0.849 0.82 3.01 4906 0
#&gt; 3 Very Good 0.2 0.806 0.71 4 12082 0
#&gt; 4 Premium 0.2 0.892 0.86 4.01 13791 0
#&gt; 5 Ideal 0.2 0.703 0.54 3.5 21551 0</pre>
</div>
<p>Furthermore, since the arguments to summarize are data-masking also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summary6(log10(carat))
#&gt; # A tibble: 5 × 7
#&gt; cut min mean median max n n_miss
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Fair -0.658 -0.0273 0 0.700 1610 0
#&gt; 2 Good -0.638 -0.133 -0.0862 0.479 4906 0
#&gt; 3 Very Good -0.699 -0.164 -0.149 0.602 12082 0
#&gt; 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0
#&gt; 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre>
</div>
<p>To summarize multiple variables, youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
<p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/Diabb6/status/1571635146658402309
count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}
diamonds |&gt; count_prop(clarity)
#&gt; # A tibble: 8 × 3
#&gt; clarity n prop
#&gt; &lt;ord&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 I1 741 0.0137
#&gt; 2 SI2 9194 0.170
#&gt; 3 SI1 13065 0.242
#&gt; 4 VS2 12258 0.227
#&gt; 5 VS1 8171 0.151
#&gt; 6 VVS2 5066 0.0939
#&gt; # … with 2 more rows</pre>
</div>
<p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because its passed to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> which uses data-masking for all variables in <code></code>.</p>
<p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, well allow the user to supply a condition:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">unique_where &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
distinct({{ var }}) |&gt;
arrange({{ var }})
}
# Find all the destinations in December
flights |&gt; unique_where(month == 12, dest)
#&gt; # A tibble: 96 × 1
#&gt; dest
#&gt; &lt;chr&gt;
#&gt; 1 ABQ
#&gt; 2 ALB
#&gt; 3 ATL
#&gt; 4 AUS
#&gt; 5 AVL
#&gt; 6 BDL
#&gt; # … with 90 more rows
# Which months did plane N14228 fly in?
flights |&gt; unique_where(tailnum == "N14228", month)
#&gt; # A tibble: 11 × 1
#&gt; month
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3
#&gt; 4 4
#&gt; 5 5
#&gt; 6 6
#&gt; # … with 5 more rows</pre>
</div>
<p>Here we embrace <code>condition</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>.</p>
<p>Weve made all these examples to take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights_sub &lt;- function(rows, cols) {
flights |&gt;
filter({{ rows }}) |&gt;
select(time_hour, carrier, flight, {{ cols }})
}
flights_sub(dest == "IAH", contains("time"))
#&gt; # A tibble: 7,198 × 8
#&gt; time_hour carrier flight dep_time sched_dep_time arr_time
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228
#&gt; # … with 7,192 more rows, and 2 more variables: sched_arr_time &lt;int&gt;,
#&gt; # air_time &lt;dbl&gt;</pre>
</div>
</section>
<section id="data-masking-vs.-tidy-selection" data-type="sect2">
<h2>
Data-masking vs. tidy-selection</h2>
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by({{ group_vars }}) |&gt;
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; Error in `group_by()`:
#&gt; In argument: `c(year, month, day)`.
#&gt; Caused by error:
#&gt; ! `c(year, month, day)` must be size 336776 or 1, not 1010328.</pre>
</div>
<p>This doesnt work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> function, which allows you to use tidy-selection inside data-masking functions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by(pick({{ group_vars }})) |&gt;
summarize(n_miss = sum(is.na({{ x_var }})))
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 365 × 4
#&gt; # Groups: year, month [12]
#&gt; year month day n_miss
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 4
#&gt; 2 2013 1 2 8
#&gt; 3 2013 1 3 10
#&gt; 4 2013 1 4 6
#&gt; 5 2013 1 5 3
#&gt; 6 2013 1 6 1
#&gt; # … with 359 more rows</pre>
</div>
<p>Another convenient use of <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to rearrange the counts into a grid:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/pollicipes/status/1571606508944719876
count_wide &lt;- function(data, rows, cols) {
data |&gt;
count(pick(c({{ rows }}, {{ cols }}))) |&gt;
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
}
diamonds |&gt; count_wide(clarity, cut)
#&gt; # A tibble: 8 × 6
#&gt; clarity Fair Good `Very Good` Premium Ideal
#&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 210 96 84 205 146
#&gt; 2 SI2 466 1081 2100 2949 2598
#&gt; 3 SI1 408 1560 3240 3575 4282
#&gt; 4 VS2 261 978 2591 3357 5071
#&gt; 5 VS1 170 648 1775 1989 3589
#&gt; 6 VVS2 69 286 1235 870 2606
#&gt; # … with 2 more rows
diamonds |&gt; count_wide(c(clarity, color), cut)
#&gt; # A tibble: 56 × 7
#&gt; clarity color Fair Good `Very Good` Premium Ideal
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 D 4 8 5 12 13
#&gt; 2 I1 E 9 23 22 30 18
#&gt; 3 I1 F 35 19 13 34 42
#&gt; 4 I1 G 53 19 16 46 16
#&gt; 5 I1 H 52 14 12 46 38
#&gt; 6 I1 I 34 9 8 24 17
#&gt; # … with 50 more rows</pre>
</div>
<p>While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> docs you can see that <code>names_from</code> uses tidy-selection.</p>
</section>
<section id="functions-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Using the datasets from nycflights13, write a function that:</p>
<ol type="1"><li>
<p>Finds all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe()</pre>
</div>
</li>
<li>
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; group_by(dest) |&gt; summarize_severe()</pre>
</div>
</li>
<li>
<p>Finds all flights that were cancelled or delayed by more than a user supplied number of hours:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; filter_severe(hours = 2)</pre>
</div>
</li>
<li>
<p>Summarizes the weather to compute the minimum, mean, and maximum, of a user supplied variable:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">weather |&gt; summarize_weather(temp)</pre>
</div>
</li>
<li>
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc.) into a decimal time (i.e. hours + (minutes / 60)).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">weather |&gt; standardise_time(sched_dep_time)</pre>
</div>
</li>
</ol></li>
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
<li>
<p>Generalize the following function so that you can supply any number of variables to count.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}</pre>
</div>
</li>
</ol></section>
</section>
<section id="plot-functions" data-type="sect1">
<h1>
Plot functions</h1>
<p>Instead of returning a data frame, you might want to return a plot. Fortunately, you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.1)
diamonds |&gt;
ggplot(aes(x = carat)) +
geom_histogram(binwidth = 0.05)</pre>
</div>
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as pie once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function and you need to embrace:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |&gt; histogram(carat, 0.1)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Note that <code>histogram()</code> returns a ggplot2 plot, meaning you can still add on additional components if you want. Just remember to switch from <code>|&gt;</code> to <code>+</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")</pre>
</div>
<section id="more-variables" data-type="sect2">
<h2>
More variables</h2>
<p>Its straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check &lt;- function(df, x, y) {
df |&gt;
ggplot(aes(x = {{ x }}, y = {{ y }})) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) +
geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE)
}
starwars |&gt;
filter(mass &lt; 1000) |&gt;
linearity_check(mass, height)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-48-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot &lt;- function(df, x, y, z, bins = 20, fun = "mean") {
df |&gt;
ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(color = after_scale(fill)), # make border same color as fill
bins = bins,
fun = fun,
)
}
diamonds |&gt; hex_plot(carat, price, depth)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
</section>
<section id="combining-with-dplyr" data-type="sect2">
<h2>
Combining with dplyr</h2>
<p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sorted_bars &lt;- function(df, var) {
df |&gt;
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |&gt;
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |&gt; sorted_bars(cut)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-50-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>We have to use a new operator here, <code>:=</code>, because we are generating the variable name based on user-supplied data. Variable names go on the left hand side of <code>=</code>, but Rs syntax doesnt allow anything to the left of <code>=</code> except for a single literal name. To work around this problem, we use the special operator <code>:=</code> which tidy evaluation treats in exactly the same way as <code>=</code>.</p>
<p>Or maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">conditional_bars &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
ggplot(aes(x = {{ var }})) +
geom_bar()
}
diamonds |&gt; conditional_bars(cut == "Good", clarity)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-51-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>You can also get creative and display data summaries in other ways. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts &lt;- function(df, val, group) {
labs &lt;- df |&gt;
group_by({{ group }}) |&gt;
summarize(breaks = max({{ val }}))
df |&gt;
ggplot(aes(x = date, y = {{ val }}, group = {{ group }}, color = {{ group }})) +
geom_path() +
scale_y_continuous(
breaks = labs$breaks,
labels = scales::label_comma(),
minor_breaks = NULL,
guide = guide_axis(position = "right")
)
}
df &lt;- tibble(
dist1 = sort(rnorm(50, 5, 2)),
dist2 = sort(rnorm(50, 8, 3)),
dist4 = sort(rnorm(50, 15, 1)),
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
)
df &lt;- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-52-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Next well discuss two more complicated cases: faceting and automatic labeling.</p>
</section>
<section id="faceting" data-type="sect2">
<h2>
Faceting</h2>
<p>Unfortunately, programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. So you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/sharoz/status/1574376332821204999
foo &lt;- function(x) {
ggplot(mtcars, aes(x = mpg, y = disp)) +
geom_point() +
facet_wrap(vars({{ x }}))
}
foo(cyl)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-53-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution of <code>carat</code> from the diamonds dataset.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># https://twitter.com/yutannihilat_en/status/1574387230025875457
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
density()
density(cut)
density(cut, clarity)</pre>
</div>
</section>
<section id="labeling" data-type="sect2">
<h2>
Labeling</h2>
<p>Remember the histogram function we showed you earlier?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth)
}</pre>
</div>
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from the package we havent talked about yet: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically inserts the appropriate variable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">histogram &lt;- function(df, var, binwidth) {
label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |&gt;
ggplot(aes(x = {{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |&gt; histogram(carat, 0.1)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-56-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>You can use the same approach in any other place where you want to supply a string in a ggplot2 plot.</p>
</section>
<section id="functions-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<p>Build up a rich plotting function by incrementally implementing each of the steps below:</p>
<ol type="1"><li><p>Draw a scatterplot given dataset and <code>x</code> and <code>y</code> variables.</p></li>
<li><p>Add a line of best fit (i.e. a linear model with no standard errors).</p></li>
<li><p>Add a title.</p></li>
</ol></section>
</section>
<section id="style" data-type="sect1">
<h1>
Style</h1>
<p>R doesnt care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. Thats hard! But its better to be clear than short, as RStudios autocomplete makes it easy to type long names.</p>
<p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="https://rdrr.io/r/stats/coef.html">coef()</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and dont be afraid to rename a function if you figure out a better name later.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Too short
f()
# Not a verb, or descriptive
my_awesome_function()
# Long, but clear
impute_missing()
collapse_years()</pre>
</div>
<p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># missing extra two spaces
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
# Pipe indented incorrectly
density &lt;- function(color, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}</pre>
</div>
<p>As you can see we recommend putting extra spaces inside of <code>{{ }}</code>. This makes it very obvious that something unusual is happening.</p>
<section id="functions-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">f1 &lt;- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f3 &lt;- function(x, y) {
rep(y, length.out = length(x))
}</pre>
</div>
</li>
<li><p>Take a function that youve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li>
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc. would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
</ol></section>
</section>
<section id="functions-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way you saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
<p>We have only shown you the bare minimum to get started with functions and theres much more to learn. A few places to learn more are:</p>
<ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="https://dplyr.tidyverse.org/articles/programming.html">programming with dplyr</a> and <a href="https://tidyr.tidyverse.org/articles/programming.html">programming with tidyr</a> and learn more about the theory in <a href="https://rlang.r-lib.org/reference/topic-data-mask.html">What is data-masking and why do I need {{?</a>.</li>
<li>To learn more about reducing duplication in your ggplot2 code, read the <a href="https://ggplot2-book.org/programming.html" class="uri">Programming with ggplot2</a> chapter of the ggplot2 book.</li>
<li>For more advice on function style, see the <a href="https://style.tidyverse.org/functions.html" class="uri">tidyverse style guide</a>.</li>
</ul><p>In the next chapter, well dive into some of the details of Rs vector data structures that weve omitted so far. These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.</p>
</section>
</section>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 148 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 186 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 102 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 176 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 220 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 185 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 257 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 524 KiB

View File

@ -1,14 +0,0 @@
<div data-type="part">
<h1><span id="sec-import" class="quarto-section-identifier d-none d-lg-block">Import</span></h1><p>In this part of the book, youll learn how to import a wider range of data into R, as well as how to get it into a form useful form for analysis. Sometimes this is just a matter of calling a function from the appropriate data import package. But in more complex cases it might require both tidying and transformation in order to get to the tidy rectangle that youd prefer to work with.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-import"><p><img src="diagrams/data-science/import.png" alt="Our data science model with import highlighted in blue. " width="535"/></p>
<figcaption>Figure 1: Data import is the beginning of the data science process; without data you cant do data science!</figcaption>
</figure>
</div>
</div><p>In this part of the book youll learn how to access data stored in the following ways:</p><ul><li><p>In <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>, youll learn how to import data from Excel spreadsheets and Google Sheets.</p></li>
<li><p>In <a href="#chp-databases" data-type="xref">#chp-databases</a>, youll learn about getting data out of a database and into R (and youll also learn a little about how to get data out of R and into a database).</p></li>
<li><p>In <a href="#chp-arrow" data-type="xref">#chp-arrow</a>, youll learn about Arrow, a powerful tool for working with out-of-memory data, particularly when its stored in the parquet format.</p></li>
<li><p>In <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>, youll learn how to work with hierarchical data, including the the deeply nested lists produced by data stored in the JSON format.</p></li>
<li><p>In <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a>, youll learn web “scraping”, the art and science of extracting data from web pages.</p></li>
</ul><p>There are two important tidyverse packages that we dont discuss here: haven and xml2. If you working with data from SPSS, Stata, and SAS files, check out the <strong>haven</strong> package, <a href="https://haven.tidyverse.org" class="uri">https://haven.tidyverse.org</a>. If youre working with XML data, check out the <strong>xml2</strong> package, <a href="https://xml2.r-lib.org" class="uri">https://xml2.r-lib.org</a>. Otherwise, youll need to do some research to figure which package youll need to use; google is your friend here 😃.</p></div>

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

View File

@ -1,954 +0,0 @@
<section data-type="chapter" id="chp-joins">
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1>
<section id="joins-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Its rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must <strong>join</strong> them together to answer the questions that youre interested in. This chapter will introduce you to two important types of joins:</p>
<ul><li>Mutating joins, which add new variables to one data frame from matching observations in another.</li>
<li>Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.</li>
</ul><p>Well begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next well discuss how joins work, focusing on their action on the rows. Well finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.</p>
<section id="joins-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well explore the five related datasets from nycflights13 using the join functions from dplyr.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="keys" data-type="sect1">
<h1>
Keys</h1>
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, within each table. In this section, youll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. Youll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
<section id="primary-and-foreign-keys" data-type="sect2">
<h2>
Primary and foreign keys</h2>
<p>Every join involves a pair of keys: a primary key and a foreign key. A <strong>primary key</strong> is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a <strong>compound key.</strong> For example, in nycfights13:</p>
<ul><li>
<p><code>airlines</code> records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making <code>carrier</code> the primary key.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">airlines
#&gt; # A tibble: 16 × 2
#&gt; carrier name
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 9E Endeavor Air Inc.
#&gt; 2 AA American Airlines Inc.
#&gt; 3 AS Alaska Airlines Inc.
#&gt; 4 B6 JetBlue Airways
#&gt; 5 DL Delta Air Lines Inc.
#&gt; 6 EV ExpressJet Airlines Inc.
#&gt; # … with 10 more rows</pre>
</div>
</li>
<li>
<p><code>airports</code> records data about each airport. You can identify each airport by its three letter airport code, making <code>faa</code> the primary key.</p>
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">airports
#&gt; # A tibble: 1,458 × 8
#&gt; faa name lat lon alt tz dst
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A
#&gt; 6 0A9 Elizabethton Municipal Airpo… 36.4 -82.2 1593 -5 A
#&gt; # … with 1,452 more rows, and 1 more variable: tzone &lt;chr&gt;</pre>
</div>
</li>
<li>
<p><code>planes</code> records data about each plane. You can identify a plane by its tail number, making <code>tailnum</code> the primary key.</p>
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manufacturer model engines
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 N10156 2004 Fixed wing multi… EMBRAER EMB-145XR 2
#&gt; 2 N102UW 1998 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 3 N103US 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 4 N104UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; 5 N10575 2002 Fixed wing multi… EMBRAER EMB-145LR 2
#&gt; 6 N105UW 1999 Fixed wing multi… AIRBUS INDUSTR… A320-214 2
#&gt; # … with 3,316 more rows, and 3 more variables: seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
</li>
<li>
<p><code>weather</code> records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making <code>origin</code> and <code>time_hour</code> the compound primary key.</p>
<div class="cell" data-r.options="{&quot;width&quot;:67}">
<pre data-type="programlisting" data-code-language="r">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240
#&gt; # … with 26,109 more rows, and 6 more variables: wind_speed &lt;dbl&gt;,
#&gt; # wind_gust &lt;dbl&gt;, precip &lt;dbl&gt;, pressure &lt;dbl&gt;, visib &lt;dbl&gt;, …</pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
<ul><li>
<code>flights$tailnum</code> is a foreign key that corresponds to the primary key <code>planes$tailnum</code>.</li>
<li>
<code>flights$carrier</code> is a foreign key that corresponds to the primary key <code>airlines$carrier</code>.</li>
<li>
<code>flights$origin</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
<li>
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
<li>
<code>flights$origin</code>-<code>flights$time_hour</code> is a compound foreign key that corresponds to the compound primary key <code>weather$origin</code>-<code>weather$time_hour</code>.</li>
</ul><p>These relationships are summarized visually in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-flights-relationships"><p><img src="diagrams/relational.png" alt="The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames." width="502"/></p>
<figcaption>Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
</figure>
</div>
</div>
<p>Youll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as youll see shortly, will make your joining life much easier. Its also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. Theres only one exception: <code>year</code> means year of departure in <code>flights</code> and year of manufacturer in <code>planes</code>. This will become important when we start actually joining tables together.</p>
</section>
<section id="checking-primary-keys" data-type="sect2">
<h2>
Checking primary keys</h2>
<p>Now that that weve identified the primary keys in each table, its good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">planes |&gt;
count(tailnum) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 0 × 2
#&gt; # … with 2 variables: tailnum &lt;chr&gt;, n &lt;int&gt;
weather |&gt;
count(time_hour, origin) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 0 × 3
#&gt; # … with 3 variables: time_hour &lt;dttm&gt;, origin &lt;chr&gt;, n &lt;int&gt;</pre>
</div>
<p>You should also check for missing values in your primary keys — if a value is missing then it cant identify an observation!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">planes |&gt;
filter(is.na(tailnum))
#&gt; # A tibble: 0 × 9
#&gt; # … with 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, …
weather |&gt;
filter(is.na(time_hour) | is.na(origin))
#&gt; # A tibble: 0 × 15
#&gt; # … with 15 variables: origin &lt;chr&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,
#&gt; # hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;, …</pre>
</div>
</section>
<section id="surrogate-keys" data-type="sect2">
<h2>
Surrogate keys</h2>
<p>So far we havent talked about the primary key for <code>flights</code>. Its not super important here, because there are no data frames that use it as a foreign key, but its still useful to consider because its easier to work with observations if we have some way to describe them to others.</p>
<p>After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
count(time_hour, carrier, flight) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 0 × 4
#&gt; # … with 4 variables: time_hour &lt;dttm&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, n &lt;int&gt;</pre>
</div>
<p>Does the absence of duplicates automatically make <code>time_hour</code>-<code>carrier</code>-<code>flight</code> a primary key? Its certainly a good start, but it doesnt guarantee it. For example, are altitude and latitude a good primary key for <code>airports</code>?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">airports |&gt;
count(alt, lat) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 1 × 3
#&gt; alt lat n
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 13 40.6 2</pre>
</div>
<p>Identifying an airport by its altitude and latitude is clearly a bad idea, and in general its not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.</p>
<p>That said, we might be better off introducing a simple numeric surrogate key using the row number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 &lt;- flights |&gt;
mutate(id = row_number(), .before = 1)
flights2
#&gt; # A tibble: 336,776 × 20
#&gt; id year month day dep_time sched_dep_time dep_delay arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 1 2013 1 1 517 515 2 830
#&gt; 2 2 2013 1 1 533 529 4 850
#&gt; 3 3 2013 1 1 542 540 2 923
#&gt; 4 4 2013 1 1 544 545 -1 1004
#&gt; 5 5 2013 1 1 554 600 -6 812
#&gt; 6 6 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, and 12 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, …</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
<section id="joins-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>We forgot to draw the relationship between <code>weather</code> and <code>airports</code> in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>. What is the relationship and how should it appear in the diagram?</p></li>
<li><p><code>weather</code> only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to <code>flights</code>?</p></li>
<li><p>The <code>year</code>, <code>month</code>, <code>day</code>, <code>hour</code>, and <code>origin</code> variables almost form a compound key for <code>weather</code>, but theres one hour that has duplicate observations. Can you figure out whats special about that hour?</p></li>
<li><p>We know that some days of the year are special and fewer people than usual fly on them (e.g. Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?</p></li>
<li><p>Draw a diagram illustrating the connections between the <code>Batting</code>, <code>People</code>, and <code>Salaries</code> data frames in the Lahman package. Draw another diagram that shows the relationship between <code>People</code>, <code>Managers</code>, <code>AwardsManagers</code>. How would you characterise the relationship between the <code>Batting</code>, <code>Pitching</code>, and <code>Fielding</code> data frames?</p></li>
</ol></section>
</section>
<section id="sec-mutating-joins" data-type="sect1">
<h1>
Basic joins</h1>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, <code>anti_join(), and full_join()</code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, and two filtering joins, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>.</p>
<section id="mutating-joins" data-type="sect2">
<h2>
Mutating joins</h2>
<p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the join functions add variables to the right, so if your dataset has many variables, you wont see the new ones. For these examples, well make it easier to see whats going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> to avoid this problem.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 &lt;- flights |&gt;
select(year, time_hour, origin, dest, tailnum, carrier)
flights2
#&gt; # A tibble: 336,776 × 6
#&gt; year time_hour origin dest tailnum carrier
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
#&gt; # … with 336,770 more rows</pre>
</div>
<p>There are four types of mutating join, but theres one that youll use almost all of the time: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>. Its special because the output will always have the same rows as <code>x</code><span data-type="footnote">Thats not 100% true, but youll get a warning whenever it isnt.</span>. The primary use of <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> is to add in additional metadata. For example, we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> to add the full airline name to the <code>flights2</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(airlines)
#&gt; Joining with `by = join_by(carrier)`
#&gt; # A tibble: 336,776 × 7
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines In…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines In…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines I…
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines In…
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Or we could find out the temperature and wind speed when each plane departed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(weather |&gt; select(origin, time_hour, temp, wind_speed))
#&gt; Joining with `by = join_by(time_hour, origin)`
#&gt; # A tibble: 336,776 × 8
#&gt; year time_hour origin dest tailnum carrier temp wind_speed
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 39.0 12.7
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 39.9 15.0
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 39.0 15.0
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 39.0 15.0
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 39.9 16.1
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 39.0 12.7
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Or what size of plane was flying:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 336,776 × 9
#&gt; year time_hour origin dest tailnum carrier type
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wing multi en…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wing multi en…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wing multi en…
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wing multi en…
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wing multi en…
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wing multi en…
#&gt; # … with 336,770 more rows, and 2 more variables: engines &lt;int&gt;, seats &lt;int&gt;</pre>
</div>
<p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
filter(tailnum == "N3ALAA") |&gt;
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 63 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 2 2013 2013-01-02 18:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 3 2013 2013-01-03 06:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 4 2013 2013-01-07 19:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 5 2013 2013-01-08 17:00:00 JFK ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 6 2013 2013-01-16 06:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; # … with 57 more rows</pre>
</div>
<p>Well come back to this problem a few times in the rest of the chapter.</p>
</section>
<section id="specifying-join-keys" data-type="sect2">
<h2>
Specifying join keys</h2>
<p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesnt always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(planes)
#&gt; Joining with `by = join_by(year, tailnum)`
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier type manufacturer
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 336,770 more rows, and 5 more variables: model &lt;chr&gt;,
#&gt; # engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;</pre>
</div>
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(planes, join_by(tailnum))
#&gt; # A tibble: 336,776 × 14
#&gt; year.x time_hour origin dest tailnum carrier year.y
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012
#&gt; # … with 336,770 more rows, and 7 more variables: type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, …</pre>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
<p>Secondly, its how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(airports, join_by(dest == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George Bush Interco…
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George Bush Interco…
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami Intl
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfield Jackson …
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago Ohare Intl
#&gt; # … with 336,770 more rows, and 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;,
#&gt; # alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;
flights2 |&gt;
left_join(airports, join_by(origin == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newark Liberty Intl
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La Guardia
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John F Kennedy Intl
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John F Kennedy Intl
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La Guardia
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newark Liberty Intl
#&gt; # … with 336,770 more rows, and 6 more variables: lat &lt;dbl&gt;, lon &lt;dbl&gt;,
#&gt; # alt &lt;dbl&gt;, tz &lt;dbl&gt;, dst &lt;chr&gt;, tzone &lt;chr&gt;</pre>
</div>
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
<ul><li>
<code>by = "x"</code> corresponds to <code>join_by(x)</code>.</li>
<li>
<code>by = c("a" = "x")</code> corresponds to <code>join_by(a == x)</code>.</li>
</ul><p>Now that it exists, we prefer <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code> since it provides a clearer and more flexible specification.</p>
</section>
<section id="filtering-joins" data-type="sect2">
<h2>
Filtering joins</h2>
<p>As you might guess the primary action of a <strong>filtering join</strong> is to filter the rows. There are two types: semi-joins and anti-joins. <strong>Semi-joins</strong> keep all rows in <code>x</code> that have a match in <code>y</code>. For example, we could use a semi-join to filter the <code>airports</code> dataset to show just the origin airports:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">airports |&gt;
semi_join(flights2, join_by(faa == origin))
#&gt; # A tibble: 3 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York
#&gt; 2 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York
#&gt; 3 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York</pre>
</div>
<p>Or just the destinations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">airports |&gt;
semi_join(flights2, join_by(faa == dest))
#&gt; # A tibble: 101 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque Internati… 35.0 -107. 5355 -7 A America/Denver
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A America/New_Yo…
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A America/New_Yo…
#&gt; 4 ANC Ted Stevens Anchorage… 61.2 -150. 152 -9 A America/Anchor…
#&gt; 5 ATL Hartsfield Jackson At… 33.6 -84.4 1026 -5 A America/New_Yo…
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A America/Chicago
#&gt; # … with 95 more rows</pre>
</div>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that are missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
anti_join(airports, join_by(dest == faa)) |&gt;
distinct(dest)
#&gt; # A tibble: 4 × 1
#&gt; dest
#&gt; &lt;chr&gt;
#&gt; 1 BQN
#&gt; 2 SJU
#&gt; 3 STT
#&gt; 4 PSE</pre>
</div>
<p>Or we can find which <code>tailnum</code>s are missing from <code>planes</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
anti_join(planes, join_by(tailnum)) |&gt;
distinct(tailnum)
#&gt; # A tibble: 722 × 1
#&gt; tailnum
#&gt; &lt;chr&gt;
#&gt; 1 N3ALAA
#&gt; 2 N3DUAA
#&gt; 3 N542MQ
#&gt; 4 N730MQ
#&gt; 5 N9EAMQ
#&gt; 6 N532UA
#&gt; # … with 716 more rows</pre>
</div>
</section>
<section id="joins-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the <code>weather</code> data. Can you see any patterns?</p></li>
<li>
<p>Imagine youve found the top 10 most popular destinations using this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">top_dest &lt;- flights2 |&gt;
count(dest, sort = TRUE) |&gt;
head(10)</pre>
</div>
<p>How can you find all flights to those destinations?</p>
</li>
<li><p>Does every departing flight have corresponding weather data for that hour?</p></li>
<li><p>What do the tail numbers that dont have a matching record in <code>planes</code> have in common? (Hint: one variable explains ~90% of the problems.)</p></li>
<li><p>Add a column to <code>planes</code> that lists every <code>carrier</code> that has flown that plane. You might expect that theres an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools youve learned in previous chapters.</p></li>
<li><p>Add the latitude and the longitude of the origin <em>and</em> destination airport to <code>flights</code>. Is it easier to rename the columns before or after the join?</p></li>
<li>
<p>Compute the average delay by destination, then join on the <code>airports</code> data frame so you can show the spatial distribution of delays. Heres an easy way to draw a map of the United States:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">airports |&gt;
semi_join(flights, join_by(faa == dest)) |&gt;
ggplot(aes(x = lon, y = lat)) +
borders("state") +
geom_point() +
coord_quickmap()</pre>
</div>
<p>You might want to use the <code>size</code> or <code>color</code> of the points to display the average delay for each airport.</p>
</li>
<li><p>What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.</p></li>
</ol></section>
</section>
<section id="how-do-joins-work" data-type="sect1">
<h1>
How do joins work?</h1>
<p>Now that youve used joins a few times its time to learn more about how they work, focusing on how each row in <code>x</code> matches rows in <code>y</code>. Well begin by using <a href="#fig-join-setup" data-type="xref">#fig-join-setup</a> to introduce a visual representation of the two simple tibbles defined below. In these examples well use a single key called <code>key</code> and a single value column (<code>val_x</code> and <code>val_y</code>), but the ideas all generalize to multiple keys and multiple values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y &lt;- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)</pre>
</div>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-setup"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are colored: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
<figcaption>Graphical representation of two simple tables. The colored <code>key</code> columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
</figure>
</div>
</div>
<p><a href="#fig-join-setup2" data-type="xref">#fig-join-setup2</a> shows all potential matches between <code>x</code> and <code>y</code> as the intersection between lines drawn from each row of <code>x</code> and each row of <code>y</code>. The rows and columns in the output are primarily determined by <code>x</code>, so the <code>x</code> table is horizontal and lines up with the output.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-setup2"><p><img src="diagrams/join/setup2.png" alt="x and y are placed at right-angles, with horizonal lines extending from x and vertical lines extending from y. There are 3 rows in x and 3 rows in y, which leads to nine intersections representing nine potential matches." width="170"/></p>
<figcaption>To understand how joins work, its useful to think of every possible match. Here we show that with a grid of connecting lines.</figcaption>
</figure>
</div>
</div>
<p>In an actual join, matches will be indicated with dots, as in <a href="#fig-join-inner" data-type="xref">#fig-join-inner</a>. The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values. The join shown here is a so-called <strong>equi</strong> <strong>inner join</strong>, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both <code>x</code> and <code>y</code>. Equi-joins are the most common type of join, so well typically omit the equi prefix, and just call it an inner join. Well come back to non-equi joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-inner"><p><img src="diagrams/join/inner.png" alt="x and y are placed at right-angles with lines forming a grid of potential matches. Keys 1 and 2 appear in both x and y, so we get a match, indicated by a dot. Each dot corresponds to a row in the output, so the resulting joined data frame has two rows." width="363"/></p>
<figcaption>An inner join matches each row in <code>x</code> to the row in <code>y</code> that has the same value of <code>key</code>. Each match becomes a row in the output.</figcaption>
</figure>
</div>
</div>
<p>An <strong>outer join</strong> keeps observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with <code>NA</code>. There are three types of outer joins:</p>
<ul><li>
<p>A <strong>left join</strong> keeps all observations in <code>x</code>, <a href="#fig-join-left" data-type="xref">#fig-join-left</a>. Every row of <code>x</code> is preserved in the output because it can fall back to matching a row of <code>NA</code>s in <code>y</code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-left"><p><img src="diagrams/join/left.png" alt="Compared to the previous diagram showing an inner join, the y table gets a new virtual row containin NA that will match any row in x that didn't otherwise match. This means that the output now has three rows. For key = 3, which matches this virtual row, val_y takes value NA." width="385"/></p>
<figcaption>A visual representation of the left join where every row in <code>x</code> appears in the output.</figcaption>
</figure>
</div>
</div>
</li>
<li>
<p>A <strong>right join</strong> keeps all observations in <code>y</code>, <a href="#fig-join-right" data-type="xref">#fig-join-right</a>. Every row of <code>y</code> is preserved in the output because it can fall back to matching a row of <code>NA</code>s in <code>x</code>. The output still matches <code>x</code> as much as possible; any extra rows from <code>y</code> are added to the end.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-right"><p><img src="diagrams/join/right.png" alt="Compared to the previous diagram showing an left join, the x table now gains a virtual row so that every row in y gets a match in x. val_x contains NA for the row in y that didn't match x." width="380"/></p>
<figcaption>A visual representation of the right join where every row of <code>y</code> appears in the output.</figcaption>
</figure>
</div>
</div>
</li>
<li>
<p>A <strong>full join</strong> keeps all observations that appear in <code>x</code> or <code>y</code>, <a href="#fig-join-full" data-type="xref">#fig-join-full</a>. Every row of <code>x</code> and <code>y</code> is included in the output because both <code>x</code> and <code>y</code> have a fall back row of <code>NA</code>s. Again, the output starts with all rows from <code>x</code>, followed by the remaining unmatched <code>y</code> rows.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-full"><p><img src="diagrams/join/full.png" alt="Now both x and y have a virtual row that always matches. The result has 4 rows: keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since those keys don't have a match in the other data frames." width="388"/></p>
<figcaption>A visual representation of the full join where every row in <code>x</code> and <code>y</code> appears in the output.</figcaption>
</figure>
</div>
</div>
</li>
</ul><p>Another way to show how the types of outer join differ is with a Venn diagram, as in <a href="#fig-join-venn" data-type="xref">#fig-join-venn</a>. However, this is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate whats happening with the columns.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-venn"><p><img src="diagrams/join/venn.png" alt="Venn diagrams for inner, full, left, and right joins. Each join represented with two intersecting circles representing data frames x and y, with x on the right and y on the left. Shading indicates the result of the join." width="385"/></p>
<figcaption>Venn diagrams showing the difference between inner, left, right, and full joins.</figcaption>
</figure>
</div>
</div>
<section id="row-matching" data-type="sect2">
<h2>
Row matching</h2>
<p>So far weve explored what happens if a row in <code>x</code> matches zero or one rows in <code>y</code>. What happens if it matches more than one row? To understand whats going lets first narrow our focus to the <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> and then draw a picture, <a href="#fig-join-match-types" data-type="xref">#fig-join-match-types</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-match-types"><p><img src="diagrams/join/match-types.png" alt="A join diagram where x has key values 1, 2, and 3, and y has key values 1, 2, 2. The output has three rows because key 1 matches one row, key 2 matches two rows, and key 3 matches zero rows." width="348"/></p>
<figcaption>The three ways a row in <code>x</code> can match. <code>x1</code> matches one row in <code>y</code>, <code>x2</code> matches two rows in <code>y</code>, <code>x3</code> matches zero rows in y. Note that while there are three rows in <code>x</code> and three rows in the output, there isnt a direct correspondence between the rows.</figcaption>
</figure>
</div>
</div>
<p>There are three possible outcomes for a row in <code>x</code>:</p>
<ul><li>If it doesnt match anything, its dropped.</li>
<li>If it matches 1 row in <code>y</code>, its preserved.</li>
<li>If it matches more than 1 row in <code>y</code>, its duplicated once for each match.</li>
</ul><p>In principle, this means that theres no guaranteed correspondence between the rows in the output and the rows in the <code>x</code>:</p>
<ul><li>There might be fewer rows if some rows in <code>x</code> dont match any rows in <code>y</code>.</li>
<li>There might be more rows if some rows in <code>x</code> match multiple rows in <code>y</code>.</li>
<li>There might be the same number of rows if every row in <code>x</code> matches one row in <code>y</code>.</li>
<li>There might be the same number of rows if some rows dont match any rows, and exactly the same number of rows match two rows in <code>y</code>!!</li>
</ul><p>Row expansion is a fundamental property of joins, but its dangerous because it might happen without you realizing it. To avoid this problem, dplyr will warn whenever there are multiple matches:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1 &lt;- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
df2 &lt;- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
df1 |&gt;
inner_join(df2, join_by(key))
#&gt; Warning in inner_join(df1, df2, join_by(key)): Each row in `x` is expected to match at most 1 row in `y`.
#&gt; Row 2 of `x` matches multiple rows.
#&gt; If multiple matches are expected, set `multiple = "all"` to silence this
#&gt; warning.
#&gt; # A tibble: 3 × 3
#&gt; key val_x val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 x1 y1
#&gt; 2 2 x2 y2
#&gt; 3 2 x2 y3</pre>
</div>
<p>This is one reason we like <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> — if it runs without warning, you know that each row of the output matches the row in the same position in <code>x</code>.</p>
<p>You can gain further control over row matching with two arguments:</p>
<ul><li>
<code>unmatched</code> controls what happens when a row in <code>x</code> fails to match any rows in <code>y</code>. It defaults to <code>"drop"</code> which will silently drop any unmatched rows.</li>
<li>
<code>multiple</code> controls what happens when a row in <code>x</code> matches more than one row in <code>y</code>. For equi-joins, it defaults to <code>"warn"</code> which emits a warning message if any rows have multiple matches.</li>
</ul><p>There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.</p>
</section>
<section id="one-to-one-mapping" data-type="sect2">
<h2>
One-to-one mapping</h2>
<p>Both <code>unmatched</code> and <code>multiple</code> can take value <code>"error"</code> which means that the join will fail unless each row in <code>x</code> matches exactly one row in <code>y</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1 &lt;- tibble(x = 1)
df2 &lt;- tibble(x = c(1, 1))
df3 &lt;- tibble(x = 3)
df1 |&gt;
inner_join(df2, join_by(x), unmatched = "error", multiple = "error")
#&gt; Error in `inner_join()`:
#&gt; ! Each row in `x` must match at most 1 row in `y`.
#&gt; Row 1 of `x` matches multiple rows.
df1 |&gt;
inner_join(df3, join_by(x), unmatched = "error", multiple = "error")
#&gt; Error in `inner_join()`:
#&gt; ! Each row of `x` must have a match in `y`.
#&gt; Row 1 of `x` does not have a match.</pre>
</div>
<p>Note that <code>unmatched = "error"</code> is not useful with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> because, as described above, every row in <code>x</code> has a fallback match to a virtual row in <code>y</code>.</p>
</section>
<section id="allow-multiple-rows" data-type="sect2">
<h2>
Allow multiple rows</h2>
<p>Sometimes its useful to deliberately expand the number of rows in the output. This can come about naturally if you “flip” the direction of the question youre asking. For example, as weve seen above, its natural to supplement the <code>flights</code> data with information about the plane that flew each flight:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights2 |&gt;
left_join(planes, by = "tailnum")</pre>
</div>
<p>But its also reasonable to ask what flights did each plane fly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">plane_flights &lt;- planes |&gt;
select(tailnum, type, engines, seats) |&gt;
left_join(flights2, by = "tailnum")
#&gt; Warning in left_join(select(planes, tailnum, type, engines, seats), flights2, : Each row in `x` is expected to match at most 1 row in `y`.
#&gt; Row 1 of `x` matches multiple rows.
#&gt; If multiple matches are expected, set `multiple = "all"` to silence this
#&gt; warning.</pre>
</div>
<p>Since this duplicates rows in <code>x</code> (the planes), we need to explicitly say that were ok with the multiple matches by setting <code>multiple = "all"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">plane_flights &lt;- planes |&gt;
select(tailnum, type, engines, seats) |&gt;
left_join(flights2, by = "tailnum", multiple = "all")
plane_flights
#&gt; # A tibble: 284,170 × 9
#&gt; tailnum type engines seats year time_hour origin
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 06:00:00 EWR
#&gt; 2 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 10:00:00 EWR
#&gt; 3 N10156 Fixed wing multi en… 2 55 2013 2013-01-10 15:00:00 EWR
#&gt; 4 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 06:00:00 EWR
#&gt; 5 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 11:00:00 EWR
#&gt; 6 N10156 Fixed wing multi en… 2 55 2013 2013-01-11 18:00:00 EWR
#&gt; # … with 284,164 more rows, and 2 more variables: dest &lt;chr&gt;, carrier &lt;chr&gt;</pre>
</div>
</section>
<section id="sec-non-equi-joins" data-type="sect2">
<h2>
Filtering joins</h2>
<p>The number of matches also determines the behavior of the filtering joins. The semi-join keeps rows in <code>x</code> that have one or more matches in <code>y</code>, as in <a href="#fig-join-semi" data-type="xref">#fig-join-semi</a>. The anti-join keeps rows in <code>x</code> that match zero rows in <code>y</code>, as in <a href="#fig-join-anti" data-type="xref">#fig-join-anti</a>. In both cases, only the existence of a match is important; it doesnt matter how many times it matches. This means that filtering joins never duplicate rows like mutating joins do.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-semi"><p><img src="diagrams/join/semi.png" alt="A join diagram with old friends x and y. In a semi join, only the presence of a match matters so the output contains the same columns as x." width="318"/></p>
<figcaption>In a semi-join it only matters that there is a match; otherwise values in <code>y</code> dont affect the output.</figcaption>
</figure>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-anti"><p><img src="diagrams/join/anti.png" alt="An anti-join is the inverse of a semi-join so matches are drawn with red lines indicating that they will be dropped from the output." width="317"/></p>
<figcaption>An anti-join is the inverse of a semi-join, dropping rows from <code>x</code> that have a match in <code>y</code>.</figcaption>
</figure>
</div>
</div>
</section>
</section>
<section id="non-equi-joins" data-type="sect1">
<h1>
Non-equi joins</h1>
<p>So far youve only seen equi-joins, joins where the rows match if the <code>x</code> key equals the <code>y</code> key. Now were going to relax that restriction and discuss other ways of determining if a pair of rows match.</p>
<p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x |&gt; left_join(y, by = "key", keep = TRUE)
#&gt; # A tibble: 3 × 4
#&gt; key.x val_x key.y val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 1 x1 1 y1
#&gt; 2 2 x2 2 y2
#&gt; 3 3 x3 NA &lt;NA&gt;</pre>
</div>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-inner-both"><p><img src="diagrams/join/inner-both.png" alt="A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one. " width="415"/></p>
<figcaption>A left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
</figure>
</div>
</div>
<p>When we move away from equi-joins well always show the keys, because the key values will often be different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyrs join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-gte"><p><img src="diagrams/join/gte.png" alt="A join diagram illustrating join_by(key &gt;= key). The first row of x matches one row of y and the second and thirds rows each match two rows. This means the output has five rows containing each of the following (key.x, key.y) pairs: (1, 1), (2, 1), (2, 2), (3, 1), (3, 2)." width="385"/></p>
<figcaption>A non-equi join where the <code>x</code> key must greater than or equal to than the <code>y</code> key. Many rows generate multiple matches.</figcaption>
</figure>
</div>
</div>
<p>Non-equi-join isnt a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi-join:</p>
<ul><li>
<strong>Cross joins</strong> match every pair of rows.</li>
<li>
<strong>Inequality joins</strong> use <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, and <code>&gt;=</code> instead of <code>==</code>.</li>
<li>
<strong>Rolling joins</strong> are similar to inequality joins but only find the closest match.</li>
<li>
<strong>Overlap joins</strong> are a special type of inequality join designed to work with ranges.</li>
</ul><p>Each of these is described in more detail in the following sections.</p>
<section id="cross-joins" data-type="sect2">
<h2>
Cross joins</h2>
<p>A cross join matches everything, as in <a href="#fig-join-cross" data-type="xref">#fig-join-cross</a>, generating the Cartesian product of rows. This means the output will have <code>nrow(x) * nrow(y)</code> rows.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-cross"><p><img src="diagrams/join/cross.png" alt="A join diagram showing a dot for every combination of x and y." width="155"/></p>
<figcaption>A cross join matches each row in <code>x</code> with every row in <code>y</code>.</figcaption>
</figure>
</div>
</div>
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since were joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>. Cross joins use a different join function because theres no distinction between inner/left/right/full when youre matching every row.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(name = c("John", "Simon", "Tracy", "Max"))
df |&gt; cross_join(df)
#&gt; # A tibble: 16 × 2
#&gt; name.x name.y
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 John John
#&gt; 2 John Simon
#&gt; 3 John Tracy
#&gt; 4 John Max
#&gt; 5 Simon John
#&gt; 6 Simon Simon
#&gt; # … with 10 more rows</pre>
</div>
</section>
<section id="inequality-joins" data-type="sect2">
<h2>
Inequality joins</h2>
<p>Inequality joins use <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;=</code>, or <code>&gt;</code> to restrict the set of possible matches, as in <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a> and <a href="#fig-join-lt" data-type="xref">#fig-join-lt</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-lt"><p><img src="diagrams/join/lt.png" width="185"/></p>
<figcaption>An inequality join where <code>x</code> is joined to <code>y</code> on rows where the key of <code>x</code> is less than the key of <code>y</code>. This makes a triangular shape in the top-left corner.</figcaption>
</figure>
</div>
</div>
<p>Inequality joins are extremely general, so general that its hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
df |&gt; left_join(df, join_by(id &lt; id))
#&gt; # A tibble: 7 × 4
#&gt; id.x name.x id.y name.y
#&gt; &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1 John 2 Simon
#&gt; 2 1 John 3 Tracy
#&gt; 3 1 John 4 Max
#&gt; 4 2 Simon 3 Tracy
#&gt; 5 2 Simon 4 Max
#&gt; 6 3 Tracy 4 Max
#&gt; # … with 1 more row</pre>
</div>
</section>
<section id="rolling-joins" data-type="sect2">
<h2>
Rolling joins</h2>
<p>Rolling joins are a special type of inequality join where instead of getting <em>every</em> row that satisfies the inequality, you get just the closest row, as in <a href="#fig-join-closest" data-type="xref">#fig-join-closest</a>. You can turn any inequality join into a rolling join by adding <code>closest()</code>. For example <code>join_by(closest(x &lt;= y))</code> matches the smallest <code>y</code> thats greater than or equal to x, and <code>join_by(closest(x &gt; y))</code> matches the biggest <code>y</code> thats less than <code>x</code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-closest"><p><img src="diagrams/join/closest.png" alt="A rolling join is a subset of an inequality join so some matches are grayed out indicating that they're not used because they're not the &quot;closest&quot;." width="262"/></p>
<figcaption>A following join is similar to a greater-than-or-equal inequality join but only matches the first value.</figcaption>
</figure>
</div>
</div>
<p>Rolling joins are particularly useful when you have two tables of dates that dont perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.</p>
<p>For example, imagine that youre in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">parties &lt;- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
)</pre>
</div>
<p>Now imagine that you have a table of employee birthdays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees &lt;- tibble(
name = sample(babynames::babynames$name, 100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
#&gt; # A tibble: 100 × 2
#&gt; name birthday
#&gt; &lt;chr&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13
#&gt; 2 Shonnie 2022-03-30
#&gt; 3 Burnard 2022-01-10
#&gt; 4 Omer 2022-11-25
#&gt; 5 Hillel 2022-07-30
#&gt; 6 Curlie 2022-12-11
#&gt; # … with 94 more rows</pre>
</div>
<p>And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees |&gt;
left_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 100 × 4
#&gt; name birthday q party
#&gt; &lt;chr&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13 3 2022-07-11
#&gt; 2 Shonnie 2022-03-30 1 2022-01-10
#&gt; 3 Burnard 2022-01-10 1 2022-01-10
#&gt; 4 Omer 2022-11-25 4 2022-10-03
#&gt; 5 Hillel 2022-07-30 3 2022-07-11
#&gt; 6 Curlie 2022-12-11 4 2022-10-03
#&gt; # … with 94 more rows</pre>
</div>
<p>There is, however, one problem with this approach: the folks with birthdays before January 10 dont get a party:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees |&gt;
anti_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 0 × 2
#&gt; # … with 2 variables: name &lt;chr&gt;, birthday &lt;date&gt;</pre>
</div>
<p>To resolve that issue well need to tackle the problem a different way, with overlap joins.</p>
</section>
<section id="overlap-joins" data-type="sect2">
<h2>
Overlap joins</h2>
<p>Overlap joins provide three helpers that use inequality joins to make it easier to work with intervals:</p>
<ul><li>
<code>between(x, y_lower, y_upper)</code> is short for <code>x &gt;= y_lower, x &lt;= y_upper</code>.</li>
<li>
<code>within(x_lower, x_upper, y_lower, y_upper)</code> is short for <code>x_lower &gt;= y_lower, x_upper &lt;= y_upper</code>.</li>
<li>
<code>overlaps(x_lower, x_upper, y_lower, y_upper)</code> is short for <code>x_lower &lt;= y_upper, x_upper &gt;= y_lower</code>.</li>
</ul><p>Lets continue the birthday example to see how you might use them. Theres one problem with the strategy we used above: theres no party preceding the birthdays Jan 1-9. So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">parties &lt;- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
)
parties
#&gt; # A tibble: 4 × 4
#&gt; q party start end
#&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 2 2 2022-04-04 2022-04-04 2022-07-11
#&gt; 3 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 4 4 2022-10-03 2022-10-03 2022-12-31</pre>
</div>
<p>Hadley is hopelessly bad at data entry so he also wanted to check that the party periods dont overlap. One way to do this is by using a self-join to check to if any start-end interval overlap with another:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">parties |&gt;
inner_join(parties, join_by(overlaps(start, end, start, end), q &lt; q)) |&gt;
select(start.x, end.x, start.y, end.y)
#&gt; # A tibble: 1 × 4
#&gt; start.x end.x start.y end.y
#&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 2022-04-04 2022-07-11 2022-07-11 2022-10-02</pre>
</div>
<p>Ooops, there is an overlap, so lets fix that problem and continue:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">parties &lt;- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
)</pre>
</div>
<p>Now we can match each employee to their party. This is a good place to use <code>unmatched = "error"</code> because we want to quickly find out if any employees didnt get assigned a party.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">employees |&gt;
inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
#&gt; # A tibble: 100 × 6
#&gt; name birthday q party start end
#&gt; &lt;chr&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 Case 2022-09-13 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 2 Shonnie 2022-03-30 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 3 Burnard 2022-01-10 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 4 Omer 2022-11-25 4 2022-10-03 2022-10-03 2022-12-31
#&gt; 5 Hillel 2022-07-30 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 6 Curlie 2022-12-11 4 2022-10-03 2022-10-03 2022-12-31
#&gt; # … with 94 more rows</pre>
</div>
</section>
<section id="joins-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Can you explain whats happening with the keys in this equi-join? Why are they different?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x |&gt; full_join(y, by = "key")
#&gt; # A tibble: 4 × 3
#&gt; key val_x val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 x1 y1
#&gt; 2 2 x2 y2
#&gt; 3 3 x3 &lt;NA&gt;
#&gt; 4 4 &lt;NA&gt; y3
x |&gt; full_join(y, by = "key", keep = TRUE)
#&gt; # A tibble: 4 × 4
#&gt; key.x val_x key.y val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 1 x1 1 y1
#&gt; 2 2 x2 2 y2
#&gt; 3 3 x3 NA &lt;NA&gt;
#&gt; 4 NA &lt;NA&gt; 4 y3</pre>
</div>
</li>
<li><p>When finding if any party period overlapped with another party period we used <code>q &lt; q</code> in the <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>? Why? What happens if you remove this inequality?</p></li>
</ol></section>
</section>
<section id="joins-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, youve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.</p>
<p>This chapter concludes the “Transform” part of the book where the focus was on the tools you could use with individual columns and tibbles. You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.</p>
<p>In the next part of the book, youll learn more about getting various types of data into R in a tidy form.</p>
</section>
</section>

View File

@ -1,728 +0,0 @@
<section data-type="chapter" id="chp-layers">
<h1><span id="sec-layers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Layers</span></span></h1>
<section id="layers-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In the <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a>, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make <em>any</em> type of plot with ggplot2.</p>
<p>In this chapter, youll expand on that foundation as you learn about the layered grammar of graphics. Well start with a deeper dive into aesthetic mappings, geometric objects, and facets. Then, you will learn about statistical transformations ggplot2 makes under the hood when creating a plot. These transformations are used to calculate new values to plot, such as the heights of bars in a bar plot or medians in a box plot. You will also learn about position adjustments, which modify how geoms are displayed in your plots. Finally, well briefly introduce coordinate systems.</p>
<p>We will not cover every single function and option for each of these layers, but we will walk you through the most important and commonly used functionality provided by ggplot2 as well as introduce you to packages that extend ggplot2.</p>
<section id="layers-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="aesthetic-mappings" data-type="sect1">
<h1>
Aesthetic mappings</h1>
<blockquote class="blockquote">
<p>“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey</p>
</blockquote>
<p>The <code>mpg</code> data frame that is bundled with the ggplot2 package contains 234 observations collected by the US Environmental Protection Agency on 38 car models.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">mpg
#&gt; # A tibble: 234 × 11
#&gt; manufacturer model displ year cyl trans drv cty hwy fl
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
#&gt; 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p
#&gt; 3 audi a4 2 2008 4 manual(m6) f 20 31 p
#&gt; 4 audi a4 2 2008 4 auto(av) f 21 30 p
#&gt; 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p
#&gt; 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p
#&gt; # … with 228 more rows, and 1 more variable: class &lt;chr&gt;</pre>
</div>
<p>Among the variables in <code>mpg</code> are:</p>
<ol type="1"><li><p><code>displ</code>: A cars engine size, in liters. A numerical variable.</p></li>
<li><p><code>hwy</code>: A cars fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. A numerical variable.</p></li>
<li><p><code>class</code>: Type of car. A categorical variable.</p></li>
</ol><p>You can learn about <code>mpg</code> on its help page by running <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code>.</p>
<p>Lets start by visualizing the relationship between <code>displ</code> and <code>hwy</code> for various <code>class</code>es of cars. We can do this with a scatterplot where the numerical variables are mapped to the <code>x</code> and <code>y</code> aesthetics and the categorical variable is mapped to an aesthetic like <code>color</code> or <code>shape</code>.</p>
<div>
<pre data-type="programlisting" data-code-language="r"># Left
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point()
# Right
ggplot(mpg, aes(x = displ, y = hwy, shape = class)) +
geom_point()
#&gt; Warning: The shape palette can deal with a maximum of 6 discrete values
#&gt; because more than 6 becomes difficult to discriminate; you have 7.
#&gt; Consider specifying shapes manually if you must have them.
#&gt; Warning: Removed 62 rows containing missing values (`geom_point()`).</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-4-1.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the color aesthetic, resulting in different colors for each class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each class, except for suv. Each plot comes with a legend that shows the mapping between color or shape and levels of the class variable." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-4-2.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the color aesthetic, resulting in different colors for each class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each class, except for suv. Each plot comes with a legend that shows the mapping between color or shape and levels of the class variable." width="384"/></p>
</div>
</div>
</div>
</div>
<p>When <code>class</code> is mapped to <code>shape</code>, we get two warnings:</p>
<blockquote class="blockquote">
<p>1: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you must have them.</p>
<p>2: Removed 62 rows containing missing values (<code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>).</p>
</blockquote>
<p>Since ggplot2 will only use six shapes at a time, by default, additional groups will go unplotted when you use the shape aesthetic. The second warning is related there are 62 SUVs in the dataset and theyre not plotted.</p>
<p>Similarly, we can map <code>class</code> to <code>size</code> or <code>alpha</code> (transparency) aesthetics as well.</p>
<div>
<pre data-type="programlisting" data-code-language="r"># Left
ggplot(mpg, aes(x = displ, y = hwy, size = class)) +
geom_point()
#&gt; Warning: Using size for a discrete variable is not advised.
# Right
ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) +
geom_point()
#&gt; Warning: Using alpha for a discrete variable is not advised.</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-5-1.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-5-2.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Both of these produce warnings as well:</p>
<blockquote class="blockquote">
<p>Using alpha for a discrete variable is not advised.</p>
</blockquote>
<p>Mapping a non-ordinal discrete (categorical) variable (<code>class</code>) to an ordered aesthetic (<code>size</code> or <code>alpha</code>) is generally not a good idea because it implies a ranking that does not in fact exist.</p>
<p>Similarly, we could have mapped <code>class</code> to the <code>alpha</code> aesthetic, which controls the transparency of the points, or to the <code>shape</code> aesthetic, which controls the shape of the points.</p>
<p>Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.</p>
<p>You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(color = "blue")</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-6-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue." width="576"/></p>
</div>
</div>
<p>Here, the color doesnt convey information about a variable, but only changes the appearance of the plot. You can set an aesthetic manually by name as an argument of your geom function. In other words, it goes <em>outside</em> of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code>. Youll need to pick a value that makes sense for that aesthetic:</p>
<ul><li>The name of a color as a character string.</li>
<li>The size of a point in mm.</li>
<li>The shape of a point as a number, as shown in <a href="#fig-shapes" data-type="xref">#fig-shapes</a>.</li>
</ul><div class="cell" data-layout-align="center">
<div class="cell-output-display">
<figure id="fig-shapes"><p><img src="layers_files/figure-html/fig-shapes-1.png" alt="Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue." width="576"/></p>
<figcaption>R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the <code>color</code> and <code>fill</code> aesthetics. The hollow shapes (014) have a border determined by <code>color</code>; the solid shapes (1520) are filled with <code>color</code>; the filled shapes (2124) have a border of <code>color</code> and are filled with <code>fill</code>.</figcaption>
</figure>
</div>
</div>
<p>So far we have discussed aesthetics that we can map or set in a scatterplot, when using a point geom. You can learn more about all possible aesthetic mappings in the aesthetic specifications vignette at <a href="https://ggplot2.tidyverse.org/articles/ggplot2-specs.html" class="uri">https://ggplot2.tidyverse.org/articles/ggplot2-specs.html</a>.</p>
<p>The specific aesthetics you can use for a plot depend on the geom you use to represent the data. In the next section we dive deeper into geoms.</p>
<section id="layers-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create a scatterplot of <code>hwy</code> vs. <code>displ</code> where the points are pink filled in triangles.</p></li>
<li>
<p>Why did the following code not result in a plot with blue points?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
geom_point(aes(x = displ, y = hwy, color = "blue"))</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-8-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are red and the legend shows a red point that is mapped to the word blue." width="576"/></p>
</div>
</div>
</li>
<li><p>What does the <code>stroke</code> aesthetic do? What shapes does it work with? (Hint: use <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">?geom_point</a></code>)</p></li>
<li><p>What happens if you map an aesthetic to something other than a variable name, like <code>aes(color = displ &lt; 5)</code>? Note, youll also need to specify x and y.</p></li>
</ol></section>
</section>
<section id="sec-geometric-objects" data-type="sect1">
<h1>
Geometric objects</h1>
<p>How are these two plots similar?</p>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-9-1.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-9-2.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="384"/></p>
</div>
</div>
</div>
<p>Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different geometric object, geom, to represent the data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.</p>
<p>To change the geom in your plot, change the geom function that you add to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. For instance, to make the plots above, you can use this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Left
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
# Right
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth()</pre>
</div>
<p>Every geom function in ggplot2 takes a <code>mapping</code> argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldnt set the “shape” of a line. If you try, ggplot2 will silently ignore that aesthetic mapping. On the other hand, you <em>could</em> set the linetype of a line. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
geom_smooth()
ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) +
geom_smooth()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-11-1.png" alt="Two plots of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves. On the left, three smooth curves, all with the same linetype. On the right, three smooth curves with different line types (solid, dashed, or long dashed) for each type of drive train. In both plots, confidence intervals around the smooth curves are also displayed." width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-11-2.png" alt="Two plots of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves. On the left, three smooth curves, all with the same linetype. On the right, three smooth curves with different line types (solid, dashed, or long dashed) for each type of drive train. In both plots, confidence intervals around the smooth curves are also displayed." width="576"/></p>
</div>
</div>
<p>Here, <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> separates the cars into three lines based on their <code>drv</code> value, which describes a cars drive train. One line describes all of the points that have a <code>4</code> value, one line describes all of the points that have an <code>f</code> value, and one line describes all of the points that have an <code>r</code> value. Here, <code>4</code> stands for four-wheel drive, <code>f</code> for front-wheel drive, and <code>r</code> for rear-wheel drive.</p>
<p>If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to <code>drv</code>.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-12-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with points (colored by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed." width="576"/></p>
</div>
</div>
<p>Notice that this plot contains two geoms in the same graph.</p>
<p>Many geoms, like <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>, use a single geometric object to display multiple rows of data. For these geoms, you can set the <code>group</code> aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the <code>linetype</code> example). It is convenient to rely on this feature because the <code>group</code> aesthetic by itself does not add a legend or distinguishing features to the geoms.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth(aes(group = drv))
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth(aes(color = drv), show.legend = FALSE)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-13-1.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-13-2.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-13-3.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
</div>
</div>
</div>
</div>
<p>If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings <em>for that layer only</em>. This makes it possible to display different aesthetics in different layers.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-14-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it." width="576"/></p>
</div>
</div>
<p>You can use the same idea to specify different <code>data</code> for each layer. Here, we use red points as well as open circles to highlight two-seater cars. The local data argument in <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> overrides the global data argument in <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> for that layer only.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_point(
data = mpg |&gt; filter(class == "2seater"),
color = "red"
) +
geom_point(
data = mpg |&gt; filter(class == "2seater"),
shape = "circle open", size = 3, color = "red"
)</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it." width="576"/></p>
</div>
</div>
<p>(Youll learn how <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.)</p>
<p>Geoms are the fundamental building blocks of ggplot2. You can completely transform the look of your plot by changing its geom, and different geoms can reveal different features of your data. For example, the histogram and density plot below reveal that the distribution of highway mileage is bimodal and right skewed while the boxplot reveals two potential outliers.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Left
ggplot(mpg, aes(x = hwy)) +
geom_histogram(binwidth = 2)
# Middle
ggplot(mpg, aes(x = hwy)) +
geom_density()
# Right
ggplot(mpg, aes(x = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-16-1.png" alt="Three plots: histogram, density plot, and box plot of highway mileage." width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-16-2.png" alt="Three plots: histogram, density plot, and box plot of highway mileage." width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-16-3.png" alt="Three plots: histogram, density plot, and box plot of highway mileage." width="576"/></p>
</div>
</div>
<p>ggplot2 provides more than 40 geoms but these dont cover all possible plots one could make. If you need a different geom, we recommend looking into extension packages first to see if someone else has already implemented it (see <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a> for a sampling). For example, the <strong>ggridges</strong> package (<a href="https://wilkelab.org/ggridges/" class="uri">https://wilkelab.org/ggridges</a>) is useful for making ridgeline plots, which can be useful for visualizing the density of a numerical variable for different levels of a categorical variable. In the following plot not only did we use a new geom (<code><a href="https://wilkelab.org/ggridges/reference/geom_density_ridges.html">geom_density_ridges()</a></code>), but we have also mapped the same variable to multiple aesthetics (<code>drv</code> to <code>y</code>, <code>fill</code>, and <code>color</code>) as well as set an aesthetic (<code>alpha = 0.5</code>) to make the density curves transparent.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(ggridges)
ggplot(mpg, aes(x = hwy, y = drv, fill = drv, color = drv)) +
geom_density_ridges(alpha = 0.5, show.legend = FALSE)
#&gt; Picking joint bandwidth of 1.28</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-17-1.png" alt="Density curves for highway mileage for cars with rear wheel, front wheel, and 4-wheel drives plotted separately. The distribution is bimodal and roughly symmetric for real and 4 wheel drive cars and unimodal and right skewed for front wheel drive cars." width="576"/></p>
</div>
</div>
<p>The best place to get a comprehensive overview of all of the geoms ggplot2 offers, as well as all functions in the package, is the reference page: <a href="https://ggplot2.tidyverse.org/reference" class="uri">https://ggplot2.tidyverse.org/reference</a>. To learn more about any single geom, use the help (e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">?geom_smooth</a></code>).</p>
<section id="layers-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?</p></li>
<li>
<p>Earlier in this chapter we used <code>show.legend</code> without explaining it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_smooth(aes(color = drv), show.legend = FALSE)</pre>
</div>
<p>What does <code>show.legend = FALSE</code> do here? What happens if you remove it? Why do you think we used it earlier?</p>
</li>
<li><p>What does the <code>se</code> argument to <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> do?</p></li>
<li>
<p>Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, its <code>drv</code>.</p>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-19-1.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-19-2.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-19-3.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-19-4.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-19-5.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-19-6.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
</div>
</div>
</li>
</ol></section>
</section>
<section id="facets" data-type="sect1">
<h1>
Facets</h1>
<p>In <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> you learned about faceting with <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code>, which splits a plot into subplots that each display one subset of the data based on a categorical variable.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~cyl)</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-20-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows." width="576"/></p>
</div>
</div>
<p>To facet your plot with the combination of two variables, switch from <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> to <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> is also a formula, but now its a double sided formula: <code>rows ~ cols</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(drv ~ cyl)</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-21-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive." width="576"/></p>
</div>
</div>
<p>By default each of the facets share the same scale for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. Setting the <code>scales</code> argument in a faceting function to <code>"free"</code> will allow for different axis scales across both rows and columns. Other options for this argument are <code>"free_x"</code> (different scales across rows) and <code>"free_y"</code> (different scales across columns).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(drv ~ cyl, scales = "free")</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-22-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive. Facets within a row share the same y-scale and facets within a column share the same x-scale." width="576"/></p>
</div>
</div>
<section id="layers-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens if you facet on a continuous variable?</p></li>
<li>
<p>What do the empty cells in plot with <code>facet_grid(drv ~ cyl)</code> mean? How do they relate to this plot?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
geom_point(aes(x = drv, y = cyl))</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-23-1.png" alt="Scatterplot of number of cycles versus type of drive train of cars. The plot shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive." width="576"/></p>
</div>
</div>
</li>
<li>
<p>What plots does the following code make? What does <code>.</code> do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)</pre>
</div>
</li>
<li>
<p>Take the first faceted plot in this section:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)</pre>
</div>
<p>What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?</p>
</li>
<li><p>Read <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">?facet_wrap</a></code>. What does <code>nrow</code> do? What does <code>ncol</code> do? What other options control the layout of the individual panels? Why doesnt <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> have <code>nrow</code> and <code>ncol</code> arguments?</p></li>
<li>
<p>Which of the following two plots makes it easier to compare engine size (<code>displ</code>) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(. ~ drv)</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-26-1.png" alt="Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-26-2.png" alt="Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." width="576"/></p>
</div>
</div>
</li>
<li>
<p>Recreate this plot using <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. How do the positions of the facet labels change?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg) +
geom_point(aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by type of drive train across rows." width="576"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="statistical-transformations" data-type="sect1">
<h1>
Statistical transformations</h1>
<p>Consider a basic bar chart, drawn with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code>. The following chart displays the total number of diamonds in the <code>diamonds</code> dataset, grouped by <code>cut</code>. The <code>diamonds</code> dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the <code>price</code>, <code>carat</code>, <code>color</code>, <code>clarity</code>, and <code>cut</code> of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-28-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
</div>
</div>
<p>On the x-axis, the chart displays <code>cut</code>, a variable from <code>diamonds</code>. On the y-axis, it displays count, but count is not a variable in <code>diamonds</code>! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:</p>
<ul><li><p>Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.</p></li>
<li><p>Smoothers fit a model to your data and then plot predictions from the model.</p></li>
<li><p>Boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.</p></li>
</ul><p>The algorithm used to calculate new values for a graph is called a <strong>stat</strong>, short for statistical transformation. <a href="#fig-vis-stat-bar" data-type="xref">#fig-vis-stat-bar</a> shows how this process works with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-vis-stat-bar"><p><img src="images/visualization-stat-bar.png" style="width:100.0%" alt="A figure demonstrating three steps of creating a bar chart. Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() transforms the data with the count stat, which returns a data set of cut values and counts. Step 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis."/></p>
<figcaption>When create a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.</figcaption>
</figure>
</div>
</div>
<p>You can learn which stat a geom uses by inspecting the default value for the <code>stat</code> argument. For example, <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">?geom_bar</a></code> shows that the default value for <code>stat</code> is “count”, which means that <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> uses <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code> is documented on the same page as <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>. If you scroll down, the section called “Computed variables” explains that it computes two new variables: <code>count</code> and <code>prop</code>.</p>
<p>Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:</p>
<ol type="1"><li>
<p>You might want to override the default stat. In the code below, we change the stat of <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> from count (the default) to identity. This lets us map the height of the bars to the raw values of a <span class="math inline">\(y\)</span> variable.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cut_frequencies &lt;- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(cut_frequencies, aes(x = cut, y = freq)) +
geom_bar(stat = "identity")</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-30-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
</div>
</div>
</li>
<li>
<p>You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = after_stat(prop), group = 1)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-31-1.png" alt="Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40." width="576"/></p>
</div>
</div>
<p>To find the variables computed by the stat, look for the section titled “computed variables” in the help for <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>.</p>
</li>
<li>
<p>You might want to draw greater attention to the statistical transformation in your code. For example, you might use <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>, which summarizes the y values for each unique x value, to draw attention to the summary that youre computing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds) +
stat_summary(
aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-32-1.png" alt="A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point." width="576"/></p>
</div>
</div>
</li>
</ol><p>ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">?stat_bin</a></code>.</p>
<section id="layers-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is the default geom associated with <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>? How could you rewrite the previous plot to use that geom function instead of the stat function?</p></li>
<li><p>What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code> do? How is it different from <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>?</p></li>
<li><p>Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?</p></li>
<li><p>What variables does <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">stat_smooth()</a></code> compute? What parameters control its behavior?</p></li>
<li>
<p>In our proportion bar chart, we need to set <code>group = 1</code>. Why? In other words, what is the problem with these two graphs?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, y = after_stat(prop))) +
geom_bar()
ggplot(diamonds, aes(x = cut, fill = color, y = after_stat(prop))) +
geom_bar()</pre>
</div>
</li>
</ol></section>
</section>
<section id="position-adjustments" data-type="sect1">
<h1>
Position adjustments</h1>
<p>Theres one more piece of magic associated with bar charts. You can color a bar chart using either the <code>color</code> aesthetic, or, more usefully, <code>fill</code>:</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, color = cut)) +
geom_bar()
ggplot(diamonds, aes(x = cut, fill = cut)) +
geom_bar()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-34-1.png" alt="Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-34-2.png" alt="Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Note what happens if you map the fill aesthetic to another variable, like <code>clarity</code>: the bars are automatically stacked. Each colored rectangle represents a combination of <code>cut</code> and <code>clarity</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-35-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level." width="576"/></p>
</div>
</div>
<p>The stacking is performed automatically using the <strong>position adjustment</strong> specified by the <code>position</code> argument. If you dont want a stacked bar chart, you can use one of three other options: <code>"identity"</code>, <code>"dodge"</code> or <code>"fill"</code>.</p>
<ul><li>
<p><code>position = "identity"</code> will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting <code>alpha</code> to a small value, or completely transparent by setting <code>fill = NA</code>.</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(diamonds, aes(x = cut, color = clarity)) +
geom_bar(fill = NA, position = "identity")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-36-1.png" alt="Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-36-2.png" alt="Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors." width="384"/></p>
</div>
</div>
</div>
</div>
<p>The identity position adjustment is more useful for 2d geoms, like points, where it is the default.</p>
</li>
<li>
<p><code>position = "fill"</code> works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "fill")</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-37-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Height of each bar is 1 and heights of the colored segments are proportional to the proportion of diamonds with a given clarity level within a given cut level." width="576"/></p>
</div>
</div>
</li>
<li>
<p><code>position = "dodge"</code> places overlapping objects directly <em>beside</em> one another. This makes it easier to compare individual values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = cut, fill = clarity)) +
geom_bar(position = "dodge")</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-38-1.png" alt="Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity." width="576"/></p>
</div>
</div>
</li>
</ul><p>Theres one other type of adjustment thats not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-39-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association." width="576"/></p>
</div>
</div>
<p>The underlying values of <code>hwy</code> and <code>displ</code> are rounded so the points appear on a grid and many points overlap each other. This problem is known as <strong>overplotting</strong>. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of <code>hwy</code> and <code>displ</code> that contains 109 values?</p>
<p>You can avoid this gridding by setting the position adjustment to “jitter”. <code>position = "jitter"</code> adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(position = "jitter")</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-40-1.png" alt="Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association." width="576"/></p>
</div>
</div>
<p>Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph <em>more</em> revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for <code>geom_point(position = "jitter")</code>: <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>.</p>
<p>To learn more about a position adjustment, look up the help page associated with each adjustment: <code><a href="https://ggplot2.tidyverse.org/reference/position_dodge.html">?position_dodge</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_fill</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_identity.html">?position_identity</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_jitter.html">?position_jitter</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_stack</a></code>.</p>
<section id="layers-exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>What is the problem with this plot? How could you improve it?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-41-1.png" alt="Scatterplot of highway fuel efficiency versus city fuel efficiency of cars that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset." width="576"/></p>
</div>
</div>
</li>
<li><p>What parameters to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> control the amount of jittering?</p></li>
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> with <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>.</p></li>
<li><p>Whats the default position adjustment for <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>? Create a visualization of the <code>mpg</code> dataset that demonstrates it.</p></li>
</ol></section>
</section>
<section id="coordinate-systems" data-type="sect1">
<h1>
Coordinate systems</h1>
<p>Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are two other coordinate systems that are occasionally helpful.</p>
<ul><li>
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> sets the aspect ratio correctly for maps. This is very important if youre plotting spatial data with ggplot2. We dont have the space to discuss maps in this book, but you can learn more in the <a href="https://ggplot2-book.org/maps.html">Maps chapter</a> of <em>ggplot2: Elegant graphics for data analysis</em>.</p>
<div>
<pre data-type="programlisting" data-code-language="r">nz &lt;- map_data("nz")
ggplot(nz, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black")
ggplot(nz, aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
coord_quickmap()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-42-1.png" alt="Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-42-2.png" alt="Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct." width="384"/></p>
</div>
</div>
</div>
</div>
</li>
<li>
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code> uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.</p>
<div>
<pre data-type="programlisting" data-code-language="r">bar &lt;- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-43-1.png" alt="There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="layers_files/figure-html/unnamed-chunk-43-2.png" alt="There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data." width="384"/></p>
</div>
</div>
</div>
</div>
</li>
</ul>
<section id="layers-exercises-5" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Turn a stacked bar chart into a pie chart using <code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code>.</p></li>
<li><p>Whats the difference between <code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_map()</a></code>?</p></li>
<li>
<p>What does the plot below tell you about the relationship between city and highway mpg? Why is <code><a href="https://ggplot2.tidyverse.org/reference/coord_fixed.html">coord_fixed()</a></code> important? What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_abline()</a></code> do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()</pre>
<div class="cell-output-display">
<p><img src="layers_files/figure-html/unnamed-chunk-44-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but does not go through the cloud of points, it is beneath it." width="576"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="the-layered-grammar-of-graphics" data-type="sect1">
<h1>
The layered grammar of graphics</h1>
<p>We can expand on the graphing template you learned in <span class="quarto-unresolved-ref">?sec-graphing-template</span> by adding position adjustments, stats, coordinate systems, and faceting:</p>
<pre><code>ggplot(data = &lt;DATA&gt;) +
&lt;GEOM_FUNCTION&gt;(
mapping = aes(&lt;MAPPINGS&gt;),
stat = &lt;STAT&gt;,
position = &lt;POSITION&gt;
) +
&lt;COORDINATE_FUNCTION&gt; +
&lt;FACET_FUNCTION&gt;</code></pre>
<p>Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.</p>
<p>The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe <em>any</em> plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.</p>
<p>To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat). Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic. Youd then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="images/visualization-grammar.png" alt="A figure demonstrating the steps for going from raw data to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level." width="1332"/></p>
</div>
</div>
<p>At this point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.</p>
<p>You could use this method to build <em>any</em> plot that you imagine. In other words, you can use the code template that youve learned in this chapter to build hundreds of thousands of unique plots.</p>
<p>If youd like to learn more about the theoretical underpinnings of ggplot2, you might enjoy reading “<a href="https://vita.had.co.nz/papers/layered-grammar.pdf">The Layered Grammar of Graphics</a>”, the scientific paper that describes the theory of ggplot2 in detail.</p>
</section>
<section id="layers-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned about the layered grammar of graphics starting with aesthetics and geometries to build a simple plot, facets for splitting the plot into subsets, statistics for understanding how geoms are calculated, position adjustments for controlling the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what <code>x</code> and <code>y</code> mean. One layer we have not yet touched on is theme, which we will introduce in <a href="#sec-themes" data-type="xref">#sec-themes</a>.</p>
<p>Two very useful resources for getting an overview of the complete ggplot2 functionality are the ggplot2 cheatsheet (which you can find at <a href="https://posit.co/resources/cheatsheets" class="uri">https://posit.co/resources/cheatsheets</a> ) and the ggplot2 package website (<a href="https://ggplot2.tidyverse.org/">https://ggplot2.tidyverse.org</a>).</p>
<p>An important lesson you should take from this chapter is that when you feel the need for a geom that is not provided by ggplot2, its always a good idea to look into whether someone else has already solved your problem by creating a ggplot2 extension package that offers that geom.</p>
</section>
</section>

View File

@ -1,637 +0,0 @@
<section data-type="chapter" id="chp-logicals">
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1>
<section id="logicals-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. Its relatively rare to find logical vectors in your raw data, but youll create and manipulate them in the course of almost every analysis.</p>
<p>Well begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then youll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. Well finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
<section id="logicals-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Most of the functions youll learn about in this chapter are provided by base R, so we dont need the tidyverse, but well still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. Well also continue to draw examples from the nycflights13 dataset.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
<p>However, as we start to cover more tools, there wont always be a perfect real example. So well start making up some dummy data with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 2, 3, 5, 7, 11, 13)
x * 2
#&gt; [1] 2 4 6 10 14 22 26</pre>
</div>
<p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside a data frame with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and friends.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x)
df |&gt;
mutate(y = x * 2)
#&gt; # A tibble: 7 × 2
#&gt; x y
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2
#&gt; 2 2 4
#&gt; 3 3 6
#&gt; 4 5 10
#&gt; 5 7 14
#&gt; 6 11 22
#&gt; # … with 1 more row</pre>
</div>
</section>
</section>
<section id="comparisons" data-type="sect1">
<h1>
Comparisons</h1>
<p>A very common way to create a logical vector is via a numeric comparison with <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>!=</code>, and <code>==</code>. So far, weve mostly created logical variables transiently within <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that leave roughly on time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dep_time &gt; 600 &amp; dep_time &lt; 2000 &amp; abs(arr_delay) &lt; 20)
#&gt; # A tibble: 172,286 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 601 600 1 844 850
#&gt; 2 2013 1 1 602 610 -8 812 820
#&gt; 3 2013 1 1 602 605 -3 821 805
#&gt; 4 2013 1 1 606 610 -4 858 910
#&gt; 5 2013 1 1 606 610 -4 837 845
#&gt; 6 2013 1 1 607 607 0 858 915
#&gt; # … with 172,280 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Its useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
daytime = dep_time &gt; 600 &amp; dep_time &lt; 2000,
approx_ontime = abs(arr_delay) &lt; 20,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 4
#&gt; dep_time arr_delay daytime approx_ontime
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 517 11 FALSE TRUE
#&gt; 2 533 20 FALSE FALSE
#&gt; 3 542 33 FALSE FALSE
#&gt; 4 544 -18 FALSE TRUE
#&gt; 5 554 -25 FALSE FALSE
#&gt; 6 554 12 FALSE TRUE
#&gt; # … with 336,770 more rows</pre>
</div>
<p>This is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.</p>
<p>All up, the initial filter is equivalent to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
daytime = dep_time &gt; 600 &amp; dep_time &lt; 2000,
approx_ontime = abs(arr_delay) &lt; 20,
) |&gt;
filter(daytime &amp; approx_ontime)</pre>
</div>
<section id="sec-fp-comparison" data-type="sect2">
<h2>
Floating point comparison</h2>
<p>Beware of using <code>==</code> with numbers. For example, it looks like this vector contains the numbers 1 and 2:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1 / 49 * 49, sqrt(2) ^ 2)
x
#&gt; [1] 1 2</pre>
</div>
<p>But if you test them for equality, you get <code>FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x == c(1, 2)
#&gt; [1] FALSE FALSE</pre>
</div>
<p>Whats going on? Computers store numbers with a fixed number of decimal places so theres no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> with the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">print(x, digits = 16)
#&gt; [1] 0.9999999999999999 2.0000000000000004</pre>
</div>
<p>You can see why R defaults to rounding these numbers; they really are very close to what you expect.</p>
<p>Now that youve seen why <code>==</code> is failing, what can you do about it? One option is to use <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> which ignores small differences:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">near(x, c(1, 2))
#&gt; [1] TRUE TRUE</pre>
</div>
</section>
<section id="sec-na-comparison" data-type="sect2">
<h2>
Missing values</h2>
<p>Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">NA &gt; 5
#&gt; [1] NA
10 == NA
#&gt; [1] NA</pre>
</div>
<p>The most confusing result is this one:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">NA == NA
#&gt; [1] NA</pre>
</div>
<p>Its easiest to understand why this is true if we artificially supply a little more context:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Let x be Mary's age. We don't know how old she is.
x &lt;- NA
# Let y be John's age. We don't know how old he is.
y &lt;- NA
# Are John and Mary the same age?
x == y
#&gt; [1] NA
# We don't know!</pre>
</div>
<p>So if you want to find all flights where <code>dep_time</code> is missing, the following code doesnt work because <code>dep_time == NA</code> will yield <code>NA</code> for every single row, and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> automatically drops missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dep_time == NA)
#&gt; # A tibble: 0 × 19
#&gt; # … with 19 variables: year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;,
#&gt; # sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;, …</pre>
</div>
<p>Instead well need a new tool: <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
</section>
<section id="is.na" data-type="sect2">
<h2>
is.na()
</h2>
<p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">is.na(c(TRUE, NA, FALSE))
#&gt; [1] FALSE TRUE FALSE
is.na(c(1, NA, 3))
#&gt; [1] FALSE TRUE FALSE
is.na(c("a", NA, "b"))
#&gt; [1] FALSE TRUE FALSE</pre>
</div>
<p>We can use <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> to find all the rows with a missing <code>dep_time</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(is.na(dep_time))
#&gt; # A tibble: 8,255 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815
#&gt; 2 2013 1 1 NA 1935 NA NA 2240
#&gt; 3 2013 1 1 NA 1500 NA NA 1825
#&gt; 4 2013 1 1 NA 600 NA NA 901
#&gt; 5 2013 1 2 NA 1540 NA NA 1747
#&gt; 6 2013 1 2 NA 1620 NA NA 1746
#&gt; # … with 8,249 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == 1, day == 1) |&gt;
arrange(dep_time)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …
flights |&gt;
filter(month == 1, day == 1) |&gt;
arrange(desc(is.na(dep_time)), dep_time)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815
#&gt; 2 2013 1 1 NA 1935 NA NA 2240
#&gt; 3 2013 1 1 NA 1500 NA NA 1825
#&gt; 4 2013 1 1 NA 600 NA NA 901
#&gt; 5 2013 1 1 517 515 2 830 819
#&gt; 6 2013 1 1 533 529 4 850 830
#&gt; # … with 836 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>Well come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
</section>
<section id="logicals-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How does <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> work? Type <code>near</code> to see the source code.</li>
<li>Use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> together to describe how the missing values in <code>dep_time</code>, <code>sched_dep_time</code> and <code>dep_delay</code> are connected.</li>
</ol></section>
</section>
<section id="boolean-algebra" data-type="sect1">
<h1>
Boolean algebra</h1>
<p>Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, <code>&amp;</code> is “and”, <code>|</code> is “or”, <code>!</code> is “not”, and <code><a href="https://rdrr.io/r/base/Logic.html">xor()</a></code> is exclusive or<span data-type="footnote">That is, <code>xor(x, y)</code> is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.</span>. <a href="#fig-bool-ops" data-type="xref">#fig-bool-ops</a> shows the complete set of Boolean operations and how they work.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-bool-ops"><p><img src="diagrams/transform.png" alt="Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y &amp; !x is y but none of x; x &amp; y is the intersection of x and y; x &amp; !y is x but none of y; x is all of x none of y; xor(x, y) is everything except the intersection of x and y; y is all of y and none of x; and x | y is everything." width="395"/></p>
<figcaption>The complete set of boolean operations. <code>x</code> is the left-hand circle, <code>y</code> is the right-hand circle, and the shaded region show which parts each operator selects.</figcaption>
</figure>
</div>
</div>
<p>As well as <code>&amp;</code> and <code>|</code>, R also has <code>&amp;&amp;</code> and <code>||</code>. Dont use them in dplyr functions! These are called short-circuiting operators and only ever return a single <code>TRUE</code> or <code>FALSE</code>. Theyre important for programming, not data science.</p>
<section id="sec-na-boolean" data-type="sect2">
<h2>
Missing values</h2>
<p>The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = c(TRUE, FALSE, NA))
df |&gt;
mutate(
and = x &amp; NA,
or = x | NA
)
#&gt; # A tibble: 3 × 3
#&gt; x and or
#&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 TRUE NA TRUE
#&gt; 2 FALSE FALSE NA
#&gt; 3 NA NA NA</pre>
</div>
<p>To understand whats going on, think about <code>NA | TRUE</code>. A missing value in a logical vector means that the value could either be <code>TRUE</code> or <code>FALSE</code>. <code>TRUE | TRUE</code> and <code>FALSE | TRUE</code> are both <code>TRUE</code>, so <code>NA | TRUE</code> must also be <code>TRUE</code>. Similar reasoning applies with <code>NA &amp; FALSE</code>.</p>
</section>
<section id="order-of-operations" data-type="sect2">
<h2>
Order of operations</h2>
<p>Note that the order of operations doesnt work like English. Take the following code that finds all flights that departed in November or December:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == 11 | month == 12)</pre>
</div>
<p>You might be tempted to write it like youd say in English: “Find all flights that departed in November or December.”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == 11 | 12)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 533 529 4 850 830
#&gt; 3 2013 1 1 542 540 2 923 850
#&gt; 4 2013 1 1 544 545 -1 1004 1022
#&gt; 5 2013 1 1 554 600 -6 812 837
#&gt; 6 2013 1 1 554 558 -4 740 728
#&gt; # … with 336,770 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>This code doesnt error but it also doesnt seem to have worked. Whats going on? Here, R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to <code>TRUE</code>, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
nov = month == 11,
final = nov | 12,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 3
#&gt; month nov final
#&gt; &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 1 FALSE TRUE
#&gt; 2 1 FALSE TRUE
#&gt; 3 1 FALSE TRUE
#&gt; 4 1 FALSE TRUE
#&gt; 5 1 FALSE TRUE
#&gt; 6 1 FALSE TRUE
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="in" data-type="sect2">
<h2>
%in%
</h2>
<p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">1:12 %in% c(1, 5, 11)
#&gt; [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
letters[1:10] %in% c("a", "e", "i", "o", "u")
#&gt; [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE</pre>
</div>
<p>So to find all flights in November and December we could write:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month %in% c(11, 12))</pre>
</div>
<p>Note that <code>%in%</code> obeys different rules for <code>NA</code> to <code>==</code>, as <code>NA %in% NA</code> is <code>TRUE</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">c(1, 2, NA) == NA
#&gt; [1] NA NA NA
c(1, 2, NA) %in% NA
#&gt; [1] FALSE FALSE TRUE</pre>
</div>
<p>This can make for a useful shortcut:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dep_time %in% c(NA, 0800))
#&gt; # A tibble: 8,803 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 800 800 0 1022 1014
#&gt; 2 2013 1 1 800 810 -10 949 955
#&gt; 3 2013 1 1 NA 1630 NA NA 1815
#&gt; 4 2013 1 1 NA 1935 NA NA 2240
#&gt; 5 2013 1 1 NA 1500 NA NA 1825
#&gt; 6 2013 1 1 NA 600 NA NA 901
#&gt; # … with 8,797 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="logicals-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
<li>How many flights have a missing <code>dep_time</code>? What other variables are missing in these rows? What might these rows represent?</li>
<li>Assuming that a missing <code>dep_time</code> implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and the average delay of non-cancelled flights?</li>
</ol></section>
</section>
<section id="sec-logical-summaries" data-type="sect1">
<h1>
Summaries</h1>
<p>The following sections describe some useful techniques for summarizing logical vectors. As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.</p>
<section id="logical-summaries" data-type="sect2">
<h2>
Logical summaries</h2>
<p>There are two main logical summaries: <code><a href="https://rdrr.io/r/base/any.html">any()</a></code> and <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>. <code>any(x)</code> is the equivalent of <code>|</code>; itll return <code>TRUE</code> if there are any <code>TRUE</code>s in <code>x</code>. <code>all(x)</code> is equivalent of <code>&amp;</code>; itll return <code>TRUE</code> only if all values of <code>x</code> are <code>TRUE</code>s. Like all summary functions, theyll return <code>NA</code> if there are any missing values present, and as usual you can make the missing values go away with <code>na.rm = TRUE</code>.</p>
<p>For example, we could use <code><a href="https://rdrr.io/r/base/all.html">all()</a></code> to find out if there were days where every flight was delayed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
all_delayed = all(arr_delay &gt;= 0, na.rm = TRUE),
any_delayed = any(arr_delay &gt;= 0, na.rm = TRUE),
.groups = "drop"
)
#&gt; # A tibble: 365 × 5
#&gt; year month day all_delayed any_delayed
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 2013 1 1 FALSE TRUE
#&gt; 2 2013 1 2 FALSE TRUE
#&gt; 3 2013 1 3 FALSE TRUE
#&gt; 4 2013 1 4 FALSE TRUE
#&gt; 5 2013 1 5 FALSE TRUE
#&gt; 6 2013 1 6 FALSE TRUE
#&gt; # … with 359 more rows</pre>
</div>
<p>In most cases, however, <code><a href="https://rdrr.io/r/base/any.html">any()</a></code> and <code><a href="https://rdrr.io/r/base/all.html">all()</a></code> are a little too crude, and it would be nice to be able to get a little more detail about how many values are <code>TRUE</code> or <code>FALSE</code>. That leads us to the numeric summaries.</p>
</section>
<section id="sec-numeric-summaries-of-logicals" data-type="sect2">
<h2>
Numeric summaries of logical vectors</h2>
<p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a></p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
prop_delayed = mean(arr_delay &gt; 0, na.rm = TRUE),
.groups = "drop"
) |&gt;
ggplot(aes(x = prop_delayed)) +
geom_histogram(binwidth = 0.05)</pre>
<div class="cell-output-display">
<figure id="fig-prop-delayed-dist"><p><img src="logicals_files/figure-html/fig-prop-delayed-dist-1.png" alt="The distribution is unimodal and mildly right skewed. The distribution peaks around 30% delayed flights." width="576"/></p>
<figcaption>A histogram showing the proportion of delayed flights each day.</figcaption>
</figure>
</div>
</div>
<p>Or we could ask: “How many flights left before 5am?”, which are often flights that were delayed from the previous day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
n_early = sum(dep_time &lt; 500, na.rm = TRUE),
.groups = "drop"
) |&gt;
arrange(desc(n_early))
#&gt; # A tibble: 365 × 4
#&gt; year month day n_early
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 6 28 32
#&gt; 2 2013 4 10 30
#&gt; 3 2013 7 28 30
#&gt; 4 2013 3 18 29
#&gt; 5 2013 7 7 29
#&gt; 6 2013 7 10 29
#&gt; # … with 359 more rows</pre>
</div>
</section>
<section id="logical-subsetting" data-type="sect2">
<h2>
Logical subsetting</h2>
<p>Theres one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base <code>[</code> (pronounced subset) operator, which youll learn more about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>.</p>
<p>Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights and then calculate the average delay:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(arr_delay &gt; 0) |&gt;
group_by(year, month, day) |&gt;
summarize(
behind = mean(arr_delay),
n = n(),
.groups = "drop"
)
#&gt; # A tibble: 365 × 5
#&gt; year month day behind n
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013 1 1 32.5 461
#&gt; 2 2013 1 2 32.0 535
#&gt; 3 2013 1 3 27.7 460
#&gt; 4 2013 1 4 28.3 297
#&gt; 5 2013 1 5 22.6 238
#&gt; 6 2013 1 6 24.4 381
#&gt; # … with 359 more rows</pre>
</div>
<p>This works, but what if we wanted to also compute the average delay for flights that arrived early? Wed need to perform a separate filter step, and then figure out how to combine the two data frames together<span data-type="footnote">Well cover this in <a href="#chp-joins" data-type="xref">#chp-joins</a>]</span>. Instead you could use <code>[</code> to perform an inline filtering: <code>arr_delay[arr_delay &gt; 0]</code> will yield only the positive arrival delays.</p>
<p>This leads to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
behind = mean(arr_delay[arr_delay &gt; 0], na.rm = TRUE),
ahead = mean(arr_delay[arr_delay &lt; 0], na.rm = TRUE),
n = n(),
.groups = "drop"
)
#&gt; # A tibble: 365 × 6
#&gt; year month day behind ahead n
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013 1 1 32.5 -12.5 842
#&gt; 2 2013 1 2 32.0 -14.3 943
#&gt; 3 2013 1 3 27.7 -18.2 914
#&gt; 4 2013 1 4 28.3 -17.0 915
#&gt; 5 2013 1 5 22.6 -14.0 720
#&gt; 6 2013 1 6 24.4 -13.6 832
#&gt; # … with 359 more rows</pre>
</div>
<p>Also note the difference in the group size: in the first chunk <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the number of delayed flights per day; in the second, <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the total number of flights.</p>
</section>
<section id="logicals-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
<li>What does <code><a href="https://rdrr.io/r/base/prod.html">prod()</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? What does <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.</li>
</ol></section>
</section>
<section id="conditional-transformations" data-type="sect1">
<h1>
Conditional transformations</h1>
<p>One of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>.</p>
<section id="if_else" data-type="sect2">
<h2>
if_else()
</h2>
<p>If you want to use one value when a condition is <code>TRUE</code> and another value when its <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyrs <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base Rs <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. Youll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
<p>Lets begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(-3:3, NA)
if_else(x &gt; 0, "+ve", "-ve")
#&gt; [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA</pre>
</div>
<p>Theres an optional fourth argument, <code>missing</code> which will be used if the input is <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">if_else(x &gt; 0, "+ve", "-ve", "???")
#&gt; [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>You can also use vectors for the the <code>true</code> and <code>false</code> arguments. For example, this allows us to create a minimal implementation of <code><a href="https://rdrr.io/r/base/MathFun.html">abs()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">if_else(x &lt; 0, -x, x)
#&gt; [1] 3 2 1 0 1 2 3 NA</pre>
</div>
<p>So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x1 &lt;- c(NA, 1, 2, NA)
y1 &lt;- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)
#&gt; [1] 3 1 2 6</pre>
</div>
<p>You might have noticed a small infelicity in our labeling example above: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">if_else(x == 0, "0", if_else(x &lt; 0, "-ve", "+ve"), "???")
#&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">dplyr::case_when()</a></code>.</p>
</section>
<section id="case_when" data-type="sect2">
<h2>
case_when()
</h2>
<p>dplyrs <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQLs <code>CASE</code> statement and provides a flexible way of performing different computations for different conditions. It has a special syntax that unfortunately looks like nothing else youll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when its <code>TRUE</code>, <code>output</code> will be used.</p>
<p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(-3:3, NA)
case_when(
x == 0 ~ "0",
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve",
is.na(x) ~ "???"
)
#&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>This is more code, but its also more explicit.</p>
<p>To explain how <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> works, lets explore some simpler cases. If none of the cases match, the output gets an <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve"
)
#&gt; [1] "-ve" "-ve" "-ve" NA "+ve" "+ve" "+ve" NA</pre>
</div>
<p>If you want to create a “default”/catch all value, use <code>TRUE</code> on the left hand side:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve",
TRUE ~ "???"
)
#&gt; [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>And note that if multiple conditions match, only the first will be used:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">case_when(
x &gt; 0 ~ "+ve",
x &gt; 2 ~ "big"
)
#&gt; [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre>
</div>
<p>Just like with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> you can use variables on both sides of the <code>~</code> and you can mix and match variables as needed for your problem. For example, we could use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> to provide some human readable labels for the arrival delay:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
status = case_when(
is.na(arr_delay) ~ "cancelled",
arr_delay &lt; -30 ~ "very early",
arr_delay &lt; -15 ~ "early",
abs(arr_delay) &lt;= 15 ~ "on time",
arr_delay &lt; 60 ~ "late",
arr_delay &lt; Inf ~ "very late",
),
.keep = "used"
)
#&gt; # A tibble: 336,776 × 2
#&gt; arr_delay status
#&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 11 on time
#&gt; 2 20 late
#&gt; 3 33 late
#&gt; 4 -18 early
#&gt; 5 -25 early
#&gt; 6 12 on time
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Be wary when writing this sort of complex <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> statement; my first two attempts used a mix of <code>&lt;</code> and <code>&gt;</code> and I kept accidentally creating overlapping conditions.</p>
</section>
<section id="compatible-types" data-type="sect2">
<h2>
Compatible types</h2>
<p>Note that both <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> require <strong>compatible</strong> types in the output. If theyre not compatible, youll see errors like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">if_else(TRUE, "a", 1)
#&gt; Error in `if_else()`:
#&gt; ! Can't combine `true` &lt;character&gt; and `false` &lt;double&gt;.
case_when(
x &lt; -1 ~ TRUE,
x &gt; 0 ~ lubridate::now()
)
#&gt; Error in `case_when()`:
#&gt; ! Can't combine `TRUE` &lt;logical&gt; and `lubridate::now()` &lt;datetime&lt;local&gt;&gt;.</pre>
</div>
<p>Overall, relatively few types are compatible, because automatically converting one type of vector to another is a common source of errors. Here are the most important cases that are compatible:</p>
<ul><li>Numeric and logical vectors are compatible, as we discussed in <a href="#sec-numeric-summaries-of-logicals" data-type="xref">#sec-numeric-summaries-of-logicals</a>.</li>
<li>Strings and factors (<a href="#chp-factors" data-type="xref">#chp-factors</a>) are compatible, because you can think of a factor as a string with a restricted set of values.</li>
<li>Dates and date-times, which well discuss in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>, are compatible because you can think of a date as a special case of date-time.</li>
<li>
<code>NA</code>, which is technically a logical vector, is compatible with everything because every vector has some way of representing a missing value.</li>
</ul><p>We dont expect you to memorize these rules, but they should become second nature over time because they are applied consistently throughout the tidyverse.</p>
</section>
</section>
<section id="logicals-summary" data-type="sect1">
<h1>
Summary</h1>
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> functions that allow you to return values depending on the value of a logical vector.</p>
<p>Well see logical vectors again and again in the following chapters. For example in <a href="#chp-strings" data-type="xref">#chp-strings</a> youll learn about <code>str_detect(x, pattern)</code> which returns a logical vector thats <code>TRUE</code> for the elements of <code>x</code> that match the <code>pattern</code>, and in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> youll create logical vectors from the comparison of dates and times. But for now, were going to move onto the next most important type of vector: numeric vectors.</p>
</section>
</section>

View File

@ -1,334 +0,0 @@
<section data-type="chapter" id="chp-missing-values">
<h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1>
<section id="missing-values-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Youve already learned the basics of missing values earlier in the book. You first saw them in <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> where they resulted in a warning when making a plot as well as in <a href="#sec-summarize" data-type="xref">#sec-summarize</a> where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in <a href="#sec-na-comparison" data-type="xref">#sec-na-comparison</a>. Now well come back to them in more depth, so you can learn more of the details.</p>
<p>Well start by discussing some general tools for working with missing values recorded as <code>NA</code>s. Well then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. Well finish off with a related discussion of empty groups, caused by factor levels that dont appear in the data.</p>
<section id="missing-values-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="explicit-missing-values" data-type="sect1">
<h1>
Explicit missing values</h1>
<p>To begin, lets explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an <code>NA</code>.</p>
<section id="last-observation-carried-forward" data-type="sect2">
<h2>
Last observation carried forward</h2>
<p>A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">treatment &lt;- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, NA,
"Katherine Burke", 1, 4
)</pre>
</div>
<p>You can fill in these missing values with <code><a href="https://tidyr.tidyverse.org/reference/fill.html">tidyr::fill()</a></code>. It works like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, taking a set of columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">treatment |&gt;
fill(everything())
#&gt; # A tibble: 4 × 3
#&gt; person treatment response
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Derrick Whitmore 1 7
#&gt; 2 Derrick Whitmore 2 10
#&gt; 3 Derrick Whitmore 3 10
#&gt; 4 Katherine Burke 1 4</pre>
</div>
<p>This treatment is sometimes called “last observation carried forward”, or <strong>locf</strong> for short. You can use the <code>.direction</code> argument to fill in missing values that have been generated in more exotic ways.</p>
</section>
<section id="fixed-values" data-type="sect2">
<h2>
Fixed values</h2>
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">dplyr::coalesce()</a></code> to replace them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#&gt; [1] 1 4 5 7 0</pre>
</div>
<p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">readr::read_csv()</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="https://dplyr.tidyverse.org/reference/na_if.html">dplyr::na_if()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 4, 5, 7, -99)
na_if(x, -99)
#&gt; [1] 1 4 5 7 NA</pre>
</div>
</section>
<section id="nan" data-type="sect2">
<h2>
NaN</h2>
<p>Before we continue, theres one special type of missing value that youll encounter from time to time: a <code>NaN</code> (pronounced “nan”), or <strong>n</strong>ot <strong>a</strong> <strong>n</strong>umber. Its not that important to know about because it generally behaves just like <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(NA, NaN)
x * 10
#&gt; [1] NA NaN
x == 1
#&gt; [1] NA NA
is.na(x)
#&gt; [1] TRUE TRUE</pre>
</div>
<p>In the rare case you need to distinguish an <code>NA</code> from a <code>NaN</code>, you can use <code>is.nan(x)</code>.</p>
<p>Youll generally encounter a <code>NaN</code> when you perform a mathematical operation that has an indeterminate result:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">0 / 0
#&gt; [1] NaN
0 * Inf
#&gt; [1] NaN
Inf - Inf
#&gt; [1] NaN
sqrt(-1)
#&gt; Warning in sqrt(-1): NaNs produced
#&gt; [1] NaN</pre>
</div>
</section>
</section>
<section id="sec-missing-implicit" data-type="sect1">
<h1>
Implicit missing values</h1>
<p>So far weve talked about missing values that are <strong>explicitly</strong> missing, i.e. you can see an <code>NA</code> in your data. But missing values can also be <strong>implicitly</strong> missing, if an entire row of data is simply absent from the data. Lets illustrate the difference with a simple data set that records the price of some stock each quarter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">stocks &lt;- tibble(
year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)</pre>
</div>
<p>This dataset has two missing observations:</p>
<ul><li><p>The <code>price</code> in the fourth quarter of 2020 is explicitly missing, because its value is <code>NA</code>.</p></li>
<li><p>The <code>price</code> for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.</p></li>
</ul><p>One way to think about the difference is with this Zen-like koan:</p>
<blockquote class="blockquote">
<p>An explicit missing value is the presence of an absence.<br/></p>
<p>An implicit missing value is the absence of a presence.</p>
</blockquote>
<p>Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.</p>
<section id="pivoting" data-type="sect2">
<h2>
Pivoting</h2>
<p>Youve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot <code>stocks</code> to put the <code>quarter</code> in the columns, both missing values become explicit:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
pivot_wider(
names_from = qtr,
values_from = price
)
#&gt; # A tibble: 2 × 5
#&gt; year `1` `2` `3` `4`
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2020 1.88 0.59 0.35 NA
#&gt; 2 2021 NA 0.92 0.17 2.66</pre>
</div>
<p>By default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting <code>values_drop_na = TRUE</code>. See the examples in <a href="#sec-tidy-data" data-type="xref">#sec-tidy-data</a> for more details.</p>
</section>
<section id="complete" data-type="sect2">
<h2>
Complete</h2>
<p><code><a href="https://tidyr.tidyverse.org/reference/complete.html">tidyr::complete()</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
complete(year, qtr)
#&gt; # A tibble: 8 × 3
#&gt; year qtr price
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2020 1 1.88
#&gt; 2 2020 2 0.59
#&gt; 3 2020 3 0.35
#&gt; 4 2020 4 NA
#&gt; 5 2021 1 NA
#&gt; 6 2021 2 0.92
#&gt; # … with 2 more rows</pre>
</div>
<p>Typically, youll call <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">stocks |&gt;
complete(year = 2019:2021, qtr)
#&gt; # A tibble: 12 × 3
#&gt; year qtr price
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2019 1 NA
#&gt; 2 2019 2 NA
#&gt; 3 2019 3 NA
#&gt; 4 2019 4 NA
#&gt; 5 2020 1 1.88
#&gt; 6 2020 2 0.59
#&gt; # … with 6 more rows</pre>
</div>
<p>If the range of a variable is correct, but not all values are present, you could use <code>full_seq(x, 1)</code> to generate all values from <code>min(x)</code> to <code>max(x)</code> spaced out by 1.</p>
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">dplyr::full_join()</a></code>.</p>
</section>
<section id="missing-values-joins" data-type="sect2">
<h2>
Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(nycflights13)
flights |&gt;
distinct(faa = dest) |&gt;
anti_join(airports)
#&gt; Joining with `by = join_by(faa)`
#&gt; # A tibble: 4 × 1
#&gt; faa
#&gt; &lt;chr&gt;
#&gt; 1 BQN
#&gt; 2 SJU
#&gt; 3 STT
#&gt; 4 PSE
flights |&gt;
distinct(tailnum) |&gt;
anti_join(planes)
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 722 × 1
#&gt; tailnum
#&gt; &lt;chr&gt;
#&gt; 1 N3ALAA
#&gt; 2 N3DUAA
#&gt; 3 N542MQ
#&gt; 4 N730MQ
#&gt; 5 N9EAMQ
#&gt; 6 N532UA
#&gt; # … with 716 more rows</pre>
</div>
</section>
<section id="missing-values-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Can you find any relationship between the carrier and the rows that appear to be missing from <code>planes</code>?</li>
</ol></section>
</section>
<section id="factors-and-empty-groups" data-type="sect1">
<h1>
Factors and empty groups</h1>
<p>A final type of missingness is the empty group, a group that doesnt contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">health &lt;- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34L, 88L, 75L, 47L, 56L),
)</pre>
</div>
<p>And we want to count the number of smokers with <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">health |&gt; count(smoker)
#&gt; # A tibble: 1 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 no 5</pre>
</div>
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">health |&gt; count(smoker, .drop = FALSE)
#&gt; # A tibble: 2 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 yes 0
#&gt; 2 no 5</pre>
</div>
<p>The same principle applies to ggplot2s discrete axes, which will also drop levels that dont have any values. You can force them to display by supplying <code>drop = FALSE</code> to the appropriate discrete axis:</p>
<div>
<pre data-type="programlisting" data-code-language="r">ggplot(health, aes(x = smoker)) +
geom_bar() +
scale_x_discrete()
ggplot(health, aes(x = smoker)) +
geom_bar() +
scale_x_discrete(drop = FALSE)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="missing-values_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A bar chart with a single value on the x-axis, &quot;no&quot;. The same bar chart as the last plot, but now with two values on the x-axis, &quot;yes&quot; and &quot;no&quot;. There is no bar for the &quot;yes&quot; category." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="missing-values_files/figure-html/unnamed-chunk-17-2.png" class="img-fluid" alt="A bar chart with a single value on the x-axis, &quot;no&quot;. The same bar chart as the last plot, but now with two values on the x-axis, &quot;yes&quot; and &quot;no&quot;. There is no bar for the &quot;yes&quot; category." width="288"/></p>
</div>
</div>
</div>
</div>
<p>The same problem comes up more generally with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">health |&gt;
group_by(smoker, .drop = FALSE) |&gt;
summarize(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
)
#&gt; Warning: There were 2 warnings in `summarize()`.
#&gt; The first warning was:
#&gt; In argument: `min_age = min(age)`.
#&gt; In group 1: `smoker = yes`.
#&gt; Caused by warning in `min()`:
#&gt; ! no non-missing arguments to min; returning Inf
#&gt; Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#&gt; # A tibble: 2 × 6
#&gt; smoker n mean_age min_age max_age sd_age
#&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 yes 0 NaN Inf -Inf NA
#&gt; 2 no 5 60 34 88 21.6</pre>
</div>
<p>We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. Theres an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># A vector containing two missing values
x1 &lt;- c(NA, NA)
length(x1)
#&gt; [1] 2
# A vector containing nothing
x2 &lt;- numeric()
length(x2)
#&gt; [1] 0</pre>
</div>
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">health |&gt;
group_by(smoker) |&gt;
summarize(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
) |&gt;
complete(smoker)
#&gt; # A tibble: 2 × 6
#&gt; smoker n mean_age min_age max_age sd_age
#&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 yes NA NA NA NA NA
#&gt; 2 no 5 60 34 88 21.6</pre>
</div>
<p>The main drawback of this approach is that you get an <code>NA</code> for the count, even though you know that it should be zero.</p>
</section>
<section id="missing-values-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Missing values are weird! Sometimes theyre recorded as an explicit <code>NA</code> but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.</p>
<p>In the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because were going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.</p>
</section>
</section>

View File

@ -1,905 +0,0 @@
<section data-type="chapter" id="chp-numbers">
<h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1>
<section id="numbers-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Numeric vectors are the backbone of data science, and youve already used them a bunch of times earlier in the book. Now its time to systematically survey what you can do with them in R, ensuring that youre well situated to tackle any future problem involving numeric vectors.</p>
<p>Well start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. Then well dive into various numeric transformations that pair well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Well finish off by covering the summary functions that pair well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and show you how they can also be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>.</p>
<section id="numbers-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/dplyr")</code>.</p>
</div>
</div>
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because well use these base R functions inside of tidyverse functions like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. Like in the last chapter, well use real examples from nycflights13, as well as toy examples made with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="making-numbers" data-type="sect1">
<h1>
Making numbers</h1>
<p>In most cases, youll get numbers already recorded in one of Rs numeric types: integer or double. In some cases, however, youll encounter them as strings, possibly because youve created them by pivoting from column headers or because something has gone wrong in your data import process.</p>
<p>readr provides two useful functions for parsing strings into numbers: <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code>. Use <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> when you have numbers that have been written as strings:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("1.2", "5.6", "1e3")
parse_double(x)
#&gt; [1] 1.2 5.6 1000.0</pre>
</div>
<p>Use <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("$1,234", "USD 3,513", "59%")
parse_number(x)
#&gt; [1] 1234 3513 59</pre>
</div>
</section>
<section id="counts" data-type="sect1">
<h1>
Counts</h1>
<p>Its surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. This function is great for quick exploration and checks during analysis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(dest)
#&gt; # A tibble: 105 × 2
#&gt; dest n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ABQ 254
#&gt; 2 ACK 265
#&gt; 3 ALB 439
#&gt; 4 ANC 8
#&gt; 5 ATL 17215
#&gt; 6 AUS 2439
#&gt; # … with 99 more rows</pre>
</div>
<p>(Despite the advice in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, we usually put <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> on a single line because its usually used at the console for a quick check that a calculation is working as expected.)</p>
<p>If you want to see the most common values, add <code>sort = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(dest, sort = TRUE)
#&gt; # A tibble: 105 × 2
#&gt; dest n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ORD 17283
#&gt; 2 ATL 17215
#&gt; 3 LAX 16174
#&gt; 4 BOS 15508
#&gt; 5 MCO 14082
#&gt; 6 CLT 14064
#&gt; # … with 99 more rows</pre>
</div>
<p>And remember that if you want to see all the values, you can use <code>|&gt; View()</code> or <code>|&gt; print(n = Inf)</code>.</p>
<p>You can perform the same computation “by hand” with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>. This is useful because it allows you to compute other summaries at the same time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(
n = n(),
delay = mean(arr_delay, na.rm = TRUE)
)
#&gt; # A tibble: 105 × 3
#&gt; dest n delay
#&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 ABQ 254 4.38
#&gt; 2 ACK 265 4.85
#&gt; 3 ALB 439 14.4
#&gt; 4 ANC 8 -2.5
#&gt; 5 ATL 17215 11.3
#&gt; 6 AUS 2439 6.02
#&gt; # … with 99 more rows</pre>
</div>
<p><code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> is a special summary function that doesnt take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">n()
#&gt; Error in `n()`:
#&gt; ! Must only be used inside data-masking verbs like `mutate()`,
#&gt; `filter()`, and `group_by()`.</pre>
</div>
<p>There are a couple of variants of <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> that you might find useful:</p>
<ul><li>
<p><code>n_distinct(x)</code> counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(carriers = n_distinct(carrier)) |&gt;
arrange(desc(carriers))
#&gt; # A tibble: 105 × 2
#&gt; dest carriers
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ATL 7
#&gt; 2 BOS 7
#&gt; 3 CLT 7
#&gt; 4 ORD 7
#&gt; 5 TPA 7
#&gt; 6 AUS 6
#&gt; # … with 99 more rows</pre>
</div>
</li>
<li>
<p>A weighted count is a sum. For example you could “count” the number of miles each plane flew:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(tailnum) |&gt;
summarize(miles = sum(distance))
#&gt; # A tibble: 4,044 × 2
#&gt; tailnum miles
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 D942DN 3418
#&gt; 2 N0EGMQ 250866
#&gt; 3 N10156 115966
#&gt; 4 N102UW 25722
#&gt; 5 N103US 24619
#&gt; 6 N104UW 25157
#&gt; # … with 4,038 more rows</pre>
</div>
<p>Weighted counts are a common problem so <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> has a <code>wt</code> argument that does the same thing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(tailnum, wt = distance)</pre>
</div>
</li>
<li>
<p>You can count missing values by combining <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>. In the <code>flights</code> dataset this represents flights that are cancelled:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(n_cancelled = sum(is.na(dep_time)))
#&gt; # A tibble: 105 × 2
#&gt; dest n_cancelled
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ABQ 0
#&gt; 2 ACK 0
#&gt; 3 ALB 20
#&gt; 4 ANC 0
#&gt; 5 ATL 317
#&gt; 6 AUS 21
#&gt; # … with 99 more rows</pre>
</div>
</li>
</ul>
<section id="numbers-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How can you use <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to count the number rows with a missing value for a given variable?</li>
<li>Expand the following calls to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to instead use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>:
<ol type="1"><li><p><code>flights |&gt; count(dest, sort = TRUE)</code></p></li>
<li><p><code>flights |&gt; count(tailnum, wt = distance)</code></p></li>
</ol></li>
</ol></section>
</section>
<section id="numeric-transformations" data-type="sect1">
<h1>
Numeric transformations</h1>
<p>Transformation functions work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as the input. The vast majority of transformation functions are already built into base R. Its impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we dont list them here because theyre rarely needed for data science.</p>
<section id="sec-recycling" data-type="sect2">
<h2>
Arithmetic and recycling rules</h2>
<p>We introduced the basics of arithmetic (<code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>^</code>) in <a href="#chp-workflow-basics" data-type="xref">#chp-workflow-basics</a> and have used them a bunch since. These functions dont need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the <strong>recycling rules</strong> which determine what happens when the left and right hand sides have different lengths. This is important for operations like <code>flights |&gt; mutate(air_time = air_time / 60)</code> because there are 336,776 numbers on the left of <code>/</code> but only one on the right.</p>
<p>R handles mismatched lengths by <strong>recycling,</strong> or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 2, 10, 20)
x / 5
#&gt; [1] 0.2 0.4 2.0 4.0
# is shorthand for
x / c(5, 5, 5, 5)
#&gt; [1] 0.2 0.4 2.0 4.0</pre>
</div>
<p>Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isnt a multiple of the shorter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x * c(1, 2)
#&gt; [1] 1 4 10 40
x * c(1, 2, 3)
#&gt; Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
#&gt; object length
#&gt; [1] 1 4 30 20</pre>
</div>
<p>These recycling rules are also applied to logical comparisons (<code>==</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>!=</code>) and can lead to a surprising result if you accidentally use <code>==</code> instead of <code>%in%</code> and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == c(1, 2))
#&gt; # A tibble: 25,977 × 19
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 542 540 2 923 850
#&gt; 3 2013 1 1 554 600 -6 812 837
#&gt; 4 2013 1 1 555 600 -5 913 854
#&gt; 5 2013 1 1 557 600 -3 838 846
#&gt; 6 2013 1 1 558 600 -2 849 851
#&gt; # … with 25,971 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
<p>The code runs without error, but it doesnt return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately theres no warning because <code>flights</code> has an even number of rows.</p>
<p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesnt help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>.</p>
</section>
<section id="minimum-and-maximum" data-type="sect2">
<h2>
Minimum and maximum</h2>
<p>The arithmetic functions work with pairs of variables. Two closely related functions are <code><a href="https://rdrr.io/r/base/Extremes.html">pmin()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">pmax()</a></code>, which when given two or more variables will return the smallest or largest value in each row:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
~x, ~y,
1, 3,
5, 2,
7, NA,
)
df |&gt;
mutate(
min = pmin(x, y, na.rm = TRUE),
max = pmax(x, y, na.rm = TRUE)
)
#&gt; # A tibble: 3 × 4
#&gt; x y min max
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 3 1 3
#&gt; 2 5 2 2 5
#&gt; 3 7 NA 7 7</pre>
</div>
<p>Note that these are different to the summary functions <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> which take multiple observations and return a single value. You can tell that youve used the wrong form when all the minimums and all the maximums have the same value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
mutate(
min = min(x, y, na.rm = TRUE),
max = max(x, y, na.rm = TRUE)
)
#&gt; # A tibble: 3 × 4
#&gt; x y min max
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 3 1 7
#&gt; 2 5 2 1 7
#&gt; 3 7 NA 1 7</pre>
</div>
</section>
<section id="modular-arithmetic" data-type="sect2">
<h2>
Modular arithmetic</h2>
<p>Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder. In R, <code>%/%</code> does integer division and <code>%%</code> computes the remainder:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">1:10 %/% 3
#&gt; [1] 0 0 1 1 1 2 2 2 3 3
1:10 %% 3
#&gt; [1] 1 2 0 1 2 0 1 2 0 1</pre>
</div>
<p>Modular arithmetic is handy for the flights dataset, because we can use it to unpack the <code>sched_dep_time</code> variable into <code>hour</code> and <code>minute</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(
hour = sched_dep_time %/% 100,
minute = sched_dep_time %% 100,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 3
#&gt; sched_dep_time hour minute
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 515 5 15
#&gt; 2 529 5 29
#&gt; 3 540 5 40
#&gt; 4 545 5 45
#&gt; 5 600 6 0
#&gt; 6 558 5 58
#&gt; # … with 336,770 more rows</pre>
</div>
<p>We can combine that with the <code>mean(is.na(x))</code> trick from <a href="#sec-logical-summaries" data-type="xref">#sec-logical-summaries</a> to see how the proportion of cancelled flights varies over the course of the day. The results are shown in <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(hour = sched_dep_time %/% 100) |&gt;
summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |&gt;
filter(hour &gt; 1) |&gt;
ggplot(aes(x = hour, y = prop_cancelled)) +
geom_line(color = "grey50") +
geom_point(aes(size = n))</pre>
<div class="cell-output-display">
<figure id="fig-prop-cancelled"><p><img src="numbers_files/figure-html/fig-prop-cancelled-1.png" alt="A line plot showing how proportion of cancelled flights changes over the course of the day. The proportion starts low at around 0.5% at 6am, then steadily increases over the course of the day until peaking at 4% at 7pm. The proportion of cancelled flights then drops rapidly getting down to around 1% by midnight." width="576"/></p>
<figcaption>A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.</figcaption>
</figure>
</div>
</div>
</section>
<section id="logarithms" data-type="sect2">
<h2>
Logarithms</h2>
<p>Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert exponential growth to linear growth. For example, take compounding interest — the amount of money you have at <code>year + 1</code> is the amount of money you had at <code>year</code> multiplied by the interest rate. That gives a formula like <code>money = starting * interest ^ year</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">starting &lt;- 100
interest &lt;- 1.05
money &lt;- tibble(
year = 1:50,
money = starting * interest ^ year
)</pre>
</div>
<p>If you plot this data, youll get an exponential curve showing how your money grows year by year at an interest rate of 1.05:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(x = year, y = money)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="numbers_files/figure-html/unnamed-chunk-22-1.png" width="576"/></p>
</div>
</div>
<p>Log transforming the y-axis gives a straight line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(money, aes(x = year, y = money)) +
geom_line() +
scale_y_log10()</pre>
<div class="cell-output-display">
<p><img src="numbers_files/figure-html/unnamed-chunk-23-1.png" width="576"/></p>
</div>
</div>
<p>This a straight line because a little algebra reveals that <code>log10(money) = log10(interest) * year + log10(starting)</code>, which matches the pattern for a line, <code>y = m * x + b</code>. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that theres underlying exponential growth.</p>
<p>If youre log-transforming your data with dplyr you have a choice of three logarithms provided by base R: <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> (the natural log, base e), <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> (base 2), and <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> (base 10). We recommend using <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code>. <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> is easy to back-transform because (e.g.) 3 is 10^3 = 1000.</p>
<p>The inverse of <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> is <code><a href="https://rdrr.io/r/base/Log.html">exp()</a></code>; to compute the inverse of <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> youll need to use <code>2^</code> or <code>10^</code>.</p>
</section>
<section id="sec-rounding" data-type="sect2">
<h2>
Rounding</h2>
<p>Use <code>round(x)</code> to round a number to the nearest integer:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">round(123.456)
#&gt; [1] 123</pre>
</div>
<p>You can control the precision of the rounding with the second argument, <code>digits</code>. <code>round(x, digits)</code> rounds to the nearest <code>10^-n</code> so <code>digits = 2</code> will round to the nearest 0.01. This definition is useful because it implies <code>round(x, -3)</code> will round to the nearest thousand, which indeed it does:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">round(123.456, 2) # two digits
#&gt; [1] 123.46
round(123.456, 1) # one digit
#&gt; [1] 123.5
round(123.456, -1) # round to nearest ten
#&gt; [1] 120
round(123.456, -2) # round to nearest hundred
#&gt; [1] 100</pre>
</div>
<p>Theres one weirdness with <code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> that seems surprising at first glance:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">round(c(1.5, 2.5))
#&gt; [1] 2 2</pre>
</div>
<p><code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> uses whats known as “round half to even” or Bankers rounding: if a number is half way between two integers, it will be rounded to the <strong>even</strong> integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.</p>
<p><code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> is paired with <code><a href="https://rdrr.io/r/base/Round.html">floor()</a></code> which always rounds down and <code><a href="https://rdrr.io/r/base/Round.html">ceiling()</a></code> which always rounds up:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- 123.456
floor(x)
#&gt; [1] 123
ceiling(x)
#&gt; [1] 124</pre>
</div>
<p>These functions dont have a <code>digits</code> argument, so you can instead scale down, round, and then scale back up:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Round down to nearest two digits
floor(x / 0.01) * 0.01
#&gt; [1] 123.45
# Round up to nearest two digits
ceiling(x / 0.01) * 0.01
#&gt; [1] 123.46</pre>
</div>
<p>You can use the same technique if you want to <code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> to a multiple of some other number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># Round to nearest multiple of 4
round(x / 4) * 4
#&gt; [1] 124
# Round to nearest 0.25
round(x / 0.25) * 0.25
#&gt; [1] 123.5</pre>
</div>
</section>
<section id="cutting-numbers-into-ranges" data-type="sect2">
<h2>
Cutting numbers into ranges</h2>
<p>Use <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code><span data-type="footnote">ggplot2 provides some helpers for common cases in <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_interval()</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>. ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.</span> to break up a numeric vector into discrete buckets:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0, 5, 10, 15, 20))
#&gt; [1] (0,5] (0,5] (0,5] (5,10] (10,15] (15,20]
#&gt; Levels: (0,5] (5,10] (10,15] (15,20]</pre>
</div>
<p>The breaks dont need to be evenly spaced:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cut(x, breaks = c(0, 5, 10, 100))
#&gt; [1] (0,5] (0,5] (0,5] (5,10] (10,100] (10,100]
#&gt; Levels: (0,5] (5,10] (10,100]</pre>
</div>
<p>You can optionally supply your own <code>labels</code>. Note that there should be one less <code>labels</code> than <code>breaks</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cut(x,
breaks = c(0, 5, 10, 15, 20),
labels = c("sm", "md", "lg", "xl")
)
#&gt; [1] sm sm sm md lg xl
#&gt; Levels: sm md lg xl</pre>
</div>
<p>Any values outside of the range of the breaks will become <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">y &lt;- c(NA, -10, 5, 10, 30)
cut(y, breaks = c(0, 5, 10, 15, 20))
#&gt; [1] &lt;NA&gt; &lt;NA&gt; (0,5] (5,10] &lt;NA&gt;
#&gt; Levels: (0,5] (5,10] (10,15] (15,20]</pre>
</div>
<p>See the documentation for other useful arguments like <code>right</code> and <code>include.lowest</code>, which control if the intervals are <code>[a, b)</code> or <code>(a, b]</code> and if the lowest interval should be <code>[a, b]</code>.</p>
</section>
<section id="sec-cumulative-and-rolling-aggregates" data-type="sect2">
<h2>
Cumulative and rolling aggregates</h2>
<p>Base R provides <code><a href="https://rdrr.io/r/base/cumsum.html">cumsum()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cumprod()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummin()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummax()</a></code> for running, or cumulative, sums, products, mins and maxes. dplyr provides <code><a href="https://dplyr.tidyverse.org/reference/cumall.html">cummean()</a></code> for cumulative means. Cumulative sums tend to come up the most in practice:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- 1:10
cumsum(x)
#&gt; [1] 1 3 6 10 15 21 28 36 45 55</pre>
</div>
<p>If you need more complex rolling or sliding aggregates, try the <a href="https://davisvaughan.github.io/slider/">slider</a> package by Davis Vaughan. The following example illustrates some of its features.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(slider)
# Same as a cumulative sum
slide_vec(x, sum, .before = Inf)
#&gt; [1] 1 3 6 10 15 21 28 36 45 55
# Sum the current element and the one before it
slide_vec(x, sum, .before = 1)
#&gt; [1] 1 3 5 7 9 11 13 15 17 19
# Sum the current element and the two before and after it
slide_vec(x, sum, .before = 2, .after = 2)
#&gt; [1] 6 10 15 20 25 30 35 40 34 27
# Only compute if the window is complete
slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
#&gt; [1] NA NA 15 20 25 30 35 40 NA NA</pre>
</div>
</section>
<section id="numbers-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain in words what each line of the code used to generate <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a> does.</p></li>
<li><p>What trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?</p></li>
<li>
<p>Currently <code>dep_time</code> and <code>sched_dep_time</code> are convenient to look at, but hard to compute with because theyre not really continuous numbers. You can see the basic problem in this plot: theres a gap between each hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(month == 1, day == 1) |&gt;
ggplot(aes(x = sched_dep_time, y = dep_delay)) +
geom_point()
#&gt; Warning: Removed 4 rows containing missing values (`geom_point()`).</pre>
<div class="cell-output-display">
<p><img src="numbers_files/figure-html/unnamed-chunk-36-1.png" width="576"/></p>
</div>
</div>
<p>Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).</p>
</li>
</ol></section>
</section>
<section id="general-transformations" data-type="sect1">
<h1>
General transformations</h1>
<p>The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.</p>
<section id="ranks" data-type="sect2">
<h2>
Ranks</h2>
<p>dplyr provides a number of ranking functions inspired by SQL, but you should always start with <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::min_rank()</a></code>. It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(1, 2, 2, 3, 4, NA)
min_rank(x)
#&gt; [1] 1 2 2 4 5 NA</pre>
</div>
<p>Note that the smallest values get the lowest ranks; use <code>desc(x)</code> to give the largest values the smallest ranks:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">min_rank(desc(x))
#&gt; [1] 5 3 3 2 1 NA</pre>
</div>
<p>If <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code> doesnt do what you need, look at the variants <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::row_number()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::dense_rank()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/percent_rank.html">dplyr::percent_rank()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/percent_rank.html">dplyr::cume_dist()</a></code>. See the documentation for details.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = x)
df |&gt;
mutate(
row_number = row_number(x),
dense_rank = dense_rank(x),
percent_rank = percent_rank(x),
cume_dist = cume_dist(x)
)
#&gt; # A tibble: 6 × 5
#&gt; x row_number dense_rank percent_rank cume_dist
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 1 1 0 0.2
#&gt; 2 2 2 2 0.25 0.6
#&gt; 3 2 3 2 0.25 0.6
#&gt; 4 3 4 3 0.75 0.8
#&gt; 5 4 5 4 1 1
#&gt; 6 NA NA NA NA NA</pre>
</div>
<p>You can achieve many of the same results by picking the appropriate <code>ties.method</code> argument to base Rs <code><a href="https://rdrr.io/r/base/rank.html">rank()</a></code>; youll probably also want to set <code>na.last = "keep"</code> to keep <code>NA</code>s as <code>NA</code>.</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/row_number.html">row_number()</a></code> can also be used without any arguments when inside a dplyr verb. In this case, itll give the number of the “current” row. When combined with <code>%%</code> or <code>%/%</code> this can be a useful tool for dividing data into similarly sized groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = runif(10))
df |&gt;
mutate(
row0 = row_number() - 1,
three_groups = row0 %% 3,
three_in_each_group = row0 %/% 3,
)
#&gt; # A tibble: 10 × 4
#&gt; x row0 three_groups three_in_each_group
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.0808 0 0 0
#&gt; 2 0.834 1 1 0
#&gt; 3 0.601 2 2 0
#&gt; 4 0.157 3 0 1
#&gt; 5 0.00740 4 1 1
#&gt; 6 0.466 5 2 1
#&gt; # … with 4 more rows</pre>
</div>
</section>
<section id="offsets" data-type="sect2">
<h2>
Offsets</h2>
<p><code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">dplyr::lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">dplyr::lag()</a></code> allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with <code>NA</code>s at the start or end:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c(2, 5, 11, 11, 19, 35)
lag(x)
#&gt; [1] NA 2 5 11 11 19
lead(x)
#&gt; [1] 5 11 11 19 35 NA</pre>
</div>
<ul><li>
<p><code>x - lag(x)</code> gives you the difference between the current and previous value.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x - lag(x)
#&gt; [1] NA 3 6 0 8 16</pre>
</div>
</li>
<li>
<p><code>x == lag(x)</code> tells you when the current value changes.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x == lag(x)
#&gt; [1] NA FALSE FALSE TRUE FALSE FALSE</pre>
</div>
</li>
</ul><p>You can lead or lag by more than one position by using the second argument, <code>n</code>.</p>
</section>
<section id="consecutive-identifiers" data-type="sect2">
<h2>
Consecutive identifiers</h2>
<p>Sometimes you want to start a new group every time some event occurs. For example, when youre looking at website data, its common to want to break up events into sessions, where a session is defined as a gap of more than x minutes since the last activity.</p>
<p>For example, imagine you have the times when someone visited a website:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">events &lt;- tibble(
time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27, 28, 30)
)</pre>
</div>
<p>And youve the time lag between the events, and figured out if theres a gap thats big enough to qualify:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">events &lt;- events |&gt;
mutate(
diff = time - lag(time, default = first(time)),
gap = diff &gt;= 5
)
events
#&gt; # A tibble: 14 × 3
#&gt; time diff gap
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;lgl&gt;
#&gt; 1 0 0 FALSE
#&gt; 2 1 1 FALSE
#&gt; 3 2 1 FALSE
#&gt; 4 3 1 FALSE
#&gt; 5 5 2 FALSE
#&gt; 6 10 5 TRUE
#&gt; # … with 8 more rows</pre>
</div>
<p>But how do we go from that logical vector to something that we can <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>? <code><a href="https://rdrr.io/r/base/cumsum.html">cumsum()</a></code> from <a href="#sec-cumulative-and-rolling-aggregates" data-type="xref">#sec-cumulative-and-rolling-aggregates</a> comes to the rescue as each occurring gap, i.e., <code>gap</code> is <code>TRUE</code>, increments <code>group</code> by one (see <a href="#sec-numeric-summaries-of-logicals" data-type="xref">#sec-numeric-summaries-of-logicals</a> on the numerical interpretation of logicals):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">events |&gt; mutate(
group = cumsum(gap)
)
#&gt; # A tibble: 14 × 4
#&gt; time diff gap group
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;int&gt;
#&gt; 1 0 0 FALSE 0
#&gt; 2 1 1 FALSE 0
#&gt; 3 2 1 FALSE 0
#&gt; 4 3 1 FALSE 0
#&gt; 5 5 2 FALSE 0
#&gt; 6 10 5 TRUE 1
#&gt; # … with 8 more rows</pre>
</div>
<p>Another approach for creating grouping variables is <code><a href="https://dplyr.tidyverse.org/reference/consecutive_id.html">consecutive_id()</a></code>, which starts a new group every time one of its arguments changes. For example, inspired by <a href="https://stackoverflow.com/questions/27482712">this stackoverflow question</a>, imagine you have a data frame with a bunch of repeated values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b"),
y = c(1, 2, 3, 2, 4, 1, 3, 9, 4, 8, 10, 199)
)
df
#&gt; # A tibble: 12 × 2
#&gt; x y
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 a 1
#&gt; 2 a 2
#&gt; 3 a 3
#&gt; 4 b 2
#&gt; 5 c 4
#&gt; 6 c 1
#&gt; # … with 6 more rows</pre>
</div>
<p>You want to keep the first row from each repeated <code>x</code>. Thats easier to express with a combination of <code><a href="https://dplyr.tidyverse.org/reference/consecutive_id.html">consecutive_id()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_head()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
group_by(id = consecutive_id(x)) |&gt;
slice_head(n = 1)
#&gt; # A tibble: 7 × 3
#&gt; # Groups: id [7]
#&gt; x y id
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 a 1 1
#&gt; 2 b 2 2
#&gt; 3 c 4 3
#&gt; 4 d 3 4
#&gt; 5 e 9 5
#&gt; 6 a 4 6
#&gt; # … with 1 more row</pre>
</div>
</section>
<section id="numbers-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code>.</p></li>
<li><p>Which plane (<code>tailnum</code>) has the worst on-time record?</p></li>
<li><p>What time of day should you fly if you want to avoid delays as much as possible?</p></li>
<li><p>What does <code>flights |&gt; group_by(dest) |&gt; filter(row_number() &lt; 4)</code> do? What does <code>flights |&gt; group_by(dest) |&gt; filter(row_number(dep_delay) &lt; 4)</code> do?</p></li>
<li><p>For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.</p></li>
<li>
<p>Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>, explore how the average flight delay for an hour is related to the average delay for the previous hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
mutate(hour = dep_time %/% 100) |&gt;
group_by(year, month, day, hour) |&gt;
summarize(
dep_delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.groups = "drop"
) |&gt;
filter(n &gt; 5)</pre>
</div>
</li>
<li><p>Look at each destination. Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)? Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?</p></li>
<li><p>Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.</p></li>
</ol></section>
</section>
<section id="numeric-summaries" data-type="sect1">
<h1>
Numeric summaries</h1>
<p>Just using the counts, means, and sums that weve introduced already can get you a long way, but R provides many other useful summary functions. Here is a selection that you might find useful.</p>
<section id="center" data-type="sect2">
<h2>
Center</h2>
<p>So far, weve mostly used <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable youre interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.</p>
<p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs. the median when looking at the hourly vs. median departure delay. The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
mean = mean(dep_delay, na.rm = TRUE),
median = median(dep_delay, na.rm = TRUE),
n = n(),
.groups = "drop"
) |&gt;
ggplot(aes(x = mean, y = median)) +
geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) +
geom_point()</pre>
<div class="cell-output-display">
<figure id="fig-mean-vs-median"><p><img src="numbers_files/figure-html/fig-mean-vs-median-1.png" alt="All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55." width="576"/></p>
<figcaption>A scatterplot showing the differences of summarising hourly depature delay with median instead of mean.</figcaption>
</figure>
</div>
</div>
<p>You might also wonder about the <strong>mode</strong>, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesnt work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and theres no mode function included in base R<span data-type="footnote">The <code><a href="https://rdrr.io/r/base/mode.html">mode()</a></code> function does something quite different!</span>.</p>
</section>
<section id="sec-min-max-summary" data-type="sect2">
<h2>
Minimum, maximum, and quantiles</h2>
<p>What if youre interested in locations other than the center? <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="https://rdrr.io/r/stats/quantile.html">quantile()</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find the value thats greater than 95% of the values.</p>
<p>For the <code>flights</code> data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
max = max(dep_delay, na.rm = TRUE),
q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
.groups = "drop"
)
#&gt; # A tibble: 365 × 5
#&gt; year month day max q95
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 853 70.1
#&gt; 2 2013 1 2 379 85
#&gt; 3 2013 1 3 291 68
#&gt; 4 2013 1 4 288 60
#&gt; 5 2013 1 5 327 41
#&gt; 6 2013 1 6 202 51
#&gt; # … with 359 more rows</pre>
</div>
</section>
<section id="spread" data-type="sect2">
<h2>
Spread</h2>
<p>Sometimes youre not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, <code>sd(x)</code>, and the inter-quartile range, <code><a href="https://rdrr.io/r/stats/IQR.html">IQR()</a></code>. We wont explain <code><a href="https://rdrr.io/r/stats/sd.html">sd()</a></code> here since youre probably already familiar with it, but <code><a href="https://rdrr.io/r/stats/IQR.html">IQR()</a></code> might be new — its <code>quantile(x, 0.75) - quantile(x, 0.25)</code> and gives you the range that contains the middle 50% of the data.</p>
<p>We can use this to reveal a small oddity in the <code>flights</code> data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, <a href="https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport">EGE</a>, might have moved.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(origin, dest) |&gt;
summarize(
distance_sd = IQR(distance),
n = n(),
.groups = "drop"
) |&gt;
filter(distance_sd &gt; 0)
#&gt; # A tibble: 2 × 4
#&gt; origin dest distance_sd n
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 EWR EGE 1 110
#&gt; 2 JFK EGE 1 103</pre>
</div>
</section>
<section id="distributions" data-type="sect2">
<h2>
Distributions</h2>
<p>Its worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that theyre fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. Thats why its always a good idea to visualize the distribution before committing to your summary statistics.</p>
<p><a href="#fig-flights-dist" data-type="xref">#fig-flights-dist</a> shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.</p>
<div>
<pre data-type="programlisting" data-code-language="r">flights |&gt;
ggplot(aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
#&gt; Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).
flights |&gt;
filter(dep_delay &lt; 120) |&gt;
ggplot(aes(x = dep_delay)) +
geom_histogram(binwidth = 5)</pre>
<div id="fig-flights-dist" class="cell quarto-layout-panel">
<figure class="figure"><div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell quarto-layout-cell-subref" style="flex-basis: 50.0%;justify-content: center;">
<figure id="fig-flights-dist-1"><p><img src="numbers_files/figure-html/fig-flights-dist-1.png" alt="Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that. " data-ref-parent="fig-flights-dist" width="384"/></p>
<figcaption>(a) Histogram shows the full range of delays.</figcaption>
</figure>
</div>
<div class="cell-output-display quarto-layout-cell quarto-layout-cell-subref" style="flex-basis: 50.0%;justify-content: center;">
<figure id="fig-flights-dist-2"><p><img src="numbers_files/figure-html/fig-flights-dist-2.png" alt="Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that. " data-ref-parent="fig-flights-dist" width="384"/></p>
<figcaption>(b) Histogram is zoomed in to show delays less than 2 hours.</figcaption>
</figure>
</div>
</div>
<p/><figcaption class="figure-caption">Figure 15.3: The distribution of <code>dep_delay</code> appears highly skewed to the right in both histograms.</figcaption><p/>
</figure></div>
</div>
<p>Its also a good idea to check that distributions for subgroups resemble the whole. <a href="#fig-flights-dist-daily" data-type="xref">#fig-flights-dist-daily</a> overlays a frequency polygon for each day. The distributions seem to follow a common pattern, suggesting its fine to use the same summary for each day.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
filter(dep_delay &lt; 120) |&gt;
ggplot(aes(x = dep_delay, group = interaction(day, month))) +
geom_freqpoly(binwidth = 5, alpha = 1/5)</pre>
<div class="cell-output-display">
<figure id="fig-flights-dist-daily"><p><img src="numbers_files/figure-html/fig-flights-dist-daily-1.png" alt="The distribution of `dep_delay` is highly right skewed with a strong peak slightly less than 0. The 365 frequency polygons are mostly overlapping forming a thick black bland." width="576"/></p>
<figcaption>365 frequency polygons of <code>dep_delay</code>, one for each day. The frequency polygons appear to have the same shape, suggesting that its reasonable to compare days by looking at just a few summary statistics.</figcaption>
</figure>
</div>
</div>
<p>Dont be afraid to explore your own custom summaries specifically tailored for the data that youre working with. In this case, that might mean separately summarizing the flights that left early vs. the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, dont forget what you learned in <a href="#sec-sample-size" data-type="xref">#sec-sample-size</a>: whenever creating numerical summaries, its a good idea to include the number of observations in each group.</p>
</section>
<section id="positions" data-type="sect2">
<h2>
Positions</h2>
<p>Theres one final type of summary thats useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position. You can do this with the base R <code>[</code> function, but were not going to cover it in detail until <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>, because its a very powerful and general function. For now well introduce three specialized functions that you can use to extract values at a specified position: <code>first(x)</code>, <code>last(x)</code>, and <code>nth(x, n)</code>.</p>
<p>For example, we can find the first and last departure for each day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
summarize(
first_dep = first(dep_time),
fifth_dep = nth(dep_time, 5),
last_dep = last(dep_time)
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using
#&gt; the `.groups` argument.
#&gt; # A tibble: 365 × 6
#&gt; # Groups: year, month [12]
#&gt; year month day first_dep fifth_dep last_dep
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 554 NA
#&gt; 2 2013 1 2 42 535 NA
#&gt; 3 2013 1 3 32 520 NA
#&gt; 4 2013 1 4 25 531 NA
#&gt; 5 2013 1 5 14 534 NA
#&gt; 6 2013 1 6 16 555 NA
#&gt; # … with 359 more rows</pre>
</div>
<p>(These functions currently lack an <code>na.rm</code> argument but will hopefully be fixed by the time you read this book: <a href="https://github.com/tidyverse/dplyr/issues/6242" class="uri">https://github.com/tidyverse/dplyr/issues/6242</a>).</p>
<p>If youre familiar with <code>[</code>, you might wonder if you ever need these functions. There are two main reasons: the <code>default</code> argument and the <code>order_by</code> argument. <code>default</code> allows you to set a default value thats used if the requested position doesnt exist, e.g. youre trying to get the 3rd element from a two element group. <code>order_by</code> lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by <code><a href="https://dplyr.tidyverse.org/reference/order_by.html">order_by()</a></code>.</p>
<p>Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(year, month, day) |&gt;
mutate(r = min_rank(desc(sched_dep_time))) |&gt;
filter(r %in% c(1, max(r)))
#&gt; # A tibble: 1,195 × 20
#&gt; # Groups: year, month, day [365]
#&gt; year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 515 2 830 819
#&gt; 2 2013 1 1 2353 2359 -6 425 445
#&gt; 3 2013 1 1 2353 2359 -6 418 442
#&gt; 4 2013 1 1 2356 2359 -3 425 437
#&gt; 5 2013 1 2 42 2359 43 518 442
#&gt; 6 2013 1 2 458 500 -2 703 650
#&gt; # … with 1,189 more rows, and 12 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, …</pre>
</div>
</section>
<section id="with-mutate" data-type="sect2">
<h2>
With mutate()
</h2>
<p>As the names suggest, the summary functions are typically paired with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, particularly when you want do some sort of group standardization. For example:</p>
<ul><li>
<code>x / sum(x)</code> calculates the proportion of a total.</li>
<li>
<code>(x - mean(x)) / sd(x)</code> computes a Z-score (standardized to mean 0 and sd 1).</li>
<li>
<code>x / first(x)</code> computes an index based on the first observation.</li>
</ul></section>
<section id="numbers-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:</p>
<ul><li>A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.</li>
<li>A flight is always 10 minutes late.</li>
<li>A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.</li>
<li>99% of the time a flight is on time. 1% of the time its 2 hours late.</li>
</ul><p>Which do you think is more important: arrival delay or departure delay?</p>
</li>
<li><p>Which destinations show the greatest variation in air speed?</p></li>
<li><p>Create a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations?</p></li>
</ol></section>
</section>
<section id="numbers-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Youre already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. Youve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.</p>
<p>Over the next two chapters, well dive into working with strings with the stringr package. Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions.</p>
</section>
</section>

View File

@ -1,9 +0,0 @@
<section data-type="chapter" id="chp-preface-2e">
<h1>Preface to the second edition</h1><p>Welcome to the second edition of “R for Data Science”! This is a major reworking of the first edition, removing material we no longer think is useful, adding material we wish we included in the first edition, and generally updating the text and code to reflect changes in best practices. Were also very excited to welcome a new co-author: Mine Çetinkaya-Rundel, a noted data science educator and one of our colleagues at Posit (the company formerly known as RStudio).</p><p>A brief summary of the biggest changes follows:</p><ul><li><p>The first part of the book has been renamed to “Whole game”. The goal of this section is to give you the rough details of the “whole game” of data science before we dive into the details.</p></li>
<li><p>The second part of the book is “Visualize”. This part gives data visualization tools and best practices a more thorough coverage compared to the first edition. The best place to get all the details is still the <a href="http://ggplot2-book.org/">ggplot2 book</a>, but now R4DS covers more of the most important techniques.</p></li>
<li><p>The third part of the book is now called “Transform” and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room to cover all the details.</p></li>
<li><p>The fourth part of the book is called “Import”. Its a new set of chapters that goes beyond reading flat text files to working with spreadsheets, getting data out of databases, working with big data, rectangling hierarchical data, and scraping data from web sites.</p></li>
<li><p>The “Program” part remains, but has been rewritten from top-to-bottom to focus on the most important parts of function writing and iteration. Function writing now includes details on how to wrap tidyverse functions (dealing with the challenges of tidy evaluation), since this has become much easier and more important over the last few years. Weve added a new chapter on important base R functions that youre likely to see in wild-caught R code.</p></li>
<li><p>The modeling part has been removed. We never had enough room to fully do modelling justice, and there are now much better resources available. We generally recommend using the <a href="https://www.tidymodels.org/">tidymodels</a> packages and reading <a href="https://www.tmwr.org/">Tidy Modeling with R</a> by Max Kuhn and Julia Silge.</p></li>
<li><p>The communicate part remains, but has been thoroughly updated to feature Quarto instead of R Markdown. This edition of the book has been written in quarto, and its clearly the tool of the future.</p></li>
</ul></section>

View File

@ -1,12 +0,0 @@
<div data-type="part">
<h1><span id="sec-program-intro" class="quarto-section-identifier d-none d-lg-block">Program</span></h1><p>In this part of the book, youll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-program"><p><img src="diagrams/data-science/program.png" alt="Our model of the data science process with program (import, tidy, transform, visualize, model, and communicate, i.e. everything) highlighted in blue." width="535"/></p>
<figcaption>Figure 1: Programming is the water in which all the other components swim.</figcaption>
</figure>
</div>
</div><p>Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if youre not working with other people, youll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.</p><p>In the following three chapters, youll learn skills to improve your programming skills:</p><ol type="1"><li><p>Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in <a href="#chp-functions" data-type="xref">#chp-functions</a>, youll learn how to write <strong>functions</strong> which let you extract out repeated code so that it can be easily reused.</p></li>
<li><p>Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for <strong>iteration</strong> that let you do similar things again and again. These tools include for loops and functional programming, which youll learn about in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p></li>
<li><p>As you read more code written by others, youll see more code that doesnt use the tidyverse. In <a href="#chp-base-R" data-type="xref">#chp-base-R</a>, youll learn some of the most important base R functions that youll see in the wild.</p></li>
</ol><p>The goal of these chapters is to teach you the minimum about programming that you need for data science. Once you have mastered the material here, we strongly recommend that you continue to invest in your programming skills. Weve written two books that you might find helpful. <a href="https://rstudio-education.github.io/hopr/"><em>Hands on Programming with R</em></a>, by Garrett Grolemund, is an introduction to R as a programming language and is a great place to start if R is your first programming language. <a href="https://adv-r.hadley.nz/"><em>Advanced R</em></a> by Hadley Wickham dives into the details of R the programming language; its great place to start if you have existing programming experience and great next step once youve internalized the ideas in these chapters.</p></div>

View File

@ -1,281 +0,0 @@
<section data-type="chapter" id="chp-quarto-formats">
<h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1>
<section id="quarto-formats-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, youve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.</p>
<p>There are two ways to set the output of a document:</p>
<ol type="1"><li>
<p>Permanently, by modifying the YAML header:</p>
<pre data-type="programlisting" data-code-language="yaml">title: "Diamond sizes"
format: html</pre>
</li>
<li>
<p>Transiently, by calling <code>quarto::quarto_render()</code> by hand:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = "docx")</pre>
</div>
<p>This is useful if you want to programmatically produce multiple types of output since the <code>output_format</code> argument can also take a list of values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = c("docx", "pdf"))</pre>
</div>
</li>
</ol></section>
<section id="output-options" data-type="sect1">
<h1>
Output options</h1>
<p>Quarto offers a wide range of output formats. You can find the complete list at <a href="https://quarto.org/docs/output-formats/all-formats.html" class="uri">https://quarto.org/docs/output-formats/all-formats.html</a>. Many formats share some output options (e.g., <code>toc: true</code> for including a table of contents), but others have options that are format specific (e.g., <code>code-fold: true</code> collapses code chunks into a <code>&lt;details&gt;</code> tag for HTML output so the user can display it on demand, its not applicable in a PDF or Word document).</p>
<p>To override the default options, you need to use an expanded <code>format</code> field. For example, if you wanted to render an <code>html</code> with a floating table of contents, youd use:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
toc: true
toc_float: true</pre>
<p>You can even render to multiple outputs by supplying a list of formats:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
toc: true
toc_float: true
pdf: default
docx: default</pre>
<p>Note the special syntax (<code>pdf: default</code>) if you dont want to override any default options.</p>
<p>To render to all formats specified in the YAML of a document, you can use <code>output_format = "all"</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">quarto::quarto_render("diamond-sizes.qmd", output_format = "all")</pre>
</div>
</section>
<section id="documents" data-type="sect1">
<h1>
Documents</h1>
<p>The previous chapter focused on the default <code>html</code> output. There are several basic variations on that theme, generating different types of documents. For example:</p>
<ul><li><p><code>pdf</code> makes a PDF with LaTeX (an open-source document layout system), which youll need to install. RStudio will prompt you if you dont already have it.</p></li>
<li><p><code>docx</code> for Microsoft Word (<code>.docx</code>) documents.</p></li>
<li><p><code>odt</code> for OpenDocument Text (<code>.odt</code>) documents.</p></li>
<li><p><code>rtf</code> for Rich Text Format (<code>.rtf</code>) documents.</p></li>
<li><p><code>gfm</code> for a GitHub Flavored Markdown (<code>.md</code>) document.</p></li>
<li><p><code>ipynb</code> for Jupyter Notebooks (<code>.ipynb</code>).</p></li>
</ul><p>Remember, when generating a document to share with decision-makers, you can turn off the default display of code by setting global options in document YAML:</p>
<pre data-type="programlisting" data-code-language="yaml">execute:
echo: false</pre>
<p>For <code>html</code> documents another option is to make the code chunks hidden by default, but visible with a click:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
code: true</pre>
</section>
<section id="presentations" data-type="sect1">
<h1>
Presentations</h1>
<p>You can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (<code>##</code>) level header. Additionally, first (<code>#</code>) level headers indicate the beginning of a new section with a section title slide that is, by default, centered in the middle.</p>
<p>Quarto supports a variety of presentation formats, including:</p>
<ol type="1"><li><p><code>revealjs</code> - HTML presentation with revealjs</p></li>
<li><p><code>pptx</code> - PowerPoint presentation</p></li>
<li><p><code>beamer</code> - PDF presentation with LaTeX Beamer.</p></li>
</ol><p>You can read more about creating presentations with Quarto at <a href="https://quarto.org/docs/presentations/">https://quarto.org/docs/presentations</a>.</p>
</section>
<section id="dashboards" data-type="sect1">
<h1>
Dashboards</h1>
<p>Dashboards are a useful way to communicate information visually and quickly. A dashboard-like look can be achieved with Quarto using document layout options like sidebars, tabsets, multi-column layouts, etc.</p>
<p>For example, you can produce this dashboard:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/quarto-dashboard.png" class="img-fluid" alt="Quarto dashboard with the title &quot;Diamonds dashboard&quot;. The first tab shows four plots of the diamonds dataset. The second tab shows summary statistics for price and carat of diamonds. The third tab shows an interactive data table of the first 100 diamonds." width="540"/></p>
</div>
</div>
<p>Using this code:</p>
<div class="cell">
<pre><code>---
title: "💍 Diamonds dashboard"
format: html
execute:
echo: false
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
library(gt)
```
::: panel-tabset
## Plots
```{r}
#| layout: [[30,-5, 30, -5, 30], [100]]
ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.1)
ggplot(diamonds, aes(x = price)) + geom_histogram(binwidth = 500)
ggplot(diamonds, aes(x = cut, color = cut)) + geom_bar()
ggplot(diamonds, aes(x = carat, y = price, color = cut)) + geom_point()
```
## Summaries
```{r}
diamonds |&gt;
select(price, carat, cut) |&gt;
group_by(cut) |&gt;
summarize(
across(where(is.numeric), list(mean = mean, median = median, sd = sd, IQR = IQR))
) |&gt;
pivot_longer(cols = -cut) |&gt;
pivot_wider(names_from = cut, values_from = value) |&gt;
separate(name, into = c("var", "stat")) |&gt;
mutate(
var = str_to_title(var),
stat = str_to_title(stat),
stat = if_else(stat == "Iqr", "IQR", stat)
) |&gt;
group_by(var) |&gt;
gt() |&gt;
fmt_currency(columns = -stat, rows = 1:4, decimals = 0) |&gt;
fmt_number(columns = -stat, rows = 5:8,) |&gt;
cols_align(columns = -stat, align = "center") |&gt;
cols_label(stat = "")
```
## Data
```{r}
diamonds |&gt;
arrange(desc(carat)) |&gt;
slice_head(n = 100) |&gt;
select(price, carat, cut) |&gt;
DT::datatable()
```
:::</code></pre>
</div>
<p>To learn more about Quarto component layouts, visit <a href="https://quarto.org/docs/interactive/layout.html" class="uri">https://quarto.org/docs/interactive/layout.html</a>.</p>
</section>
<section id="interactivity" data-type="sect1">
<h1>
Interactivity</h1>
<p>Any HTML document can contain interactive components.</p>
<section id="htmlwidgets" data-type="sect2">
<h2>
htmlwidgets</h2>
<p>HTML is an interactive format, and you can take advantage of that interactivity with <strong>htmlwidgets</strong>, R functions that produce interactive HTML visualizations. For example, take the <strong>leaflet</strong> map below. If youre viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously cant do that in a book, so Quarto automatically inserts a static screenshot for you.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(leaflet)
leaflet() |&gt;
setView(174.764, -36.877, zoom = 16) |&gt;
addTiles() |&gt;
addMarkers(174.764, -36.877, popup = "Maungawhau") </pre>
<div class="cell-output-display">
<div class="leaflet html-widget html-fill-item-overflow-hidden html-fill-item" id="htmlwidget-ac96cb3ee4656e2e9ec3" style="width:100%;height:433px;"/>
<script type="application/json" data-for="htmlwidget-ac96cb3ee4656e2e9ec3"><![CDATA[{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"setView":[[-36.877,174.764],16,[]],"calls":[{"method":"addTiles","args":["https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"&copy; <a href=\"https://openstreetmap.org\">OpenStreetMap<\/a> contributors, <a href=\"https://creativecommons.org/licenses/by-sa/2.0/\">CC-BY-SA<\/a>"}]},{"method":"addMarkers","args":[-36.877,174.764,null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},"Maungawhau",null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[-36.877,-36.877],"lng":[174.764,174.764]}},"evals":[],"jsHooks":[]}]]></script></div>
</div>
<p>The great thing about htmlwidgets is that you dont need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you dont need to worry about it.</p>
<p>There are many packages that provide htmlwidgets, including:</p>
<ul><li><p><strong>dygraphs</strong>, <a href="https://rstudio.github.io/dygraphs/" class="uri">https://rstudio.github.io/dygraphs</a>, for interactive time series visualizations.</p></li>
<li><p><strong>DT</strong>, <a href="https://rstudio.github.io/DT" class="uri">https://rstudio.github.io/DT/</a>, for interactive tables.</p></li>
<li><p><strong>threejs</strong>, <a href="https://bwlewis.github.io/rthreejs/" class="uri">https://bwlewis.github.io/rthreejs</a> for interactive 3d plots.</p></li>
<li><p><strong>DiagrammeR</strong>, <a href="https://rich-iannone.github.io/DiagrammeR" class="uri">https://rich-iannone.github.io/DiagrammeR</a> for diagrams (like flow charts and simple node-link diagrams).</p></li>
</ul><p>To learn more about htmlwidgets and see a complete list of packages that provide them visit <a href="https://www.htmlwidgets.org" class="uri">https://www.htmlwidgets.org</a>.</p>
</section>
<section id="shiny" data-type="sect2">
<h2>
Shiny</h2>
<p>htmlwidgets provide <strong>client-side</strong> interactivity — all the interactivity happens in the browser, independently of R. On the one hand, thats great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use <strong>shiny</strong>, a package that allows you to create interactivity using R code, not JavaScript.</p>
<p>To call Shiny code from a Quarto document, add <code>server: shiny</code> to the YAML header:</p>
<pre data-type="programlisting" data-code-language="yaml">title: "Shiny Web App"
format: html
server: shiny</pre>
<p>Then you can use the “input” functions to add interactive components to the document:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(shiny)
textInput("name", "What is your name?")
numericInput("age", "How old are you?", NA, min = 0, max = 150)</pre>
</div>
<p>And you also need a code chunk with chunk option <code>context: server</code> which contains the code that needs to run in a Shiny server.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/quarto-shiny.png" class="img-fluid" alt="Two input boxes on top of each other. Top one says, &quot;What is your name?&quot;, the bottom, &quot;How old are you?&quot;." width="650"/></p>
</div>
</div>
<p>You can then refer to the values with <code>input$name</code> and <code>input$age</code>, and the code that uses them will be automatically re-run whenever they change.</p>
<p>We cant show you a live shiny app here because shiny interactions occur on the <strong>server-side</strong>. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public-facing Shiny server if you want to publish this sort of interactivity online. Thats the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.</p>
<p>For learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, <a href="https://mastering-shiny.org/">https://mastering-shiny.org</a>.</p>
</section>
</section>
<section id="websites-and-books" data-type="sect1">
<h1>
Websites and books</h1>
<p>With a bit of additional infrastructure, you can use Quarto to generate a complete website or book:</p>
<ul><li><p>Put your <code>.qmd</code> files in a single directory. <code>index.qmd</code> will become the home page.</p></li>
<li>
<p>Add a YAML file named <code>_quarto.yml</code> that provides the navigation for the site. In this file, set the <code>project</code> type to either <code>book</code> or <code>website</code>, e.g.:</p>
<pre data-type="programlisting" data-code-language="yaml">project:
type: book</pre>
</li>
</ul><p>For example, the following <code>_quarto.yml</code> file creates a website from three source files: <code>index.qmd</code> (the home page), <code>viridis-colors.qmd</code>, and <code>terrain-colors.qmd</code>.</p>
<div class="cell">
<pre><code>project:
type: website
website:
title: "A website on color scales"
navbar:
left:
- href: index.qmd
text: Home
- href: viridis-colors.qmd
text: Viridis colors
- href: terrain-colors.qmd
text: Terrain colors</code></pre>
</div>
<p>The <code>_quarto.yml</code> file you need for a book is very similarly structured. The following example shows how you can create a book with four chapters that renders to three different outputs (<code>html</code>, <code>pdf</code>, and <code>epub</code>). Once again, the source files are <code>.qmd</code> files.</p>
<div class="cell">
<pre><code>project:
type: book
book:
title: "A book on color scales"
author: "Jane Coloriste"
chapters:
- index.qmd
- intro.qmd
- viridis-colors.qmd
- terrain-colors.qmd
format:
html:
theme: cosmo
pdf: default
epub: default</code></pre>
</div>
<p>We recommend that you use an RStudio project for your websites and books. Based on the <code>_quarto.yml</code> file, RStudio will recognize the type of project youre working on, and add a Built tab to the IDE that you can use to render and preview your websites and books. Both websites and books can also be rendered using <code>quarto::render()</code>.</p>
<p>Read more at <a href="https://quarto.org/docs/websites" class="uri">https://quarto.org/docs/websites</a> about Quarto websites and <a href="https://quarto.org/docs/books" class="uri">https://quarto.org/docs/books</a> about books.</p>
</section>
<section id="other-formats" data-type="sect1">
<h1>
Other formats</h1>
<p>Quarto offers even more output formats:</p>
<ul><li><p>You can write journal articles using Quarto Journal Templates: <a href="https://quarto.org/docs/journals/templates.html" class="uri">https://quarto.org/docs/journals/templates.html</a>.</p></li>
<li><p>You can output Quarto documents to Jupyter Notebooks with <code>format: ipynb</code>: <a href="https://quarto.org/docs/reference/formats/ipynb.html" class="uri">https://quarto.org/docs/reference/formats/ipynb.html</a>.</p></li>
</ul><p>See <a href="https://quarto.org/docs/output-formats/all-formats.html" class="uri">https://quarto.org/docs/output-formats/all-formats.html</a> for a list of even more formats.</p>
</section>
<section id="quarto-formats-learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>To learn more about effective communication in these different formats, we recommend the following resources:</p>
<ul><li><p>To improve your presentation skills, try <a href="https://presentationpatterns.com/"><em>Presentation Patterns</em></a> by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.</p></li>
<li><p>If you give academic talks, you might like the <a href="https://github.com/jtleek/talkguide"><em>Leek group guide to giving talks</em></a>.</p></li>
<li><p>We havent taken it ourselves, but weve heard good things about Matt McGarritys online course on public speaking: <a href="https://www.coursera.org/learn/public-speaking" class="uri">https://www.coursera.org/learn/public-speaking</a>.</p></li>
<li><p>If you are creating many dashboards, make sure to read Stephen Fews <a href="https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"><em>Information Dashboard Design: The Effective Visual Communication of Data</em></a>. It will help you create dashboards that are truly useful, not just pretty to look at.</p></li>
<li><p>Effectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams <a href="https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151"><em>The Non-Designers Design Book</em></a> is a great place to start.</p></li>
</ul></section>
</section>

View File

@ -1,17 +0,0 @@
<section data-type="chapter" id="chp-quarto-workflow">
<h1><span id="sec-quarto-workflow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto workflow</span></span></h1><p>Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the <em>console</em>, then capture what works in the <em>script editor</em>. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When youre happy, you move on and start a new chunk.</p><p>Quarto is also important because it so tightly integrates prose and code. This makes it a great <strong>analysis notebook</strong> because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:</p><ul><li><p>Records what you did and why you did it. Regardless of how great your memory is, if you dont record what you do, there will come a time when you have forgotten important details. Write them down so you dont forget!</p></li>
<li><p>Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.</p></li>
<li><p>Helps others understand your work. It is rare to do data analysis by yourself, and youll often be working as part of a team. A lab notebook helps you share not only what youve done, but why you did it with your colleagues or lab mates.</p></li>
</ul><p>Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. Weve drawn on our own experiences and Colin Purringtons advice on lab notebooks (<a href="https://colinpurrington.com/tips/lab-notebooks" class="uri">https://colinpurrington.com/tips/lab-notebooks</a>) to come up with the following tips:</p><ul><li><p>Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.</p></li>
<li>
<p>Use the YAML header date field to record the date you started working on the notebook:</p>
<pre data-type="programlisting" data-code-language="yaml">date: 2016-08-23</pre>
<p>Use ISO8601 YYYY-MM-DD format so thats there no ambiguity. Use it even if you dont normally write dates that way!</p>
</li>
<li><p>If you spend a lot of time on an analysis idea and it turns out to be a dead end, dont delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.</p></li>
<li><p>Generally, youre better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tibble::tribble()</a></code>.</p></li>
<li><p>If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.</p></li>
<li><p>Before you finish for the day, make sure you can render the notebook. If youre using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.</p></li>
<li><p>If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), youll need to track the versions of the packages that your code uses. A rigorous approach is to use <strong>renv</strong>, <a href="https://rstudio.github.io/renv/index.html" class="uri">https://rstudio.github.io/renv/index.html</a>, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs <code><a href="https://rdrr.io/r/utils/sessionInfo.html">sessionInfo()</a></code> — that wont let you easily recreate your packages as they are today, but at least youll know what they were.</p></li>
<li><p>You are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.</p></li>
</ul></section>

View File

@ -1,684 +0,0 @@
<section data-type="chapter" id="chp-quarto">
<h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1>
<section id="quarto-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.</p>
<p>Quarto files are designed to be used in three ways:</p>
<ol type="1"><li><p>For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.</p></li>
<li><p>For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).</p></li>
<li><p>As an environment in which to <em>do</em> data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.</p></li>
</ol><p>Quarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through <code>?</code>. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation page at <a href="https://quarto.org/" class="uri">https://quarto.org</a> for help.</p>
<p>If youre an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. Youre not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.</p>
<section id="quarto-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>You need the Quarto command line interface (Quarto CLI), but you dont need to explicitly install it or load it, as RStudio automatically does both when needed.</p>
</section>
</section>
<section id="quarto-basics" data-type="sect1">
<h1>
Quarto basics</h1>
<p>This is a Quarto file a plain text file that has the extension <code>.qmd</code>:</p>
<div class="cell">
<pre><code>---
title: "Diamond sizes"
date: 2022-09-12
format: html
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
smaller &lt;- diamonds |&gt;
filter(carat &lt;= 2.5)
```
We have data about `r nrow(diamonds)` diamonds.
Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.
The distribution of the remainder is shown below:
```{r}
#| label: plot-smaller-diamonds
#| echo: false
smaller |&gt;
ggplot(aes(x = carat)) +
geom_freqpoly(binwidth = 0.01)
```</code></pre>
</div>
<p>It contains three important types of content:</p>
<ol type="1"><li>An (optional) <strong>YAML header</strong> surrounded by <code>---</code>s.</li>
<li>
<strong>Chunks</strong> of R code surrounded by <code>```</code>.</li>
<li>Text mixed with simple text formatting like <code># heading</code> and <code>_italics_</code>.</li>
</ol><p>When you open a <code>.qmd</code>, you get a notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/diamond-sizes-notebook.png" class="img-fluid" style="width:90.0%" alt="RStudio window with a Quarto document titled &quot;diamond-sizes.qmd&quot; on the left and a blank Viewer window on the right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less then 2.5 carats. The plot shows that the frequency decreases as the weight increases."/></p>
</div>
</div>
<p>If you dont like seeing your plots and output in your document and would rather make use of RStudios console and plot panes, you can click on the gear icon next to “Render” and switch to “Chunk Output in Console”.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/diamond-sizes-console-output.png" class="img-fluid" style="width:90.0%" alt="RStudio window with a Quarto document titled &quot;diamond-sizes.qmd&quot; on the left and the Plot pane on the bottom right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less then 2.5 carats. The plot is displayed in the Plot pane and shows that the frequency decreases as the weight increases. The RStudio option to show Chunk Output in Console is also highlighted."/></p>
</div>
</div>
<p>To produce a complete report containing all text, code, and results, click “Render” or press Cmd/Ctrl + Shift + K. You can also do this programmatically with <code>quarto::quarto_render("diamond-sizes.qmd")</code>. This will display the report in the viewer pane and create an HTML file.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/diamond-sizes-report.png" class="img-fluid" style="width:90.0%" alt="RStudio window with a Quarto document titled &quot;diamond-sizes.qmd&quot; on the left and the Plot pane on the bottom right. The rendered document does not show any of the code, but the code is visible in the source document."/></p>
</div>
</div>
<p>When you render the document, Quarto sends the <code>.qmd</code> file to <strong>knitr</strong>, <a href="https://yihui.name/knitr/" class="uri">https://yihui.name/knitr</a>, which executes all of the code chunks and creates a new markdown (<code>.md</code>) document which includes the code and its output. The markdown file generated by knitr is then processed by <strong>pandoc</strong>, <a href="https://pandoc.org/" class="uri">https://pandoc.org</a>, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats, as youll learn about in <a href="#chp-quarto-formats" data-type="xref">#chp-quarto-formats</a>.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="images/quarto-flow.png" class="img-fluid" style="width:75.0%" alt="Workflow diagram starting with a qmd file, then knitr, then md, then pandoc, then PDF, MS Word, or HTML."/></p>
</div>
</div>
<p>To get started with your own <code>.qmd</code> file, select <em>File &gt; New File &gt; Quarto Document…</em> in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.</p>
<p>The following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.</p>
<section id="quarto-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create a new Quarto document using <em>File &gt; New File &gt; Quarto Document</em>. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.</p></li>
<li><p>Create one new Quarto document for each of the three built-in formats: HTML, PDF and Word. Render each of the three documents. How do the outputs differ? How do the inputs differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)</p></li>
</ol></section>
</section>
<section id="visual-editor" data-type="sect1">
<h1>
Visual editor</h1>
<p>The Visual editor in RStudio provides a <a href="https://en.wikipedia.org/wiki/WYSIWYM">WYSIWYM</a> interface for authoring Quarto documents. Under the hood, prose in Quarto documents (<code>.qmd</code> files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in <a href="#sec-source-editor" data-type="xref">#sec-source-editor</a>, it still requires learning new syntax. Therefore, if youre new to computational documents like <code>.qmd</code> files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.</p>
<p>In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all <kbd>⌘ /</kbd> shortcut to insert just about anything. If you are at the beginning of a line (as shown below), you can also enter just <kbd>/</kbd> to invoke the shortcut.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/quarto-visual-editor.png" class="img-fluid" style="width:75.0%" alt="A Quarto document displaying various features of the visual editor such as text formatting (italic, bold, underline, small caps, code, superscript, and subscript), first through third level headings, bulleted and numbered lists, links, linked phrases, and images (along with a pop-up window for customizing image size, adding a caption and alt text, etc.), tables with a header row, and the insert anything tool with options to insert an R code chunk, a Python code chunk, a div, a bullet list, a numbered list, or a first level heading (the top few choices in the tool)."/></p>
</div>
</div>
<p>Inserting images and customizing how they are displayed is also facilitated with the visual editor. You can either paste an image from your clipboard directly into the visual editor (and RStudio will place a copy of that image in the project directory and link to it) or you can use the visual editors Insert &gt; Figure / Image menu to browse to the image you want to insert or paste its URL. In addition, using the same menu you can resize the image as well as add a caption, alternative text, and a link.</p>
<p>The visual editor has many more features that we havent enumerated here that you might find useful as you gain experience authoring with it.</p>
<p>Most importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.</p>
<section id="quarto-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises. -->
</section>
</section>
<section id="sec-source-editor" data-type="sect1">
<h1>
Source editor</h1>
<p>You can also edit Quarto documents using the Source editor in RStudio, without the assist of the Visual editor. While the Visual editor will feel familiar to those with experience writing in tools like Google docs, the Source editor will feel familiar to those with experience writing R scripts or R Markdown documents. The Source editor can also be useful for debugging any Quarto syntax errors since its often easier to catch these in plain text.</p>
<p>The guide below shows how to use Pandocs Markdown for authoring Quarto documents in the source editor.</p>
<div class="cell">
<pre><code>## Text formatting
*italic* **bold** [underline]{.underline} ~~strikeout~~ [small caps]{.smallcaps} `code` superscript^2^ and subscript~2~
## Headings
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
## Lists
- Bulleted list item 1
- Item 2
- Item 2a
- Item 2b
1. Numbered list item 1
2. Item 2.
The numbers are incremented automatically in the output.
## Links and images
&lt;http://example.com&gt;
[linked phrase](http://example.com)
![optional caption text](quarto.png){fig-alt="Quarto logo and the word quarto spelled in small case letters"}
## Tables
| First Header | Second Header |
|--------------|---------------|
| Content Cell | Content Cell |
| Content Cell | Content Cell |
/</code></pre>
</div>
<p>The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you wont need to think about them. If you forget, you can get to a handy reference sheet with <em>Help &gt; Markdown Quick Reference</em>.</p>
<section id="quarto-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Practice what youve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.</p></li>
<li>
<p>Using the visual editor, figure out how to:</p>
<ol type="a"><li>Add a footnote.</li>
<li>Add a horizontal rule.</li>
<li>Add a block quote.</li>
</ol></li>
<li>
<p>Now, using the source editor and the Markdown quick reference, figure out how to:</p>
<ol type="a"><li>Add a footnote.</li>
<li>Add a horizontal rule.</li>
<li>Add a block quote.</li>
</ol></li>
<li><p>Copy and paste the contents of <code>diamond-sizes.qmd</code> from <a href="https://github.com/hadley/r4ds/tree/main/quarto" class="uri">https://github.com/hadley/r4ds/tree/main/quarto</a> in to a local R Quarto document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.</p></li>
</ol></section>
</section>
<section id="code-chunks" data-type="sect1">
<h1>
Code chunks</h1>
<p>To run code inside a Quarto document, you need to insert a chunk. There are three ways to do so:</p>
<ol type="1"><li><p>The keyboard shortcut Cmd + Option + I / Ctrl + Alt + I.</p></li>
<li><p>The “Insert” button icon in the editor toolbar.</p></li>
<li><p>By manually typing the chunk delimiters <code>```{r}</code> and <code>```</code>.</p></li>
</ol><p>Wed recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!</p>
<p>You can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focused around a single task.</p>
<p>The following sections describe the chunk header which consists of <code>```{r}</code>, followed by an optional chunk label and various other chunk options, each on their own line, marked by <code>#|</code>.</p>
<section id="chunk-label" data-type="sect2">
<h2>
Chunk label</h2>
<p>Chunks can be given an optional label, e.g.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="markdown">```{r}
#| label: simple-addition
1 + 1
```</pre>
<pre><code>#&gt; [1] 2</code></pre>
</div>
<p>This has three advantages:</p>
<ol type="1"><li>
<p>You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/quarto-chunk-nav.png" class="img-fluid" style="width:30.0%" alt="Snippet of RStudio IDE showing only the drop-down code navigator which shows three chunks. Chunk 1 is setup. Chunk 2 is cars and it is in a section called Quarto. Chunk 3 is pressure and it is in a section called Including plots."/></p>
</div>
</div>
</li>
<li><p>Graphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in <a href="#sec-figures" data-type="xref">#sec-figures</a>.</p></li>
<li><p>You can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that in <a href="#sec-caching" data-type="xref">#sec-caching</a>.</p></li>
</ol><p>Your chunk labels should be short but evocative and should not contain spaces. We recommend using dashes (<code>-</code>) to separate words (instead of underscores, <code>_</code>) and avoiding other special characters in chunk labels.</p>
<p>You are generally free to label your chunk however you like, but there is one chunk name that imbues special behavior: <code>setup</code>. When youre in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.</p>
<p>Additionally, chunk labels cannot be duplicated. Each chunk label must be unique.</p>
</section>
<section id="chunk-options" data-type="sect2">
<h2>
Chunk options</h2>
<p>Chunk output can be customized with <strong>options</strong>, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here well cover the most important chunk options that youll use frequently. You can see the full list at <a href="https://yihui.name/knitr/options/" class="uri">https://yihui.name/knitr/options</a>.</p>
<p>The most important set of options controls if your code block is executed and what results are inserted in the finished report:</p>
<ul><li><p><code>eval: false</code> prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.</p></li>
<li><p><code>include: false</code> runs the code, but doesnt show the code or results in the final document. Use this for setup code that you dont want cluttering your report.</p></li>
<li><p><code>echo: false</code> prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who dont want to see the underlying R code.</p></li>
<li><p><code>message: false</code> or <code>warning: false</code> prevents messages or warnings from appearing in the finished file.</p></li>
<li><p><code>results: hide</code> hides printed output; <code>fig-show: hide</code> hides plots.</p></li>
<li><p><code>error: true</code> causes the render to continue even if code returns an error. This is rarely something youll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your <code>.qmd</code>. Its also useful if youre teaching R and want to deliberately include an error. The default, <code>error: false</code> causes rendering to fail if there is a single error in the document.</p></li>
</ul><p>Each of these chunk options get added to the header of the chunk, following <code>#|</code>, e.g. in the following chunk the result is not printed since <code>eval</code> is set to false.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="markdown">```{r}
#| label: simple-multiplication
#| eval: false
2 * 2
```</pre>
</div>
<p>The following table summarizes which types of output each option suppresses:</p>
<table class="table"><colgroup><col style="width: 24%"/><col style="width: 13%"/><col style="width: 14%"/><col style="width: 10%"/><col style="width: 9%"/><col style="width: 13%"/><col style="width: 13%"/></colgroup><thead><tr class="header"><th>Option</th>
<th style="text-align: center;">Run code</th>
<th style="text-align: center;">Show code</th>
<th style="text-align: center;">Output</th>
<th style="text-align: center;">Plots</th>
<th style="text-align: center;">Messages</th>
<th style="text-align: center;">Warnings</th>
</tr></thead><tbody><tr class="odd"><td><code>eval: false</code></td>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
</tr><tr class="even"><td><code>include: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
</tr><tr class="odd"><td><code>echo: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
</tr><tr class="even"><td><code>results: hide</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
</tr><tr class="odd"><td><code>fig-show: hide</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
</tr><tr class="even"><td><code>message: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
</tr><tr class="odd"><td><code>warning: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
</tr></tbody></table></section>
<section id="global-options" data-type="sect2">
<h2>
Global options</h2>
<p>As you work more with knitr, you will discover that some of the default chunk options dont fit your needs and you want to change them.</p>
<p>You can do this by adding the preferred options in the document YAML, under <code>execute</code>. For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set <code>echo: false</code> at the document level. That will hide the code by default, so only showing the chunks you deliberately choose to show (with <code>echo: true</code>). You might consider setting <code>message: false</code> and <code>warning: false</code>, but that would make it harder to debug problems because you wouldnt see any messages in the final document.</p>
<pre data-type="programlisting" data-code-language="yaml">title: "My report"
execute:
echo: false</pre>
<p>Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g. Jupyter). You can, however, still set these as global options for your document under the <code>knitr</code> field, under <code>opts_chunk</code>. For example, when writing books and tutorials we set:</p>
<pre data-type="programlisting" data-code-language="yaml">title: "Tutorial"
knitr:
opts_chunk:
comment: "#&gt;"
collapse: true</pre>
<p>This uses our preferred comment formatting and ensures that the code and output are kept closely entwined.</p>
</section>
<section id="inline-code" data-type="sect2">
<h2>
Inline code</h2>
<p>There is one other way to embed R code into a Quarto document: directly into the text, with: <code>`r `</code>. This can be very useful if you mention properties of your data in the text. For example, the example document used at the start of the chapter had:</p>
<blockquote class="blockquote">
<p>We have data about <code>`r nrow(diamonds)`</code> diamonds. Only <code>`r nrow(diamonds) - nrow(smaller)`</code> are larger than 2.5 carats. The distribution of the remainder is shown below:</p>
</blockquote>
<p>When the report is rendered, the results of these computations are inserted into the text:</p>
<blockquote class="blockquote">
<p>We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:</p>
</blockquote>
<p>When inserting numbers into text, <code><a href="https://rdrr.io/r/base/format.html">format()</a></code> is your friend. It allows you to set the number of <code>digits</code> so you dont print to a ridiculous degree of accuracy, and a <code>big.mark</code> to make numbers easier to read. You might combine these into a helper function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">comma &lt;- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345)
#&gt; [1] "3,452,345"
comma(.12358124331)
#&gt; [1] "0.12"</pre>
</div>
</section>
<section id="quarto-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Add a section that explores how diamond sizes vary by cut, color, and clarity. Assume youre writing a report for someone who doesnt know R, and instead of setting <code>echo: false</code> on each chunk, set a global option.</p></li>
<li><p>Download <code>diamond-sizes.qmd</code> from <a href="https://github.com/hadley/r4ds/tree/main/quarto" class="uri">https://github.com/hadley/r4ds/tree/main/quarto</a>. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.</p></li>
<li><p>Modify <code>diamonds-sizes.qmd</code> to use <code>label_comma()</code> to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats.</p></li>
</ol></section>
</section>
<section id="sec-figures" data-type="sect1">
<h1>
Figures</h1>
<p>The figures in a Quarto document can be embedded (e.g. a PNG or JPEG file) or generated as a result of a code chunk.</p>
<p>To embed an image from an external file, you can use the Insert menu in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.</p>
<p>If you include a code chunk that generates a figure (e.g. includes a <code>ggplot()</code> call), the resulting figure will be automatically included in your Quarto document.</p>
<section id="figure-sizing" data-type="sect2">
<h2>
Figure sizing</h2>
<p>The biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: <code>fig-width</code>, <code>fig-height</code>, <code>fig-asp</code>, <code>out-width</code> and <code>out-height</code>. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e. height, width, and aspect ratio: pick two of three).</p>
<!-- TODO: https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/ -->
<p>We recommend three of the five options:</p>
<ul><li><p>Plots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set <code>fig-width: 6</code> (6”) and <code>fig-asp: 0.618</code> (the golden ratio) in the defaults. Then in individual chunks, only adjust <code>fig-asp</code>.</p></li>
<li><p>Control the output size with <code>out-width</code> and set it to a percentage of the line width. We suggest to <code>out-width: "70%"</code> and <code>fig-align: "center"</code>. That gives plots room to breathe, without taking up too much space.</p></li>
<li><p>To put multiple plots in a single row, set the <code>out-width</code> to <code>50%</code> for two plots, <code>33%</code> for 3 plots, or <code>25%</code> to 4 plots, and set <code>fig-align: "default"</code>. Depending on what youre trying to illustrate (e.g. show data or show plot variations), you might also tweak <code>fig-width</code>, as discussed below.</p></li>
</ul><p>If you find that youre having to squint to read the text in your plot, you need to tweak <code>fig-width</code>. If <code>fig-width</code> is larger than the size the figure is rendered in the final doc, the text will be too small; if <code>fig-width</code> is smaller, the text will be too big. Youll often need to do a little experimentation to figure out the right ratio between the <code>fig-width</code> and the eventual width in your document. To illustrate the principle, the following three plots have <code>fig-width</code> of 4, 6, and 8 respectively:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto_files/figure-html/unnamed-chunk-15-1.png" class="img-fluid" width="384"/></p>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" width="768"/></p>
</div>
</div>
<p>If you want to make sure the font size is consistent across all your figures, whenever you set <code>out-width</code>, youll also need to adjust <code>fig-width</code> to maintain the same ratio with your default <code>out-width</code>. For example, if your default <code>fig-width</code> is 6 and <code>out-width</code> is 0.7, when you set <code>out-width: "50%"</code> youll need to set <code>fig-width</code> to 4.3 (6 * 0.5 / 0.7).</p>
</section>
<section id="other-important-options" data-type="sect2">
<h2>
Other important options</h2>
<p>When mingling code and text, like in this book, you can set <code>fig-show: "hold"</code> so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.</p>
<p>To add a caption to the plot, use <code>fig-cap</code>. In Quarto this will change the figure from inline to “floating”.</p>
<p>If youre producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set <code>fig-format: "png"</code> to force the use of PNGs. They are slightly lower quality, but will be much more compact.</p>
<p>Its a good idea to name code chunks that produce figures, even if you dont routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).</p>
</section>
<section id="quarto-exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises -->
</section>
</section>
<section id="quarto-tables" data-type="sect1">
<h1>
Tables</h1>
<p>Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create in directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.</p>
<p>By default, Quarto prints data frames and matrices as youd see them in the console:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">mtcars[1:5, ]
#&gt; mpg cyl disp hp drat wt qsec vs am gear carb
#&gt; Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#&gt; Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#&gt; Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#&gt; Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#&gt; Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2</pre>
</div>
<p>If you prefer that data be displayed with additional formatting you can use the <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">knitr::kable()</a></code> function. The code below generates <a href="#tbl-kable" data-type="xref">#tbl-kable</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">knitr::kable(mtcars[1:5, ], )</pre>
<div class="cell-output-display">
<div id="tbl-kable" class="anchored">
<table class="table table-sm table-striped"><caption>Table 30.1: A knitr kable.</caption>
<colgroup><col style="width: 26%"/><col style="width: 7%"/><col style="width: 5%"/><col style="width: 7%"/><col style="width: 5%"/><col style="width: 7%"/><col style="width: 8%"/><col style="width: 8%"/><col style="width: 4%"/><col style="width: 4%"/><col style="width: 7%"/><col style="width: 7%"/></colgroup><thead><tr class="header"><th style="text-align: left;"/>
<th style="text-align: right;">mpg</th>
<th style="text-align: right;">cyl</th>
<th style="text-align: right;">disp</th>
<th style="text-align: right;">hp</th>
<th style="text-align: right;">drat</th>
<th style="text-align: right;">wt</th>
<th style="text-align: right;">qsec</th>
<th style="text-align: right;">vs</th>
<th style="text-align: right;">am</th>
<th style="text-align: right;">gear</th>
<th style="text-align: right;">carb</th>
</tr></thead><tbody><tr class="odd"><td style="text-align: left;">Mazda RX4</td>
<td style="text-align: right;">21.0</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">160</td>
<td style="text-align: right;">110</td>
<td style="text-align: right;">3.90</td>
<td style="text-align: right;">2.620</td>
<td style="text-align: right;">16.46</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">4</td>
</tr><tr class="even"><td style="text-align: left;">Mazda RX4 Wag</td>
<td style="text-align: right;">21.0</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">160</td>
<td style="text-align: right;">110</td>
<td style="text-align: right;">3.90</td>
<td style="text-align: right;">2.875</td>
<td style="text-align: right;">17.02</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">4</td>
</tr><tr class="odd"><td style="text-align: left;">Datsun 710</td>
<td style="text-align: right;">22.8</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">108</td>
<td style="text-align: right;">93</td>
<td style="text-align: right;">3.85</td>
<td style="text-align: right;">2.320</td>
<td style="text-align: right;">18.61</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">1</td>
</tr><tr class="even"><td style="text-align: left;">Hornet 4 Drive</td>
<td style="text-align: right;">21.4</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">258</td>
<td style="text-align: right;">110</td>
<td style="text-align: right;">3.08</td>
<td style="text-align: right;">3.215</td>
<td style="text-align: right;">19.44</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">3</td>
<td style="text-align: right;">1</td>
</tr><tr class="odd"><td style="text-align: left;">Hornet Sportabout</td>
<td style="text-align: right;">18.7</td>
<td style="text-align: right;">8</td>
<td style="text-align: right;">360</td>
<td style="text-align: right;">175</td>
<td style="text-align: right;">3.15</td>
<td style="text-align: right;">3.440</td>
<td style="text-align: right;">17.02</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">3</td>
<td style="text-align: right;">2</td>
</tr></tbody></table></div>
</div>
</div>
<p>Read the documentation for <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">?knitr::kable</a></code> to see the other ways in which you can customize the table. For even deeper customization, consider the <strong>gt</strong>, <strong>huxtable</strong>, <strong>reactable</strong>, <strong>kableExtra</strong>, <strong>xtable</strong>, <strong>stargazer</strong>, <strong>pander</strong>, <strong>tables</strong>, and <strong>ascii</strong> packages. Each provides a set of tools for returning formatted tables from R code.</p>
<p>There is also a rich set of options for controlling how figures are embedded. Youll learn about these in <span class="quarto-unresolved-ref">?sec-graphics-communication</span>.</p>
<section id="quarto-exercises-5" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises -->
</section>
</section>
<section id="sec-caching" data-type="sect1">
<h1>
Caching</h1>
<p>Normally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that youve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is <code>cache: true</code>.</p>
<p>You can enable the Knitr cache at the document level for caching the results of all computations in a document using standard YAML options:</p>
<pre data-type="programlisting" data-code-language="yaml">---
title: "My Document"
execute:
cache: true
---</pre>
<p>You can also enable caching at the chunk level for caching the results of computation in a specific chunk:</p>
<div class="cell" data-hash="quarto_cache/html/unnamed-chunk-20_0ece1c5566ef654926248351b9afb313">
<pre data-type="programlisting" data-code-language="markdown">```{r}
#| cache: true
# code for lengthy computation...
```</pre>
</div>
<p>When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasnt, it will reuse the cached results.</p>
<p>The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the <code>processed_data</code> chunk depends on the <code>raw-data</code> chunk:</p>
<pre><code>```{r}
#| label: raw-data
rawdata &lt;- readr::read_csv("a_very_large_file.csv")
```
```{r}
#| label: processed_data
#| cache: true
processed_data &lt;- rawdata |&gt;
filter(!is.na(import_var)) |&gt;
mutate(new_variable = complicated_transformation(x, y, z))
```</code></pre>
<p>Caching the <code>processed_data</code> chunk means that it will get re-run if the dplyr pipeline is changed, but it wont get rerun if the <code>read_csv()</code> call changes. You can avoid that problem with the <code>dependson</code> chunk option:</p>
<pre><code>```{r}
#| label: processed-data
#| cache: true
#| dependson: "raw-data"
processed_data &lt;- rawdata |&gt;
filter(!is.na(import_var)) |&gt;
mutate(new_variable = complicated_transformation(x, y, z))
```</code></pre>
<p><code>dependson</code> should contain a character vector of <em>every</em> chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.</p>
<p>Note that the chunks wont update if <code>a_very_large_file.csv</code> changes, because knitr caching only tracks changes within the <code>.qmd</code> file. If you want to also track changes to that file you can use the <code>cache.extra</code> option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is <code><a href="https://rdrr.io/r/base/file.info.html">file.info()</a></code>: it returns a bunch of information about the file including when it was last modified. Then you can write:</p>
<pre><code>```{r}
#| label: raw-data
#| cache.extra: file.info("a_very_large_file.csv")
rawdata &lt;- readr::read_csv("a_very_large_file.csv")
```</code></pre>
<p>As your caching strategies get progressively more complicated, its a good idea to regularly clear out all your caches with <code><a href="https://rdrr.io/pkg/knitr/man/clean_cache.html">knitr::clean_cache()</a></code>.</p>
<p>Weve followed the advice of <a href="https://twitter.com/drob/status/738786604731490304">David Robinson</a> to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the <code>dependson</code> specification.</p>
<section id="exercises-6" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Set up a network of chunks where <code>d</code> depends on <code>c</code> and <code>b</code>, and both <code>b</code> and <code>c</code> depend on <code>a</code>. Have each chunk print <code><a href="https://lubridate.tidyverse.org/reference/now.html">lubridate::now()</a></code>, set <code>cache: true</code>, then verify your understanding of caching.</li>
</ol><blockquote class="blockquote">
<blockquote class="blockquote">
<blockquote class="blockquote">
<blockquote class="blockquote">
<blockquote class="blockquote">
<blockquote class="blockquote">
<blockquote class="blockquote">
<p>7ff2b1502187f15a978d74f59a88534fa6f1012e ## Troubleshooting</p>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<p>Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.</p>
<p>One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.</p>
<p>If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If youre lucky, that will recreate the problem, and you can figure out whats going on interactively.</p>
<p>If that doesnt help, there must be something different between your interactive environment and the Quarto environment. Youre going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code> in a chunk.</p>
<p>Next, brainstorm all the things that might cause the bug. Youll need to systematically check that theyre the same in your R session and your Quarto session. The easiest way to do that is to set <code>error: true</code> on the chunk causing the problem, then use <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> and <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> to check that settings are as you expect.</p>
</section>
</section>
<section id="yaml-header" data-type="sect1">
<h1>
YAML header</h1>
<p>You can control many other “whole document” settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: its “YAML Aint Markup Language”, which is designed for representing hierarchical data in a way thats easy for humans to read and write. Quarto uses it to control many details of the output. Here well discuss three: self-contained documents, document parameters, and bibliographies.</p>
<section id="self-contained" data-type="sect2">
<h2>
Self-contained</h2>
<p>HTML documents typically have a number of external dependencies (e.g. images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a <code>_files</code> folder in the same directory as your <code>.qmd</code> file. If you publish the HTML file on a hosting platform (e.g. QuartoPub, <a href="https://quartopub.com/" class="uri">https://quartopub.com/</a>), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the <code>embed-resources</code> option:</p>
<p>By default these dependencies are placed in a <code>_files</code> directory alongside your document. For example, if you render <code>report.qmd</code> to HTML:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
embed-resources: true</pre>
<p>The resulting file will be self-contained, such that it will need no external files and no internet access to be displayed properly by a browser.</p>
</section>
<section id="parameters" data-type="sect2">
<h2>
Parameters</h2>
<p>Quarto documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the <code>params</code> field.</p>
<p>This example uses a <code>my_class</code> parameter to determine which class of cars to display:</p>
<div class="cell">
<pre><code>---
output: html_document
params:
my_class: "suv"
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
class &lt;- mpg |&gt; filter(class == params$my_class)
```
# Fuel economy for `r params$my_class`s
```{r}
#| message: false
ggplot(class, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
```</code></pre>
</div>
<p>As you can see, parameters are available within the code chunks as a read-only list named <code>params</code>.</p>
<p>You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with <code>!r</code>. This is a good way to specify date/time parameters.</p>
<pre data-type="programlisting" data-code-language="yaml">params:
start: !r lubridate::ymd("2015-01-01")
snapshot: !r lubridate::ymd_hms("2015-01-01 12:30:00")</pre>
</section>
<section id="bibliographies-and-citations" data-type="sect2">
<h2>
Bibliographies and Citations</h2>
<p>Quarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.</p>
<p>To add a citation using the visual editor, go to Insert &gt; Citation. Citations can be inserted from a variety of sources:</p>
<ol type="1"><li><p><a href="https://quarto.org/docs/visual-editor/technical.html#citations-from-dois">DOI</a> (Document Object Identifier) references.</p></li>
<li><p><a href="https://quarto.org/docs/visual-editor/technical.html#citations-from-zotero">Zotero</a> personal or group libraries.</p></li>
<li><p>Searches of <a href="https://www.crossref.org/">Crossref</a>, <a href="https://datacite.org/">DataCite</a>, or <a href="https://pubmed.ncbi.nlm.nih.gov/">PubMed</a>.</p></li>
<li><p>Your document bibliography (a <code>.bib</code> file in the directory of your document)</p></li>
</ol><p>Under the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g. <code>[@citation]</code>).</p>
<p>If you add a citation using one of the first three methods, the visual editor will automatically create a <code>bibliography.bib</code> file for you and add the reference to it. It will also add a <code>bibliography</code> field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.</p>
<p>To create a citation within your .qmd file in the source editor, use a key composed of @ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:</p>
<pre data-type="programlisting" data-code-language="markdown">Separate multiple citations with a `;`: Blah blah [@smith04; @doe99].
You can add arbitrary comments inside the square brackets:
Blah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].
Remove the square brackets to create an in-text citation: @smith04
says blah, or @smith04 [p. 33] says blah.
Add a `-` before the citation to suppress the author's name:
Smith says blah [-@smith04].</pre>
<p>When Quarto renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as <code># References</code> or <code># Bibliography</code>.</p>
<p>You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the <code>csl</code> field:</p>
<pre data-type="programlisting" data-code-language="yaml">bibliography: rmarkdown.bib
csl: apa.csl</pre>
<p>As with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is <a href="https://github.com/citation-style-language/styles" class="uri">https://github.com/citation-style-language/styles</a>.</p>
</section>
</section>
<section id="quarto-learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: <a href="https://quarto.org/" class="uri">https://quarto.org</a>.</p>
<p>There are two important topics that we havent covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: <a href="https://happygitwithr.com" class="uri">https://happygitwithr.com</a>.</p>
<p>We have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either <a href="https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416"><em>Style: Lessons in Clarity and Grace</em></a> by Joseph M. Williams &amp; Joseph Bizup, or <a href="https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327"><em>The Sense of Structure: Writing from the Readers Perspective</em></a> by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but theyre used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <a href="https://www.georgegopen.com/the-litigation-articles.html" class="uri">https://www.georgegopen.com/the-litigation-articles.html</a>. They are aimed at lawyers, but almost everything applies to data scientists too.</p>
</section>
</section>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 470 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 374 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 462 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 335 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 320 KiB

File diff suppressed because it is too large Load Diff

View File

@ -1,962 +0,0 @@
<section data-type="chapter" id="chp-regexps">
<h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1>
<section id="regexps-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In <a href="#chp-strings" data-type="xref">#chp-strings</a>, you learned a whole bunch of useful functions for working with strings. This chapter will focus on functions that use <strong>regular expressions</strong>, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”<span data-type="footnote">You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x).</span> or “regexp”.</p>
<p>The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. Well then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, well talk about some of the other types of patterns that stringr functions can work with and the various “flags” that allow you to tweak the operation of regular expressions. Well finish with a survey of other places in the tidyverse and base R where you might use regexes.</p>
<section id="regexps-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(babynames)</pre>
</div>
<p>Through this chapter, well use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:</p>
<ul><li>
<code>fruit</code> contains the names of 80 fruits.</li>
<li>
<code>words</code> contains 980 common English words.</li>
<li>
<code>sentences</code> contains 720 short sentences.</li>
</ul></section>
</section>
<section id="sec-reg-basics" data-type="sect1">
<h1>
Pattern basics</h1>
<p>Well use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs. its printed representation, and now well use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code>&lt;&gt;</code>, and, where possible, highlighting the match in blue.</p>
<p>The simplest patterns consist of letters and numbers which match those characters exactly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "berry")
#&gt; [6] │ bil&lt;berry&gt;
#&gt; [7] │ black&lt;berry&gt;
#&gt; [10] │ blue&lt;berry&gt;
#&gt; [11] │ boysen&lt;berry&gt;
#&gt; [19] │ cloud&lt;berry&gt;
#&gt; [21] │ cran&lt;berry&gt;
#&gt; ... and 8 more
str_view(fruit, "BERRY")</pre>
</div>
<p>Letters and numbers match exactly and are called <strong>literal characters</strong>. Punctuation characters like <code>.</code>, <code>+</code>, <code>*</code>, <code>[</code>, <code>]</code>, <code>?</code> have special meanings<span data-type="footnote">Youll learn how to escape these special meanings in <a href="#sec-regexp-escaping" data-type="xref">#sec-regexp-escaping</a>.</span> and are called <strong>meta-characters</strong>. For example, <code>.</code> will match any character<span data-type="footnote">Well, any character apart from <code>\n</code>.</span>, so <code>"a."</code> will match any string that contains an “a” followed by another character :</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
#&gt; [2] │ &lt;ab&gt;
#&gt; [3] │ &lt;ae&gt;
#&gt; [6] │ e&lt;ab&gt;</pre>
</div>
<p>Or we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "a...e")
#&gt; [1] │ &lt;apple&gt;
#&gt; [7] │ bl&lt;ackbe&gt;rry
#&gt; [48] │ mand&lt;arine&gt;
#&gt; [51] │ nect&lt;arine&gt;
#&gt; [62] │ pine&lt;apple&gt;
#&gt; [64] │ pomegr&lt;anate&gt;
#&gt; ... and 2 more</pre>
</div>
<p><strong>Quantifiers</strong> control how many times a pattern can match:</p>
<ul><li>
<code>?</code> makes a pattern optional (i.e., it matches 0 or 1 times)</li>
<li>
<code>+</code> lets a pattern repeat (i.e., it matches at least once)</li>
<li>
<code>*</code> lets a pattern be optional or repeat (i.e., it matches any number of times, including 0).</li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="r"># ab? matches an "a", optionally followed by a "b".
str_view(c("a", "ab", "abb"), "ab?")
#&gt; [1] │ &lt;a&gt;
#&gt; [2] │ &lt;ab&gt;
#&gt; [3] │ &lt;ab&gt;b
# ab+ matches an "a", followed by at least one "b".
str_view(c("a", "ab", "abb"), "ab+")
#&gt; [2] │ &lt;ab&gt;
#&gt; [3] │ &lt;abb&gt;
# ab* matches an "a", followed by any number of "b"s.
str_view(c("a", "ab", "abb"), "ab*")
#&gt; [1] │ &lt;a&gt;
#&gt; [2] │ &lt;ab&gt;
#&gt; [3] │ &lt;abb&gt;</pre>
</div>
<p><strong>Character classes</strong> are defined by <code>[]</code> and let you match a set of characters, e.g. <code>[abcd]</code> matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with <code>^</code>: <code>[^abcd]</code> matches anything <strong>except</strong> “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][aeiou]")
#&gt; [79] │ b&lt;eau&gt;ty
#&gt; [565] │ obv&lt;iou&gt;s
#&gt; [644] │ prev&lt;iou&gt;s
#&gt; [670] │ q&lt;uie&gt;t
#&gt; [741] │ ser&lt;iou&gt;s
#&gt; [915] │ var&lt;iou&gt;s
str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
#&gt; [45] │ a&lt;pply&gt;
#&gt; [198] │ cou&lt;ntry&gt;
#&gt; [424] │ indu&lt;stry&gt;
#&gt; [830] │ su&lt;pply&gt;
#&gt; [836] │ &lt;syst&gt;em</pre>
</div>
<p>You can combine character classes and quantifiers. For example, the following regexp looks for two vowels followed by two or more consonants:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
#&gt; [6] │ acc&lt;ount&gt;
#&gt; [21] │ ag&lt;ainst&gt;
#&gt; [31] │ alr&lt;eady&gt;
#&gt; [34] │ alth&lt;ough&gt;
#&gt; [37] │ am&lt;ount&gt;
#&gt; [46] │ app&lt;oint&gt;
#&gt; ... and 66 more</pre>
</div>
<p>(Well learn more elegant ways to express these ideas in <a href="#sec-quantifiers" data-type="xref">#sec-quantifiers</a>.)</p>
<p>You can use <strong>alternation</strong>, <code>|</code> to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple|pear|banana")
#&gt; [1] │ &lt;apple&gt;
#&gt; [4] │ &lt;banana&gt;
#&gt; [59] │ &lt;pear&gt;
#&gt; [62] │ pine&lt;apple&gt;
str_view(fruit, "aa|ee|ii|oo|uu")
#&gt; [9] │ bl&lt;oo&gt;d orange
#&gt; [33] │ g&lt;oo&gt;seberry
#&gt; [47] │ lych&lt;ee&gt;
#&gt; [66] │ purple mangost&lt;ee&gt;n</pre>
</div>
<p>Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Dont worry; youll get better with practice, and simple patterns will soon become second nature. Lets kick off that process by practicing with some useful stringr functions.</p>
</section>
<section id="sec-stringr-regex-funs" data-type="sect1">
<h1>
Key functions</h1>
<p>Now that youve got the basics of regular expressions under your belt, lets use them with some stringr and tidyr functions. In the following section, youll learn how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.</p>
<section id="detect-matches" data-type="sect2">
<h2>
Detect matches</h2>
<p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matches an element of the character vector and <code>FALSE</code> otherwise:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_detect(c("a", "b", "c"), "[aeiou]")
#&gt; [1] TRUE FALSE FALSE</pre>
</div>
<p>Since <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector of the same length as the initial vector, it pairs well with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. For example, this code finds all the most popular names containing a lower-case “x”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">babynames |&gt;
filter(str_detect(name, "x")) |&gt;
count(name, wt = n, sort = TRUE)
#&gt; # A tibble: 974 × 2
#&gt; name n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 Alexander 665492
#&gt; 2 Alexis 399551
#&gt; 3 Alex 278705
#&gt; 4 Alexandra 232223
#&gt; 5 Max 148787
#&gt; 6 Alexa 123032
#&gt; # … with 968 more rows</pre>
</div>
<p>We can also use <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> by pairing it with <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> or <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>: <code>sum(str_detect(x, pattern))</code> tells you the number of observations that match and <code>mean(str_detect(x, pattern))</code> tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names<span data-type="footnote">This gives us the proportion of <strong>names</strong> that contain an “x”; if you wanted the proportion of babies with a name containing an x, youd need to perform a weighted mean.</span> that contain “x”, broken down by year. It looks like theyve radically increased in popularity lately!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">babynames |&gt;
group_by(year) |&gt;
summarize(prop_x = mean(str_detect(name, "x"))) |&gt;
ggplot(aes(x = year, y = prop_x)) +
geom_line()</pre>
<div class="cell-output-display">
<figure id="fig-x-names"><p><img src="regexps_files/figure-html/fig-x-names-1.png" alt="A time series showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019." width="576"/></p>
<figcaption>A time series showing the proportion of baby names that contain a lower case “x”.</figcaption>
</figure>
</div>
</div>
<p>There are two functions that are closely related to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>, namely <code><a href="https://stringr.tidyverse.org/reference/str_subset.html">str_subset()</a></code> which returns just the strings that contain a match and <code><a href="https://stringr.tidyverse.org/reference/str_which.html">str_which()</a></code> which returns the indexes of strings that have a match:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_subset(c("a", "b", "c"), "[aeiou]")
#&gt; [1] "a"
str_which(c("a", "b", "c"), "[aeiou]")
#&gt; [1] 1</pre>
</div>
</section>
<section id="count-matches" data-type="sect2">
<h2>
Count matches</h2>
<p>The next step up in complexity from <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> is <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code>: rather than a simple true or false, it tells you how many matches there are in each string.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("apple", "banana", "pear")
str_count(x, "p")
#&gt; [1] 2 0 1</pre>
</div>
<p>Note that each match starts at the end of the previous match; i.e. regex matches never overlap. For example, in <code>"abababa"</code>, how many times will the pattern <code>"aba"</code> match? Regular expressions say two, not three:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_count("abababa", "aba")
#&gt; [1] 2
str_view("abababa", "aba")
#&gt; [1] │ &lt;aba&gt;b&lt;aba&gt;</pre>
</div>
<p>Its natural to use <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. The following example uses <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with character classes to count the number of vowels and consonants in each name.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">babynames |&gt;
count(name) |&gt;
mutate(
vowels = str_count(name, "[aeiou]"),
consonants = str_count(name, "[^aeiou]")
)
#&gt; # A tibble: 97,310 × 4
#&gt; name n vowels consonants
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Aaban 10 2 3
#&gt; 2 Aabha 5 2 3
#&gt; 3 Aabid 2 2 3
#&gt; 4 Aabir 1 2 3
#&gt; 5 Aabriella 5 4 5
#&gt; 6 Aada 1 2 2
#&gt; # … with 97,304 more rows</pre>
</div>
<p>If you look closely, youll notice that theres something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. Thats because regular expressions are case sensitive. There are three ways we could fix this:</p>
<ul><li>Add the upper case vowels to the character class: <code>str_count(name, "[aeiouAEIOU]")</code>.</li>
<li>Tell the regular expression to ignore case: <code>str_count(regex(name, ignore_case = TRUE), "[aeiou]")</code>. Well talk about more in <a href="#sec-flags" data-type="xref">#sec-flags</a>.</li>
<li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>.</li>
</ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p>
<p>In this case, since were applying two functions to the name, I think its easier to transform it first:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">babynames |&gt;
count(name) |&gt;
mutate(
name = str_to_lower(name),
vowels = str_count(name, "[aeiou]"),
consonants = str_count(name, "[^aeiou]")
)
#&gt; # A tibble: 97,310 × 4
#&gt; name n vowels consonants
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 aaban 10 3 2
#&gt; 2 aabha 5 3 2
#&gt; 3 aabid 2 3 2
#&gt; 4 aabir 1 3 2
#&gt; 5 aabriella 5 5 4
#&gt; 6 aada 1 3 1
#&gt; # … with 97,304 more rows</pre>
</div>
</section>
<section id="replace-values" data-type="sect2">
<h2>
Replace values</h2>
<p>As well as detecting and counting matches, we can also modify them with <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>. <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> replaces the first match, and as the name suggests, <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code> replaces all matches.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")
#&gt; [1] "-ppl-" "p--r" "b-n-n-"</pre>
</div>
<p><code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove_all()</a></code> are handy shortcuts for <code>str_replace(x, pattern, "")</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("apple", "pear", "banana")
str_remove_all(x, "[aeiou]")
#&gt; [1] "ppl" "pr" "bnn"</pre>
</div>
<p>These functions are naturally paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> when doing data cleaning, and youll often apply them repeatedly to peel off layers of inconsistent formatting.</p>
</section>
<section id="sec-extract-variables" data-type="sect2">
<h2>
Extract variables</h2>
<p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its a peer of the <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because they operate on (columns of) data frames, rather than individual vectors.</p>
<p>Lets create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that youd never see something this weird in real life, but unfortunately over the course of your career youre likely to see much weirder!</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
~str,
"&lt;Sheryl&gt;-F_34",
"&lt;Kisha&gt;-F_45",
"&lt;Brandon&gt;-N_33",
"&lt;Sharon&gt;-F_38",
"&lt;Penny&gt;-F_58",
"&lt;Justin&gt;-M_41",
"&lt;Patricia&gt;-F_84",
)</pre>
</div>
<p>To extract this data using <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
separate_wider_regex(
str,
patterns = c(
"&lt;", name = "[A-Za-z]+", "&gt;-",
gender = ".", "_",
age = "[0-9]+"
)
)
#&gt; # A tibble: 7 × 3
#&gt; name gender age
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Sheryl F 34
#&gt; 2 Kisha F 45
#&gt; 3 Brandon N 33
#&gt; 4 Sharon F 38
#&gt; 5 Penny F 58
#&gt; 6 Justin M 41
#&gt; # … with 1 more row</pre>
</div>
<p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code>.</p>
</section>
<section id="regexps-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)</p></li>
<li><p>Replace all forward slashes in a string with backslashes.</p></li>
<li><p>Implement a simple version of <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> using <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>.</p></li>
<li><p>Create a regular expression that will match telephone numbers as commonly written in your country.</p></li>
</ol></section>
</section>
<section id="pattern-details" data-type="sect1">
<h1>
Pattern details</h1>
<p>Now that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, its time to dig into more of the details. First, well start with <strong>escaping</strong>, which allows you to match metacharacters that would otherwise be treated specially. Next, youll learn about <strong>anchors</strong> which allow you to match the start or end of the string. Then, youll more learn about <strong>character classes</strong> and their shortcuts which allow you to match any character from a set. Next, youll learn the final details of <strong>quantifiers</strong> which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of <strong>operator precedence</strong> and parentheses. And well finish off with some details of <strong>grouping</strong> components of the pattern.</p>
<p>The terms we use here are the technical names for each component. Theyre not always the most evocative of their purpose, but its very helpful to know the correct terms if you later want to Google for more details.</p>
<section id="sec-regexp-escaping" data-type="sect2">
<h2>
Escaping</h2>
<p>In order to match a literal <code>.</code>, you need an <strong>escape</strong> which tells the regular expression to match metacharacters literally. Like strings, regexps use the backslash for escaping. So, to match a <code>.</code>, you need the regexp <code>\.</code>. Unfortunately this creates a problem. We use strings to represent regular expressions, and <code>\</code> is also used as an escape symbol in strings. So to create the regular expression <code>\.</code> we need the string <code>"\\."</code>, as the following example shows.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r"># To create the regular expression \., we need to use \\.
dot &lt;- "\\."
# But the expression itself only contains one \
str_view(dot)
#&gt; [1] │ \.
# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")
#&gt; [2] │ &lt;a.c&gt;</pre>
</div>
<p>In this book, well usually write regular expression without quotes, like <code>\.</code>. If we need to emphasize what youll actually type, well surround it with quotes and add extra escapes, like <code>"\\."</code>.</p>
<p>If <code>\</code> is used as an escape character in regular expressions, how do you match a literal <code>\</code>? Well, you need to escape it, creating the regular expression <code>\\</code>. To create that regular expression, you need to use a string, which also needs to escape <code>\</code>. That means to match a literal <code>\</code> you need to write <code>"\\\\"</code> — you need four backslashes to match one!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "a\\b"
str_view(x)
#&gt; [1] │ a\b
str_view(x, "\\\\")
#&gt; [1] │ a&lt;\&gt;b</pre>
</div>
<p>Alternatively, you might find it easier to use the raw strings you learned about in <a href="#sec-raw-strings" data-type="xref">#sec-raw-strings</a>). That lets you to avoid one layer of escaping:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(x, r"{\\}")
#&gt; [1] │ a&lt;\&gt;b</pre>
</div>
<p>If youre trying to match a literal <code>.</code>, <code>$</code>, <code>|</code>, <code>*</code>, <code>+</code>, <code>?</code>, <code>{</code>, <code>}</code>, <code>(</code>, <code>)</code>, theres an alternative to using a backslash escape: you can use a character class: <code>[.]</code>, <code>[$]</code>, <code>[|]</code>, ... all match the literal values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
#&gt; [2] │ &lt;a.c&gt;
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
#&gt; [3] │ &lt;a*c&gt;</pre>
</div>
<p>The full set of metacharacters is <code>.^$\|*+?{}[]()</code>. In general, look at punctuation characters with suspicion; if your regular expression isnt matching what you think it should, check if youve used any of these characters.</p>
</section>
<section id="anchors" data-type="sect2">
<h2>
Anchors</h2>
<p>By default, regular expressions will match any part of a string. If you want to match at the start of end you need to <strong>anchor</strong> the regular expression using <code>^</code> to match the start of the string or <code>$</code> to match the end of the string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "^a")
#&gt; [1] │ &lt;a&gt;pple
#&gt; [2] │ &lt;a&gt;pricot
#&gt; [3] │ &lt;a&gt;vocado
str_view(fruit, "a$")
#&gt; [4] │ banan&lt;a&gt;
#&gt; [15] │ cherimoy&lt;a&gt;
#&gt; [30] │ feijo&lt;a&gt;
#&gt; [36] │ guav&lt;a&gt;
#&gt; [56] │ papay&lt;a&gt;
#&gt; [74] │ satsum&lt;a&gt;</pre>
</div>
<p>Its tempting to think that <code>$</code> should match the start of a string, because thats how we write dollar amounts, but its not what regular expressions want.</p>
<p>To force a regular expression to match only the full string, anchor it with both <code>^</code> and <code>$</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "apple")
#&gt; [1] │ &lt;apple&gt;
#&gt; [62] │ pine&lt;apple&gt;
str_view(fruit, "^apple$")
#&gt; [1] │ &lt;apple&gt;</pre>
</div>
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly useful when using RStudios find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarize</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
#&gt; [1] │ &lt;sum&gt;mary(x)
#&gt; [2] │ &lt;sum&gt;marize(df)
#&gt; [3] │ row&lt;sum&gt;(x)
#&gt; [4] │ &lt;sum&gt;(x)
str_view(x, "\\bsum\\b")
#&gt; [4] │ &lt;sum&gt;(x)</pre>
</div>
<p>When used alone, anchors will produce a zero-width match:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view("abc", c("$", "^", "\\b"))
#&gt; [1] │ abc&lt;&gt;
#&gt; [2] │ &lt;&gt;abc
#&gt; [3] │ &lt;&gt;abc&lt;&gt;</pre>
</div>
<p>This helps you understand what happens when you replace a standalone anchor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_replace_all("abc", c("$", "^", "\\b"), "--")
#&gt; [1] "abc--" "--abc" "--abc--"</pre>
</div>
</section>
<section id="character-classes" data-type="sect2">
<h2>
Character classes</h2>
<p>A <strong>character class</strong>, or character <strong>set</strong>, allows you to match any character in a set. As we discussed above, you can construct your own sets with <code>[]</code>, where <code>[abc]</code> matches a, b, or c. There are three characters that have special meaning inside of <code>[]:</code></p>
<ul><li>
<code>-</code> defines a range, e.g. <code>[a-z]</code> matches any lower case letter and <code>[0-9]</code> matches any number.</li>
<li>
<code>^</code> takes the inverse of the set, e.g. <code>[^abc]</code> matches anything except a, b, or c.</li>
<li>
<code>\</code> escapes special characters, so <code>[\^\-\]]</code> matches <code>^</code>, <code>-</code>, or <code>]</code>.</li>
</ul><p>Here are few examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "abcd ABCD 12345 -!@#%."
str_view(x, "[abc]+")
#&gt; [1] │ &lt;abc&gt;d ABCD 12345 -!@#%.
str_view(x, "[a-z]+")
#&gt; [1] │ &lt;abcd&gt; ABCD 12345 -!@#%.
str_view(x, "[^a-z0-9]+")
#&gt; [1] │ abcd&lt; ABCD &gt;12345&lt; -!@#%.&gt;
# You need an escape to match characters that are otherwise
# special inside of []
str_view("a-b-c", "[a-c]")
#&gt; [1] │ &lt;a&gt;-&lt;b&gt;-&lt;c&gt;
str_view("a-b-c", "[a\\-c]")
#&gt; [1] │ &lt;a&gt;&lt;-&gt;b&lt;-&gt;&lt;c&gt;</pre>
</div>
<p>Some character classes are used so commonly that they get their own shortcut. Youve already seen <code>.</code>, which matches any character apart from a newline. There are three other particularly useful pairs<span data-type="footnote">Remember, to create a regular expression containing <code>\d</code> or <code>\s</code>, youll need to escape the <code>\</code> for the string, so youll type <code>"\\d"</code> or <code>"\\s"</code>.</span>:</p>
<ul><li>
<code>\d</code> matches any digit;<br/><code>\D</code> matches anything that isnt a digit.</li>
<li>
<code>\s</code> matches any whitespace (e.g. space, tab, newline);<br/><code>\S</code> matches anything that isnt whitespace.</li>
<li>
<code>\w</code> matches any “word” character, i.e. letters and numbers;<br/><code>\W</code> matches any “non-word” character.</li>
</ul><p>The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
#&gt; [1] │ abcd ABCD &lt;12345&gt; -!@#%.
str_view(x, "\\D+")
#&gt; [1] │ &lt;abcd ABCD &gt;12345&lt; -!@#%.&gt;
str_view(x, "\\w+")
#&gt; [1] │ &lt;abcd&gt; &lt;ABCD&gt; &lt;12345&gt; -!@#%.
str_view(x, "\\W+")
#&gt; [1] │ abcd&lt; &gt;ABCD&lt; &gt;12345&lt; -!@#%.&gt;
str_view(x, "\\s+")
#&gt; [1] │ abcd&lt; &gt;ABCD&lt; &gt;12345&lt; &gt;-!@#%.
str_view(x, "\\S+")
#&gt; [1] │ &lt;abcd&gt; &lt;ABCD&gt; &lt;12345&gt; &lt;-!@#%.&gt;</pre>
</div>
</section>
<section id="sec-quantifiers" data-type="sect2">
<h2>
Quantifiers</h2>
<p><strong>Quantifiers</strong> control how many times a pattern matches. In <a href="#sec-reg-basics" data-type="xref">#sec-reg-basics</a> you learned about <code>?</code> (0 or 1 matches), <code>+</code> (1 or more matches), and <code>*</code> (0 or more matches). For example, <code>colou?r</code> will match American or British spelling, <code>\d+</code> will match one or more digits, and <code>\s?</code> will optionally match a single item of whitespace. You can also specify the number of matches precisely with <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>:</p>
<ul><li>
<code>{n}</code> matches exactly n times.</li>
<li>
<code>{n,}</code> matches at least n times.</li>
<li>
<code>{n,m}</code> matches between n and m times.</li>
</ul><p>The following code shows how this works for a few simple examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "-- -x- -xx- -xxx- -xxxx- -xxxxx-"
str_view(x, "-x?-") # [0, 1]
#&gt; [1] │ &lt;--&gt; &lt;-x-&gt; -xx- -xxx- -xxxx- -xxxxx-
str_view(x, "-x+-") # [1, Inf)
#&gt; [1] │ -- &lt;-x-&gt; &lt;-xx-&gt; &lt;-xxx-&gt; &lt;-xxxx-&gt; &lt;-xxxxx-&gt;
str_view(x, "-x*-") # [0, Inf)
#&gt; [1] │ &lt;--&gt; &lt;-x-&gt; &lt;-xx-&gt; &lt;-xxx-&gt; &lt;-xxxx-&gt; &lt;-xxxxx-&gt;
str_view(x, "-x{2}-") # [2. 2]
#&gt; [1] │ -- -x- &lt;-xx-&gt; -xxx- -xxxx- -xxxxx-
str_view(x, "-x{2,}-") # [2, Inf)
#&gt; [1] │ -- -x- &lt;-xx-&gt; &lt;-xxx-&gt; &lt;-xxxx-&gt; &lt;-xxxxx-&gt;
str_view(x, "-x{2,3}-") # [2, 3]
#&gt; [1] │ -- -x- &lt;-xx-&gt; &lt;-xxx-&gt; -xxxx- -xxxxx-</pre>
</div>
</section>
<section id="operator-precedence-and-parentheses" data-type="sect2">
<h2>
Operator precedence and parentheses</h2>
<p>What does <code>ab+</code> match? Does it match “a” followed by one or more “b”s, or does it match “ab” repeated any number of times? What does <code>^a|b$</code> match? Does it match the complete string a or the complete string b, or does it match a string starting with a or a string starting with “b”?</p>
<p>The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school. You know that <code>a + b * c</code> is equivalent to <code>a + (b * c)</code> not <code>(a + b) * c</code> because <code>*</code> has higher precedence and <code>+</code> has lower precedence: you compute <code>*</code> before <code>+</code>.</p>
<p>Similarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that <code>ab+</code> is equivalent to <code>a(b+)</code>, and <code>^a|b$</code> is equivalent to <code>(^a)|(b$)</code>. Just like with algebra, you can use parentheses to override the usual order. But unlike algebra youre unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.</p>
</section>
<section id="grouping-and-capturing" data-type="sect2">
<h2>
Grouping and capturing</h2>
<p>As well as overriding operator precedence, parentheses have another important effect: they create <strong>capturing groups</strong> that allow you to use sub-components of the match.</p>
<p>The first way to use a capturing group is to refer back to it within a match with <strong>back reference</strong>: <code>\1</code> refers to the match contained in the first parenthesis, <code>\2</code> in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(fruit, "(..)\\1")
#&gt; [4] │ b&lt;anan&gt;a
#&gt; [20] │ &lt;coco&gt;nut
#&gt; [22] │ &lt;cucu&gt;mber
#&gt; [41] │ &lt;juju&gt;be
#&gt; [56] │ &lt;papa&gt;ya
#&gt; [73] │ s&lt;alal&gt; berry</pre>
</div>
<p>And this one finds all words that start and end with the same pair of letters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(words, "^(..).*\\1$")
#&gt; [152] │ &lt;church&gt;
#&gt; [217] │ &lt;decide&gt;
#&gt; [617] │ &lt;photograph&gt;
#&gt; [699] │ &lt;require&gt;
#&gt; [739] │ &lt;sense&gt;</pre>
</div>
<p>You can also use back references in <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code>. For example, this code switches the order of the second and third words in <code>sentences</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sentences |&gt;
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |&gt;
str_view()
#&gt; [1] │ The canoe birch slid on the smooth planks.
#&gt; [2] │ Glue sheet the to the dark blue background.
#&gt; [3] │ It's to easy tell the depth of a well.
#&gt; [4] │ These a days chicken leg is a rare dish.
#&gt; [5] │ Rice often is served in round bowls.
#&gt; [6] │ The of juice lemons makes fine punch.
#&gt; ... and 714 more</pre>
</div>
<p>If you want extract the matches for each group you can use <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code>. But <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code> returns a matrix, so its not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sentences |&gt;
str_match("the (\\w+) (\\w+)") |&gt;
head()
#&gt; [,1] [,2] [,3]
#&gt; [1,] "the smooth planks" "smooth" "planks"
#&gt; [2,] "the sheet to" "sheet" "to"
#&gt; [3,] "the depth of" "depth" "of"
#&gt; [4,] NA NA NA
#&gt; [5,] NA NA NA
#&gt; [6,] NA NA NA</pre>
</div>
<p>You could convert to a tibble and name the columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">sentences |&gt;
str_match("the (\\w+) (\\w+)") |&gt;
as_tibble(.name_repair = "minimal") |&gt;
set_names("match", "word1", "word2")
#&gt; # A tibble: 720 × 3
#&gt; match word1 word2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 the smooth planks smooth planks
#&gt; 2 the sheet to sheet to
#&gt; 3 the depth of depth of
#&gt; 4 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 714 more rows</pre>
</div>
<p>But then youve basically recreated your own version of <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Indeed, behind the scenes, <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> converts your vector of patterns to a single regex that uses grouping to capture the named components.</p>
<p>Occasionally, youll want to use parentheses without creating matching groups. You can create a non-capturing group with <code>(?:)</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- c("a gray cat", "a grey dog")
str_match(x, "gr(e|a)y")
#&gt; [,1] [,2]
#&gt; [1,] "gray" "a"
#&gt; [2,] "grey" "e"
str_match(x, "gr(?:e|a)y")
#&gt; [,1]
#&gt; [1,] "gray"
#&gt; [2,] "grey"</pre>
</div>
</section>
<section id="regexps-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How would you match the literal string <code>"'\</code>? How about <code>"$^$"</code>?</p></li>
<li><p>Explain why each of these patterns dont match a <code>\</code>: <code>"\"</code>, <code>"\\"</code>, <code>"\\\"</code>.</p></li>
<li>
<p>Given the corpus of common words in <code><a href="https://stringr.tidyverse.org/reference/stringr-data.html">stringr::words</a></code>, create regular expressions that find all words that:</p>
<ol type="a"><li>Start with “y”.</li>
<li>Dont start with “y”.</li>
<li>End with “x”.</li>
<li>Are exactly three letters long. (Dont cheat by using <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code>!)</li>
<li>Have seven letters or more.</li>
<li>Contain a vowel-consonant pair.</li>
<li>Contain at least two vowel-consonant pairs in a row.</li>
<li>Only consist of repeated vowel-consonant pairs.</li>
</ol></li>
<li><p>Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarize, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. Try and make the shortest possible regex!</p></li>
<li><p>Switch the first and last letters in <code>words</code>. Which of those strings are still <code>words</code>?</p></li>
<li>
<p>Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)</p>
<ol type="a"><li><code>^.*$</code></li>
<li><code>"\\{.+\\}"</code></li>
<li><code>\d{4}-\d{2}-\d{2}</code></li>
<li><code>"\\\\{4}"</code></li>
<li><code>\..\..\..</code></li>
<li><code>(.)\1\1</code></li>
<li><code>"(..)\\1"</code></li>
</ol></li>
<li><p>Solve the beginner regexp crosswords at <a href="https://regexcrossword.com/challenges/beginner" class="uri">https://regexcrossword.com/challenges/beginner</a>.</p></li>
</ol></section>
</section>
<section id="pattern-control" data-type="sect1">
<h1>
Pattern control</h1>
<p>Its possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you control the so called regex flags and match various types of fixed strings, as described below.</p>
<section id="sec-flags" data-type="sect2">
<h2>
Regex flags</h2>
<p>There are a number of settings that can be used to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">bananas &lt;- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
#&gt; [1] │ &lt;banana&gt;
str_view(bananas, regex("banana", ignore_case = TRUE))
#&gt; [1] │ &lt;banana&gt;
#&gt; [2] │ &lt;Banana&gt;
#&gt; [3] │ &lt;BANANA&gt;</pre>
</div>
<p>If youre doing a lot of work with multiline strings (i.e. strings that contain <code>\n</code>), <code>dotall</code>and <code>multiline</code> may also be useful:</p>
<ul><li>
<p><code>dotall = TRUE</code> lets <code>.</code> match everything, including <code>\n</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")
str_view(x, regex(".Line", dotall = TRUE))
#&gt; [1] │ Line 1&lt;
#&gt; │ Line&gt; 2&lt;
#&gt; │ Line&gt; 3</pre>
</div>
</li>
<li>
<p><code>multiline = TRUE</code> makes <code>^</code> and <code>$</code> match the start and end of each line rather than the start and end of the complete string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x &lt;- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
#&gt; [1] │ &lt;Line&gt; 1
#&gt; │ Line 2
#&gt; │ Line 3
str_view(x, regex("^Line", multiline = TRUE))
#&gt; [1] │ &lt;Line&gt; 1
#&gt;&lt;Line&gt; 2
#&gt;&lt;Line&gt; 3</pre>
</div>
</li>
</ul><p>Finally, if youre writing a complicated regular expression and youre worried you might not understand it in the future, you might try <code>comments = TRUE</code>. It tweaks the pattern language to ignore spaces and new lines, as well as everything after <code>#</code>. This allows you to use comments and whitespace to make complex regular expressions more understandable<span data-type="footnote"><code>comments = TRUE</code> is particularly effective in combination with a raw string, as we use here.</span>, as in the following example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">phone &lt;- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code
[)\ -]? # optional closing parens, space, or dash
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{3}) # three more numbers
)",
comments = TRUE
)
str_match("514-791-8141", phone)
#&gt; [,1] [,2] [,3] [,4]
#&gt; [1,] "514-791-814" "514" "791" "814"</pre>
</div>
<p>If youre using comments and want to match a space, newline, or <code>#</code>, youll need to escape it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view("x x #", regex(r"(x #)", comments = TRUE))
#&gt; [1] │ &lt;x&gt; &lt;x&gt; #
str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
#&gt; [1] │ x &lt;x #&gt;</pre>
</div>
</section>
<section id="fixed-matches" data-type="sect2">
<h2>
Fixed matches</h2>
<p>You can opt-out of the regular expression rules by using <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(c("", "a", "."), fixed("."))
#&gt; [3] │ &lt;.&gt;</pre>
</div>
<p><code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code> also gives you the ability to ignore case:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view("x X", "X")
#&gt; [1] │ x &lt;X&gt;
str_view("x X", fixed("X", ignore_case = TRUE))
#&gt; [1] │ &lt;x&gt; &lt;X&gt;</pre>
</div>
<p>If youre working with non-English text, you will probably want <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">coll()</a></code> instead of <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>, as it implements the full rules for capitalization as used by the <code>locale</code> you specify. See <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a> for more details on locales.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
#&gt; [1] │ i &lt;İ&gt; ı I
str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
#&gt; [1] │ &lt;i&gt; &lt;İ&gt; ı I</pre>
</div>
</section>
</section>
<section id="practice" data-type="sect1">
<h1>
Practice</h1>
<p>To put these ideas into practice well solve a few semi-authentic problems next. Well discuss three general techniques:</p>
<ol type="1"><li>checking your work by creating simple positive and negative controls</li>
<li>combining regular expressions with Boolean algebra</li>
<li>creating complex patterns using string manipulation</li>
</ol>
<section id="check-your-work" data-type="sect2">
<h2>
Check your work</h2>
<p>First, lets find all sentences that start with “The”. Using the <code>^</code> anchor alone is not enough:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^The")
#&gt; [1] │ &lt;The&gt; birch canoe slid on the smooth planks.
#&gt; [4] │ &lt;The&gt;se days a chicken leg is a rare dish.
#&gt; [6] │ &lt;The&gt; juice of lemons makes fine punch.
#&gt; [7] │ &lt;The&gt; box was thrown beside the parked truck.
#&gt; [8] │ &lt;The&gt; hogs were fed chopped corn and garbage.
#&gt; [11] │ &lt;The&gt; boy was there when the sun rose.
#&gt; ... and 271 more</pre>
</div>
<p>Because that pattern also matches sentences starting with words like <code>They</code> or <code>These</code>. We need to make sure that the “e” is the last letter in the word, which we can do by adding adding a word boundary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^The\\b")
#&gt; [1] │ &lt;The&gt; birch canoe slid on the smooth planks.
#&gt; [6] │ &lt;The&gt; juice of lemons makes fine punch.
#&gt; [7] │ &lt;The&gt; box was thrown beside the parked truck.
#&gt; [8] │ &lt;The&gt; hogs were fed chopped corn and garbage.
#&gt; [11] │ &lt;The&gt; boy was there when the sun rose.
#&gt; [13] │ &lt;The&gt; source of the huge river is the clear spring.
#&gt; ... and 250 more</pre>
</div>
<p>What about finding all sentences that begin with a pronoun?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^She|He|It|They\\b")
#&gt; [3] │ &lt;It&gt;'s easy to tell the depth of a well.
#&gt; [15] │ &lt;He&gt;lp the woman get back to her feet.
#&gt; [27] │ &lt;He&gt;r purse was full of useless trash.
#&gt; [29] │ &lt;It&gt; snowed, rained, and hailed the same morning.
#&gt; [63] │ &lt;He&gt; ran half way to the hardware store.
#&gt; [90] │ &lt;He&gt; lay prone and hardly moved a limb.
#&gt; ... and 57 more</pre>
</div>
<p>A quick inspection of the results shows that were getting some spurious matches. Thats because weve forgotten to use parentheses:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "^(She|He|It|They)\\b")
#&gt; [3] │ &lt;It&gt;'s easy to tell the depth of a well.
#&gt; [29] │ &lt;It&gt; snowed, rained, and hailed the same morning.
#&gt; [63] │ &lt;He&gt; ran half way to the hardware store.
#&gt; [90] │ &lt;He&gt; lay prone and hardly moved a limb.
#&gt; [116] │ &lt;He&gt; ordered peach pie with ice cream.
#&gt; [127] │ &lt;It&gt; caught its hind paw in a rusty trap.
#&gt; ... and 51 more</pre>
</div>
<p>You might wonder how you might spot such a mistake if it didnt occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">pos &lt;- c("He is a boy", "She had a good time")
neg &lt;- c("Shells come from the sea", "Hadley said 'It's a great day'")
pattern &lt;- "^(She|He|It|They)\\b"
str_detect(pos, pattern)
#&gt; [1] TRUE TRUE
str_detect(neg, pattern)
#&gt; [1] FALSE FALSE</pre>
</div>
<p>Its typically much easier to come up with good positive examples than negative examples, because it takes a while before youre good enough with regular expressions to predict where your weaknesses are. Nevertheless, theyre still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.</p>
</section>
<section id="sec-boolean-operations" data-type="sect2">
<h2>
Boolean operations</h2>
<p>Imagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels (<code>[^aeiou]</code>), then allow that to match any number of letters (<code>[^aeiou]+</code>), then force it to match the whole string by anchoring to the beginning and the end (<code>^[^aeiou]+$</code>):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(words, "^[^aeiou]+$")
#&gt; [123] │ &lt;by&gt;
#&gt; [249] │ &lt;dry&gt;
#&gt; [328] │ &lt;fly&gt;
#&gt; [538] │ &lt;mrs&gt;
#&gt; [895] │ &lt;try&gt;
#&gt; [952] │ &lt;why&gt;</pre>
</div>
<p>But you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that dont contain any vowels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(words[!str_detect(words, "[aeiou]")])
#&gt; [1] │ by
#&gt; [2] │ dry
#&gt; [3] │ fly
#&gt; [4] │ mrs
#&gt; [5] │ try
#&gt; [6] │ why</pre>
</div>
<p>This is a useful technique whenever youre dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. Theres no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(words, "a.*b|b.*a")
#&gt; [2] │ &lt;ab&gt;le
#&gt; [3] │ &lt;ab&gt;out
#&gt; [4] │ &lt;ab&gt;solute
#&gt; [62] │ &lt;availab&gt;le
#&gt; [66] │ &lt;ba&gt;by
#&gt; [67] │ &lt;ba&gt;ck
#&gt; ... and 24 more</pre>
</div>
<p>Its simpler to combine the results of two calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">words[str_detect(words, "a") &amp; str_detect(words, "b")]
#&gt; [1] "able" "about" "absolute" "available" "baby" "back"
#&gt; [7] "bad" "bag" "balance" "ball" "bank" "bar"
#&gt; [13] "base" "basis" "bear" "beat" "beauty" "because"
#&gt; [19] "black" "board" "boat" "break" "brilliant" "britain"
#&gt; [25] "debate" "husband" "labour" "maybe" "probable" "table"</pre>
</div>
<p>What if we wanted to see if there was a word that contains all vowels? If we did it with patterns wed need to generate 5! (120) different patterns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">words[str_detect(words, "a.*e.*i.*o.*u")]
# ...
words[str_detect(words, "u.*o.*i.*e.*a")]</pre>
</div>
<p>Its much simpler to combine five calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">words[
str_detect(words, "a") &amp;
str_detect(words, "e") &amp;
str_detect(words, "i") &amp;
str_detect(words, "o") &amp;
str_detect(words, "u")
]
#&gt; character(0)</pre>
</div>
<p>In general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.</p>
</section>
<section id="creating-a-pattern-with-code" data-type="sect2">
<h2>
Creating a pattern with code</h2>
<p>What if we wanted to find all <code>sentences</code> that mention a color? The basic idea is simple: we just combine alternation with word boundaries.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(sentences, "\\b(red|green|blue)\\b")
#&gt; [2] │ Glue the sheet to the dark &lt;blue&gt; background.
#&gt; [26] │ Two &lt;blue&gt; fish swam in the tank.
#&gt; [92] │ A wisp of cloud hung in the &lt;blue&gt; air.
#&gt; [148] │ The spot on the blotter was made by &lt;green&gt; ink.
#&gt; [160] │ The sofa cushion is &lt;red&gt; and of light weight.
#&gt; [174] │ The sky that morning was clear and bright &lt;blue&gt;.
#&gt; ... and 20 more</pre>
</div>
<p>But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldnt it be nice if we could store the colors in a vector?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">rgb &lt;- c("red", "green", "blue")</pre>
</div>
<p>Well, we can! Wed just need to create the pattern from the vector using <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
#&gt; [1] "\\b(red|green|blue)\\b"</pre>
</div>
<p>We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str_view(colors())
#&gt; [1] │ white
#&gt; [2] │ aliceblue
#&gt; [3] │ antiquewhite
#&gt; [4] │ antiquewhite1
#&gt; [5] │ antiquewhite2
#&gt; [6] │ antiquewhite3
#&gt; ... and 651 more</pre>
</div>
<p>But lets first eliminate the numbered variants:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">cols &lt;- colors()
cols &lt;- cols[!str_detect(cols, "\\d")]
str_view(cols)
#&gt; [1] │ white
#&gt; [2] │ aliceblue
#&gt; [3] │ antiquewhite
#&gt; [4] │ aquamarine
#&gt; [5] │ azure
#&gt; [6] │ beige
#&gt; ... and 137 more</pre>
</div>
<p>Then we can turn this into one giant pattern. We wont show the pattern here because its huge, but you can see it working:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">pattern &lt;- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
str_view(sentences, pattern)
#&gt; [2] │ Glue the sheet to the dark &lt;blue&gt; background.
#&gt; [12] │ A rod is used to catch &lt;pink&gt; &lt;salmon&gt;.
#&gt; [26] │ Two &lt;blue&gt; fish swam in the tank.
#&gt; [66] │ Cars and busses stalled in &lt;snow&gt; drifts.
#&gt; [92] │ A wisp of cloud hung in the &lt;blue&gt; air.
#&gt; [112] │ Leaves turn &lt;brown&gt; and &lt;yellow&gt; in the fall.
#&gt; ... and 57 more</pre>
</div>
<p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create patterns from existing strings its wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
</section>
<section id="regexps-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> calls.</p>
<ol type="a"><li>Find all <code>words</code> that start or end with <code>x</code>.</li>
<li>Find all <code>words</code> that start with a vowel and end with a consonant.</li>
<li>Are there any <code>words</code> that contain at least one of each different vowel?</li>
</ol></li>
<li><p>Construct patterns to find evidence for and against the rule “i before e except after c”?</p></li>
<li><p><code><a href="https://rdrr.io/r/grDevices/colors.html">colors()</a></code> contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).</p></li>
<li><p>Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the <code><a href="https://rdrr.io/r/utils/data.html">data()</a></code> function: <code>data(package = "datasets")$results[, "Item"]</code>. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so youll need to strip those off.</p></li>
</ol></section>
</section>
<section id="regular-expressions-in-other-places" data-type="sect1">
<h1>
Regular expressions in other places</h1>
<p>Just like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.</p>
<section id="tidyverse" data-type="sect2">
<h2>
tidyverse</h2>
<p>There are three other particularly useful places where you might want to use a regular expressions</p>
<ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. Its a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>).</p></li>
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its useful when extracting data out of variable names with a complex structure</p></li>
<li><p>The <code>delim</code> argument in <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
</ul></section>
<section id="base-r" data-type="sect2">
<h2>
Base R</h2>
<p><code>apropos(pattern)</code> searches all objects available from the global environment that match the given pattern. This is useful if you cant quite remember the name of a function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">apropos("replace")
#&gt; [1] "%+replace%" "replace" "replace_na"
#&gt; [4] "setReplaceMethod" "str_replace" "str_replace_all"
#&gt; [7] "str_replace_na" "theme_replace"</pre>
</div>
<p><code>list.files(path, pattern)</code> lists all files in <code>path</code> that match a regular expression <code>pattern</code>. For example, you can find all the R Markdown files in the current directory with:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">head(list.files(pattern = "\\.Rmd$"))
#&gt; character(0)</pre>
</div>
<p>Its worth noting that the pattern language used by base R is very slightly different to that used by stringr. Thats because stringr is built on top of the <a href="https://stringi.gagolewski.com">stringi package</a>, which is in turn built on top of the <a href="https://unicode-org.github.io/icu/userguide/strings/regexp.html">ICU engine</a>, whereas base R functions use either the <a href="https://github.com/laurikari/tre">TRE engine</a> or the <a href="https://www.pcre.org">PCRE engine</a>, depending on whether or not youve set <code>perl = TRUE</code>. Fortunately, the basics of regular expressions are so well established that youll encounter few variations when working with the patterns youll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the <code>(?…)</code> syntax.</p>
</section>
</section>
<section id="regexps-summary" data-type="sect1">
<h1>
Summary</h1>
<p>With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. Theyre definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.</p>
<p>In this chapter, youve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.</p>
<p>A good place to start is <code><a href="https://stringr.tidyverse.org/articles/regular-expressions.html">vignette("regular-expressions", package = "stringr")</a></code>: it documents the full set of syntax supported by stringr. Another useful reference is <a href="https://www.regular-expressions.info/tutorial.html">https://www.regular-expressions.info/</a>. Its not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.</p>
<p>Its also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk. If youre struggling to find a function that does what you need in stringr, dont be afraid to look in stringi. Youll find stringi very easy to pick up because it follows many of the the same conventions as stringr.</p>
<p>In the next chapter, well talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.</p>
</section>
</section>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 297 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 314 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 332 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 391 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 497 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 984 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 963 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 794 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 144 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 292 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 709 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 78 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 65 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 120 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 160 KiB

Some files were not shown because too many files have changed in this diff Show More