Commit O'Reilly HTML files to monitor fixes

This commit is contained in:
Hadley Wickham 2022-11-18 10:28:19 -06:00
parent 31ef77ced6
commit 4caea5281b
39 changed files with 17407 additions and 1 deletions

2
.gitignore vendored
View File

@ -6,7 +6,7 @@ _main.rds
_book
*.md
!CODE_OF_CONDUCT.md
*.html
/*.html
!plausible.html
search_index.json
libs

2
oreilly/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
*.png
*.jpg

591
oreilly/EDA.html Normal file
View File

@ -0,0 +1,591 @@
<section data-type="chapter" id="chp-EDA">
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
<ol type="1"><li><p>Generate questions about your data.</p></li>
<li><p>Search for answers by visualizing, transforming, and modelling your data.</p></li>
<li><p>Use what you learn to refine your questions and/or generate new questions.</p></li>
</ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that youll eventually write up and communicate to others.</p>
<p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, youll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well combine what youve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="questions" data-type="sect1">
<h1>
Questions</h1>
<blockquote class="blockquote">
<p>“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox</p>
</blockquote>
<blockquote class="blockquote">
<p>“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey</p>
</blockquote>
<p>Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.</p>
<p>EDA is fundamentally a creative process. And like most creative processes, the key to asking <em>quality</em> questions is to generate a large <em>quantity</em> of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.</p>
<p>There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:</p>
<ol type="1"><li><p>What type of variation occurs within my variables?</p></li>
<li><p>What type of covariation occurs between my variables?</p></li>
</ol><p>The rest of this chapter will look at these two questions. Well explain what variation and covariation are, and well show you several ways to answer each question. To make the discussion easier, lets define some terms:</p>
<ul><li><p>A <strong>variable</strong> is a quantity, quality, or property that you can measure.</p></li>
<li><p>A <strong>value</strong> is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.</p></li>
<li><p>An <strong>observation</strong> is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. Well sometimes refer to an observation as a data point.</p></li>
<li><p><strong>Tabular data</strong> is a set of values, each associated with a variable and an observation. Tabular data is <em>tidy</em> if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.</p></li>
</ul><p>So far, all of the data that youve seen has been tidy. In real-life, most data isnt tidy, so well come back to these ideas again in <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>.</p>
</section>
<section id="variation" data-type="sect1">
<h1>
Variation</h1>
<p><strong>Variation</strong> is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variables values.</p>
<section id="visualizing-distributions" data-type="sect2">
<h2>
Visualizing distributions</h2>
<p>How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is <strong>categorical</strong> if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, you can use a bar chart:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p>
</div>
</div>
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut)
#&gt; # A tibble: 5 × 2
#&gt; cut n
#&gt; &lt;ord&gt; &lt;int&gt;
#&gt; 1 Fair 1610
#&gt; 2 Good 4906
#&gt; 3 Very Good 12082
#&gt; 4 Premium 13791
#&gt; 5 Ideal 21551</pre>
</div>
<p>A variable is <strong>continuous</strong> if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, you can use a histogram:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.5)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
</div>
</div>
<p>You can compute this by hand by combining <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut_width(carat, 0.5))
#&gt; # A tibble: 11 × 2
#&gt; `cut_width(carat, 0.5)` n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 [-0.25,0.25] 785
#&gt; 2 (0.25,0.75] 29498
#&gt; 3 (0.75,1.25] 15977
#&gt; 4 (1.25,1.75] 5313
#&gt; 5 (1.75,2.25] 2002
#&gt; 6 (2.25,2.75] 322
#&gt; # … with 5 more rows</pre>
</div>
<p>A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. Note that even though its not possible to have a <code>carat</code> value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0. This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25. The tallest bar shows that almost 30,000 observations have a <code>carat</code> value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5.</p>
<p>You can set the width of the intervals in a histogram with the <code>binwidth</code> argument, which is measured in the units of the <code>x</code> variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">smaller &lt;- diamonds |&gt;
filter(carat &lt; 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), resulting in many bars. The distribution is right skewed but there are lots of ups and downs in the heights of the bins, creating a jagged outline." width="576"/></p>
</div>
</div>
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> performs the same calculation as <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>, but instead of displaying the counts with bars, uses lines instead. Its much easier to understand overlapping lines than bars.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1, size = 0.75)
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; Please use `linewidth` instead.</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid" alt="A frequency polygon of carats of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 3 and the y-axis ranges from 0 to almost 6000. Ideal diamonds have a much higher peak than the others around 0.25 carats. All cuts of diamonds have right skewed distributions with local peaks at 1 carat and 2 carats. As the cut level increases (from Fair to Ideal), so does the number of diamonds that fall into that category." width="576"/></p>
</div>
</div>
<p>Weve also customized the thickness of the lines using the <code>size</code> argument in order to make them stand out a bit more against the background.</p>
<p>There are a few challenges with this type of plot, which we will come back to in <a href="#sec-cat-cont" data-type="xref">#sec-cat-cont</a> on visualizing a categorical and a continuous variable.</p>
<p>Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? Weve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).</p>
</section>
<section id="typical-values" data-type="sect2">
<h2>
Typical values</h2>
<p>In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:</p>
<ul><li><p>Which values are the most common? Why?</p></li>
<li><p>Which values are rare? Why? Does that match your expectations?</p></li>
<li><p>Can you see any unusual patterns? What might explain them?</p></li>
</ul><p>As an example, the histogram below suggests several interesting questions:</p>
<ul><li><p>Why are there more diamonds at whole carats and common fractions of carats?</p></li>
<li><p>Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?</p></li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak." width="576"/></p>
</div>
</div>
<p>Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:</p>
<ul><li><p>How are the observations within each cluster similar to each other?</p></li>
<li><p>How are the observations in separate clusters different from each other?</p></li>
<li><p>How can you explain or describe the clusters?</p></li>
<li><p>Why might the appearance of clusters be misleading?</p></li>
</ul><p>The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-10-1.png" class="img-fluid" alt="A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5." width="576"/></p>
</div>
</div>
<p>Many of the questions above will prompt you to explore a relationship <em>between</em> variables, for example, to see if the values of one variable can explain the behavior of another variable. Well get to that shortly.</p>
</section>
<section id="unusual-values" data-type="sect2">
<h2>
Unusual values</h2>
<p>Outliers are observations that are unusual; data points that dont seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the <code>y</code> variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
geom_histogram(binwidth = 0.5)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
</div>
</div>
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 youll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
geom_histogram(binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p>
</div>
</div>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> also has an <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
<p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">unusual &lt;- diamonds |&gt;
filter(y &lt; 3 | y &gt; 20) |&gt;
select(price, x, y, z) |&gt;
arrange(y)
unusual
#&gt; # A tibble: 9 × 4
#&gt; price x y z
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 5139 0 0 0
#&gt; 2 6381 0 0 0
#&gt; 3 12800 0 0 0
#&gt; 4 15686 0 0 0
#&gt; 5 18034 0 0 0
#&gt; 6 2130 0 0 0
#&gt; 7 2130 0 0 0
#&gt; 8 2075 5.15 31.8 5.12
#&gt; 9 12210 8.09 58.9 8.06</pre>
</div>
<p>The <code>y</code> variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds cant have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but dont cost hundreds of thousands of dollars!</p>
<p>Its good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you cant figure out why theyre there, its reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldnt drop them without justification. Youll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
<li><p>Explore the distribution of <code>price</code>. Do you discover anything unusual or surprising? (Hint: Carefully think about the <code>binwidth</code> and make sure you try a wide range of values.)</p></li>
<li><p>How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> vs <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
</ol></section>
</section>
<section id="sec-missing-values-eda" data-type="sect1">
<h1>
Missing values</h1>
<p>If youve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.</p>
<ol type="1"><li>
<p>Drop the entire row with the strange values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds2 &lt;- diamonds |&gt;
filter(between(y, 3, 20))</pre>
</div>
<p>We dont recommend this option because just because one measurement is invalid, doesnt mean all the measurements are. Additionally, if you have low quality data, by time that youve applied this approach to every variable you might find that you dont have any data left!</p>
</li>
<li>
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to replace the variable with a modified copy. You can use the <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> function to replace unusual values with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds2 &lt;- diamonds |&gt;
mutate(y = if_else(y &lt; 3 | y &gt; 20, NA, y))</pre>
</div>
</li>
</ol><p><code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>, use <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> statements nested inside one another.</p>
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. Its not obvious where you should plot missing values, so ggplot2 doesnt include them in the plot, but it does warn that theyve been removed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
#&gt; Warning: Removed 9 rows containing missing values (`geom_point()`).</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a strong, linear association between the two variables. All but one of the diamonds has length greater than 3. The one outlier has a length of 0 and a width of about 6.5." width="576"/></p>
</div>
</div>
<p>To suppress that warning, set <code>na.rm = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)</pre>
</div>
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, well use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">nycflights13::flights |&gt;
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + (sched_min / 60)
) |&gt;
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A frequency polygon of scheduled departure times of flights. Two lines represent flights that are cancelled and not cancelled. The x-axis ranges from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of flights not cancelled are much higher than those not cancelled." width="576"/></p>
</div>
</div>
<p>However this plot isnt great because there are many more non-cancelled flights than cancelled flights. In the next section well explore some techniques for improving this comparison.</p>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
<li><p>What does <code>na.rm = TRUE</code> do in <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> and <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>?</p></li>
</ol></section>
</section>
<section id="covariation" data-type="sect1">
<h1>
Covariation</h1>
<p>If variation describes the behavior <em>within</em> a variable, covariation describes the behavior <em>between</em> variables. <strong>Covariation</strong> is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that depends again on the types of variables involved.</p>
<section id="sec-cat-cont" data-type="sect2">
<h2>
A categorical and continuous variable</h2>
<p>Its common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, its hard to see the differences in the shapes of their distributions. For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
</div>
</div>
<p>Its hard to see the difference in distribution because the overall counts differ so much:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal." width="576"/></p>
</div>
</div>
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, well display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div>
</div>
<p>Note that were mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes_eval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes_eval</a></code> function to do so.</p>
<p>Theres something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe thats because frequency polygons are a little hard to interpret - theres a lot going on in this plot.</p>
<p>Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A <strong>boxplot</strong> is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:</p>
<ul><li><p>A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.</p></li>
<li><p>Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.</p></li>
<li><p>A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.</p></li>
</ul><div class="cell">
<div class="cell-output-display">
<p><img src="images/EDA-boxplot.png" class="img-fluid" alt="A diagram depicting how a boxplot is created following the steps outlined above." width="1066"/></p>
</div>
</div>
<p>Lets take a look at the distribution of price by cut using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest." width="576"/></p>
</div>
</div>
<p>We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, youll be challenged to figure out why.</p>
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables dont have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="#chp-https://rdrr.io/r/stats/reorder.factor" data-type="xref">#chp-https://rdrr.io/r/stats/reorder.factor</a></code> function.</p>
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
</div>
</div>
<p>To make the trend easier to see, we can reorder <code>class</code> based on the median value of <code>hwy</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p>
</div>
</div>
<p>If you have long variable names, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-28-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the y-axis and ordered by increasing median highway mileage." width="576"/></p>
</div>
</div>
<section id="exercises-2" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
<li><p>What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?</p></li>
<li><p>Instead of exchanging the x and y variables, add <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_flip" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_flip</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li>
<li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs cut. What do you learn? How do you interpret the plots?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_violin" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_violin</a></code> with a faceted <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>, or a coloured <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. What are the pros and cons of each method?</p></li>
<li><p>If you have a small dataset, its sometimes useful to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code>. List them and briefly describe what each one does.</p></li>
</ol></section>
</section>
<section id="two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_count" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_count</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
geom_count()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-29-1.png" class="img-fluid" alt="A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000." width="576"/></p>
</div>
</div>
<p>The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.</p>
<p>A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The number of diamonds for each level of cut increases from Fair to Ideal and the heights of the segments within each bar represent the number of diamonds that fall within each color/cut combination. There appear to be some of each color of diamonds within each level of cut of diamonds." width="576"/></p>
</div>
</div>
<p>However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
geom_bar(position = "fill")</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The heights of each of the bars representing each cut of diamond are the same, 1. The heights of the segments within each bar represent the proportion of diamonds that fall within each color/cut combination. The proportions don't appear to be very different across the levels of cut." width="576"/></p>
</div>
</div>
<p>Another approach for exploring the relationship between these variables is computing the counts with dplyr:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(color, cut)
#&gt; # A tibble: 35 × 3
#&gt; color cut n
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt;
#&gt; 1 D Fair 163
#&gt; 2 D Good 662
#&gt; 3 D Very Good 1513
#&gt; 4 D Premium 1603
#&gt; 5 D Ideal 2834
#&gt; 6 E Fair 224
#&gt; # … with 29 more rows</pre>
</div>
<p>Then visualize with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> and the fill aesthetic:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(color, cut) |&gt;
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-33-1.png" class="img-fluid" alt="A tile plot of cut vs. color of diamonds. Each tile represents a cut/color combination and tiles are colored according to the number of observations in each tile. There are more Ideal diamonds than other cuts, with the highest number being Ideal diamonds with color G. Fair diamonds and diamonds with color I are the lowest in frequency." width="576"/></p>
</div>
</div>
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
<section id="exercises-3" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
<li><p>How does the segmented bar chart change if color is mapped to the <code>x</code> aesthetic and <code>cut</code> is mapped to the <code>fill</code> aesthetic? Calculate the counts that fall into each of the segments.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?</p></li>
<li><p>Why is it slightly better to use <code>aes(x = color, y = cut)</code> rather than <code>aes(x = cut, y = color)</code> in the example above?</p></li>
</ol></section>
</section>
<section id="two-continuous-variables" data-type="sect2">
<h2>
Two continuous variables</h2>
<p>Youve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential." width="576"/></p>
</div>
</div>
<p>Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). Youve already seen one way to fix the problem: using the <code>alpha</code> aesthetic to add transparency.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 1 / 100)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
</div>
</div>
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> to bin in one dimension. Now youll learn how to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> to bin in two dimensions.</p>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> creates rectangular bins. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code>.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_bin2d()
# install.packages("hexbin")
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_hex()</pre>
<div class="cell quarto-layout-panel">
</div>
</div>
<p>Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin <code>carat</code> and then for each group, display a boxplot:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
</div>
</div>
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so its difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
<p>Another approach is to display approximately the same number of points in each bin. Thats the job of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-38-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
</div>
</div>
<section id="exercises-4" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code> vs <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
<li><p>Visualize the distribution of <code>carat</code>, partitioned by <code>price</code>.</p></li>
<li><p>How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?</p></li>
<li><p>Combine two of the techniques youve learned to visualize the combined distribution of cut, carat, and price.</p></li>
<li>
<p>Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of <code>x</code> and <code>y</code> values, which makes the points outliers even though their <code>x</code> and <code>y</code> values appear normal when examined separately.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
geom_point() +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-39-1.png" class="img-fluid" alt="A scatterplot of widths vs. lengths of diamonds. There is a positive, strong, linear relationship. There are a few unusual observations above and below the bulk of the data, more below it than above." width="576"/></p>
</div>
</div>
<p>Why is a scatterplot a better display than a binned plot for this case?</p>
</li>
</ol></section>
</section>
</section>
<section id="patterns-and-models" data-type="sect1">
<h1>
Patterns and models</h1>
<p>Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:</p>
<ul><li><p>Could this pattern be due to coincidence (i.e. random chance)?</p></li>
<li><p>How can you describe the relationship implied by the pattern?</p></li>
<li><p>How strong is the relationship implied by the pattern?</p></li>
<li><p>What other variables might affect the relationship?</p></li>
<li><p>Does the relationship change if you look at individual subgroups of the data?</p></li>
</ul><p>A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-40-1.png" class="img-fluid" alt="A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times." width="576"/></p>
</div>
</div>
<p>Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.</p>
<p>Models are a tool for extracting patterns out of data. For example, consider the diamonds data. Its hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. Its possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts <code>price</code> from <code>carat</code> and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of <code>price</code> and <code>carat</code>, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidymodels)
diamonds &lt;- diamonds |&gt;
mutate(
log_price = log(price),
log_carat = log(carat)
)
diamonds_fit &lt;- linear_reg() |&gt;
fit(log_price ~ log_carat, data = diamonds)
diamonds_aug &lt;- augment(diamonds_fit, new_data = diamonds) |&gt;
mutate(.resid = exp(.resid))
ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-41-1.png" class="img-fluid" alt="A scatter plot of residuals vs. carat of diamonds. The x-axis ranges from 0 to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered around low values of carat and residuals. There is a clear, curved pattern showing decrease in residuals as carat increases." width="576"/></p>
</div>
</div>
<p>Once youve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-42-1.png" class="img-fluid" alt="Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end." width="576"/></p>
</div>
</div>
<p>Were not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
</section>
<section id="ggplot2-calls" data-type="sect1">
<h1>
ggplot2 calls</h1>
<p>As we move on from these introductory chapters, well transition to a more concise expression of ggplot2 code. So far weve been very explicit, which is helpful when you are learning:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)</pre>
</div>
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we wont supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see whats different between plots. Thats a really important programming concern that well come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
<p>Rewriting the previous plot more concisely yields:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)</pre>
</div>
<p>Sometimes well turn the end of a pipeline of data transformation into a plot. Watch for the transition from <code>|&gt;</code> to <code>+</code>. We wish this transition wasnt necessary but unfortunately ggplot2 was created before the pipe was discovered.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut, clarity) |&gt;
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()</pre>
</div>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned a variety of tools to help you understand the variation within your data. Youve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but theyre foundation upon which all other techniques are built.</p>
<p>In the next chapter, well tackle our final piece of workflow advice: how to get help when youre stuck.</p>
</section>
</section>

527
oreilly/base-R.html Normal file
View File

@ -0,0 +1,527 @@
<section data-type="chapter" id="chp-base-R">
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> to load packages, to <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, well briefly discuss two important plotting functions.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
</section>
<section id="sec-subset-many" data-type="sect1">
<h1>
Selecting multiple elements with<code>[</code>
</h1>
<p><code>[</code> is used to extract sub-components from vectors and data frames, and is called like <code>x[i]</code> or <code>x[i, j]</code>. In this section, well introduce you to the power of <code>[</code>, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. Well then help you cement that knowledge by showing how various dplyr verbs are special cases of <code>[</code>.</p>
<section id="subsetting-vectors" data-type="sect2">
<h2>
Subsetting vectors</h2>
<p>There are five main types of things that you can subset a vector with, i.e. that can be the <code>i</code> in <code>x[i]</code>:</p>
<ol type="1"><li>
<p><strong>A vector of positive integers</strong>. Subsetting with positive integers keeps the elements at those positions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
#&gt; [1] "three" "two" "five"</pre>
</div>
<p>By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x[c(1, 1, 5, 5, 5, 2)]
#&gt; [1] "one" "one" "five" "five" "five" "two"</pre>
</div>
</li>
<li>
<p><strong>A vector of negative integers</strong>. Negative values drop the elements at the specified positions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x[c(-1, -3, -5)]
#&gt; [1] "two" "four"</pre>
</div>
</li>
<li>
<p><strong>A logical vector</strong>. Subsetting with a logical vector keeps all values corresponding to a <code>TRUE</code> value. This is most often useful in conjunction with the comparison functions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
!is.na(x)
#&gt; [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE
x[!is.na(x)]
#&gt; [1] 10 3 5 8 1
# All even (or missing!) values of x
x %% 2 == 0
#&gt; [1] TRUE FALSE NA FALSE TRUE FALSE NA
x[x %% 2 == 0]
#&gt; [1] 10 NA 8 NA</pre>
</div>
<p>Note that, unlike <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, <code>NA</code> indices will be included in the output as <code>NA</code>s.</p>
</li>
<li>
<p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
#&gt; xyz def
#&gt; 5 2</pre>
</div>
<p>As with subsetting with positive integers, you can use a character vector to duplicate individual entries.</p>
</li>
<li><p><strong>Nothing</strong>. The final type of subsetting is nothing, <code>x[]</code>, which returns the complete <code>x</code>. This is not useful for subsetting vectors, but as well see shortly it is useful when subsetting 2d structures like tibbles.</p></li>
</ol></section>
<section id="subsetting-data-frames" data-type="sect2">
<h2>
Subsetting data frames</h2>
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to selecting rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
<p>Here are a couple of examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
x = 1:3,
y = c("a", "e", "f"),
z = runif(3)
)
# Select first row and second column
df[1, 2]
#&gt; # A tibble: 1 × 1
#&gt; y
#&gt; &lt;chr&gt;
#&gt; 1 a
# Select all rows and columns x and y
df[, c("x" , "y")]
#&gt; # A tibble: 3 × 2
#&gt; x y
#&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1 a
#&gt; 2 2 e
#&gt; 3 3 f
# Select rows where `x` is greater than 1 and all columns
df[df$x &gt; 1, ]
#&gt; # A tibble: 2 × 3
#&gt; x y z
#&gt; &lt;int&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 e 0.834
#&gt; 2 3 f 0.601</pre>
</div>
<p>Well come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesnt use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
<p>Theres an important difference between tibbles and data frames when it comes to <code>[</code>. In this book weve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to Rs built-in data frame, well write <code>data.frame</code>s. So if <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 &lt;- data.frame(x = 1:3)
df1[, "x"]
#&gt; [1] 1 2 3
df2 &lt;- tibble(x = 1:3)
df2[, "x"]
#&gt; # A tibble: 3 × 1
#&gt; x
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3</pre>
</div>
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1[, "x", drop = FALSE]
#&gt; x
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3</pre>
</div>
</section>
<section id="dplyr-equivalents" data-type="sect2">
<h2>
dplyr equivalents</h2>
<p>A number of dplyr verbs are special cases of <code>[</code>:</p>
<ul><li>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
x = c(2, 3, 1, 1, NA),
y = letters[1:5],
z = runif(5)
)
df |&gt; filter(x &gt; 1)
# same as
df[!is.na(df$x) &amp; df$x &gt; 1, ]</pre>
</div>
<p>Another common technique in the wild is to use <code><a href="#chp-https://rdrr.io/r/base/which" data-type="xref">#chp-https://rdrr.io/r/base/which</a></code> for its side-effect of dropping missing values: <code>df[which(df$x &gt; 1), ]</code>.</p>
</li>
<li>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="#chp-https://rdrr.io/r/base/order" data-type="xref">#chp-https://rdrr.io/r/base/order</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; arrange(x, y)
# same as
df[order(df$x, df$y), ]</pre>
</div>
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individual sort columns in decreasing order.</p>
</li>
<li>
<p>Both <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> are similar to subsetting the columns with a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; select(x, z)
# same as
df[, c("x", "z")]</pre>
</div>
</li>
</ul><p>Base R also provides a function that combines the features of <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code><span data-type="footnote">But it doesnt handle grouped data frames differently and it doesnt support selection helper functions like <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code>.</span> called <code><a href="#chp-https://rdrr.io/r/base/subset" data-type="xref">#chp-https://rdrr.io/r/base/subset</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
filter(x &gt; 1) |&gt;
select(y, z)
#&gt; # A tibble: 2 × 2
#&gt; y z
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 a 0.157
#&gt; 2 b 0.00740
# same as
df |&gt; subset(x &gt; 1, c(y, z))
#&gt; # A tibble: 2 × 2
#&gt; y z
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 a 0.157
#&gt; 2 b 0.00740</pre>
</div>
<p>This function was the inspiration for much of dplyrs syntax.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Create functions that take a vector as input and return:</p>
<ol type="a"><li>The elements at even numbered positions.</li>
<li>Every element except the last value.</li>
<li>Only even values (and no missing values).</li>
</ol></li>
<li><p>Why is <code>x[-which(x &gt; 0)]</code> not the same as <code>x[x &lt;= 0]</code>? Read the documentation for <code><a href="#chp-https://rdrr.io/r/base/which" data-type="xref">#chp-https://rdrr.io/r/base/which</a></code> and do some experiments to figure it out.</p></li>
</ol></section>
</section>
<section id="sec-subset-one" data-type="sect1">
<h1>
Selecting a single element<code>$</code> and <code>[[</code>
</h1>
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, well show you how to use <code>[[</code> and <code>$</code> to pull columns out of a data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
<section id="data-frames" data-type="sect2">
<h2>
Data frames</h2>
<p><code>[[</code> and <code>$</code> can be used like <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb &lt;- tibble(
x = 1:4,
y = c(10, 4, 1, 21)
)
# by position
tb[[1]]
#&gt; [1] 1 2 3 4
# by name
tb[["x"]]
#&gt; [1] 1 2 3 4
tb$x
#&gt; [1] 1 2 3 4</pre>
</div>
<p>They can also be used to create new columns, the base R equivalent of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb$z &lt;- tb$x + tb$y
tb
#&gt; # A tibble: 4 × 3
#&gt; x y z
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 10 11
#&gt; 2 2 4 6
#&gt; 3 3 1 4
#&gt; 4 4 21 25</pre>
</div>
<p>There are a number other base approaches to creating new columns including with <code><a href="#chp-https://rdrr.io/r/base/transform" data-type="xref">#chp-https://rdrr.io/r/base/transform</a></code>, <code><a href="#chp-https://rdrr.io/r/base/with" data-type="xref">#chp-https://rdrr.io/r/base/with</a></code>, and <code><a href="#chp-https://rdrr.io/r/base/with" data-type="xref">#chp-https://rdrr.io/r/base/with</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">max(diamonds$carat)
#&gt; [1] 5.01
levels(diamonds$cut)
#&gt; [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
</div>
</section>
<section id="tibbles" data-type="sect2">
<h2>
Tibbles</h2>
<p>There are a couple of important differences between tibbles and base <code>data.frame</code>s when it comes to <code>$</code>. Data frames match the prefix of any variable names (so-called <strong>partial matching</strong>) and dont complain if a column doesnt exist:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- data.frame(x1 = 1)
df$x
#&gt; Warning in df$x: partial match of 'x' to 'x1'
#&gt; [1] 1
df$z
#&gt; NULL</pre>
</div>
<p>Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesnt exist:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb &lt;- tibble(x1 = 1)
tb$x
#&gt; Warning: Unknown or uninitialised column: `x`.
#&gt; NULL
tb$z
#&gt; Warning: Unknown or uninitialised column: `z`.
#&gt; NULL</pre>
</div>
<p>For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.</p>
</section>
<section id="lists" data-type="sect2">
<h2>
Lists</h2>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and its important to understand how they differ to <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">l &lt;- list(
a = 1:3,
b = "a string",
c = pi,
d = list(-1, -5)
)</pre>
</div>
<ul><li>
<p><code>[</code> extracts a sub-list. It doesnt matter how many elements you extract, the result will always be a list.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str(l[1:2])
#&gt; List of 2
#&gt; $ a: int [1:3] 1 2 3
#&gt; $ b: chr "a string"
str(l[4])
#&gt; List of 1
#&gt; $ d:List of 2
#&gt; ..$ : num -1
#&gt; ..$ : num -5</pre>
</div>
<p>Like with vectors, you can subset with a logical, integer, or character vector.</p>
</li>
<li>
<p><code>[[</code> and <code>$</code> extract a single component from a list. They remove a level of hierarchy from the list.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str(l[[1]])
#&gt; int [1:3] 1 2 3
str(l[[4]])
#&gt; List of 2
#&gt; $ : num -1
#&gt; $ : num -5
str(l$a)
#&gt; int [1:3] 1 2 3</pre>
</div>
</li>
</ul><p>The difference between <code>[</code> and <code>[[</code> is particularly important for lists because <code>[[</code> drills down into the list while <code>[</code> returns a new, smaller list. To help you remember the difference, take a look at the an unusual pepper shaker shown in <a href="#fig-pepper-1" data-type="xref">#fig-pepper-1</a>. If this pepper shaker is your list <code>pepper</code>, then, <code>pepper[1]</code> is a pepper shaker containing a single pepper packet, as in <a href="#fig-pepper-2" data-type="xref">#fig-pepper-2</a>. If we suppose this pepper shaker is a list <code>pepper</code>, then, <code>pepper[1]</code> is a pepper shaker containing a single pepper packet, as in <a href="#fig-pepper-2" data-type="xref">#fig-pepper-2</a>. <code>pepper[2]</code> would look the same, but would contain the second packet. <code>pepper[1:2]</code> would be a pepper shaker containing two pepper packets. <code>pepper[[1]]</code> would extract the pepper packet itself, as in <a href="#fig-pepper-3" data-type="xref">#fig-pepper-3</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pepper-3"><p><img src="images/pepper.jpg" style="width:25.0%" alt="A photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains many packets of pepper."/></p>
<figcaption>Figure 26.1: A pepper shaker that Hadley once found in his hotel room.</figcaption>
</figure>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/pepper-1.jpg" style="width:25.0%" alt="A photo of the glass pepper shaker containing just one packet of pepper."/></p>
<figcaption class="figure-caption">Figure 26.2: <code>pepper[1]</code></figcaption>
</figure>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/pepper-2.jpg" style="width:25.0%" alt="A photo of single packet of pepper."/></p>
<figcaption class="figure-caption">Figure 26.3: <code>pepper[[1]]</code></figcaption>
</figure>
</div>
</div>
<p>This same principle applies when you use 1d <code>[</code> with a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = 1:3, y = 3:5)
# returns a one-column data frame
df["x"]
#&gt; # A tibble: 3 × 1
#&gt; x
#&gt; &lt;int&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 3
# returns the contents of x
df[["x"]]
#&gt; [1] 1 2 3</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens when you use <code>[[</code> with a positive integer thats bigger than the length of the vector? What happens when you subset with a name that doesnt exist?</p></li>
<li><p>What would <code>pepper[[1]][1]</code> be? What about <code>pepper[[1]][[1]]</code>?</p></li>
</ol></section>
</section>
<section id="apply-family" data-type="sect1">
<h1>
Apply family</h1>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> and the map family of functions. In this section, youll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here well give you a quick overview of this family so you can recognize them in the wild.</p>
<p>The most important member of this family is <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>, which is very similar to <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if theres an error.</span>. In fact, because we havent used any of <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>s more advanced features, you can replace every <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>.</p>
<p>Theres no exact base R equivalent to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> but you can get close by using <code>[</code> with <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> on a data frame applies the function to each column.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
# First find numeric columns
num_cols &lt;- sapply(df, is.numeric)
num_cols
#&gt; a b c d e
#&gt; TRUE TRUE FALSE FALSE TRUE
# Then transform each column with lapply() then replace the original values
df[, num_cols] &lt;- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
df
#&gt; # A tibble: 1 × 5
#&gt; a b c d e
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 4 a b 8</pre>
</div>
<p>The code above uses a new function, <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>. Its similar to <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We dont recommend using it for programming, because the simplification can fail and give you an unexpected type, but its usually fine for interactive use. purrr has a similar function called <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> that we didnt mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
<p>Base R provides a stricter version of <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> called <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> call above with this <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> where we specify that we expect <code><a href="#chp-https://rdrr.io/r/base/numeric" data-type="xref">#chp-https://rdrr.io/r/base/numeric</a></code> to return a logical vector of length 1:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">vapply(df, is.numeric, logical(1))
#&gt; a b c d e
#&gt; TRUE TRUE FALSE FALSE TRUE</pre>
</div>
<p>The distinction between <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> and <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> is really important when theyre inside a function (because it makes a big difference to the functions robustness to unusual inputs), but it doesnt usually matter in data analysis.</p>
<p>Another important member of the apply family is <code><a href="#chp-https://rdrr.io/r/base/tapply" data-type="xref">#chp-https://rdrr.io/r/base/tapply</a></code> which computes a single grouped summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
group_by(cut) |&gt;
summarise(price = mean(price))
#&gt; # A tibble: 5 × 2
#&gt; cut price
#&gt; &lt;ord&gt; &lt;dbl&gt;
#&gt; 1 Fair 4359.
#&gt; 2 Good 3929.
#&gt; 3 Very Good 3982.
#&gt; 4 Premium 4584.
#&gt; 5 Ideal 3458.
tapply(diamonds$price, diamonds$cut, mean)
#&gt; Fair Good Very Good Premium Ideal
#&gt; 4358.758 3928.864 3981.760 4584.258 3457.542</pre>
</div>
<p>Unfortunately <code><a href="#chp-https://rdrr.io/r/base/tapply" data-type="xref">#chp-https://rdrr.io/r/base/tapply</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (its certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="#chp-https://rdrr.io/r/base/tapply" data-type="xref">#chp-https://rdrr.io/r/base/tapply</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="#chp-https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec" data-type="xref">#chp-https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec</a>.</p>
<p>The final member of the apply family is the titular <code><a href="#chp-https://rdrr.io/r/base/apply" data-type="xref">#chp-https://rdrr.io/r/base/apply</a></code>, which works with matrices and arrays. In particular, watch out of <code>apply(df, 2, something)</code> which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
</section>
<section id="for-loops" data-type="sect1">
<h1>
For loops</h1>
<p>For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">for (element in vector) {
# do something with element
}</pre>
</div>
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; walk(append_file)</pre>
</div>
<p>We could have used a for loop:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">for (path in paths) {
append_file(path)
}</pre>
</div>
<p>Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files &lt;- map(paths, readxl::read_excel)</pre>
</div>
<p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, were going to want a list the same length as <code>paths</code>, which we can create with <code><a href="#chp-https://rdrr.io/r/base/vector" data-type="xref">#chp-https://rdrr.io/r/base/vector</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- vector("list", length(paths))</pre>
</div>
<p>Then instead of iterating over the elements of <code>paths</code>, well iterate over their indices, using <code><a href="#chp-https://rdrr.io/r/base/seq" data-type="xref">#chp-https://rdrr.io/r/base/seq</a></code> to generate one index for each element of paths:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">seq_along(paths)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre>
</div>
<p>Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">for (i in seq_along(paths)) {
files[[i]] &lt;- readxl::read_excel(paths[[i]])
}</pre>
</div>
<p>To combine the list of tibbles into a single tibble you can use <code><a href="#chp-https://rdrr.io/r/base/do.call" data-type="xref">#chp-https://rdrr.io/r/base/do.call</a></code> + <code><a href="#chp-https://rdrr.io/r/base/cbind" data-type="xref">#chp-https://rdrr.io/r/base/cbind</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">do.call(rbind, files)
#&gt; # A tibble: 1,704 × 5
#&gt; country continent lifeExp pop gdpPercap
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan Asia 28.8 8425333 779.
#&gt; 2 Albania Europe 55.2 1282697 1601.
#&gt; 3 Algeria Africa 43.1 9279525 2449.
#&gt; 4 Angola Africa 30.0 4232095 3521.
#&gt; 5 Argentina Americas 62.5 17876956 5911.
#&gt; 6 Australia Oceania 69.1 8691212 10040.
#&gt; # … with 1,698 more rows</pre>
</div>
<p>Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">out &lt;- NULL
for (path in paths) {
out &lt;- rbind(out, readxl::read_excel(path))
}</pre>
</div>
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This the source of the persistent canard that <code>for</code> loops are slow: theyre not, but iteratively growing a vector is.</p>
</section>
<section id="plots" data-type="sect1">
<h1>
Plots</h1>
<p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because theyre so concise — its very little typing to do a basic exploratory plot.</p>
<p>There are two main types of base plot youll see in the wild: scatterplots and histograms, produced with <code><a href="#chp-https://rdrr.io/r/graphics/plot.default" data-type="xref">#chp-https://rdrr.io/r/graphics/plot.default</a></code> and <code><a href="#chp-https://rdrr.io/r/graphics/hist" data-type="xref">#chp-https://rdrr.io/r/graphics/hist</a></code> respectively. Heres a quick example from the diamonds dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">hist(diamonds$carat)
plot(diamonds$carat, diamonds$price)</pre>
<div class="cell-output-display">
<p><img src="base-R_files/figure-html/unnamed-chunk-39-1.png" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="base-R_files/figure-html/unnamed-chunk-39-2.png" width="576"/></p>
</div>
</div>
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, weve shown you selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
<p>This chapter concludes the programming section of the book. Youve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that youre are looking forward to learning more outside of this book.</p>
</section>
</section>

View File

@ -0,0 +1,624 @@
<section data-type="chapter" id="chp-communicate-plots">
<h1><span id="sec-graphics-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Graphics for communication</span></span></h1><div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, youll produce tens or hundreds of plots, most of which are immediately thrown away.</p>
<p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, youll learn some of the tools that ggplot2 provides to do so.</p>
<p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="#chp-https://www.amazon.com/gp/product/0321934075/" data-type="xref">#chp-https://www.amazon.com/gp/product/0321934075/</a>, by Albert Cairo. It doesnt teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well focus once again on ggplot2. Well also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including <strong>ggrepel</strong> and <strong>patchwork</strong>. Rather than loading those extensions here, well refer to their functions explicitly, using the <code>::</code> notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Dont forget youll need to install those packages with <code><a href="#chp-https://rdrr.io/r/utils/install.packages" data-type="xref">#chp-https://rdrr.io/r/utils/install.packages</a></code> if you dont already have them.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="label" data-type="sect1">
<h1>
Label</h1>
<p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code> function. This example adds a plot title:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-3-1.png" width="576"/></p>
</div>
</div>
<p>The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.</p>
<p>If you need to add more text, there are two other useful labels that you can use in ggplot2 2.2.0 and above:</p>
<ul><li><p><code>subtitle</code> adds additional detail in a smaller font beneath the title.</p></li>
<li><p><code>caption</code> adds text at the bottom right of the plot, often used to describe the source of the data.</p></li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-4-1.png" width="576"/></p>
</div>
</div>
<p>You can also use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code> to replace the axis and legend titles. Its usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)",
colour = "Car type"
)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-5-1.png" width="576"/></p>
</div>
</div>
<p>Its possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="#chp-https://rdrr.io/r/base/substitute" data-type="xref">#chp-https://rdrr.io/r/base/substitute</a></code> and read about the available options in <code><a href="#chp-https://rdrr.io/r/grDevices/plotmath" data-type="xref">#chp-https://rdrr.io/r/grDevices/plotmath</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
x = runif(10),
y = runif(10)
)
ggplot(df, aes(x, y)) +
geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-6-1.png" style="width:50.0%"/></p>
</div>
</div>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create one plot on the fuel economy data with customized <code>title</code>, <code>subtitle</code>, <code>caption</code>, <code>x</code>, <code>y</code>, and <code>colour</code> labels.</p></li>
<li>
<p>Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-7-1.png" width="576"/></p>
</div>
</div>
</li>
<li><p>Take an exploratory graphic that youve created in the last month, and add informative titles to make it easier for others to understand.</p></li>
</ol></section>
</section>
<section id="annotations" data-type="sect1">
<h1>
Annotations</h1>
<p>In addition to labelling major components of your plot, its often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> is similar to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p>
<p>There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isnt terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">best_in_class &lt;- mpg |&gt;
group_by(class) |&gt;
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-8-1.png" width="576"/></p>
</div>
</div>
<p>This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-9-1.png" width="576"/></p>
</div>
</div>
<p>That helps a bit, but if you look closely in the top-left hand corner, youll notice that there are two labels practically on top of each other. This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same. Theres no way that we can fix these by applying the same transformation for every label. Instead, we can use the <strong>ggrepel</strong> package by Kamil Slowikowski. This useful package will automatically adjust labels so that they dont overlap:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-10-1.png" width="576"/></p>
</div>
</div>
<p>Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.</p>
<p>You can sometimes use the same idea to replace the legend with labels placed directly on the plot. Its not wonderful for this plot, but it isnt too bad. (<code>theme(legend.position = "none"</code>) turns the legend off — well talk about it more shortly.)</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">class_avg &lt;- mpg |&gt;
group_by(class) |&gt;
summarise(
displ = median(displ),
hwy = median(hwy)
)
ggplot(mpg, aes(displ, hwy, colour = class)) +
ggrepel::geom_label_repel(aes(label = class),
data = class_avg,
size = 6,
label.size = 0,
segment.color = NA
) +
geom_point() +
theme(legend.position = "none")</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-11-1.png" width="576"/></p>
</div>
</div>
<p>Alternatively, you might just want to add a single label to the plot, but youll still need to create a data frame. Often, you want the label in the corner of the plot, so its convenient to create a new data frame using <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> to compute the maximum values of x and y.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">label_info &lt;- mpg |&gt;
summarise(
displ = max(displ),
hwy = max(hwy),
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-12-1.png" width="576"/></p>
</div>
</div>
<p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since were no longer computing the positions from <code>mpg</code>, we can use <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> to create the data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">label_info &lt;- tibble(
displ = Inf,
hwy = Inf,
label = "Increasing engine size is \nrelated to decreasing fuel economy."
)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-13-1.png" width="576"/></p>
</div>
</div>
<p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_wrap" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_wrap</a></code> to automatically add line breaks, given the number of characters you want per line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">"Increasing engine size is related to decreasing fuel economy." |&gt;
str_wrap(width = 40) |&gt;
writeLines()
#&gt; Increasing engine size is related to
#&gt; decreasing fuel economy.</pre>
</div>
<p>Note the use of <code>hjust</code> and <code>vjust</code> to control the alignment of the label. <a href="#fig-just" data-type="xref">#fig-just</a> shows all nine possible combinations.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-themes"><p><img src="communicate-plots_files/figure-html/fig-just-1.png" style="width:60.0%"/></p>
<figcaption>Figure 28.1: All nine combinations of hjust and vjust.<code>hjust</code> and <code>vjust</code>.</figcaption>
</figure>
</div>
</div>
<p>Remember, in addition to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A few ideas:</p>
<ul><li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_abline" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_abline</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_abline" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_abline</a></code> to add reference lines. We often make them thick (<code>size = 2</code>) and white (<code>colour = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_segment" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_segment</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li>
</ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> with infinite positions to place text at the four corners of the plot.</p></li>
<li><p>Read the documentation for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/annotate" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/annotate</a></code>. How can you use it to add a text label to a plot without having to create a tibble?</p></li>
<li><p>How do labels with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)</p></li>
<li><p>What arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> control the appearance of the background box?</p></li>
<li><p>What are the four arguments to <code><a href="#chp-https://rdrr.io/r/grid/arrow" data-type="xref">#chp-https://rdrr.io/r/grid/arrow</a></code>? How do they work? Create a series of plots that demonstrate the most important options.</p></li>
</ol></section>
</section>
<section id="scales" data-type="sect1">
<h1>
Scales</h1>
<p>The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive. Normally, ggplot2 automatically adds scales for you. For example, when you type:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))</pre>
</div>
<p>ggplot2 automatically adds default scales behind the scenes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()</pre>
</div>
<p>Note the naming scheme for scales: <code>scale_</code> followed by the name of the aesthetic, then <code>_</code>, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. There are lots of non-default scales which youll learn about below.</p>
<p>The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:</p>
<ul><li><p>You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.</p></li>
<li><p>You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.</p></li>
</ul>
<section id="axis-ticks-and-legend-keys" data-type="sect2">
<h2>
Axis ticks and legend keys</h2>
<p>There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: <code>breaks</code> and <code>labels</code>. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of <code>breaks</code> is to override the default choice:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-18-1.png" width="576"/></p>
</div>
</div>
<p>You can use <code>labels</code> in the same way (a character vector the same length as <code>breaks</code>), but you can also set it to <code>NULL</code> to suppress the labels altogether. This is useful for maps, or for publishing plots where you cant share the absolute numbers.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-19-1.png" width="576"/></p>
</div>
</div>
<p>You can also use <code>breaks</code> and <code>labels</code> to control the appearance of legends. Collectively axes and legends are called <strong>guides</strong>. Axes are used for x and y aesthetics; legends are used for everything else.</p>
<p>Another use of <code>breaks</code> is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">presidential |&gt;
mutate(id = 33 + row_number()) |&gt;
ggplot(aes(start, id)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_x_date(NULL, breaks = presidential$start, date_labels = "'%y")</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-20-1.png" width="576"/></p>
</div>
</div>
<p>Note that the specification of breaks and labels for date and datetime scales is a little different:</p>
<ul><li><p><code>date_labels</code> takes a format specification, in the same form as <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code>.</p></li>
<li><p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1 month”.</p></li>
</ul></section>
<section id="legend-layout" data-type="sect2">
<h2>
Legend layout</h2>
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p>
<p>To control the overall position of the legend, you need to use a <code><a href="#chp-https://ggplot2.tidyverse.org/reference/theme" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/theme</a></code> setting. Well come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">base &lt;- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
base + theme(legend.position = "left")
base + theme(legend.position = "top")
base + theme(legend.position = "bottom")
base + theme(legend.position = "right") # the default</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-2.png" width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-3.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-21-4.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p>
<p>To control the display of individual legends, use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/guides" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/guides</a></code> along with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/guide_legend" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/guide_legend</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/guide_colourbar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/guide_colourbar</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
theme(legend.position = "bottom") +
guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))
#&gt; `geom_smooth()` using method = 'loess' and formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-22-1.png" width="576"/></p>
</div>
</div>
</section>
<section id="replacing-a-scale" data-type="sect2">
<h2>
Replacing a scale</h2>
<p>Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales youre mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once youve mastered position and colour, youll be able to quickly pick up other scale replacements.</p>
<p>Its very useful to plot transformations of your variable. For example, as weve seen in <a href="#chp-diamond-prices" data-type="xref">#chp-diamond-prices</a> its easier to see the precise relationship between <code>carat</code> and <code>price</code> if we log transform them:</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(diamonds, aes(carat, price)) +
geom_bin2d()
ggplot(diamonds, aes(log10(carat), log10(price))) +
geom_bin2d()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-23-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-23-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(diamonds, aes(carat, price)) +
geom_bin2d() +
scale_x_log10() +
scale_y_log10()</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-24-1.png" width="576"/></p>
</div>
</div>
<p>Another scale that is frequently customized is colour. The default categorical scale picks colors that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv))
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
scale_colour_brewer(palette = "Set1")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-25-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-25-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>Dont forget simpler techniques. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv, shape = drv)) +
scale_colour_brewer(palette = "Set1")</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-26-1.png" width="576"/></p>
</div>
</div>
<p>The ColorBrewer scales are documented online at <a href="https://colorbrewer2.org/" class="uri">https://colorbrewer2.org/</a> and made available in R via the <strong>RColorBrewer</strong> package, by Erich Neuwirth. <a href="#fig-brewer" data-type="xref">#fig-brewer</a> shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if youve used <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code> to make a continuous variable into a categorical variable.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="communicate-plots_files/figure-html/fig-brewer-1.png" width="576"/></p>
<figcaption class="figure-caption">Figure 28.2: All ColourBrewer scales.</figcaption>
</figure>
</div>
</div>
<p>When you have a predefined mapping between values and colors, use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_manual" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_manual</a></code>. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">presidential |&gt;
mutate(id = 33 + row_number()) |&gt;
ggplot(aes(start, id, colour = party)) +
geom_point() +
geom_segment(aes(xend = end, yend = id)) +
scale_colour_manual(values = c(Republican = "red", Democratic = "blue"))</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-28-1.png" width="576"/></p>
</div>
</div>
<p>For continuous colour, you can use the built-in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_gradient" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_gradient</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_gradient" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_gradient</a></code>. If you have a diverging scale, you can use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_gradient" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_gradient</a></code>. That allows you to give, for example, positive and negative values different colors. Thats sometimes also useful if you want to distinguish points above or below the mean.</p>
<p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
x = rnorm(10000),
y = rnorm(10000)
)
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
labs(title = "Default, continuous")
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_c() +
labs(title = "Viridis, continuous")
ggplot(df, aes(x, y)) +
geom_hex() +
coord_fixed() +
scale_fill_viridis_b() +
labs(title = "Viridis, binned")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-29-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-29-2.png" width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-29-3.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>Note that all colour scales come in two variety: <code>scale_colour_x()</code> and <code>scale_fill_x()</code> for the <code>colour</code> and <code>fill</code> aesthetics respectively (the colour scales are available in both UK and US spellings).</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Why doesnt the following code override the default scale?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(df, aes(x, y)) +
geom_hex() +
scale_colour_gradient(low = "white", high = "red") +
coord_fixed()</pre>
</div>
</li>
<li><p>What is the first argument to every scale? How does it compare to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code>?</p></li>
<li>
<p>Change the display of the presidential terms by:</p>
<ol type="a"><li>Combining the two variants shown above.</li>
<li>Improving the display of the y axis.</li>
<li>Labelling each term with the name of the president.</li>
<li>Adding informative plot labels.</li>
<li>Placing breaks every 4 years (this is trickier than it seems!).</li>
</ol></li>
<li>
<p>Use <code>override.aes</code> to make the legend on the following plot easier to see.</p>
<div class="cell" data-fig.format="png">
<pre data-type="programlisting" data-code-language="downlit">ggplot(diamonds, aes(carat, price)) +
geom_point(aes(colour = cut), alpha = 1/20)</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-31-1.png" style="width:50.0%"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="zooming" data-type="sect1">
<h1>
Zooming</h1>
<p>There are three ways to control the plot limits:</p>
<ol type="1"><li>Adjusting what data are plotted</li>
<li>Setting the limits in each scale</li>
<li>Setting <code>xlim</code> and <code>ylim</code> in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code>
</li>
</ol><p>To zoom in on a region of the plot, its generally best to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code>. Compare the following two plots:</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
mpg |&gt;
filter(displ &gt;= 5, displ &lt;= 7, hwy &gt;= 10, hwy &lt;= 30) |&gt;
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-32-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-32-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>You can also set the <code>limits</code> on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want <em>expand</em> the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, its difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">suv &lt;- mpg |&gt; filter(class == "suv")
compact &lt;- mpg |&gt; filter(class == "compact")
ggplot(suv, aes(displ, hwy, colour = drv)) +
geom_point()
ggplot(compact, aes(displ, hwy, colour = drv)) +
geom_point()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-33-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-33-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>One way to overcome this problem is to share scales across multiple plots, training the scales with the <code>limits</code> of the full data.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">x_scale &lt;- scale_x_continuous(limits = range(mpg$displ))
y_scale &lt;- scale_y_continuous(limits = range(mpg$hwy))
col_scale &lt;- scale_colour_discrete(limits = unique(mpg$drv))
ggplot(suv, aes(displ, hwy, colour = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale
ggplot(compact, aes(displ, hwy, colour = drv)) +
geom_point() +
x_scale +
y_scale +
col_scale</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-34-1.png" width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-34-2.png" width="384"/></p>
</div>
</div>
</div>
</div>
<p>In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.</p>
</section>
<section id="themes" data-type="sect1">
<h1>
Themes</h1>
<p>Finally, you can customize the non-data elements of your plot with a theme:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()</pre>
<div class="cell-output-display">
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-35-1.png" width="576"/></p>
</div>
</div>
<p>ggplot2 includes eight themes by default, as shown in <a href="#fig-themes" data-type="xref">#fig-themes</a>. Many more are included in add-on packages like <strong>ggthemes</strong> (<a href="https://jrnold.github.io/ggthemes" class="uri">https://jrnold.github.io/ggthemes</a>), by Jeffrey Arnold.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/visualization-themes.png" alt="Eight barplots created with ggplot2, each with one of the eight built-in themes: theme_bw() - White background with grid lines, theme_light() - Light axes and grid lines, theme_classic() - Classic theme, axes but no grid lines, theme_linedraw() - Only black lines, theme_dark() - Dark background for contrast, theme_minimal() - Minimal theme, no background, theme_gray() - Gray background (default theme), theme_void() - Empty theme, only geoms are visible." width="1600"/></p>
<figcaption class="figure-caption">Figure 28.3: The eight themes built-in to ggplot2.</figcaption>
</figure>
</div>
</div>
<p>Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.</p>
<p>Its also possible to control individual components of each theme, like the size and colour of the font used for the y axis. Unfortunately, this level of detail is outside the scope of this book, so youll need to read the <a href="#chp-https://ggplot2-book.org/" data-type="xref">#chp-https://ggplot2-book.org/</a> for the full details. You can also create your own themes, if you are trying to match a particular corporate or journal style.</p>
</section>
<section id="sec-ggsave" data-type="sect1">
<h1>
Saving your plots</h1>
<p>There are two main ways to get your plots out of R and into your final write-up: <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> and knitr. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> will save the most recent plot to disk:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
#&gt; Saving 6 x 4 in image</pre>
</div>
<p>If you dont specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, youll want to specify them.</p>
<p>Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics. You can learn more about <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> in the documentation.</p>
</section>
<section id="learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>The absolute best place to learn more is the ggplot2 book: <a href="#chp-https://ggplot2-book.org/" data-type="xref">#chp-https://ggplot2-book.org/</a>. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.</p>
<p>Another great resource is the ggplot2 extensions gallery <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a>. This site lists many of the packages that extend ggplot2 with new geoms and scales. Its a great place to start if youre trying to do something that seems hard with ggplot2.</p>
</section>
</section>

12
oreilly/communicate.html Normal file
View File

@ -0,0 +1,12 @@
<div data-type="part">
<h1><span id="sec-communicate-intro" class="quarto-section-identifier d-none d-lg-block">Communicate</span></h1><p>So far, youve learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualization. However, it doesnt matter how great your analysis is unless you can explain it to others: you need to <strong>communicate</strong> your results.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-communicate"><p><img src="diagrams/data-science/communicate.png" alt="A diagram displaying the data science cycle with visualize and communicate highlighed in blue. " width="535"/></p>
<figcaption>Figure 1: Communication is the final part of the data science process; if you cant communicate your results to other humans, it doesnt matter how great your analysis is.</figcaption>
</figure>
</div>
</div><p>Communication is the theme of the following three chapters:</p><ul><li><p>In <a href="#chp-quarto" data-type="xref">#chp-quarto</a>, you will learn about Quarto, a tool for integrating prose, code, and results. You can use Quarto for analyst-to-analyst communication as well as analyst-to-decision-maker communication. Thanks to the power of Quarto formats, you can even use the same document for both purposes.</p></li>
<li><p>In <a href="#chp-quarto-formats" data-type="xref">#chp-quarto-formats</a>, youll learn a little about the many other varieties of outputs you can produce using Quarto, including dashboards, websites, and books.</p></li>
<li><p>Well finish up with <a href="#chp-quarto-workflow" data-type="xref">#chp-quarto-workflow</a>, where youll learn about the “analysis notebook” and how to systematically record your successes and failures so that you can learn from them.</p></li>
</ul><p>These chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans. However, there are lot of other great books about communication, which well point you to at the end of each chapter.</p></div>

BIN
oreilly/cover.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 334 KiB

602
oreilly/data-import.html Normal file
View File

@ -0,0 +1,602 @@
<section data-type="chapter" id="chp-data-import">
<h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, youll learn how to read plain-text rectangular files into R.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, youll learn how to load flat files in R with the <strong>readr</strong> package, which is part of the core tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="reading-data-from-a-file" data-type="sect1">
<h1>
Reading data from a file</h1>
<p>To begin well focus on the most rectangular data file type: the CSV, short for comma separate values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows give the data.</p>
<div class="cell">
<pre><code>#&gt; Student ID,Full Name,favourite.food,mealPlan,AGE
#&gt; 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
#&gt; 2,Barclay Lynn,French fries,Lunch only,5
#&gt; 3,Jayendra Lyne,N/A,Breakfast and lunch,7
#&gt; 4,Leon Rossini,Anchovies,Lunch only,
#&gt; 5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
#&gt; 6,Güvenç Attila,Ice cream,Lunch only,6</code></pre>
</div>
<p><a href="#tbl-students-table" data-type="xref">#tbl-students-table</a> shows a representation of the same data as a table.</p>
<div class="cell">
<div class="cell-output-display">
<div id="tbl-students-table" class="anchored">
<table class="table table-sm table-striped"><caption>Table 8.1: Data from the students.csv file as a table.</caption>
<colgroup><col style="width: 15%"/><col style="width: 23%"/><col style="width: 26%"/><col style="width: 27%"/><col style="width: 6%"/></colgroup><thead><tr class="header"><th style="text-align: right;">Student ID</th>
<th style="text-align: left;">Full Name</th>
<th style="text-align: left;">favourite.food</th>
<th style="text-align: left;">mealPlan</th>
<th style="text-align: left;">AGE</th>
</tr></thead><tbody><tr class="odd"><td style="text-align: right;">1</td>
<td style="text-align: left;">Sunil Huffmann</td>
<td style="text-align: left;">Strawberry yoghurt</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">4</td>
</tr><tr class="even"><td style="text-align: right;">2</td>
<td style="text-align: left;">Barclay Lynn</td>
<td style="text-align: left;">French fries</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">5</td>
</tr><tr class="odd"><td style="text-align: right;">3</td>
<td style="text-align: left;">Jayendra Lyne</td>
<td style="text-align: left;">N/A</td>
<td style="text-align: left;">Breakfast and lunch</td>
<td style="text-align: left;">7</td>
</tr><tr class="even"><td style="text-align: right;">4</td>
<td style="text-align: left;">Leon Rossini</td>
<td style="text-align: left;">Anchovies</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">NA</td>
</tr><tr class="odd"><td style="text-align: right;">5</td>
<td style="text-align: left;">Chidiegwu Dunkel</td>
<td style="text-align: left;">Pizza</td>
<td style="text-align: left;">Breakfast and lunch</td>
<td style="text-align: left;">five</td>
</tr><tr class="even"><td style="text-align: right;">6</td>
<td style="text-align: left;">Güvenç Attila</td>
<td style="text-align: left;">Ice cream</td>
<td style="text-align: left;">Lunch only</td>
<td style="text-align: left;">6</td>
</tr></tbody></table></div>
</div>
</div>
<p>We can read this file into R using <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>. The first argument is the most important: its the path to the file.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_csv("data/students.csv")
#&gt; Rows: 6 Columns: 5
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (4): Full Name, favourite.food, mealPlan, AGE
#&gt; dbl (1): Student ID
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
<p>When you run <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and well come back to in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
<section id="practical-advice" data-type="sect2">
<h2>
Practical advice</h2>
<p>Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Lets take another look at the <code>students</code> data with that in mind.</p>
<p>In the <code>favourite.food</code> column, there are a bunch of food items and then the character string <code>N/A</code>, which should have been an real <code>NA</code> that R will recognize as “not available”. This is something we can address using the <code>na</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_csv("data/students.csv", na = c("N/A", ""))
students
#&gt; # A tibble: 6 × 5
#&gt; `Student ID` `Full Name` favourite.food mealPlan AGE
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>You might also notice that the <code>Student ID</code> and <code>Full Name</code> columns are surrounded by back ticks. Thats because they contain spaces, breaking Rs usual rules for variable names. To refer to them, you need to use those back ticks:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students |&gt;
rename(
student_id = `Student ID`,
full_name = `Full Name`
)
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite.food mealPlan AGE
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>An alternative approach is to use <code><a href="#chp-https://rdrr.io/pkg/janitor/man/clean_names" data-type="xref">#chp-https://rdrr.io/pkg/janitor/man/clean_names</a></code> to use some heuristics to turn them all into snake case at once<span data-type="footnote">The <a href="#chp-http://sfirke.github.io/janitor/" data-type="xref">#chp-http://sfirke.github.io/janitor/</a> package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses <code>|&gt;</code>.</span>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students |&gt; janitor::clean_names()
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>Another common task after reading in data is to consider variable types. For example, <code>meal_type</code> is a categorical variable with a known set of possible values, which in R should be represent as factor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students |&gt;
janitor::clean_names() |&gt;
mutate(
meal_plan = factor(meal_plan)
)
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>Note that the values in the <code>meal_type</code> variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<code>&lt;chr&gt;</code>) to factor (<code>&lt;fct&gt;</code>). Youll learn more about factors in <a href="#chp-factors" data-type="xref">#chp-factors</a>.</p>
<p>Before you move on to analyzing these data, youll probably want to fix the <code>age</code> column as well: currently its a character variable because of the one observation that is typed out as <code>five</code> instead of a numeric <code>5</code>. We discuss the details of fixing this issue in <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- students |&gt;
janitor::clean_names() |&gt;
mutate(
meal_plan = factor(meal_plan),
age = parse_number(if_else(age == "five", "5", age))
)
students
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</section>
<section id="other-arguments" data-type="sect2">
<h2>
Other arguments</h2>
<p>There are a couple of other important arguments that we need to mention, and theyll be easier to demonstrate if we first show you a handy trick: <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> can read csv files that youve created in a string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv(
"a,b,c
1,2,3
4,5,6"
)
#&gt; # A tibble: 2 × 3
#&gt; a b c
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>Usually <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv(
"The first line of metadata
The second line of metadata
x,y,z
1,2,3",
skip = 2
)
#&gt; # A tibble: 1 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
read_csv(
"# A comment I want to skip
x,y,z
1,2,3",
comment = "#"
)
#&gt; # A tibble: 1 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3</pre>
</div>
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> not to treat the first row as headings, and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv(
"1,2,3
4,5,6",
col_names = FALSE
)
#&gt; # A tibble: 2 × 3
#&gt; X1 X2 X3
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>Alternatively you can pass <code>col_names</code> a character vector which will be used as the column names:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv(
"1,2,3
4,5,6",
col_names = c("x", "y", "z")
)
#&gt; # A tibble: 2 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3
#&gt; 2 4 5 6</pre>
</div>
<p>These arguments are all you need to know to read the majority of CSV files that youll encounter in practice. (For the rest, youll need to carefully inspect your <code>.csv</code> file and carefully read the documentation for <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>s many other arguments.)</p>
</section>
<section id="other-file-types" data-type="sect2">
<h2>
Other file types</h2>
<p>Once youve mastered <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>, using readrs other functions is straightforward; its just a matter of knowing which function to reach for:</p>
<ul><li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reads semicolon separated files. These use <code>;</code> instead of <code>,</code> to separate fields, and are common in countries that use <code>,</code> as the decimal marker.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reads tab delimited files.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reads in files with any delimiter, attempting to automatically guess the delimited if you dont specify it.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code> reads fixed width files. You can specify fields either by their widths with <code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code> or their position with <code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code>.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_table" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_table</a></code> reads a common variation of fixed width files where columns are separated by white space.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_log" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_log</a></code> reads Apache style log files.</p></li>
</ul></section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What function would you use to read a file where fields were separated with “|”?</p></li>
<li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> have in common?</p></li>
<li><p>What are the most important arguments to <code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code>?</p></li>
<li>
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> assumes that the quoting character will be <code>"</code>. What argument to <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> do you need to specify to read the following text into a data frame?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">"x,y\n1,'a,b'"</pre>
</div>
</li>
<li>
<p>Identify what is wrong with each of the following inline CSV files. What happens when you run the code?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv("a,b\n1,2,3\n4,5,6")
read_csv("a,b,c\n1,2\n1,2,3,4")
read_csv("a,b\n\"1")
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")</pre>
</div>
</li>
<li>
<p>Practice referring to non-syntactic names in the following data frame by:</p>
<ol type="a"><li>Extracting the variable called <code>1</code>.</li>
<li>Plotting a scatterplot of <code>1</code> vs <code>2</code>.</li>
<li>Creating a new column called <code>3</code> which is <code>2</code> divided by <code>1</code>.</li>
<li>Renaming the columns to <code>one</code>, <code>two</code> and <code>three</code>.</li>
</ol><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">annoying &lt;- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)</pre>
</div>
</li>
</ol></section>
</section>
<section id="sec-col-types" data-type="sect1">
<h1>
Controlling column types</h1>
<p>A CSV file doesnt contain any information about the type of each variable (i.e. whether its a logical, number, string, etc), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself. Finally, well mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.</p>
<section id="guessing-types" data-type="sect2">
<h2>
Guessing types</h2>
<p>readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000<span data-type="footnote">You can override the default of 1000 with the <code>guess_max</code> argument.</span> rows spaced evenly from the first row to the last, ignoring an missing values. It then works through the following questions:</p>
<ul><li>Does it contain only <code>F</code>, <code>T</code>, <code>FALSE</code>, or <code>TRUE</code> (ignoring case)? If so, its a logical.</li>
<li>Does it contain only numbers (e.g. <code>1</code>, <code>-4.5</code>, <code>5e6</code>, <code>Inf</code>)? If so, its a number.</li>
<li>Does it match match the ISO8601 standard? If so, its a date or date-time. (Well come back to date/times in more detail in <a href="#sec-creating-datetimes" data-type="xref">#sec-creating-datetimes</a>).</li>
<li>Otherwise, it must be a string.</li>
</ul><p>You can see that behavior in action in this simple example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv("
logical,numeric,date,string
TRUE,1,2021-01-15,abc
false,4.5,2021-02-15,def
T,Inf,2021-02-16,ghi"
)
#&gt; Rows: 3 Columns: 4
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): string
#&gt; dbl (1): numeric
#&gt; lgl (1): logical
#&gt; date (1): date
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.
#&gt; # A tibble: 3 × 4
#&gt; logical numeric date string
#&gt; &lt;lgl&gt; &lt;dbl&gt; &lt;date&gt; &lt;chr&gt;
#&gt; 1 TRUE 1 2021-01-15 abc
#&gt; 2 FALSE 4.5 2021-02-15 def
#&gt; 3 TRUE Inf 2021-02-16 ghi</pre>
</div>
<p>This heuristic works well if you have a clean dataset, but in real life youll encounter a selection of weird and wonderful failures.</p>
</section>
<section id="missing-values-column-types-and-problems" data-type="sect2">
<h2>
Missing values, column types, and problems</h2>
<p>The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type. One of the most common causes for this a missing value, recorded using something other than the <code>NA</code> that stringr expects.</p>
<p>Take this simple 1 column CSV file as an example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
x
10
.
20
30"</pre>
</div>
<p>If we read it without any additional arguments, <code>x</code> becomes a character column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- read_csv(csv)
#&gt; Rows: 4 Columns: 1
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): x
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
<p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled amongst them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- read_csv(csv, col_types = list(x = col_double()))
#&gt; Warning: One or more parsing issues, call `problems()` on your data frame for details,
#&gt; e.g.:
#&gt; dat &lt;- vroom(...)
#&gt; problems(dat)</pre>
</div>
<p>Now <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reports that there was a problem, and tells us we can find out more with <code><a href="#chp-https://readr.tidyverse.org/reference/problems" data-type="xref">#chp-https://readr.tidyverse.org/reference/problems</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">problems(df)
#&gt; # A tibble: 1 × 5
#&gt; row col expected actual file
#&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 3 1 a double . /private/tmp/Rtmp43JYhG/file7cf337a06034</pre>
</div>
<p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- read_csv(csv, na = ".")
#&gt; Rows: 4 Columns: 1
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; dbl (1): x
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div>
</section>
<section id="column-types" data-type="sect2">
<h2>
Column types</h2>
<p>readr provides a total of nine column types for you to use:</p>
<ul><li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> read logicals and real numbers. Theyre relatively rarely needed (except as above), since readr will usually guess them for you.</li>
<li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> reads integers. We distinguish because integers and doubles in this book because theyre functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
<li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesnt make sense to (e.g.) divide it in half.</li>
<li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_factor" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_factor</a></code>, <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> create factors, dates and date-time respectively; youll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
<li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. Youll learn more about it in <a href="#chp-numbers" data-type="xref">#chp-numbers</a>.</li>
<li>
<code><a href="#chp-https://readr.tidyverse.org/reference/col_skip" data-type="xref">#chp-https://readr.tidyverse.org/reference/col_skip</a></code> skips a column so its not included in the result.</li>
</ul><p>Its also possible to override the default column by switching from <code><a href="#chp-https://rdrr.io/r/base/list" data-type="xref">#chp-https://rdrr.io/r/base/list</a></code> to <code><a href="#chp-https://readr.tidyverse.org/reference/cols" data-type="xref">#chp-https://readr.tidyverse.org/reference/cols</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
x,y,z
1,2,3"
read_csv(csv, col_types = cols(.default = col_character()))
#&gt; # A tibble: 1 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 2 3</pre>
</div>
<p>Another useful helper is <code><a href="#chp-https://readr.tidyverse.org/reference/cols" data-type="xref">#chp-https://readr.tidyverse.org/reference/cols</a></code> which will read in only the columns you specify:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv(
"x,y,z
1,2,3",
col_types = cols_only(x = col_character())
)
#&gt; # A tibble: 1 × 1
#&gt; x
#&gt; &lt;chr&gt;
#&gt; 1 1</pre>
</div>
</section>
</section>
<section id="sec-readr-directory" data-type="sect1">
<h1>
Reading data from multiple files</h1>
<p>Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each months data in a separate file: <code>01-sales.csv</code> for January, <code>02-sales.csv</code> for February, and <code>03-sales.csv</code> for March. With <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> you can read these data in at once and stack them on top of each other in a single data frame.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
#&gt; Rows: 19 Columns: 6
#&gt; ── Column specification ────────────────────────────────────────────────────────
#&gt; Delimiter: ","
#&gt; chr (1): month
#&gt; dbl (4): year, brand, item, n
#&gt;
#&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.
#&gt; # A tibble: 19 × 6
#&gt; file month year brand item n
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 data/01-sales.csv January 2019 1 1234 3
#&gt; 2 data/01-sales.csv January 2019 1 8721 9
#&gt; 3 data/01-sales.csv January 2019 1 1822 2
#&gt; 4 data/01-sales.csv January 2019 2 3333 1
#&gt; 5 data/01-sales.csv January 2019 2 2156 9
#&gt; 6 data/01-sales.csv January 2019 2 3987 6
#&gt; # … with 13 more rows</pre>
</div>
<p>With the additional <code>id</code> parameter we have added a new column called <code>file</code> to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files youre reading in do not have an identifying column that can help you trace the observations back to their original sources.</p>
<p>If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base <code><a href="#chp-https://rdrr.io/r/base/list.files" data-type="xref">#chp-https://rdrr.io/r/base/list.files</a></code> function to find the files for you by matching a pattern in the file names. Youll learn more about these patterns in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
sales_files
#&gt; [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"</pre>
</div>
</section>
<section id="sec-writing-to-a-file" data-type="sect1">
<h1>
Writing to a file</h1>
<p>readr also comes with two useful functions for writing data back to disk: <code><a href="#chp-https://readr.tidyverse.org/reference/write_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/write_delim</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/write_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/write_delim</a></code>. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.</p>
<p>The most important arguments are <code>x</code> (the data frame to save), and <code>file</code> (the location to save it). You can also specify how missing values are written with <code>na</code>, and if you want to <code>append</code> to an existing file.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">write_csv(students, "students.csv")</pre>
</div>
<p>Now lets read that csv file back in. Note that the type information is lost when you save to csv:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:</p>
<ol type="1"><li>
<p><code><a href="#chp-https://readr.tidyverse.org/reference/read_rds" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_rds</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/read_rds" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_rds</a></code> are uniform wrappers around the base functions <code><a href="#chp-https://rdrr.io/r/base/readRDS" data-type="xref">#chp-https://rdrr.io/r/base/readRDS</a></code> and <code><a href="#chp-https://rdrr.io/r/base/readRDS" data-type="xref">#chp-https://rdrr.io/r/base/readRDS</a></code>. These store data in Rs custom binary format called RDS:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">write_rds(students, "students.rds")
read_rds("students.rds")
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
<li>
<p>The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(arrow)
write_parquet(students, "students.parquet")
read_parquet("students.parquet")
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne NA Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
</ol><p>Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package.</p>
</section>
<section id="data-entry" data-type="sect1">
<h1>
Data entry</h1>
<p>Sometimes youll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> works by column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tibble(
x = c(1, 2, 5),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.60)
)
#&gt; # A tibble: 3 × 3
#&gt; x y z
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 1 h 0.08
#&gt; 2 2 m 0.83
#&gt; 3 5 g 0.6</pre>
</div>
<p>Note that every column in tibble must be same size, so youll get an error if theyre not:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tibble(
x = c(1, 2),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.6)
)
#&gt; Error:
#&gt; ! Tibble columns must have compatible sizes.
#&gt; • Size 2: Existing data.
#&gt; • Size 3: Column `y`.
#&gt; Only values of size one are recycled.</pre>
</div>
<p>Laying out the data by column can make it hard to see how the rows are related, so an alternative is <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>, short for <strong>tr</strong>ansposed t<strong>ibble</strong>, which lets you lay out your data row by row. <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code> is customized for data entry in code: column headings start with <code>~</code> and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tribble(
~x, ~y, ~z,
"h", 1, 0.08,
"m", 2, 0.83,
"g", 5, 0.60,
)
#&gt; # A tibble: 3 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 h 1 0.08
#&gt; 2 m 2 0.83
#&gt; 3 g 5 0.6</pre>
</div>
<p>Well use <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> and <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code> later in the book to construct small examples to demonstrate how various functions work.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to load CSV files with <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> and to do your own data entry with <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> and <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
<p>Now that youre writing a substantial amount of R code, its time to learn more about organizing your code into files and directories. In the next chapter, youll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p>
</section>
</section>

867
oreilly/data-tidy.html Normal file
View File

@ -0,0 +1,867 @@
<section data-type="chapter" id="chp-data-tidy">
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
<p>“Happy families are all alike; every unhappy family is unhappy in its own way.”<br/>
— Leo Tolstoy</p>
</blockquote>
<blockquote class="blockquote">
<p>“Tidy datasets are all alike, but every messy dataset is messy in its own way.”<br/>
— Hadley Wickham</p>
</blockquote>
<p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
<p>In this chapter, youll first learn the definition of tidy data and see it applied to simple toy dataset. Then well dive into the main tool youll use for tidying data: pivoting. Pivoting allows you to change the form of your data, without changing any of the values. Well finish up with a discussion of usefully untidy data, and how you can create it if needed.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
<p>From this chapter on, well suppress the loading message from <code><a href="#chp-https://tidyverse.tidyverse" data-type="xref">#chp-https://tidyverse.tidyverse</a></code>.</p>
</section>
</section>
<section id="sec-tidy-data" data-type="sect1">
<h1>
Tidy data</h1>
<p>You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables: <em>country</em>, <em>year</em>, <em>population</em>, and <em>cases</em> of TB (tuberculosis), but each dataset organizes the values in a different way.</p>
<!-- TODO redraw as tables -->
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">table1
#&gt; # A tibble: 6 × 4
#&gt; country year cases population
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Afghanistan 1999 745 19987071
#&gt; 2 Afghanistan 2000 2666 20595360
#&gt; 3 Brazil 1999 37737 172006362
#&gt; 4 Brazil 2000 80488 174504898
#&gt; 5 China 1999 212258 1272915272
#&gt; 6 China 2000 213766 1280428583
table2
#&gt; # A tibble: 12 × 4
#&gt; country year type count
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 Afghanistan 1999 cases 745
#&gt; 2 Afghanistan 1999 population 19987071
#&gt; 3 Afghanistan 2000 cases 2666
#&gt; 4 Afghanistan 2000 population 20595360
#&gt; 5 Brazil 1999 cases 37737
#&gt; 6 Brazil 1999 population 172006362
#&gt; # … with 6 more rows
table3
#&gt; # A tibble: 6 × 3
#&gt; country year rate
#&gt; * &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 Afghanistan 1999 745/19987071
#&gt; 2 Afghanistan 2000 2666/20595360
#&gt; 3 Brazil 1999 37737/172006362
#&gt; 4 Brazil 2000 80488/174504898
#&gt; 5 China 1999 212258/1272915272
#&gt; 6 China 2000 213766/1280428583
# Spread across two tibbles
table4a # cases
#&gt; # A tibble: 3 × 3
#&gt; country `1999` `2000`
#&gt; * &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Afghanistan 745 2666
#&gt; 2 Brazil 37737 80488
#&gt; 3 China 212258 213766
table4b # population
#&gt; # A tibble: 3 × 3
#&gt; country `1999` `2000`
#&gt; * &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Afghanistan 19987071 20595360
#&gt; 2 Brazil 172006362 174504898
#&gt; 3 China 1272915272 1280428583</pre>
</div>
<p>These are all representations of the same underlying data, but they are not equally easy to use. One of them, <code>table1</code>, will be much easier to work with inside the tidyverse because its tidy.</p>
<p>There are three interrelated rules that make a dataset tidy:</p>
<ol type="1"><li>Each variable is a column; each column is a variable.</li>
<li>Each observation is row; each row is an observation.</li>
<li>Each value is a cell; each cell is a single value.</li>
</ol><p><a href="#fig-tidy-structure" data-type="xref">#fig-tidy-structure</a> shows the rules visually.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-pivot-names-and-values"><p><img src="images/tidy-1.png" alt="Three panels, each representing a tidy data frame. The first panel shows that each variable is a column. The second panel shows that each observation is a row. The third panel shows that each value is a cell." width="683"/></p>
<figcaption>Figure 6.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.</figcaption>
</figure>
</div>
</div>
<p>Why ensure that your data is tidy? There are two main advantages:</p>
<ol type="1"><li><p>Theres a general advantage to picking one consistent way of storing data. If you have a consistent data structure, its easier to learn the tools that work with it because they have an underlying uniformity.</p></li>
<li><p>Theres a specific advantage to placing variables in columns because it allows Rs vectorised nature to shine. As you learned in <a href="#sec-mutate" data-type="xref">#sec-mutate</a> and <a href="#sec-summarize" data-type="xref">#sec-summarize</a>, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.</p></li>
</ol><p>dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with <code>table1</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Compute rate per 10,000
table1 |&gt;
mutate(
rate = cases / population * 10000
)
#&gt; # A tibble: 6 × 5
#&gt; country year cases population rate
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1999 745 19987071 0.373
#&gt; 2 Afghanistan 2000 2666 20595360 1.29
#&gt; 3 Brazil 1999 37737 172006362 2.19
#&gt; 4 Brazil 2000 80488 174504898 4.61
#&gt; 5 China 1999 212258 1272915272 1.67
#&gt; 6 China 2000 213766 1280428583 1.67
# Compute cases per year
table1 |&gt;
count(year, wt = cases)
#&gt; # A tibble: 2 × 2
#&gt; year n
#&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 1999 250740
#&gt; 2 2000 296920
# Visualise changes over time
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000))</pre>
<div class="cell-output-display">
<p><img src="data-tidy_files/figure-html/unnamed-chunk-5-1.png" alt="This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale." width="480"/></p>
</div>
</div>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Using prose, describe how the variables and observations are organised in each of the sample tables.</p></li>
<li>
<p>Sketch out the process youd use to calculate the <code>rate</code> for <code>table2</code> and <code>table4a</code> + <code>table4b</code>. You will need to perform four operations:</p>
<ol type="a"><li>Extract the number of TB cases per country per year.</li>
<li>Extract the matching population per country per year.</li>
<li>Divide cases by population, and multiply by 10000.</li>
<li>Store back in the appropriate place.</li>
</ol><p>You havent yet learned all the functions youd need to actually perform these operations, but you should still be able to think through the transformations youd need.</p>
</li>
<li><p>Recreate the plot showing change in cases over time using <code>table2</code> instead of <code>table1</code>. What do you need to do first?</p></li>
</ol></section>
</section>
<section id="sec-pivoting" data-type="sect1">
<h1>
Pivoting</h1>
<p>The principles of tidy data might seem so obvious that you wonder if youll ever encounter a dataset that isnt tidy. Unfortunately, however, most real data is untidy. There are two main reasons:</p>
<ol type="1"><li><p>Data is often organised to facilitate some goal other than analysis. For example, its common for data to be structured to make data entry, not analysis, easy.</p></li>
<li><p>Most people arent familiar with the principles of tidy data, and its hard to derive them yourself unless you spend a lot of time working with data.</p></li>
</ol><p>This means that most real analyses will require at least a little tidying. Youll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times youll need to consult with the people who originally generated the data. Next, youll <strong>pivot</strong> your data into a tidy form, with variables in the columns and observations in the rows.</p>
<p>tidyr provides two functions for pivoting data: <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>, which makes datasets <strong>longer</strong> by increasing rows and reducing columns, and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which makes datasets <strong>wider</strong> by increasing columns and reducing rows. The following sections work through the use of <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to tackle a wide range of realistic datasets. These examples are drawn from <code><a href="#chp-https://tidyr.tidyverse.org/articles/pivot" data-type="xref">#chp-https://tidyr.tidyverse.org/articles/pivot</a></code>, which you should check out if you want to see more variations and more challenging problems.</p>
<p>Lets dive in.</p>
<section id="sec-billboard" data-type="sect2">
<h2>
Data in column names</h2>
<p>The <code>billboard</code> dataset records the billboard rank of songs in the year 2000:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard
#&gt; # A tibble: 317 × 79
#&gt; artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA
#&gt; 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA
#&gt; 3 3 Door… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51
#&gt; 4 3 Door… Loser 2000-10-21 76 76 72 69 67 65 55 59 62
#&gt; 5 504 Bo… Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53
#&gt; 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3
#&gt; # … with 311 more rows, 67 more variables: wk10 &lt;dbl&gt;, wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;,
#&gt; # wk13 &lt;dbl&gt;, wk14 &lt;dbl&gt;, wk15 &lt;dbl&gt;, wk16 &lt;dbl&gt;, wk17 &lt;dbl&gt;, wk18 &lt;dbl&gt;,
#&gt; # wk19 &lt;dbl&gt;, wk20 &lt;dbl&gt;, wk21 &lt;dbl&gt;, wk22 &lt;dbl&gt;, wk23 &lt;dbl&gt;, wk24 &lt;dbl&gt;,
#&gt; # wk25 &lt;dbl&gt;, wk26 &lt;dbl&gt;, wk27 &lt;dbl&gt;, wk28 &lt;dbl&gt;, wk29 &lt;dbl&gt;, wk30 &lt;dbl&gt;,
#&gt; # wk31 &lt;dbl&gt;, wk32 &lt;dbl&gt;, wk33 &lt;dbl&gt;, wk34 &lt;dbl&gt;, wk35 &lt;dbl&gt;, wk36 &lt;dbl&gt;,
#&gt; # wk37 &lt;dbl&gt;, wk38 &lt;dbl&gt;, wk39 &lt;dbl&gt;, wk40 &lt;dbl&gt;, wk41 &lt;dbl&gt;, wk42 &lt;dbl&gt;,
#&gt; # wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, wk46 &lt;dbl&gt;, wk47 &lt;dbl&gt;, wk48 &lt;dbl&gt;, …</pre>
</div>
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
<p>To tidy this data, well use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>. After the data, there are three key arguments:</p>
<ul><li>
<code>cols</code> specifies which columns need to be pivoted, i.e. which columns arent variables. This argument uses the same syntax as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> so here we could use <code>!c(artist, track, date.entered)</code> or <code>starts_with("wk")</code>.</li>
<li>
<code>names_to</code> names of the variable stored in the column names, here <code>"week"</code>.</li>
<li>
<code>values_to</code> names the variable stored in the cell values, here <code>"rank"</code>.</li>
</ul><p>That gives the following call:</p>
<div class="cell" data-r.options="{&quot;pillar.print_min&quot;:10}">
<pre data-type="programlisting" data-code-language="downlit">billboard |&gt;
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank"
)
#&gt; # A tibble: 24,092 × 5
#&gt; artist track date.entered week rank
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#&gt; 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#&gt; 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#&gt; 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#&gt; 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#&gt; 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#&gt; 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
#&gt; 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
#&gt; 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
#&gt; 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
#&gt; # … with 24,082 more rows</pre>
</div>
<p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pacs “Baby Dont Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s dont really represent unknown observations; theyre forced to exist by the structure of the dataset<span data-type="footnote">Well come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard |&gt;
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
#&gt; # A tibble: 5,307 × 5
#&gt; artist track date.entered week rank
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#&gt; 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#&gt; 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#&gt; 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#&gt; 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#&gt; 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#&gt; # … with 5,301 more rows</pre>
</div>
<p>You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We cant tell from this data, but you might guess that additional columns <code>wk77</code>, <code>wk78</code>, … would be added to the dataset.</p>
<p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code>. <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard_tidy &lt;- billboard |&gt;
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) |&gt;
mutate(
week = parse_number(week)
)
billboard_tidy
#&gt; # A tibble: 5,307 × 5
#&gt; artist track date.entered week rank
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;date&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#&gt; 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#&gt; 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#&gt; 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#&gt; 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#&gt; 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#&gt; # … with 5,301 more rows</pre>
</div>
<p>Now were in a good position to look at how song ranks vary over time by drawing a plot. The code is shown below and the result is <a href="#fig-billboard-ranks" data-type="xref">#fig-billboard-ranks</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard_tidy |&gt;
ggplot(aes(week, rank, group = track)) +
geom_line(alpha = 1/3) +
scale_y_reverse()</pre>
<div class="cell-output-display">
<figure class="figure"><p><img src="data-tidy_files/figure-html/fig-billboard-ranks-1.png" alt="A line plot with week on the x-axis and rank on the y-axis, where each line represents a song. Most songs appear to start at a high rank, rapidly accelerate to a low rank, and then decay again. There are suprisingly few tracks in the region when week is &gt;20 and rank is &gt;50." width="576"/></p>
<figcaption class="figure-caption">Figure 6.2: A line plot showing how the rank of a song changes over time.</figcaption>
</figure>
</div>
</div>
</section>
<section id="how-does-pivoting-work" data-type="sect2">
<h2>
How does pivoting work?</h2>
<p>Now that youve seen what pivoting can do for you, its worth taking a little time to gain some intuition about what it does to the data. Lets start with a very simple dataset to make it easier to see whats happening:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~var, ~col1, ~col2,
"A", 1, 2,
"B", 3, 4,
"C", 5, 6
)</pre>
</div>
<p>Here well say there are three variables: <code>var</code> (already in a variable), <code>name</code> (the column names in the column names), and <code>value</code> (the cell values). So we can tidy it with:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
pivot_longer(
cols = col1:col2,
names_to = "names",
values_to = "values"
)
#&gt; # A tibble: 6 × 3
#&gt; var names values
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 A col1 1
#&gt; 2 A col2 2
#&gt; 3 B col1 3
#&gt; 4 B col2 4
#&gt; 5 C col1 5
#&gt; 6 C col2 6</pre>
</div>
<p>How does this transformation take place? Its easier to see if we take it component by component. Columns that are already variables need to be repeated, once for each column in <code>cols</code>, as shown in <a href="#fig-pivot-variables" data-type="xref">#fig-pivot-variables</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/tidy-data/variables.png" alt="A diagram showing how `pivot_longer()` transforms a simple dataset, using color to highlight how the values in the `var` column (&quot;A&quot;, &quot;B&quot;, &quot;C&quot;) are each repeated twice in the output because there are two columns being pivotted (&quot;col1&quot; and &quot;col2&quot;)." width="469"/></p>
<figcaption class="figure-caption">Figure 6.3: Columns that are already variables need to be repeated, once for each column that is pivotted.</figcaption>
</figure>
</div>
</div>
<p>The column names become values in a new variable, whose name is given by <code>names_to</code>, as shown in <a href="#fig-pivot-names" data-type="xref">#fig-pivot-names</a>. They need to be repeated once for each row in the original dataset.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/tidy-data/column-names.png" alt="A diagram showing how `pivot_longer()` transforms a simple data set, using color to highlight how column names (&quot;col1&quot; and &quot;col2&quot;) become the values in a new `var` column. They are repeated three times because there were three rows in the input." width="469"/></p>
<figcaption class="figure-caption">Figure 6.4: The column names of pivoted columns become a new column.</figcaption>
</figure>
</div>
</div>
<p>The cell values also become values in a new variable, with a name given by <code>values_to</code>. They are unwound row by row. <a href="#fig-pivot-values" data-type="xref">#fig-pivot-values</a> illustrates the process.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/tidy-data/cell-values.png" alt="A diagram showing how `pivot_longer()` transforms data, using color to highlight how the cell values (the numbers 1 to 6) become the values in a new `value` column. They are unwound row-by-row, so the original rows (1,2), then (3,4), then (5,6), become a column running from 1 to 6." width="469"/></p>
<figcaption class="figure-caption">Figure 6.5: The number of values is preserved (not repeated), but unwound row-by-row.</figcaption>
</figure>
</div>
</div>
</section>
<section id="many-variables-in-column-names" data-type="sect2">
<h2>
Many variables in column names</h2>
<p>A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the <code>who2</code> dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">who2
#&gt; # A tibble: 7,240 × 58
#&gt; country year sp_m_…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_…⁶ sp_m_65 sp_f_…⁷
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghani… 1980 NA NA NA NA NA NA NA NA
#&gt; 2 Afghani… 1981 NA NA NA NA NA NA NA NA
#&gt; 3 Afghani… 1982 NA NA NA NA NA NA NA NA
#&gt; 4 Afghani… 1983 NA NA NA NA NA NA NA NA
#&gt; 5 Afghani… 1984 NA NA NA NA NA NA NA NA
#&gt; 6 Afghani… 1985 NA NA NA NA NA NA NA NA
#&gt; # … with 7,234 more rows, 48 more variables: sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;,
#&gt; # sp_f_3544 &lt;dbl&gt;, sp_f_4554 &lt;dbl&gt;, sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;,
#&gt; # sn_m_014 &lt;dbl&gt;, sn_m_1524 &lt;dbl&gt;, sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;,
#&gt; # sn_m_4554 &lt;dbl&gt;, sn_m_5564 &lt;dbl&gt;, sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;,
#&gt; # sn_f_1524 &lt;dbl&gt;, sn_f_2534 &lt;dbl&gt;, sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;,
#&gt; # sn_f_5564 &lt;dbl&gt;, sn_f_65 &lt;dbl&gt;, ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;,
#&gt; # ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;, ep_m_4554 &lt;dbl&gt;, ep_m_5564 &lt;dbl&gt;, …</pre>
</div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">who2 |&gt;
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
#&gt; # A tibble: 405,440 × 6
#&gt; country year diagnosis gender age count
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Afghanistan 1980 sp m 014 NA
#&gt; 2 Afghanistan 1980 sp m 1524 NA
#&gt; 3 Afghanistan 1980 sp m 2534 NA
#&gt; 4 Afghanistan 1980 sp m 3544 NA
#&gt; 5 Afghanistan 1980 sp m 4554 NA
#&gt; 6 Afghanistan 1980 sp m 5564 NA
#&gt; # … with 405,434 more rows</pre>
</div>
<p>An alternative to <code>names_sep</code> is <code>names_pattern</code>, which you can use to extract variables from more complicated naming scenarios, once youve learned about regular expressions in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p>
<p>Conceptually, this is only a minor variation on the simpler case youve already seen. <a href="#fig-pivot-multiple-names" data-type="xref">#fig-pivot-multiple-names</a> shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that gives better performance.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/tidy-data/multiple-names.png" alt="A diagram that uses color to illustrate how supplying `names_sep` and multiple `names_to` creates multiple variables in the output. The input has variable names &quot;x_1&quot; and &quot;y_2&quot; which are split up by &quot;_&quot; to create name and number columns in the output. This is is similar case with a single `names_to`, but what would have been a single output variable is now separated into multiple variables." width="600"/></p>
<figcaption class="figure-caption">Figure 6.6: Pivotting with many variables in the column names means that each column name now fills in values in multiple output columns.</figcaption>
</figure>
</div>
</div>
</section>
<section id="data-and-variable-names-in-the-column-headers" data-type="sect2">
<h2>
Data and variable names in the column headers</h2>
<p>The next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the <code>household</code> dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">household
#&gt; # A tibble: 5 × 5
#&gt; family dob_child1 dob_child2 name_child1 name_child2
#&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 1998-11-26 2000-01-29 Susan Jose
#&gt; 2 2 1996-06-22 NA Mark &lt;NA&gt;
#&gt; 3 3 2002-07-11 2004-04-05 Sam Seth
#&gt; 4 4 2004-10-10 2009-08-27 Craig Khai
#&gt; 5 5 2000-12-05 2005-02-28 Parker Gracie</pre>
</div>
<p>This dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (<code>dob</code>, <code>name)</code> and the values of another (<code>child,</code> with values 1 and 2). To solve this problem we again need to supply a vector to <code>names_to</code> but this time we use the special <code>".value"</code> sentinel. This overrides the usual <code>values_to</code> argument to use the first component of the pivoted column name as a variable name in the output.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">household |&gt;
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
) |&gt;
mutate(
child = parse_number(child)
)
#&gt; # A tibble: 9 × 4
#&gt; family child dob name
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;date&gt; &lt;chr&gt;
#&gt; 1 1 1 1998-11-26 Susan
#&gt; 2 1 2 2000-01-29 Jose
#&gt; 3 2 1 1996-06-22 Mark
#&gt; 4 3 1 2002-07-11 Sam
#&gt; 5 3 2 2004-04-05 Seth
#&gt; 6 4 1 2004-10-10 Craig
#&gt; # … with 3 more rows</pre>
</div>
<p>We again use <code>values_drop_na = TRUE</code>, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> to convert (e.g.) <code>child1</code> into 1.</p>
<p><a href="#fig-pivot-names-and-values" data-type="xref">#fig-pivot-names-and-values</a> illustrates the basic idea with a simpler example. When you use <code>".value"</code> in <code>names_to</code>, the column names in the input contribute to both values and variable names in the output.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/tidy-data/names-and-values.png" alt="A diagram that uses color to illustrate how the special &quot;.value&quot; sentinel works. The input has names &quot;x_1&quot;, &quot;x_2&quot;, &quot;y_1&quot;, and &quot;y_2&quot;, and we want to use the first component (&quot;x&quot;, &quot;y&quot;) as a variable name and the second (&quot;1&quot;, &quot;2&quot;) as the value for a new &quot;id&quot; column." width="540"/></p>
<figcaption class="figure-caption">Figure 6.7: Pivoting with <code>names_to = c(".value", "id")</code> splits the column names into two components: the first part determines the output column name (<code>x</code> or <code>y</code>), and the second part determines the value of the <code>id</code> column.</figcaption>
</figure>
</div>
</div>
</section>
<section id="widening-data" data-type="sect2">
<h2>
Widening data</h2>
<p>So far weve used <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> to solve the common class of problems where values have ended up in column names. Next well pivot (HA HA) to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p>
<p>Well start by looking at <code>cms_patient_experience</code>, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience
#&gt; # A tibble: 500 × 5
#&gt; org_pac_id org_nm measure_cd measure_title prf_r…¹
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS SSM… 63
#&gt; 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS SSM… 87
#&gt; 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS SSM… 86
#&gt; 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS SSM… 57
#&gt; 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS SSM… 85
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM… 24
#&gt; # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
</div>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
distinct(measure_cd, measure_title)
#&gt; # A tibble: 6 × 2
#&gt; measure_cd measure_title
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor…
#&gt; 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
#&gt; 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
#&gt; 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
#&gt; 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
#&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre>
</div>
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesnt hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. Well use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> has the opposite interface to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
pivot_wider(
names_from = measure_cd,
values_from = prf_rate
)
#&gt; # A tibble: 500 × 9
#&gt; org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE M… CAHPS … 63 NA NA NA NA NA
#&gt; 2 0446157747 USC CARE M… CAHPS … NA 87 NA NA NA NA
#&gt; 3 0446157747 USC CARE M… CAHPS … NA NA 86 NA NA NA
#&gt; 4 0446157747 USC CARE M… CAHPS … NA NA NA 57 NA NA
#&gt; 5 0446157747 USC CARE M… CAHPS … NA NA NA NA 85 NA
#&gt; 6 0446157747 USC CARE M… CAHPS … NA NA NA NA NA 24
#&gt; # … with 494 more rows, and abbreviated variable names ¹measure_title,
#&gt; # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
#&gt; # ⁷CAHPS_GRP_12</pre>
</div>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
pivot_wider(
id_cols = starts_with("org"),
names_from = measure_cd,
values_from = prf_rate
)
#&gt; # A tibble: 95 × 8
#&gt; org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0446157747 USC CARE MEDICAL G… 63 87 86 57 85 24
#&gt; 2 0446162697 ASSOCIATION OF UNI… 59 85 83 63 88 22
#&gt; 3 0547164295 BEAVER MEDICAL GRO… 49 NA 75 44 73 12
#&gt; 4 0749333730 CAPE PHYSICIANS AS… 67 84 85 65 82 24
#&gt; 5 0840104360 ALLIANCE PHYSICIAN… 66 87 87 64 87 28
#&gt; 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
#&gt; # … with 89 more rows, and abbreviated variable names ¹CAHPS_GRP_1,
#&gt; # ²CAHPS_GRP_2, ³CAHPS_GRP_3, ⁴CAHPS_GRP_5, ⁵CAHPS_GRP_8, ⁶CAHPS_GRP_12</pre>
</div>
<p>This gives us the output that were looking for.</p>
</section>
<section id="how-does-pivot_wider-work" data-type="sect2">
<h2>
How does<code>pivot_wider()</code> work?</h2>
<p>To understand how <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> works, lets again start with a very simple dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~id, ~name, ~value,
"A", "x", 1,
"B", "y", 2,
"B", "x", 3,
"A", "y", 4,
"A", "z", 5,
)</pre>
</div>
<p>Well take the values from the <code>value</code> column and the names from the <code>name</code> column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
pivot_wider(
names_from = name,
values_from = value
)
#&gt; # A tibble: 2 × 4
#&gt; id x y z
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 A 1 4 5
#&gt; 2 B 3 2 NA</pre>
</div>
<p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p>
<p>To begin the process <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: its just the values of <code>name</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
distinct(name)
#&gt; # A tibble: 3 × 1
#&gt; name
#&gt; &lt;chr&gt;
#&gt; 1 x
#&gt; 2 y
#&gt; 3 z</pre>
</div>
<p>By default, the rows in the output are formed by all the variables that arent going into the names or values. These are called the <code>id_cols</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
select(-name, -value) |&gt;
distinct()
#&gt; # A tibble: 2 × 1
#&gt; id
#&gt; &lt;chr&gt;
#&gt; 1 A
#&gt; 2 B</pre>
</div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> then combines these results to generate an empty data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
select(-name, -value) |&gt;
distinct() |&gt;
mutate(x = NA, y = NA, z = NA)
#&gt; # A tibble: 2 × 4
#&gt; id x y z
#&gt; &lt;chr&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 A NA NA NA
#&gt; 2 B NA NA NA</pre>
</div>
<p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as theres no entry for id “B” and name “z”, so that cell remains missing. Well come back to this idea that <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
<p>You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~id, ~name, ~value,
"A", "x", 1,
"A", "x", 2,
"A", "y", 3,
"B", "x", 4,
"B", "y", 5,
)</pre>
</div>
<p>If we attempt to pivot this we get an output that contains list-columns, which youll learn more about in <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; pivot_wider(
names_from = name,
values_from = value
)
#&gt; Warning: Values from `value` are not uniquely identified; output will contain list-cols.
#&gt; • Use `values_fn = list` to suppress this warning.
#&gt; • Use `values_fn = {summary_fun}` to summarise duplicates.
#&gt; • Use the following dplyr code to identify duplicates.
#&gt; {data} %&gt;%
#&gt; dplyr::group_by(id, name) %&gt;%
#&gt; dplyr::summarise(n = dplyr::n(), .groups = "drop") %&gt;%
#&gt; dplyr::filter(n &gt; 1L)
#&gt; # A tibble: 2 × 3
#&gt; id x y
#&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 A &lt;dbl [2]&gt; &lt;dbl [1]&gt;
#&gt; 2 B &lt;dbl [1]&gt; &lt;dbl [1]&gt;</pre>
</div>
<p>Since you dont know how to work with this sort of data yet, youll want to follow the hint in the warning to figure out where the problem is:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
group_by(id, name) |&gt;
summarize(n = n(), .groups = "drop") |&gt;
filter(n &gt; 1L)
#&gt; # A tibble: 1 × 3
#&gt; id name n
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 A x 2</pre>
</div>
<p>Its then up to you to figure out whats gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.</p>
</section>
</section>
<section id="untidy-data" data-type="sect1">
<h1>
Untidy data</h1>
<p>While <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> is occasionally useful for making tidy data, its real strength is making <strong>untidy</strong> data. While that sounds like a bad thing, untidy isnt a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but its not the only data format youll ever need.</p>
<p>The following sections will show a few examples of <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.</p>
<section id="presenting-data-to-humans" data-type="sect2">
<h2>
Presenting data to humans</h2>
<p>As youve seen, <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(clarity, color)
#&gt; # A tibble: 56 × 3
#&gt; clarity color n
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt;
#&gt; 1 I1 D 42
#&gt; 2 I1 E 102
#&gt; 3 I1 F 143
#&gt; 4 I1 G 150
#&gt; 5 I1 H 162
#&gt; 6 I1 I 92
#&gt; # … with 50 more rows</pre>
</div>
<p>This is easy to visualize or summarize further, but its not the most compact form for display. You can use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to create a form more suitable for display to other humans:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(clarity, color) |&gt;
pivot_wider(
names_from = color,
values_from = n
)
#&gt; # A tibble: 8 × 8
#&gt; clarity D E F G H I J
#&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 42 102 143 150 162 92 50
#&gt; 2 SI2 1370 1713 1609 1548 1563 912 479
#&gt; 3 SI1 2083 2426 2131 1976 2275 1424 750
#&gt; 4 VS2 1697 2470 2201 2347 1643 1169 731
#&gt; 5 VS1 705 1281 1364 2148 1169 962 542
#&gt; 6 VVS2 553 991 975 1443 608 365 131
#&gt; # … with 2 more rows</pre>
</div>
<p>This display also makes it easy to compare in two directions, horizontally and vertically, much like <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>.</p>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like <a href="#chp-https://gt.rstudio" data-type="xref">#chp-https://gt.rstudio</a>. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.</p>
</section>
<section id="multivariate-statistics" data-type="sect2">
<h2>
Multivariate statistics</h2>
<p>Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable. Sometimes these formats have substantial performance or space advantages, or sometimes theyre just necessary to get closer to the underlying matrix mathematics.</p>
<p>Were not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need. For example, lets imagine you wanted to cluster the gapminder data to find countries that had similar progression of <code>gdpPercap</code> over time. To do this, we need one row for each country and one column for each year:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(gapminder)
col_year &lt;- gapminder |&gt;
mutate(gdpPercap = log10(gdpPercap)) |&gt;
pivot_wider(
id_cols = country,
names_from = year,
values_from = gdpPercap
)
col_year
#&gt; # A tibble: 142 × 13
#&gt; country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Afghani… 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81 2.80
#&gt; 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40 3.50
#&gt; 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70 3.68
#&gt; 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42 3.36
#&gt; 5 Argenti… 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97 4.04
#&gt; 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43
#&gt; # … with 136 more rows, and 2 more variables: `2002` &lt;dbl&gt;, `2007` &lt;dbl&gt;</pre>
</div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">col_year &lt;- col_year |&gt;
column_to_rownames("country")
head(col_year)
#&gt; 1952 1957 1962 1967 1972 1977 1982
#&gt; Afghanistan 2.891786 2.914265 2.931000 2.922309 2.869221 2.895485 2.990344
#&gt; Albania 3.204407 3.288313 3.364155 3.440940 3.520277 3.548144 3.560012
#&gt; Algeria 3.388990 3.479140 3.406679 3.511481 3.621453 3.691118 3.759302
#&gt; Angola 3.546618 3.582965 3.630354 3.742157 3.738248 3.478371 3.440429
#&gt; Argentina 3.771684 3.836125 3.853282 3.905955 3.975112 4.003419 3.954141
#&gt; Australia 4.001716 4.039400 4.086973 4.162150 4.225015 4.263262 4.289522
#&gt; 1987 1992 1997 2002 2007
#&gt; Afghanistan 2.930641 2.812473 2.803007 2.861376 2.988818
#&gt; Albania 3.572748 3.397495 3.504206 3.663155 3.773569
#&gt; Algeria 3.754452 3.700982 3.680996 3.723295 3.794025
#&gt; Angola 3.385644 3.419600 3.357390 3.442995 3.680991
#&gt; Argentina 3.960931 3.968876 4.040099 3.944366 4.106510
#&gt; Australia 4.340224 4.369675 4.431331 4.486965 4.537005</pre>
</div>
<p>This makes a data frame, because tibbles dont support row names<span data-type="footnote">tibbles dont use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.</span>.</p>
<p>Were now ready to cluster with (e.g.) <code><a href="#chp-https://rdrr.io/r/stats/kmeans" data-type="xref">#chp-https://rdrr.io/r/stats/kmeans</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cluster &lt;- stats::kmeans(col_year, centers = 6)</pre>
</div>
<p>Extracting the data out of this object into a form you can work with is a challenge youll need to come back to later in the book, once youve learned more about lists. But for now, you can get the clustering membership out with this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cluster_id &lt;- cluster$cluster |&gt;
enframe() |&gt;
rename(country = name, cluster_id = value)
cluster_id
#&gt; # A tibble: 142 × 2
#&gt; country cluster_id
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 Afghanistan 4
#&gt; 2 Albania 2
#&gt; 3 Algeria 6
#&gt; 4 Angola 2
#&gt; 5 Argentina 5
#&gt; 6 Australia 1
#&gt; # … with 136 more rows</pre>
</div>
<p>You could then combine this back with the original data using one of the joins youll learn about in <a href="#chp-joins" data-type="xref">#chp-joins</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gapminder |&gt; left_join(cluster_id)
#&gt; Joining with `by = join_by(country)`
#&gt; # A tibble: 1,704 × 7
#&gt; country continent year lifeExp pop gdpPercap cluster_id
#&gt; &lt;chr&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 Afghanistan Asia 1952 28.8 8425333 779. 4
#&gt; 2 Afghanistan Asia 1957 30.3 9240934 821. 4
#&gt; 3 Afghanistan Asia 1962 32.0 10267083 853. 4
#&gt; 4 Afghanistan Asia 1967 34.0 11537966 836. 4
#&gt; 5 Afghanistan Asia 1972 36.1 13079460 740. 4
#&gt; 6 Afghanistan Asia 1977 38.4 14880372 786. 4
#&gt; # … with 1,698 more rows</pre>
</div>
</section>
<section id="pragmatic-computation" data-type="sect2">
<h2>
Pragmatic computation</h2>
<p>Sometimes its just easier to answer a question using untidy data. For example, if youre interested in just the total number of missing values in <code>cms_patient_experience</code>, its easier to work with the untidy form:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
group_by(org_pac_id) |&gt;
summarize(
n_miss = sum(is.na(prf_rate)),
n = n(),
)
#&gt; # A tibble: 95 × 3
#&gt; org_pac_id n_miss n
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0446157747 0 6
#&gt; 2 0446162697 0 6
#&gt; 3 0547164295 1 6
#&gt; 4 0749333730 0 6
#&gt; 5 0840104360 0 6
#&gt; 6 0840109864 0 6
#&gt; # … with 89 more rows</pre>
</div>
<p>This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didnt actually define what a variable is (and its surprisingly hard to do so). Its totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.</p>
<p>So if youre stuck figuring out how to do some computation, maybe its time to switch up the organisation of your data. For computations involving a fixed number of values (like computing differences or ratios), its usually easier if the data is in columns; for those with a variable number of values (like sums or means) its usually easier in rows. Dont be afraid to untidy, transform, and re-tidy if needed.</p>
<p>Lets explore this idea by looking at <code>cms_patient_care</code>, which has a similar structure to <code>cms_patient_experience</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care
#&gt; # A tibble: 252 × 5
#&gt; ccn facility_name measure_abbr score type
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 011500 BAPTIST HOSPICE beliefs_addressed 202 denominator
#&gt; 2 011500 BAPTIST HOSPICE beliefs_addressed 100 observed
#&gt; 3 011500 BAPTIST HOSPICE composite_process 202 denominator
#&gt; 4 011500 BAPTIST HOSPICE composite_process 88.1 observed
#&gt; 5 011500 BAPTIST HOSPICE dyspena_treatment 110 denominator
#&gt; 6 011500 BAPTIST HOSPICE dyspena_treatment 99.1 observed
#&gt; # … with 246 more rows</pre>
</div>
<p>It contains information about 9 measures (<code>beliefs_addressed</code>, <code>composite_process</code>, <code>dyspena_treatment</code>, …) on 14 different facilities (identified by <code>ccn</code> with a name given by <code>facility_name</code>). Compared to <code>cms_patient_experience</code>, however, each measurement is recorded in two rows with a <code>score</code>, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to. Depending on what you want to do next, you may find any of the following three structures useful:</p>
<ul><li>
<p>If you want to compute the number of patients that answered yes to the question, you may pivot <code>type</code> into the columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care |&gt;
pivot_wider(
names_from = type,
values_from = score
) |&gt;
mutate(
numerator = round(observed / 100 * denominator)
)
#&gt; # A tibble: 126 × 6
#&gt; ccn facility_name measure_abbr denominator observed numerator
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 011500 BAPTIST HOSPICE beliefs_addressed 202 100 202
#&gt; 2 011500 BAPTIST HOSPICE composite_process 202 88.1 178
#&gt; 3 011500 BAPTIST HOSPICE dyspena_treatment 110 99.1 109
#&gt; 4 011500 BAPTIST HOSPICE dyspnea_screening 202 100 202
#&gt; 5 011500 BAPTIST HOSPICE opioid_bowel 61 100 61
#&gt; 6 011500 BAPTIST HOSPICE pain_assessment 107 100 107
#&gt; # … with 120 more rows</pre>
</div>
</li>
<li>
<p>If you want to display the distribution of each metric, you may keep it as is so you could facet by <code>measure_abbr</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care |&gt;
filter(type == "observed") |&gt;
ggplot(aes(score)) +
geom_histogram(binwidth = 2) +
facet_wrap(vars(measure_abbr))
#&gt; Warning: Removed 1 rows containing non-finite values (`stat_bin()`).</pre>
</div>
</li>
<li>
<p>If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_care |&gt;
filter(type == "observed") |&gt;
select(-type) |&gt;
pivot_wider(
names_from = measure_abbr,
values_from = score
) |&gt;
ggplot(aes(dyspnea_screening, dyspena_treatment)) +
geom_point() +
coord_equal()</pre>
</div>
</li>
</ul></section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="#chp-https://www.jstatsoft.org/article/view/v059i10" data-type="xref">#chp-https://www.jstatsoft.org/article/view/v059i10</a> paper published in the Journal of Statistical Software.</p>
<p>In the next chapter, well pivot back to workflow to discuss the importance of code style, keeping your code “tidy” (ha!) in order to make it easy for you and others to read and understand your code.</p>
</section>
</section>

890
oreilly/data-transform.html Normal file
View File

@ -0,0 +1,890 @@
<section data-type="chapter" id="chp-data-transform">
<h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Visualisation is an important tool for generating insight, but its rare that you get the data in exactly the right form you need for it. Often youll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. Youll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the <strong>dplyr</strong> package and a new dataset on flights that departed New York City in 2013.</p>
<p>The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. Well come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well focus on the dplyr package, another core member of the tidyverse. Well illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
library(tidyverse)
#&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
#&gt; ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
#&gt; ✔ tidyr 1.2.1.9001 ✔ stringr 1.4.1.9000
#&gt; ✔ readr 2.1.3 ✔ forcats 0.5.2
#&gt; ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div>
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="#chp-https://rdrr.io/r/stats/filter" data-type="xref">#chp-https://rdrr.io/r/stats/filter</a></code> and <code><a href="#chp-https://rdrr.io/r/stats/lag" data-type="xref">#chp-https://rdrr.io/r/stats/lag</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
</section>
<section id="nycflights13" data-type="sect2">
<h2>
nycflights13</h2>
<p>To explore the basic dplyr verbs, were going to use <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="#chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0" data-type="xref">#chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0</a>, and is documented in <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>If youve used R before, you might notice that this data frame prints a little differently to other data frames youve seen. Thats because its a <strong>tibble</strong>, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. To see everything you can use <code>print(flights, width = Inf)</code> to show everything in the console, but its generally more convenient to instead use <code>View(flights)</code> to open the dataset in the scrollable RStudio viewer.</p>
<p>You might have noticed the short abbreviations that follow each column name. These tell you the type of each variable: <code>&lt;int&gt;</code> is short for integer, <code>&lt;dbl&gt;</code> is short for double (aka real numbers), <code>&lt;chr&gt;</code> for character (aka strings), and <code>&lt;dttm&gt;</code> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.</p>
</section>
<section id="dplyr-basics" data-type="sect2">
<h2>
dplyr basics</h2>
<p>Youre about to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, its worth stating what they have in common:</p>
<ol type="1"><li><p>The first argument is always a data frame.</p></li>
<li><p>The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).</p></li>
<li><p>The result is always a new data frame.</p></li>
</ol><p>Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, <code>|&gt;</code>. The pipe takes the thing on its left and passes it along to the function on its right so that <code>x |&gt; f(y)</code> is equivalent to <code>f(x, y)</code>, and <code>x |&gt; f(y) |&gt; g(z)</code> is equivalent to into <code>g(f(x, y), z)</code>. The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you havent yet learned the details:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dest == "IAH") |&gt;
group_by(year, month, day) |&gt;
summarize(
arr_delay = mean(arr_delay, na.rm = TRUE)
)</pre>
</div>
<p>The code starts with the <code>flights</code> dataset, then filters it, then groups it, then summarizes it. Well come back to the pipe and its alternatives in <a href="#sec-pipes" data-type="xref">#sec-pipes</a>.</p>
<p>dplyrs verbs are organised into four groups based on what they operate on: <strong>rows</strong>, <strong>columns</strong>, <strong>groups</strong>, or <strong>tables</strong>. In the following sections youll learn the most important verbs for rows, columns, and groups, then well come back to verb that work on tables in <a href="#chp-joins" data-type="xref">#chp-joins</a>. Lets dive in!</p>
</section>
</section>
<section id="rows" data-type="sect1">
<h1>
Rows</h1>
<p>The most important verbs that operate on rows are <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, which changes which rows are present without changing their order, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.</p>
<section id="filter" data-type="sect2">
<h2>
<code>filter()</code>
</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(arr_delay &gt; 120)
#&gt; # A tibble: 10,034 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 811 630 101 1047 830 137 MQ
#&gt; 2 2013 1 1 848 1835 853 1001 1950 851 MQ
#&gt; 3 2013 1 1 957 733 144 1056 853 123 UA
#&gt; 4 2013 1 1 1114 900 134 1447 1222 145 UA
#&gt; 5 2013 1 1 1505 1310 115 1638 1431 127 EV
#&gt; 6 2013 1 1 1525 1340 105 1831 1626 125 B6
#&gt; # … with 10,028 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>As well as <code>&gt;</code> (greater than), you can use <code>&gt;=</code> (greater than or equal to), <code>&lt;</code> (less than), <code>&lt;=</code> (less than or equal to), <code>==</code> (equal to), and <code>!=</code> (not equal to). You can also use <code>&amp;</code> (and) or <code>|</code> (or) to combine multiple conditions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Flights that departed on January 1
flights |&gt;
filter(month == 1 &amp; day == 1)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 836 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Flights that departed in January or February
flights |&gt;
filter(month == 1 | month == 2)
#&gt; # A tibble: 51,955 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 51,949 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Theres a useful shortcut when youre combining <code>|</code> and <code>==</code>: <code>%in%</code>. It keeps rows where the variable equals one of the values on the right:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># A shorter way to select flights that departed in January or February
flights |&gt;
filter(month %in% c(1, 2))
#&gt; # A tibble: 51,955 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 51,949 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
<p>When you run <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">jan1 &lt;- flights |&gt;
filter(month == 1 &amp; day == 1)</pre>
</div>
</section>
<section id="common-mistakes" data-type="sect2">
<h2>
Common mistakes</h2>
<p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> will let you know when this happens:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month = 1)
#&gt; Error in `filter()`:
#&gt; ! We detected a named input.
#&gt; This usually means that you've used `=` instead of `==`.
#&gt; Did you mean `month == 1`?</pre>
</div>
<p>Another mistakes is you write “or” statements like you would in English:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 1 | 2)</pre>
</div>
<p>This works, in the sense that it doesnt throw an error, but it doesnt do what you want. Well come back to what it does and why in <a href="#sec-boolean-operations" data-type="xref">#sec-boolean-operations</a>.</p>
</section>
<section id="arrange" data-type="sect2">
<h2>
<code>arrange()</code>
</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, dep_time)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>You can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/desc" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/desc</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(desc(dep_delay))
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 9 641 900 1301 1242 1530 1272 HA
#&gt; 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
#&gt; 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
#&gt; 4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
#&gt; 5 2013 7 22 845 1600 1005 1044 1815 989 MQ
#&gt; 6 2013 4 10 1100 1900 960 1342 2211 931 DL
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>You can combine <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt;
arrange(desc(arr_delay))
#&gt; # A tibble: 239,109 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 11 1 658 700 -2 1329 1015 194 VX
#&gt; 2 2013 4 18 558 600 -2 1149 850 179 AA
#&gt; 3 2013 7 7 1659 1700 -1 2050 1823 147 US
#&gt; 4 2013 7 22 1606 1615 -9 2056 1831 145 DL
#&gt; 5 2013 9 19 648 641 7 1035 810 145 UA
#&gt; 6 2013 4 18 655 700 -5 1213 950 143 AA
#&gt; # … with 239,103 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Find all flights that</p>
<ol type="a"><li>Had an arrival delay of two or more hours</li>
<li>Flew to Houston (<code>IAH</code> or <code>HOU</code>)</li>
<li>Were operated by United, American, or Delta</li>
<li>Departed in summer (July, August, and September)</li>
<li>Arrived more than two hours late, but didnt leave late</li>
<li>Were delayed by at least an hour, but made up over 30 minutes in flight</li>
</ol></li>
<li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li>
<li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li>
<li><p>Which flights traveled the farthest? Which traveled the shortest?</p></li>
<li><p>Does it matter what order you used <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> in if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
</ol></section>
</section>
<section id="columns" data-type="sect1">
<h1>
Columns</h1>
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> creates new columns that are functions of the existing columns; <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> change which columns are present, their names, or their positions.</p>
<section id="sec-mutate" data-type="sect2">
<h2>
<code>mutate()</code>
</h2>
<p>The job of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#&gt; # A tibble: 336,776 × 21
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 11 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, gain &lt;dbl&gt;, speed &lt;dbl&gt;, and abbreviated
#&gt; # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay</pre>
</div>
<p>By default, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code>.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
#&gt; # A tibble: 336,776 × 21
#&gt; gain speed year month day dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 -9 370. 2013 1 1 517 515 2 830 819 11
#&gt; 2 -16 374. 2013 1 1 533 529 4 850 830 20
#&gt; 3 -31 408. 2013 1 1 542 540 2 923 850 33
#&gt; 4 17 517. 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 19 394. 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 -16 288. 2013 1 1 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can the name of a variable name instead of a position. For example, we could add the new variables after <code>day:</code></p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
#&gt; # A tibble: 336,776 × 21
#&gt; year month day gain speed dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 -9 370. 517 515 2 830 819 11
#&gt; 2 2013 1 1 -16 374. 533 529 4 850 830 20
#&gt; 3 2013 1 1 -31 408. 542 540 2 923 850 33
#&gt; 4 2013 1 1 17 517. 544 545 -1 1004 1022 -18
#&gt; 5 2013 1 1 19 394. 554 600 -6 812 837 -25
#&gt; 6 2013 1 1 -16 288. 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 6
#&gt; dep_delay arr_delay air_time gain hours gain_per_hour
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2 11 227 -9 3.78 -2.38
#&gt; 2 4 20 227 -16 3.78 -4.23
#&gt; 3 2 33 160 -31 2.67 -11.6
#&gt; 4 -1 -18 183 17 3.05 5.57
#&gt; 5 -6 -25 116 19 1.93 9.83
#&gt; 6 -4 12 150 -16 2.5 -6.4
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="sec-select" data-type="sect2">
<h2>
<code>select()</code>
</h2>
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Select columns by name
flights |&gt;
select(year, month, day)
#&gt; # A tibble: 336,776 × 3
#&gt; year month day
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1
#&gt; 2 2013 1 1
#&gt; 3 2013 1 1
#&gt; 4 2013 1 1
#&gt; 5 2013 1 1
#&gt; 6 2013 1 1
#&gt; # … with 336,770 more rows
# Select all columns between year and day (inclusive)
flights |&gt;
select(year:day)
#&gt; # A tibble: 336,776 × 3
#&gt; year month day
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1
#&gt; 2 2013 1 1
#&gt; 3 2013 1 1
#&gt; 4 2013 1 1
#&gt; 5 2013 1 1
#&gt; 6 2013 1 1
#&gt; # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
flights |&gt;
select(!year:day)
#&gt; # A tibble: 336,776 × 16
#&gt; dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin
#&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 517 515 2 830 819 11 UA 1545 N14228 EWR
#&gt; 2 533 529 4 850 830 20 UA 1714 N24211 LGA
#&gt; 3 542 540 2 923 850 33 AA 1141 N619AA JFK
#&gt; 4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
#&gt; 5 554 600 -6 812 837 -25 DL 461 N668DN LGA
#&gt; 6 554 558 -4 740 728 12 UA 1696 N39463 EWR
#&gt; # … with 336,770 more rows, 6 more variables: dest &lt;chr&gt;, air_time &lt;dbl&gt;,
#&gt; # distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated
#&gt; # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay
# Select all columns that are characters
flights |&gt;
select(where(is.character))
#&gt; # A tibble: 336,776 × 4
#&gt; carrier tailnum origin dest
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 UA N14228 EWR IAH
#&gt; 2 UA N24211 LGA IAH
#&gt; 3 AA N619AA JFK MIA
#&gt; 4 B6 N804JB JFK BQN
#&gt; 5 DL N668DN LGA ATL
#&gt; 6 UA N39463 EWR ORD
#&gt; # … with 336,770 more rows</pre>
</div>
<p>There are a number of helper functions you can use within <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>:</p>
<ul><li>
<code>starts_with("abc")</code>: matches names that begin with “abc”.</li>
<li>
<code>ends_with("xyz")</code>: matches names that end with “xyz”.</li>
<li>
<code>contains("ijk")</code>: matches names that contain “ijk”.</li>
<li>
<code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li>
</ul><p>See <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be use <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code> to select variables that match a pattern.</p>
<p>You can rename variables as you <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
select(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 1
#&gt; tail_num
#&gt; &lt;chr&gt;
#&gt; 1 N14228
#&gt; 2 N24211
#&gt; 3 N619AA
#&gt; 4 N804JB
#&gt; 5 N668DN
#&gt; 6 N39463
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="rename" data-type="sect2">
<h2>
<code>rename()</code>
</h2>
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> instead of <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
rename(tail_num = tailnum)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tail_num &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>It works exactly the same way as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="#chp-https://rdrr.io/pkg/janitor/man/clean_names" data-type="xref">#chp-https://rdrr.io/pkg/janitor/man/clean_names</a></code> which provides some useful automated cleaning.</p>
</section>
<section id="relocate" data-type="sect2">
<h2>
<code>relocate()</code>
</h2>
<p>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> moves variables to the front:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(time_hour, air_time)
#&gt; # A tibble: 336,776 × 19
#&gt; time_hour air_time year month day dep_t…¹ sched…² dep_d…³ arr_t…⁴
#&gt; &lt;dttm&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013-01-01 05:00:00 227 2013 1 1 517 515 2 830
#&gt; 2 2013-01-01 05:00:00 227 2013 1 1 533 529 4 850
#&gt; 3 2013-01-01 05:00:00 160 2013 1 1 542 540 2 923
#&gt; 4 2013-01-01 05:00:00 183 2013 1 1 544 545 -1 1004
#&gt; 5 2013-01-01 06:00:00 116 2013 1 1 554 600 -6 812
#&gt; 6 2013-01-01 05:00:00 150 2013 1 1 554 558 -4 740
#&gt; # … with 336,770 more rows, 10 more variables: sched_arr_time &lt;int&gt;,
#&gt; # arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;,
#&gt; # dest &lt;chr&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, and abbreviated
#&gt; # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time</pre>
</div>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to choose where to put them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(year:dep_time, .after = time_hour)
#&gt; # A tibble: 336,776 × 19
#&gt; sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 515 2 830 819 11 UA 1545 N14228 EWR IAH
#&gt; 2 529 4 850 830 20 UA 1714 N24211 LGA IAH
#&gt; 3 540 2 923 850 33 AA 1141 N619AA JFK MIA
#&gt; 4 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
#&gt; 5 600 -6 812 837 -25 DL 461 N668DN LGA ATL
#&gt; 6 558 -4 740 728 12 UA 1696 N39463 EWR ORD
#&gt; # … with 336,770 more rows, 9 more variables: air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, year &lt;int&gt;, month &lt;int&gt;,
#&gt; # day &lt;int&gt;, dep_time &lt;int&gt;, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights |&gt;
relocate(starts_with("arr"), .before = dep_time)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day arr_time arr_delay dep_time sched_…¹ dep_d…² sched…³ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 830 11 517 515 2 819 UA
#&gt; 2 2013 1 1 850 20 533 529 4 830 UA
#&gt; 3 2013 1 1 923 33 542 540 2 850 AA
#&gt; 4 2013 1 1 1004 -18 544 545 -1 1022 B6
#&gt; 5 2013 1 1 812 -25 554 600 -6 837 DL
#&gt; 6 2013 1 1 740 12 554 558 -4 728 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³sched_arr_time</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<div class="cell">
</div>
<ol type="1"><li><p>Compare <code>air_time</code> with <code>arr_time - dep_time</code>. What do you expect to see? What do you see? What do you need to do to fix it?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code>, and <code>dep_delay</code>. How would you expect those three numbers to be related?</p></li>
<li><p>Brainstorm as many ways as possible to select <code>dep_time</code>, <code>dep_delay</code>, <code>arr_time</code>, and <code>arr_delay</code> from <code>flights</code>.</p></li>
<li><p>What happens if you include the name of a variable multiple times in a <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> call?</p></li>
<li>
<p>What does the <code><a href="#chp-https://tidyselect.r-lib.org/reference/all_of" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/all_of</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">variables &lt;- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
</div>
</li>
<li>
<p>Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">select(flights, contains("TIME"))</pre>
</div>
</li>
</ol></section>
</section>
<section id="groups" data-type="sect1">
<h1>
Groups</h1>
<p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, and the slice family of functions.</p>
<section id="group_by" data-type="sect2">
<h2>
<code>group_by()</code>
</h2>
<p>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month)
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: month [12]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”.</p>
</section>
<section id="sec-summarize" data-type="sect2">
<h2>
<code>summarize()</code>
</h2>
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay)
)
#&gt; # A tibble: 12 × 2
#&gt; month delay
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 NA
#&gt; 2 2 NA
#&gt; 3 3 NA
#&gt; 4 4 NA
#&gt; 5 5 NA
#&gt; 6 6 NA
#&gt; # … with 6 more rows</pre>
</div>
<p>Uhoh! Something has gone wrong and all of our results are <code>NA</code> (pronounced “N-A”), Rs symbol for missing value. Well come back to discuss missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>, but for now well remove them by using <code>na.rm = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE)
)
#&gt; # A tibble: 12 × 2
#&gt; month delay
#&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 10.0
#&gt; 2 2 10.8
#&gt; 3 3 13.2
#&gt; 4 4 13.9
#&gt; 5 5 13.0
#&gt; 6 6 20.8
#&gt; # … with 6 more rows</pre>
</div>
<p>You can create any number of summaries in a single call to <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>, which returns the number of rows in each group:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE),
n = n()
)
#&gt; # A tibble: 12 × 3
#&gt; month delay n
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 1 10.0 27004
#&gt; 2 2 10.8 24951
#&gt; 3 3 13.2 28834
#&gt; 4 4 13.9 28330
#&gt; 5 5 13.0 28796
#&gt; 6 6 20.8 28243
#&gt; # … with 6 more rows</pre>
</div>
<p>Means and counts can get you a surprisingly long way in data science!</p>
</section>
<section id="the-slice_-functions" data-type="sect2">
<h2>
The<code>slice_</code> functions</h2>
<p>There are five handy functions that allow you pick off specific rows within each group:</p>
<ul><li>
<code>df |&gt; slice_head(n = 1)</code> takes the first row from each group.</li>
<li>
<code>df |&gt; slice_tail(n = 1)</code> takes the last row in each group.</li>
<li>
<code>df |&gt; slice_min(x, n = 1)</code> takes the row with the smallest value of <code>x</code>.</li>
<li>
<code>df |&gt; slice_max(x, n = 1)</code> takes the row with the largest value of <code>x</code>.</li>
<li>
<code>df |&gt; slice_sample(x, n = 1)</code> takes one random row.</li>
</ul><p>You can vary <code>n</code> to select more than one row, or instead of <code>n =</code>, you can use <code>prop = 0.1</code> to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
slice_max(arr_delay, n = 1)
#&gt; # A tibble: 108 × 19
#&gt; # Groups: dest [105]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 7 22 2145 2007 98 132 2259 153 B6
#&gt; 2 2013 7 23 1139 800 219 1250 909 221 B6
#&gt; 3 2013 1 25 123 2000 323 229 2101 328 EV
#&gt; 4 2013 8 17 1740 1625 75 2042 2003 39 UA
#&gt; 5 2013 7 22 2257 759 898 121 1026 895 DL
#&gt; 6 2013 7 10 2056 1505 351 2347 1758 349 UA
#&gt; # … with 102 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>This is similar to computing the max delay with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, but you get the whole row instead of the single summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarize(max_delay = max(arr_delay, na.rm = TRUE))
#&gt; Warning: There was 1 warning in `summarize()`.
#&gt; In argument `max_delay = max(arr_delay, na.rm = TRUE)`.
#&gt; In group 52: `dest = "LGA"`.
#&gt; Caused by warning in `max()`:
#&gt; ! no non-missing arguments to max; returning -Inf
#&gt; # A tibble: 105 × 2
#&gt; dest max_delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ABQ 153
#&gt; 2 ACK 221
#&gt; 3 ALB 328
#&gt; 4 ANC 39
#&gt; 5 ATL 895
#&gt; 6 AUS 349
#&gt; # … with 99 more rows</pre>
</div>
</section>
<section id="grouping-by-multiple-variables" data-type="sect2">
<h2>
Grouping by multiple variables</h2>
<p>You can create groups using more than one variable. For example, we could make a group for each day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">daily &lt;- flights |&gt;
group_by(year, month, day)
daily
#&gt; # A tibble: 336,776 × 19
#&gt; # Groups: year, month, day [365]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasnt great way to make this function work, but its difficult to change without breaking existing code. To make it obvious whats happening, dplyr displays a message that tells you how you can change this behavior:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">daily_flights &lt;- daily |&gt;
summarize(
n = n()
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using the
#&gt; `.groups` argument.</pre>
</div>
<p>If youre happy with this behavior, you can explicitly request it in order to suppress the message:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">daily_flights &lt;- daily |&gt;
summarize(
n = n(),
.groups = "drop_last"
)</pre>
</div>
<p>Alternatively, change the default behavior by setting a different value, e.g. <code>"drop"</code> to drop all grouping or <code>"keep"</code> to preserve the same groups.</p>
</section>
<section id="ungrouping" data-type="sect2">
<h2>
Ungrouping</h2>
<p>You might also want to remove grouping outside of <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. You can do this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">daily |&gt;
ungroup() |&gt;
summarize(
delay = mean(dep_delay, na.rm = TRUE),
flights = n()
)
#&gt; # A tibble: 1 × 2
#&gt; delay flights
#&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 12.6 336776</pre>
</div>
<p>As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
<li><p>Find the most delayed flight to each destination.</p></li>
<li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li>
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code> and friends?</p></li>
<li><p>Explain what <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> do?</p></li>
</ol></section>
</section>
<section id="sec-sample-size" data-type="sect1">
<h1>
Case study: aggregates and sample size</h1>
<p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">delays &lt;- flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(delays, aes(delay)) +
geom_freqpoly(binwidth = 10)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-36-1.png" class="img-fluid" alt="A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours." width="576"/></p>
</div>
</div>
<p>Wow, there are some planes that have an <em>average</em> delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(delays, aes(n, delay)) +
geom_point(alpha = 1/10)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases." width="576"/></p>
</div>
</div>
<p>Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, youll see that the variation decreases as the sample size increases<span data-type="footnote">*cough* the central limit theorem *cough*.</span>.</p>
<p>When looking at this sort of plot, its often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">delays |&gt;
filter(n &gt; 25) |&gt;
ggplot(aes(n, delay)) +
geom_point(alpha = 1/10) +
geom_smooth(se = FALSE)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-38-1.png" class="img-fluid" alt="Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights." width="576"/></p>
</div>
</div>
<p>Note the handy pattern for combining ggplot2 and dplyr. Its a bit annoying that you have to switch from <code>|&gt;</code> to <code>+</code>, but its not too much of a hassle once you get the hang of it.</p>
<p>Theres another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the <strong>Lahman</strong> package to compare what proportion of times a player hits the ball vs. the number of attempts they take:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">batters &lt;- Lahman::Batting |&gt;
group_by(playerID) |&gt;
summarize(
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
n = sum(AB, na.rm = TRUE)
)
batters
#&gt; # A tibble: 20,166 × 3
#&gt; playerID perf n
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 aardsda01 0 4
#&gt; 2 aaronha01 0.305 12364
#&gt; 3 aaronto01 0.229 944
#&gt; 4 aasedo01 0 5
#&gt; 5 abadan01 0.0952 21
#&gt; 6 abadfe01 0.111 9
#&gt; # … with 20,160 more rows</pre>
</div>
<p>When we plot the skill of the batter (measured by the batting average, <code>ba</code>) against the number of opportunities to hit the ball (measured by at bat, <code>ab</code>), you see two patterns:</p>
<ol type="1"><li><p>As above, the variation in our aggregate decreases as we get more data points.</p></li>
<li><p>Theres a positive correlation between skill (<code>perf</code>) and opportunities to hit the ball (<code>n</code>) because obviously teams want to give their best batters the most opportunities to hit the ball.</p></li>
</ol><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">batters |&gt;
filter(n &gt; 100) |&gt;
ggplot(aes(n, perf)) +
geom_point(alpha = 1 / 10) +
geom_smooth(se = FALSE)</pre>
<div class="cell-output-display">
<p><img src="data-transform_files/figure-html/unnamed-chunk-40-1.png" class="img-fluid" alt="A scatterplot of number of batting opportunites vs batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000." width="576"/></p>
</div>
</div>
<p>This also has important implications for ranking. If you naively sort on <code>desc(ba)</code>, the people with the best batting averages are clearly lucky, not skilled:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">batters |&gt;
arrange(desc(perf))
#&gt; # A tibble: 20,166 × 3
#&gt; playerID perf n
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 abramge01 1 1
#&gt; 2 alberan01 1 1
#&gt; 3 banisje01 1 1
#&gt; 4 bartocl01 1 1
#&gt; 5 bassdo01 1 1
#&gt; 6 birasst01 1 2
#&gt; # … with 20,160 more rows</pre>
</div>
<p>You can find a good explanation of this problem and how to overcome it at <a href="http://varianceexplained.org/r/empirical_bayes_baseball/" class="uri">http://varianceexplained.org/r/empirical_bayes_baseball/</a> and <a href="https://www.evanmiller.org/how-not-to-sort-by-average-rating.html" class="uri">https://www.evanmiller.org/how-not-to-sort-by-average-rating.html</a>.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, those that manipulate the columns (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>), and those that manipulate groups (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
<p>For now, well pivot back to workflow, and in the next chapter youll learn more about the pipe, <code>|&gt;</code>, why we recommend it, and a little of the history that lead from magrittrs <code>%&gt;%</code> to base Rs <code>|&gt;</code>.</p>
</section>
</section>

838
oreilly/data-visualize.html Normal file
View File

@ -0,0 +1,838 @@
<section data-type="chapter" id="chp-data-visualize">
<h1><span id="sec-data-visualisation" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data visualization</span></span></h1>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
<p>“The simple graph has brought more information to the data analysts mind than any other device.” — John Tukey</p>
</blockquote>
<p>This chapter will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the <strong>grammar of graphics</strong>, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
#&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#&gt; ✔ ggplot2 3.4.0.9000 ✔ purrr 0.9000.0.9000
#&gt; ✔ tibble 3.1.8 ✔ dplyr 1.0.99.9000
#&gt; ✔ tidyr 1.2.1.9001 ✔ stringr 1.4.1.9000
#&gt; ✔ readr 2.1.3 ✔ forcats 0.5.2
#&gt; ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div>
<p>That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).</p>
<p>If you run this code and get the error message “there is no package called tidyverse”, youll need to first install it, then run <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> once again.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse")
library(tidyverse)</pre>
</div>
<p>You only need to install a package once, but you need to reload it every time you start a new session.</p>
</section>
</section>
<section id="first-steps" data-type="sect1">
<h1>
First steps</h1>
<p>Lets use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?</p>
<section id="the-mpg-data-frame" data-type="sect2">
<h2>
The<code>mpg</code> data frame</h2>
<p>You can test your answer with the <code>mpg</code> <strong>data frame</strong> found in ggplot2 (a.k.a. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>mpg</code> contains observations collected by the US Environmental Protection Agency on 38 car models.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mpg
#&gt; # A tibble: 234 × 11
#&gt; manufacturer model displ year cyl trans drv cty hwy fl class
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
#&gt; 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
#&gt; 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
#&gt; 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
#&gt; 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
#&gt; 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
#&gt; # … with 228 more rows</pre>
</div>
<p>Among the variables in <code>mpg</code> are:</p>
<ol type="1"><li><p><code>displ</code>, a cars engine size, in liters.</p></li>
<li><p><code>hwy</code>, a cars fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.</p></li>
</ol><p>To learn more about <code>mpg</code>, open its help page by running <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code>.</p>
</section>
<section id="creating-a-ggplot" data-type="sect2">
<h2>
Creating a ggplot</h2>
<p>To plot <code>mpg</code>, run this code to put <code>displ</code> on the x-axis and <code>hwy</code> on the y-axis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-5-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association." width="576"/></p>
</div>
</div>
<p>The plot shows a negative relationship between engine size (<code>displ</code>) and fuel efficiency (<code>hwy</code>). In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. Does this confirm or refute your hypothesis about fuel efficiency and engine size?</p>
<p>With ggplot2, you begin a plot with the function <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> creates a coordinate system that you can add layers to. The first argument of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> is the dataset to use in the graph. So <code>ggplot(data = mpg)</code> creates an empty graph, but its not very interesting so we wont show it here.</p>
<p>You complete your graph by adding one or more layers to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. The function <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code> adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. Youll learn a whole bunch of them throughout this chapter.</p>
<p>Each geom function in ggplot2 takes a <code>mapping</code> argument. This defines how variables in your dataset are mapped to visual properties of your plot. The <code>mapping</code> argument is always paired with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code>, and the <code>x</code> and <code>y</code> arguments of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the <code>data</code> argument, in this case, <code>mpg</code>.</p>
</section>
<section id="a-graphing-template" data-type="sect2">
<h2>
A graphing template</h2>
<p>Lets turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(data = &lt;DATA&gt;) +
&lt;GEOM_FUNCTION&gt;(mapping = aes(&lt;MAPPINGS&gt;))</pre>
</div>
<p>The rest of this chapter will show you how to complete and extend this template to make different types of graphs. We will begin with the <code>&lt;MAPPINGS&gt;</code> component.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Run <code>ggplot(data = mpg)</code>. What do you see?</p></li>
<li><p>How many rows are in <code>mpg</code>? How many columns?</p></li>
<li><p>What does the <code>drv</code> variable describe? Read the help for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code> to find out.</p></li>
<li><p>Make a scatterplot of <code>hwy</code> vs <code>cyl</code>.</p></li>
<li><p>What happens if you make a scatterplot of <code>class</code> vs <code>drv</code>? Why is the plot not useful?</p></li>
</ol></section>
</section>
<section id="aesthetic-mappings" data-type="sect1">
<h1>
Aesthetic mappings</h1>
<blockquote class="blockquote">
<p>“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey</p>
</blockquote>
<p>In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher fuel efficiency than you might expect. That is, they have a higher miles per gallon than other cars with similar engine sizes. How can you explain these cars?</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-7-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. Cars with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon stand out from the rest of the data and are highlighted in red." width="576"/></p>
</div>
</div>
<p>Lets hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the <code>class</code> value for each car. The <code>class</code> variable of the <code>mpg</code> dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).</p>
<p>You can add a third variable, like <code>class</code>, to a two dimensional scatterplot by mapping it to an <strong>aesthetic</strong>. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, lets use the word “level” to describe aesthetic properties. Here we change the levels of a points size, shape, and color to make the point small, triangular, or blue:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-8-1.png" alt="Diagram that shows four plotting characters next to each other. The first is a large circle, the second is a small circle, the third is a triangle, and the fourth is a blue circle." width="768"/></p>
</div>
</div>
<p>You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the <code>class</code> variable to reveal the class of each car.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-9-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The points representing each car are colored according to the class of the car. The legend on the right of the plot shows the mapping between colors and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv." width="576"/></p>
</div>
</div>
<p>(If you prefer British English, like Hadley, you can use <code>colour</code> instead of <code>color</code>.)</p>
<p>To map an aesthetic to a variable, associate the name of the aesthetic with the name of the variable inside <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code>. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as <strong>scaling</strong>. ggplot2 will also add a legend that explains which levels correspond to which values.</p>
<p>The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars dont seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.</p>
<p>In the above example, we mapped <code>class</code> to the color aesthetic, but we could have mapped <code>class</code> to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a <em>warning</em> here: mapping an unordered variable (<code>class</code>) to an ordered aesthetic (<code>size</code>) is generally not a good idea because it implies a ranking that does not in fact exist.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
#&gt; Warning: Using size for a discrete variable is not advised.</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-10-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between sizes and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv." width="576"/></p>
</div>
</div>
<p>Similarly, we could have mapped <code>class</code> to the <em>alpha</em> aesthetic, which controls the transparency of the points, or to the <em>shape</em> aesthetic, which controls the shape of the points.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit"># Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
# Right
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-11-1.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-11-2.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable." width="384"/></p>
</div>
</div>
</div>
</div>
<p>What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.</p>
<p>For each aesthetic, you use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> to associate the name of the aesthetic with a variable to display. The <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> function gathers together each of the aesthetic mappings used by a layer and passes them to the layers mapping argument. The syntax highlights a useful insight about <code>x</code> and <code>y</code>: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.</p>
<p>Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.</p>
<p>You can also <em>set</em> the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-12-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue." width="576"/></p>
</div>
</div>
<p>Here, the color doesnt convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function. In other words, it goes <em>outside</em> of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code>. Youll need to pick a value that makes sense for that aesthetic:</p>
<ul><li>The name of a color as a character string.</li>
<li>The size of a point in mm.</li>
<li>The shape of a point as a number, as shown in <a href="#fig-shapes" data-type="xref">#fig-shapes</a>.</li>
</ul><div class="cell" data-layout-align="center">
<div class="cell-output-display">
<figure id="fig-vis-stat-bar"><p><img src="data-visualize_files/figure-html/fig-shapes-1.png" alt="Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue." width="576"/></p>
<figcaption>Figure 2.1: R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (014) have a border determined by color; the solid shapes (1520) are filled with color; the filled shapes (2124) have a border of color and are filled with fill.<code>color</code> and <code>fill</code> aesthetics. The hollow shapes (014) have a border determined by <code>color</code>; the solid shapes (1520) are filled with <code>color</code>; the filled shapes (2124) have a border of <code>color</code> and are filled with <code>fill</code>.</figcaption>
</figure>
</div>
</div>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Whats gone wrong with this code? Why are the points not blue?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-14-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are red and the legend shows a red point that is mapped to the word blue." width="576"/></p>
</div>
</div>
</li>
<li><p>Which variables in <code>mpg</code> are categorical? Which variables are continuous? (Hint: type <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code> to read the documentation for the dataset). How can you see this information when you run <code>mpg</code>?</p></li>
<li><p>Map a continuous variable to <code>color</code>, <code>size</code>, and <code>shape</code>. How do these aesthetics behave differently for categorical vs. continuous variables?</p></li>
<li><p>What happens if you map the same variable to multiple aesthetics?</p></li>
<li><p>What does the <code>stroke</code> aesthetic do? What shapes does it work with? (Hint: use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>)</p></li>
<li><p>What happens if you map an aesthetic to something other than a variable name, like <code>aes(color = displ &lt; 5)</code>? Note, youll also need to specify x and y.</p></li>
</ol></section>
</section>
<section id="common-problems" data-type="sect1">
<h1>
Common problems</h1>
<p>As you start to run R code, youre likely to run into problems. Dont worry — it happens to everyone. We have all been writing R code for years, but every day we still write code that doesnt work!</p>
<p>Start by carefully comparing the code that youre running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every <code>(</code> is matched with a <code>)</code> and every <code>"</code> is paired with another <code>"</code>. Sometimes youll run the code and nothing happens. Check the left-hand of your console: if its a <code>+</code>, it means that R doesnt think youve typed a complete expression and its waiting for you to finish it. In this case, its usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.</p>
<p>One common problem when creating ggplot2 graphics is to put the <code>+</code> in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you havent accidentally written code like this:</p>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg)
+ geom_point(mapping = aes(x = displ, y = hwy))</pre>
<p>If youre still stuck, try the help. You can get help about any R function by running <code>?function_name</code> in the console, or selecting the function name and pressing F1 in RStudio. Dont worry if the help doesnt seem that helpful - instead skip down to the examples and look for code that matches what youre trying to do.</p>
<p>If that doesnt help, carefully read the error message. Sometimes the answer will be buried there! But when youre new to R, the answer might be in the error message but you dont yet know how to understand it. Another great tool is Google: try googling the error message, as its likely someone else has had the same problem, and has gotten help online.</p>
</section>
<section id="facets" data-type="sect1">
<h1>
Facets</h1>
<p>One way to add additional variables to a plot is by mapping them to an aesthetic. Another way, which is particularly useful for categorical variables, is to split your plot into <strong>facets</strong>, subplots that each display one subset of the data.</p>
<p>To facet your plot by a single variable, use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code>. The first argument of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> is a formula<span data-type="footnote">Here “formula” is the name of the type of thing created by <code>~</code>, not a synonym for “equation”.</span>, which you create with <code>~</code> followed by a variable name. The variable that you pass to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> should be discrete.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~cyl)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows." width="576"/></p>
</div>
</div>
<p>To facet your plot with the combination of two variables, switch from <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>. The first argument of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code> is also a formula, but now its a double sided formula: <code>rows ~ cols</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-16-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive." width="576"/></p>
</div>
</div>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens if you facet on a continuous variable?</p></li>
<li>
<p>What do the empty cells in plot with <code>facet_grid(drv ~ cyl)</code> mean? How do they relate to this plot?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-17-1.png" alt="Scatterplot of number of cycles versus type of drive train of cars. The plot shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive." width="576"/></p>
</div>
</div>
</li>
<li>
<p>What plots does the following code make? What does <code>.</code> do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)</pre>
</div>
</li>
<li>
<p>Take the first faceted plot in this section:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)</pre>
</div>
<p>What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?</p>
</li>
<li><p>Read <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code>. What does <code>nrow</code> do? What does <code>ncol</code> do? What other options control the layout of the individual panels? Why doesnt <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code> have <code>nrow</code> and <code>ncol</code> arguments?</p></li>
<li>
<p>Which of the following two plots makes it easier to compare engine size (<code>displ</code>) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ drv)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-20-1.png" alt="Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-20-2.png" alt="Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns." width="576"/></p>
</div>
</div>
</li>
<li>
<p>Recreate this plot using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>. How do the positions of the facet labels change?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-21-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by type of drive train across rows." width="576"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="geometric-objects" data-type="sect1">
<h1>
Geometric objects</h1>
<p>How are these two plots similar?</p>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-22-1.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-22-2.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="384"/></p>
</div>
</div>
</div>
<p>Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different <strong>geoms</strong>.</p>
<p>A <strong>geom</strong> is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.</p>
<p>To change the geom in your plot, change the geom function that you add to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. For instance, to make the plots above, you can use this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
# Right
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))</pre>
</div>
<p>Every geom function in ggplot2 takes a <code>mapping</code> argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldnt set the “shape” of a line. On the other hand, you <em>could</em> set the linetype of a line. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-24-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed." width="576"/></p>
</div>
</div>
<p>Here, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> separates the cars into three lines based on their <code>drv</code> value, which describes a cars drive train. One line describes all of the points that have a <code>4</code> value, one line describes all of the points that have an <code>f</code> value, and one line describes all of the points that have an <code>r</code> value. Here, <code>4</code> stands for four-wheel drive, <code>f</code> for front-wheel drive, and <code>r</code> for rear-wheel drive.</p>
<p>If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to <code>drv</code>.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-25-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with points (colored by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed." width="576"/></p>
</div>
</div>
<p>Notice that this plot contains two geoms in the same graph! If this makes you excited, buckle up. You will learn how to place multiple geoms in the same plot very soon.</p>
<p>ggplot2 provides more than 40 geoms, and extension packages provide even more (see <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a> for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at <a href="https://rstudio.com/resources/cheatsheets" class="uri">https://rstudio.com/resources/cheatsheets</a>. To learn more about any single geom, use the help (e.g. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code>).</p>
<p>Many geoms, like <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code>, use a single geometric object to display multiple rows of data. For these geoms, you can set the <code>group</code> aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the <code>linetype</code> example). It is convenient to rely on this feature because the <code>group</code> aesthetic by itself does not add a legend or distinguishing features to the geoms.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-26-1.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-26-2.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 33.3%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-26-3.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="288"/></p>
</div>
</div>
</div>
</div>
<p>To display multiple geoms in the same plot, add multiple geom functions to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed." width="576"/></p>
</div>
</div>
<p>This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display <code>cty</code> instead of <code>hwy</code>. Youd need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()</pre>
</div>
<p>If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings <em>for that layer only</em>. This makes it possible to display different aesthetics in different layers.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-29-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it." width="576"/></p>
</div>
</div>
<p>You can use the same idea to specify different <code>data</code> for each layer. Here, our smooth line displays just a subset of the <code>mpg</code> dataset, the subcompact cars. The local data argument in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> overrides the global data argument in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> for that layer only.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-30-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it." width="576"/></p>
</div>
</div>
<p>(Youll learn how <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.)</p>
<section id="exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?</p></li>
<li>
<p>Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)</pre>
</div>
</li>
<li>
<p>Earlier in this chapter we used <code>show.legend</code> without explaining it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)</pre>
</div>
<p>What does <code>show.legend = FALSE</code> do here? What happens if you remove it? Why do you think we used it earlier?</p>
</li>
<li><p>What does the <code>se</code> argument to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> do?</p></li>
<li>
<p>Will these two graphs look different? Why/why not?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))</pre>
</div>
</li>
<li>
<p>Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, its <code>drv</code>.</p>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-1.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-2.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-3.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-4.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
</div>
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-5.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-34-6.png" alt="There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border." width="384"/></p>
</div>
</div>
</div>
</li>
</ol></section>
</section>
<section id="statistical-transformations" data-type="sect1">
<h1>
Statistical transformations</h1>
<p>Next, lets take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>. The following chart displays the total number of diamonds in the <code>diamonds</code> dataset, grouped by <code>cut</code>. The <code>diamonds</code> dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the <code>price</code>, <code>carat</code>, <code>color</code>, <code>clarity</code>, and <code>cut</code> of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-35-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
</div>
</div>
<p>On the x-axis, the chart displays <code>cut</code>, a variable from <code>diamonds</code>. On the y-axis, it displays count, but count is not a variable in <code>diamonds</code>! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:</p>
<ul><li><p>bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.</p></li>
<li><p>smoothers fit a model to your data and then plot predictions from the model.</p></li>
<li><p>boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.</p></li>
</ul><p>The algorithm used to calculate new values for a graph is called a <strong>stat</strong>, short for statistical transformation. <a href="#fig-vis-stat-bar" data-type="xref">#fig-vis-stat-bar</a> shows how this process works with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/visualization-stat-bar.png" style="width:100.0%" alt="A figure demonstrating three steps of creating a bar chart. Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() transforms the data with the count stat, which returns a data set of cut values and counts. Step 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis."/></p>
<figcaption class="figure-caption">Figure 2.2: When create a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.</figcaption>
</figure>
</div>
</div>
<p>You can learn which stat a geom uses by inspecting the default value for the <code>stat</code> argument. For example, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> shows that the default value for <code>stat</code> is “count”, which means that <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> uses <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> is documented on the same page as <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>. If you scroll down, the section called “Computed variables” explains that it computes two new variables: <code>count</code> and <code>prop</code>.</p>
<p>You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-37-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
</div>
</div>
<p>This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:</p>
<ol type="1"><li>
<p>You might want to override the default stat. In the code below, we change the stat of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> from count (the default) to identity. This lets me map the height of the bars to the raw values of a <span class="math inline">\(y\)</span> variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">demo &lt;- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(data = demo) +
geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-38-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
</div>
</div>
<p>(Dont worry that you havent seen <code>&lt;-</code> or <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code> before. You might be able to guess their meaning from the context, and youll learn exactly what they do soon!)</p>
</li>
<li>
<p>You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-39-1.png" alt="Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40." width="576"/></p>
</div>
</div>
<p>To find the variables computed by the stat, look for the section titled “computed variables” in the help for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>.</p>
</li>
<li>
<p>You might want to draw greater attention to the statistical transformation in your code. For example, you might use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/stat_summary" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/stat_summary</a></code>, which summarizes the y values for each unique x value, to draw attention to the summary that youre computing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-40-1.png" alt="A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point." width="576"/></p>
</div>
</div>
</li>
</ol><p>ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. To see a complete list of stats, try the <a href="#chp-https://rstudio.com/resources/cheatsheets" data-type="xref">#chp-https://rstudio.com/resources/cheatsheets</a>.</p>
<section id="exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is the default geom associated with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/stat_summary" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/stat_summary</a></code>? How could you rewrite the previous plot to use that geom function instead of the stat function?</p></li>
<li><p>What does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> do? How is it different from <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>?</p></li>
<li><p>Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?</p></li>
<li><p>What variables does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> compute? What parameters control its behaviour?</p></li>
<li>
<p>In our proportion bar chart, we need to set <code>group = 1</code>. Why? In other words, what is the problem with these two graphs?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))</pre>
</div>
</li>
</ol></section>
</section>
<section id="position-adjustments" data-type="sect1">
<h1>
Position adjustments</h1>
<p>Theres one more piece of magic associated with bar charts. You can color a bar chart using either the <code>color</code> aesthetic, or, more usefully, <code>fill</code>:</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-42-1.png" alt="Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-42-2.png" alt="Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Note what happens if you map the fill aesthetic to another variable, like <code>clarity</code>: the bars are automatically stacked. Each colored rectangle represents a combination of <code>cut</code> and <code>clarity</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-43-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level." width="576"/></p>
</div>
</div>
<p>The stacking is performed automatically using the <strong>position adjustment</strong> specified by the <code>position</code> argument. If you dont want a stacked bar chart, you can use one of three other options: <code>"identity"</code>, <code>"dodge"</code> or <code>"fill"</code>.</p>
<ul><li>
<p><code>position = "identity"</code> will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting <code>alpha</code> to a small value, or completely transparent by setting <code>fill = NA</code>.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
geom_bar(fill = NA, position = "identity")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-44-1.png" alt="Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-44-2.png" alt="Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors." width="384"/></p>
</div>
</div>
</div>
</div>
<p>The identity position adjustment is more useful for 2d geoms, like points, where it is the default.</p>
</li>
<li>
<p><code>position = "fill"</code> works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-45-1.png" alt="Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Height of each bar is 1 and heights of the colored segments are proportional to the proportion of diamonds with a given clarity level within a given cut level." width="576"/></p>
</div>
</div>
</li>
<li>
<p><code>position = "dodge"</code> places overlapping objects directly <em>beside</em> one another. This makes it easier to compare individual values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-46-1.png" alt="Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity." width="576"/></p>
</div>
</div>
</li>
</ul><p>Theres one other type of adjustment thats not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-47-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association." width="576"/></p>
</div>
</div>
<p>The underlying values of <code>hwy</code> and <code>displ</code> are rounded so the points appear on a grid and many points overlap each other. This problem is known as <strong>overplotting</strong>. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of <code>hwy</code> and <code>displ</code> that contains 109 values?</p>
<p>You can avoid this gridding by setting the position adjustment to “jitter”. <code>position = "jitter"</code> adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-48-1.png" alt="Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association." width="576"/></p>
</div>
</div>
<p>Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph <em>more</em> revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for <code>geom_point(position = "jitter")</code>: <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code>.</p>
<p>To learn more about a position adjustment, look up the help page associated with each adjustment: <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_dodge" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_dodge</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_stack" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_stack</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_identity" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_identity</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_jitter</a></code>, and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_stack" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_stack</a></code>.</p>
<section id="exercises-5" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>What is the problem with this plot? How could you improve it?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-49-1.png" alt="Scatterplot of highway fuel efficiency versus city fuel efficiency of cars that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset." width="576"/></p>
</div>
</div>
</li>
<li><p>What parameters to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> control the amount of jittering?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_count" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_count</a></code>.</p></li>
<li><p>Whats the default position adjustment for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code>? Create a visualization of the <code>mpg</code> dataset that demonstrates it.</p></li>
</ol></section>
</section>
<section id="coordinate-systems" data-type="sect1">
<h1>
Coordinate systems</h1>
<p>Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are three other coordinate systems that are occasionally helpful.</p>
<ul><li>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_flip" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_flip</a></code> switches the x and y axes. This is useful (for example), if you want horizontal boxplots. Its also useful for long labels: its hard to get them to fit without overlapping on the x-axis.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-50-1.png" alt="Two side-by-side box plots of highway fuel efficiency of cars. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they are listed down the y-axis, avoiding overlap." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-50-2.png" alt="Two side-by-side box plots of highway fuel efficiency of cars. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they are listed down the y-axis, avoiding overlap." width="384"/></p>
</div>
</div>
</div>
</div>
<p>However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(y = class, x = hwy)) +
geom_boxplot()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-51-1.png" alt="Side-by-side box plots of highway fuel efficiency of cars. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
</div>
</div>
</li>
<li>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_map" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_map</a></code> sets the aspect ratio correctly for maps. This is very important if youre plotting spatial data with ggplot2. We dont have the space to discuss maps in this book, but you can learn more in the <a href="#chp-https://ggplot2-book.org/maps" data-type="xref">#chp-https://ggplot2-book.org/maps</a> of <em>ggplot2: Elegant graphics for data analysis</em>.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">nz &lt;- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", color = "black")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
coord_quickmap()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-52-1.png" alt="Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-52-2.png" alt="Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct." width="384"/></p>
</div>
</div>
</div>
</div>
</li>
<li>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_polar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_polar</a></code> uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">bar &lt;- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-53-1.png" alt="There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-53-2.png" alt="There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data." width="384"/></p>
</div>
</div>
</div>
</div>
</li>
</ul>
<section id="exercises-6" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Turn a stacked bar chart into a pie chart using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_polar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_polar</a></code>.</p></li>
<li><p>What does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code> do? Read the documentation.</p></li>
<li><p>Whats the difference between <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_map" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_map</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_map" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_map</a></code>?</p></li>
<li>
<p>What does the plot below tell you about the relationship between city and highway mpg? Why is <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_fixed" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_fixed</a></code> important? What does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_abline" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_abline</a></code> do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()</pre>
<div class="cell-output-display">
<p><img src="data-visualize_files/figure-html/unnamed-chunk-54-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but does not go through the cloud of points, it is beneath it." width="576"/></p>
</div>
</div>
</li>
</ol></section>
</section>
<section id="the-layered-grammar-of-graphics" data-type="sect1">
<h1>
The layered grammar of graphics</h1>
<p>In the previous sections, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make <em>any</em> type of plot with ggplot2. To see this, lets add position adjustments, stats, coordinate systems, and faceting to our code template:</p>
<pre><code>ggplot(data = &lt;DATA&gt;) +
&lt;GEOM_FUNCTION&gt;(
mapping = aes(&lt;MAPPINGS&gt;),
stat = &lt;STAT&gt;,
position = &lt;POSITION&gt;
) +
&lt;COORDINATE_FUNCTION&gt; +
&lt;FACET_FUNCTION&gt;</code></pre>
<p>Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.</p>
<p>The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe <em>any</em> plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.</p>
<p>To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="images/visualization-grammar-1.png" alt="A figure demonstrating the steps for going from raw data to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Steps 1 and 2 are annotated. Step 1. Begin with the diamonds dataset. Step 2. Compute counts for each cut value with stat_count()." width="1400"/></p>
</div>
</div>
<p>Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="images/visualization-grammar-2.png" alt="A figure demonstrating the steps for going from raw data to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Each level is also mapped to a color. Steps 3 and 4 are annotated. Step 3. Represent each observation with a bar. Step 4. Map the fill of each bar to the ..count.. variable." width="1400"/></p>
</div>
</div>
<p>Youd then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="images/visualization-grammar-3.png" alt="A figure demonstrating the steps for going from raw data to bar chart where each bar represents one level of cut and filled in with a different color. Steps 5 and 6 are annotated. Step 5. Place geoms in a Cartesian coordinate system. Step 6. Map the y values to ..count.. and the x values to cut." width="1400"/></p>
</div>
</div>
<p>You could use this method to build <em>any</em> plot that you imagine. In other words, you can use the code template that youve learned in this chapter to build hundreds of thousands of unique plots.</p>
<p>If youd like to learn more about this theoretical underpinnings of ggplot2, you might enjoy reading “<a href="#chp-https://vita.had.co.nz/papers/layered-grammar" data-type="xref">#chp-https://vita.had.co.nz/papers/layered-grammar</a>”, the scientific paper that describes the theory of ggplot2 in detail.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learn the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape. You then learned about facets, which allow you to create small multiples, where each panel contains a subgroup from your data. We then gave you a whirlwind tour of the geoms and stats which control the “type” of graph you get, whether its a scatterplot, line plot, histogram, or something else. Position adjustment control the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what <code>x</code> and <code>y</code> mean.</p>
<p>Well use visualizations again and again through out this book, introducing new techniques as we need them. If you want to get a comprehensive understand of ggplot2, we recommend reading the book, <a href="#chp-https://ggplot2-book" data-type="xref">#chp-https://ggplot2-book</a>. Other useful resources are the <a href="#chp-https://r-graphics" data-type="xref">#chp-https://r-graphics</a> by Winston Chang and <a href="#chp-https://clauswilke.com/dataviz/" data-type="xref">#chp-https://clauswilke.com/dataviz/</a> by Claus Wilke.</p>
<p>With the basics of visualization under your belt, in the next chapter were going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because itll help you stay organize as you write increasing amounts of R code.</p>
</section>
</section>

770
oreilly/databases.html Normal file
View File

@ -0,0 +1,770 @@
<section data-type="chapter" id="chp-databases">
<h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>A huge amount of data lives in databases, so its essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a <code>.csv</code> for you, but this gets painful quickly: every time you need to make a change youll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.</p>
<p>In this chapter, youll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL<span data-type="footnote">SQL is either pronounced “s”-“q”-“l” or “sequel”.</span> query. <strong>SQL</strong>, short for <strong>s</strong>tructured <strong>q</strong>uery <strong>l</strong>anguage, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, were not going to start with SQL, but instead well teach you dbplyr, which can translate your dplyr code to the SQL. Well use that as way to teach you some of the most important features of SQL. You wont become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(DBI)
library(dbplyr)
library(tidyverse)</pre>
</div>
</section>
</section>
<section id="database-basics" data-type="sect1">
<h1>
Database basics</h1>
<p>At the simplest level, you can think about a database as a collection of data frames, called <strong>tables</strong> in database terminology. Like a data.frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:</p>
<ul><li><p>Database tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).</p></li>
<li><p>Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles dont have indexes, but data.tables do, which is one of the reasons that theyre so fast.</p></li>
<li><p>Most classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called <strong>row-oriented</strong> because the data is stored row-by-row, rather than column-by-column like R. More recently, theres been much development of <strong>column-oriented</strong> databases that make analyzing the existing data much faster.</p></li>
</ul><p>Databases are run by database management systems (<strong>DBMS</strong>s for short), which come in three basic forms:</p>
<ul><li>
<strong>Client-server</strong> DBMSs run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMSs include PostgreSQL, MariaDB, SQL Server, and Oracle.</li>
<li>
<strong>Cloud</strong> DBMSs, like Snowflake, Amazons RedShift, and Googles BigQuery, are similar to client server DBMSs, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.</li>
<li>
<strong>In-process</strong> DBMSs, like SQLite or duckdb, run entirely on your computer. Theyre great for working with large datasets where youre the primary user.</li>
</ul></section>
<section id="connecting-to-a-database" data-type="sect1">
<h1>
Connecting to a database</h1>
<p>To connect to the database from R, youll use a pair of packages:</p>
<ul><li><p>Youll always use DBI (<strong>d</strong>ata<strong>b</strong>ase <strong>i</strong>nterface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.</p></li>
<li><p>Youll also use a package tailored for the DBMS youre connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. Theres usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.</p></li>
</ul><p>If you cant find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because youll also need to install an ODBC driver and tell the odbc package where to find it.</p>
<p>Concretely, you create a database connection using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbConnect" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbConnect</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(
RMariaDB::MariaDB(),
username = "foo"
)
con &lt;- DBI::dbConnect(
RPostgres::Postgres(),
hostname = "databases.mycompany.com",
port = 1234
)</pre>
</div>
<p>The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we cant cover all the details here. This means youll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (<strong>d</strong>ata<strong>b</strong>ase <strong>a</strong>dministrator). The initial setup will often take a little fiddling (and maybe some googling) to get right, but youll generally only need to do it once.</p>
<section id="in-this-book" data-type="sect2">
<h2>
In this book</h2>
<p>Setting up a client-server or cloud DBMS would be a pain for this book, so well instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how youll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.</p>
<p>Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. Thats great for learning because it guarantees that youll start from a clean slate every time you restart R:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(duckdb::duckdb())</pre>
</div>
<p>duckdb is a high-performance database thats designed very much for the needs of a data scientist. We use it here because its very to easy to get started with, but its also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, youll also need to supply the <code>dbdir</code> argument to make a persistent database and tell duckdb where to save it. Assuming youre using a project (<a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>), its reasonable to store it in the <code>duckdb</code> directory of the current project:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
</div>
</section>
<section id="sec-load-data" data-type="sect2">
<h2>
Load some data</h2>
<p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code>. The simplest usage of <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
</div>
<p>If youre using duckdb in a real project, we highly recommend learning about <code>duckdb_read_csv()</code> and <code>duckdb_register_arrow()</code>. These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.</p>
<p>Well also show off a useful technique for loading multiple files into a database in <a href="#sec-save-database" data-type="xref">#sec-save-database</a>.</p>
</section>
</section>
<section id="dbi-basics" data-type="sect1">
<h1>
DBI basics</h1>
<p>Now that weve connected to a database with some data in it, lets perform some basic operations with DBI.</p>
<section id="whats-there" data-type="sect2">
<h2>
Whats there?</h2>
<p>The most important database objects for data scientists are tables. DBI provides two useful functions to either list all the tables in the database<span data-type="footnote">At least, all the tables that you have permission to see.</span> or to check if a specific table already exists:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dbListTables(con)
#&gt; [1] "diamonds" "mpg"
dbExistsTable(con, "foo")
#&gt; [1] FALSE</pre>
</div>
</section>
<section id="extract-some-data" data-type="sect2">
<h2>
Extract some data</h2>
<p>Once youve determined a table exists, you can retrieve it with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con |&gt;
dbReadTable("diamonds") |&gt;
as_tibble()
#&gt; # A tibble: 53,940 × 10
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#&gt; 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#&gt; 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#&gt; 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#&gt; 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with 53,934 more rows</pre>
</div>
<p><code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code> returns a <code>data.frame</code> so we use <code><a href="#chp-https://tibble.tidyverse.org/reference/as_tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/as_tibble</a></code> to convert it into a tibble so that it prints nicely.</p>
<p>In real life, its rare that youll use <code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p>
</section>
<section id="sec-dbGetQuery" data-type="sect2">
<h2>
Run a query</h2>
<p>The way youll usually retrieve data is with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sql &lt;- "
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price &gt; 15000
"
as_tibble(dbGetQuery(con, sql))
#&gt; # A tibble: 1,655 × 5
#&gt; carat cut clarity color price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium VS2 E 15002
#&gt; 2 1.19 Ideal VVS1 F 15005
#&gt; 3 2.1 Premium SI1 I 15007
#&gt; 4 1.69 Ideal SI1 D 15011
#&gt; 5 1.5 Very Good VVS2 G 15013
#&gt; 6 1.73 Very Good VS1 G 15014
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Dont worry if youve never seen SQL before; youll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where <code>price</code> is greater than 15,000.</p>
<p>Youll need to be a little careful with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbSendQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbSendQuery</a></code> to get a “result set” which you can page through by calling <code><a href="#chp-https://dbi.r-dbi.org/reference/dbFetch" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbFetch</a></code> until <code><a href="#chp-https://dbi.r-dbi.org/reference/dbHasCompleted" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbHasCompleted</a></code> returns <code>TRUE</code>.</p>
</section>
<section id="other-functions" data-type="sect2">
<h2>
Other functions</h2>
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
</section>
</section>
<section id="dbplyr-basics" data-type="sect1">
<h1>
dbplyr basics</h1>
<p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="#chp-https://dtplyr.tidyverse" data-type="xref">#chp-https://dtplyr.tidyverse</a> which translates to <a href="#chp-https://r-datatable" data-type="xref">#chp-https://r-datatable</a>, and <a href="#chp-https://multidplyr.tidyverse" data-type="xref">#chp-https://multidplyr.tidyverse</a> which executes your code on multiple cores.</p>
<p>To use dbplyr, you must first use <code><a href="#chp-https://dplyr.tidyverse.org/reference/tbl" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/tbl</a></code> to create an object that represents a database table:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, "diamonds")
diamonds_db
#&gt; # Source: table&lt;diamonds&gt; [?? x 10]
#&gt; # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity depth table price x y z
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#&gt; 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#&gt; 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#&gt; 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#&gt; 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with more rows</pre>
</div>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<p>This object is <strong>lazy</strong>; when you use dplyr verbs on it, dplyr doesnt do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db &lt;- diamonds_db |&gt;
filter(price &gt; 15000) |&gt;
select(carat:clarity, price)
big_diamonds_db
#&gt; # Source: SQL [?? x 5]
#&gt; # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
#&gt; 2 1.19 Ideal F VVS1 15005
#&gt; 3 2.1 Premium I SI1 15007
#&gt; 4 1.69 Ideal D SI1 15011
#&gt; 5 1.5 Very Good G VVS2 15013
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with more rows</pre>
</div>
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesnt know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something were trying to avoid.</p>
<p>You can see the SQL code generated by the dbplyr function <code><a href="#chp-https://dplyr.tidyverse.org/reference/explain" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/explain</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT carat, cut, color, clarity, price
#&gt; FROM diamonds
#&gt; WHERE (price &gt; 15000.0)</pre>
</div>
<p>To get all the data back into R, you call <code><a href="#chp-https://dplyr.tidyverse.org/reference/compute" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/compute</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code> to get the data, then turns the result into a tibble:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds &lt;- big_diamonds_db |&gt;
collect()
big_diamonds
#&gt; # A tibble: 1,655 × 5
#&gt; carat cut color clarity price
#&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 1.54 Premium E VS2 15002
#&gt; 2 1.19 Ideal F VVS1 15005
#&gt; 3 2.1 Premium I SI1 15007
#&gt; 4 1.69 Ideal D SI1 15011
#&gt; 5 1.5 Very Good G VVS2 15013
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="#chp-https://dplyr.tidyverse.org/reference/compute" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/compute</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p>
</section>
<section id="sql" data-type="sect1">
<h1>
SQL</h1>
<p>The rest of the chapter will teach you a little SQL through the lens of dbplyr. Its a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr youre in a great place to quickly pick up SQL because so many of the concepts are the same.</p>
<p>Well explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: <code>flights</code> and <code>planes</code>. These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dbplyr::copy_nycflights13(con)
#&gt; Creating table: airlines
#&gt; Creating table: airports
#&gt; Creating table: flights
#&gt; Creating table: planes
#&gt; Creating table: weather
flights &lt;- tbl(con, "flights")
planes &lt;- tbl(con, "planes")</pre>
</div>
<div class="cell">
</div>
<section id="sql-basics" data-type="sect2">
<h2>
SQL basics</h2>
<p>The top-level components of SQL are called <strong>statements</strong>. Common statements include <code>CREATE</code> for defining new tables, <code>INSERT</code> for adding data, and <code>SELECT</code> for retrieving data. We will on focus on <code>SELECT</code> statements, also called <strong>queries</strong>, because they are almost exclusively what youll use as a data scientist.</p>
<p>A query is made up of <strong>clauses</strong>. There are five important clauses: <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>ORDER BY</code>, and <code>GROUP BY</code>. Every query must have the <code>SELECT</code><span data-type="footnote">Confusingly, depending on the context, <code>SELECT</code> is either a statement or a clause. To avoid this confusion, well generally use query instead of <code>SELECT</code> statement.</span> and <code>FROM</code><span data-type="footnote">Ok, technically, only the <code>SELECT</code> is required, since you can write queries like <code>SELECT 1+1</code> to perform basic calculations. But if you want to work with data (as you always do!) youll also need a <code>FROM</code> clause.</span> clauses and the simplest query is <code>SELECT * FROM table</code>, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
planes |&gt; show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM planes</pre>
</div>
<p><code>WHERE</code> and <code>ORDER BY</code> control which rows are included and how they are ordered:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dest == "IAH") |&gt;
arrange(dep_delay) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest = 'IAH')
#&gt; ORDER BY dep_delay</pre>
</div>
<p><code>GROUP BY</code> converts the query to a summary, causing aggregation to happen:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT dest, AVG(dep_delay) AS dep_delay
#&gt; FROM flights
#&gt; GROUP BY dest</pre>
</div>
<p>There are two important differences between dplyr verbs and SELECT clauses:</p>
<ul><li>In SQL, case doesnt matter: you can write <code>select</code>, <code>SELECT</code>, or even <code>SeLeCt</code>. In this book well stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.</li>
<li>In SQL, order matters: you must always write the clauses in the order <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>GROUP BY</code>, <code>ORDER BY</code>. Confusingly, this order doesnt match how the clauses actually evaluated which is first <code>FROM</code>, then <code>WHERE</code>, <code>GROUP BY</code>, <code>SELECT</code>, and <code>ORDER BY</code>.</li>
</ul><p>The following sections explore each clause in more detail.</p>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
</section>
<section id="select" data-type="sect2">
<h2>
SELECT</h2>
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>, and, as youll learn in the next section, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>.</p>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, "type", manufacturer, model, "year"
#&gt; FROM planes
planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
rename(year_built = year) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, "type", manufacturer, model, "year" AS year_built
#&gt; FROM planes
planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
relocate(manufacturer, model, .before = type) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT tailnum, manufacturer, model, "type", "year"
#&gt; FROM planes</pre>
</div>
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, the old name is on the left and the new name is on the right.</p>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table youre interested in:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, in_schema("sales", "diamonds"))
diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<p>The translations for <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
speed = distance / (air_time / 60)
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *, distance / (air_time / 60.0) AS speed
#&gt; FROM flights</pre>
</div>
<p>Well come back to the translation of individual components (like <code>/</code>) in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="from" data-type="sect2">
<h2>
FROM</h2>
<p>The <code>FROM</code> clause defines the data source. Its going to be rather uninteresting for a little while, because were just using single tables. Youll see more complex examples once we hit the join functions.</p>
</section>
<section id="group-by" data-type="sect2">
<h2>
GROUP BY</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> is translated to the <code>SELECT</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt;
group_by(cut) |&gt;
summarise(
n = n(),
avg_price = mean(price, na.rm = TRUE)
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price
#&gt; FROM diamonds
#&gt; GROUP BY cut</pre>
</div>
<p>Well come back to whats happening with translation <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="where" data-type="sect2">
<h2>
WHERE</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> is translated to the <code>WHERE</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dest == "IAH" | dest == "HOU") |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest = 'IAH' OR dest = 'HOU')
flights |&gt;
filter(arr_delay &gt; 0 &amp; arr_delay &lt; 20) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (arr_delay &gt; 0.0 AND arr_delay &lt; 20.0)</pre>
</div>
<p>There are a few important details to note here:</p>
<ul><li>
<code>|</code> becomes <code>OR</code> and <code>&amp;</code> becomes <code>AND</code>.</li>
<li>SQL uses <code>=</code> for comparison, not <code>==</code>. SQL doesnt have assignment, so theres no potential for confusion there.</li>
<li>SQL uses only <code>''</code> for strings, not <code>""</code>. In SQL, <code>""</code> is used to identify variables, like Rs <code>``</code>.</li>
</ul><p>Another useful SQL operator is <code>IN</code>, which is very close to Rs <code>%in%</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dest %in% c("IAH", "HOU")) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (dest IN ('IAH', 'HOU'))</pre>
</div>
<p>SQL uses <code>NULL</code> instead of <code>NA</code>. <code>NULL</code>s behave similarly to <code>NA</code>s. The main difference is that while theyre “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarise(delay = mean(arr_delay))
#&gt; Warning: Missing values are always removed in SQL aggregation functions.
#&gt; Use `na.rm = TRUE` to silence this warning
#&gt; This warning is displayed once every 8 hours.
#&gt; # Source: SQL [?? x 2]
#&gt; # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
#&gt; dest delay
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ATL 11.3
#&gt; 2 ORD 5.88
#&gt; 3 RDU 10.1
#&gt; 4 IAD 13.9
#&gt; 5 DTW 5.43
#&gt; 6 LAX 0.547
#&gt; # … with more rows</pre>
</div>
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="#chp-https://modern-sql.com/concept/three-valued-logic" data-type="xref">#chp-https://modern-sql.com/concept/three-valued-logic</a>” by Markus Winand.</p>
<p>In general, you can work with <code>NULL</code>s using the functions youd use for <code>NA</code>s in R:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(!is.na(dep_delay)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; WHERE (NOT((dep_delay IS NULL)))</pre>
</div>
<p>This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isnt as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator thats easier to read:</p>
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
<p>Note that if you <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt;
group_by(cut) |&gt;
summarise(n = n()) |&gt;
filter(n &gt; 100) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT cut, COUNT(*) AS n
#&gt; FROM diamonds
#&gt; GROUP BY cut
#&gt; HAVING (COUNT(*) &gt; 100.0)</pre>
</div>
</section>
<section id="order-by" data-type="sect2">
<h2>
ORDER BY</h2>
<p>Ordering rows involves a straightforward translation from <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> to the <code>ORDER BY</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, desc(dep_delay)) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM flights
#&gt; ORDER BY "year", "month", "day", dep_delay DESC</pre>
</div>
<p>Notice how <code><a href="#chp-https://dplyr.tidyverse.org/reference/desc" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/desc</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p>
</section>
<section id="subqueries" data-type="sect2">
<h2>
Subqueries</h2>
<p>Sometimes its not possible to translate a dplyr pipeline into a single <code>SELECT</code> statement and you need to use a subquery. A <strong>subquery</strong> is just a query used as a data source in the <code>FROM</code> clause, instead of the usual table.</p>
<p>dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the <code>SELECT</code> clause cant refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes <code>year1</code> and then the second (outer) query can compute <code>year2</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
year1 = year + 1,
year2 = year1 + 1
) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *, year1 + 1.0 AS year2
#&gt; FROM (
#&gt; SELECT *, "year" + 1.0 AS year1
#&gt; FROM flights
#&gt; ) q01</pre>
</div>
<p>Youll also see this if you attempted to <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(year1 = year + 1) |&gt;
filter(year1 == 2014) |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT *
#&gt; FROM (
#&gt; SELECT *, "year" + 1.0 AS year1
#&gt; FROM flights
#&gt; ) q01
#&gt; WHERE (year1 = 2014.0)</pre>
</div>
<p>Sometimes dbplyr will create a subquery where its not needed because it doesnt yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.</p>
</section>
<section id="joins" data-type="sect2">
<h2>
Joins</h2>
<p>If youre familiar with dplyrs joins, SQL joins are very similar. Heres a simple example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
left_join(planes |&gt; rename(year_built = year), by = "tailnum") |&gt;
show_query()
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; flights.*,
#&gt; planes."year" AS year_built,
#&gt; "type",
#&gt; manufacturer,
#&gt; model,
#&gt; engines,
#&gt; seats,
#&gt; speed,
#&gt; engine
#&gt; FROM flights
#&gt; LEFT JOIN planes
#&gt; ON (flights.tailnum = planes.tailnum)</pre>
</div>
<p>The main thing to notice here is the syntax: SQL joins use sub-clauses of the <code>FROM</code> clause to bring in additional tables, using <code>ON</code> to define how the tables are related.</p>
<p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)</pre>
<p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="#chp-https://cynkra.github.io/dm/" data-type="xref">#chp-https://cynkra.github.io/dm/</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p>
</section>
<section id="other-verbs" data-type="sect2">
<h2>
Other verbs</h2>
<p>dbplyr also translates other verbs like <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code>slice_*()</code>, and <code><a href="#chp-https://generics.r-lib.org/reference/setops" data-type="xref">#chp-https://generics.r-lib.org/reference/setops</a></code>, and a growing selection of tidyr functions like <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code> translated to? How about <code><a href="#chp-https://rdrr.io/r/utils/head" data-type="xref">#chp-https://rdrr.io/r/utils/head</a></code>?</p></li>
<li>
<p>Explain what each of the following SQL queries do and try recreate them using dbplyr.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT *
FROM flights
WHERE dep_delay &lt; arr_delay
SELECT *, distance / (airtime / 60) AS speed
FROM flights</pre>
</li>
</ol></section>
</section>
<section id="sec-sql-expressions" data-type="sect1">
<h1>
Function translations</h1>
<p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>?</p>
<p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summarize_query &lt;- function(df, ...) {
df |&gt;
summarise(...) |&gt;
show_query()
}
mutate_query &lt;- function(df, ...) {
df |&gt;
mutate(..., .keep = "none") |&gt;
show_query()
}</pre>
</div>
<p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>, have a relatively simple translation while others, like <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarize_query(
mean = mean(arr_delay, na.rm = TRUE),
median = median(arr_delay, na.rm = TRUE)
)
#&gt; `summarise()` has grouped output by "year" and "month". You can override using
#&gt; the `.groups` argument.
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",
#&gt; "month",
#&gt; "day",
#&gt; AVG(arr_delay) AS mean,
#&gt; PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median
#&gt; FROM flights
#&gt; GROUP BY "year", "month", "day"</pre>
</div>
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
mutate_query(
mean = mean(arr_delay, na.rm = TRUE),
)
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; "year",
#&gt; "month",
#&gt; "day",
#&gt; AVG(arr_delay) OVER (PARTITION BY "year", "month", "day") AS mean
#&gt; FROM flights</pre>
</div>
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
<p>Window functions include all functions that look forward or backwards, like <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
arrange(time_hour) |&gt;
mutate_query(
lead = lead(arr_delay),
lag = lag(arr_delay)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT
#&gt; dest,
#&gt; LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,
#&gt; LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag
#&gt; FROM flights
#&gt; ORDER BY time_hour</pre>
</div>
<p>Here its important to <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p>
<p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate_query(
description = if_else(arr_delay &gt; 0, "delayed", "on-time")
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE WHEN (arr_delay &gt; 0.0) THEN 'delayed' WHEN NOT (arr_delay &gt; 0.0) THEN 'on-time' END AS description
#&gt; FROM flights
flights |&gt;
mutate_query(
description =
case_when(
arr_delay &lt; -5 ~ "early",
arr_delay &lt; 5 ~ "on-time",
arr_delay &gt;= 5 ~ "late"
)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE
#&gt; WHEN (arr_delay &lt; -5.0) THEN 'early'
#&gt; WHEN (arr_delay &lt; 5.0) THEN 'on-time'
#&gt; WHEN (arr_delay &gt;= 5.0) THEN 'late'
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
<p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate_query(
description = cut(
arr_delay,
breaks = c(-Inf, -5, 5, Inf),
labels = c("early", "on-time", "late")
)
)
#&gt; &lt;SQL&gt;
#&gt; SELECT CASE
#&gt; WHEN (arr_delay &lt;= -5.0) THEN 'early'
#&gt; WHEN (arr_delay &lt;= 5.0) THEN 'on-time'
#&gt; WHEN (arr_delay &gt; 5.0) THEN 'late'
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="#chp-https://dbplyr.tidyverse.org/articles/translation-function" data-type="xref">#chp-https://dbplyr.tidyverse.org/articles/translation-function</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
<section id="learning-more" data-type="sect2">
<h2>
Learning more</h2>
<p>If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
<ul><li>
<a href="#chp-https://sqlfordatascientists" data-type="xref">#chp-https://sqlfordatascientists</a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li>
<li>
<a href="#chp-https://www.practicalsql" data-type="xref">#chp-https://www.practicalsql</a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
</ul></section>
</section>
</section>

771
oreilly/datetimes.html Normal file
View File

@ -0,0 +1,771 @@
<section data-type="chapter" id="chp-datetimes">
<h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they dont seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!</p>
<p>To warm up think about how many days there are in a year, and how many hours there are in a day. You probably remembered that most years have 365 days, but leap years have 366. Do you know the full rule for determining if a year is a leap year<span data-type="footnote">A year is a leap year if its divisible by 4, unless its also divisible by 100, except if its also divisible by 400. In other words, in every set of 400 years, theres 97 leap years.</span>? The number of hours in a day is a little less obvious: most days have 24 hours, but in places that use daylight saving time (DST), one day each year has 23 hours and another has 25.</p>
<p>Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter wont teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.</p>
<p>Well begin by showing you how to create date-times from various inputs, and then once youve got a date-time, how you can extract components like year, month, and day. Well then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what youre trying to do. Well conclude with a brief discussion of the additional challenges posed by time zones.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter will focus on the <strong>lubridate</strong> package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when youre working with dates/times. We will also need nycflights13 for practice data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(lubridate)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="sec-creating-datetimes" data-type="sect1">
<h1>
Creating date/times</h1>
<p>There are three types of date/time data that refer to an instant in time:</p>
<ul><li><p>A <strong>date</strong>. Tibbles print this as <code>&lt;date&gt;</code>.</p></li>
<li><p>A <strong>time</strong> within a day. Tibbles print this as <code>&lt;time&gt;</code>.</p></li>
<li><p>A <strong>date-time</strong> is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <code>&lt;dttm&gt;</code>. Base R calls these POSIXct, but doesnt exactly trip off the tongue.</p></li>
</ul><p>In this chapter we are going to focus on dates and date-times as R doesnt have a native class for storing times. If you need one, you can use the <strong>hms</strong> package.</p>
<p>You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which well come back to at the end of the chapter.</p>
<p>To get the current date or date-time you can use <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code> or <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">today()
#&gt; [1] "2022-11-18"
now()
#&gt; [1] "2022-11-18 10:21:36 CST"</pre>
</div>
<p>Otherwise, the following sections describe the four ways youre likely to create a date/time:</p>
<ul><li>While reading a file with readr.</li>
<li>From a string.</li>
<li>From individual date-time components.</li>
<li>From an existing date/time object.</li>
</ul>
<section id="during-import" data-type="sect2">
<h2>
During import</h2>
<p>If your CSV contains an ISO8601 date or date-time, you dont need to do anything; readr will automatically recognize it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
date,datetime
2022-01-02,2022-01-02 05:12
"
read_csv(csv)
#&gt; # A tibble: 1 × 2
#&gt; date datetime
#&gt; &lt;date&gt; &lt;dttm&gt;
#&gt; 1 2022-01-02 2022-01-02 05:12:00</pre>
</div>
<p>If you havent heard of <strong>ISO8601</strong> before, its an international standard<span data-type="footnote"><a href="https://xkcd.com/1179/" class="uri">https://xkcd.com/1179/</a></span> for writing dates where the components of a date are organised from biggest to smallest separated by <code>-</code>. For example, in ISO8601 March 5 2022 is <code>2022-05-03</code>. ISO8601 dates can also include times, where hour, minute, and second are separated by <code>:</code>, and the date and time components are separated by either a <code>T</code> or a space. For example, you could write 4:26pm on March 5 2022 as either <code>2022-05-03 16:26</code> or <code>2022-05-03T16:26</code>.</p>
<p>For other date-time formats, youll need to use <code>col_types</code> plus <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> or <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a <code>%</code> followed by a single character. For example, <code>%Y-%m-%d</code> specifies a date thats a year, <code>-</code>, month (as number) <code>-</code>, day. Table <a href="#tbl-date-formats" data-type="xref">#tbl-date-formats</a> lists all the options.</p>
<div id="tbl-date-formats" class="anchored">
<table class="table"><caption>Table 17.1: All date formats understood by readr</caption>
<thead><tr class="header"><th>Type</th>
<th>Code</th>
<th>Meaning</th>
<th>Example</th>
</tr></thead><tbody><tr class="odd"><td>Year</td>
<td><code>%Y</code></td>
<td>4 digit year</td>
<td>2021</td>
</tr><tr class="even"><td/>
<td><code>%y</code></td>
<td>2 digit year</td>
<td>21</td>
</tr><tr class="odd"><td>Month</td>
<td><code>%m</code></td>
<td>Number</td>
<td>2</td>
</tr><tr class="even"><td/>
<td><code>%b</code></td>
<td>Abbreviated name</td>
<td>Feb</td>
</tr><tr class="odd"><td/>
<td><code>%B</code></td>
<td>Full name</td>
<td>Februrary</td>
</tr><tr class="even"><td>Day</td>
<td><code>%d</code></td>
<td>Two digits</td>
<td>02</td>
</tr><tr class="odd"><td/>
<td><code>%e</code></td>
<td>One or two digits</td>
<td>2</td>
</tr><tr class="even"><td>Time</td>
<td><code>%H</code></td>
<td>24-hour hour</td>
<td>13</td>
</tr><tr class="odd"><td/>
<td><code>%I</code></td>
<td>12-hour hour</td>
<td>1</td>
</tr><tr class="even"><td/>
<td><code>%p</code></td>
<td>AM/PM</td>
<td>pm</td>
</tr><tr class="odd"><td/>
<td><code>%M</code></td>
<td>Minutes</td>
<td>35</td>
</tr><tr class="even"><td/>
<td><code>%S</code></td>
<td>Seconds</td>
<td>45</td>
</tr><tr class="odd"><td/>
<td><code>%OS</code></td>
<td>Seconds with decimal component</td>
<td>45.35</td>
</tr><tr class="even"><td/>
<td><code>%Z</code></td>
<td>Time zone name</td>
<td>America/Chicago</td>
</tr><tr class="odd"><td/>
<td><code>%z</code></td>
<td>Offset from UTC</td>
<td>+0800</td>
</tr><tr class="even"><td>Other</td>
<td><code>%.</code></td>
<td>Skip one non-digit</td>
<td>:</td>
</tr><tr class="odd"><td/>
<td><code>%*</code></td>
<td>Skip any number of non-digits</td>
<td/>
</tr></tbody></table></div>
<p>And this code shows some a few options applied to a very ambiguous date:</p>
<div class="cell" data-messages="false">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
date
01/02/15
"
read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
#&gt; # A tibble: 1 × 1
#&gt; date
#&gt; &lt;date&gt;
#&gt; 1 2015-01-02
read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
#&gt; # A tibble: 1 × 1
#&gt; date
#&gt; &lt;date&gt;
#&gt; 1 2015-02-01
read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
#&gt; # A tibble: 1 × 1
#&gt; date
#&gt; &lt;date&gt;
#&gt; 1 2001-02-15</pre>
</div>
<p>Note that no matter how you specify the date format, its always displayed the same way once you get it into R.</p>
<p>If youre using <code>%b</code> or <code>%B</code> and working with non-English dates, youll also need to provide a <code><a href="#chp-https://readr.tidyverse.org/reference/locale" data-type="xref">#chp-https://readr.tidyverse.org/reference/locale</a></code>. See the list of built-in languages in <code><a href="#chp-https://readr.tidyverse.org/reference/date_names" data-type="xref">#chp-https://readr.tidyverse.org/reference/date_names</a></code>, or create your own with <code><a href="#chp-https://readr.tidyverse.org/reference/date_names" data-type="xref">#chp-https://readr.tidyverse.org/reference/date_names</a></code>,</p>
</section>
<section id="from-strings" data-type="sect2">
<h2>
From strings</h2>
<p>The date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridates helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ymd("2017-01-31")
#&gt; [1] "2017-01-31"
mdy("January 31st, 2017")
#&gt; [1] "2017-01-31"
dmy("31-Jan-2017")
#&gt; [1] "2017-01-31"</pre>
</div>
<p><code><a href="#chp-https://lubridate.tidyverse.org/reference/ymd" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/ymd</a></code> and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ymd_hms("2017-01-31 20:11:59")
#&gt; [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
#&gt; [1] "2017-01-31 08:01:00 UTC"</pre>
</div>
<p>You can also force the creation of a date-time from a date by supplying a timezone:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ymd("2017-01-31", tz = "UTC")
#&gt; [1] "2017-01-31 UTC"</pre>
</div>
</section>
<section id="from-individual-components" data-type="sect2">
<h2>
From individual components</h2>
<p>Instead of a single string, sometimes youll have the individual components of the date-time spread across multiple columns. This is what we have in the <code>flights</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
select(year, month, day, hour, minute)
#&gt; # A tibble: 336,776 × 5
#&gt; year month day hour minute
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 5 15
#&gt; 2 2013 1 1 5 29
#&gt; 3 2013 1 1 5 40
#&gt; 4 2013 1 1 5 45
#&gt; 5 2013 1 1 6 0
#&gt; 6 2013 1 1 5 58
#&gt; # … with 336,770 more rows</pre>
</div>
<p>To create a date/time from this sort of input, use <code><a href="#chp-https://lubridate.tidyverse.org/reference/make_datetime" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/make_datetime</a></code> for dates, or <code><a href="#chp-https://lubridate.tidyverse.org/reference/make_datetime" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/make_datetime</a></code> for date-times:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
select(year, month, day, hour, minute) |&gt;
mutate(departure = make_datetime(year, month, day, hour, minute))
#&gt; # A tibble: 336,776 × 6
#&gt; year month day hour minute departure
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt;
#&gt; 1 2013 1 1 5 15 2013-01-01 05:15:00
#&gt; 2 2013 1 1 5 29 2013-01-01 05:29:00
#&gt; 3 2013 1 1 5 40 2013-01-01 05:40:00
#&gt; 4 2013 1 1 5 45 2013-01-01 05:45:00
#&gt; 5 2013 1 1 6 0 2013-01-01 06:00:00
#&gt; 6 2013 1 1 5 58 2013-01-01 05:58:00
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Lets do the same thing for each of the four time columns in <code>flights</code>. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once weve created the date-time variables, we focus in on the variables well explore in the rest of the chapter.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">make_datetime_100 &lt;- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt &lt;- flights |&gt;
filter(!is.na(dep_time), !is.na(arr_time)) |&gt;
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) |&gt;
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
#&gt; # A tibble: 328,063 × 9
#&gt; origin dest dep_delay arr_delay dep_time sched_dep_time
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
#&gt; 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
#&gt; 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
#&gt; 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
#&gt; 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
#&gt; 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
#&gt; # … with 328,057 more rows, and 3 more variables: arr_time &lt;dttm&gt;,
#&gt; # sched_arr_time &lt;dttm&gt;, air_time &lt;dbl&gt;</pre>
</div>
<p>With this data, we can visualize the distribution of departure times across the year:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December." width="576"/></p>
</div>
</div>
<p>Or within a single day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
filter(dep_time &lt; ymd(20130102)) |&gt;
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-13-1.png" class="img-fluid" alt="A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm." width="576"/></p>
</div>
</div>
<p>Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.</p>
</section>
<section id="from-other-types" data-type="sect2">
<h2>
From other types</h2>
<p>You may want to switch between a date-time and a date. Thats the job of <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code> and <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">as_datetime(today())
#&gt; [1] "2022-11-18 UTC"
as_date(now())
#&gt; [1] "2022-11-18"</pre>
</div>
<p>Sometimes youll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code>; if its in days, use <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">as_datetime(60 * 60 * 10)
#&gt; [1] "1970-01-01 10:00:00 UTC"
as_date(365 * 10 + 2)
#&gt; [1] "1980-01-01"</pre>
</div>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>What happens if you parse a string that contains invalid dates?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ymd(c("2010-10-10", "bananas"))</pre>
</div>
</li>
<li><p>What does the <code>tzone</code> argument to <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code> do? Why is it important?</p></li>
<li>
<p>For each of the following date-times show how youd parse it using a readr column-specification and a lubridate function.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">d1 &lt;- "January 1, 2010"
d2 &lt;- "2015-Mar-07"
d3 &lt;- "06-Jun-2017"
d4 &lt;- c("August 19 (2015)", "July 1 (2015)")
d5 &lt;- "12/30/14" # Dec 30, 2014
t1 &lt;- "1705"
t2 &lt;- "11:15:10.12 PM"</pre>
</div>
</li>
</ol></section>
</section>
<section id="date-time-components" data-type="sect1">
<h1>
Date-time components</h1>
<p>Now that you know how to get date-time data into Rs date-time data structures, lets explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.</p>
<section id="getting-components" data-type="sect2">
<h2>
Getting components</h2>
<p>You can pull out individual parts of the date with the accessor functions <code><a href="#chp-https://lubridate.tidyverse.org/reference/year" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/year</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/month" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/month</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> (day of the month), <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> (day of the year), <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> (day of the week), <code><a href="#chp-https://lubridate.tidyverse.org/reference/hour" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/hour</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/minute" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/minute</a></code>, and <code><a href="#chp-https://lubridate.tidyverse.org/reference/second" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/second</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">datetime &lt;- ymd_hms("2026-07-08 12:34:56")
year(datetime)
#&gt; [1] 2026
month(datetime)
#&gt; [1] 7
mday(datetime)
#&gt; [1] 8
yday(datetime)
#&gt; [1] 189
wday(datetime)
#&gt; [1] 4</pre>
</div>
<p>For <code><a href="#chp-https://lubridate.tidyverse.org/reference/month" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/month</a></code> and <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> you can set <code>label = TRUE</code> to return the abbreviated name of the month or day of the week. Set <code>abbr = FALSE</code> to return the full name.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">month(datetime, label = TRUE)
#&gt; [1] Jul
#&gt; 12 Levels: Jan &lt; Feb &lt; Mar &lt; Apr &lt; May &lt; Jun &lt; Jul &lt; Aug &lt; Sep &lt; ... &lt; Dec
wday(datetime, label = TRUE, abbr = FALSE)
#&gt; [1] Wednesday
#&gt; 7 Levels: Sunday &lt; Monday &lt; Tuesday &lt; Wednesday &lt; Thursday &lt; ... &lt; Saturday</pre>
</div>
<p>We can use <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> to see that more flights depart during the week than on the weekend:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
mutate(wday = wday(dep_time, label = TRUE)) |&gt;
ggplot(aes(x = wday)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000)." width="576"/></p>
</div>
</div>
<p>Theres an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
mutate(minute = minute(dep_time)) |&gt;
group_by(minute) |&gt;
summarise(
avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()) |&gt;
ggplot(aes(minute, avg_delay)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9)." width="576"/></p>
</div>
</div>
<p>Interestingly, if we look at the <em>scheduled</em> departure time we dont see such a strong pattern:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sched_dep &lt;- flights_dt |&gt;
mutate(minute = minute(sched_dep_time)) |&gt;
group_by(minute) |&gt;
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n())
ggplot(sched_dep, aes(minute, avg_delay)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour." width="576"/></p>
</div>
</div>
<p>So why do we see that pattern with the actual departure times? Well, like much data collected by humans, theres a strong bias towards flights leaving at “nice” departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(sched_dep, aes(minute, n)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes." width="576"/></p>
</div>
</div>
</section>
<section id="rounding" data-type="sect2">
<h2>
Rounding</h2>
<p>An alternative approach to plotting individual components is to round the date to a nearby unit of time, with <code><a href="#chp-https://lubridate.tidyverse.org/reference/round_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/round_date</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/round_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/round_date</a></code>, and <code><a href="#chp-https://lubridate.tidyverse.org/reference/round_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/round_date</a></code>. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
count(week = floor_date(dep_time, "week")) |&gt;
ggplot(aes(week, n)) +
geom_line() +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-24-1.png" class="img-fluid" alt="A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights)." width="576"/></p>
</div>
</div>
<p>You can use rounding to show the distribution of flights across the course of a day by computing the difference between <code>dep_time</code> and the earliest instant of that day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |&gt;
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)
#&gt; Don't know how to automatically pick scale for object of type &lt;difftime&gt;.
#&gt; Defaulting to continuous.</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret." width="576"/></p>
</div>
</div>
<p>Computing the difference between a pair of date-times yields a difftime (more on that in <a href="#sec-intervals" data-type="xref">#sec-intervals</a>). We can convert that to an <code>hms</code> object to get a more useful x-axis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |&gt;
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 60 * 30)</pre>
<div class="cell-output-display">
<p><img src="datetimes_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (&lt;100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again." width="576"/></p>
</div>
</div>
</section>
<section id="modifying-components" data-type="sect2">
<h2>
Modifying components</h2>
<p>You can also use each accessor function to modify the components of a date/time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">(datetime &lt;- ymd_hms("2026-07-08 12:34:56"))
#&gt; [1] "2026-07-08 12:34:56 UTC"
year(datetime) &lt;- 2030
datetime
#&gt; [1] "2030-07-08 12:34:56 UTC"
month(datetime) &lt;- 01
datetime
#&gt; [1] "2030-01-08 12:34:56 UTC"
hour(datetime) &lt;- hour(datetime) + 1
datetime
#&gt; [1] "2030-01-08 13:34:56 UTC"</pre>
</div>
<p>Alternatively, rather than modifying an existing variabke, you can create a new date-time with <code><a href="#chp-https://rdrr.io/r/stats/update" data-type="xref">#chp-https://rdrr.io/r/stats/update</a></code>. This also allows you to set multiple values in one step:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
#&gt; [1] "2030-02-02 02:34:56 UTC"</pre>
</div>
<p>If values are too big, they will roll-over:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">update(ymd("2023-02-01"), mday = 30)
#&gt; [1] "2023-03-02"
update(ymd("2023-02-01"), hour = 400)
#&gt; [1] "2023-02-17 16:00:00 UTC"</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How does the distribution of flight times within a day change over the course of the year?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code> and <code>dep_delay</code>. Are they consistent? Explain your findings.</p></li>
<li><p>Compare <code>air_time</code> with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)</p></li>
<li><p>How does the average delay time change over the course of a day? Should you use <code>dep_time</code> or <code>sched_dep_time</code>? Why?</p></li>
<li><p>On what day of the week should you leave if you want to minimise the chance of a delay?</p></li>
<li><p>What makes the distribution of <code>diamonds$carat</code> and <code>flights$sched_dep_time</code> similar?</p></li>
<li><p>Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.</p></li>
</ol></section>
</section>
<section id="time-spans" data-type="sect1">
<h1>
Time spans</h1>
<p>Next youll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, youll learn about three important classes that represent time spans:</p>
<ul><li>
<strong>Durations</strong>, which represent an exact number of seconds.</li>
<li>
<strong>Periods</strong>, which represent human units like weeks and months.</li>
<li>
<strong>Intervals</strong>, which represent a starting and ending point.</li>
</ul><p>How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.</p>
<section id="durations" data-type="sect2">
<h2>
Durations</h2>
<p>In R, when you subtract two dates, you get a difftime object:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># How old is Hadley?
h_age &lt;- today() - ymd("1979-10-14")
h_age
#&gt; Time difference of 15741 days</pre>
</div>
<p>A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the <strong>duration</strong>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">as.duration(h_age)
#&gt; [1] "1360022400s (~43.1 years)"</pre>
</div>
<p>Durations come with a bunch of convenient constructors:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dseconds(15)
#&gt; [1] "15s"
dminutes(10)
#&gt; [1] "600s (~10 minutes)"
dhours(c(12, 24))
#&gt; [1] "43200s (~12 hours)" "86400s (~1 days)"
ddays(0:5)
#&gt; [1] "0s" "86400s (~1 days)" "172800s (~2 days)"
#&gt; [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
dweeks(3)
#&gt; [1] "1814400s (~3 weeks)"
dyears(1)
#&gt; [1] "31557600s (~1 years)"</pre>
</div>
<p>Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year is uses the “average” number of days in a year, i.e. 365.25. Theres no way to convert a month to a duration, because theres just too much variation.</p>
<p>You can add and multiply durations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">2 * dyears(1)
#&gt; [1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
#&gt; [1] "38869200s (~1.23 years)"</pre>
</div>
<p>You can add and subtract durations to and from days:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tomorrow &lt;- today() + ddays(1)
last_year &lt;- today() - dyears(1)</pre>
</div>
<p>However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">one_pm &lt;- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
one_pm
#&gt; [1] "2026-03-12 13:00:00 EDT"
one_pm + ddays(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"</pre>
</div>
<p>Why is one day after 1pm March 12, 2pm March 13? If you look carefully at the date you might also notice that the time zones have changed. March 12 only has 23 hours because its when DST starts, so if we add a full days worth of seconds we end up with a different time.</p>
</section>
<section id="periods" data-type="sect2">
<h2>
Periods</h2>
<p>To solve this problem, lubridate provides <strong>periods</strong>. Periods are time spans but dont have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">one_pm
#&gt; [1] "2026-03-12 13:00:00 EDT"
one_pm + days(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"</pre>
</div>
<p>Like durations, periods can be created with a number of friendly constructor functions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">hours(c(12, 24))
#&gt; [1] "12H 0M 0S" "24H 0M 0S"
days(7)
#&gt; [1] "7d 0H 0M 0S"
months(1:6)
#&gt; [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
#&gt; [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"</pre>
</div>
<p>You can add and multiply periods:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">10 * (months(6) + days(1))
#&gt; [1] "60m 10d 0H 0M 0S"
days(50) + hours(25) + minutes(2)
#&gt; [1] "50d 25H 2M 0S"</pre>
</div>
<p>And of course, add them to dates. Compared to durations, periods are more likely to do what you expect:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># A leap year
ymd("2024-01-01") + dyears(1)
#&gt; [1] "2024-12-31 06:00:00 UTC"
ymd("2024-01-01") + years(1)
#&gt; [1] "2025-01-01"
# Daylight Savings Time
one_pm + ddays(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"
one_pm + days(1)
#&gt; [1] "2026-03-13 13:00:00 EDT"</pre>
</div>
<p>Lets use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination <em>before</em> they departed from New York City.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
filter(arr_time &lt; dep_time)
#&gt; # A tibble: 10,640 × 9
#&gt; origin dest dep_delay arr_delay dep_time sched_dep_time
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00
#&gt; 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00
#&gt; 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00
#&gt; 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00
#&gt; 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00
#&gt; 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00
#&gt; # … with 10,634 more rows, and 3 more variables: arr_time &lt;dttm&gt;,
#&gt; # sched_arr_time &lt;dttm&gt;, air_time &lt;dbl&gt;</pre>
</div>
<p>These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding <code>days(1)</code> to the arrival time of each overnight flight.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt &lt;- flights_dt |&gt;
mutate(
overnight = arr_time &lt; dep_time,
arr_time = arr_time + days(if_else(overnight, 0, 1)),
sched_arr_time = sched_arr_time + days(overnight * 1)
)</pre>
</div>
<p>Now all of our flights obey the laws of physics.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
filter(overnight, arr_time &lt; dep_time)
#&gt; # A tibble: 10,640 × 10
#&gt; origin dest dep_delay arr_delay dep_time sched_dep_time
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00
#&gt; 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00
#&gt; 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00
#&gt; 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00
#&gt; 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00
#&gt; 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00
#&gt; # … with 10,634 more rows, and 4 more variables: arr_time &lt;dttm&gt;,
#&gt; # sched_arr_time &lt;dttm&gt;, air_time &lt;dbl&gt;, overnight &lt;lgl&gt;</pre>
</div>
</section>
<section id="sec-intervals" data-type="sect2">
<h2>
Intervals</h2>
<p>Its obvious what <code>dyears(1) / ddays(365)</code> should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.</p>
<p>What should <code>years(1) / days(1)</code> return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! Theres not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">years(1) / days(1)
#&gt; [1] 365.25</pre>
</div>
<p>If you want a more accurate measurement, youll have to use an <strong>interval</strong>. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.</p>
<p>You can create an interval by writing <code>start %--% end</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2023 &lt;- ymd("2023-01-01") %--% ymd("2024-01-01")
y2024 &lt;- ymd("2024-01-01") %--% ymd("2025-01-01")
y2023
#&gt; [1] 2023-01-01 UTC--2024-01-01 UTC
y2024
#&gt; [1] 2024-01-01 UTC--2025-01-01 UTC</pre>
</div>
<p>You could then divide it by <code><a href="#chp-https://lubridate.tidyverse.org/reference/period" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/period</a></code> to find out how many days fit in the year:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2023 / days(1)
#&gt; [1] 365
y2024 / days(1)
#&gt; [1] 366</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain <code>days(overnight * 1)</code> to someone who has just started learning R. How does it work?</p></li>
<li><p>Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the <em>current</em> year.</p></li>
<li><p>Write a function that given your birthday (as a date), returns how old you are in years.</p></li>
<li><p>Why cant <code>(today() %--% (today() + years(1))) / months(1)</code> work?</p></li>
</ol></section>
</section>
<section id="time-zones" data-type="sect1">
<h1>
Time zones</h1>
<p>Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we dont need to dig into all the details as theyre not all important for data analysis, but there are a few challenges well need to tackle head on.</p>
<!--# https://www.ietf.org/timezones/tzdb-2018a/theory.html -->
<p>The first challenge is that everyday names of time zones tend to be ambiguous. For example, if youre American youre probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme <code>{area}/{location}</code>, typically in the form <code>{continent}/{city}</code> or <code>{ocean}/{city}</code>. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.</p>
<p>You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. Its worth reading the raw time zone database (available at <a href="https://www.iana.org/time-zones" class="uri">https://www.iana.org/time-zones</a>) just to read some of these stories!</p>
<p>You can find out what R thinks your current time zone is with <code><a href="#chp-https://rdrr.io/r/base/timezones" data-type="xref">#chp-https://rdrr.io/r/base/timezones</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">Sys.timezone()
#&gt; [1] "America/Chicago"</pre>
</div>
<p>(If R doesnt know, youll get an <code>NA</code>.)</p>
<p>And see the complete list of all time zone names with <code><a href="#chp-https://rdrr.io/r/base/timezones" data-type="xref">#chp-https://rdrr.io/r/base/timezones</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">length(OlsonNames())
#&gt; [1] 595
head(OlsonNames())
#&gt; [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
#&gt; [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"</pre>
</div>
<p>In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
x1
#&gt; [1] "2024-06-01 12:00:00 EDT"
x2 &lt;- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
x2
#&gt; [1] "2024-06-01 18:00:00 CEST"
x3 &lt;- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
x3
#&gt; [1] "2024-06-02 04:00:00 NZST"</pre>
</div>
<p>You can verify that theyre the same time using subtraction:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 - x2
#&gt; Time difference of 0 secs
x1 - x3
#&gt; Time difference of 0 secs</pre>
</div>
<p>Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>, will often drop the time zone. In that case, the date-times will display in your local time zone:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x4 &lt;- c(x1, x2, x3)
x4
#&gt; [1] "2024-06-01 12:00:00 EDT" "2024-06-01 12:00:00 EDT"
#&gt; [3] "2024-06-01 12:00:00 EDT"</pre>
</div>
<p>You can change the time zone in two ways:</p>
<ul><li>
<p>Keep the instant in time the same, and change how its displayed. Use this when the instant is correct, but you want a more natural display.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x4a &lt;- with_tz(x4, tzone = "Australia/Lord_Howe")
x4a
#&gt; [1] "2024-06-02 02:30:00 +1030" "2024-06-02 02:30:00 +1030"
#&gt; [3] "2024-06-02 02:30:00 +1030"
x4a - x4
#&gt; Time differences in secs
#&gt; [1] 0 0 0</pre>
</div>
<p>(This also illustrates another challenge of times zones: theyre not all integer hour offsets!)</p>
</li>
<li>
<p>Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x4b &lt;- force_tz(x4, tzone = "Australia/Lord_Howe")
x4b
#&gt; [1] "2024-06-01 12:00:00 +1030" "2024-06-01 12:00:00 +1030"
#&gt; [3] "2024-06-01 12:00:00 +1030"
x4b - x4
#&gt; Time differences in hours
#&gt; [1] -14.5 -14.5 -14.5</pre>
</div>
</li>
</ul></section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.</p>
<p>The next chapter gives a round up of missing values. Youve seen them in a few places and have no doubt encounter in your own analysis, and its how time to provide a grab bag of useful techniques for dealing with them.</p>
</section>
</section>

446
oreilly/factors.html Normal file
View File

@ -0,0 +1,446 @@
<section data-type="chapter" id="chp-factors">
<h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.</p>
<p>Well start by motivating why factors are needed for data analysis and how you can create them with <code><a href="#chp-https://rdrr.io/r/base/factor" data-type="xref">#chp-https://rdrr.io/r/base/factor</a></code>. Well then introduce you to the <code>gss_cat</code> dataset which contains a bunch of categorical variables to experiment with. Youll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Base R provides some basic tools for creating and manipulating factors. Well supplement these with the <strong>forcats</strong> package, which is part of the core tidyverse. It provides tools for dealing with <strong>cat</strong>egorical variables (and its an anagram of factors!) using a wide range of helpers for working with factors.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="factor-basics" data-type="sect1">
<h1>
Factor basics</h1>
<p>Imagine that you have a variable that records month:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- c("Dec", "Apr", "Jan", "Mar")</pre>
</div>
<p>Using a string to record this variable has two problems:</p>
<ol type="1"><li>
<p>There are only twelve possible months, and theres nothing saving you from typos:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x2 &lt;- c("Dec", "Apr", "Jam", "Mar")</pre>
</div>
</li>
<li>
<p>It doesnt sort in a useful way:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sort(x1)
#&gt; [1] "Apr" "Dec" "Jan" "Mar"</pre>
</div>
</li>
</ol><p>You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid <strong>levels</strong>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">month_levels &lt;- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)</pre>
</div>
<p>Now you can create a factor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y1 &lt;- factor(x1, levels = month_levels)
y1
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
#&gt; [1] Jan Mar Apr Dec
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
<p>And any values not in the level will be silently converted to NA:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2 &lt;- factor(x2, levels = month_levels)
y2
#&gt; [1] Dec Apr &lt;NA&gt; Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
<p>This seems risky, so you might want to use <code><a href="#chp-https://forcats.tidyverse.org/reference/fct" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct</a></code> instead:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2 &lt;- fct(x2, levels = month_levels)
#&gt; Error in `fct()`:
#&gt; ! All values of `x` must appear in `levels` or `na`
#&gt; Missing level: "Jam"</pre>
</div>
<p>If you omit the levels, theyll be taken from the data in alphabetical order:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">factor(x1)
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Apr Dec Jan Mar</pre>
</div>
<p>Sometimes youd prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to <code>unique(x)</code>, or after the fact, with <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_inorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_inorder</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">f1 &lt;- factor(x1, levels = unique(x1))
f1
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Dec Apr Jan Mar
f2 &lt;- x1 |&gt; factor() |&gt; fct_inorder()
f2
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Dec Apr Jan Mar</pre>
</div>
<p>If you ever need to access the set of valid levels directly, you can do so with <code><a href="#chp-https://rdrr.io/r/base/levels" data-type="xref">#chp-https://rdrr.io/r/base/levels</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">levels(f2)
#&gt; [1] "Dec" "Apr" "Jan" "Mar"</pre>
</div>
<p>You can also create a factor when reading your data with readr with <code><a href="#chp-https://readr.tidyverse.org/reference/parse_factor" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_factor</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
month,value
Jan,12
Feb,56
Mar,12"
df &lt;- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
df$month
#&gt; [1] Jan Feb Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
</section>
<section id="general-social-survey" data-type="sect1">
<h1>
General Social Survey</h1>
<p>For the rest of this chapter, were going to use <code><a href="#chp-https://forcats.tidyverse.org/reference/gss_cat" data-type="xref">#chp-https://forcats.tidyverse.org/reference/gss_cat</a></code>. Its a sample of data from the <a href="#chp-https://gss.norc" data-type="xref">#chp-https://gss.norc</a>, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in <code>gss_cat</code> Hadley selected a handful that will illustrate some common challenges youll encounter when working with factors.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat
#&gt; # A tibble: 21,483 × 9
#&gt; year marital age race rincome partyid relig denom tvhours
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,near r… Prot… Sout… 12
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str re… Prot… Bapt… NA
#&gt; 3 2000 Widowed 67 White Not applicable Independent Prot… No d… 2
#&gt; 4 2000 Never married 39 White Not applicable Ind,near r… Orth… Not … 4
#&gt; 5 2000 Divorced 25 White Not applicable Not str de… None Not … 1
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA
#&gt; # … with 21,477 more rows</pre>
</div>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="#chp-https://forcats.tidyverse.org/reference/gss_cat" data-type="xref">#chp-https://forcats.tidyverse.org/reference/gss_cat</a></code>.)</p>
<p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
count(race)
#&gt; # A tibble: 3 × 2
#&gt; race n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Other 1959
#&gt; 2 Black 3129
#&gt; 3 White 16395</pre>
</div>
<p>Or with a bar chart:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(gss_cat, aes(race)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A bar chart showing the distribution of race. There are ~2000 records with race &quot;Other&quot;, 3000 with race &quot;Black&quot;, and other 15,000 with race &quot;White&quot;." width="576"/></p>
</div>
</div>
<p>When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.</p>
<section id="exercise" data-type="sect2">
<h2>
Exercise</h2>
<ol type="1"><li><p>Explore the distribution of <code>rincome</code> (reported income). What makes the default bar chart hard to understand? How could you improve the plot?</p></li>
<li><p>What is the most common <code>relig</code> in this survey? Whats the most common <code>partyid</code>?</p></li>
<li><p>Which <code>relig</code> does <code>denom</code> (denomination) apply to? How can you find out with a table? How can you find out with a visualization?</p></li>
</ol></section>
</section>
<section id="modifying-factor-order" data-type="sect1">
<h1>
Modifying factor order</h1>
<p>Its often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">relig_summary &lt;- gss_cat |&gt;
group_by(relig) |&gt;
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
</div>
</div>
<p>It is hard to read this plot because theres no overall pattern. We can improve it by reordering the levels of <code>relig</code> using <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_reorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_reorder</a></code>. <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_reorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_reorder</a></code> takes three arguments:</p>
<ul><li>
<code>f</code>, the factor whose levels you want to modify.</li>
<li>
<code>x</code>, a numeric vector that you want to use to reorder the levels.</li>
<li>Optionally, <code>fun</code>, a function thats used if there are multiple values of <code>x</code> for each value of <code>f</code>. The default value is <code>median</code>.</li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
</div>
</div>
<p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p>
<p>As you start making more complicated transformations, we recommend moving them out of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> and into a separate <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> step. For example, you could rewrite the plot above as:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">relig_summary |&gt;
mutate(
relig = fct_reorder(relig, tvhours)
) |&gt;
ggplot(aes(tvhours, relig)) +
geom_point()</pre>
</div>
<p>What if we create a similar plot looking at how average age varies across reported income level?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rincome_summary &lt;- gss_cat |&gt;
group_by(rincome) |&gt;
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
</div>
</div>
<p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_reorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_reorder</a></code> for factors whose levels are arbitrarily ordered.</p>
<p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_relevel" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_relevel</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is &quot;Not applicable&quot;." width="576"/></p>
</div>
</div>
<p>Why do you think the average age for “Not applicable” is so high?</p>
<p>Another type of reordering is useful when you are coloring the lines on a plot. <code>fct_reorder2(f, x, y)</code> reorders the factor <code>f</code> by the <code>y</code> values associated with the largest <code>x</code> values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">#|
#| Rearranging the legend makes the plot easier to read because the
#| legend colours now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion
#| never marred decreases with age, married forms an upside down U
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age &lt;- gss_cat |&gt;
filter(!is.na(age)) |&gt;
count(age, marital) |&gt;
group_by(age) |&gt;
mutate(
prop = n / sum(n)
)
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-22-2.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot." width="384"/></p>
</div>
</div>
</div>
</div>
<p>Finally, for bar plots, you can use <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_inorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_inorder</a></code> to order levels in decreasing frequency: this is the simplest type of reordering because it doesnt need any extra variables. Combine it with <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_rev" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_rev</a></code> if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(marital = marital |&gt; fct_infreq() |&gt; fct_rev()) |&gt;
ggplot(aes(marital)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
</div>
</div>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>There are some suspiciously high numbers in <code>tvhours</code>. Is the mean a good summary?</p></li>
<li><p>For each factor in <code>gss_cat</code> identify whether the order of the levels is arbitrary or principled.</p></li>
<li><p>Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?</p></li>
</ol></section>
</section>
<section id="modifying-factor-levels" data-type="sect1">
<h1>
Modifying factor levels</h1>
<p>More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_recode" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_recode</a></code>. It allows you to recode, or change, the value of each level. For example, take the <code>gss_cat$partyid</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; count(partyid)
#&gt; # A tibble: 10 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 No answer 154
#&gt; 2 Don't know 1
#&gt; 3 Other party 393
#&gt; 4 Strong republican 2314
#&gt; 5 Not str republican 3032
#&gt; 6 Ind,near rep 1791
#&gt; # … with 4 more rows</pre>
</div>
<p>The levels are terse and inconsistent. Lets tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)
) |&gt;
count(partyid)
#&gt; # A tibble: 10 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 No answer 154
#&gt; 2 Don't know 1
#&gt; 3 Other party 393
#&gt; 4 Republican, strong 2314
#&gt; 5 Republican, weak 3032
#&gt; 6 Independent, near rep 1791
#&gt; # … with 4 more rows</pre>
</div>
<p><code><a href="#chp-https://forcats.tidyverse.org/reference/fct_recode" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_recode</a></code> will leave the levels that arent explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist.</p>
<p>To combine groups, you can assign multiple old levels to the same new level:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)
) |&gt;
count(partyid)
#&gt; # A tibble: 8 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Other 548
#&gt; 2 Republican, strong 2314
#&gt; 3 Republican, weak 3032
#&gt; 4 Independent, near rep 1791
#&gt; 5 Independent 4119
#&gt; 6 Independent, near dem 2499
#&gt; # … with 2 more rows</pre>
</div>
<p>Use this technique with care: if you group together categories that are truly different you will end up with misleading results.</p>
<p>If you want to collapse a lot of levels, <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_collapse" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_collapse</a></code> is a useful variant of <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_recode" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_recode</a></code>. For each new variable, you can provide a vector of old levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(
partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)
) |&gt;
count(partyid)
#&gt; # A tibble: 4 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 other 548
#&gt; 2 rep 5346
#&gt; 3 ind 8409
#&gt; 4 dem 7180</pre>
</div>
<p>Sometimes you just want to lump together the small groups to make a plot or table simpler. Thats the job of the <code>fct_lump_*()</code> family of functions. <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(relig = fct_lump_lowfreq(relig)) |&gt;
count(relig)
#&gt; # A tibble: 2 × 2
#&gt; relig n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Protestant 10846
#&gt; 2 Other 10637</pre>
</div>
<p>In this case its not very helpful: it is true that the majority of Americans in this survey are Protestant, but wed probably like to see some more details! Instead, we can use the <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> to specify that we want exactly 10 groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(relig = fct_lump_n(relig, n = 10)) |&gt;
count(relig, sort = TRUE) |&gt;
print(n = Inf)
#&gt; # A tibble: 10 × 2
#&gt; relig n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 Protestant 10846
#&gt; 2 Catholic 5124
#&gt; 3 None 3523
#&gt; 4 Christian 689
#&gt; 5 Other 458
#&gt; 6 Jewish 388
#&gt; 7 Buddhism 147
#&gt; 8 Inter-nondenominational 109
#&gt; 9 Moslem/islam 104
#&gt; 10 Orthodox-christian 95</pre>
</div>
<p>Read the documentation to learn about <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> and <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> which are useful in other cases.</p>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?</p></li>
<li><p>How could you collapse <code>rincome</code> into a small set of categories?</p></li>
<li><p>Notice there are 9 groups (excluding other) in the <code>fct_lump</code> example above. Why not 10? (Hint: type <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code>, and find the default for the argument <code>other_level</code> is “Other”.)</p></li>
</ol></section>
</section>
<section id="ordered-factors" data-type="sect1">
<h1>
Ordered factors</h1>
<p>Before we go on, theres a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with <code><a href="#chp-https://rdrr.io/r/base/factor" data-type="xref">#chp-https://rdrr.io/r/base/factor</a></code>, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use <code>&lt;</code> between the factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ordered(c("a", "b", "c"))
#&gt; [1] a b c
#&gt; Levels: a &lt; b &lt; c</pre>
</div>
<p>In practice, <code><a href="#chp-https://rdrr.io/r/base/factor" data-type="xref">#chp-https://rdrr.io/r/base/factor</a></code> factors behave very similarly to regular factors. There are only two places where you might notice different behavior:</p>
<ul><li>If you map an ordered factor to color or fill in ggplot2, it will default to <code>scale_color_viridis()</code>/<code>scale_fill_viridis()</code>, a color scale that implies a ranking.</li>
<li>If you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably dont routinely interpret them. If you want to learn more, we recommend <code>vignette("contrasts", package = "faux")</code> by Lisa DeBruine.</li>
</ul><p>Given the arguable utility of these differences, we dont generally recommend using ordered factors.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didnt have space to discuss here, so whenever youre facing a factor analysis challenge that you havent encountered before, I highly recommend skimming the <a href="#chp-https://forcats.tidyverse.org/reference/index" data-type="xref">#chp-https://forcats.tidyverse.org/reference/index</a> to see if theres a canned function that can help solve your problem.</p>
<p>If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Hortons paper, <a href="#chp-https://peerj.com/preprints/3163/" data-type="xref">#chp-https://peerj.com/preprints/3163/</a>. This paper lays out some of the history discussed in <a href="#chp-https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/" data-type="xref">#chp-https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/</a> and <a href="#chp-https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh" data-type="xref">#chp-https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh</a>, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia &amp; Nick!</p>
<p>In the next chapter well switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as youll soon see, the more you learn about them, the more complex they seem to get!</p>
</section>
</section>

932
oreilly/functions.html Normal file
View File

@ -0,0 +1,932 @@
<section data-type="chapter" id="chp-functions">
<h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div><h1>
RStudio
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:</p>
<ol type="1"><li><p>You can give a function an evocative name that makes your code easier to understand.</p></li>
<li><p>As requirements change, you only need to update code in one place, instead of many.</p></li>
<li><p>You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).</p></li>
</ol><p>A good rule of thumb is to consider writing a function whenever youve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, youll learn about three useful types of functions:</p>
<ul><li>Vector functions take one or more vectors as input and return a vector as output.</li>
<li>Data frame functions take a data frame as input and return a data frame as output.</li>
<li>Plot functions that take a data frame as input and return a plot as output.</li>
</ul><p>Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldnt be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for <a href="#chp-https://twitter.com/hadleywickham/status/1571603361350164486" data-type="xref">#chp-https://twitter.com/hadleywickham/status/1571603361350164486</a> and <a href="#chp-https://twitter.com/hadleywickham/status/1574373127349575680" data-type="xref">#chp-https://twitter.com/hadleywickham/status/1574373127349575680</a> to see even more functions.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Well wrap up a variety of functions from around the tidyverse. Well also use nycflights13 as a source of familiar data to use our functions with.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="vector-functions" data-type="sect1">
<h1>
Vector functions</h1>
<p>Well begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
a = rnorm(5),
b = rnorm(5),
c = rnorm(5),
d = rnorm(5),
)
df |&gt; mutate(
a = (a - min(a, na.rm = TRUE)) /
(max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
b = (b - min(b, na.rm = TRUE)) /
(max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
c = (c - min(c, na.rm = TRUE)) /
(max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
d = (d - min(d, na.rm = TRUE)) /
(max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
#&gt; # A tibble: 5 × 4
#&gt; a b c d
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.339 2.59 0.291 0
#&gt; 2 0.880 0 0.611 0.557
#&gt; 3 0 1.37 1 0.752
#&gt; 4 0.795 1.37 0 1
#&gt; 5 1 1.34 0.580 0.394</pre>
</div>
<p>You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an <code>a</code> to a <code>b</code>. Preventing this type of mistake of is one very good reason to learn how to write functions.</p>
<section id="writing-a-function" data-type="sect2">
<h2>
Writing a function</h2>
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> its a little easier to see the pattern because each repetition is now one line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)) </pre>
</div>
<p>To make this a bit clearer we can replace the bit that varies with <code></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))</pre>
</div>
<p>To turn this into a function you need three things:</p>
<ol type="1"><li><p>A <strong>name</strong>. Here well use <code>rescale01</code> because this function rescales a vector to lie between 0 and 1.</p></li>
<li><p>The <strong>arguments</strong>. The arguments are things that vary across calls and our analysis above tells us that have just one. Well call it <code>x</code> because this is the conventional name for a numeric vector.</p></li>
<li><p>The <strong>body</strong>. The body is the code that repeated across all the calls.</p></li>
</ol><p>Then you create a function by following the template:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">name &lt;- function(arguments) {
body
}</pre>
</div>
<p>For this case that leads to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}</pre>
</div>
<p>At this point you might test with a few simple inputs to make sure youve captured the logic correctly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01(c(-10, 0, 10))
#&gt; [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
#&gt; [1] 0.00 0.25 0.50 NA 1.00</pre>
</div>
<p>Then you can rewrite the call to <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> as:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(
a = rescale01(a),
b = rescale01(b),
c = rescale01(c),
d = rescale01(d),
)
#&gt; # A tibble: 5 × 4
#&gt; a b c d
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.339 1 0.291 0
#&gt; 2 0.880 0 0.611 0.557
#&gt; 3 0 0.530 1 0.752
#&gt; 4 0.795 0.531 0 1
#&gt; 5 1 0.518 0.580 0.394</pre>
</div>
<p>(In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> to reduce the duplication even further so all you need is <code>df |&gt; mutate(across(a:d, rescale01))</code>).</p>
</section>
<section id="improving-our-function" data-type="sect2">
<h2>
Improving our function</h2>
<p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> twice and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="#chp-https://rdrr.io/r/base/range" data-type="xref">#chp-https://rdrr.io/r/base/range</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}</pre>
</div>
<p>Or you might try this function on a vector that includes an infinite value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1:10, Inf)
rescale01(x)
#&gt; [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre>
</div>
<p>That result is not particularly useful so we could ask <code><a href="#chp-https://rdrr.io/r/base/range" data-type="xref">#chp-https://rdrr.io/r/base/range</a></code> to ignore infinite values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
#&gt; [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
#&gt; [8] 0.7777778 0.8888889 1.0000000 Inf</pre>
</div>
<p>These changes illustrate an important benefit of functions: because weve moved the repeated code into a function, we only need to make the change in one place.</p>
</section>
<section id="mutate-functions" data-type="sect2">
<h2>
Mutate functions</h2>
<p>Now youve got the basic idea of functions, lets take a look a whole bunch of examples. Well start by looking at “mutate” functions, functions that work well like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> because they return an output the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">z_score &lt;- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}</pre>
</div>
<p>Or maybe you want to wrap up a straightforward <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">clamp &lt;- function(x, min, max) {
case_when(
x &lt; min ~ min,
x &gt; max ~ max,
.default = x
)
}
clamp(1:10, min = 3, max = 7)
#&gt; [1] 3 3 3 4 5 6 7 7 7 7</pre>
</div>
<p>Or maybe youd rather mark those values as <code>NA</code>s:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">na_outside &lt;- function(x, min, max) {
case_when(
x &lt; min ~ NA,
x &gt; max ~ NA,
.default = x
)
}
na_outside(1:10, min = 3, max = 7)
#&gt; [1] NA NA 3 4 5 6 7 NA NA NA</pre>
</div>
<p>Of course functions dont just need to work with numeric variables. You might want to extract out some repeated string manipulation. Maybe you need to make the first character upper case:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">first_upper &lt;- function(x) {
str_sub(x, 1, 1) &lt;- str_to_upper(str_sub(x, 1, 1))
x
}
first_upper("hello")
#&gt; [1] "Hello"</pre>
</div>
<p>Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/NVlabormarket/status/1571939851922198530
clean_number &lt;- function(x) {
is_pct &lt;- str_detect(x, "%")
num &lt;- x |&gt;
str_remove_all("%") |&gt;
str_remove_all(",") |&gt;
str_remove_all(fixed("$")) |&gt;
as.numeric(x)
if_else(is_pct, num / 100, num)
}
clean_number("$12,300")
#&gt; [1] 12300
clean_number("45%")
#&gt; [1] 0.45</pre>
</div>
<p>Sometimes your functions will be highly specialized for one data analysis. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">fix_na &lt;- function(x) {
if_else(x %in% c(997, 998, 999), NA, x)
}</pre>
</div>
<p>Weve focused on examples that take a single vector because we think theyre the most common. But theres no reason that your function cant take multiple vector inputs. For example, you might want to compute the distance between two locations on the globe using the haversine formula. This requires four vectors:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
haversine &lt;- function(long1, lat1, long2, lat2, round = 3) {
# convert to radians
long1 &lt;- long1 * pi / 180
lat1 &lt;- lat1 * pi / 180
long2 &lt;- long2 * pi / 180
lat2 &lt;- lat2 * pi / 180
R &lt;- 6371 # Earth mean radius in km
a &lt;- sin((lat2 - lat1) / 2)^2 +
cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
d &lt;- R * 2 * asin(sqrt(a))
round(d, round)
}</pre>
</div>
</section>
<section id="summary-functions" data-type="sect2">
<h2>
Summary functions</h2>
<p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">commas &lt;- function(x) {
str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
#&gt; [1] "cat, dog and pigeon"</pre>
</div>
<p>Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cv &lt;- function(x, na.rm = FALSE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
#&gt; [1] 0.5196276
cv(runif(100, min = 0, max = 500))
#&gt; [1] 0.5652554</pre>
</div>
<p>Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/gbganalyst/status/1571619641390252033
n_missing &lt;- function(x) {
sum(is.na(x))
} </pre>
</div>
<p>You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/neilgcurrie/status/1571607727255834625
mape &lt;- function(actual, predicted) {
sum(abs((actual - predicted) / actual)) / length(actual)
}</pre>
</div>
<div data-type="note"><h1>
RStudio
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that youve written, place the cursor on the name of the function and press <code>F2</code>.</p></li>
<li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li>
</ul></div>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mean(is.na(x))
mean(is.na(y))
mean(is.na(z))
x / sum(x, na.rm = TRUE)
y / sum(y, na.rm = TRUE)
z / sum(z, na.rm = TRUE)
round(x / sum(x, na.rm = TRUE) * 100, 1)
round(y / sum(y, na.rm = TRUE) * 100, 1)
round(z / sum(z, na.rm = TRUE) * 100, 1)</pre>
</div>
</li>
<li><p>In the second variant of <code>rescale01()</code>, infinite values are left unchanged. Can you rewrite <code>rescale01()</code> so that <code>-Inf</code> is mapped to 0, and <code>Inf</code> is mapped to 1?</p></li>
<li><p>Given a vector of birthdates, write a function to compute the age in years.</p></li>
<li><p>Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as <span class="math display">\[
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
\]</span> where <span class="math inline">\(\bar{x} = (\sum_i^n x_i) / n\)</span> is the sample mean. Skewness is defined as <span class="math display">\[
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
\]</span></p></li>
<li><p>Write <code>both_na()</code>, a summary function that takes two vectors of the same length and returns the number of positions that have an <code>NA</code> in both vectors.</p></li>
<li>
<p>Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">is_directory &lt;- function(x) file.info(x)$isdir
is_readable &lt;- function(x) file.access(x, 4) == 0</pre>
</div>
</li>
</ol></section>
</section>
<section id="data-frame-functions" data-type="sect1">
<h1>
Data frame functions</h1>
<p>Vector functions are useful for pulling out code thats repeated within a dplyr verb. But youll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.</p>
<p>To let you write a function that uses dplyr verbs, well first introduce you to the challenge of indirection and how you can overcome it with embracing, <code>{{ }}</code>. With this theory under your belt, well then show you a bunch of examples to illustrate what you might do with it.</p>
<section id="indirection-and-tidy-evaluation" data-type="sect2">
<h2>
Indirection and tidy evaluation</h2>
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> the unique (distinct) values of a variable:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">pull_unique &lt;- function(df, var) {
df |&gt;
distinct(var) |&gt;
pull(var)
}</pre>
</div>
<p>If we try and use it, we get an error:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; pull_unique(clarity)
#&gt; Error in `distinct()` at ]8;line = 38:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/pull.Rdplyr/R/pull.R:38:2]8;;:
#&gt; ! Must use existing variables.
#&gt; ✖ `var` not found in `.data`.</pre>
</div>
<p>To make the problem a bit more clear we can use a made up data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(var = "var", x = "x", y = "y")
df |&gt; pull_unique(x)
#&gt; [1] "var"
df |&gt; pull_unique(y)
#&gt; [1] "var"</pre>
</div>
<p>Regardless of how we call <code>pull_unique()</code> it always does <code>df |&gt; distinct(var) |&gt; pull(var)</code>, instead of <code>df |&gt; distinct(x) |&gt; pull(x)</code> or <code>df |&gt; distinct(y) |&gt; pull(y)</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> not to treat <code>var</code> as the name of a variable, but instead look inside <code>var</code> for the variable we actually want to use.</p>
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
<p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
}
diamonds |&gt; pull_unique(clarity)
#&gt; [1] SI2 SI1 VS1 VS2 VVS2 VVS1 I1 IF
#&gt; Levels: I1 &lt; SI2 &lt; SI1 &lt; VS2 &lt; VS1 &lt; VVS2 &lt; VVS1 &lt; IF</pre>
</div>
<p>Success!</p>
</section>
<section id="sec-embracing" data-type="sect2">
<h2>
When to embrace?</h2>
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:</p>
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> that compute with variables.</p></li>
<li><p><strong>Tidy-selection</strong>: this is used for for functions like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> that select variables.</p></li>
</ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p>
<p>In the following sections well explore the sorts of handy functions you might write once you understand embracing.</p>
</section>
<section id="common-use-cases" data-type="sect2">
<h2>
Common use cases</h2>
<p>If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summary6 &lt;- function(data, var) {
data |&gt; summarise(
min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE),
max = max({{ var }}, na.rm = TRUE),
n = n(),
n_miss = sum(is.na({{ var }})),
.groups = "drop"
)
}
diamonds |&gt; summary6(carat)
#&gt; # A tibble: 1 × 6
#&gt; min mean median max n n_miss
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.2 0.798 0.7 5.01 53940 0</pre>
</div>
<p>(Whenever you wrap <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is because it wraps <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> you can used it on grouped data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
group_by(cut) |&gt;
summary6(carat)
#&gt; # A tibble: 5 × 7
#&gt; cut min mean median max n n_miss
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Fair 0.22 1.05 1 5.01 1610 0
#&gt; 2 Good 0.23 0.849 0.82 3.01 4906 0
#&gt; 3 Very Good 0.2 0.806 0.71 4 12082 0
#&gt; 4 Premium 0.2 0.892 0.86 4.01 13791 0
#&gt; 5 Ideal 0.2 0.703 0.54 3.5 21551 0</pre>
</div>
<p>Because the arguments to summarize are data-masking that also means that the <code>var</code> argument to <code>summary6()</code> is data-masking. That means you can also summarize computed variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
group_by(cut) |&gt;
summary6(log10(carat))
#&gt; # A tibble: 5 × 7
#&gt; cut min mean median max n n_miss
#&gt; &lt;ord&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Fair -0.658 -0.0273 0 0.700 1610 0
#&gt; 2 Good -0.638 -0.133 -0.0862 0.479 4906 0
#&gt; 3 Very Good -0.699 -0.164 -0.149 0.602 12082 0
#&gt; 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0
#&gt; 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre>
</div>
<p>To summarize multiple variables youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>.</p>
<p>Another popular <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> helper function is a version of <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> that also computes proportions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/Diabb6/status/1571635146658402309
count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}
diamonds |&gt; count_prop(clarity)
#&gt; # A tibble: 8 × 3
#&gt; clarity n prop
#&gt; &lt;ord&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 I1 741 0.0137
#&gt; 2 SI2 9194 0.170
#&gt; 3 SI1 13065 0.242
#&gt; 4 VS2 12258 0.227
#&gt; 5 VS1 8171 0.151
#&gt; 6 VVS2 5066 0.0939
#&gt; # … with 2 more rows</pre>
</div>
<p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because its passed to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> which uses data-masking for all variables in <code></code>.</p>
<p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, well allow the user to supply a condition:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">unique_where &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
distinct({{ var }}) |&gt;
arrange({{ var }}) |&gt;
pull({{ var }})
}
# Find all the destinations in December
flights |&gt; unique_where(month == 12, dest)
#&gt; [1] "ABQ" "ALB" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA" "BOS" "BQN" "BTV"
#&gt; [13] "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHS" "CLE" "CLT" "CMH" "CVG" "DAY"
#&gt; [25] "DCA" "DEN" "DFW" "DSM" "DTW" "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN"
#&gt; [37] "HNL" "HOU" "IAD" "IAH" "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LGB" "MCI"
#&gt; [49] "MCO" "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "OAK" "OKC"
#&gt; [61] "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE" "PSP" "PVD" "PWM"
#&gt; [73] "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN" "SDF" "SEA" "SFO" "SJC"
#&gt; [85] "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT" "SYR" "TPA" "TUL" "TYS" "XNA"
# Which months did plane N14228 fly in?
flights |&gt; unique_where(tailnum == "N14228", month)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 12</pre>
</div>
<p>Here we embrace <code>condition</code> because its passed to <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code>var</code> because its passed to <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code>.</p>
<p>Weve made all these examples take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_sub &lt;- function(rows, cols) {
flights |&gt;
filter({{ rows }}) |&gt;
select(time_hour, carrier, flight, {{ cols }})
}
flights_sub(dest == "IAH", contains("time"))
#&gt; # A tibble: 7,198 × 8
#&gt; time_hour carrier flight dep_time sched_de…¹ arr_t…² sched…³ air_t…⁴
#&gt; &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 2013-01-01 05:00:00 UA 1545 517 515 830 819 227
#&gt; 2 2013-01-01 05:00:00 UA 1714 533 529 850 830 227
#&gt; 3 2013-01-01 06:00:00 UA 496 623 627 933 932 229
#&gt; 4 2013-01-01 07:00:00 UA 473 728 732 1041 1038 238
#&gt; 5 2013-01-01 07:00:00 UA 1479 739 739 1104 1038 249
#&gt; 6 2013-01-01 09:00:00 UA 1220 908 908 1228 1219 233
#&gt; # … with 7,192 more rows, and abbreviated variable names ¹sched_dep_time,
#&gt; # ²arr_time, ³sched_arr_time, ⁴air_time</pre>
</div>
</section>
<section id="data-masking-vs-tidy-selection" data-type="sect2">
<h2>
Data-masking vs tidy-selection</h2>
<p>Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a <code>count_missing()</code> that counts the number of missing observations in rows. You might try writing something like:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by({{ group_vars }}) |&gt;
summarise(n_miss = sum(is.na({{ x_var }})))
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; Error in `group_by()` at ]8;line = 127:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/summarise.Rdplyr/R/summarise.R:127:2]8;;:
#&gt; In argument: `..1 = c(year, month, day)`.
#&gt; Caused by error:
#&gt; ! `..1` must be size 336776 or 1, not 1010328.</pre>
</div>
<p>This doesnt work because <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="#chp-https://dplyr.tidyverse.org/reference/pick" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pick</a></code> which allows you to use use tidy-selection inside data-masking functions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt;
group_by(pick({{ group_vars }})) |&gt;
summarise(n_miss = sum(is.na({{ x_var }})))
}
flights |&gt;
count_missing(c(year, month, day), dep_time)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using the
#&gt; `.groups` argument.
#&gt; # A tibble: 365 × 4
#&gt; # Groups: year, month [12]
#&gt; year month day n_miss
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 4
#&gt; 2 2013 1 2 8
#&gt; 3 2013 1 3 10
#&gt; 4 2013 1 4 6
#&gt; 5 2013 1 5 3
#&gt; 6 2013 1 6 1
#&gt; # … with 359 more rows</pre>
</div>
<p>Another convenient use of <code><a href="#chp-https://dplyr.tidyverse.org/reference/pick" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pick</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to rearrange the counts into a grid:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/pollicipes/status/1571606508944719876
count_wide &lt;- function(data, rows, cols) {
data |&gt;
count(pick(c({{ rows }}, {{ cols }}))) |&gt;
pivot_wider(
names_from = {{ cols }},
values_from = n,
names_sort = TRUE,
values_fill = 0
)
}
diamonds |&gt; count_wide(clarity, cut)
#&gt; # A tibble: 8 × 6
#&gt; clarity Fair Good `Very Good` Premium Ideal
#&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 210 96 84 205 146
#&gt; 2 SI2 466 1081 2100 2949 2598
#&gt; 3 SI1 408 1560 3240 3575 4282
#&gt; 4 VS2 261 978 2591 3357 5071
#&gt; 5 VS1 170 648 1775 1989 3589
#&gt; 6 VVS2 69 286 1235 870 2606
#&gt; # … with 2 more rows
diamonds |&gt; count_wide(c(clarity, color), cut)
#&gt; # A tibble: 56 × 7
#&gt; clarity color Fair Good `Very Good` Premium Ideal
#&gt; &lt;ord&gt; &lt;ord&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 I1 D 4 8 5 12 13
#&gt; 2 I1 E 9 23 22 30 18
#&gt; 3 I1 F 35 19 13 34 42
#&gt; 4 I1 G 53 19 16 46 16
#&gt; 5 I1 H 52 14 12 46 38
#&gt; 6 I1 I 34 9 8 24 17
#&gt; # … with 50 more rows</pre>
</div>
<p>While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> docs you can see that <code>names_from</code> uses tidy-selection.</p>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Using the datasets from nyclights13, write functions that:</p>
<ol type="1"><li>
<p>Find all flights that were cancelled (i.e. <code>is.na(arr_time)</code>) or delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; filter_severe()</pre>
</div>
</li>
<li>
<p>Counts the number of cancelled flights and the number of flights delayed by more than an hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; group_by(dest) |&gt; summarise_severe()</pre>
</div>
</li>
<li>
<p>Finds all flights that were cancelled or delayed by more than a user supplied number of hours:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; filter_severe(hours = 2)</pre>
</div>
</li>
<li>
<p>Summarizes the weather to compute the minum, mean, and maximum, of a user supplied variable:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather |&gt; summarise_weather(temp)</pre>
</div>
</li>
<li>
<p>Converts the user supplied variable that uses clock time (e.g. <code>dep_time</code>, <code>arr_time</code>, etc) into a decimal time (i.e. hours + minutes / 60).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather |&gt; standardise_time(sched_dep_time)</pre>
</div>
</li>
</ol></li>
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code>.</p></li>
<li>
<p>Generalize the following function so that you can supply any number of variables to count.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_prop &lt;- function(df, var, sort = FALSE) {
df |&gt;
count({{ var }}, sort = sort) |&gt;
mutate(prop = n / sum(n))
}</pre>
</div>
</li>
</ol></section>
</section>
<section id="plot-functions" data-type="sect1">
<h1>
Plot functions</h1>
<p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.1)
diamonds |&gt;
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.05)</pre>
</div>
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> is a data-masking function so that you need to embrace:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
}
diamonds |&gt; histogram(carat, 0.1)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-46-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Note that <code>histogram()</code> returns a ggplot2 plot, so that you can still add on additional components if you want. Just remember to switch from <code>|&gt;</code> to <code>+</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
histogram(carat, 0.1) +
labs(x = "Size (in carats)", y = "Number of diamonds")</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-47-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<section id="more-variables" data-type="sect2">
<h2>
More variables</h2>
<p>Its straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/tyler_js_smith/status/1574377116988104704
linearity_check &lt;- function(df, x, y) {
df |&gt;
ggplot(aes({{ x }}, {{ y }})) +
geom_point() +
geom_smooth(method = "loess", color = "red", se = FALSE) +
geom_smooth(method = "lm", color = "blue", se = FALSE)
}
starwars |&gt;
filter(mass &lt; 1000) |&gt;
linearity_check(mass, height)
#&gt; `geom_smooth()` using formula = 'y ~ x'
#&gt; `geom_smooth()` using formula = 'y ~ x'</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-48-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/ppaxisa/status/1574398423175921665
hex_plot &lt;- function(df, x, y, z, bins = 20, fun = "mean") {
df |&gt;
ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) +
stat_summary_hex(
aes(colour = after_scale(fill)), # make border same colour as fill
bins = bins,
fun = fun,
)
}
diamonds |&gt; hex_plot(carat, price, depth)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-49-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
</section>
<section id="combining-with-dplyr" data-type="sect2">
<h2>
Combining with dplyr</h2>
<p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_inorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_inorder</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sorted_bars &lt;- function(df, var) {
df |&gt;
mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |&gt;
ggplot(aes(y = {{ var }})) +
geom_bar()
}
diamonds |&gt; sorted_bars(cut)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-50-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">conditional_bars &lt;- function(df, condition, var) {
df |&gt;
filter({{ condition }}) |&gt;
ggplot(aes({{ var }})) +
geom_bar()
}
diamonds |&gt; conditional_bars(cut == "Good", clarity)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-51-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>You can also get creative and display data summaries in other way. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
fancy_ts &lt;- function(df, val, group) {
labs &lt;- df |&gt;
group_by({{group}}) |&gt;
summarize(breaks = max({{val}}))
df |&gt;
ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
geom_path() +
scale_y_continuous(
breaks = labs$breaks,
labels = scales::label_comma(),
minor_breaks = NULL,
guide = guide_axis(position = "right")
)
}
df &lt;- tibble(
dist1 = sort(rnorm(50, 5, 2)),
dist2 = sort(rnorm(50, 8, 3)),
dist4 = sort(rnorm(50, 15, 1)),
date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
)
df &lt;- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
fancy_ts(df, value, dist_name)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-52-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>Next well discuss two more complicated cases: faceting and automatic labeling.</p>
</section>
<section id="faceting" data-type="sect2">
<h2>
Faceting</h2>
<p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="#chp-https://ggplot2.tidyverse.org/reference/vars" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/vars</a></code> uses tidy evaluation so you can embrace within it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/sharoz/status/1574376332821204999
foo &lt;- function(x) {
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
facet_wrap(vars({{ x }}))
}
foo(cyl)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-53-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution <code>bill_length_mm</code> from palmerpenguins dataset.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/yutannihilat_en/status/1574387230025875457
density &lt;- function(colour, facets, binwidth = 0.1) {
diamonds |&gt;
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
density()
density(cut)
density(cut, clarity)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-1.png" class="img-fluid" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-2.png" class="img-fluid" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-54-3.png" class="img-fluid" width="576"/></p>
</div>
</div>
</section>
<section id="labeling" data-type="sect2">
<h2>
Labeling</h2>
<p>Remember the histogram function we showed you earlier?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt;
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth)
}</pre>
</div>
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from package we havent talked about before: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="#chp-https://rlang.r-lib.org/reference/englue" data-type="xref">#chp-https://rlang.r-lib.org/reference/englue</a></code>. This works similarly to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code>, so any value wrapped in <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth) {
label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
df |&gt;
ggplot(aes({{ var }})) +
geom_histogram(binwidth = binwidth) +
labs(title = label)
}
diamonds |&gt; histogram(carat, 0.1)</pre>
<div class="cell-output-display">
<p><img src="functions_files/figure-html/unnamed-chunk-56-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<p>You can use the same approach any other place that you might supply a string in a ggplot2 plot.</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Build up a rich plotting function by incrementally implementing each of the steps below.
<ol type="1"><li><p>Draw a scatterplot given dataset and <code>x</code> and <code>y</code> variables.</p></li>
<li><p>Add a line of best fit (i.e. a linear model with no standard errors).</p></li>
<li><p>Add a title.</p></li>
</ol></li>
</ol></section>
</section>
<section id="style" data-type="sect1">
<h1>
Style</h1>
<p>R doesnt care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. Thats hard! But its better to be clear than short, as RStudios autocomplete makes it easy to type long names.</p>
<p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="#chp-https://rdrr.io/r/stats/coef" data-type="xref">#chp-https://rdrr.io/r/stats/coef</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and dont be afraid to rename a function if you figure out a better name later.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Too short
f()
# Not a verb, or descriptive
my_awesome_function()
# Long, but clear
impute_missing()
collapse_years()</pre>
</div>
<p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># missing extra two spaces
pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
}
# Pipe indented incorrectly
pull_unique &lt;- function(df, var) {
df |&gt;
distinct({{ var }}) |&gt;
pull({{ var }})
}
# Missing {} and all one line
pull_unique &lt;- function(df, var) df |&gt; distinct({{ var }}) |&gt; pull({{ var }})</pre>
</div>
<p>As you can see we recommend putting extra spaces inside of <code>{{ }}</code>. This makes it very obvious that something unusual is happening.</p>
<section id="exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">f1 &lt;- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f3 &lt;- function(x, y) {
rep(y, length.out = length(x))
}</pre>
</div>
</li>
<li><p>Take a function that youve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li>
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc would be better than <code><a href="#chp-https://rdrr.io/r/stats/Normal" data-type="xref">#chp-https://rdrr.io/r/stats/Normal</a></code>, <code><a href="#chp-https://rdrr.io/r/stats/Normal" data-type="xref">#chp-https://rdrr.io/r/stats/Normal</a></code>. Make a case for the opposite.</p></li>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
<p>We have only shown you the bare minimum to get started with functions and theres much more to learn. A few places to learn more are:</p>
<ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="#chp-https://dplyr.tidyverse.org/articles/programming" data-type="xref">#chp-https://dplyr.tidyverse.org/articles/programming</a> and <a href="#chp-https://tidyr.tidyverse.org/articles/programming" data-type="xref">#chp-https://tidyr.tidyverse.org/articles/programming</a> and learn more about the theory in <a href="#chp-https://rlang.r-lib.org/reference/topic-data-mask" data-type="xref">#chp-https://rlang.r-lib.org/reference/topic-data-mask</a>.</li>
<li>To learn more about reducing duplication in your ggplot2 code, read the <a href="#chp-https://ggplot2-book.org/programming" class="uri" data-type="xref">#chp-https://ggplot2-book.org/programming</a> chapter of the ggplot2 book.</li>
<li>For more advice on function style, see the <a href="#chp-https://style.tidyverse.org/functions" class="uri" data-type="xref">#chp-https://style.tidyverse.org/functions</a>.</li>
</ul><p>In the next chapter, well dive into some of the details of Rs vector data structures that weve omitted so far. These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.</p>
</section>
</section>

297
oreilly/intro.html Normal file

File diff suppressed because one or more lines are too long

1092
oreilly/iteration.html Normal file

File diff suppressed because it is too large Load Diff

972
oreilly/joins.html Normal file
View File

@ -0,0 +1,972 @@
<section data-type="chapter" id="chp-joins">
<h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Its rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must <strong>join</strong> them together to answer the questions that youre interested in. This chapter will introduce you to two important types of joins:</p>
<ul><li>Mutating joins, which add new variables to one data frame from matching observations in another.</li>
<li>Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.</li>
</ul><p>Well begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next well discuss how joins work, focusing on their action on the rows. Well finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, well explore the five related datasets from nycflights13 using the join functions from dplyr.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="keys" data-type="sect1">
<h1>
Keys</h1>
<p>To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table. In this section, youll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. Youll also learn how to check that your keys are valid, and what to do if your table lacks a key.</p>
<section id="primary-and-foreign-keys" data-type="sect2">
<h2>
Primary and foreign keys</h2>
<p>Every join involves a pair of keys: a primary key and a foreign key. A <strong>primary key</strong> is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a <strong>compound key.</strong> For example, in nycfights13:</p>
<ul><li>
<p><code>airlines</code> records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making <code>carrier</code> the primary key.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airlines
#&gt; # A tibble: 16 × 2
#&gt; carrier name
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 9E Endeavor Air Inc.
#&gt; 2 AA American Airlines Inc.
#&gt; 3 AS Alaska Airlines Inc.
#&gt; 4 B6 JetBlue Airways
#&gt; 5 DL Delta Air Lines Inc.
#&gt; 6 EV ExpressJet Airlines Inc.
#&gt; # … with 10 more rows</pre>
</div>
</li>
<li>
<p><code>airports</code> records data about each airport. You can identify each airport by its three letter airport code, making <code>faa</code> the primary key.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports
#&gt; # A tibble: 1,458 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/Ne…
#&gt; 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Ch…
#&gt; 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Ch…
#&gt; 4 06N Randall Airport 41.4 -74.4 523 -5 A America/Ne…
#&gt; 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/Ne…
#&gt; 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/Ne…
#&gt; # … with 1,452 more rows</pre>
</div>
</li>
<li>
<p><code>planes</code> records data about each plane. You can identify a plane by its tail number, making <code>tailnum</code> the primary key.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes
#&gt; # A tibble: 3,322 × 9
#&gt; tailnum year type manuf…¹ model engines seats speed engine
#&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 2 N102UW 1998 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 3 N103US 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 4 N104UW 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-… 2 55 NA Turbo…
#&gt; 6 N105UW 1999 Fixed wing multi engine AIRBUS… A320… 2 182 NA Turbo…
#&gt; # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre>
</div>
</li>
<li>
<p><code>weather</code> records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making <code>origin</code> and <code>time_hour</code> the compound primary key.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">weather
#&gt; # A tibble: 26,115 × 15
#&gt; origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
#&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
#&gt; 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
#&gt; 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
#&gt; 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
#&gt; 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
#&gt; 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
#&gt; # … with 26,109 more rows, and 4 more variables: precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
</div>
</li>
</ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p>
<ul><li>
<code>flights$tailnum</code> is a foreign key that corresponds to the primary key <code>planes$tailnum</code>.</li>
<li>
<code>flights$carrier</code> is a foreign key that corresponds to the primary key <code>airlines$carrier</code>.</li>
<li>
<code>flights$origin</code> is a foreign key that corresponds to the primary key <code>airports$faa</code>.</li>
<li>
<code>flights$dest</code> is a foreign key that corresponds to the primary key <code>airports$faa</code> .</li>
<li>
<code>flights$origin</code>-<code>flights$time_hour</code> is a compound foreign key that corresponds to the compound primary key <code>weather$origin</code>-<code>weather$time_hour</code>.</li>
</ul><p>These relationships are summarized visually in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-join-closest"><p><img src="diagrams/relational.png" alt="The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames." width="502"/></p>
<figcaption>Figure 19.1: Connections between all five data frames in the nycflights13 package. Variables making up a primary key are coloured grey, and are connected to their corresponding foreign keys with arrows.</figcaption>
</figure>
</div>
</div>
<p>Youll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as youll see shortly, will make your joining life much easier. Its also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. Theres only one exception: <code>year</code> means year of departure in <code>flights</code> and year of manufacturer in <code>planes</code>. This will become important when we start actually joining tables together.</p>
</section>
<section id="checking-primary-keys" data-type="sect2">
<h2>
Checking primary keys</h2>
<p>Now that that weve identified the primary keys in each table, its good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt;
count(tailnum) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 0 × 2
#&gt; # … with 2 variables: tailnum &lt;chr&gt;, n &lt;int&gt;
weather |&gt;
count(time_hour, origin) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 0 × 3
#&gt; # … with 3 variables: time_hour &lt;dttm&gt;, origin &lt;chr&gt;, n &lt;int&gt;</pre>
</div>
<p>You should also check for missing values in your primary keys — if a value is missing then it cant identify an observation!</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt;
filter(is.na(tailnum))
#&gt; # A tibble: 0 × 9
#&gt; # … with 9 variables: tailnum &lt;chr&gt;, year &lt;int&gt;, type &lt;chr&gt;,
#&gt; # manufacturer &lt;chr&gt;, model &lt;chr&gt;, engines &lt;int&gt;, seats &lt;int&gt;, speed &lt;int&gt;,
#&gt; # engine &lt;chr&gt;
weather |&gt;
filter(is.na(time_hour) | is.na(origin))
#&gt; # A tibble: 0 × 15
#&gt; # … with 15 variables: origin &lt;chr&gt;, year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;,
#&gt; # hour &lt;int&gt;, temp &lt;dbl&gt;, dewp &lt;dbl&gt;, humid &lt;dbl&gt;, wind_dir &lt;dbl&gt;,
#&gt; # wind_speed &lt;dbl&gt;, wind_gust &lt;dbl&gt;, precip &lt;dbl&gt;, pressure &lt;dbl&gt;,
#&gt; # visib &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
</div>
</section>
<section id="surrogate-keys" data-type="sect2">
<h2>
Surrogate keys</h2>
<p>So far we havent talked about the primary key for <code>flights</code>. Its not super important here, because there are no data frames that use it as a foreign key, but its still useful to consider because its easier to work with observations if have some way to describe them to others.</p>
<p>After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
count(time_hour, carrier, flight) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 0 × 4
#&gt; # … with 4 variables: time_hour &lt;dttm&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;, n &lt;int&gt;</pre>
</div>
<p>Does the absence of duplicates automatically make <code>time_hour</code>-<code>carrier</code>-<code>flight</code> a primary key? Its certainly a good start, but it doesnt guarantee it. For example, are altitude and latitude a good primary key for <code>airports</code>?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports |&gt;
count(alt, lat) |&gt;
filter(n &gt; 1)
#&gt; # A tibble: 1 × 3
#&gt; alt lat n
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 13 40.6 2</pre>
</div>
<p>Identifying an airport by its altitude and latitude is clearly a bad idea, and in general its not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.</p>
<p>That said, we might be better off introducing a simple numeric surrogate key using the row number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 &lt;- flights |&gt;
mutate(id = row_number(), .before = 1)
flights2
#&gt; # A tibble: 336,776 × 20
#&gt; id year month day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 1 2013 1 1 517 515 2 830 819 11
#&gt; 2 2 2013 1 1 533 529 4 850 830 20
#&gt; 3 3 2013 1 1 542 540 2 923 850 33
#&gt; 4 4 2013 1 1 544 545 -1 1004 1022 -18
#&gt; 5 5 2013 1 1 554 600 -6 812 837 -25
#&gt; 6 6 2013 1 1 554 558 -4 740 728 12
#&gt; # … with 336,770 more rows, 10 more variables: carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Surrogate keys can be particular useful when communicating to other humans: its much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>We forgot to draw the relationship between <code>weather</code> and <code>airports</code> in <a href="#fig-flights-relationships" data-type="xref">#fig-flights-relationships</a>. What is the relationship and how should it appear in the diagram?</p></li>
<li><p><code>weather</code> only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to <code>flights</code>?</p></li>
<li><p>The <code>year</code>, <code>month</code>, <code>day</code>, <code>hour</code>, and <code>origin</code> variables almost form a compound key for <code>weather</code>, but theres one hour that has duplicate observations. Can you figure out whats special about that hour?</p></li>
<li><p>We know that some days of the year are special and fewer people than usual fly on them (e.g. Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?</p></li>
<li><p>Draw a diagram illustrating the connections between the <code>Batting</code>, <code>People</code>, and <code>Salaries</code> data frames in the Lahman package. Draw another diagram that shows the relationship between <code>People</code>, <code>Managers</code>, <code>AwardsManagers</code>. How would you characterise the relationship between the <code>Batting</code>, <code>Pitching</code>, and <code>Fielding</code> data frames?</p></li>
</ol></section>
</section>
<section id="sec-mutating-joins" data-type="sect1">
<h1>
Basic joins</h1>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, and two filtering joins, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>.</p>
<section id="mutating-joins" data-type="sect2">
<h2>
Mutating joins</h2>
<p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, the join functions add variables to the right, so if your dataset has many variables, you wont see the new ones. For these examples, well make it easier to see whats going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code> to avoid this problem.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 &lt;- flights |&gt;
select(year, time_hour, origin, dest, tailnum, carrier)
flights2
#&gt; # A tibble: 336,776 × 6
#&gt; year time_hour origin dest tailnum carrier
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
#&gt; # … with 336,770 more rows</pre>
</div>
<p>There are four types of mutating join, but theres one that youll use almost all of the time: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>. Its special because the output will always have the same rows as <code>x</code><span data-type="footnote">Thats not 100% true, but youll get a warning whenever it isnt.</span>. The primary use of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> is to add in additional metadata. For example, we can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> to add the full airline name to the <code>flights2</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(airlines)
#&gt; Joining with `by = join_by(carrier)`
#&gt; # A tibble: 336,776 × 7
#&gt; year time_hour origin dest tailnum carrier name
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc.
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc.
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc.
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc.
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Or we could find out the temperature and wind speed when each plane departed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(weather |&gt; select(origin, time_hour, temp, wind_speed))
#&gt; Joining with `by = join_by(time_hour, origin)`
#&gt; # A tibble: 336,776 × 8
#&gt; year time_hour origin dest tailnum carrier temp wind_speed
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 39.0 12.7
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 39.9 15.0
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 39.0 15.0
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 39.0 15.0
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 39.9 16.1
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 39.0 12.7
#&gt; # … with 336,770 more rows</pre>
</div>
<p>Or what size of plane was flying:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 336,776 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed wi… 2 149
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed wi… 2 149
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed wi… 2 178
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed wi… 2 200
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed wi… 2 178
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191
#&gt; # … with 336,770 more rows</pre>
</div>
<p>When <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
filter(tailnum == "N3ALAA") |&gt;
left_join(planes |&gt; select(tailnum, type, engines, seats))
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 63 × 9
#&gt; year time_hour origin dest tailnum carrier type engines seats
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 2 2013 2013-01-02 18:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 3 2013 2013-01-03 06:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 4 2013 2013-01-07 19:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 5 2013 2013-01-08 17:00:00 JFK ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; 6 2013 2013-01-16 06:00:00 LGA ORD N3ALAA AA &lt;NA&gt; NA NA
#&gt; # … with 57 more rows</pre>
</div>
<p>Well come back to this problem a few times in the rest of the chapter.</p>
</section>
<section id="specifying-join-keys" data-type="sect2">
<h2>
Specifying join keys</h2>
<p>By default, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesnt always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes)
#&gt; Joining with `by = join_by(year, tailnum)`
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier type manufactu…¹ model
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre>
</div>
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes, join_by(tailnum))
#&gt; # A tibble: 336,776 × 14
#&gt; year.x time_hour origin dest tailnum carrier year.y type manuf…¹
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed … BOEING
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed … BOEING
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed … BOEING
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed … AIRBUS
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed … BOEING
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed … BOEING
#&gt; # … with 336,770 more rows, 5 more variables: model &lt;chr&gt;, engines &lt;int&gt;,
#&gt; # seats &lt;int&gt;, speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name
#&gt; # ¹manufacturer</pre>
</div>
<p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p>
<p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. Its important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. Thats why this type of join is often called an <strong>equi-join</strong>. Youll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
<p>Secondly, its how you specify different join keys in each table. For example, there are two ways to join the <code>flight2</code> and <code>airports</code> table: either by <code>dest</code> or <code>origin:</code></p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(airports, join_by(dest == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon alt
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Geor… 30.0 -95.3 97
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Geor… 30.0 -95.3 97
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miam… 25.8 -80.3 8
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 &lt;NA&gt; NA NA NA
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hart… 33.6 -84.4 1026
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chic… 42.0 -87.9 668
#&gt; # … with 336,770 more rows, and 3 more variables: tz &lt;dbl&gt;, dst &lt;chr&gt;,
#&gt; # tzone &lt;chr&gt;
flights2 |&gt;
left_join(airports, join_by(origin == faa))
#&gt; # A tibble: 336,776 × 13
#&gt; year time_hour origin dest tailnum carrier name lat lon alt
#&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Newa… 40.7 -74.2 18
#&gt; 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA La G… 40.8 -73.9 22
#&gt; 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA John… 40.6 -73.8 13
#&gt; 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 John… 40.6 -73.8 13
#&gt; 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL La G… 40.8 -73.9 22
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Newa… 40.7 -74.2 18
#&gt; # … with 336,770 more rows, and 3 more variables: tz &lt;dbl&gt;, dst &lt;chr&gt;,
#&gt; # tzone &lt;chr&gt;</pre>
</div>
<p>In older code you might see a different way of specifying the join keys, using a character vector:</p>
<ul><li>
<code>by = "x"</code> corresponds to <code>join_by(x)</code>.</li>
<li>
<code>by = c("a" = "x")</code> corresponds to <code>join_by(a == x)</code>.</li>
</ul><p>Now that it exists, we prefer <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code> since it provides a clearer and more flexible specification.</p>
</section>
<section id="filtering-joins" data-type="sect2">
<h2>
Filtering joins</h2>
<p>As you might guess the primary action of a <strong>filtering join</strong> is to filter the rows. There are two types: semi-joins and anti-joins. <strong>Semi-joins</strong> keep all rows in <code>x</code> that have a match in <code>y</code>. For example, we could use a semi-join to filter the <code>airports</code> dataset to show just the origin airports:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports |&gt;
semi_join(flights2, join_by(faa == origin))
#&gt; # A tibble: 3 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York
#&gt; 2 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York
#&gt; 3 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York</pre>
</div>
<p>Or just the destinations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports |&gt;
semi_join(flights2, join_by(faa == dest))
#&gt; # A tibble: 101 × 8
#&gt; faa name lat lon alt tz dst tzone
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 ABQ Albuquerque International Sunport 35.0 -107. 5355 -7 A Americ…
#&gt; 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Americ…
#&gt; 3 ALB Albany Intl 42.7 -73.8 285 -5 A Americ…
#&gt; 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Americ…
#&gt; 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Americ…
#&gt; 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Americ…
#&gt; # … with 95 more rows</pre>
</div>
<p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that dont have a match in <code>y</code>. Theyre useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values dont show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that dont have a matching destination airport:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
anti_join(airports, join_by(dest == faa)) |&gt;
distinct(dest)
#&gt; # A tibble: 4 × 1
#&gt; dest
#&gt; &lt;chr&gt;
#&gt; 1 BQN
#&gt; 2 SJU
#&gt; 3 STT
#&gt; 4 PSE</pre>
</div>
<p>Or we can find which <code>tailnum</code>s are missing from <code>planes</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
anti_join(planes, join_by(tailnum)) |&gt;
distinct(tailnum)
#&gt; # A tibble: 722 × 1
#&gt; tailnum
#&gt; &lt;chr&gt;
#&gt; 1 N3ALAA
#&gt; 2 N3DUAA
#&gt; 3 N542MQ
#&gt; 4 N730MQ
#&gt; 5 N9EAMQ
#&gt; 6 N532UA
#&gt; # … with 716 more rows</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the <code>weather</code> data. Can you see any patterns?</p></li>
<li>
<p>Imagine youve found the top 10 most popular destinations using this code:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">top_dest &lt;- flights2 |&gt;
count(dest, sort = TRUE) |&gt;
head(10)</pre>
</div>
<p>How can you find all flights to those destinations?</p>
</li>
<li><p>Does every departing flight have corresponding weather data for that hour?</p></li>
<li><p>What do the tail numbers that dont have a matching record in <code>planes</code> have in common? (Hint: one variable explains ~90% of the problems.)</p></li>
<li><p>Add a column to <code>planes</code> that lists every <code>carrier</code> that has flown that plane. You might expect that theres an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools youve learned in previous chapters.</p></li>
<li><p>Add the latitude and the longitude of the origin <em>and</em> destination airport to <code>flights</code>. Is it easier to rename the columns before or after the join?</p></li>
<li>
<p>Compute the average delay by destination, then join on the <code>airports</code> data frame so you can show the spatial distribution of delays. Heres an easy way to draw a map of the United States:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">airports |&gt;
semi_join(flights, join_by(faa == dest)) |&gt;
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()</pre>
</div>
<p>You might want to use the <code>size</code> or <code>colour</code> of the points to display the average delay for each airport.</p>
</li>
<li><p>What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.</p></li>
</ol></section>
</section>
<section id="how-do-joins-work" data-type="sect1">
<h1>
How do joins work?</h1>
<p>Now that youve used joins a few times its time to learn more about how they work, focusing on how each row in <code>x</code> matches rows in <code>y</code>. Well begin by using <a href="#fig-join-setup" data-type="xref">#fig-join-setup</a> to introduce a visual representation of the two simple tibbles defined below. In these examples well use a single key called <code>key</code> and a single value column (<code>val_x</code> and <code>val_y</code>), but the ideas all generalize to multiple keys and multiple values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y &lt;- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)</pre>
</div>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/setup.png" alt="x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are coloured: 1 is green, 2 is purple, 3 is orange, and 4 is yellow." width="160"/></p>
<figcaption class="figure-caption">Figure 19.2: Graphical representation of two simple tables. The coloured <code>key</code> columns map background colour to key value. The grey columns represent the “value” columns that are carried along for the ride.</figcaption>
</figure>
</div>
</div>
<p><a href="#fig-join-setup2" data-type="xref">#fig-join-setup2</a> shows all potential matches between <code>x</code> and <code>y</code> as the intersection between lines drawn from each row of <code>x</code> and each row of <code>y</code>. The rows and columns in the output are primarily determined by <code>x</code>, so the <code>x</code> table is horizontal and lines up with the output.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/setup2.png" alt="x and y are placed at right-angles, with horizonal lines extending from x and vertical lines extending from y. There are 3 rows in x and 3 rows in y, which leads to nine intersections representing nine potential matches." width="170"/></p>
<figcaption class="figure-caption">Figure 19.3: To understand how joins work, its useful to think of every possible match. Here we show that with a grid of connecting lines.</figcaption>
</figure>
</div>
</div>
<p>In an actual join, matches will be indicated with dots, as in <a href="#fig-join-inner" data-type="xref">#fig-join-inner</a>. The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values. The join shown here is a so-called <strong>equi</strong> <strong>inner join</strong>, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both <code>x</code> and <code>y</code>. Equi-joins are the most common type of join, so well typically omit the equi prefix, and just call it an inner join. Well come back to non-equi joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/inner.png" alt="x and y are placed at right-angles with lines forming a grid of potential matches. Keys 1 and 2 appear in both x and y, so we get a match, indicated by a dot. Each dot corresponds to a row in the output, so the resulting joined data frame has two rows." width="363"/></p>
<figcaption class="figure-caption">Figure 19.4: An inner join matches each row in <code>x</code> to the row in <code>y</code> that has the same value of <code>key</code>. Each match becomes a row in the output.</figcaption>
</figure>
</div>
</div>
<p>An <strong>outer join</strong> keeps observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with <code>NA</code>. There are three types of outer joins:</p>
<ul><li>
<p>A <strong>left join</strong> keeps all observations in <code>x</code>, <a href="#fig-join-left" data-type="xref">#fig-join-left</a>. Every row of <code>x</code> is preserved in the output because it can fall back to matching a row of <code>NA</code>s in <code>y</code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/left.png" alt="Compared to the previous diagram showing an inner join, the y table gets a new virtual row containin NA that will match any row in x that didn't otherwise match. This means that the output now has three rows. For key = 3, which matches this virtual row, val_y takes value NA." width="385"/></p>
<figcaption class="figure-caption">Figure 19.5: A visual representation of the left join where every row in <code>x</code> appears in the output.</figcaption>
</figure>
</div>
</div>
</li>
<li>
<p>A <strong>right join</strong> keeps all observations in <code>y</code>, <a href="#fig-join-right" data-type="xref">#fig-join-right</a>. Every row of <code>y</code> is preserved in the output because it can fall back to matching a row of <code>NA</code>s in <code>x</code>. The output still matches <code>x</code> as much as possible; any extra rows from <code>y</code> are added to the end.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/right.png" alt="Compared to the previous diagram showing an left join, the x table now gains a virtual row so that every row in y gets a match in x. val_x contains NA for the row in y that didn't match x." width="380"/></p>
<figcaption class="figure-caption">Figure 19.6: A visual representation of the right join where every row of <code>y</code> appears in the output.</figcaption>
</figure>
</div>
</div>
</li>
<li>
<p>A <strong>full join</strong> keeps all observations that appear in <code>x</code> or <code>y</code>, <a href="#fig-join-full" data-type="xref">#fig-join-full</a>. Every row of <code>x</code> and <code>y</code> is included in the output because both <code>x</code> and <code>y</code> have a fall back row of <code>NA</code>s. Again, the output starts with all rows from <code>x</code>, followed by the remaining unmatched <code>y</code> rows.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/full.png" alt="Now both x and y have a virtual row that always matches. The result has 4 rows: keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since those keys don't have a match in the other data frames." width="388"/></p>
<figcaption class="figure-caption">Figure 19.7: A visual representation of the full join where every row in <code>x</code> and <code>y</code> appears in the output.</figcaption>
</figure>
</div>
</div>
</li>
</ul><p>Another way to show how the types of outer join differ is with a Venn diagram, as in <a href="#fig-join-venn" data-type="xref">#fig-join-venn</a>. However, this is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate whats happening with the columns.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/venn.png" alt="Venn diagrams for inner, full, left, and right joins. Each join represented with two intersecting circles representing data frames x and y, with x on the right and y on the left. Shading indicates the result of the join." width="385"/></p>
<figcaption class="figure-caption">Figure 19.8: Venn diagrams showing the difference between inner, left, right, and full joins.</figcaption>
</figure>
</div>
</div>
<section id="row-matching" data-type="sect2">
<h2>
Row matching</h2>
<p>So far weve explored what happens if a row in <code>x</code> matches zero or one rows in <code>y</code>. What happens if it matches more than one row? To understand whats going lets first narrow our focus to the <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> and then draw a picture, <a href="#fig-join-match-types" data-type="xref">#fig-join-match-types</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/match-types.png" alt="A join diagram where x has key values 1, 2, and 3, and y has key values 1, 2, 2. The output has three rows because key 1 matches one row, key 2 matches two rows, and key 3 matches zero rows." width="348"/></p>
<figcaption class="figure-caption">Figure 19.9: The three ways a row in <code>x</code> can match. <code>x1</code> matches one row in <code>y</code>, <code>x2</code> matches two rows in <code>y</code>, <code>x3</code> matches zero rows in y. Note that while there are three rows in <code>x</code> and three rows in the output, there isnt a direct correspondence between the rows.</figcaption>
</figure>
</div>
</div>
<p>There are three possible outcomes for a row in <code>x</code>:</p>
<ul><li>If it doesnt match anything, its dropped.</li>
<li>If it matches 1 row in <code>y</code>, its preserved.</li>
<li>If it matches more than 1 row in <code>y</code>, its duplicated once for each match.</li>
</ul><p>In principle, this means that theres no guaranteed correspondence between the rows in the output and the rows in the <code>x</code>:</p>
<ul><li>There might be fewer rows if some rows in <code>x</code> dont match any rows in <code>y</code>.</li>
<li>There might be more rows if some rows in <code>x</code> match multiple rows in <code>y</code>.</li>
<li>There might be the same number of rows if every row in <code>x</code> matches one row in <code>y</code>.</li>
<li>There might be the same number of rows if some rows dont match any rows, and exactly the same number of rows match two rows in <code>y</code>!!</li>
</ul><p>Row expansion is a fundamental property of joins, but its dangerous because it might happen without you realizing it. To avoid this problem, dplyr will warn whenever there are multiple matches:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 &lt;- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
df2 &lt;- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
df1 |&gt;
inner_join(df2, join_by(key))
#&gt; Warning in inner_join(df1, df2, join_by(key)): Each row in `x` is expected to match at most 1 row in `y`.
#&gt; Row 2 of `x` matches multiple rows.
#&gt; If multiple matches are expected, set `multiple = "all"` to silence this
#&gt; warning.
#&gt; # A tibble: 3 × 3
#&gt; key val_x val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 x1 y1
#&gt; 2 2 x2 y2
#&gt; 3 2 x2 y3</pre>
</div>
<p>This is one reason we like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> — if it runs without warning, you know that each row of the output matches the row in the same position in <code>x</code>.</p>
<p>You can gain further control over row matching with two arguments:</p>
<ul><li>
<code>unmatched</code> controls what happens when a row in <code>x</code> fails to match any rows in <code>y</code>. It defaults to <code>"drop"</code> which will silently drop any unmatched rows.</li>
<li>
<code>multiple</code> controls what happens when a row in <code>x</code> matches more than one row in <code>y</code>. For equi-joins, it defaults to <code>"warn"</code> which emits a warning message if any rows have multiple matches.</li>
</ul><p>There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.</p>
</section>
<section id="one-to-one-mapping" data-type="sect2">
<h2>
One-to-one mapping</h2>
<p>Both <code>unmatched</code> and <code>multiple</code> can take value <code>"error"</code> which means that the join will fail unless each row in <code>x</code> matches exactly one row in <code>y</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 &lt;- tibble(x = 1)
df2 &lt;- tibble(x = c(1, 1))
df3 &lt;- tibble(x = 3)
df1 |&gt;
inner_join(df2, join_by(x), unmatched = "error", multiple = "error")
#&gt; Error in `inner_join()`:
#&gt; ! Each row in `x` must match at most 1 row in `y`.
#&gt; Row 1 of `x` matches multiple rows.
df1 |&gt;
inner_join(df3, join_by(x), unmatched = "error", multiple = "error")
#&gt; Error in `inner_join()`:
#&gt; ! Each row of `x` must have a match in `y`.
#&gt; Row 1 of `x` does not have a match.</pre>
</div>
<p>Note that <code>unmatched = "error"</code> is not useful with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> because, as described above, every row in <code>x</code> has a fallback match to a virtual row in <code>y</code>.</p>
</section>
<section id="allow-multiple-rows" data-type="sect2">
<h2>
Allow multiple rows</h2>
<p>Sometimes its useful to deliberately expand the number of rows in the output. This can come about naturally if you “flip” the direction of the question youre asking. For example, as weve seen above, its natural to supplement the <code>flights</code> data with information about the plane that flew each flight:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes, by = "tailnum")</pre>
</div>
<p>But its also reasonable to ask what flights did each plane fly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">plane_flights &lt;- planes |&gt;
select(tailnum, type, engines, seats) |&gt;
left_join(flights2, by = "tailnum")
#&gt; Warning in left_join(select(planes, tailnum, type, engines, seats), flights2, : Each row in `x` is expected to match at most 1 row in `y`.
#&gt; Row 1 of `x` matches multiple rows.
#&gt; If multiple matches are expected, set `multiple = "all"` to silence this
#&gt; warning.</pre>
</div>
<p>Since this duplicates rows in <code>x</code> (the planes), we need to explicitly say that were ok with the multiple matches by setting <code>multiple = "all"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">plane_flights &lt;- planes |&gt;
select(tailnum, type, engines, seats) |&gt;
left_join(flights2, by = "tailnum", multiple = "all")
plane_flights
#&gt; # A tibble: 284,170 × 9
#&gt; tailnum type engines seats year time_hour origin dest carrier
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dttm&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 N10156 Fixed wi… 2 55 2013 2013-01-10 06:00:00 EWR PIT EV
#&gt; 2 N10156 Fixed wi… 2 55 2013 2013-01-10 10:00:00 EWR CHS EV
#&gt; 3 N10156 Fixed wi… 2 55 2013 2013-01-10 15:00:00 EWR MSP EV
#&gt; 4 N10156 Fixed wi… 2 55 2013 2013-01-11 06:00:00 EWR CMH EV
#&gt; 5 N10156 Fixed wi… 2 55 2013 2013-01-11 11:00:00 EWR MCI EV
#&gt; 6 N10156 Fixed wi… 2 55 2013 2013-01-11 18:00:00 EWR PWM EV
#&gt; # … with 284,164 more rows</pre>
</div>
</section>
<section id="sec-non-equi-joins" data-type="sect2">
<h2>
Filtering joins</h2>
<p>The number of matches also determines the behavior of the filtering joins. The semi-join keeps rows in <code>x</code> that have one or more matches in <code>y</code>, as in <a href="#fig-join-semi" data-type="xref">#fig-join-semi</a>. The anti-join keeps rows in <code>x</code> that match zero rows in <code>y</code>, as in <a href="#fig-join-anti" data-type="xref">#fig-join-anti</a>. In both cases, only the existence of a match is important; it doesnt matter how many times it matches. This means that filtering joins never duplicate rows like mutating joins do.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/semi.png" alt="A join diagram with old friends x and y. In a semi join, only the presence of a match matters so the output contains the same columns as x." width="318"/></p>
<figcaption class="figure-caption">Figure 19.10: In a semi-join it only matters that there is a match; otherwise values in <code>y</code> dont affect the output.</figcaption>
</figure>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/anti.png" alt="An anti-join is the inverse of a semi-join so matches are drawn with red lines indicating that they will be dropped from the output." width="317"/></p>
<figcaption class="figure-caption">Figure 19.11: An anti-join is the inverse of a semi-join, dropping rows from <code>x</code> that have a match in <code>y</code>.</figcaption>
</figure>
</div>
</div>
</section>
</section>
<section id="non-equi-joins" data-type="sect1">
<h1>
Non-equi joins</h1>
<p>So far youve only seen equi-joins, joins where the rows match if the <code>x</code> key equals the <code>y</code> key. Now were going to relax that restriction and discuss other ways of determining if a pair of rows match.</p>
<p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x |&gt; left_join(y, by = "key", keep = TRUE)
#&gt; # A tibble: 3 × 4
#&gt; key.x val_x key.y val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 1 x1 1 y1
#&gt; 2 2 x2 2 y2
#&gt; 3 3 x3 NA &lt;NA&gt;</pre>
</div>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/inner-both.png" alt="A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one. " width="415"/></p>
<figcaption class="figure-caption">Figure 19.12: An left join showing both <code>x</code> and <code>y</code> keys in the output.</figcaption>
</figure>
</div>
</div>
<p>When we move away from equi-joins well always show the keys, because the key values will often different. For example, instead of matching only when the <code>x$key</code> and <code>y$key</code> are equal, we could match whenever the <code>x$key</code> is greater than or equal to the <code>y$key</code>, leading to <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a>. dplyrs join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/gte.png" alt="A join diagram illustrating join_by(key &gt;= key). The first row of x matches one row of y and the second and thirds rows each match two rows. This means the output has five rows containing each of the following (key.x, key.y) pairs: (1, 1), (2, 1), (2, 2), (3, 1), (3, 2)." width="385"/></p>
<figcaption class="figure-caption">Figure 19.13: A non-equi join where the <code>x</code> key must greater than or equal to than the <code>y</code> key. Many rows generate multiple matches.</figcaption>
</figure>
</div>
</div>
<p>Non-equi-join isnt a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi-join:</p>
<ul><li>
<strong>Cross joins</strong> match every pair of rows.</li>
<li>
<strong>Inequality joins</strong> use <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, and <code>&gt;=</code> instead of <code>==</code>.</li>
<li>
<strong>Rolling joins</strong> are similar to inequality joins but only find the closest match.</li>
<li>
<strong>Overlap joins</strong> are a special type of inequality join designed to work with ranges.</li>
</ul><p>Each of these is described in more detail in the following sections.</p>
<section id="cross-joins" data-type="sect2">
<h2>
Cross joins</h2>
<p>A cross join matches everything, as in <a href="#fig-join-cross" data-type="xref">#fig-join-cross</a>, generating the Cartesian product of rows. This means the output will have <code>nrow(x) * nrow(y)</code> rows.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/cross.png" alt="A join diagram showing a dot for every combination of x and y." width="155"/></p>
<figcaption class="figure-caption">Figure 19.14: A cross join matches each row in <code>x</code> with every row in <code>y</code>.</figcaption>
</figure>
</div>
</div>
<p>Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since were joining <code>df</code> to itself, this is sometimes called a <strong>self-join</strong>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(name = c("John", "Simon", "Tracy", "Max"))
df |&gt; left_join(df, join_by())
#&gt; # A tibble: 16 × 2
#&gt; name.x name.y
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 John John
#&gt; 2 John Simon
#&gt; 3 John Tracy
#&gt; 4 John Max
#&gt; 5 Simon John
#&gt; 6 Simon Simon
#&gt; # … with 10 more rows</pre>
</div>
</section>
<section id="inequality-joins" data-type="sect2">
<h2>
Inequality joins</h2>
<p>Inequality joins use <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;=</code>, or <code>&gt;</code> to restrict the set of possible matches, as in <a href="#fig-join-gte" data-type="xref">#fig-join-gte</a> and <a href="#fig-join-lt" data-type="xref">#fig-join-lt</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/lt.png" width="185"/></p>
<figcaption class="figure-caption">Figure 19.15: An inequality join where <code>x</code> is joined to <code>y</code> on rows where the key of <code>x</code> is less than the key of <code>y</code>. This makes a triangular shape in the top-left corner.</figcaption>
</figure>
</div>
</div>
<p>Inequality joins are extremely general, so general that its hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
df |&gt; left_join(df, join_by(id &lt; id))
#&gt; # A tibble: 7 × 4
#&gt; id.x name.x id.y name.y
#&gt; &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1 John 2 Simon
#&gt; 2 1 John 3 Tracy
#&gt; 3 1 John 4 Max
#&gt; 4 2 Simon 3 Tracy
#&gt; 5 2 Simon 4 Max
#&gt; 6 3 Tracy 4 Max
#&gt; # … with 1 more row</pre>
</div>
</section>
<section id="rolling-joins" data-type="sect2">
<h2>
Rolling joins</h2>
<p>Rolling joins are a special type of inequality join where instead of getting <em>every</em> row that satisfies the inequality, you get just the closest row, as in <a href="#fig-join-closest" data-type="xref">#fig-join-closest</a>. You can turn any inequality join into a rolling join by adding <code>closest()</code>. For example <code>join_by(closest(x &lt;= y))</code> matches the smallest <code>y</code> thats greater than or equal to x, and <code>join_by(closest(x &gt; y))</code> matches the biggest <code>y</code> thats less than <code>x</code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/join/closest.png" alt="A rolling join is a subset of an inequality join so some matches are grayed out indicating that they're not used because they're not the &quot;closest&quot;." width="262"/></p>
<figcaption class="figure-caption">Figure 19.16: A following join is similar to a greater-than-or-equal inequality join but only matches the first value.</figcaption>
</figure>
</div>
</div>
<p>Rolling joins are particularly useful when you have two tables of dates that dont perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.</p>
<p>For example, imagine that youre in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">parties &lt;- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
)</pre>
</div>
<p>Now imagine that you have a table of employee birthdays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">employees &lt;- tibble(
name = wakefield::name(100),
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
)
employees
#&gt; # A tibble: 100 × 2
#&gt; name birthday
#&gt; &lt;variable&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11
#&gt; 2 Santania 2022-03-01
#&gt; 3 Gardell 2022-03-04
#&gt; 4 Cyrille 2022-11-15
#&gt; 5 Kynli 2022-07-09
#&gt; 6 Sever 2022-02-03
#&gt; # … with 94 more rows</pre>
</div>
<p>And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">employees |&gt;
left_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 100 × 4
#&gt; name birthday q party
#&gt; &lt;variable&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11 3 2022-07-11
#&gt; 2 Santania 2022-03-01 1 2022-01-10
#&gt; 3 Gardell 2022-03-04 1 2022-01-10
#&gt; 4 Cyrille 2022-11-15 4 2022-10-03
#&gt; 5 Kynli 2022-07-09 2 2022-04-04
#&gt; 6 Sever 2022-02-03 1 2022-01-10
#&gt; # … with 94 more rows</pre>
</div>
<p>There is, however, one problem with this approach: the folks with birthdays before January 10 dont get a party:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">employees |&gt;
anti_join(parties, join_by(closest(birthday &gt;= party)))
#&gt; # A tibble: 4 × 2
#&gt; name birthday
#&gt; &lt;variable&gt; &lt;date&gt;
#&gt; 1 Janeida 2022-01-04
#&gt; 2 Aires 2022-01-07
#&gt; 3 Mikalya 2022-01-06
#&gt; 4 Carlynn 2022-01-08</pre>
</div>
<p>To resolve that issue well need to tackle the problem a different way, with overlap joins.</p>
</section>
<section id="overlap-joins" data-type="sect2">
<h2>
Overlap joins</h2>
<p>Overlap joins provide three helpers that use inequality joins to make it easier to work with intervals:</p>
<ul><li>
<code>between(x, y_lower, y_upper)</code> is short for <code>x &gt;= y_lower, x &lt;= y_upper</code>.</li>
<li>
<code>within(x_lower, x_upper, y_lower, y_upper)</code> is short for <code>x_lower &gt;= y_lower, x_upper &lt;= y_upper</code>.</li>
<li>
<code>overlaps(x_lower, x_upper, y_lower, y_upper)</code> is short for <code>x_lower &lt;= y_upper, x_upper &gt;= y_lower</code>.</li>
</ul><p>Lets continue the birthday example to see how you might use them. Theres one problem with the strategy we used above: theres no party preceding the birthdays Jan 1-9. So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">parties &lt;- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
)
parties
#&gt; # A tibble: 4 × 4
#&gt; q party start end
#&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 2 2 2022-04-04 2022-04-04 2022-07-11
#&gt; 3 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 4 4 2022-10-03 2022-10-03 2022-12-31</pre>
</div>
<p>Hadley is hopelessly bad at data entry so he also wanted to check that the party periods dont overlap. One way to do this is by using a self-join to check to if any start-end interval overlap with another:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">parties |&gt;
inner_join(parties, join_by(overlaps(start, end, start, end), q &lt; q)) |&gt;
select(start.x, end.x, start.y, end.y)
#&gt; # A tibble: 1 × 4
#&gt; start.x end.x start.y end.y
#&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 2022-04-04 2022-07-11 2022-07-11 2022-10-02</pre>
</div>
<p>Ooops, there is an overlap, so lets fix that problem and continue:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">parties &lt;- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
)</pre>
</div>
<p>Now we can match each employee to their party. This is a good place to use <code>unmatched = "error"</code> because we want to quickly find out if any employees didnt get assigned a party.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">employees |&gt;
inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
#&gt; # A tibble: 100 × 6
#&gt; name birthday q party start end
#&gt; &lt;variable&gt; &lt;date&gt; &lt;int&gt; &lt;date&gt; &lt;date&gt; &lt;date&gt;
#&gt; 1 Lindzy 2022-08-11 3 2022-07-11 2022-07-11 2022-10-02
#&gt; 2 Santania 2022-03-01 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 3 Gardell 2022-03-04 1 2022-01-10 2022-01-01 2022-04-03
#&gt; 4 Cyrille 2022-11-15 4 2022-10-03 2022-10-03 2022-12-31
#&gt; 5 Kynli 2022-07-09 2 2022-04-04 2022-04-04 2022-07-10
#&gt; 6 Sever 2022-02-03 1 2022-01-10 2022-01-01 2022-04-03
#&gt; # … with 94 more rows</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Can you explain whats happening with the keys in this equi-join? Why are they different?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x |&gt; full_join(y, by = "key")
#&gt; # A tibble: 4 × 3
#&gt; key val_x val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 x1 y1
#&gt; 2 2 x2 y2
#&gt; 3 3 x3 &lt;NA&gt;
#&gt; 4 4 &lt;NA&gt; y3
x |&gt; full_join(y, by = "key", keep = TRUE)
#&gt; # A tibble: 4 × 4
#&gt; key.x val_x key.y val_y
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 1 x1 1 y1
#&gt; 2 2 x2 2 y2
#&gt; 3 3 x3 NA &lt;NA&gt;
#&gt; 4 NA &lt;NA&gt; 4 y3</pre>
</div>
</li>
<li><p>When finding if any party period overlapped with another party period we used <code>q &lt; q</code> in the <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code>? Why? What happens if you remove this inequality?</p></li>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, youve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.</p>
<p>This chapter concludes the “Transform” part of the book where the focus was on the tools you could use with individual columns and tibbles. You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.</p>
<p>In the next part of the book, youll learn more about getting various types of data into R in a tidy form.</p>
</section>
</section>

633
oreilly/logicals.html Normal file
View File

@ -0,0 +1,633 @@
<section data-type="chapter" id="chp-logicals">
<h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. Its relatively rare to find logical vectors in your raw data, but youll create and manipulate in the course of almost every analysis.</p>
<p>Well begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then youll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. Well finish off with <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Most of the functions youll learn about in this chapter are provided by base R, so we dont need the tidyverse, but well still load it so we can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, and friends to work with data frames. Well also continue to draw examples from the nycflights13 dataset.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre>
</div>
<p>However, as we start to cover more tools, there wont always be a perfect real example. So well start making up some dummy data with <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 3, 5, 7, 11, 13)
x * 2
#&gt; [1] 2 4 6 10 14 22 26</pre>
</div>
<p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside data frame with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and friends.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x)
df |&gt;
mutate(y = x * 2)
#&gt; # A tibble: 7 × 2
#&gt; x y
#&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2
#&gt; 2 2 4
#&gt; 3 3 6
#&gt; 4 5 10
#&gt; 5 7 14
#&gt; 6 11 22
#&gt; # … with 1 more row</pre>
</div>
</section>
</section>
<section id="comparisons" data-type="sect1">
<h1>
Comparisons</h1>
<p>A very common way to create a logical vector is via a numeric comparison with <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>!=</code>, and <code>==</code>. So far, weve mostly created logical variables transiently within <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that leave roughly on time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time &gt; 600 &amp; dep_time &lt; 2000 &amp; abs(arr_delay) &lt; 20)
#&gt; # A tibble: 172,286 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 601 600 1 844 850 -6 B6
#&gt; 2 2013 1 1 602 610 -8 812 820 -8 DL
#&gt; 3 2013 1 1 602 605 -3 821 805 16 MQ
#&gt; 4 2013 1 1 606 610 -4 858 910 -12 AA
#&gt; 5 2013 1 1 606 610 -4 837 845 -8 DL
#&gt; 6 2013 1 1 607 607 0 858 915 -17 UA
#&gt; # … with 172,280 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Its useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
daytime = dep_time &gt; 600 &amp; dep_time &lt; 2000,
approx_ontime = abs(arr_delay) &lt; 20,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 4
#&gt; dep_time arr_delay daytime approx_ontime
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 517 11 FALSE TRUE
#&gt; 2 533 20 FALSE FALSE
#&gt; 3 542 33 FALSE FALSE
#&gt; 4 544 -18 FALSE TRUE
#&gt; 5 554 -25 FALSE FALSE
#&gt; 6 554 12 FALSE TRUE
#&gt; # … with 336,770 more rows</pre>
</div>
<p>This is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.</p>
<p>All up, the initial filter is equivalent to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
daytime = dep_time &gt; 600 &amp; dep_time &lt; 2000,
approx_ontime = abs(arr_delay) &lt; 20,
) |&gt;
filter(daytime &amp; approx_ontime)</pre>
</div>
<section id="sec-fp-comparison" data-type="sect2">
<h2>
Floating point comparison</h2>
<p>Beware of using <code>==</code> with numbers. For example, it looks like this vector contains the numbers 1 and 2:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1 / 49 * 49, sqrt(2) ^ 2)
x
#&gt; [1] 1 2</pre>
</div>
<p>But if you test them for equality, you get <code>FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x == c(1, 2)
#&gt; [1] FALSE FALSE</pre>
</div>
<p>Whats going on? Computers store numbers with a fixed number of decimal places so theres no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="#chp-https://rdrr.io/r/base/print" data-type="xref">#chp-https://rdrr.io/r/base/print</a></code> with the the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">print(x, digits = 16)
#&gt; [1] 0.9999999999999999 2.0000000000000004</pre>
</div>
<p>You can see why R defaults to rounding these numbers; they really are very close to what you expect.</p>
<p>Now that youve seen why <code>==</code> is failing, what can you do about it? One option is to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/near" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/near</a></code> which ignores small differences:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">near(x, c(1, 2))
#&gt; [1] TRUE TRUE</pre>
</div>
</section>
<section id="sec-na-comparison" data-type="sect2">
<h2>
Missing values</h2>
<p>Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">NA &gt; 5
#&gt; [1] NA
10 == NA
#&gt; [1] NA</pre>
</div>
<p>The most confusing result is this one:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">NA == NA
#&gt; [1] NA</pre>
</div>
<p>Its easiest to understand why this is true if we artificially supply a little more context:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Let x be Mary's age. We don't know how old she is.
x &lt;- NA
# Let y be John's age. We don't know how old he is.
y &lt;- NA
# Are John and Mary the same age?
x == y
#&gt; [1] NA
# We don't know!</pre>
</div>
<p>So if you want to find all flights with <code>dep_time</code> is missing, the following code doesnt work because <code>dep_time == NA</code> will yield a <code>NA</code> for every single row, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> automatically drops missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time == NA)
#&gt; # A tibble: 0 × 19
#&gt; # … with 19 variables: year &lt;int&gt;, month &lt;int&gt;, day &lt;int&gt;, dep_time &lt;int&gt;,
#&gt; # sched_dep_time &lt;int&gt;, dep_delay &lt;dbl&gt;, arr_time &lt;int&gt;,
#&gt; # sched_arr_time &lt;int&gt;, arr_delay &lt;dbl&gt;, carrier &lt;chr&gt;, flight &lt;int&gt;,
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
</div>
<p>Instead well need a new tool: <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>.</p>
</section>
<section id="is.na" data-type="sect2">
<h2>
<code>is.na()</code>
</h2>
<p><code>is.na(x)</code> works with any type of vector and returns <code>TRUE</code> for missing values and <code>FALSE</code> for everything else:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">is.na(c(TRUE, NA, FALSE))
#&gt; [1] FALSE TRUE FALSE
is.na(c(1, NA, 3))
#&gt; [1] FALSE TRUE FALSE
is.na(c("a", NA, "b"))
#&gt; [1] FALSE TRUE FALSE</pre>
</div>
<p>We can use <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code> to find all the rows with a missing <code>dep_time</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(is.na(dep_time))
#&gt; # A tibble: 8,255 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 2 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 3 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 4 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; 5 2013 1 2 NA 1540 NA NA 1747 NA EV
#&gt; 6 2013 1 2 NA 1620 NA NA 1746 NA EV
#&gt; # … with 8,249 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p><code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code> can also be useful in <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 1, day == 1) |&gt;
arrange(dep_time)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 836 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights |&gt;
filter(month == 1, day == 1) |&gt;
arrange(desc(is.na(dep_time)), dep_time)
#&gt; # A tibble: 842 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 2 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 3 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 4 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; 5 2013 1 1 517 515 2 830 819 11 UA
#&gt; 6 2013 1 1 533 529 4 850 830 20 UA
#&gt; # … with 836 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>Well come back to cover missing values in more depth in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How does <code><a href="#chp-https://dplyr.tidyverse.org/reference/near" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/near</a></code> work? Type <code>near</code> to see the source code.</li>
<li>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> together to describe how the missing values in <code>dep_time</code>, <code>sched_dep_time</code> and <code>dep_delay</code> are connected.</li>
</ol></section>
</section>
<section id="boolean-algebra" data-type="sect1">
<h1>
Boolean algebra</h1>
<p>Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, <code>&amp;</code> is “and”, <code>|</code> is “or”, and <code>!</code> is “not”, and <code><a href="#chp-https://rdrr.io/r/base/Logic" data-type="xref">#chp-https://rdrr.io/r/base/Logic</a></code> is exclusive or<span data-type="footnote">That is, <code>xor(x, y)</code> is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.</span>. <a href="#fig-bool-ops" data-type="xref">#fig-bool-ops</a> shows the complete set of Boolean operations and how they work.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-prop-delayed-dist"><p><img src="diagrams/transform.png" alt="Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y &amp; !x is y but none of x; x &amp; y is the intersection of x and y; x &amp; !y is x but none of y; x is all of x none of y; xor(x, y) is everything except the intersection of x and y; y is all of y and none of x; and x | y is everything." width="395"/></p>
<figcaption>Figure 12.1: The complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.<code>x</code> is the left-hand circle, <code>y</code> is the right-hand circle, and the shaded region show which parts each operator selects.</figcaption>
</figure>
</div>
</div>
<p>As well as <code>&amp;</code> and <code>|</code>, R also has <code>&amp;&amp;</code> and <code>||</code>. Dont use them in dplyr functions! These are called short-circuiting operators and only ever return a single <code>TRUE</code> or <code>FALSE</code>. Theyre important for programming, not data science</p>
<section id="sec-na-boolean" data-type="sect2">
<h2>
Missing values</h2>
<p>The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = c(TRUE, FALSE, NA))
df |&gt;
mutate(
and = x &amp; NA,
or = x | NA
)
#&gt; # A tibble: 3 × 3
#&gt; x and or
#&gt; &lt;lgl&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 TRUE NA TRUE
#&gt; 2 FALSE FALSE NA
#&gt; 3 NA NA NA</pre>
</div>
<p>To understand whats going on, think about <code>NA | TRUE</code>. A missing value in a logical vector means that the value could either be <code>TRUE</code> or <code>FALSE</code>. <code>TRUE | TRUE</code> and <code>FALSE | TRUE</code> are both <code>TRUE</code>, so <code>NA | TRUE</code> must also be <code>TRUE</code>. Similar reasoning applies with <code>NA &amp; FALSE</code>.</p>
</section>
<section id="order-of-operations" data-type="sect2">
<h2>
Order of operations</h2>
<p>Note that the order of operations doesnt work like English. Take the following code finds all flights that departed in November or December:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 11 | month == 12)</pre>
</div>
<p>You might be tempted to write it like youd say in English: “find all flights that departed in November or December”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 11 | 12)
#&gt; # A tibble: 336,776 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 533 529 4 850 830 20 UA
#&gt; 3 2013 1 1 542 540 2 923 850 33 AA
#&gt; 4 2013 1 1 544 545 -1 1004 1022 -18 B6
#&gt; 5 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 6 2013 1 1 554 558 -4 740 728 12 UA
#&gt; # … with 336,770 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>This code doesnt error but it also doesnt seem to have worked. Whats going on? Here R first evaluates <code>month == 11</code> creating a logical vector, which we call <code>nov</code>. It computes <code>nov | 12</code>. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to <code>nov | TRUE</code> which will always be <code>TRUE</code>, so every row will be selected:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
nov = month == 11,
final = nov | 12,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 3
#&gt; month nov final
#&gt; &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 1 FALSE TRUE
#&gt; 2 1 FALSE TRUE
#&gt; 3 1 FALSE TRUE
#&gt; 4 1 FALSE TRUE
#&gt; 5 1 FALSE TRUE
#&gt; 6 1 FALSE TRUE
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
<section id="in" data-type="sect2">
<h2>
<code>%in%</code>
</h2>
<p>An easy way to avoid the problem of getting your <code>==</code>s and <code>|</code>s in the right order is to use <code>%in%</code>. <code>x %in% y</code> returns a logical vector the same length as <code>x</code> that is <code>TRUE</code> whenever a value in <code>x</code> is anywhere in <code>y</code> .</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">1:12 %in% c(1, 5, 11)
#&gt; [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
letters[1:10] %in% c("a", "e", "i", "o", "u")
#&gt; [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE</pre>
</div>
<p>So to find all flights in November and December we could write:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month %in% c(11, 12))</pre>
</div>
<p>Note that <code>%in%</code> obeys different rules for <code>NA</code> to <code>==</code>, as <code>NA %in% NA</code> is <code>TRUE</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">c(1, 2, NA) == NA
#&gt; [1] NA NA NA
c(1, 2, NA) %in% NA
#&gt; [1] FALSE FALSE TRUE</pre>
</div>
<p>This can make for a useful shortcut:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time %in% c(NA, 0800))
#&gt; # A tibble: 8,803 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 800 800 0 1022 1014 8 DL
#&gt; 2 2013 1 1 800 810 -10 949 955 -6 MQ
#&gt; 3 2013 1 1 NA 1630 NA NA 1815 NA EV
#&gt; 4 2013 1 1 NA 1935 NA NA 2240 NA AA
#&gt; 5 2013 1 1 NA 1500 NA NA 1825 NA AA
#&gt; 6 2013 1 1 NA 600 NA NA 901 NA B6
#&gt; # … with 8,797 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Find all flights where <code>arr_delay</code> is missing but <code>dep_delay</code> is not. Find all flights where neither <code>arr_time</code> nor <code>sched_arr_time</code> are missing, but <code>arr_delay</code> is.</li>
<li>How many flights have a missing <code>dep_time</code>? What other variables are missing in these rows? What might these rows represent?</li>
<li>Assuming that a missing <code>dep_time</code> implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and average delay of non-cancelled flights?</li>
</ol></section>
</section>
<section id="sec-logical-summaries" data-type="sect1">
<h1>
Summaries</h1>
<p>The following sections describe some useful techniques for summarizing logical vectors. As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.</p>
<section id="logical-summaries" data-type="sect2">
<h2>
Logical summaries</h2>
<p>There are two main logical summaries: <code><a href="#chp-https://rdrr.io/r/base/any" data-type="xref">#chp-https://rdrr.io/r/base/any</a></code> and <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code>. <code>any(x)</code> is the equivalent of <code>|</code>; itll return <code>TRUE</code> if there are any <code>TRUE</code>s in <code>x</code>. <code>all(x)</code> is equivalent of <code>&amp;</code>; itll return <code>TRUE</code> only if all values of <code>x</code> are <code>TRUE</code>s. Like all summary functions, theyll return <code>NA</code> if there are any missing values present, and as usual you can make the missing values go away with <code>na.rm = TRUE</code>.</p>
<p>For example, we could use <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code> to find out if there were days where every flight was delayed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
all_delayed = all(arr_delay &gt;= 0, na.rm = TRUE),
any_delayed = any(arr_delay &gt;= 0, na.rm = TRUE),
.groups = "drop"
)
#&gt; # A tibble: 365 × 5
#&gt; year month day all_delayed any_delayed
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;lgl&gt; &lt;lgl&gt;
#&gt; 1 2013 1 1 FALSE TRUE
#&gt; 2 2013 1 2 FALSE TRUE
#&gt; 3 2013 1 3 FALSE TRUE
#&gt; 4 2013 1 4 FALSE TRUE
#&gt; 5 2013 1 5 FALSE TRUE
#&gt; 6 2013 1 6 FALSE TRUE
#&gt; # … with 359 more rows</pre>
</div>
<p>In most cases, however, <code><a href="#chp-https://rdrr.io/r/base/any" data-type="xref">#chp-https://rdrr.io/r/base/any</a></code> and <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code> are a little too crude, and it would be nice to be able to get a little more detail about how many values are <code>TRUE</code> or <code>FALSE</code>. That leads us to the numeric summaries.</p>
</section>
<section id="numeric-summaries-of-logical-vectors" data-type="sect2">
<h2>
Numeric summaries of logical vectors</h2>
<p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
prop_delayed = mean(arr_delay &gt; 0, na.rm = TRUE),
.groups = "drop"
) |&gt;
ggplot(aes(prop_delayed)) +
geom_histogram(binwidth = 0.05)</pre>
<div class="cell-output-display">
<figure class="figure"><p><img src="logicals_files/figure-html/fig-prop-delayed-dist-1.png" alt="The distribution is unimodal and mildly right skewed. The distribution peaks around 30% delayed flights." width="576"/></p>
<figcaption class="figure-caption">Figure 12.2: A histogram showing the proportion of delayed flights each day.</figcaption>
</figure>
</div>
</div>
<p>Or we could ask how many flights left before 5am, which are often flights that were delayed from the previous day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
n_early = sum(dep_time &lt; 500, na.rm = TRUE),
.groups = "drop"
) |&gt;
arrange(desc(n_early))
#&gt; # A tibble: 365 × 4
#&gt; year month day n_early
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 6 28 32
#&gt; 2 2013 4 10 30
#&gt; 3 2013 7 28 30
#&gt; 4 2013 3 18 29
#&gt; 5 2013 7 7 29
#&gt; 6 2013 7 10 29
#&gt; # … with 359 more rows</pre>
</div>
</section>
<section id="logical-subsetting" data-type="sect2">
<h2>
Logical subsetting</h2>
<p>Theres one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base <code>[</code> (pronounced subset) operator, which youll learn more about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>.</p>
<p>Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(arr_delay &gt; 0) |&gt;
group_by(year, month, day) |&gt;
summarise(
behind = mean(arr_delay),
n = n(),
.groups = "drop"
)
#&gt; # A tibble: 365 × 5
#&gt; year month day behind n
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013 1 1 32.5 461
#&gt; 2 2013 1 2 32.0 535
#&gt; 3 2013 1 3 27.7 460
#&gt; 4 2013 1 4 28.3 297
#&gt; 5 2013 1 5 22.6 238
#&gt; 6 2013 1 6 24.4 381
#&gt; # … with 359 more rows</pre>
</div>
<p>This works, but what if we wanted to also compute the average delay for flights that arrived early? Wed need to perform a separate filter step, and then figure out how to combine the two data frames together<span data-type="footnote">Well cover this in <a href="#chp-joins" data-type="xref">#chp-joins</a>]</span>. Instead you could use <code>[</code> to perform an inline filtering: <code>arr_delay[arr_delay &gt; 0]</code> will yield only the positive arrival delays.</p>
<p>This leads to:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
behind = mean(arr_delay[arr_delay &gt; 0], na.rm = TRUE),
ahead = mean(arr_delay[arr_delay &lt; 0], na.rm = TRUE),
n = n(),
.groups = "drop"
)
#&gt; # A tibble: 365 × 6
#&gt; year month day behind ahead n
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 2013 1 1 32.5 -12.5 842
#&gt; 2 2013 1 2 32.0 -14.3 943
#&gt; 3 2013 1 3 27.7 -18.2 914
#&gt; 4 2013 1 4 28.3 -17.0 915
#&gt; 5 2013 1 5 22.6 -14.0 720
#&gt; 6 2013 1 6 24.4 -13.6 832
#&gt; # … with 359 more rows</pre>
</div>
<p>Also note the difference in the group size: in the first chunk <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> gives the number of delayed flights per day; in the second, <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> gives the total number of flights.</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
<li>What does <code><a href="#chp-https://rdrr.io/r/base/prod" data-type="xref">#chp-https://rdrr.io/r/base/prod</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? What does <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> return applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.</li>
</ol></section>
</section>
<section id="conditional-transformations" data-type="sect1">
<h1>
Conditional transformations</h1>
<p>One of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>.</p>
<section id="if_else" data-type="sect2">
<h2>
<code>if_else()</code>
</h2>
<p>If you want to use one value when a condition is true and another value when its <code>FALSE</code>, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code><span data-type="footnote">dplyrs <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> is very similar to base Rs <code><a href="#chp-https://rdrr.io/r/base/ifelse" data-type="xref">#chp-https://rdrr.io/r/base/ifelse</a></code>. There are two main advantages of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>over <code><a href="#chp-https://rdrr.io/r/base/ifelse" data-type="xref">#chp-https://rdrr.io/r/base/ifelse</a></code>: you can choose what should happen to missing values, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. Youll always use the first three argument of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
<p>Lets begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(-3:3, NA)
if_else(x &gt; 0, "+ve", "-ve")
#&gt; [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA</pre>
</div>
<p>Theres an optional fourth argument, <code>missing</code> which will be used if the input is <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">if_else(x &gt; 0, "+ve", "-ve", "???")
#&gt; [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>You can also use vectors for the the <code>true</code> and <code>false</code> arguments. For example, this allows us to create a minimal implementation of <code><a href="#chp-https://rdrr.io/r/base/MathFun" data-type="xref">#chp-https://rdrr.io/r/base/MathFun</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">if_else(x &lt; 0, -x, x)
#&gt; [1] 3 2 1 0 1 2 3 NA</pre>
</div>
<p>So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code> like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- c(NA, 1, 2, NA)
y1 &lt;- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)
#&gt; [1] 3 1 2 6</pre>
</div>
<p>You might have noticed a small infelicity in our labeling: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">if_else(x == 0, "0", if_else(x &lt; 0, "-ve", "+ve"), "???")
#&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>.</p>
</section>
<section id="case_when" data-type="sect2">
<h2>
<code>case_when()</code>
</h2>
<p>dplyrs <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> is inspired by SQLs <code>CASE</code> statement and provides a flexible way of performing different computations for different computations. It has a special syntax that unfortunately looks like nothing else youll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when its <code>TRUE</code>, <code>output</code> will be used.</p>
<p>This means we could recreate our previous nested <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> as follows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">case_when(
x == 0 ~ "0",
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve",
is.na(x) ~ "???"
)
#&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>This is more code, but its also more explicit.</p>
<p>To explain how <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> works, lets explore some simpler cases. If none of the cases match, the output gets an <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">case_when(
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve"
)
#&gt; [1] "-ve" "-ve" "-ve" NA "+ve" "+ve" "+ve" NA</pre>
</div>
<p>If you want to create a “default”/catch all value, use <code>TRUE</code> on the left hand side:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">case_when(
x &lt; 0 ~ "-ve",
x &gt; 0 ~ "+ve",
TRUE ~ "???"
)
#&gt; [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???"</pre>
</div>
<p>And note that if multiple conditions match, only the first will be used:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">case_when(
x &gt; 0 ~ "+ve",
x &gt; 3 ~ "big"
)
#&gt; [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre>
</div>
<p>Just like with <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> you can use variables on both sides of the <code>~</code> and you can mix and match variables as needed for your problem. For example, we could use <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> to provide some human readable labels for the arrival delay:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
status = case_when(
is.na(arr_delay) ~ "cancelled",
arr_delay &gt; 60 ~ "very late",
arr_delay &gt; 15 ~ "late",
abs(arr_delay) &lt;= 15 ~ "on time",
arr_delay &lt; -15 ~ "early",
arr_delay &lt; -30 ~ "very early",
),
.keep = "used"
)
#&gt; # A tibble: 336,776 × 2
#&gt; arr_delay status
#&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 11 on time
#&gt; 2 20 late
#&gt; 3 33 late
#&gt; 4 -18 early
#&gt; 5 -25 early
#&gt; 6 12 on time
#&gt; # … with 336,770 more rows</pre>
</div>
</section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="#chp-https://rdrr.io/r/base/any" data-type="xref">#chp-https://rdrr.io/r/base/any</a></code>, <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code>, <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>, and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>. You also learned the powerful <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> that allow you to return values depending on the value of a logical vector.</p>
<p>Well see logical vectors again and in the following chapters. For example in <a href="#chp-strings" data-type="xref">#chp-strings</a> youll learn about <code>str_detect(x, pattern)</code> which returns a logical vector thats <code>TRUE</code> for the elements of <code>x</code> that match the <code>pattern</code>, and in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> youll create logical vectors from the comparison of dates and times. But for now, were going to move onto the next most important type of vector: numeric vectors.</p>
</section>
</section>

342
oreilly/missing-values.html Normal file
View File

@ -0,0 +1,342 @@
<section data-type="chapter" id="chp-missing-values">
<h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Youve already learned the basics of missing values earlier in the book. You first saw them in <a href="#sec-summarize" data-type="xref">#sec-summarize</a> where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in <a href="#sec-na-comparison" data-type="xref">#sec-na-comparison</a>. Now well come back to them in more depth, so you can learn more of the details.</p>
<p>Well start by discussing some general tools for working with missing values recorded as <code>NA</code>s. Well then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. Well finish off with a related discussion of empty groups, caused by factor levels that dont appear in the data.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
</section>
</section>
<section id="explicit-missing-values" data-type="sect1">
<h1>
Explicit missing values</h1>
<p>To begin, lets explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an <code>NA</code>.</p>
<section id="last-observation-carried-forward" data-type="sect2">
<h2>
Last observation carried forward</h2>
<p>A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">treatment &lt;- tribble(
~person, ~treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, NA,
"Katherine Burke", 1, 4
)</pre>
</div>
<p>You can fill in these missing values with <code><a href="#chp-https://tidyr.tidyverse.org/reference/fill" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/fill</a></code>. It works like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, taking a set of columns:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">treatment |&gt;
fill(everything())
#&gt; # A tibble: 4 × 3
#&gt; person treatment response
#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 Derrick Whitmore 1 7
#&gt; 2 Derrick Whitmore 2 10
#&gt; 3 Derrick Whitmore 3 10
#&gt; 4 Katherine Burke 1 4</pre>
</div>
<p>This treatment is sometimes called “last observation carried forward”, or <strong>locf</strong> for short. You can use the <code>.direction</code> argument to fill in missing values that have been generated in more exotic ways.</p>
</section>
<section id="fixed-values" data-type="sect2">
<h2>
Fixed values</h2>
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code> to replace them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, NA)
coalesce(x, 0)
#&gt; [1] 1 4 5 7 0</pre>
</div>
<p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/na_if" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/na_if</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, -99)
na_if(x, -99)
#&gt; [1] 1 4 5 7 NA</pre>
</div>
</section>
<section id="nan" data-type="sect2">
<h2>
NaN</h2>
<p>Before we continue, theres one special type of missing value that youll encounter from time to time: a <code>NaN</code> (pronounced “nan”), or <strong>n</strong>ot <strong>a</strong> <strong>n</strong>umber. Its not that important to know about because it generally behaves just like <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(NA, NaN)
x * 10
#&gt; [1] NA NaN
x == 1
#&gt; [1] NA NA
is.na(x)
#&gt; [1] TRUE TRUE</pre>
</div>
<p>In the rare case you need to distinguish an <code>NA</code> from a <code>NaN</code>, you can use <code>is.nan(x)</code>.</p>
<p>Youll generally encounter a <code>NaN</code> when you perform a mathematical operation that has an indeterminate result:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">0 / 0
#&gt; [1] NaN
0 * Inf
#&gt; [1] NaN
Inf - Inf
#&gt; [1] NaN
sqrt(-1)
#&gt; Warning in sqrt(-1): NaNs produced
#&gt; [1] NaN</pre>
</div>
</section>
</section>
<section id="sec-missing-implicit" data-type="sect1">
<h1>
Implicit missing values</h1>
<p>So far weve talked about missing values that are <strong>explicitly</strong> missing, i.e. you can see an <code>NA</code> in your data. But missing values can also be <strong>implicitly</strong> missing, if an entire row of data is simply absent from the data. Lets illustrate the difference with a simple data set that records the price of some stock each quarter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks &lt;- tibble(
year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)</pre>
</div>
<p>This dataset has two missing observations:</p>
<ul><li><p>The <code>price</code> in the fourth quarter of 2020 is explicitly missing, because its value is <code>NA</code>.</p></li>
<li><p>The <code>price</code> for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.</p></li>
</ul><p>One way to think about the difference is with this Zen-like koan:</p>
<blockquote class="blockquote">
<p>An explicit missing value is the presence of an absence.<br/></p>
<p>An implicit missing value is the absence of a presence.</p>
</blockquote>
<p>Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.</p>
<section id="pivoting" data-type="sect2">
<h2>
Pivoting</h2>
<p>Youve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot <code>stocks</code> to put the <code>quarter</code> in the columns, both missing values become explicit:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
pivot_wider(
names_from = qtr,
values_from = price
)
#&gt; # A tibble: 2 × 5
#&gt; year `1` `2` `3` `4`
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2020 1.88 0.59 0.35 NA
#&gt; 2 2021 NA 0.92 0.17 2.66</pre>
</div>
<p>By default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting <code>values_drop_na = TRUE</code>. See the examples in <a href="#sec-tidy-data" data-type="xref">#sec-tidy-data</a> for more details.</p>
</section>
<section id="complete" data-type="sect2">
<h2>
Complete</h2>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
complete(year, qtr)
#&gt; # A tibble: 8 × 3
#&gt; year qtr price
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2020 1 1.88
#&gt; 2 2020 2 0.59
#&gt; 3 2020 3 0.35
#&gt; 4 2020 4 NA
#&gt; 5 2021 1 NA
#&gt; 6 2021 2 0.92
#&gt; # … with 2 more rows</pre>
</div>
<p>Typically, youll call <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
complete(year = 2019:2021, qtr)
#&gt; # A tibble: 12 × 3
#&gt; year qtr price
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2019 1 NA
#&gt; 2 2019 2 NA
#&gt; 3 2019 3 NA
#&gt; 4 2019 4 NA
#&gt; 5 2020 1 1.88
#&gt; 6 2020 2 0.59
#&gt; # … with 6 more rows</pre>
</div>
<p>If the range of a variable is correct, but not all values are present, you could use <code>full_seq(x, 1)</code> to generate all values from <code>min(x)</code> to <code>max(x)</code> spaced out by 1.</p>
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>.</p>
</section>
<section id="joins" data-type="sect2">
<h2>
Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
flights |&gt;
distinct(faa = dest) |&gt;
anti_join(airports)
#&gt; Joining with `by = join_by(faa)`
#&gt; # A tibble: 4 × 1
#&gt; faa
#&gt; &lt;chr&gt;
#&gt; 1 BQN
#&gt; 2 SJU
#&gt; 3 STT
#&gt; 4 PSE
flights |&gt;
distinct(tailnum) |&gt;
anti_join(planes)
#&gt; Joining with `by = join_by(tailnum)`
#&gt; # A tibble: 722 × 1
#&gt; tailnum
#&gt; &lt;chr&gt;
#&gt; 1 N3ALAA
#&gt; 2 N3DUAA
#&gt; 3 N542MQ
#&gt; 4 N730MQ
#&gt; 5 N9EAMQ
#&gt; 6 N532UA
#&gt; # … with 716 more rows</pre>
</div>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Can you find any relationship between the carrier and the rows that appear to be missing from <code>planes</code>?</li>
</ol></section>
</section>
<section id="factors-and-empty-groups" data-type="sect1">
<h1>
Factors and empty groups</h1>
<p>A final type of missingness is the empty group, a group that doesnt contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health &lt;- tibble(
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
age = c(34L, 88L, 75L, 47L, 56L),
)</pre>
</div>
<p>And we want to count the number of smokers with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker)
#&gt; # A tibble: 1 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 no 5</pre>
</div>
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker, .drop = FALSE)
#&gt; # A tibble: 2 × 2
#&gt; smoker n
#&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 yes 0
#&gt; 2 no 5</pre>
</div>
<p>The same principle applies to ggplot2s discrete axes, which will also drop levels that dont have any values. You can force them to display by supplying <code>drop = FALSE</code> to the appropriate discrete axis:</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(health, aes(smoker)) +
geom_bar() +
scale_x_discrete()
ggplot(health, aes(smoker)) +
geom_bar() +
scale_x_discrete(drop = FALSE)</pre>
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="missing-values_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A bar chart with a single value on the x-axis, &quot;no&quot;. The same bar chart as the last plot, but now with two values on the x-axis, &quot;yes&quot; and &quot;no&quot;. There is no bar for the &quot;yes&quot; category." width="288"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="missing-values_files/figure-html/unnamed-chunk-17-2.png" class="img-fluid" alt="A bar chart with a single value on the x-axis, &quot;no&quot;. The same bar chart as the last plot, but now with two values on the x-axis, &quot;yes&quot; and &quot;no&quot;. There is no bar for the &quot;yes&quot; category." width="288"/></p>
</div>
</div>
</div>
</div>
<p>The same problem comes up more generally with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt;
group_by(smoker, .drop = FALSE) |&gt;
summarise(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
)
#&gt; Warning: There were 2 warnings in `summarise()`.
#&gt; The first warning was:
#&gt; In argument `min_age = min(age)`.
#&gt; In group 1: `smoker = yes`.
#&gt; Caused by warning in `min()`:
#&gt; ! no non-missing arguments to min; returning Inf
#&gt; Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#&gt; # A tibble: 2 × 6
#&gt; smoker n mean_age min_age max_age sd_age
#&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 yes 0 NaN Inf -Inf NA
#&gt; 2 no 5 60 34 88 21.6</pre>
</div>
<p>We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. Theres an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># A vector containing two missing values
x1 &lt;- c(NA, NA)
length(x1)
#&gt; [1] 2
# A vector containing nothing
x2 &lt;- numeric()
length(x2)
#&gt; [1] 0</pre>
</div>
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt;
group_by(smoker) |&gt;
summarise(
n = n(),
mean_age = mean(age),
min_age = min(age),
max_age = max(age),
sd_age = sd(age)
) |&gt;
complete(smoker)
#&gt; # A tibble: 2 × 6
#&gt; smoker n mean_age min_age max_age sd_age
#&gt; &lt;fct&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 yes NA NA NA NA NA
#&gt; 2 no 5 60 34 88 21.6</pre>
</div>
<p>The main drawback of this approach is that you get an <code>NA</code> for the count, even though you know that it should be zero.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>Missing values are weird! Sometimes theyre recorded as an explicit <code>NA</code> but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.</p>
<p>In the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because were going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.</p>
</section>
</section>

838
oreilly/numbers.html Normal file
View File

@ -0,0 +1,838 @@
<section data-type="chapter" id="chp-numbers">
<h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Numeric vectors are the backbone of data science, and youve already used them a bunch of times earlier in the book. Now its time to systematically survey what you can do with them in R, ensuring that youre well situated to tackle any future problem involving numeric vectors.</p>
<p>Well start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>. Then well dive into various numeric transformations that pair well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Well finish off by covering the summary functions that pair well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> and show you how they can also be used with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because well use these base R functions inside of tidyverse functions like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>. Like in the last chapter, well use real examples from nycflights13, as well as toy examples made with <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code> and <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre>
</div>
</section>
</section>
<section id="making-numbers" data-type="sect1">
<h1>
Making numbers</h1>
<p>In most cases, youll get numbers already recorded in one of Rs numeric types: integer or double. In some cases, however, youll encounter them as strings, possibly because youve created them by pivoting from column headers or something has gone wrong in your data import process.</p>
<p>readr provides two useful functions for parsing strings into numbers: <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code>. Use <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> when you have numbers that have been written as strings:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("1.2", "5.6", "1e3")
parse_double(x)
#&gt; [1] 1.2 5.6 1000.0</pre>
</div>
<p>Use <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("$1,234", "USD 3,513", "59%")
parse_number(x)
#&gt; [1] 1234 3513 59</pre>
</div>
</section>
<section id="counts" data-type="sect1">
<h1>
Counts</h1>
<p>Its surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>. This function is great for quick exploration and checks during analysis:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(dest)
#&gt; # A tibble: 105 × 2
#&gt; dest n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ABQ 254
#&gt; 2 ACK 265
#&gt; 3 ALB 439
#&gt; 4 ANC 8
#&gt; 5 ATL 17215
#&gt; 6 AUS 2439
#&gt; # … with 99 more rows</pre>
</div>
<p>(Despite the advice in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, we usually put <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> on a single line because its usually used at the console for a quick check that a calculation is working as expected.)</p>
<p>If you want to see the most common values add <code>sort = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(dest, sort = TRUE)
#&gt; # A tibble: 105 × 2
#&gt; dest n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ORD 17283
#&gt; 2 ATL 17215
#&gt; 3 LAX 16174
#&gt; 4 BOS 15508
#&gt; 5 MCO 14082
#&gt; 6 CLT 14064
#&gt; # … with 99 more rows</pre>
</div>
<p>And remember that if you want to see all the values, you can use <code>|&gt; View()</code> or <code>|&gt; print(n = Inf)</code>.</p>
<p>You can perform the same computation “by hand” with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>. This is useful because it allows you to compute other summaries at the same time:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarise(
n = n(),
delay = mean(arr_delay, na.rm = TRUE)
)
#&gt; # A tibble: 105 × 3
#&gt; dest n delay
#&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt;
#&gt; 1 ABQ 254 4.38
#&gt; 2 ACK 265 4.85
#&gt; 3 ALB 439 14.4
#&gt; 4 ANC 8 -2.5
#&gt; 5 ATL 17215 11.3
#&gt; 6 AUS 2439 6.02
#&gt; # … with 99 more rows</pre>
</div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> is a special summary function that doesnt take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">n()
#&gt; Error in `n()`:
#&gt; ! Must only be used inside data-masking verbs like `mutate()`,
#&gt; `filter()`, and `group_by()`.</pre>
</div>
<p>There are a couple of variants of <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> that you might find useful:</p>
<ul><li>
<p><code>n_distinct(x)</code> counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarise(
carriers = n_distinct(carrier)
) |&gt;
arrange(desc(carriers))
#&gt; # A tibble: 105 × 2
#&gt; dest carriers
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ATL 7
#&gt; 2 BOS 7
#&gt; 3 CLT 7
#&gt; 4 ORD 7
#&gt; 5 TPA 7
#&gt; 6 AUS 6
#&gt; # … with 99 more rows</pre>
</div>
</li>
<li>
<p>A weighted count is a sum. For example you could “count” the number of miles each plane flew:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(tailnum) |&gt;
summarise(miles = sum(distance))
#&gt; # A tibble: 4,044 × 2
#&gt; tailnum miles
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 D942DN 3418
#&gt; 2 N0EGMQ 250866
#&gt; 3 N10156 115966
#&gt; 4 N102UW 25722
#&gt; 5 N103US 24619
#&gt; 6 N104UW 25157
#&gt; # … with 4,038 more rows</pre>
</div>
<p>Weighted counts are a common problem so <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> has a <code>wt</code> argument that does the same thing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(tailnum, wt = distance)
#&gt; # A tibble: 4,044 × 2
#&gt; tailnum n
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 D942DN 3418
#&gt; 2 N0EGMQ 250866
#&gt; 3 N10156 115966
#&gt; 4 N102UW 25722
#&gt; 5 N103US 24619
#&gt; 6 N104UW 25157
#&gt; # … with 4,038 more rows</pre>
</div>
</li>
<li>
<p>You can count missing values by combining <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> and <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>. In the <code>flights</code> dataset this represents flights that are cancelled:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarise(n_cancelled = sum(is.na(dep_time)))
#&gt; # A tibble: 105 × 2
#&gt; dest n_cancelled
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 ABQ 0
#&gt; 2 ACK 0
#&gt; 3 ALB 20
#&gt; 4 ANC 0
#&gt; 5 ATL 317
#&gt; 6 AUS 21
#&gt; # … with 99 more rows</pre>
</div>
</li>
</ul>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How can you use <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to count the number rows with a missing value for a given variable?</li>
<li>Expand the following calls to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to instead use <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>:
<ol type="1"><li><p><code>flights |&gt; count(dest, sort = TRUE)</code></p></li>
<li><p><code>flights |&gt; count(tailnum, wt = distance)</code></p></li>
</ol></li>
</ol></section>
</section>
<section id="numeric-transformations" data-type="sect1">
<h1>
Numeric transformations</h1>
<p>Transformation functions work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because their output is the same length as the input. The vast majority of transformation functions are already built into base R. Its impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we dont list them here because theyre rarely needed for data science.</p>
<section id="sec-recycling" data-type="sect2">
<h2>
Arithmetic and recycling rules</h2>
<p>We introduced the basics of arithmetic (<code>+</code>, <code>-</code>, <code>*</code>, <code>/</code>, <code>^</code>) in <a href="#chp-workflow-basics" data-type="xref">#chp-workflow-basics</a> and have used them a bunch since. These functions dont need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the <strong>recycling rules</strong> which determine what happens when the left and right hand sides have different lengths. This is important for operations like <code>flights |&gt; mutate(air_time = air_time / 60)</code> because there are 336,776 numbers on the left of <code>/</code> but only one on the right.</p>
<p>R handles mismatched lengths by <strong>recycling,</strong> or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 10, 20)
x / 5
#&gt; [1] 0.2 0.4 2.0 4.0
# is shorthand for
x / c(5, 5, 5, 5)
#&gt; [1] 0.2 0.4 2.0 4.0</pre>
</div>
<p>Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isnt a multiple of the shorter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x * c(1, 2)
#&gt; [1] 1 4 10 40
x * c(1, 2, 3)
#&gt; Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
#&gt; object length
#&gt; [1] 1 4 30 20</pre>
</div>
<p>These recycling rules are also applied to logical comparisons (<code>==</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>!=</code>) and can lead to a surprising result if you accidentally use <code>==</code> instead of <code>%in%</code> and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == c(1, 2))
#&gt; # A tibble: 25,977 × 19
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 542 540 2 923 850 33 AA
#&gt; 3 2013 1 1 554 600 -6 812 837 -25 DL
#&gt; 4 2013 1 1 555 600 -5 913 854 19 B6
#&gt; 5 2013 1 1 557 600 -3 838 846 -8 B6
#&gt; 6 2013 1 1 558 600 -2 849 851 -2 B6
#&gt; # … with 25,971 more rows, 9 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
<p>The code runs without error, but it doesnt return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unforuntately theres no warning because <code>flights</code> has an even number of rows.</p>
<p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesnt help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>.</p>
</section>
<section id="minimum-and-maximum" data-type="sect2">
<h2>
Minimum and maximum</h2>
<p>The arithmetic functions work with pairs of variables. Two closely related functions are <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code>, which when given two or more variables will return the smallest or largest value in each row:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~x, ~y,
1, 3,
5, 2,
7, NA,
)
df |&gt;
mutate(
min = pmin(x, y, na.rm = TRUE),
max = pmax(x, y, na.rm = TRUE)
)
#&gt; # A tibble: 3 × 4
#&gt; x y min max
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 3 1 3
#&gt; 2 5 2 2 5
#&gt; 3 7 NA 7 7</pre>
</div>
<p>Note that these are different to the summary functions <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> which take multiple observations and return a single value. You can tell that youve used the wrong form when all the minimums and all the maximums have the same value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
mutate(
min = min(x, y, na.rm = TRUE),
max = max(x, y, na.rm = TRUE)
)
#&gt; # A tibble: 3 × 4
#&gt; x y min max
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 3 1 7
#&gt; 2 5 2 1 7
#&gt; 3 7 NA 1 7</pre>
</div>
</section>
<section id="modular-arithmetic" data-type="sect2">
<h2>
Modular arithmetic</h2>
<p>Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder. In R, <code>%/%</code> does integer division and <code>%%</code> computes the remainder:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">1:10 %/% 3
#&gt; [1] 0 0 1 1 1 2 2 2 3 3
1:10 %% 3
#&gt; [1] 1 2 0 1 2 0 1 2 0 1</pre>
</div>
<p>Modular arithmetic is handy for the flights dataset, because we can use it to unpack the <code>sched_dep_time</code> variable into and <code>hour</code> and <code>minute</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
hour = sched_dep_time %/% 100,
minute = sched_dep_time %% 100,
.keep = "used"
)
#&gt; # A tibble: 336,776 × 3
#&gt; sched_dep_time hour minute
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 515 5 15
#&gt; 2 529 5 29
#&gt; 3 540 5 40
#&gt; 4 545 5 45
#&gt; 5 600 6 0
#&gt; 6 558 5 58
#&gt; # … with 336,770 more rows</pre>
</div>
<p>We can combine that with the <code>mean(is.na(x))</code> trick from <a href="#sec-logical-summaries" data-type="xref">#sec-logical-summaries</a> to see how the proportion of cancelled flights varies over the course of the day. The results are shown in <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(hour = sched_dep_time %/% 100) |&gt;
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |&gt;
filter(hour &gt; 1) |&gt;
ggplot(aes(hour, prop_cancelled)) +
geom_line(color = "grey50") +
geom_point(aes(size = n))</pre>
<div class="cell-output-display">
<figure id="fig-flights-dist-daily"><p><img src="numbers_files/figure-html/fig-prop-cancelled-1.png" alt="A line plot showing how proportion of cancelled flights changes over the course of the day. The proportion starts low at around 0.5% at 6am, then steadily increases over the course of the day until peaking at 4% at 7pm. The proportion of cancelled flights then drops rapidly getting down to around 1% by midnight." width="576"/></p>
<figcaption>Figure 13.1: A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.</figcaption>
</figure>
</div>
</div>
</section>
<section id="logarithms" data-type="sect2">
<h2>
Logarithms</h2>
<p>Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert exponential growth to linear growth. For example, take compounding interest — the amount of money you have at <code>year + 1</code> is the amount of money you had at <code>year</code> multiplied by the interest rate. That gives a formula like <code>money = starting * interest ^ year</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">starting &lt;- 100
interest &lt;- 1.05
money &lt;- tibble(
year = 2000 + 1:50,
money = starting * interest^(1:50)
)</pre>
</div>
<p>If you plot this data, youll get an exponential curve:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(money, aes(year, money)) +
geom_line()</pre>
<div class="cell-output-display">
<p><img src="numbers_files/figure-html/unnamed-chunk-22-1.png" width="576"/></p>
</div>
</div>
<p>Log transforming the y-axis gives a straight line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(money, aes(year, money)) +
geom_line() +
scale_y_log10()</pre>
<div class="cell-output-display">
<p><img src="numbers_files/figure-html/unnamed-chunk-23-1.png" width="576"/></p>
</div>
</div>
<p>This a straight line because a little algebra reveals that <code>log(money) = log(starting) + n * log(interest)</code>, which matches the pattern for a line, <code>y = m * x + b</code>. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that theres underlying exponential growth.</p>
<p>If youre log-transforming your data with dplyr you have a choice of three logarithms provided by base R: <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> (the natural log, base e), <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> (base 2), and <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> (base 10). We recommend using <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> or <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code>. <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> is easy to back-transform because (e.g) 3 is 10^3 = 1000.</p>
<p>The inverse of <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> is <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code>; to compute the inverse of <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> or <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> youll need to use <code>2^</code> or <code>10^</code>.</p>
</section>
<section id="sec-rounding" data-type="sect2">
<h2>
Rounding</h2>
<p>Use <code>round(x)</code> to round a number to the nearest integer:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">round(123.456)
#&gt; [1] 123</pre>
</div>
<p>You can control the precision of the rounding with the second argument, <code>digits</code>. <code>round(x, digits)</code> rounds to the nearest <code>10^-n</code> so <code>digits = 2</code> will round to the nearest 0.01. This definition is useful because it implies <code>round(x, -3)</code> will round to the nearest thousand, which indeed it does:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">round(123.456, 2) # two digits
#&gt; [1] 123.46
round(123.456, 1) # one digit
#&gt; [1] 123.5
round(123.456, -1) # round to nearest ten
#&gt; [1] 120
round(123.456, -2) # round to nearest hundred
#&gt; [1] 100</pre>
</div>
<p>Theres one weirdness with <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> that seems surprising at first glance:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">round(c(1.5, 2.5))
#&gt; [1] 2 2</pre>
</div>
<p><code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> uses whats known as “round half to even” or Bankers rounding: if a number is half way between two integers, it will be rounded to the <strong>even</strong> integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.</p>
<p><code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> is paired with <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> which always rounds down and <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> which always rounds up:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- 123.456
floor(x)
#&gt; [1] 123
ceiling(x)
#&gt; [1] 124</pre>
</div>
<p>These functions dont have a digits argument, so you can instead scale down, round, and then scale back up:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Round down to nearest two digits
floor(x / 0.01) * 0.01
#&gt; [1] 123.45
# Round up to nearest two digits
ceiling(x / 0.01) * 0.01
#&gt; [1] 123.46</pre>
</div>
<p>You can use the same technique if you want to <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> to a multiple of some other number:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Round to nearest multiple of 4
round(x / 4) * 4
#&gt; [1] 124
# Round to nearest 0.25
round(x / 0.25) * 0.25
#&gt; [1] 123.5</pre>
</div>
</section>
<section id="cutting-numbers-into-ranges" data-type="sect2">
<h2>
Cutting numbers into ranges</h2>
<p>Use <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code><span data-type="footnote">ggplot2 provides some helpers for common cases in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>, and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>. ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.</span> to break up a numeric vector into discrete buckets:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0, 5, 10, 15, 20))
#&gt; [1] (0,5] (0,5] (0,5] (5,10] (10,15] (15,20]
#&gt; Levels: (0,5] (5,10] (10,15] (15,20]</pre>
</div>
<p>The breaks dont need to be evenly spaced:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cut(x, breaks = c(0, 5, 10, 100))
#&gt; [1] (0,5] (0,5] (0,5] (5,10] (10,100] (10,100]
#&gt; Levels: (0,5] (5,10] (10,100]</pre>
</div>
<p>You can optionally supply your own <code>labels</code>. Note that there should be one less <code>labels</code> than <code>breaks</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cut(x,
breaks = c(0, 5, 10, 15, 20),
labels = c("sm", "md", "lg", "xl")
)
#&gt; [1] sm sm sm md lg xl
#&gt; Levels: sm md lg xl</pre>
</div>
<p>Any values outside of the range of the breaks will become <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y &lt;- c(NA, -10, 5, 10, 30)
cut(y, breaks = c(0, 5, 10, 15, 20))
#&gt; [1] &lt;NA&gt; &lt;NA&gt; (0,5] (5,10] &lt;NA&gt;
#&gt; Levels: (0,5] (5,10] (10,15] (15,20]</pre>
</div>
<p>See the documentation for other useful arguments like <code>right</code> and <code>include.lowest</code>, which control if the intervals are <code>[a, b)</code> or <code>(a, b]</code> and if the lowest interval should be <code>[a, b]</code>.</p>
</section>
<section id="cumulative-and-rolling-aggregates" data-type="sect2">
<h2>
Cumulative and rolling aggregates</h2>
<p>Base R provides <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code>, <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code>, <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code>, <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code> for running, or cumulative, sums, products, mins and maxes. dplyr provides <code><a href="#chp-https://dplyr.tidyverse.org/reference/cumall" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/cumall</a></code> for cumulative means. Cumulative sums tend to come up the most in practice:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- 1:10
cumsum(x)
#&gt; [1] 1 3 6 10 15 21 28 36 45 55</pre>
</div>
<p>If you need more complex rolling or sliding aggregates, try the <a href="#chp-https://davisvaughan.github.io/slider/" data-type="xref">#chp-https://davisvaughan.github.io/slider/</a> package by Davis Vaughan. The following example illustrates some of its features.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(slider)
# Same as a cumulative sum
slide_vec(x, sum, .before = Inf)
#&gt; [1] 1 3 6 10 15 21 28 36 45 55
# Sum the current element and the one before it
slide_vec(x, sum, .before = 1)
#&gt; [1] 1 3 5 7 9 11 13 15 17 19
# Sum the current element and the two before and after it
slide_vec(x, sum, .before = 2, .after = 2)
#&gt; [1] 6 10 15 20 25 30 35 40 34 27
# Only compute if the window is complete
slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
#&gt; [1] NA NA 15 20 25 30 35 40 NA NA</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain in words what each line of the code used to generate <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a> does.</p></li>
<li><p>What trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?</p></li>
<li>
<p>Currently <code>dep_time</code> and <code>sched_dep_time</code> are convenient to look at, but hard to compute with because theyre not really continuous numbers. You can see the basic problem in this plot: theres a gap between each hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 1, day == 1) |&gt;
ggplot(aes(sched_dep_time, dep_delay)) +
geom_point()
#&gt; Warning: Removed 4 rows containing missing values (`geom_point()`).</pre>
<div class="cell-output-display">
<p><img src="numbers_files/figure-html/unnamed-chunk-36-1.png" width="576"/></p>
</div>
</div>
<p>Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).</p>
</li>
</ol></section>
</section>
<section id="general-transformations" data-type="sect1">
<h1>
General transformations</h1>
<p>The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.</p>
<section id="ranks" data-type="sect2">
<h2>
Ranks</h2>
<p>dplyr provides a number of ranking functions inspired by SQL, but you should always start with <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>. It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 2, 3, 4, NA)
min_rank(x)
#&gt; [1] 1 2 2 4 5 NA</pre>
</div>
<p>Note that the smallest values get the lowest ranks; use <code>desc(x)</code> to give the largest values the smallest ranks:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">min_rank(desc(x))
#&gt; [1] 5 3 3 2 1 NA</pre>
</div>
<p>If <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code> doesnt do what you need, look at the variants <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/percent_rank" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/percent_rank</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/percent_rank" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/percent_rank</a></code>. See the documentation for details.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = x)
df |&gt;
mutate(
row_number = row_number(x),
dense_rank = dense_rank(x),
percent_rank = percent_rank(x),
cume_dist = cume_dist(x)
)
#&gt; # A tibble: 6 × 5
#&gt; x row_number dense_rank percent_rank cume_dist
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 1 1 0 0.2
#&gt; 2 2 2 2 0.25 0.6
#&gt; 3 2 3 2 0.25 0.6
#&gt; 4 3 4 3 0.75 0.8
#&gt; 5 4 5 4 1 1
#&gt; 6 NA NA NA NA NA</pre>
</div>
<p>You can achieve many of the same results by picking the appropriate <code>ties.method</code> argument to base Rs <code><a href="#chp-https://rdrr.io/r/base/rank" data-type="xref">#chp-https://rdrr.io/r/base/rank</a></code>; youll probably also want to set <code>na.last = "keep"</code> to keep <code>NA</code>s as <code>NA</code>.</p>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code> can also be used without any arguments when inside a dplyr verb. In this case, itll give the number of the “current” row. When combined with <code>%%</code> or <code>%/%</code> this can be a useful tool for dividing data into similarly sized groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = runif(10))
df |&gt;
mutate(
row0 = row_number() - 1,
three_groups = row0 %% 3,
three_in_each_group = row0 %/% 3,
)
#&gt; # A tibble: 10 × 4
#&gt; x row0 three_groups three_in_each_group
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.0808 0 0 0
#&gt; 2 0.834 1 1 0
#&gt; 3 0.601 2 2 0
#&gt; 4 0.157 3 0 1
#&gt; 5 0.00740 4 1 1
#&gt; 6 0.466 5 2 1
#&gt; # … with 4 more rows</pre>
</div>
</section>
<section id="offsets" data-type="sect2">
<h2>
Offsets</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with <code>NA</code>s at the start or end:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(2, 5, 11, 11, 19, 35)
lag(x)
#&gt; [1] NA 2 5 11 11 19
lead(x)
#&gt; [1] 5 11 11 19 35 NA</pre>
</div>
<ul><li>
<p><code>x - lag(x)</code> gives you the difference between the current and previous value.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x - lag(x)
#&gt; [1] NA 3 6 0 8 16</pre>
</div>
</li>
<li>
<p><code>x == lag(x)</code> tells you when the current value changes.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x == lag(x)
#&gt; [1] NA FALSE FALSE TRUE FALSE FALSE</pre>
</div>
</li>
</ul><p>You can lead or lag by more than one position by using the second argument, <code>n</code>.</p>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>.</p></li>
<li><p>Which plane (<code>tailnum</code>) has the worst on-time record?</p></li>
<li><p>What time of day should you fly if you want to avoid delays as much as possible?</p></li>
<li><p>What does <code>flights |&gt; group_by(dest() |&gt; filter(row_number() &lt; 4)</code> do? What does <code>flights |&gt; group_by(dest() |&gt; filter(row_number(dep_delay) &lt; 4)</code> do?</p></li>
<li><p>For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.</p></li>
<li>
<p>Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code>, explore how the average flight delay for an hour is related to the average delay for the previous hour.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(hour = dep_time %/% 100) |&gt;
group_by(year, month, day, hour) |&gt;
summarise(
dep_delay = mean(dep_delay, na.rm = TRUE),
n = n(),
.groups = "drop"
) |&gt;
filter(n &gt; 5)</pre>
</div>
</li>
<li><p>Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?</p></li>
<li><p>Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.</p></li>
</ol></section>
</section>
<section id="numeric-summaries" data-type="sect1">
<h1>
Numeric summaries</h1>
<p>Just using the counts, means, and sums that weve introduced already can get you a long way, but R provides many other useful summary functions. Here are a selection that you might find useful.</p>
<section id="center" data-type="sect2">
<h2>
Center</h2>
<p>So far, weve mostly used <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code>, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable youre interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.</p>
<p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs the median when looking at the hourly vs median departure delay. The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
mean = mean(dep_delay, na.rm = TRUE),
median = median(dep_delay, na.rm = TRUE),
n = n(),
.groups = "drop"
) |&gt;
ggplot(aes(mean, median)) +
geom_abline(slope = 1, intercept = 0, color = "white", size = 2) +
geom_point()
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; Please use `linewidth` instead.</pre>
<div class="cell-output-display">
<figure class="figure"><p><img src="numbers_files/figure-html/fig-mean-vs-median-1.png" alt="All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55." width="576"/></p>
<figcaption class="figure-caption">Figure 13.2: A scatterplot showing the differences of summarising hourly depature delay with median instead of mean.</figcaption>
</figure>
</div>
</div>
<p>You might also wonder about the <strong>mode</strong>, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesnt work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and theres no mode function included in base R<span data-type="footnote">The <code><a href="#chp-https://rdrr.io/r/base/mode" data-type="xref">#chp-https://rdrr.io/r/base/mode</a></code> function does something quite different!</span>.</p>
</section>
<section id="sec-min-max-summary" data-type="sect2">
<h2>
Minimum, maximum, and quantiles</h2>
<p>What if youre interested in locations other than the center? <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="#chp-https://rdrr.io/r/stats/quantile" data-type="xref">#chp-https://rdrr.io/r/stats/quantile</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find a value thats greater than 95% of the values.</p>
<p>For the <code>flights</code> data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
max = max(dep_delay, na.rm = TRUE),
q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
.groups = "drop"
)
#&gt; # A tibble: 365 × 5
#&gt; year month day max q95
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 2013 1 1 853 70.1
#&gt; 2 2013 1 2 379 85
#&gt; 3 2013 1 3 291 68
#&gt; 4 2013 1 4 288 60
#&gt; 5 2013 1 5 327 41
#&gt; 6 2013 1 6 202 51
#&gt; # … with 359 more rows</pre>
</div>
</section>
<section id="spread" data-type="sect2">
<h2>
Spread</h2>
<p>Sometimes youre not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, <code>sd(x)</code>, and the inter-quartile range, <code><a href="#chp-https://rdrr.io/r/stats/IQR" data-type="xref">#chp-https://rdrr.io/r/stats/IQR</a></code>. We wont explain <code><a href="#chp-https://rdrr.io/r/stats/sd" data-type="xref">#chp-https://rdrr.io/r/stats/sd</a></code> here since youre probably already familiar with it, but <code><a href="#chp-https://rdrr.io/r/stats/IQR" data-type="xref">#chp-https://rdrr.io/r/stats/IQR</a></code> might be new — its <code>quantile(x, 0.75) - quantile(x, 0.25)</code> and gives you the range that contains the middle 50% of the data.</p>
<p>We can use this to reveal a small oddity in the <code>flights</code> data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, <a href="#chp-https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport" data-type="xref">#chp-https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport</a>, might have moved.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(origin, dest) |&gt;
summarise(
distance_sd = IQR(distance),
n = n(),
.groups = "drop"
) |&gt;
filter(distance_sd &gt; 0)
#&gt; # A tibble: 2 × 4
#&gt; origin dest distance_sd n
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 EWR EGE 1 110
#&gt; 2 JFK EGE 1 103</pre>
</div>
</section>
<section id="distributions" data-type="sect2">
<h2>
Distributions</h2>
<p>Its worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that theyre fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. Thats why its always a good idea to visualize the distribution before committing to your summary statistics.</p>
<p><a href="#fig-flights-dist" data-type="xref">#fig-flights-dist</a> shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
ggplot(aes(dep_delay)) +
geom_histogram(binwidth = 15)
#&gt; Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).
flights |&gt;
filter(dep_delay &lt; 120) |&gt;
ggplot(aes(dep_delay)) +
geom_histogram(binwidth = 5)</pre>
<div id="fig-flights-dist" class="cell quarto-layout-panel">
<figure class="figure"><div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell quarto-layout-cell-subref" style="flex-basis: 50.0%;justify-content: center;">
<figure class="figure"><p><img src="numbers_files/figure-html/fig-flights-dist-1.png" alt="Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that. " data-ref-parent="fig-flights-dist" width="384"/></p>
<figcaption class="figure-caption">(a) Histogram shows the full range of delays.</figcaption>
</figure>
</div>
<div class="cell-output-display quarto-layout-cell quarto-layout-cell-subref" style="flex-basis: 50.0%;justify-content: center;">
<figure class="figure"><p><img src="numbers_files/figure-html/fig-flights-dist-2.png" alt="Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that. " data-ref-parent="fig-flights-dist" width="384"/></p>
<figcaption class="figure-caption">(b) Histogram is zoomed in to show delays less than 2 hours.</figcaption>
</figure>
</div>
</div>
<figcaption class="figure-caption">Figure 13.3: The distribution of <code>dep_delay</code> appears highly skewed to the right in both histograms.</figcaption>
</figure></div>
</div>
<p>Its also a good idea to check that distributions for subgroups resemble the whole. <a href="#fig-flights-dist-daily" data-type="xref">#fig-flights-dist-daily</a> overlays a frequency polygon for each day. The distributions seem to follow a common pattern, suggesting its fine to use the same summary for each day.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_delay &lt; 120) |&gt;
ggplot(aes(dep_delay, group = interaction(day, month))) +
geom_freqpoly(binwidth = 5, alpha = 1/5)</pre>
<div class="cell-output-display">
<figure class="figure"><p><img src="numbers_files/figure-html/fig-flights-dist-daily-1.png" alt="The distribution of `dep_delay` is highly right skewed with a strong peak slightly less than 0. The 365 frequency polygons are mostly overlapping forming a thick black bland." width="576"/></p>
<figcaption class="figure-caption">Figure 13.4: 365 frequency polygons of <code>dep_delay</code>, one for each day. The frequency polygons appear to have the same shape, suggesting that its reasonable to compare days by looking at just a few summary statistics.</figcaption>
</figure>
</div>
</div>
<p>Dont be afraid to explore your own custom summaries specifically tailored for the data that youre working with. In this case, that might mean separately summarizing the flights that left early vs the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, dont forget what you learned in <a href="#sec-sample-size" data-type="xref">#sec-sample-size</a>: whenever creating numerical summaries, its a good idea to include the number of observations in each group.</p>
</section>
<section id="positions" data-type="sect2">
<h2>
Positions</h2>
<p>Theres one final type of summary thats useful for numeric vectors, but also works with every other type of value: extracting a value at specific position. You can do this with the base R <code>[</code> function, but were not going to cover it in detail until <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>, because its a very powerful and general function. For now well introduce three specialized functions that you can use to extract values at a specified position: <code>first(x)</code>, <code>last(x)</code>, and <code>nth(x, n)</code>.</p>
<p>For example, we can find the first and last departure for each day:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
summarise(
first_dep = first(dep_time),
fifth_dep = nth(dep_time, 5),
last_dep = last(dep_time)
)
#&gt; `summarise()` has grouped output by 'year', 'month'. You can override using the
#&gt; `.groups` argument.
#&gt; # A tibble: 365 × 6
#&gt; # Groups: year, month [12]
#&gt; year month day first_dep fifth_dep last_dep
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2013 1 1 517 554 NA
#&gt; 2 2013 1 2 42 535 NA
#&gt; 3 2013 1 3 32 520 NA
#&gt; 4 2013 1 4 25 531 NA
#&gt; 5 2013 1 5 14 534 NA
#&gt; 6 2013 1 6 16 555 NA
#&gt; # … with 359 more rows</pre>
</div>
<p>(These functions currently lack an <code>na.rm</code> argument but will hopefully be fixed by the time you read this book: <a href="https://github.com/tidyverse/dplyr/issues/6242" class="uri">https://github.com/tidyverse/dplyr/issues/6242</a>).</p>
<p>If youre familiar with <code>[</code>, you might wonder if you ever need these functions. There are two main reasons: the <code>default</code> argument and the <code>order_by</code> argument. <code>default</code> allows you to set a default value thats used if the requested position doesnt exist, e.g. youre trying to get the 3rd element from a two element group. <code>order_by</code> lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by <code><a href="#chp-https://dplyr.tidyverse.org/reference/order_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/order_by</a></code>.</p>
<p>Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
mutate(r = min_rank(desc(sched_dep_time))) |&gt;
filter(r %in% c(1, max(r)))
#&gt; # A tibble: 1,195 × 20
#&gt; # Groups: year, month, day [365]
#&gt; year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 2013 1 1 517 515 2 830 819 11 UA
#&gt; 2 2013 1 1 2353 2359 -6 425 445 -20 B6
#&gt; 3 2013 1 1 2353 2359 -6 418 442 -24 B6
#&gt; 4 2013 1 1 2356 2359 -3 425 437 -12 B6
#&gt; 5 2013 1 2 42 2359 43 518 442 36 B6
#&gt; 6 2013 1 2 458 500 -2 703 650 13 US
#&gt; # … with 1,189 more rows, 10 more variables: flight &lt;int&gt;, tailnum &lt;chr&gt;,
#&gt; # origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;,
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, r &lt;int&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div>
</section>
<section id="with-mutate" data-type="sect2">
<h2>
With<code>mutate()</code>
</h2>
<p>As the names suggest, the summary functions are typically paired with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, particularly when you want do some sort of group standardization. For example:</p>
<ul><li>
<code>x / sum(x)</code> calculates the proportion of a total.</li>
<li>
<code>(x - mean(x)) / sd(x)</code> computes a Z-score (standardized to mean 0 and sd 1).</li>
<li>
<code>x / first(x)</code> computes an index based on the first observation.</li>
</ul></section>
<section id="exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:</p>
<ul><li>A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.</li>
<li>A flight is always 10 minutes late.</li>
<li>A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.</li>
<li>99% of the time a flight is on time. 1% of the time its 2 hours late.</li>
</ul><p>Which do you think is more important: arrival delay or departure delay?</p>
</li>
<li><p>Which destinations show the greatest variation in air speed?</p></li>
<li><p>Create a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations?</p></li>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>Youre already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. Youve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.</p>
<p>Over the next two chapters, well dive into working with strings with the stringr package. Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions.</p>
</section>
</section>

19
oreilly/preface-2e.html Normal file
View File

@ -0,0 +1,19 @@
<section data-type="chapter" id="chp-preface-2e">
<h1>Preface to the second edition</h1><p>Welcome to the second edition of “R for Data Science”.</p>
<section id="major-changes" data-type="sectNA">
<h1>Major changes</h1>
<ul><li><p>The first part is renamed to “whole game” to reflect the entire data science cycle. It gains a new chapter that briefly introduces the basics of reading data from csv files.</p></li>
<li><p>The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.</p></li>
<li><p>Weve added new chapters on column-wise and row-wise operations.</p></li>
<li><p>Weve added a new set of chapters on import that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and scraping data from the web.</p></li>
<li><p>The modeling part has been removed. For modeling, we recommend using packages from <a href="#chp-https://www.tidymodels.org/" data-type="xref">#chp-https://www.tidymodels.org/</a> and reading <a href="#chp-https://www.tmwr.org/" data-type="xref">#chp-https://www.tmwr.org/</a> by Max Kuhn and Julia Silge to learn more about them.</p></li>
<li><p>Weve switched from the magrittr pipe to the base pipe.</p></li>
</ul></section>
<section id="acknowledgements" data-type="sectNA">
<h1>Acknowledgements</h1>
<p><em>TO DO: Add acknowledgements.</em></p>
</section>
</section>

18
oreilly/program.html Normal file
View File

@ -0,0 +1,18 @@
<div data-type="part">
<h1><span id="sec-program-intro" class="quarto-section-identifier d-none d-lg-block">Program</span></h1><p>In this part of the book, youll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-program"><p><img src="diagrams/data-science/program.png" alt="Our model of the data science process with program (import, tidy, transform, visualize, model, and communicate, i.e. everything) highlighted in blue." width="535"/></p>
<figcaption>Figure 1: Programming is the water in which all other components of the data science process swims.</figcaption>
</figure>
</div>
</div><p>Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if youre not working with other people, youll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.</p><p>Writing code is similar in many ways to writing prose. One parallel which we find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, its often worth looking at your code and thinking about whether or not its obvious what youve done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesnt mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)</p><p>In the following three chapters, youll learn skills to improve your programming skills:</p><ol type="1"><li><p>Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in <a href="#chp-functions" data-type="xref">#chp-functions</a>, youll learn how to write <strong>functions</strong> which let you extract out repeated code so that it can be easily reused.</p></li>
<li><p>Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for <strong>iteration</strong> that let you do similar things again and again. These tools include for loops and functional programming, which youll learn about in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p></li>
<li><p>As you read more code written by others, youll see more code that doesnt use the tidyverse. In <a href="#chp-base-R" data-type="xref">#chp-base-R</a>, youll learn some of the most important base R functions that youll see in the wild. These functions tend to be designed to use individual vectors, rather than data frames, often making them a good fit for your programming needs.</p></li>
</ol><section id="chp-program" class="level2">
<h1>Learning more</h1>
<p>The goal of these chapters is to teach you the minimum about programming that you need to practice data science. Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills. Learning more about programming is a long-term investment: it wont pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.</p>
<p>To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:</p>
<ul><li><p><a href="#chp-https://rstudio-education.github.io/hopr/" data-type="xref">#chp-https://rstudio-education.github.io/hopr/</a>, by Garrett Grolemund. This is an introduction to R as a programming language and is a great place to start if R is your first programming language. It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino). Its a useful complement if you find that these four chapters go by too quickly.</p></li>
<li><p><a href="#chp-https://adv-r.hadley.nz/" data-type="xref">#chp-https://adv-r.hadley.nz/</a> by Hadley Wickham. This dives into the details of R the programming language. This is a great place to start if you have existing programming experience. Its also a great next step once youve internalized the ideas in these chapters.</p></li>
</ul></section></div>

293
oreilly/quarto-formats.html Normal file
View File

@ -0,0 +1,293 @@
<section data-type="chapter" id="chp-quarto-formats">
<h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far youve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.</p>
<p>There are two ways to set the output of a document:</p>
<ol type="1"><li>
<p>Permanently, by modifying the YAML header:</p>
<pre data-type="programlisting" data-code-language="yaml">title: "Diamond sizes"
format: html</pre>
</li>
<li>
<p>Transiently, by calling <code>quarto::quarto_render()</code> by hand:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">quarto::quarto_render("diamond-sizes.qmd", output_format = "docx")</pre>
</div>
<p>This is useful if you want to programmatically produce multiple types of output since the <code>output_format</code> argument can also take a list of values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">quarto::quarto_render("diamond-sizes.qmd", output_format = c("docx", "pdf"))</pre>
</div>
</li>
</ol></section>
<section id="output-options" data-type="sect1">
<h1>
Output options</h1>
<p>Quarto offers a wide range of output formats. You can find the complete list at <a href="https://quarto.org/docs/output-formats/all-formats.html" class="uri">https://quarto.org/docs/output-formats/all-formats.html</a>. Many formats share some output options (e.g., <code>toc: true</code> for including a table of contents), but others have options that are format specific (e.g., <code>code-fold: true</code> collapses code chunks into a <code>&lt;details&gt;</code> tag for HTML output so the user can display it on demand, its not applicable in a PDF or Word document).</p>
<p>To override the default voptions, you need to use an expanded <code>format</code> field. For example, if you wanted to render an <code>html</code> with a floating table of contents, youd use:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
toc: true
toc_float: true</pre>
<p>You can even render to multiple outputs by supplying a list of formats:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
toc: true
toc_float: true
pdf: default
docx: default</pre>
<p>Note the special syntax (<code>pdf: default</code>) if you dont want to override any of the default options.</p>
<p>To render to all formats specified in the YAML of a document, you can use <code>output_format = "all"</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">quarto::quarto_render("diamond-sizes.qmd", output_format = "all")</pre>
</div>
</section>
<section id="documents" data-type="sect1">
<h1>
Documents</h1>
<p>The previous chapter focused on the default <code>html</code> output. There are a number of basic variations on that theme, generating different types of documents. For example:</p>
<ul><li><p><code>pdf</code> makes a PDF with LaTeX (an open source document layout system), which youll need to install. RStudio will prompt you if you dont already have it.</p></li>
<li><p><code>docx</code> for Microsoft Word (<code>.docx</code>) documents.</p></li>
<li><p><code>odt</code> for OpenDocument Text (<code>.odt</code>) documents.</p></li>
<li><p><code>rtf</code> for Rich Text Format (<code>.rtf</code>) documents.</p></li>
<li><p><code>gfm</code> for a GitHub Flavored Markdown (<code>.md</code>) document.</p></li>
<li><p><code>ipynb</code> for Jupyter Notebooks (<code>.ipynb</code>).</p></li>
</ul><p>Remember, when generating a document to share with decision makers, you can turn off the default display of code by setting global options in document YAML:</p>
<pre data-type="programlisting" data-code-language="yaml">execute:
echo: false</pre>
<p>For <code>html</code> documents another option is to make the code chunks hidden by default, but visible with a click:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
code: true</pre>
</section>
<section id="presentations" data-type="sect1">
<h1>
Presentations</h1>
<p>You can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (<code>##</code>) level header. Additionally, first (<code>#</code>) level headers can be used to indicate the beginning of a new section with a section title slide that is by default centered in the middle.</p>
<p>Quarto supports a variety of presentation formats, including:</p>
<ol type="1"><li><p><code>revealjs</code> - HTML presentation with revealjs</p></li>
<li><p><code>pptx</code> - PowerPoint presentation</p></li>
<li><p><code>beamer</code> - PDF presentation with LaTeX Beamer.</p></li>
</ol><p>You can read more about creating presentations with Quarto at <a href="https://quarto.org/docs/presentations/">https://quarto.org/docs/presentations</a>.</p>
</section>
<section id="dashboards" data-type="sect1">
<h1>
Dashboards</h1>
<p>Dashboards are a useful way to communicate large amounts of information visually and quickly. A dashboard-like look can be achieved with Quarto using document layout options like sidebars, tabsets, multi-column layouts, etc.</p>
<p>For example, you can produce this dashboard:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/quarto-dashboard.png" class="img-fluid" alt="Quarto dashboard with the title &quot;Diamonds dashboard&quot;. The first tab shows four plots of the diamonds dataset. The second tab shows summary statistics for price and carat of diamonds. The third tab shows an interactive data table of the first 100 diamonds." width="540"/></p>
</div>
</div>
<p>Using this code:</p>
<div class="cell">
<pre><code>---
title: "💍 Diamonds dashboard"
format: html
execute:
echo: false
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
library(gt)
```
::: panel-tabset
## Plots
```{r}
#| layout: [[30,-5, 30, -5, 30], [100]]
ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.1)
ggplot(diamonds, aes(x = price)) + geom_histogram(binwidth = 500)
ggplot(diamonds, aes(x = cut, color = cut)) + geom_bar()
ggplot(diamonds, aes(x = carat, y = price, color = cut)) + geom_point()
```
## Summaries
```{r}
diamonds |&gt;
select(price, carat, cut) |&gt;
group_by(cut) |&gt;
summarize(
across(where(is.numeric), list(mean = mean, median = median, sd = sd, IQR = IQR))
) |&gt;
pivot_longer(cols = -cut) |&gt;
pivot_wider(names_from = cut, values_from = value) |&gt;
separate(name, into = c("var", "stat")) |&gt;
mutate(
var = str_to_title(var),
stat = str_to_title(stat),
stat = if_else(stat == "Iqr", "IQR", stat)
) |&gt;
group_by(var) |&gt;
gt() |&gt;
fmt_currency(columns = -stat, rows = 1:4, decimals = 0) |&gt;
fmt_number(columns = -stat, rows = 5:8,) |&gt;
cols_align(columns = -stat, align = "center") |&gt;
cols_label(stat = "")
```
## Data
```{r}
diamonds |&gt;
arrange(desc(carat)) |&gt;
slice_head(n = 100) |&gt;
select(price, carat, cut) |&gt;
DT::datatable()
```
:::</code></pre>
</div>
<p>To learn more about Quarto component layouts, visit <a href="https://quarto.org/docs/interactive/layout.html" class="uri">https://quarto.org/docs/interactive/layout.html</a>.</p>
</section>
<section id="interactivity" data-type="sect1">
<h1>
Interactivity</h1>
<p>Any HTML documents can contain interactive components.</p>
<section id="htmlwidgets" data-type="sect2">
<h2>
htmlwidgets</h2>
<p>HTML is an interactive format, and you can take advantage of that interactivity with <strong>htmlwidgets</strong>, R functions that produce interactive HTML visualizations. For example, take the <strong>leaflet</strong> map below. If youre viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously cant do that in a book, so Quarto automatically inserts a static screenshot for you.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(leaflet)
leaflet() |&gt;
setView(174.764, -36.877, zoom = 16) |&gt;
addTiles() |&gt;
addMarkers(174.764, -36.877, popup = "Maungawhau") </pre>
<div class="cell-output-display">
<div id="htmlwidget-ac96cb3ee4656e2e9ec3" style="width:100%;height:433px;" class="leaflet html-widget"/>
<script type="application/json" data-for="htmlwidget-ac96cb3ee4656e2e9ec3"><![CDATA[{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"setView":[[-36.877,174.764],16,[]],"calls":[{"method":"addTiles","args":["https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"&copy; <a href=\"https://openstreetmap.org\">OpenStreetMap<\/a> contributors, <a href=\"https://creativecommons.org/licenses/by-sa/2.0/\">CC-BY-SA<\/a>"}]},{"method":"addMarkers","args":[-36.877,174.764,null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},"Maungawhau",null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[-36.877,-36.877],"lng":[174.764,174.764]}},"evals":[],"jsHooks":[]}]]></script></div>
</div>
<p>The great thing about htmlwidgets is that you dont need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you dont need to worry about it.</p>
<p>There are many packages that provide htmlwidgets, including:</p>
<ul><li><p><strong>dygraphs</strong>, <a href="https://rstudio.github.io/dygraphs/" class="uri">https://rstudio.github.io/dygraphs</a>, for interactive time series visualisations.</p></li>
<li><p><strong>DT</strong>, <a href="https://rstudio.github.io/DT" class="uri">https://rstudio.github.io/DT/</a>, for interactive tables.</p></li>
<li><p><strong>threejs</strong>, <a href="https://bwlewis.github.io/rthreejs/" class="uri">https://bwlewis.github.io/rthreejs</a> for interactive 3d plots.</p></li>
<li><p><strong>DiagrammeR</strong>, <a href="https://rich-iannone.github.io/DiagrammeR" class="uri">https://rich-iannone.github.io/DiagrammeR</a> for diagrams (like flow charts and simple node-link diagrams).</p></li>
</ul><p>To learn more about htmlwidgets and see a more complete list of packages that provide them visit <a href="https://www.htmlwidgets.org" class="uri">https://www.htmlwidgets.org</a>.</p>
</section>
<section id="shiny" data-type="sect2">
<h2>
Shiny</h2>
<p>htmlwidgets provide <strong>client-side</strong> interactivity — all the interactivity happens in the browser, independently of R. On one hand, thats great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use <strong>shiny</strong>, a package that allows you to create interactivity using R code, not JavaScript.</p>
<p>To call Shiny code from an Quarto document, add <code>server: shiny</code> to the YAML header:</p>
<pre data-type="programlisting" data-code-language="yaml">title: "Shiny Web App"
format: html
server: shiny</pre>
<p>Then you can use the “input” functions to add interactive components to the document:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(shiny)
textInput("name", "What is your name?")
numericInput("age", "How old are you?", NA, min = 0, max = 150)</pre>
</div>
<p>And you also need a code chunk with chunk option <code>context: server</code> which contains the code that needs to run in a Shiny server.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/quarto-shiny.png" class="img-fluid" alt="Two input boxes on top of each other. Top one says &quot;What is your name?&quot;, the bottom one &quot;How old are you?&quot;." width="650"/></p>
</div>
</div>
<p>You can then refer to the values with <code>input$name</code> and <code>input$age</code>, and the code that uses them will be automatically re-run whenever they change.</p>
<p>We cant show you a live shiny app here because shiny interactions occur on the <strong>server-side</strong>. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public facing Shiny server if you want to publish this sort of interactivity online. Thats the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.</p>
<p>For learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, <a href="https://mastering-shiny.org/">https://mastering-shiny.org</a>.</p>
</section>
</section>
<section id="websites-and-books" data-type="sect1">
<h1>
Websites and books</h1>
<p>With a little additional infrastructure you can use Quarto to generate a complete website:</p>
<ul><li><p>Put your <code>.qmd</code> files in a single directory. <code>index.qmd</code> will become the home page.</p></li>
<li>
<p>Add a YAML file named <code>_quarto.yml</code> that provides the navigation for the site. In this file, set the <code>project</code> type:</p>
<ul><li>For a website, set <code>type: book</code>:</li>
</ul><pre data-type="programlisting" data-code-language="yaml">project:
type: book</pre>
<ul><li>For a website, set <code>type: website</code>:</li>
</ul><pre data-type="programlisting" data-code-language="yaml">project:
type: website</pre>
</li>
</ul><p>For example, the following <code>_quarto.yml</code> file creates a website from three source files: <code>index.qmd</code> (the home page), <code>viridis-colors.qmd</code>, and <code>terrain-colors.qmd</code>.</p>
<div class="cell">
<pre><code>project:
type: website
website:
title: "A website on color scales"
navbar:
left:
- href: index.qmd
text: Home
- href: viridis-colors.qmd
text: Viridis colors
- href: terrain-colors.qmd
text: Terrain colors</code></pre>
</div>
<p>The <code>_quarto.yml</code> file you need for a book is very similarly structured. The following example shows how you can create a book with four chapters that renders to three different outputs (<code>html</code>, <code>pdf</code>, and <code>epub</code>). Once again, the source files are <code>.qmd</code> files.</p>
<div class="cell">
<pre><code>project:
type: book
book:
title: "A book on color scales"
author: "Jane Coloriste"
chapters:
- index.qmd
- intro.qmd
- viridis-colors.qmd
- terrain-colors.qmd
format:
html:
theme: cosmo
pdf: default
epub: default</code></pre>
</div>
<p>We recommend that you use an RStudio project for your websites and books. Based on the <code>_quarto.yml</code> file, RStudio will recognize the type of project youre working on, and add a Built tab to the IDE that you can use to render and preview your websites and books. Both websites and books can also be rendered using <code>quarto::render()</code>.</p>
<p>Read more at <a href="https://quarto.org/docs/websites" class="uri">https://quarto.org/docs/websites</a> about Quarto websites and <a href="https://quarto.org/docs/books" class="uri">https://quarto.org/docs/books</a> about books.</p>
</section>
<section id="other-formats" data-type="sect1">
<h1>
Other formats</h1>
<p>Quarto offers even more output formats:</p>
<ul><li><p>You can write journal articles using Quarto Journal Templates: <a href="https://quarto.org/docs/journals/templates.html" class="uri">https://quarto.org/docs/journals/templates.html</a>.</p></li>
<li><p>You can output Quarto documents to Jupyter Notebooks with <code>format: ipynb</code>: <a href="https://quarto.org/docs/reference/formats/ipynb.html" class="uri">https://quarto.org/docs/reference/formats/ipynb.html</a>.</p></li>
</ul><p>See <a href="https://quarto.org/docs/output-formats/all-formats.html" class="uri">https://quarto.org/docs/output-formats/all-formats.html</a> for a list of even more formats.</p>
</section>
<section id="learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>To learn more about effective communication in these different formats we recommend the following resources:</p>
<ul><li><p>To improve your presentation skills, try <a href="#chp-https://amzn.com/0321820800" data-type="xref">#chp-https://amzn.com/0321820800</a>, by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.</p></li>
<li><p>If you give academic talks, you might like the <a href="#chp-https://github.com/jtleek/talkguide" data-type="xref">#chp-https://github.com/jtleek/talkguide</a>.</p></li>
<li><p>We havent taken it outselves, but weve heard good things about Matt McGarritys online course on public speaking: <a href="https://www.coursera.org/learn/public-speaking" class="uri">https://www.coursera.org/learn/public-speaking</a>.</p></li>
<li><p>If you are creating a lot of dashboards, make sure to read Stephen Fews <a href="#chp-https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167" data-type="xref">#chp-https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167</a>. It will help you create dashboards that are truly useful, not just pretty to look at.</p></li>
<li><p>Effectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams <a href="#chp-https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151" data-type="xref">#chp-https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151</a> is a great place to start.</p></li>
</ul></section>
</section>

View File

@ -0,0 +1,25 @@
<section data-type="chapter" id="chp-quarto-workflow">
<h1><span id="sec-quarto-workflow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto workflow</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the <em>console</em>, then capture what works in the <em>script editor</em>. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When youre happy, you move on and start a new chunk.</p><p>Quarto is also important because it so tightly integrates prose and code. This makes it a great <strong>analysis notebook</strong> because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:</p><ul><li><p>Records what you did and why you did it. Regardless of how great your memory is, if you dont record what you do, there will come a time when you have forgotten important details. Write them down so you dont forget!</p></li>
<li><p>Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.</p></li>
<li><p>Helps others understand your work. It is rare to do data analysis by yourself, and youll often be working as part of a team. A lab notebook helps you share not only what youve done, but why you did it with your colleagues or lab mates.</p></li>
</ul><p>Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. Weve drawn on our own experiences and Colin Purringtons advice on lab notebooks (<a href="https://colinpurrington.com/tips/lab-notebooks" class="uri">https://colinpurrington.com/tips/lab-notebooks</a>) to come up with the following tips:</p><ul><li><p>Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.</p></li>
<li>
<p>Use the YAML header date field to record the date you started working on the notebook:</p>
<pre data-type="programlisting" data-code-language="yaml">date: 2016-08-23</pre>
<p>Use ISO8601 YYYY-MM-DD format so thats there no ambiguity. Use it even if you dont normally write dates that way!</p>
</li>
<li><p>If you spend a lot of time on an analysis idea and it turns out to be a dead end, dont delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.</p></li>
<li><p>Generally, youre better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>.</p></li>
<li><p>If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.</p></li>
<li><p>Before you finish for the day, make sure you can render the notebook. If youre using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.</p></li>
<li><p>If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), youll need to track the versions of the packages that your code uses. A rigorous approach is to use <strong>renv</strong>, <a href="https://rstudio.github.io/renv/index.html" class="uri">https://rstudio.github.io/renv/index.html</a>, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs <code><a href="#chp-https://rdrr.io/r/utils/sessionInfo" data-type="xref">#chp-https://rdrr.io/r/utils/sessionInfo</a></code> — that wont let you easily recreate your packages as they are today, but at least youll know what they were.</p></li>
<li><p>You are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.</p></li>
</ul></section>

682
oreilly/quarto.html Normal file
View File

@ -0,0 +1,682 @@
<section data-type="chapter" id="chp-quarto">
<h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.</p>
<p>Quarto files are designed to be used in three ways:</p>
<ol type="1"><li><p>For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.</p></li>
<li><p>For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).</p></li>
<li><p>As an environment in which to <em>do</em> data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.</p></li>
</ol><p>Quarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through <code>?</code>. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation page at <a href="https://quarto.org/" class="uri">https://quarto.org</a> for help.</p>
<p>If youre an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. Youre not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>You need the Quarto command line interface (Quarto CLI), but you dont need to explicitly install it or load it, as RStudio automatically does both when needed.</p>
</section>
</section>
<section id="quarto-basics" data-type="sect1">
<h1>
Quarto basics</h1>
<p>This is a Quarto file a plain text file that has the extension <code>.qmd</code>:</p>
<div class="cell">
<pre><code>---
title: "Diamond sizes"
date: 2022-09-12
format: html
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
smaller &lt;- diamonds |&gt;
filter(carat &lt;= 2.5)
```
We have data about `r nrow(diamonds)` diamonds.
Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.
The distribution of the remainder is shown below:
```{r}
#| label: plot-smaller-diamonds
#| echo: false
smaller |&gt;
ggplot(aes(carat)) +
geom_freqpoly(binwidth = 0.01)
```</code></pre>
</div>
<p>It contains three important types of content:</p>
<ol type="1"><li>An (optional) <strong>YAML header</strong> surrounded by <code>---</code>s.</li>
<li>
<strong>Chunks</strong> of R code surrounded by <code>```</code>.</li>
<li>Text mixed with simple text formatting like <code># heading</code> and <code>_italics_</code>.</li>
</ol><p>When you open a <code>.qmd</code>, you get a notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/diamond-sizes-notebook.png" class="img-fluid" style="width:90.0%" alt="RStudio window with a Quarto document titled &quot;diamond-sizes.qmd&quot; on the left and a blank Viewer window on the right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less then 2.5 carats. The plot shows that the frequency decreases as the weight increases."/></p>
</div>
</div>
<p>If you dont like seeing your plots and output in your document and would rather make use of RStudios console and plot panes, you can click on the gear icon next to “Render” and switch to “Chunk Output in Console”.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/diamond-sizes-console-output.png" class="img-fluid" style="width:90.0%" alt="RStudio window with a Quarto document titled &quot;diamond-sizes.qmd&quot; on the left and the Plot pane on the bottom right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less then 2.5 carats. The plot is displayed in the Plot pane and shows that the frequency decreases as the weight increases. The RStudio option to show Chunk Output in Console is also highlighted."/></p>
</div>
</div>
<p>To produce a complete report containing all text, code, and results, click “Render” or press Cmd/Ctrl + Shift + K. You can also do this programmatically with <code>quarto::quarto_render("diamond-sizes.qmd")</code>. This will display the report in the viewer pane and create an HTML file.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/diamond-sizes-report.png" class="img-fluid" style="width:90.0%" alt="RStudio window with a Quarto document titled &quot;diamond-sizes.qmd&quot; on the left and the Plot pane on the bottom right. The rendered document does not show any of the code, but the code is visible in the source document."/></p>
</div>
</div>
<p>When you render the document, Quarto sends the <code>.qmd</code> file to <strong>knitr</strong>, <a href="https://yihui.name/knitr/" class="uri">https://yihui.name/knitr</a>, which executes all of the code chunks and creates a new markdown (<code>.md</code>) document which includes the code and its output. The markdown file generated by knitr is then processed by <strong>pandoc</strong>, <a href="https://pandoc.org/" class="uri">https://pandoc.org</a>, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats, as youll learn about in <a href="#chp-quarto-formats" data-type="xref">#chp-quarto-formats</a>.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="images/quarto-flow.png" class="img-fluid" style="width:75.0%" alt="Workflow diagram starting with a qmd file, then knitr, then md, then pandoc, then PDF, MS Word, or HTML."/></p>
</div>
</div>
<p>To get started with your own <code>.qmd</code> file, select <em>File &gt; New File &gt; Quarto Document…</em> in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.</p>
<p>The following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.</p>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Create a new Quarto document using <em>File &gt; New File &gt; Quarto Document</em>. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.</p></li>
<li><p>Create one new Quarto document for each of the three built-in formats: HTML, PDF and Word. Render each of the three documents. How do the outputs differ? How do the inputs differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)</p></li>
</ol></section>
</section>
<section id="visual-editor" data-type="sect1">
<h1>
Visual editor</h1>
<p>The Visual editor in RStudio provides a <a href="#chp-https://en.wikipedia.org/wiki/WYSIWYM" data-type="xref">#chp-https://en.wikipedia.org/wiki/WYSIWYM</a> interface for authoring Quarto documents. Under the hood, prose in Quarto documents (<code>.qmd</code> files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in <a href="#sec-source-editor" data-type="xref">#sec-source-editor</a>, it still requires learning new syntax. Therefore, if youre new to computational documents like <code>.qmd</code> files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.</p>
<p>In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all <kbd>⌘ /</kbd> shortcut to insert just about anything. If you are at the beginning of a line (as shown below), you can also enter just <kbd>/</kbd> to invoke the shortcut.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto/quarto-visual-editor.png" class="img-fluid" style="width:75.0%" alt="A Quarto document displaying various features of the visual editor such as text formatting (italic, bold, underline, small caps, code, superscript, and subscript), first through third level headings, bulleted and numbered lists, links, linked phrases, and images (along with a pop-up window for customizing image size, adding a caption and alt text, etc.), tables with a header row, and the insert anything tool with options to insert an R code chunk, a Python code chunk, a div, a bullet list, a numbered list, or a first level heading (the top few choices in the tool)."/></p>
</div>
</div>
<p>Inserting images and customizing how they are displayed is also facilitated with the visual editor. You can either paste an image from your clipboard directly into the visual editor (and RStudio will place a copy of that image in the project directory and link to it) or you can use the visual editors Insert &gt; Figure / Image menu to browse to the image you want to insert or paste its URL. In addition, using the same menu you can resize the image as well as add a caption, alternative text, and a link.</p>
<p>The visual editor has many more features that we havent enumerated here that you might find useful as you gain experience authoring with it.</p>
<p>Most importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.</p>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises. -->
</section>
</section>
<section id="sec-source-editor" data-type="sect1">
<h1>
Source editor</h1>
<p>You can also edit Quarto documents using the Source editor in RStudio, without the assist of the Visual editor. While the Visual editor will feel familiar to those with experience writing in tools like Google docs, the Source editor will feel familiar to those with experience writing R scripts or R Markdown documents. The Source editor can also be useful for debugging any Quarto syntax errors since its often easier to catch these in plain text.</p>
<p>The guide below shows how to use Pandocs Markdown for authoring Quarto documents in the source editor.</p>
<div class="cell">
<pre><code>## Text formatting
*italic* **bold** [underline]{.underline} ~~strikeout~~ [small caps]{.smallcaps} `code` superscript^2^ and subscript~2~
## Headings
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
## Lists
- Bulleted list item 1
- Item 2
- Item 2a
- Item 2b
1. Numbered list item 1
2. Item 2.
The numbers are incremented automatically in the output.
## Links and images
&lt;http://example.com&gt;
[linked phrase](http://example.com)
![optional caption text](quarto.png){fig-alt="Quarto logo and the word quarto spelled in small case letters"}
## Tables
| First Header | Second Header |
|--------------|---------------|
| Content Cell | Content Cell |
| Content Cell | Content Cell |
/</code></pre>
</div>
<p>The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you wont need to think about them. If you forget, you can get to a handy reference sheet with <em>Help &gt; Markdown Quick Reference</em>.</p>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Practice what youve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.</p></li>
<li>
<p>Using the visual editor, figure out how to:</p>
<ol type="a"><li>Add a footnote.</li>
<li>Add a horizontal rule.</li>
<li>Add a block quote.</li>
</ol></li>
<li>
<p>Now, using the source editor and the Markdown quick reference, figure out how to:</p>
<ol type="a"><li>Add a footnote.</li>
<li>Add a horizontal rule.</li>
<li>Add a block quote.</li>
</ol></li>
<li><p>Copy and paste the contents of <code>diamond-sizes.qmd</code> from <a href="https://github.com/hadley/r4ds/tree/main/quarto" class="uri">https://github.com/hadley/r4ds/tree/main/quarto</a> in to a local R Quarto document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.</p></li>
</ol></section>
</section>
<section id="code-chunks" data-type="sect1">
<h1>
Code chunks</h1>
<p>To run code inside a Quarto document, you need to insert a chunk. There are three ways to do so:</p>
<ol type="1"><li><p>The keyboard shortcut Cmd + Option + I / Ctrl + Alt + I.</p></li>
<li><p>The “Insert” button icon in the editor toolbar.</p></li>
<li><p>By manually typing the chunk delimiters <code>```{r}</code> and <code>```</code>.</p></li>
</ol><p>Wed recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!</p>
<p>You can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focused around a single task.</p>
<p>The following sections describe the chunk header which consists of <code>```{r}</code>, followed by an optional chunk label and various other chunk options, each on their own line, marked by <code>#|</code>.</p>
<section id="chunk-label" data-type="sect2">
<h2>
Chunk label</h2>
<p>Chunks can be given an optional label, e.g.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="markdown">```{r}
#| label: simple-addition
1 + 1
```</pre>
<pre><code>#&gt; [1] 2</code></pre>
</div>
<p>This has three advantages:</p>
<ol type="1"><li>
<p>You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/quarto-chunk-nav.png" class="img-fluid" style="width:30.0%" alt="Snippet of RStudio IDE showing only the drop-down code navigator which shows three chunks. Chunk 1 is setup. Chunk 2 is cars and it is in a section called Quarto. Chunk 3 is pressure and it is in a section called Including plots."/></p>
</div>
</div>
</li>
<li><p>Graphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in <a href="#sec-figures" data-type="xref">#sec-figures</a>.</p></li>
<li><p>You can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that in <a href="#sec-caching" data-type="xref">#sec-caching</a>.</p></li>
</ol><p>Your chunk labels should be short but evocative and should not contain spaces. We recommend using dashes (<code>-</code>) to separate words (instead of underscores, <code>_</code>) and avoiding other special characters in chunk labels.</p>
<p>You are generally free to label your chunk however you like, but there is one chunk name that imbues special behavior: <code>setup</code>. When youre in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.</p>
<p>Additionally, chunk labels cannot be duplicated. Each chunk label must be unique.</p>
</section>
<section id="chunk-options" data-type="sect2">
<h2>
Chunk options</h2>
<p>Chunk output can be customized with <strong>options</strong>, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here well cover the most important chunk options that youll use frequently. You can see the full list at <a href="https://yihui.name/knitr/options/" class="uri">https://yihui.name/knitr/options</a>.</p>
<p>The most important set of options controls if your code block is executed and what results are inserted in the finished report:</p>
<ul><li><p><code>eval: false</code> prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.</p></li>
<li><p><code>include: false</code> runs the code, but doesnt show the code or results in the final document. Use this for setup code that you dont want cluttering your report.</p></li>
<li><p><code>echo: false</code> prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who dont want to see the underlying R code.</p></li>
<li><p><code>message: false</code> or <code>warning: false</code> prevents messages or warnings from appearing in the finished file.</p></li>
<li><p><code>results: hide</code> hides printed output; <code>fig-show: hide</code> hides plots.</p></li>
<li><p><code>error: true</code> causes the render to continue even if code returns an error. This is rarely something youll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your <code>.qmd</code>. Its also useful if youre teaching R and want to deliberately include an error. The default, <code>error: false</code> causes rendering to fail if there is a single error in the document.</p></li>
</ul><p>Each of these chunk options get added to the header of the chunk, following <code>#|</code>, e.g., in the following chunk the result is not printed since <code>eval</code> is set to false.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="markdown">```{r}
#| label: simple-multiplication
#| eval: false
2 * 2
```</pre>
</div>
<p>The following table summarizes which types of output each option suppresses:</p>
<table class="table"><colgroup><col style="width: 24%"/><col style="width: 13%"/><col style="width: 14%"/><col style="width: 10%"/><col style="width: 9%"/><col style="width: 13%"/><col style="width: 13%"/></colgroup><thead><tr class="header"><th>Option</th>
<th style="text-align: center;">Run code</th>
<th style="text-align: center;">Show code</th>
<th style="text-align: center;">Output</th>
<th style="text-align: center;">Plots</th>
<th style="text-align: center;">Messages</th>
<th style="text-align: center;">Warnings</th>
</tr></thead><tbody><tr class="odd"><td><code>eval: false</code></td>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
</tr><tr class="even"><td><code>include: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
<td style="text-align: center;">-</td>
</tr><tr class="odd"><td><code>echo: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
</tr><tr class="even"><td><code>results: hide</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
</tr><tr class="odd"><td><code>fig-show: hide</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
</tr><tr class="even"><td><code>message: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
<td style="text-align: center;"/>
</tr><tr class="odd"><td><code>warning: false</code></td>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;"/>
<td style="text-align: center;">-</td>
</tr></tbody></table></section>
<section id="global-options" data-type="sect2">
<h2>
Global options</h2>
<p>As you work more with knitr, you will discover that some of the default chunk options dont fit your needs and you want to change them.</p>
<p>You can do this by adding the preferred options in the document YAML, under <code>execute</code>. For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set <code>echo: false</code> at the document level. That will hide the code by default, so only showing the chunks you deliberately choose to show (with <code>echo: true</code>). You might consider setting <code>message: false</code> and <code>warning: false</code>, but that would make it harder to debug problems because you wouldnt see any messages in the final document.</p>
<pre data-type="programlisting" data-code-language="yaml">title: "My report"
execute:
echo: false</pre>
<p>Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). You can, however, still set these as global options for your document under the <code>knitr</code> field, under <code>opts_chunk</code>. For example, when writing books and tutorials we set:</p>
<pre data-type="programlisting" data-code-language="yaml">title: "Tutorial"
knitr:
opts_chunk:
comment: "#&gt;"
collapse: true</pre>
<p>This uses our preferred comment formatting and ensures that the code and output are kept closely entwined.</p>
</section>
<section id="inline-code" data-type="sect2">
<h2>
Inline code</h2>
<p>There is one other way to embed R code into a Quarto document: directly into the text, with: <code>`r `</code>. This can be very useful if you mention properties of your data in the text. For example, the example document used at the start of the chapter had:</p>
<blockquote class="blockquote">
<p>We have data about <code>`r nrow(diamonds)`</code> diamonds. Only <code>`r nrow(diamonds) - nrow(smaller)`</code> are larger than 2.5 carats. The distribution of the remainder is shown below:</p>
</blockquote>
<p>When the report is rendered, the results of these computations are inserted into the text:</p>
<blockquote class="blockquote">
<p>We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:</p>
</blockquote>
<p>When inserting numbers into text, <code><a href="#chp-https://rdrr.io/r/base/format" data-type="xref">#chp-https://rdrr.io/r/base/format</a></code> is your friend. It allows you to set the number of <code>digits</code> so you dont print to a ridiculous degree of accuracy, and a <code>big.mark</code> to make numbers easier to read. You might combine these into a helper function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">comma &lt;- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345)
#&gt; [1] "3,452,345"
comma(.12358124331)
#&gt; [1] "0.12"</pre>
</div>
</section>
<section id="exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Add a section that explores how diamond sizes vary by cut, colour, and clarity. Assume youre writing a report for someone who doesnt know R, and instead of setting <code>echo: false</code> on each chunk, set a global option.</p></li>
<li><p>Download <code>diamond-sizes.qmd</code> from <a href="https://github.com/hadley/r4ds/tree/main/quarto" class="uri">https://github.com/hadley/r4ds/tree/main/quarto</a>. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.</p></li>
<li><p>Modify <code>diamonds-sizes.qmd</code> to use <code>label_comma()</code> to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats.</p></li>
</ol></section>
</section>
<section id="sec-figures" data-type="sect1">
<h1>
Figures</h1>
<p>The figures in a Quarto document can be embedded (e.g., a PNG or JPEG file) or generated as a result of a code chunk.</p>
<p>To embed an image from an external file, you can use the Insert menu in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.</p>
<p>If you include a code chunk that generates a figure (e.g., includes a <code>ggplot()</code> call), the resulting figure will be automatically included in your Quarto document.</p>
<section id="figure-sizing" data-type="sect2">
<h2>
Figure sizing</h2>
<p>The biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: <code>fig-width</code>, <code>fig-height</code>, <code>fig-asp</code>, <code>out-width</code> and <code>out-height</code>. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).</p>
<!-- TODO: https://www.tidyverse.org/blog/2020/08/taking-control-of-plot-scaling/ -->
<p>We recommend three of the five options:</p>
<ul><li><p>Plots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set <code>fig-width: 6</code> (6”) and <code>fig-asp: 0.618</code> (the golden ratio) in the defaults. Then in individual chunks, only adjust <code>fig-asp</code>.</p></li>
<li><p>Control the output size with <code>out-width</code> and set it to a percentage of the line width. We suggest to <code>out-width: "70%"</code> and <code>fig-align: "center"</code>. That gives plots room to breathe, without taking up too much space.</p></li>
<li><p>To put multiple plots in a single row, set the <code>out-width</code> to <code>50%</code> for two plots, <code>33%</code> for 3 plots, or <code>25%</code> to 4 plots, and set <code>fig-align: "default"</code>. Depending on what youre trying to illustrate (e.g. show data or show plot variations), you might also tweak <code>fig-width</code>, as discussed below.</p></li>
</ul><p>If you find that youre having to squint to read the text in your plot, you need to tweak <code>fig-width</code>. If <code>fig-width</code> is larger than the size the figure is rendered in the final doc, the text will be too small; if <code>fig-width</code> is smaller, the text will be too big. Youll often need to do a little experimentation to figure out the right ratio between the <code>fig-width</code> and the eventual width in your document. To illustrate the principle, the following three plots have <code>fig-width</code> of 4, 6, and 8 respectively:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto_files/figure-html/unnamed-chunk-15-1.png" class="img-fluid" width="384"/></p>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" width="576"/></p>
</div>
</div>
<div class="cell">
<div class="cell-output-display">
<p><img src="quarto_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" width="768"/></p>
</div>
</div>
<p>If you want to make sure the font size is consistent across all your figures, whenever you set <code>out-width</code>, youll also need to adjust <code>fig-width</code> to maintain the same ratio with your default <code>out-width</code>. For example, if your default <code>fig-width</code> is 6 and <code>out-width</code> is 0.7, when you set <code>out-width: "50%"</code> youll need to set <code>fig-width</code> to 4.3 (6 * 0.5 / 0.7).</p>
</section>
<section id="other-important-options" data-type="sect2">
<h2>
Other important options</h2>
<p>When mingling code and text, like in this book, you can set <code>fig-show: "hold"</code> so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.</p>
<p>To add a caption to the plot, use <code>fig-cap</code>. In Quarto this will change the figure from inline to “floating”.</p>
<p>If youre producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set <code>fig-format: "png"</code> to force the use of PNGs. They are slightly lower quality, but will be much more compact.</p>
<p>Its a good idea to name code chunks that produce figures, even if you dont routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).</p>
</section>
<section id="exercises-4" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises -->
</section>
</section>
<section id="tables" data-type="sect1">
<h1>
Tables</h1>
<p>Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create in directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.</p>
<p>By default, Quarto prints data frames and matrices as youd see them in the console:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mtcars[1:5, ]
#&gt; mpg cyl disp hp drat wt qsec vs am gear carb
#&gt; Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#&gt; Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#&gt; Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#&gt; Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#&gt; Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2</pre>
</div>
<p>If you prefer that data be displayed with additional formatting you can use the <code><a href="#chp-https://rdrr.io/pkg/knitr/man/kable" data-type="xref">#chp-https://rdrr.io/pkg/knitr/man/kable</a></code> function. The code below generates <a href="#tbl-kable" data-type="xref">#tbl-kable</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">knitr::kable(mtcars[1:5, ], )</pre>
<div class="cell-output-display">
<div id="tbl-kable" class="anchored">
<table class="table table-sm table-striped"><caption>Table 27.1: A knitr kable.</caption>
<colgroup><col style="width: 26%"/><col style="width: 7%"/><col style="width: 5%"/><col style="width: 7%"/><col style="width: 5%"/><col style="width: 7%"/><col style="width: 8%"/><col style="width: 8%"/><col style="width: 4%"/><col style="width: 4%"/><col style="width: 7%"/><col style="width: 7%"/></colgroup><thead><tr class="header"><th style="text-align: left;"/>
<th style="text-align: right;">mpg</th>
<th style="text-align: right;">cyl</th>
<th style="text-align: right;">disp</th>
<th style="text-align: right;">hp</th>
<th style="text-align: right;">drat</th>
<th style="text-align: right;">wt</th>
<th style="text-align: right;">qsec</th>
<th style="text-align: right;">vs</th>
<th style="text-align: right;">am</th>
<th style="text-align: right;">gear</th>
<th style="text-align: right;">carb</th>
</tr></thead><tbody><tr class="odd"><td style="text-align: left;">Mazda RX4</td>
<td style="text-align: right;">21.0</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">160</td>
<td style="text-align: right;">110</td>
<td style="text-align: right;">3.90</td>
<td style="text-align: right;">2.620</td>
<td style="text-align: right;">16.46</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">4</td>
</tr><tr class="even"><td style="text-align: left;">Mazda RX4 Wag</td>
<td style="text-align: right;">21.0</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">160</td>
<td style="text-align: right;">110</td>
<td style="text-align: right;">3.90</td>
<td style="text-align: right;">2.875</td>
<td style="text-align: right;">17.02</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">4</td>
</tr><tr class="odd"><td style="text-align: left;">Datsun 710</td>
<td style="text-align: right;">22.8</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">108</td>
<td style="text-align: right;">93</td>
<td style="text-align: right;">3.85</td>
<td style="text-align: right;">2.320</td>
<td style="text-align: right;">18.61</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">4</td>
<td style="text-align: right;">1</td>
</tr><tr class="even"><td style="text-align: left;">Hornet 4 Drive</td>
<td style="text-align: right;">21.4</td>
<td style="text-align: right;">6</td>
<td style="text-align: right;">258</td>
<td style="text-align: right;">110</td>
<td style="text-align: right;">3.08</td>
<td style="text-align: right;">3.215</td>
<td style="text-align: right;">19.44</td>
<td style="text-align: right;">1</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">3</td>
<td style="text-align: right;">1</td>
</tr><tr class="odd"><td style="text-align: left;">Hornet Sportabout</td>
<td style="text-align: right;">18.7</td>
<td style="text-align: right;">8</td>
<td style="text-align: right;">360</td>
<td style="text-align: right;">175</td>
<td style="text-align: right;">3.15</td>
<td style="text-align: right;">3.440</td>
<td style="text-align: right;">17.02</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">0</td>
<td style="text-align: right;">3</td>
<td style="text-align: right;">2</td>
</tr></tbody></table></div>
</div>
</div>
<p>Read the documentation for <code><a href="#chp-https://rdrr.io/pkg/knitr/man/kable" data-type="xref">#chp-https://rdrr.io/pkg/knitr/man/kable</a></code> to see the other ways in which you can customize the table. For even deeper customization, consider the <strong>gt</strong>, <strong>huxtable</strong>, <strong>reactable</strong>, <strong>kableExtra</strong>, <strong>xtable</strong>, <strong>stargazer</strong>, <strong>pander</strong>, <strong>tables</strong>, and <strong>ascii</strong> packages. Each provides a set of tools for returning formatted tables from R code.</p>
<p>There is also a rich set of options for controlling how figures are embedded. Youll learn about these in <a href="#chp-communicate-plots" data-type="xref">#chp-communicate-plots</a>.</p>
<section id="exercises-5" data-type="sect2">
<h2>
Exercises</h2>
<!--# TO DO: Add exercises -->
</section>
</section>
<section id="sec-caching" data-type="sect1">
<h1>
Caching</h1>
<p>Normally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that youve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is <code>cache: true</code>.</p>
<p>You can enable the Knitr cache at the document level for caching the results of all computations in a document using standard YAML options:</p>
<pre data-type="programlisting" data-code-language="yaml">---
title: "My Document"
execute:
cache: true
---</pre>
<p>You can also enable caching at the chunk level for caching the results of computation in a specific chunk:</p>
<div class="cell" data-hash="quarto_cache/html/unnamed-chunk-20_0ece1c5566ef654926248351b9afb313">
<pre data-type="programlisting" data-code-language="markdown">```{r}
#| cache: true
# code for lengthy computation...
```</pre>
</div>
<p>When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasnt, it will reuse the cached results.</p>
<p>The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the <code>processed_data</code> chunk depends on the <code>raw-data</code> chunk:</p>
<pre><code>```{r}
#| label: raw-data
rawdata &lt;- readr::read_csv("a_very_large_file.csv")
```
```{r}
#| label: processed_data
#| cache: true
processed_data &lt;- rawdata |&gt;
filter(!is.na(import_var)) |&gt;
mutate(new_variable = complicated_transformation(x, y, z))
```</code></pre>
<p>Caching the <code>processed_data</code> chunk means that it will get re-run if the dplyr pipeline is changed, but it wont get rerun if the <code>read_csv()</code> call changes. You can avoid that problem with the <code>dependson</code> chunk option:</p>
<pre><code>```{r}
#| label: processed-data
#| cache: true
#| dependson: "raw-data"
processed_data &lt;- rawdata |&gt;
filter(!is.na(import_var)) |&gt;
mutate(new_variable = complicated_transformation(x, y, z))
```</code></pre>
<p><code>dependson</code> should contain a character vector of <em>every</em> chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.</p>
<p>Note that the chunks wont update if <code>a_very_large_file.csv</code> changes, because knitr caching only tracks changes within the <code>.qmd</code> file. If you want to also track changes to that file you can use the <code>cache.extra</code> option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is <code><a href="#chp-https://rdrr.io/r/base/file.info" data-type="xref">#chp-https://rdrr.io/r/base/file.info</a></code>: it returns a bunch of information about the file including when it was last modified. Then you can write:</p>
<pre><code>```{r}
#| label: raw-data
#| cache.extra: file.info("a_very_large_file.csv")
rawdata &lt;- readr::read_csv("a_very_large_file.csv")
```</code></pre>
<p>As your caching strategies get progressively more complicated, its a good idea to regularly clear out all your caches with <code><a href="#chp-https://rdrr.io/pkg/knitr/man/clean_cache" data-type="xref">#chp-https://rdrr.io/pkg/knitr/man/clean_cache</a></code>.</p>
<p>Weve followed the advice of <a href="#chp-https://twitter.com/drob/status/738786604731490304" data-type="xref">#chp-https://twitter.com/drob/status/738786604731490304</a> to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the <code>dependson</code> specification.</p>
<section id="exercises-6" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Set up a network of chunks where <code>d</code> depends on <code>c</code> and <code>b</code>, and both <code>b</code> and <code>c</code> depend on <code>a</code>. Have each chunk print <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code>, set <code>cache: true</code>, then verify your understanding of caching.</li>
</ol></section>
</section>
<section id="troubleshooting" data-type="sect1">
<h1>
Troubleshooting</h1>
<p>Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.</p>
<p>One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.</p>
<p>If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If youre lucky, that will recreate the problem, and you can figure out whats going on interactively.</p>
<p>If that doesnt help, there must be something different between your interactive environment and the Quarto environment. Youre going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including <code><a href="#chp-https://rdrr.io/r/base/getwd" data-type="xref">#chp-https://rdrr.io/r/base/getwd</a></code> in a chunk.</p>
<p>Next, brainstorm all the things that might cause the bug. Youll need to systematically check that theyre the same in your R session and your Quarto session. The easiest way to do that is to set <code>error: true</code> on the chunk causing the problem, then use <code><a href="#chp-https://rdrr.io/r/base/print" data-type="xref">#chp-https://rdrr.io/r/base/print</a></code> and <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code> to check that settings are as you expect.</p>
</section>
<section id="yaml-header" data-type="sect1">
<h1>
YAML header</h1>
<p>You can control many other “whole document” settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: its “YAML Aint Markup Language”, which is designed for representing hierarchical data in a way thats easy for humans to read and write. Quarto uses it to control many details of the output. Here well discuss three: self-contained documents, document parameters, and bibliographies.</p>
<section id="self-contained" data-type="sect2">
<h2>
Self-contained</h2>
<p>HTML documents typically have a number of external dependencies (e.g. images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a <code>_files</code> folder in the same directory as your <code>.qmd</code> file. If you publish the HTML file on a hosting platform (e.g., QuartoPub, <a href="https://quartopub.com/" class="uri">https://quartopub.com/</a>), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the <code>embed-resources</code> option:</p>
<p>By default these dependencies are placed in a <code>_files</code> directory alongside your document. For example, if you render <code>report.qmd</code> to HTML:</p>
<pre data-type="programlisting" data-code-language="yaml">format:
html:
embed-resources: true</pre>
<p>The resulting file will be self-contained, such that it will need no external files and no internet access to be displayed properly by a browser.</p>
</section>
<section id="parameters" data-type="sect2">
<h2>
Parameters</h2>
<p>Quarto documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the <code>params</code> field.</p>
<p>This example uses a <code>my_class</code> parameter to determine which class of cars to display:</p>
<div class="cell">
<pre><code>---
output: html_document
params:
my_class: "suv"
---
```{r}
#| label: setup
#| include: false
library(tidyverse)
class &lt;- mpg |&gt; filter(class == params$my_class)
```
# Fuel economy for `r params$my_class`s
```{r}
#| message: false
ggplot(class, aes(displ, hwy)) +
geom_point() +
geom_smooth(se = FALSE)
```</code></pre>
</div>
<p>As you can see, parameters are available within the code chunks as a read-only list named <code>params</code>.</p>
<p>You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with <code>!r</code>. This is a good way to specify date/time parameters.</p>
<pre data-type="programlisting" data-code-language="yaml">params:
start: !r lubridate::ymd("2015-01-01")
snapshot: !r lubridate::ymd_hms("2015-01-01 12:30:00")</pre>
</section>
<section id="bibliographies-and-citations" data-type="sect2">
<h2>
Bibliographies and Citations</h2>
<p>Quarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.</p>
<p>To add a citation using the visual editor, go to Insert &gt; Citation. Citations can be inserted from a variety of sources:</p>
<ol type="1"><li><p><a href="#citations-from-dois" data-type="xref">#citations-from-dois</a> (Document Object Identifier) references.</p></li>
<li><p><a href="#citations-from-zotero" data-type="xref">#citations-from-zotero</a> personal or group libraries.</p></li>
<li><p>Searches of <a href="#chp-https://www.crossref.org/" data-type="xref">#chp-https://www.crossref.org/</a>, <a href="#chp-https://datacite.org/" data-type="xref">#chp-https://datacite.org/</a>, or <a href="#chp-https://pubmed.ncbi.nlm.nih.gov/" data-type="xref">#chp-https://pubmed.ncbi.nlm.nih.gov/</a>.</p></li>
<li><p>Your document bibliography (a <code>.bib</code> file in the directory of your document)</p></li>
</ol><p>Under the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g. <code>[@citation]</code>).</p>
<p>If you add a citation using one of the first three methods, the visual editor will automatically create a <code>bibliography.bib</code> file for you and add the reference to it. It will also add a <code>bibliography</code> field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.</p>
<p>To create a citation within your .qmd file in the source editor, use a key composed of @ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:</p>
<pre data-type="programlisting" data-code-language="markdown">Separate multiple citations with a `;`: Blah blah [@smith04; @doe99].
You can add arbitrary comments inside the square brackets:
Blah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].
Remove the square brackets to create an in-text citation: @smith04
says blah, or @smith04 [p. 33] says blah.
Add a `-` before the citation to suppress the author's name:
Smith says blah [-@smith04].</pre>
<p>When Quarto renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as <code># References</code> or <code># Bibliography</code>.</p>
<p>You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the <code>csl</code> field:</p>
<pre data-type="programlisting" data-code-language="yaml">bibliography: rmarkdown.bib
csl: apa.csl</pre>
<p>As with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is <a href="https://github.com/citation-style-language/styles" class="uri">https://github.com/citation-style-language/styles</a>.</p>
</section>
</section>
<section id="learning-more" data-type="sect1">
<h1>
Learning more</h1>
<p>Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: <a href="https://quarto.org/" class="uri">https://quarto.org</a>.</p>
<p>There are two important topics that we havent covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: <a href="https://happygitwithr.com" class="uri">https://happygitwithr.com</a>.</p>
<p>We have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either <a href="#chp-https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416" data-type="xref">#chp-https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416</a> by Joseph M. Williams &amp; Joseph Bizup, or <a href="#chp-https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327" data-type="xref">#chp-https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327</a> by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but theyre used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <a href="https://www.georgegopen.com/the-litigation-articles.html" class="uri">https://www.georgegopen.com/the-litigation-articles.html</a>. They are aimed at lawyers, but almost everything applies to data scientists too.</p>
</section>
</section>

1148
oreilly/rectangling.html Normal file

File diff suppressed because it is too large Load Diff

1032
oreilly/regexps.html Normal file

File diff suppressed because it is too large Load Diff

539
oreilly/spreadsheets.html Normal file
View File

@ -0,0 +1,539 @@
<section data-type="chapter" id="chp-spreadsheets">
<h1><span id="sec-import-spreadsheets" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Spreadsheets</span></span></h1><div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far you have learned about importing data from plain text files, e.g. <code>.csv</code> and <code>.tsv</code> files. Sometimes you need to analyze data that lives in a spreadsheet. In this chapter we will introduce you to tools for working with data in Excel spreadsheets and Google Sheets. This will build on much of what youve learned in <a href="#chp-data-import" data-type="xref">#chp-data-import</a> but we will also discuss additional considerations and complexities when working with data from spreadsheets.</p>
<p>If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper “Data Organization in Spreadsheets” by Karl Broman and Kara Woo: <a href="https://doi.org/10.1080/00031305.2017.1375989" class="uri">https://doi.org/10.1080/00031305.2017.1375989</a>. The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise.</p>
</section>
<section id="excel" data-type="sect1">
<h1>
Excel</h1>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter, youll learn how to load data from Excel spreadsheets in R with the <strong>readxl</strong> package. This package is non-core tidyverse, so you need to load it explicitly but it is installed automatically when you install the tidyverse package.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(readxl)
library(tidyverse)</pre>
</div>
<p><strong>xlsx</strong> and <strong>XLConnect</strong> can be used for reading data from and writing data to Excel spreadsheets. However, these two packages require Java installed on your machine and the rJava package. Due to potential challenges with installation, we recommend using alternative packages weve introduced in this chapter.</p>
</section>
<section id="getting-started" data-type="sect2">
<h2>
Getting started</h2>
<p>Most of readxls functions allow you to load Excel spreadsheets into R:</p>
<ul><li>
<code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> reads Excel files with <code>xls</code> format.</li>
<li>
<code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> read Excel files with <code>xlsx</code> format.</li>
<li>
<code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> can read files with both <code>xls</code> and <code>xlsx</code> format. It guesses the file type based on the input.</li>
</ul><p>These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g. <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>, <code><a href="#chp-https://readr.tidyverse.org/reference/read_table" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_table</a></code>, etc. For the rest of the chapter we will focus on using <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>.</p>
</section>
<section id="sec-reading-spreadsheets" data-type="sect2">
<h2>
Reading spreadsheets</h2>
<p><a href="#fig-students-excel" data-type="xref">#fig-students-excel</a> shows what the spreadsheet were going to read into R looks like in Excel.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-penguins-species"><p><img src="images/import-spreadsheets-students.png" alt="A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age." width="1200"/></p>
<figcaption>Figure 20.1: Spreadsheet called students.xlsx in Excel.</figcaption>
</figure>
</div>
</div>
<p>The first argument to <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> is the path to the file to read.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_excel("data/students.xlsx")</pre>
</div>
<p><code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> will read the file in as a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students
#&gt; # A tibble: 6 × 5
#&gt; `Student ID` `Full Name` favourite.food mealPlan AGE
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne N/A Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>We have six students in the data and five variables on each student. However there are a few things we might want to address in this dataset:</p>
<ol type="1"><li>
<p>The column names are all over the place. You can provide column names that follow a consistent format; we recommend <code>snake_case</code> using the <code>col_names</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
)
#&gt; # A tibble: 7 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Student ID Full Name favourite.food mealPlan AGE
#&gt; 2 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 3 2 Barclay Lynn French fries Lunch only 5
#&gt; 4 3 Jayendra Lyne N/A Breakfast and lunch 7
#&gt; 5 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 6 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; # … with 1 more row</pre>
</div>
<p>Unfortunately, this didnt quite do the trick. You now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the <code>skip</code> argument.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1
)
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne N/A Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
<li>
<p>In the <code>favourite_food</code> column, one of the observations is <code>N/A</code>, which stands for “not available” but its currently not recognized as an <code>NA</code> (note the contrast between this <code>N/A</code> and the age of the fourth student in the list). You can specify which character strings should be recognized as <code>NA</code>s with the <code>na</code> argument. By default, only <code>""</code> (empty string, or, in the case of reading from a spreadsheet, an empty cell) is recognized as an <code>NA</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A")
)
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only &lt;NA&gt;
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
<li>
<p>One other remaining issue is that <code>age</code> is read in as a character variable, but it really should be numeric. Just like with <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> and friends for reading data from flat files, you can supply a <code>col_types</code> argument to <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are <code>"skip"</code>, <code>"guess"</code>, <code>"logical"</code>, <code>"numeric"</code>, <code>"date"</code>, <code>"text"</code> or <code>"list"</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("numeric", "text", "text", "text", "numeric")
)
#&gt; Warning: Expecting numeric in E6 / R6C5: got 'five'
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch NA
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
<p>However, this didnt quite produce the desired result either. By specifying that <code>age</code> should be numeric, we have turned the one cell with the non-numeric entry (which had the value <code>five</code>) into an <code>NA</code>. In this case, we should read age in as <code>"text"</code> and then make the change once the data is loaded in R.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("numeric", "text", "text", "text", "text")
)
students &lt;- students |&gt;
mutate(
age = if_else(age == "five", "5", age),
age = parse_number(age)
)
students
#&gt; # A tibble: 6 × 5
#&gt; student_id full_name favourite_food meal_plan age
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#&gt; 2 2 Barclay Lynn French fries Lunch only 5
#&gt; 3 3 Jayendra Lyne &lt;NA&gt; Breakfast and lunch 7
#&gt; 4 4 Leon Rossini Anchovies Lunch only NA
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div>
</li>
</ol><p>It took us multiple steps and trial-and-error to load the data in exactly the format we want, and this is not unexpected. Data science is an iterative process. There is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. That might be tempting, but its strongly not recommended. <!--# TO DO: Provide reason why it's not recommended. --> Instead, you should not be afraid of doing what we did here: load the data, take a peek, make adjustments to your code, load it again, and repeat until youre happy with the result.</p>
</section>
<section id="reading-individual-sheets" data-type="sect2">
<h2>
Reading individual sheets</h2>
<p>An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets. <a href="#fig-penguins-islands" data-type="xref">#fig-penguins-islands</a> shows an Excel spreadsheet with multiple sheets. The data come from the <strong>palmerpenguins</strong> package. Each sheet contains information on penguins from a different island where data were collected.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/import-spreadsheets-penguins-islands.png" alt="A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island." width="1514"/></p>
<figcaption class="figure-caption">Figure 20.2: Spreadsheet called penguins.xlsx in Excel.</figcaption>
</figure>
</div>
</div>
<p>You can read a single sheet from a spreadsheet with the <code>sheet</code> argument in <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipp…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.399999999… 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.299999999999997 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA NA 2007
#&gt; 5 Adelie Torgersen 36.700000000000003 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.299999999999997 20.6 190 3650 male 2007
#&gt; # … with 46 more rows, and abbreviated variable names ¹flipper_length_mm,
#&gt; # ²body_mass_g</pre>
</div>
<p>Some variables that appear to contain numerical data are read in as characters due to the character string <code>"NA"</code> not being recognized as a true <code>NA</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_torgersen &lt;- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
penguins_torgersen
#&gt; # A tibble: 52 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA &lt;NA&gt; 2007
#&gt; 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#&gt; # … with 46 more rows, and abbreviated variable names ¹flipper_length_mm,
#&gt; # ²body_mass_g</pre>
</div>
<p>However, we cheated here a bit. We looked inside the Excel spreadsheet, which is not a recommended workflow. Instead, you can use <code><a href="#chp-https://readxl.tidyverse.org/reference/excel_sheets" data-type="xref">#chp-https://readxl.tidyverse.org/reference/excel_sheets</a></code> to get information on all sheets in an Excel spreadsheet, and then read the one(s) youre interested in.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">excel_sheets("data/penguins.xlsx")
#&gt; [1] "Torgersen Island" "Biscoe Island" "Dream Island"</pre>
</div>
<p>Once you know the names of the sheets, you can read them in individually with <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_biscoe &lt;- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
penguins_dream &lt;- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")</pre>
</div>
<p>In this case the full penguins dataset is spread across three sheets in the spreadsheet. Each sheet has the same number of columns but different numbers of rows.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dim(penguins_torgersen)
#&gt; [1] 52 8
dim(penguins_biscoe)
#&gt; [1] 168 8
dim(penguins_dream)
#&gt; [1] 124 8</pre>
</div>
<p>We can put them together with <code><a href="#chp-https://dplyr.tidyverse.org/reference/bind_rows" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/bind_rows</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#&gt; 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#&gt; 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#&gt; 4 Adelie Torgersen NA NA NA NA &lt;NA&gt; 2007
#&gt; 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#&gt; 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#&gt; # … with 338 more rows, and abbreviated variable names ¹flipper_length_mm,
#&gt; # ²body_mass_g</pre>
</div>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a> well talk about ways of doing this sort of task without repetitive code.</p>
</section>
<section id="reading-part-of-a-sheet" data-type="sect2">
<h2>
Reading part of a sheet</h2>
<p>Since many use Excel spreadsheets for presentation as well as for data storage, its quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R. <a href="#fig-deaths-excel" data-type="xref">#fig-deaths-excel</a> shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/import-spreadsheets-deaths.png" alt="A look at the deaths spreadsheet in Excel. The spreadsheet has four rows on top that contain non-data information; the text 'For the same of consistency in the data layout, which is really a beautiful thing, I will keep making notes up here.' is spread across cells in these top four rows. Then, there is a data frame that includes information on deaths of 10 famous people, including their names, professions, ages, whether they have kids or not, date of birth and death. At the bottom, there are four more rows of non-data information; the text 'This has been really fun, but we're signing off now!' is spread across cells in these bottom four rows." width="1614"/></p>
<figcaption class="figure-caption">Figure 20.3: Spreadsheet called deaths.xlsx in Excel.</figcaption>
</figure>
</div>
</div>
<p>This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the <code><a href="#chp-https://readxl.tidyverse.org/reference/readxl_example" data-type="xref">#chp-https://readxl.tidyverse.org/reference/readxl_example</a></code> function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> as usual.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">deaths_path &lt;- readxl_example("deaths.xlsx")
deaths &lt;- read_excel(deaths_path)
#&gt; New names:
#&gt; • `` -&gt; `...2`
#&gt; • `` -&gt; `...3`
#&gt; • `` -&gt; `...4`
#&gt; • `` -&gt; `...5`
#&gt; • `` -&gt; `...6`
deaths
#&gt; # A tibble: 18 × 6
#&gt; `Lots of people` ...2 ...3 ...4 ...5 ...6
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 simply cannot resist writing &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; some not…
#&gt; 2 at the top &lt;NA&gt; of their sp…
#&gt; 3 or merging &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; cells
#&gt; 4 Name Profession Age Has kids Date of birth Date of …
#&gt; 5 David Bowie musician 69 TRUE 17175 42379
#&gt; 6 Carrie Fisher actor 60 TRUE 20749 42731
#&gt; # … with 12 more rows</pre>
</div>
<p>The top three rows and the bottom four rows are not part of the data frame.</p>
<p>We could skip the top three rows with <code>skip</code>. Note that we set <code>skip = 4</code> since the fourth row contains column names, not the data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4)
#&gt; # A tibble: 14 × 6
#&gt; Name Profession Age `Has kids` `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;dttm&gt; &lt;chr&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 42379
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 42731
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 42812
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 42791
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 42481
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 42383
#&gt; # … with 8 more rows</pre>
</div>
<p>We could also set <code>n_max</code> to omit the extraneous rows at the bottom.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4, n_max = 10)
#&gt; # A tibble: 10 × 6
#&gt; Name Profession Age Has k…¹ `Date of birth` `Date of death`
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;lgl&gt; &lt;dttm&gt; &lt;dttm&gt;
#&gt; 1 David Bowie musician 69 TRUE 1947-01-08 00:00:00 2016-01-10 00:00:00
#&gt; 2 Carrie Fisher actor 60 TRUE 1956-10-21 00:00:00 2016-12-27 00:00:00
#&gt; 3 Chuck Berry musician 90 TRUE 1926-10-18 00:00:00 2017-03-18 00:00:00
#&gt; 4 Bill Paxton actor 61 TRUE 1955-05-17 00:00:00 2017-02-25 00:00:00
#&gt; 5 Prince musician 57 TRUE 1958-06-07 00:00:00 2016-04-21 00:00:00
#&gt; 6 Alan Rickman actor 69 FALSE 1946-02-21 00:00:00 2016-01-14 00:00:00
#&gt; # … with 4 more rows, and abbreviated variable name ¹​`Has kids`</pre>
</div>
<p>Another approach is using cell ranges. In Excel, the top left cell is <code>A1</code>. As you move across columns to the right, the cell label moves down the alphabet, i.e. <code>B1</code>, <code>C1</code>, etc. And as you move down a column, the number in the cell label increases, i.e. <code>A2</code>, <code>A3</code>, etc.</p>
<p>The data we want to read in starts in cell <code>A5</code> and ends in cell <code>F15</code>. In spreadsheet notation, this is <code>A5:F15</code>.</p>
<ul><li>
<p>Supply this information to the <code>range</code> argument:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = "A5:F15")</pre>
</div>
</li>
<li>
<p>Specify rows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = cell_rows(c(5, 15)))</pre>
</div>
</li>
<li>
<p>Specify cells that mark the top-left and bottom-right corners of the data the top-left corner, <code>A5</code>, translates to <code>c(5, 1)</code> (5th row down, 1st column) and the bottom-right corner, <code>F15</code>, translates to <code>c(15, 6)</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))</pre>
</div>
</li>
</ul><p>If you have control over the sheet, an even better way is to create a “named range”. This is useful within Excel because named ranges help repeat formulas easier to create and they have some useful properties for creating dynamic charts and graphs as well. Even if youre not working in Excel, named ranges can be useful for identifying which cells to read into R. In the example above, the table were reading in is named <code>Table1</code>, so we can read it in with the following.</p>
<p><strong>TO DO:</strong> Add this once reading in named ranges are implemented in readxl.</p>
</section>
<section id="data-types" data-type="sect2">
<h2>
Data types</h2>
<p>In CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.</p>
<p>The underlying data in Excel spreadsheets is more complex. A cell can be one of five things:</p>
<ul><li><p>A logical, like TRUE / FALSE</p></li>
<li><p>A number, like “10” or “10.5”</p></li>
<li><p>A date, which can also include time like “11/1/21” or “11/1/21 3:00 PM”</p></li>
<li><p>A string, like “ten”</p></li>
<li><p>A currency, which allows numeric values in a limited range and four decimal digits of fixed precision</p></li>
</ul><p>When working with spreadsheet data, its important to keep in mind that how the underlying data is stored can be very different than what you see in the cell. For example, Excel has no notion of an integer. All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points. Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970. You can customize how you display the date by applying formatting in Excel. Confusingly, its also possible to have something that looks like a number but is actually a string (e.g. type <code>'10</code> into a cell in Excel).</p>
<p>These differences between how the underlying data are stored vs. how theyre displayed can cause surprises when the data are loaded into R. By default readxl will guess the data type in a given column. A recommended workflow is to let readxl guess the column types, confirm that youre happy with the guessed column types, and if not, go back and re-import specifying <code>col_types</code> as shown in <a href="#sec-reading-spreadsheets" data-type="xref">#sec-reading-spreadsheets</a>.</p>
<p>Another challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g. some cells are numeric, others text, others dates. When importing the data into R readxl has to make some decisions. In these cases you can set the type for this column to <code>"list"</code>, which will load the column as a list of length 1 vectors, where the type of each element of the vector is guessed.</p>
</section>
<section id="data-not-in-cell-values" data-type="sect2">
<h2>
Data not in cell values</h2>
<p><strong>tidyxl</strong> is useful for importing non-tabular data from Excel files into R. For example, tidyxl doesnt coerce a pivot table into a data frame. See <a href="https://nacnudus.github.io/spreadsheet-munging-strategies/" class="uri">https://nacnudus.github.io/spreadsheet-munging-strategies/</a> for more on strategies for working with non-tabular data from Excel.</p>
</section>
<section id="writing-to-excel" data-type="sect2">
<h2>
Writing to Excel</h2>
<p>Lets create a small data frame that we can then write out. Note that <code>item</code> is a factor and <code>quantity</code> is an integer.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">bake_sale &lt;- tibble(
item = factor(c("brownie", "cupcake", "cookie")),
quantity = c(10, 5, 8)
)
bake_sale
#&gt; # A tibble: 3 × 2
#&gt; item quantity
#&gt; &lt;fct&gt; &lt;dbl&gt;
#&gt; 1 brownie 10
#&gt; 2 cupcake 5
#&gt; 3 cookie 8</pre>
</div>
<p>You can write data back to disk as an Excel file using the <code><a href="#chp-https://docs.ropensci.org/writexl/reference/write_xlsx" data-type="xref">#chp-https://docs.ropensci.org/writexl/reference/write_xlsx</a></code> from the <strong>writexl</strong> package.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
</div>
<p><a href="#fig-bake-sale-excel" data-type="xref">#fig-bake-sale-excel</a> shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting <code>col_names</code> and <code>format_headers</code> arguments to <code>FALSE</code>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/import-spreadsheets-bake-sale.png" alt="Bake sale data frame created earlier in Excel." width="917"/></p>
<figcaption class="figure-caption">Figure 20.4: Spreadsheet called bake_sale.xlsx in Excel.</figcaption>
</figure>
</div>
</div>
<p>Just like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see <a href="#sec-writing-to-a-file" data-type="xref">#sec-writing-to-a-file</a>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/bake-sale.xlsx")
#&gt; # A tibble: 3 × 2
#&gt; item quantity
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 brownie 10
#&gt; 2 cupcake 5
#&gt; 3 cookie 8</pre>
</div>
</section>
<section id="formatted-output" data-type="sect2">
<h2>
Formatted output</h2>
<p>The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if youre interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the <strong>openxlsx</strong> package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions cant be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.</p>
<p>Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the <code>penguins</code> data frame.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(openxlsx)
library(palmerpenguins)
# Create a workbook (spreadsheet)
penguins_species &lt;- createWorkbook()
# Add three sheets to the spreadsheet
addWorksheet(penguins_species, sheetName = "Adelie")
addWorksheet(penguins_species, sheetName = "Gentoo")
addWorksheet(penguins_species, sheetName = "Chinstrap")
# Write data to each sheet
writeDataTable(
penguins_species,
sheet = "Adelie",
x = penguins |&gt; filter(species == "Adelie")
)
writeDataTable(
penguins_species,
sheet = "Gentoo",
x = penguins |&gt; filter(species == "Gentoo")
)
writeDataTable(
penguins_species,
sheet = "Chinstrap",
x = penguins |&gt; filter(species == "Chinstrap")
)</pre>
</div>
<p>This creates a workbook object:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_species
#&gt; A Workbook object.
#&gt;
#&gt; Worksheets:
#&gt; Sheet 1: "Adelie"
#&gt;
#&gt;
#&gt; Sheet 2: "Gentoo"
#&gt;
#&gt;
#&gt; Sheet 3: "Chinstrap"
#&gt;
#&gt;
#&gt;
#&gt; Worksheet write order: 1, 2, 3
#&gt; Active Sheet 1: "Adelie"
#&gt; Position: 1</pre>
</div>
<p>And we can write this to this with <code><a href="#chp-https://rdrr.io/pkg/openxlsx/man/saveWorkbook" data-type="xref">#chp-https://rdrr.io/pkg/openxlsx/man/saveWorkbook</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre>
</div>
<p>The resulting spreadsheet is shown in <a href="#fig-penguins-species" data-type="xref">#fig-penguins-species</a>. By default, openxlsx formats the data as an Excel table.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="images/import-spreadsheets-penguins-species.png" alt="A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island." width="1106"/></p>
<figcaption class="figure-caption">Figure 20.5: Spreadsheet called penguins.xlsx in Excel.</figcaption>
</figure>
</div>
</div>
<p>See <a href="https://ycphs.github.io/openxlsx/articles/Formatting.html" class="uri">https://ycphs.github.io/openxlsx/articles/Formatting.html</a> for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Recreate the <code>bake_sale</code> data frame, write it out to an Excel file using the <code><a href="#chp-https://rdrr.io/pkg/openxlsx/man/write.xlsx" data-type="xref">#chp-https://rdrr.io/pkg/openxlsx/man/write.xlsx</a></code> function from the openxlsx package.</li>
<li>What happens if you try to read in a file with <code>.xlsx</code> extension with <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>?</li>
</ol><!--# Need moar exercises --></section>
</section>
<section id="google-sheets" data-type="sect1">
<h1>
Google Sheets</h1>
<!--# TO DO: Write section. -->
<section id="prerequisites-1" data-type="sect2">
<h2>
Prerequisites</h2>
<p>TO DO:</p>
<ul><li>use googlesheets4</li>
<li>why 4?</li>
</ul></section>
<section id="getting-started-1" data-type="sect2">
<h2>
Getting started</h2>
<p>TO DO:</p>
<ul><li>reading from public sheet with <code>read_sheet()</code> and <code>read_range()</code>
</li>
</ul></section>
<section id="authentication" data-type="sect2">
<h2>
Authentication</h2>
</section>
<section id="read-sheets" data-type="sect2">
<h2>
Read sheets</h2>
</section>
<section id="write-sheets" data-type="sect2">
<h2>
Write sheets</h2>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
</section>
</section>
</section>

752
oreilly/strings.html Normal file
View File

@ -0,0 +1,752 @@
<section data-type="chapter" id="chp-strings">
<h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>So far, youve used a bunch of strings without learning much about the details. Now its time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tool you have at your disposal.</p>
<p>Well begin with the details of creating strings and character vectors. Youll then dive into creating strings from data, then the opposite; extracting strings from data. Well then discuss tools that work with individual letters. The chapter finishes off with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.</p>
<p>Well keep working with strings in the next chapter, where youll learn more about the power of regular expressions.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in stringr 1.5.0 and tidyr 1.3.0 which are still in development. If you want to live life on the edge you can get the dev versions with <code>devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr"))</code>.</p></div>
<p>In this chapter, well use functions from the stringr package which is part of the core tidyverse. Well also use the babynames data since it provides some fun strings to manipulate.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(babynames)</pre>
</div>
<p>You can easily tell when youre using a stringr function because all stringr functions start with <code>str_</code>. This is particularly useful if you use RStudio, because typing <code>str_</code> will trigger autocomplete, allowing you jog your memory of which functions are available.</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/stringr-autocomplete.png" class="img-fluid" width="678"/></p>
</div>
</div>
</section>
</section>
<section id="creating-a-string" data-type="sect1">
<h1>
Creating a string</h1>
<p>Weve created strings in passing earlier in the book, but didnt discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). Theres no difference in behavior between the two so in the interests of consistency the <a href="#character-vectors" data-type="xref">#character-vectors</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">string1 &lt;- "This is a string"
string2 &lt;- 'If I want to include a "quote" inside a string, I use single quotes'</pre>
</div>
<p>If you forget to close a quote, youll see <code>+</code>, the continuation character:</p>
<pre><code>&gt; "This is a string without a closing quote
+
+
+ HELP I'M STUCK IN A STRING</code></pre>
<p>If this happens to you and you cant figure out which quote you need to close, press Escape to cancel, and try again.</p>
<section id="escapes" data-type="sect2">
<h2>
Escapes</h2>
<p>To include a literal single or double quote in a string you can use <code>\</code> to “escape” it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">double_quote &lt;- "\"" # or '"'
single_quote &lt;- '\'' # or "'"</pre>
</div>
<p>So if you want to include a literal backslash in your string, youll need to escape it: <code>"\\"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">backslash &lt;- "\\"</pre>
</div>
<p>Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code><span data-type="footnote">Or use the base R function <code><a href="#chp-https://rdrr.io/r/base/writeLines" data-type="xref">#chp-https://rdrr.io/r/base/writeLines</a></code>.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(single_quote, double_quote, backslash)
x
#&gt; [1] "'" "\"" "\\"
str_view(x)
#&gt; [1] │ '
#&gt; [2] │ "
#&gt; [3] │ \</pre>
</div>
</section>
<section id="sec-raw-strings" data-type="sect2">
<h2>
Raw strings</h2>
<p>Creating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, lets create a string that contains the contents of the code block where we define the <code>double_quote</code> and <code>single_quote</code> variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tricky &lt;- "double_quote &lt;- \"\\\"\" # or '\"'
single_quote &lt;- '\\'' # or \"'\""
str_view(tricky)
#&gt; [1] │ double_quote &lt;- "\"" # or '"'
#&gt; │ single_quote &lt;- '\'' # or "'"</pre>
</div>
<p>Thats a lot of backslashes! (This is sometimes called <a href="#chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome" data-type="xref">#chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome</a>.) To eliminate the escaping you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tricky &lt;- r"(double_quote &lt;- "\"" # or '"'
single_quote &lt;- '\'' # or "'")"
str_view(tricky)
#&gt; [1] │ double_quote &lt;- "\"" # or '"'
#&gt; │ single_quote &lt;- '\'' # or "'"</pre>
</div>
<p>A raw string usually starts with <code>r"(</code> and finishes with <code>)"</code>. But if your string contains <code>)"</code> you can instead use <code>r"[]"</code> or <code>r"{}"</code>, and if thats still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. <code>`r"--()--"</code>, <code>`r"---()---"</code>, etc. Raw strings are flexible enough to handle any text.</p>
</section>
<section id="other-special-characters" data-type="sect2">
<h2>
Other special characters</h2>
<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code> there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. Youll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in <code><a href="#chp-https://rdrr.io/r/base/Quotes" data-type="xref">#chp-https://rdrr.io/r/base/Quotes</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x
#&gt; [1] "one\ntwo" "one\ttwo" "µ" "😄"
str_view(x)
#&gt; [1] │ one
#&gt; │ two
#&gt; [2] │ one{\t}two
#&gt; [3] │ µ
#&gt; [4] │ 😄</pre>
</div>
<p>Note that <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that theres a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Create strings that contain the following values:</p>
<ol type="1"><li><p><code>He said "That's amazing!"</code></p></li>
<li><p><code>\a\b\c\d</code></p></li>
<li><p><code>\\\\\\</code></p></li>
</ol></li>
<li>
<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- "This\u00a0is\u00a0tricky"</pre>
</div>
</li>
</ol></section>
</section>
<section id="creating-many-strings-from-data" data-type="sect1">
<h1>
Creating many strings from data</h1>
<p>Now that youve learned the basics of creating a string or two by “hand”, well go into the details of creating strings from other strings. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. For example, to create a greeting you might combine “Hello” with a <code>name</code> variable. Well show you how to do this with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> and how you can you use them with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. That naturally raises the question of what string functions you might use with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, so well finish this section with a discussion of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code> which is a summary function for strings.</p>
<section id="str_c" data-type="sect2">
<h2>
<code>str_c()</code>
</h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code><span data-type="footnote"><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> is very similar to the base <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code>. There are two main reasons we recommend it: it propagates <code>NA</code>s (rather than converting them to <code>"NA"</code>) and it uses the tidyverse recycling rules.</span> takes any number of vectors as arguments and returns a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_c("x", "y")
#&gt; [1] "xy"
str_c("x", "y", "z")
#&gt; [1] "xyz"
str_c("Hello ", c("John", "Susan"))
#&gt; [1] "Hello John" "Hello Susan"</pre>
</div>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> is designed to be used with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> so it obeys the usual rules for recycling and missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">set.seed(1410)
df &lt;- tibble(name = c(wakefield::name(3), NA))
df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Ilena Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento!
#&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; &lt;NA&gt;</pre>
</div>
<p>If you want missing values to display in some other way, use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code>. Depending on what you want, you might use it either inside or outside of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
mutate(
greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
)
#&gt; # A tibble: 4 × 3
#&gt; name greeting1 greeting2
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Ilena Hi Ilena! Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento! Hi Sacramento!
#&gt; 3 Graylon Hi Graylon! Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi you! Hi!</pre>
</div>
</section>
<section id="sec-glue" data-type="sect2">
<h2>
<code>str_glue()</code>
</h2>
<p>If you are mixing many fixed and variable strings with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>, youll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="#chp-https://glue.tidyverse" data-type="xref">#chp-https://glue.tidyverse</a> via <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code><span data-type="footnote">If youre not using stringr, you can also access it directly with <code><a href="#chp-https://glue.tidyverse.org/reference/glue" data-type="xref">#chp-https://glue.tidyverse.org/reference/glue</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code> will be evaluated like its outside of the quotes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(greeting = str_glue("Hi {name}!"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Ilena Hi Ilena!
#&gt; 2 Sacramento Hi Sacramento!
#&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi NA!</pre>
</div>
<p>As you can see, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>.</p>
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. If you guess that youll need to somehow escape it, youre on the right track. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}"))
#&gt; # A tibble: 4 × 2
#&gt; name greeting
#&gt; &lt;chr&gt; &lt;glue&gt;
#&gt; 1 Ilena {Hi Ilena!}
#&gt; 2 Sacramento {Hi Sacramento!}
#&gt; 3 Graylon {Hi Graylon!}
#&gt; 4 &lt;NA&gt; {Hi NA!}</pre>
</div>
</section>
<section id="str_flatten" data-type="sect2">
<h2>
<code>str_flatten()</code>
</h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code>glue()</code> work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, i.e. something that always returns a single string? Thats the job of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code><span data-type="footnote">The base R equivalent is <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_flatten(c("x", "y", "z"))
#&gt; [1] "xyz"
str_flatten(c("x", "y", "z"), ", ")
#&gt; [1] "x, y, z"
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
#&gt; [1] "x, y, and z"</pre>
</div>
<p>This makes it work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~ name, ~ fruit,
"Carmen", "banana",
"Carmen", "apple",
"Marvin", "nectarine",
"Terence", "cantaloupe",
"Terence", "papaya",
"Terence", "madarine"
)
df |&gt;
group_by(name) |&gt;
summarise(fruits = str_flatten(fruit, ", "))
#&gt; # A tibble: 3 × 2
#&gt; name fruits
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 Carmen banana, apple
#&gt; 2 Marvin nectarine
#&gt; 3 Terence cantaloupe, papaya, madarine</pre>
</div>
</section>
<section id="exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
<p>Compare and contrast the results of <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code> with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> for the following inputs:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])</pre>
</div>
</li>
<li>
<p>Convert the following expressions from <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> or vice versa:</p>
<ol type="a"><li><p><code>str_c("The price of ", food, " is ", price)</code></p></li>
<li><p><code>str_glue("I'm {age} years old and live in {country}")</code></p></li>
<li><p><code>str_c("\\section{", title, "}")</code></p></li>
</ol></li>
</ol></section>
</section>
<section id="extracting-data-from-strings" data-type="sect1">
<h1>
Extracting data from strings</h1>
<p>Its very common for multiple variables to be crammed together into a single string. In this section youll learn how to use four tidyr functions to extract them:</p>
<ul><li><code>df |&gt; separate_longer_delim(col, delim)</code></li>
<li><code>df |&gt; separate_longer_position(col, width)</code></li>
<li><code>df |&gt; separate_wider_delim(col, delim, names)</code></li>
<li><code>df |&gt; separate_wider_position(col, widths)</code></li>
</ul><p>If you look closely you can see theres a common pattern here: <code>separate_</code>, then <code>longer</code> or <code>wider</code>, then <code>_</code>, then by <code>delim</code> or <code>position</code>. Thats because these four functions are composed from two simpler primitives:</p>
<ul><li>
<code>longer</code> makes input data frame longer, creating new rows; <code>wider</code> makes the input data frame wider, generating new columns.</li>
<li>
<code>delim</code> splits up a string with a delimiter like <code>", "</code> or <code>" "</code>; <code>position</code> splits at specified widths, like <code>c(3, 5, 2)</code>.</li>
</ul><p>Well come back the last member of this family, <code>separate_regex_wider()</code>, in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>. Its the most flexible of the <code>wider</code> functions but you need to know something about regular expression before you can use it.</p>
<p>The next two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating in to columns. Well finish off my discussing the tools that the <code>wider</code> functions give you to diagnose problems.</p>
<section id="separating-into-rows" data-type="sect2">
<h2>
Separating into rows</h2>
<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim</a></code> to split based on a delimiter:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 &lt;- tibble(x = c("a,b,c", "d,e", "f"))
df1 |&gt;
separate_longer_delim(x, delim = ",")
#&gt; # A tibble: 6 × 1
#&gt; x
#&gt; &lt;chr&gt;
#&gt; 1 a
#&gt; 2 b
#&gt; 3 c
#&gt; 4 d
#&gt; 5 e
#&gt; 6 f</pre>
</div>
<p>Its rarer to see <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim</a></code> in the wild, but some older datasets do use very compact format where each character is used to record a value:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df2 &lt;- tibble(x = c("1211", "131", "21"))
df2 |&gt;
separate_longer_position(x, width = 1)
#&gt; # A tibble: 9 × 1
#&gt; x
#&gt; &lt;chr&gt;
#&gt; 1 1
#&gt; 2 2
#&gt; 3 1
#&gt; 4 1
#&gt; 5 1
#&gt; 6 3
#&gt; # … with 3 more rows</pre>
</div>
</section>
<section id="sec-string-columns" data-type="sect2">
<h2>
Separating into columns</h2>
<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> we supply the delimiter and the names in two arguments:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df3 &lt;- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |&gt;
separate_wider_delim(
x,
delim = ".",
names = c("code", "edition", "year")
)
#&gt; # A tibble: 3 × 3
#&gt; code edition year
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 a10 1 2022
#&gt; 2 b10 2 2011
#&gt; 3 e15 1 2015</pre>
</div>
<p>If a specific piece is not useful you can use an <code>NA</code> name to omit it from the results:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df3 |&gt;
separate_wider_delim(
x,
delim = ".",
names = c("code", NA, "year")
)
#&gt; # A tibble: 3 × 2
#&gt; code year
#&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 a10 2022
#&gt; 2 b10 2011
#&gt; 3 e15 2015</pre>
</div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df4 &lt;- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 |&gt;
separate_wider_position(
x,
widths = c(year = 4, age = 2, state = 2)
)
#&gt; # A tibble: 3 × 3
#&gt; year age state
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 2022 15 TX
#&gt; 2 2021 22 LA
#&gt; 3 2023 25 CA</pre>
</div>
</section>
<section id="diagnosing-widening-problems" data-type="sect2">
<h2>
Diagnosing widening problems</h2>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code><span data-type="footnote">The same principles apply to <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows dont have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Lets first look at the <code>too_few</code> case with the following sample dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z")
)
#&gt; Error in `separate_wider_delim()`:
#&gt; ! Expected 3 pieces in each element of `x`.
#&gt; ! 2 values were too short.
#&gt; Use `too_few = "debug"` to diagnose the problem.
#&gt; Use `too_few = "align_start"/"align_end"` to silence this message.</pre>
</div>
<p>Youll notice that we get an error, but the error gives us some suggestions as to how you might proceed. Lets start by debugging the problem:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">debug &lt;- df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_few = "debug"
)
#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#&gt; `x_remainder`.
debug
#&gt; # A tibble: 5 × 6
#&gt; x y z x_ok x_pieces x_remainder
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1-1-1 1 1 TRUE 3 ""
#&gt; 2 1-1-2 1 2 TRUE 3 ""
#&gt; 3 1-3 3 &lt;NA&gt; FALSE 2 ""
#&gt; 4 1-3-2 3 2 TRUE 3 ""
#&gt; 5 1 &lt;NA&gt; &lt;NA&gt; FALSE 1 ""</pre>
</div>
<p>When you use the debug mode you get three extra columns add to the output: <code>x_ok</code>, <code>x_pieces</code>, and <code>x_remainder</code> (if you separate variable with a different name, youll get a different prefix). Here, <code>x_ok</code> lets you quickly find the inputs that failed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">debug |&gt; filter(!x_ok)
#&gt; # A tibble: 2 × 6
#&gt; x y z x_ok x_pieces x_remainder
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1-3 3 &lt;NA&gt; FALSE 2 ""
#&gt; 2 1 &lt;NA&gt; &lt;NA&gt; FALSE 1 ""</pre>
</div>
<p><code>x_pieces</code> tells us how many pieces were found, compared to the expected 3 (the length of <code>names</code>). <code>x_remainder</code> isnt useful when there are too few pieces, but well see it again shortly.</p>
<p>Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove <code>too_few = "debug"</code> to ensure that new problem become errors.</p>
<p>In other cases you may just want to fill in the missing pieces with <code>NA</code>s and move on. Thats the job of <code>too_few = "align_start"</code> and <code>too_few = "align_end"</code> which allow you to control where the <code>NA</code>s should go:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_few = "align_start"
)
#&gt; # A tibble: 5 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 1 1
#&gt; 2 1 1 2
#&gt; 3 1 3 &lt;NA&gt;
#&gt; 4 1 3 2
#&gt; 5 1 &lt;NA&gt; &lt;NA&gt;</pre>
</div>
<p>The same principles apply if you have too many pieces:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z")
)
#&gt; Error in `separate_wider_delim()`:
#&gt; ! Expected 3 pieces in each element of `x`.
#&gt; ! 2 values were too long.
#&gt; Use `too_many = "debug"` to diagnose the problem.
#&gt; Use `too_many = "drop"/"merge"` to silence this message.</pre>
</div>
<p>But now when we debug the result, you can see the purpose of <code>x_remainder</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">debug &lt;- df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "debug"
)
#&gt; Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#&gt; `x_remainder`.
debug |&gt; filter(!x_ok)
#&gt; # A tibble: 2 × 6
#&gt; x y z x_ok x_pieces x_remainder
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1-3-5-6 3 5 FALSE 4 -6
#&gt; 2 1-3-5-7-9 3 5 FALSE 5 -7-9</pre>
</div>
<p>You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "drop"
)
#&gt; # A tibble: 5 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 1 1
#&gt; 2 1 1 2
#&gt; 3 1 3 5
#&gt; 4 1 3 2
#&gt; 5 1 3 5
df |&gt;
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "merge"
)
#&gt; # A tibble: 5 × 3
#&gt; x y z
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 1 1
#&gt; 2 1 1 2
#&gt; 3 1 3 5-6
#&gt; 4 1 3 2
#&gt; 5 1 3 5-7-9</pre>
</div>
</section>
</section>
<section id="letters" data-type="sect1">
<h1>
Letters</h1>
<p>In this section, well introduce you to functions that allow you to work with the individual letters within a string. Youll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.</p>
<section id="length" data-type="sect2">
<h2>
Length</h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> tells you the number of letters in the string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_length(c("a", "R for data science", NA))
#&gt; [1] 1 18 NA</pre>
</div>
<p>You could use this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to find the distribution of lengths of US babynames, and then with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, wed guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
count(length = str_length(name), wt = n)
#&gt; # A tibble: 14 × 2
#&gt; length n
#&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 2 338150
#&gt; 2 3 8589596
#&gt; 3 4 48506739
#&gt; 4 5 87011607
#&gt; 5 6 90749404
#&gt; 6 7 72120767
#&gt; # … with 8 more rows
babynames |&gt;
filter(str_length(name) == 15) |&gt;
count(name, wt = n, sort = TRUE)
#&gt; # A tibble: 34 × 2
#&gt; name n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 Franciscojavier 123
#&gt; 2 Christopherjohn 118
#&gt; 3 Johnchristopher 118
#&gt; 4 Christopherjame 108
#&gt; 5 Christophermich 52
#&gt; 6 Ryanchristopher 45
#&gt; # … with 28 more rows</pre>
</div>
</section>
<section id="subsetting" data-type="sect2">
<h2>
Subsetting</h2>
<p>You can extract parts of a string using <code>str_sub(string, start, end)</code>, where <code>start</code> and <code>end</code> are the letters where the substring should start and end. The <code>start</code> and <code>end</code> arguments are inclusive, so the length of the returned string will be <code>end - start + 1</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#&gt; [1] "App" "Ban" "Pea"</pre>
</div>
<p>You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_sub(x, -3, -1)
#&gt; [1] "ple" "ana" "ear"</pre>
</div>
<p>Note that <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> wont fail if the string is too short: it will just return as much as possible:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_sub("a", 1, 5)
#&gt; [1] "a"</pre>
</div>
<p>We could use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to find the first and last letter of each name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
mutate(
first = str_sub(name, 1, 1),
last = str_sub(name, -1, -1)
)
#&gt; # A tibble: 1,924,665 × 7
#&gt; year sex name n prop first last
#&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1880 F Mary 7065 0.0724 M y
#&gt; 2 1880 F Anna 2604 0.0267 A a
#&gt; 3 1880 F Emma 2003 0.0205 E a
#&gt; 4 1880 F Elizabeth 1939 0.0199 E h
#&gt; 5 1880 F Minnie 1746 0.0179 M e
#&gt; 6 1880 F Margaret 1578 0.0162 M t
#&gt; # … with 1,924,659 more rows</pre>
</div>
</section>
<section id="long-strings" data-type="sect2">
<h2>
Long strings</h2>
<p>Sometimes the reason you care about the length of a string is because youre trying to fit it into a label on a plot or in a table. stringr provides two useful tools for cases where your string is too long:</p>
<ul><li><p><code>str_trunc(x, 30)</code> ensures that no string is longer than 30 characters, replacing any letters after 30 with <code></code>.</p></li>
<li><p><code>str_wrap(x, 30)</code> wraps a string introducing new lines so that each line is at most 30 characters (it doesnt hyphenate, however, so any word longer than 30 characters will make a longer line)</p></li>
</ul><p>The following code shows these functions in action with a made up string:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
str_view(str_trunc(x, 30))
#&gt; [1] │ Lorem ipsum dolor sit amet,...
str_view(str_wrap(x, 30))
#&gt; [1] │ Lorem ipsum dolor sit amet,
#&gt; │ consectetur adipiscing
#&gt; │ elit, sed do eiusmod tempor
#&gt; │ incididunt ut labore et dolore
#&gt; │ magna aliqua. Ut enim ad
#&gt; │ minim veniam, quis nostrud
#&gt; │ exercitation ullamco laboris
#&gt; │ nisi ut aliquip ex ea commodo
#&gt; │ consequat.</pre>
</div>
</section>
<section id="exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>Use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
<li>Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?</li>
</ol></section>
</section>
<section id="sec-other-languages" data-type="sect1">
<h1>
Non-English text</h1>
<p>So far, weve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is fairly simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately we dont have room for a full treatment of non-English languages, but I wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale dependent functions.</p>
<section id="encoding" data-type="sect2">
<h2>
Encoding</h2>
<p>When working with non-English text the first challenge is often the <strong>encoding</strong>. To understand whats going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="#chp-https://rdrr.io/r/base/rawConversion" data-type="xref">#chp-https://rdrr.io/r/base/rawConversion</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">charToRaw("Hadley")
#&gt; [1] 48 61 64 6c 65 79</pre>
</div>
<p>Each of these six hexadecimal numbers represents one letter: <code>48</code> is H, <code>61</code> is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because its the <strong>American</strong> Standard Code for Information Interchange.</p>
<p>Things arent so easy for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte <code>b1</code> is “±”, but in Latin2, its “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emojis.</p>
<p>readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that dont use UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times youll get complete gibberish. For example here are two inline CSVs with unusual encodings<span data-type="footnote">Here Im using the special <code>\x</code> to encode binary data directly into a string.</span>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(x1)
#&gt; # A tibble: 1 × 1
#&gt; text
#&gt; &lt;chr&gt;
#&gt; 1 "El Ni\xf1o was particularly bad this year"
x2 &lt;- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
read_csv(x2)
#&gt; # A tibble: 1 × 1
#&gt; text
#&gt; &lt;chr&gt;
#&gt; 1 "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"</pre>
</div>
<p>To read these correctly you specify the encoding via the <code>locale</code> argument:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv(x1, locale = locale(encoding = "Latin1"))
#&gt; # A tibble: 1 × 1
#&gt; text
#&gt; &lt;chr&gt;
#&gt; 1 El Niño was particularly bad this year
read_csv(x2, locale = locale(encoding = "Shift-JIS"))
#&gt; # A tibble: 1 × 1
#&gt; text
#&gt; &lt;chr&gt;
#&gt; 1 こんにちは</pre>
</div>
<p>How do you find the correct encoding? If youre lucky, itll be included somewhere in the data documentation. Unfortunately, thats rarely the case, so readr provides <code><a href="#chp-https://readr.tidyverse.org/reference/encoding" data-type="xref">#chp-https://readr.tidyverse.org/reference/encoding</a></code> to help you figure it out. Its not foolproof, and it works better when you have lots of text (unlike here), but its a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">guess_encoding(x1)
#&gt; # A tibble: 1 × 2
#&gt; encoding confidence
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 ISO-8859-1 0.41
guess_encoding(x2)
#&gt; # A tibble: 1 × 2
#&gt; encoding confidence
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 KOI8-R 0.27</pre>
</div>
<p>Encodings are a rich and complex topic, and weve only scratched the surface here. If youd like to learn more we recommend reading the detailed explanation at <a href="http://kunststube.net/encoding/" class="uri">http://kunststube.net/encoding/</a>.</p>
</section>
<section id="letter-variations" data-type="sect2">
<h2>
Letter variations</h2>
<p>If youre working with individual letters (e.g. with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code>) theres an important challenge if youre working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">u &lt;- c("\u00fc", "u\u0308")
str_view(u)
#&gt; [1] │ ü
#&gt; [2] │ ü</pre>
</div>
<p>But they have different lengths and the first characters are different:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_length(u)
#&gt; [1] 1 2
str_sub(u, 1, 1)
#&gt; [1] "ü" "u"</pre>
</div>
<p>Finally note that these strings look differently when you compare them with <code>==</code>, for which is stringr provides the handy <code><a href="#chp-https://stringr.tidyverse.org/reference/str_equal" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_equal</a></code> function:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">u[[1]] == u[[2]]
#&gt; [1] FALSE
str_equal(u[[1]], u[[2]])
#&gt; [1] TRUE</pre>
</div>
</section>
<section id="locale-dependent-function" data-type="sect2">
<h2>
Locale-dependent function</h2>
<p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language. A locale is specified by lower-case language abbreviation, optionally followed by a <code>_</code> and a upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you dont already know the code for your language, <a href="#chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes" data-type="xref">#chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="#chp-https://rdrr.io/pkg/stringi/man/stri_locale_list" data-type="xref">#chp-https://rdrr.io/pkg/stringi/man/stri_locale_list</a></code>.</p>
<p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country. To avoid this problem, stringr defaults to using English rules, by using the “en” locale, and requires you to specify the <code>locale</code> argument to override it. Fortunately there are two sets of functions where the locale really matters: changing case and sorting.</p>
<p><strong>T</strong>he rules for changing case are not the same in every language. For example, Turkish has two is: with and without a dot, and it capitalizes them in a different way to English:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_to_upper(c("i", "ı"))
#&gt; [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#&gt; [1] "İ" "I"</pre>
</div>
<p>Sorting strings depends on the order of the alphabet, and order of the alphabet is not the same in every language<span data-type="footnote">Sorting in languages that dont have an alphabet, like Chinese, is more complicated still.</span>! Heres an example: in Czech, “ch” is a compound letter that appears after <code>h</code> in the alphabet.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_sort(c("a", "c", "ch", "h", "z"))
#&gt; [1] "a" "c" "ch" "h" "z"
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
#&gt; [1] "a" "c" "h" "ch" "z"</pre>
</div>
<p>This also comes up when sorting strings with <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> which is why it also has a <code>locale</code> argument.</p>
</section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned about some of the power of the stringr package: you learned how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now its time to learn one of the most important and powerful tools for working withr strings: regular expressions. Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.</p>
</section>
</section>

17
oreilly/transform.html Normal file
View File

@ -0,0 +1,17 @@
<div data-type="part">
<h1><span id="sec-transform-intro" class="quarto-section-identifier d-none d-lg-block">Transform</span></h1><p>After reading the first part of the book, you understand (at least superficially) the most important tools for doing data science. Now its time to start diving into the details. In this part of the book, youll learn about the most important types of variables that youll encounter inside a data frame and learn the tools you can use to work with them.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-transform"><p><img src="diagrams/data-science/transform.png" alt="Our data science model transform, highlighted in blue. " width="535"/></p>
<figcaption>Figure 1: The options for data transformation depends heavily on the type of data involve, the subject of this part of the book.</figcaption>
</figure>
</div>
</div><p>You can read these chapters as you need them; theyre designed to be largely standalone so that they can be read out of order.</p><ul><li><p><a href="#chp-logicals" data-type="xref">#chp-logicals</a> teaches you about logical vectors. These are simplest type of vector, but are extremely powerful. Youll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations.</p></li>
<li><p><a href="#chp-numbers" data-type="xref">#chp-numbers</a> dives into tools for vectors of numbers, the powerhouse of data science. Youll learn more about counting and a bunch of important transformation and summary functions.</p></li>
<li><p><a href="#chp-strings" data-type="xref">#chp-strings</a> will give you the tools to work with strings: youll slice them, youll dice them, and youll stick them back together again. This chapter mostly focuses on the stringr package, but youll also learn some more tidyr functions devoted to extracting data from strings.</p></li>
<li><p><a href="#chp-regexps" data-type="xref">#chp-regexps</a> introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.</p></li>
<li><p><a href="#chp-factors" data-type="xref">#chp-factors</a> introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.</p></li>
<li><p><a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, youll learn to how to overcome the most common challenges.</p></li>
<li><p><a href="#chp-missing-values" data-type="xref">#chp-missing-values</a> discusses missing values in depth. Weve discussed them a couple of times in isolation, but now its time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.</p></li>
<li><p><a href="#chp-joins" data-type="xref">#chp-joins</a> finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset.</p></li>
</ul></div>

10
oreilly/webscraping.html Normal file
View File

@ -0,0 +1,10 @@
<section data-type="chapter" id="chp-webscraping">
<h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><div data-type="important"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we dont recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
</section>

14
oreilly/whole-game.html Normal file
View File

@ -0,0 +1,14 @@
<div data-type="part">
<h1><span id="sec-whole-game-intro" class="quarto-section-identifier d-none d-lg-block">Whole game</span></h1><p>Our goal in this part of the book is to give you a rapid overview of the main tools of data science: <strong>importing</strong>, <strong>tidying</strong>, <strong>transforming</strong>, and <strong>visualizing data</strong>, as shown in <a href="#fig-ds-whole-game" data-type="xref">#fig-ds-whole-game</a>. We want to show you the “whole game” of data science giving you just enough of all the major pieces so that you can tackle real, if simple, data sets. The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-whole-game"><p><img src="diagrams/data-science/whole-game.png" alt="A diagram displaying the data science cycle: Import -&gt; Tidy -&gt; Understand (which has the phases Transform -&gt; Visualize -&gt; Model in a cycle) -&gt; Communicate. Surrounding all of these is Program Import, Tidy, Transform, and Visualize is highlighted." width="535"/></p>
<figcaption>Figure 1: In this section of the book, youll learn how to import, tidy, transform, and visualize data.</figcaption>
</figure>
</div>
</div><p>Five chapters focus on the tools of data science:</p><ul><li><p>Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data. In <a href="#chp-data-visualize" data-type="xref">#chp-data-visualize</a> youll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.</p></li>
<li><p>Visualisation alone is typically not enough, so in <a href="#chp-data-transform" data-type="xref">#chp-data-transform</a>, youll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.</p></li>
<li><p>In <a href="#chp-data-tidy" data-type="xref">#chp-data-tidy</a>, youll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier. Youll learn the underlying principles, and how to get your data into a tidy form.</p></li>
<li><p>Before you can transform and visualize your data, you need to first get your data into R. In <a href="#chp-data-import" data-type="xref">#chp-data-import</a> youll learn the basics of getting <code>.csv</code> files into R.</p></li>
<li><p>Finally, in <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, youll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.</p></li>
</ul><p>Nestled among these chapters that are five other chapters that focus on your R workflow. In <a href="#chp-workflow-basics" data-type="xref">#chp-workflow-basics</a>, <a href="#chp-workflow-pipes" data-type="xref">#chp-workflow-pipes</a>, <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, and <a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>, youll learn good workflow practices for writing and organizing your R code. These will set you up for success in the long run, as theyll give you the tools to stay organised when you tackle real projects.</p></div>

View File

@ -0,0 +1,161 @@
<section data-type="chapter" id="chp-workflow-basics">
<h1><span id="sec-workflow-basics" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: basics</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>You now have some experience running R code. We didnt give you many details, but youve obviously figured out the basics, or you wouldve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.</p><p>Before we go any further, lets make sure youve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.</p>
<section id="coding-basics" data-type="sect1">
<h1>
Coding basics</h1>
<p>Lets review some basics weve so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">1 / 200 * 30
#&gt; [1] 0.15
(59 + 73 + 2) / 3
#&gt; [1] 44.66667
sin(pi / 2)
#&gt; [1] 1</pre>
</div>
<p>You can create new objects with the assignment operator <code>&lt;-</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- 3 * 4</pre>
</div>
<p>You can <strong>c</strong>ombine multiple elements into a vector with <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">primes &lt;- c(2, 3, 5, 7, 11, 13)</pre>
</div>
<p>And basic arithmetic is applied to every element of the vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">primes * 2
#&gt; [1] 4 6 10 14 22 26
primes - 1
#&gt; [1] 1 2 4 6 10 12</pre>
</div>
<p>All R statements where you create objects, <strong>assignment</strong> statements, have the same form:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">object_name &lt;- value</pre>
</div>
<p>When reading that code, say “object name gets value” in your head.</p>
<p>You will make lots of assignments and <code>&lt;-</code> is a pain to type. You can save time with RStudios keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automatically surrounds <code>&lt;-</code> with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.</p>
</section>
<section id="comments" data-type="sect1">
<h1>
Comments</h1>
<p>R will ignore any text after <code>#</code>. This allows to you to write <strong>comments</strong>, text that is ignored by R but read by other humans. Well sometimes include comments in examples explaining whats happening with the code.</p>
<p>Comments can be helpful for briefly describing what the subsequent code does.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># define primes
primes &lt;- c(2, 3, 5, 7, 11, 13)
# multiply primes by 2
primes * 2
#&gt; [1] 4 6 10 14 22 26</pre>
</div>
<p>With short pieces of code like this, it might not be necessary to leave a command for every single line of code. But as the code youre writing gets more complex, comments can save you (and your collaborators) a lot of time in figuring out what was done in the code.</p>
<p>Use comments to explain the <em>why</em> of your code, not the <em>how</em> or the <em>what</em>. The <em>what</em> and <em>how</em> of code your is always possible to figure out, even if it might be tedious, by carefully reading the code. But if you describe the “what” in your comments and your code, youll have to remember to carefully update the comment and code in tandem. If you change the code and forget to update the comment, theyll be inconsistent which will lead to confusion when you come back to your code in the future.</p>
<p>Figuring out <em>why</em> something was done is much more difficult, if not impossible. For example, <code>geom_smooth()</code> has an argument called <code>span</code>, which controls the smoothness of the curve, with larger values yielding a smoother curve. Suppose you decide to change the value of <code>span</code> from its default of 0.75 to 0.3: its easy for a future reader to understand <em>what</em> is happening, but unless you note your thinking in a comment, no one will understand <em>why</em> you changed the default.</p>
<p>For data analysis code, use comments to explain your overall plan of attack and record important insight as you encounter them. Theres no way to re-capture this knowledge from the code itself.</p>
</section>
<section id="sec-whats-in-a-name" data-type="sect1">
<h1>
Whats in a name?</h1>
<p>Object names must start with a letter, and can only contain letters, numbers, <code>_</code> and <code>.</code>. You want your object names to be descriptive, so youll need to adopt a convention for multiple words. We recommend <strong>snake_case</strong> where you separate lowercase words with <code>_</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention</pre>
</div>
<p>Well come back to names again when we talk more about code style in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>.</p>
<p>You can inspect an object by typing its name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x
#&gt; [1] 12</pre>
</div>
<p>Make another assignment:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">this_is_a_really_long_name &lt;- 2.5</pre>
</div>
<p>To inspect this object, try out RStudios completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.</p>
<p>Ooops, you made a mistake! The value of <code>this_is_a_really_long_name</code> should be 3.5, not 2.5. Use another keyboard shortcut to help you fix it. Type “this” then press Cmd/Ctrl + ↑. Doing so will list all the commands youve typed that start with those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.</p>
<p>Make yet another assignment:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">r_rocks &lt;- 2 ^ 3</pre>
</div>
<p>Lets try to inspect it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">r_rock
#&gt; Error: object 'r_rock' not found
R_rocks
#&gt; Error: object 'R_rocks' not found</pre>
</div>
<p>This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions. Typos matter; R cant read your mind and say “oh, they probably meant <code>r_rocks</code> when they typed <code>r_rock</code>”. Case matters; similarly R cant read your mind and say “oh, they probably meant <code>r_rocks</code> when they typed <code>R_rocks</code>”.</p>
</section>
<section id="calling-functions" data-type="sect1">
<h1>
Calling functions</h1>
<p>R has a large collection of built-in functions that are called like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">function_name(arg1 = val1, arg2 = val2, ...)</pre>
</div>
<p>Lets try using <code><a href="#chp-https://rdrr.io/r/base/seq" data-type="xref">#chp-https://rdrr.io/r/base/seq</a></code>, which makes regular <strong>seq</strong>uences of numbers and, while were at it, learn more helpful features of RStudio. Type <code>se</code> and hit TAB. A popup shows you possible completions. Specify <code><a href="#chp-https://rdrr.io/r/base/seq" data-type="xref">#chp-https://rdrr.io/r/base/seq</a></code> by typing more (a <code>q</code>) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the functions arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.</p>
<p>When youve selected the function you want, press TAB again. RStudio will add matching opening (<code>(</code>) and closing (<code>)</code>) parentheses for you. Type the arguments <code>1, 10</code> and hit return.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">seq(1, 10)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10</pre>
</div>
<p>Type this code and notice that RStudio provides similar assistance with the paired quotation marks:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- "hello world"</pre>
</div>
<p>Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but its still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:</p>
<pre><code>&gt; x &lt;- "hello
+</code></pre>
<p>The <code>+</code> tells you that R is waiting for more input; it doesnt think youre done yet. Usually, this means youve forgotten either a <code>"</code> or a <code>)</code>. Either add the missing pair, or press ESCAPE to abort the expression and try again.</p>
<p>Note that the environment tab in the upper right pane displays all of the objects that youve created:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-env.png" class="img-fluid" alt="Environment tab of RStudio which shows r_rocks, this_is_a_really_long_name, x, and y in the Global Environment." width="597"/></p>
</div>
</div>
</section>
<section id="exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li>
<p>Why does this code not work?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">my_variable &lt;- 10
my_varıable
#&gt; Error in eval(expr, envir, enclos): object 'my_varıable' not found</pre>
</div>
<p>Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)</p>
</li>
<li>
<p>Tweak each of the following R commands so that they run correctly:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">libary(tidyverse)
ggplot(dota = mpg) +
geom_point(maping = aes(x = displ, y = hwy))</pre>
</div>
</li>
<li><p>Press Alt + Shift + K. What happens? How can you get to the same place using the menus?</p></li>
</ol></section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>Now that youve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, well continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether its selecting important variables, filtering down to rows of interest, or computing summary statistics.</p>
</section>
</section>

View File

@ -0,0 +1,81 @@
<section data-type="chapter" id="chp-workflow-help">
<h1><span id="sec-workflow-getting-help" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Getting help</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.</p>
<section id="google-is-your-friend" data-type="sect1">
<h1>
Google is your friend</h1>
<p>If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isnt useful, it often means that there arent any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isnt in English, run <code>Sys.setenv(LANGUAGE = "en")</code> and re-run the code; youre more likely to find help for English error messages.)</p>
<p>If Google doesnt help, try <a href="#chp-https://stackoverflow" data-type="xref">#chp-https://stackoverflow</a>. Start by spending a little time searching for an existing answer, including <code>[R]</code> to restrict your search to questions and answers that use R.</p>
</section>
<section id="making-a-reprex" data-type="sect1">
<h1>
Making a reprex</h1>
<p>If your googling doesnt find anything useful, its a really good idea prepare a <strong>reprex,</strong> short for minimal <strong>repr</strong>oducible <strong>ex</strong>ample. A good reprex makes it easier for other people to help you, and often youll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:</p>
<ul><li><p>First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> calls and create all necessary objects. The easiest way to make sure youve done this is to use the reprex package.</p></li>
<li><p>Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one youre facing in real life or even using built-in data.</p></li>
</ul><p>That sounds like a lot of work! And it can be, but it has a great payoff:</p>
<ul><li><p>80% of the time creating an excellent reprex reveals the source of your problem. Its amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.</p></li>
<li><p>The other 20% of time you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!</p></li>
</ul><p>When creating a reprex by hand, its easy to accidentally miss something that means your code cant be run on someone elses computer. Avoid this problem by using the reprex package which is installed as part of the tidyverse. Lets say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y &lt;- 1:4
mean(y)</pre>
</div>
<p>Then call <code>reprex()</code>, where the default target venue is GitHub:</p>
<pre data-type="programlisting" data-code-language="downlit">reprex::reprex()</pre>
<p>A nicely rendered HTML preview will display in RStudios Viewer (if youre in RStudio) or your default browser otherwise. The relevant bit of GitHub-flavored Markdown is ready to be pasted from your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):</p>
<pre><code>``` r
y &lt;- 1:4
mean(y)
#&gt; [1] 2.5
```</code></pre>
<p>Heres what that Markdown would look like rendered in a GitHub issue:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y &lt;- 1:4
mean(y)
#&gt; [1] 2.5</pre>
</div>
<p>Anyone else can copy, paste, and run this immediately.</p>
<p>There are three things you need to include to make your example reproducible: required packages, data, and code.</p>
<ol type="1"><li><p><strong>Packages</strong> should be loaded at the top of the script, so its easy to see which ones the example needs. This is a good time to check that youre using the latest version of each package; its possible youve discovered a bug thats been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run <code>tidyverse_update()</code>.</p></li>
<li>
<p>The easiest way to include <strong>data</strong> is to use <code><a href="#chp-https://rdrr.io/r/base/dput" data-type="xref">#chp-https://rdrr.io/r/base/dput</a></code> to generate the R code needed to recreate it. For example, to recreate the <code>mtcars</code> dataset in R, perform the following steps:</p>
<ol type="1"><li>Run <code>dput(mtcars)</code> in R</li>
<li>Copy the output</li>
<li>In reprex, type <code>mtcars &lt;-</code> then paste.</li>
</ol><p>Try and find the smallest subset of your data that still reveals the problem.</p>
</li>
<li>
<p>Spend a little bit of time ensuring that your <strong>code</strong> is easy for others to read:</p>
<ul><li><p>Make sure youve used spaces and your variable names are concise, yet informative.</p></li>
<li><p>Use comments to indicate where your problem lies.</p></li>
<li><p>Do your best to remove everything that is not related to the problem.</p></li>
</ul><p>The shorter your code is, the easier it is to understand, and the easier it is to fix.</p>
</li>
</ol><p>Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.</p>
</section>
<section id="investing-in-yourself" data-type="sect1">
<h1>
Investing in yourself</h1>
<p>You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the <a href="#chp-https://www.tidyverse.org/blog/" data-type="xref">#chp-https://www.tidyverse.org/blog/</a>. To keep up with the R community more broadly, we recommend reading <a href="#chp-https://rweekly" data-type="xref">#chp-https://rweekly</a>: its a community effort to aggregate the most interesting news in the R community each week.</p>
<p>If youre an active Twitter user, you might also want to follow Hadley (<a href="#chp-https://twitter.com/hadleywickham" data-type="xref">#chp-https://twitter.com/hadleywickham</a>), Mine (<a href="#chp-https://twitter.com/minebocek" data-type="xref">#chp-https://twitter.com/minebocek</a>), Garrett (<a href="#chp-https://twitter.com/statgarrett" data-type="xref">#chp-https://twitter.com/statgarrett</a>), or follow <a href="#chp-https://twitter.com/rstudiotips" data-type="xref">#chp-https://twitter.com/rstudiotips</a> to keep up with new features in the IDE. If you want the full fire hose of new developments, you can also read the (<a href="#chp-https://twitter.com/search?q=%23rstats" data-type="xref">#chp-https://twitter.com/search?q=%23rstats</a>) hashtag. This is one the key tools that Hadley and Mine use to keep up with new developments in the community.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter concludes the Whole Game part of the book. Youve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now youve got a holistic view of whole process and we start to get into the the details of small pieces.</p>
<p>The next part of the book, Transform, goes into depth into the different types of variables that you might encounter: logical vectors, numbers, strings, factors, and date-times, and covers important related topics like tibbles, regular expression, missing values, and joins. Theres no need to read these chapters in order; dip in and out as needed for the specific data that youre working with.</p>
</section>
</section>

106
oreilly/workflow-pipes.html Normal file
View File

@ -0,0 +1,106 @@
<section data-type="chapter" id="chp-workflow-pipes">
<h1><span id="sec-workflow-pipes" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Pipes</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>The pipe, <code>|&gt;</code>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss <code>%&gt;%</code>, a predecessor to <code>|&gt;</code>.</p><p>To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. Youll need to make one change to your RStudio options to use <code>|&gt;</code> instead of <code>%&gt;%</code> as shown in <a href="#fig-pipe-options" data-type="xref">#fig-pipe-options</a>; more on <code>%&gt;%</code> shortly.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-pipe-options"><p><img src="screenshots/rstudio-pipe-options.png" alt="Screenshot showing the &quot;Use native pipe operator&quot; option which can be found on the &quot;Editing&quot; panel of the &quot;Code&quot; options." width="616"/></p>
<figcaption>Figure 5.1: To insert |&gt;, make sure the “Use native pipe operator” option is checked.<code>|&gt;</code>, make sure the “Use native pipe operator” option is checked.</figcaption>
</figure>
</div>
</div>
<section id="why-use-a-pipe" data-type="sect1">
<h1>
Why use a pipe?</h1>
<p>Each individual dplyr verb is quite simple, so solving complex problems typically requires combining multiple verbs. For example, the last chapter finished with a moderately complex pipe:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
group_by(tailnum) |&gt;
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)</pre>
</div>
<p>Even though this pipe has four steps, its easy to skim because the verbs come at the start of each line: start with the <code>flights</code> data, then filter, then group, then summarize.</p>
<p>What would happen if we didnt have the pipe? We could nest each function call inside the previous call:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summarise(
group_by(
filter(
flights,
!is.na(arr_delay), !is.na(tailnum)
),
tailnum
),
delay = mean(arr_delay, na.rm = TRUE
),
n = n()
)</pre>
</div>
<p>Or we could use a bunch of intermediate variables:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights1 &lt;- filter(flights, !is.na(arr_delay), !is.na(tailnum))
flights2 &lt;- group_by(flights1, tailnum)
flights3 &lt;- summarise(flight2,
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)</pre>
</div>
<p>While both of these forms have their time and place, the pipe generally produces data analysis code thats both easier to write and easier to read.</p>
</section>
<section id="magrittr-and-the-pipe" data-type="sect1">
<h1>
magrittr and the<code>%&gt;%</code> pipe</h1>
<p>If youve been using the tidyverse for a while, you might be familiar with the <code>%&gt;%</code> pipe provided by the <strong>magrittr</strong> package. The magrittr package is included in the core tidyverse, so you can use <code>%&gt;%</code> whenever you load the tidyverse:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
mtcars %&gt;%
group_by(cyl) %&gt;%
summarise(n = n())
#&gt; # A tibble: 3 × 2
#&gt; cyl n
#&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 4 11
#&gt; 2 6 7
#&gt; 3 8 14</pre>
</div>
<p>For simple cases <code>|&gt;</code> and <code>%&gt;%</code> behave identically. So why do we recommend the base pipe? Firstly, because its part of base R, its always available for you to use, even when youre not using the tidyverse. Secondly, <code>|&gt;</code> is quite a bit simpler than <code>%&gt;%</code>: in the time between the invention of <code>%&gt;%</code> in 2014 and the inclusion of <code>|&gt;</code> in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features.</p>
</section>
<section id="vs" data-type="sect1">
<h1>
<code>|&gt;</code> vs <code>%&gt;%</code>
</h1>
<p>While <code>|&gt;</code> and <code>%&gt;%</code> behave identically for simple cases, there are a few important differences. These are most likely to affect you if youre a long-term user of <code>%&gt;%</code> who has taken advantage of some of the more advanced features. But theyre still good to know about even if youve never used <code>%&gt;%</code> because youre likely to encounter some of them when reading wild-caught code.</p>
<ul><li><p>By default, the pipe passes the object on its left hand side to the first argument of the function on the right-hand side. <code>%&gt;%</code> allows you change the placement with a <code>.</code> placeholder. For example, <code>x %&gt;% f(1)</code> is equivalent to <code>f(x, 1)</code> but <code>x %&gt;% f(1, .)</code> is equivalent to <code>f(1, x)</code>. R 4.2.0 added a <code>_</code> placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, <code>x |&gt; f(1, y = _)</code> is equivalent to <code>f(1, y = x)</code>.</p></li>
<li>
<p>The <code>|&gt;</code> placeholder is deliberately simple and cant replicate many features of the <code>%&gt;%</code> placeholder: you cant pass it to multiple arguments, and it doesnt have any special behavior when the placeholder is used inside another function. For example, <code>df %&gt;% split(.$var)</code> is equivalent to <code>split(df, df$var)</code> and <code>df %&gt;% {split(.$x, .$y)}</code> is equivalent to <code>split(df$x, df$y)</code>.</p>
<p>With <code>%&gt;%</code> you can use <code>.</code> on the left-hand side of operators like <code>$</code>, <code>[[</code>, <code>[</code> (which youll learn about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>), so you can extract a single column from a data frame with (e.g.) <code>mtcars %&gt;% .$cyl</code>. A future version of R may add similar support for <code>|&gt;</code> and <code>_</code>. For the special case of extracting a column out of a data frame, you can also use <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mtcars |&gt; pull(cyl)
#&gt; [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4</pre>
</div>
</li>
<li><p><code>%&gt;%</code> allows you to drop the parentheses when calling a function with no other arguments; <code>|&gt;</code> always requires the parentheses.</p></li>
<li><p><code>%&gt;%</code> allows you to start a pipe with <code>.</code> to create a function rather than immediately executing the pipe; this is not supported by the base pipe.</p></li>
</ul><p>Luckily theres no need to commit entirely to one pipe or the other — you can use the base pipe for the majority of cases where its sufficient, and use the magrittr pipe when you really need its special features.</p>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learn more about the pipe: why we recommend it and some of the history that lead to <code>|&gt;</code>. The pipe is important because youll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.</p>
<p>In the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy and most datasets that you encounter in the wild will not already be tidy. So well also teach you how to use the tidyr package to tidy your untidy data.</p>
</section>
</section>

View File

@ -0,0 +1,236 @@
<section data-type="chapter" id="chp-workflow-scripts">
<h1><span id="sec-workflow-scripts-projects" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: scripts and projects</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div><h1>
RStudio server
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
<p>This chapter will introduce you to two very important tools for organizing your code: scripts and projects.</p>
<section id="scripts" data-type="sect1">
<h1>
Scripts</h1>
<p>So far, you have used the console to run code. Thats a great place to start, but youll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now youll see four panes, as in <a href="#fig-rstudio-script" data-type="xref">#fig-rstudio-script</a>. The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.</p>
<div class="cell">
<div class="cell-output-display">
<figure id="fig-new-project-3"><p><img src="diagrams/rstudio/script.png" alt="RStudio IDE with Editor, Console, and Output highlighted." width="521"/></p>
<figcaption>Figure 9.1: Opening the script editor adds a new pane at the top-left of the IDE.</figcaption>
</figure>
</div>
</div>
<section id="running-code" data-type="sect2">
<h2>
Running code</h2>
<p>The script editor is a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below. If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates <code>not_cancelled</code>. It will also move the cursor to the next statement (beginning with <code>not_cancelled |&gt;</code>). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(dplyr)
library(nycflights13)
not_cancelled &lt;- flights |&gt;
filter(!is.na(dep_delay)█, !is.na(arr_delay))
not_cancelled |&gt;
group_by(year, month, day) |&gt;
summarize(mean = mean(dep_delay))</pre>
</div>
<p>Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that youve captured all the important parts of your code in the script.</p>
<p>We recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include <code><a href="#chp-https://rdrr.io/r/utils/install.packages" data-type="xref">#chp-https://rdrr.io/r/utils/install.packages</a></code> in a script that you share. Its very antisocial to change settings on someone elses computer!</p>
<p>When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you wont even think about it.</p>
</section>
<section id="rstudio-diagnostics" data-type="sect2">
<h2>
RStudio diagnostics</h2>
<p>In script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-diagnostic.png" alt="Script editor with the script x y &lt;- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line." width="148"/></p>
</div>
</div>
<p>Hover over the cross to see what the problem is:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-diagnostic-tip.png" alt="Script editor with the script x y &lt;- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line. Hovering over the X shows a text box with the text unexpected token y and unexpected token &lt;-." width="232"/></p>
</div>
</div>
<p>RStudio will also let you know about potential problems:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-diagnostic-warn.png" alt="Script editor with the script 3 == NA. A yellow exclamation park indicates that there may be a potential problem. Hovering over the exclamation mark shows a text box with the text use is.na to check whether expression evaluates to NA." width="439"/></p>
</div>
</div>
</section>
<section id="saving-and-naming" data-type="sect2">
<h2>
Saving and naming</h2>
<p>RStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, its a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.</p>
<p>It might be tempting to name your files <code>code.R</code> or <code>myscript.R</code>, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:</p>
<ol type="1"><li>File names should be <strong>machine</strong> readable: avoid spaces, symbols, and special characters. Dont rely on case sensitivity to distinguish files.</li>
<li>File names should be <strong>human</strong> readable: use file names to describe whats in the file.</li>
<li>File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.</li>
</ol><p>For example, suppose you have the following files in a project folder.</p>
<pre><code>alternative model.R
code for exploratory analysis.r
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
run-first.r
temp.txt</code></pre>
<p>There are a variety of problems here: its hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (<code>finalreport</code> vs. <code>FinalReport</code><span data-type="footnote">Not to mention that youre tempting fate by using “final” in the name 😆 The comic piled higher and deeper has a <a href="#chp-https://phdcomics.com/comics/archive.php?comicid=1531" data-type="xref">#chp-https://phdcomics.com/comics/archive.php?comicid=1531</a>.</span>), and some names dont describe their contents (<code>run-first</code> and <code>temp</code>).</p>
<p>Heres better way of naming and organizing the same set of files:</p>
<pre><code>01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
report-2022-03-20.qmd
report-2022-04-02.qmd
report-draft-notes.txt</code></pre>
<p>Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and <code>temp</code> is renamed to <code>report-draft-notes</code> to better describe its contents.</p>
</section>
</section>
<section id="projects" data-type="sect1">
<h1>
Projects</h1>
<p>One day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.</p>
<p>To handle these real life situations, you need to make two decisions:</p>
<ol type="1"><li><p>What is the source of truth? What will you save as your lasting record of what happened?</p></li>
<li><p>Where does your analysis live?</p></li>
</ol>
<section id="what-is-the-source-of-truth" data-type="sect2">
<h2>
What is the source of truth?</h2>
<p>As a beginning R user, its OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis. However, in the long run, youll be much better off if you ensure that your R scripts are the source of truth. With your R scripts (and your data files), you can recreate the environment. With only your environment, its much harder to recreate your R scripts: youll either have to retype a lot of code from memory (inevitably making mistakes along the way) or youll have to carefully mine your R history.</p>
<p>To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running <code><a href="#chp-https://usethis.r-lib.org/reference/use_blank_slate" data-type="xref">#chp-https://usethis.r-lib.org/reference/use_blank_slate</a></code><span data-type="footnote">If you dont have usethis installed, you can install it with <code>install.packages("usethis")</code>.</span> or by mimicking the options shown in <a href="#fig-blank-slate" data-type="xref">#fig-blank-slate</a>. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time. But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code. Theres nothing worse than discovering three months after the fact that youve only stored the results of an important calculation in your workspace, not the calculation itself in your code.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="diagrams/rstudio/clean-slate.png" alt="RStudio preferences window where the option Restore .RData into workspace at startup is not checked. Also, the option Save workspace to .RData on exit is set to Never. " width="523"/></p>
<figcaption class="figure-caption">Figure 9.2: Copy these options in your RStudio options to always start your RStudio session with a clean slate.</figcaption>
</figure>
</div>
</div>
<p>There is a great pair of keyboard shortcuts that will work together to make sure youve captured the important parts of your code in the editor:</p>
<ol type="1"><li>Press Cmd/Ctrl + Shift + F10 to restart R.</li>
<li>Press Cmd/Ctrl + Shift + S to re-run the current script.</li>
</ol><p>We collectively use this pattern hundreds of times a week.</p>
<div data-type="note"><h1>
RStudio server
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
</section>
<section id="where-does-your-analysis-live" data-type="sect2">
<h2>
Where does your analysis live?</h2>
<p>R has a powerful notion of the <strong>working directory</strong>. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-wd.png" alt="The Console tab shows the current working directory as ~/Documents/r4ds/r4ds. " width="321"/></p>
</div>
</div>
<p>And you can print this out in R code by running <code><a href="#chp-https://rdrr.io/r/base/getwd" data-type="xref">#chp-https://rdrr.io/r/base/getwd</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">getwd()
#&gt; [1] "/Users/hadley/Documents/r4ds/r4ds"</pre>
</div>
<p>As a beginning R user, its OK to let your working direction be your home directory, documents directory, or any other weird directory on your computer. But youre nine chapters into this book, and youre no longer a rank beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set Rs working directory to the associated directory.</p>
<p>You can set the working directory from within R but <strong>we</strong> <strong>do not recommend it</strong>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">setwd("/path/to/my/CoolProject")</pre>
</div>
<p>Theres a better way; a way that also puts you on the path to managing your R work like an expert. That way is the <strong>RStudio</strong> <strong>project</strong>.</p>
</section>
<section id="rstudio-projects" data-type="sect2">
<h2>
RStudio projects</h2>
<p>Keeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via <strong>projects</strong>. Lets make a project for you to use while youre working through the rest of this book. Click File &gt; New Project, then follow the steps shown in <a href="#fig-new-project" data-type="xref">#fig-new-project</a>.</p>
<figure class="figure"><div class="cell-output-display">
<div class="quarto-figure quarto-figure-center anchored">
<figure class="figure"><p><img src="screenshots/rstudio-project-1.png" alt="Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop. " data-ref-parent="fig-new-project" width="542"/></p>
<figcaption class="figure-caption">(a) First click New Directory.</figcaption>
</figure></div>
</div>
<div class="cell-output-display">
<div class="quarto-figure quarto-figure-center anchored">
<figure class="figure"><p><img src="screenshots/rstudio-project-2.png" alt="Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop. " data-ref-parent="fig-new-project" width="545"/></p>
<figcaption class="figure-caption">(b) Then click New Project.</figcaption>
</figure></div>
</div>
<div class="cell-output-display">
<div class="quarto-figure quarto-figure-center anchored">
<figure class="figure"><p><img src="screenshots/rstudio-project-3.png" alt="Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop. " data-ref-parent="fig-new-project" width="548"/></p>
<figcaption class="figure-caption">(c) Finally, fill in the directory (project) name, choose a good subdirectory for its home and click Create Project.</figcaption>
</figure></div>
</div>
<figcaption class="figure-caption">Figure 9.3: Create a new project by following these three steps.</figcaption>
</figure>
<p>Call your project <code>r4ds</code> and think carefully about which subdirectory you put the project in. If you dont store it somewhere sensible, it will be hard to find it in the future!</p>
<p>Once this process is complete, youll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">getwd()
#&gt; [1] /Users/hadley/Documents/r4ds/r4ds</pre>
</div>
<p>Now enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Next, run the complete script which will save a PDF and CSV file into your project directory. Dont worry about the details, youll learn them later in the book.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")</pre>
</div>
<p>Quit RStudio. Inspect the folder associated with your project — notice the <code>.Rproj</code> file. Double-click that file to re-open the project. Notice you get back to where you left off: its the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that youre starting with a clean slate.</p>
<p>In your favorite OS-specific way, search your computer for <code>diamonds.pdf</code> and you will find the PDF (no surprise) but <em>also the script that created it</em> (<code>diamonds.R</code>). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files <strong>with R code</strong> and never with the mouse or the clipboard, you will be able to reproduce old work with ease!</p>
</section>
<section id="relative-and-absolute-paths" data-type="sect2">
<h2>
Relative and absolute paths</h2>
<p>Once youre inside a project, you should only ever use relative paths not absolute paths. Whats the difference? A relative path is <strong>relative</strong> to the working directory, i.e. the projects home. When Hadley wrote <code>diamonds.R</code> above it was a shortcut for <code>/Users/hadley/Documents/r4ds/r4ds/diamonds.R</code>. But importantly, if Mine ran this code on her computer, it would point to <code>/Users/Mine/Documents/r4ds/r4ds/diamonds.R</code>. This is why relative paths are important: theyll work regardless of where the project ends up.</p>
<p>Absolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g. <code>C:</code>) or two backslashes (e.g. <code>\\servername</code>) and on Mac/Linux they start with a slash “/” (e.g. <code>/users/hadley</code>). You should <strong>never</strong> use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.</p>
<p>Theres another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g. <code>plots/diamonds.pdf</code>) and Windows uses backslashes (e.g. <code>plots\diamonds.pdf</code>). R can work with either type (no matter what platform youre currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.</p>
</section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In summary, scripts and projects give you a solid workflow that will serve you well in the future:</p>
<ul><li>Create one RStudio project for each data analysis project.</li>
<li>Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure youve captured everything in your scripts.</li>
<li>Only ever use relative paths, not absolute paths.</li>
</ul><p>Then everything you need is in one place and cleanly separated from all the other projects that you are working on.</p>
</section>
<section id="exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li><p>Go to the RStudio Tips Twitter account, <a href="https://twitter.com/rstudiotips" class="uri">https://twitter.com/rstudiotips</a> and find one tip that looks interesting. Practice using it!</p></li>
<li><p>What other common mistakes will RStudio diagnostics report? Read <a href="https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics" class="uri">https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics</a> to find out.</p></li>
</ol></section>
<section id="summary-1" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, youll learn to appreciate how a little up front organisation can save you a bunch of time down the road.</p>
<p>Next up, well switch back to data science tooling to talk about exploratory data analysis (or EDA for short), a philosophy and set of tools that you can use with your data to start to get a sense of whats going on.</p>
</section>
</section>

211
oreilly/workflow-style.html Normal file
View File

@ -0,0 +1,211 @@
<section data-type="chapter" id="chp-workflow-style">
<h1><span id="sec-workflow-style" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: code style</span></span></h1><div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer its a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="#chp-https://style.tidyverse" data-type="xref">#chp-https://style.tidyverse</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="#chp-https://styler.r-lib" data-type="xref">#chp-https://styler.r-lib</a> package by Lorenz Walthert. Once youve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudios <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-rstudio-sections"><p><img src="screenshots/rstudio-palette.png" alt="A screenshot showing the command palette after typing &quot;styler&quot;, showing the four styling tool provided by the package." width="638"/></p>
<figcaption>Figure 7.1: RStudios command palette makes it easy to access every RStudio command using only the keyboard.</figcaption>
</figure>
</div>
</div><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre>
</div>
<section id="names" data-type="sect1">
<h1>
Names</h1>
<p>We talked briefly about names in <a href="#sec-whats-in-a-name" data-type="xref">#sec-whats-in-a-name</a>. Remember that variable names (those created by <code>&lt;-</code> and those created by <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>) should use only lowercase letters, numbers, and <code>_</code>. Use <code>_</code> to separate words within a name.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for:
short_flights &lt;- flights |&gt; filter(air_time &lt; 60)
# Avoid:
SHORTFLIGHTS &lt;- flights |&gt; filter(air_time &lt; 60)</pre>
</div>
<p>As a general rule of thumb, its better to prefer long, descriptive names that are easy to understand, rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.</p>
<p>If you have a bunch of names for related things, do your best to be consistent. Its easy for inconsistencies to arise when you forget a previous convention, so dont feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme youre better off giving them a common prefix, rather than a common suffix, because autocomplete works best on the start of a variable.</p>
</section>
<section id="spaces" data-type="sect1">
<h1>
Spaces</h1>
<p>Put spaces on either side of mathematical operators apart from <code>^</code> (i.e., <code>+</code>, <code>-</code>, <code>==</code>, <code>&lt;</code>, …), and around the assignment operator (<code>&lt;-</code>).</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for
z &lt;- (a + b)^2 / d
# Avoid
z&lt;-( a + b ) ^ 2/d</pre>
</div>
<p>Dont put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in regular English.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for
mean(x, na.rm = TRUE)
# Avoid
mean (x ,na.rm=TRUE)</pre>
</div>
<p>Its OK to add extra spaces if it improves alignment. For example, if youre creating multiple variables in <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, you might want to add spaces so that all the <code>=</code> line up. This makes it easier to skim the code.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
speed = air_time / distance,
dep_hour = dep_time %/% 100,
dep_minute = dep_time %% 100
)</pre>
</div>
</section>
<section id="sec-pipes" data-type="sect1">
<h1>
Pipes</h1>
<p><code>|&gt;</code> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for
flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
count(dest)
# Avoid
flights|&gt;filter(!is.na(arr_delay), !is.na(tailnum))|&gt;count(dest)</pre>
</div>
<p>If the function youre piping into has named arguments (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>), put each argument on a new line. If the function doesnt have named arguments (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>) keep everything on one line unless it doesnt fit, in which case you should put each argument on its own line.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for
flights |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# Avoid
flights |&gt;
group_by(
tailnum
) |&gt;
summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())</pre>
</div>
<p>After the first step of the pipeline, indent each line by two spaces. If youre putting each argument on its own line, indent by an extra two spaces. Make sure <code>)</code> is on its own line, and un-indented to match the horizontal position of the function name.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for
flights |&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# Avoid
flights|&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
flights|&gt;
group_by(tailnum) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)</pre>
</div>
<p>Its OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, its common for short snippets to grow longer, so youll usually save time in the long run by starting with all the vertical space you need.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># This fits compactly on one line
df |&gt; mutate(y = x + 1)
# While this takes up 4x as many lines, it's easily extended to
# more variables and more steps in the future
df |&gt;
mutate(
y = x + 1
)</pre>
</div>
<p>Finally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into whats happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name. Dont expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.</p>
</section>
<section id="ggplot2" data-type="sect1">
<h1>
ggplot2</h1>
<p>The same basic rules that apply to the pipe also apply to ggplot2; just treat <code>+</code> the same way as <code>|&gt;</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt;
summarize(
delay = mean(arr_delay, na.rm = TRUE)
) |&gt;
ggplot(aes(month, delay)) +
geom_point() +
geom_line()</pre>
</div>
<p>Again, if you can fit all of the arguments to a function on to a single line, put each argument on its own line:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
summarize(
distance = mean(distance),
speed = mean(air_time / distance, na.rm = TRUE)
) |&gt;
ggplot(aes(distance, speed)) +
geom_smooth(
method = "loess",
span = 0.5,
se = FALSE,
color = "white",
size = 4
) +
geom_point()</pre>
</div>
</section>
<section id="sectioning-comments" data-type="sect1">
<h1>
Sectioning comments</h1>
<p>As your scripts get longer, you can use <strong>sectioning</strong> comments to break up your file into manageable pieces:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Load data --------------------------------------
# Plot data --------------------------------------</pre>
</div>
<p>RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in <a href="#fig-rstudio-sections" data-type="xref">#fig-rstudio-sections</a>.</p>
<div class="cell">
<div class="cell-output-display">
<figure class="figure"><p><img src="screenshots/rstudio-nav.png" width="125"/></p>
<figcaption class="figure-caption">Figure 7.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.</figcaption>
</figure>
</div>
</div>
</section>
<section id="exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li>
<p>Restyle the following pipelines following the guidelines above.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights|&gt;filter(dest=="IAH")|&gt;group_by(year,month,day)|&gt;summarize(n=n(),delay=mean(arr_delay,na.rm=TRUE))|&gt;filter(n&gt;10)
flights|&gt;filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time&gt;0900,sched_arr_time&lt;2000)|&gt;group_by(flight)|&gt;summarize(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|&gt;filter(n&gt;10)</pre>
</div>
</li>
</ol></section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learn the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, youll see how important a consistent style is. And dont forget about the styler package: its a great way to quickly improve the quality of poorly styled code.</p>
<p>So far, weve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data wont available in this way. So in the next chapter, youre going to learn how load data from disk into your R session using the readr package.</p>
</section>
</section>

16
oreilly/wrangle.html Normal file
View File

@ -0,0 +1,16 @@
<div data-type="part">
<h1><span id="sec-wrangle" class="quarto-section-identifier d-none d-lg-block">Wrangle</span></h1><p>In this part of the book, youll learn about data wrangling, the art of getting your data into R in a useful form for further work. In some cases, this is a relatively simple application of a package that does data import. But in more complex cases it encompasses both tidying and transformation as the native structure of the data might be quite far from the tidy rectangle youd prefer to work with.</p><div class="cell">
<div class="cell-output-display">
<figure id="fig-ds-wrangle"><p><img src="diagrams/data-science/wrangle.png" alt="Our data science model with import, tidy, and transform, highlighted in blue and labelled with &quot;wrangle&quot;. " width="535"/></p>
<figcaption>Figure 1: Data wrangling is the combination of importing, tidying, and transforming.</figcaption>
</figure>
</div>
</div><p>This part of the book proceeds as follows:</p><ul><li><p>In <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>, youll learn how to get plain-text data in rectangular formats from disk and into R.</p></li>
<li><p>In <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a>, youll learn how to get data from Excel spreadsheets and Google Sheets into R.</p></li>
<li><p>In <a href="#chp-databases" data-type="xref">#chp-databases</a>, youll learn about getting data into R from databases.</p></li>
<li><p>In <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a>, youll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.</p></li>
<li><p>In <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a>, youll learn about harvesting data off the web and getting it into R.</p></li>
</ul><p>Some other types of data are not covered in this book:</p><ul><li><p><strong>haven</strong> reads SPSS, Stata, and SAS files.</p></li>
<li><p>xml2 for <strong>xml2</strong> for XML</p></li>
</ul><p>For other file types, try the <a href="#chp-https://cran.r-project.org/doc/manuals/r-release/R-data" data-type="xref">#chp-https://cran.r-project.org/doc/manuals/r-release/R-data</a> and the <a href="#chp-https://github.com/leeper/rio" data-type="xref">#chp-https://github.com/leeper/rio</a> package.</p></div>