Don't transform non-crossref links

This commit is contained in:
Hadley Wickham 2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@ -66,7 +66,7 @@ Visualizing distributions</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p>
</div> </div>
</div> </div>
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p> <p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut) count(cut)
@ -87,7 +87,7 @@ Visualizing distributions</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
</div> </div>
</div> </div>
<p>You can compute this by hand by combining <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>:</p> <p>You can compute this by hand by combining <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut_width(carat, 0.5)) count(cut_width(carat, 0.5))
@ -114,7 +114,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
<p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), resulting in many bars. The distribution is right skewed but there are lots of ups and downs in the heights of the bins, creating a jagged outline." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), resulting in many bars. The distribution is right skewed but there are lots of ups and downs in the heights of the bins, creating a jagged outline." width="576"/></p>
</div> </div>
</div> </div>
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> performs the same calculation as <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>, but instead of displaying the counts with bars, uses lines instead. Its much easier to understand overlapping lines than bars.</p> <p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> performs the same calculation as <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, but instead of displaying the counts with bars, uses lines instead. Its much easier to understand overlapping lines than bars.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1, size = 0.75) geom_freqpoly(binwidth = 0.1, size = 0.75)
@ -173,7 +173,7 @@ Unusual values</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
</div> </div>
</div> </div>
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 youll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code>:</p> <p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 youll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
geom_histogram(binwidth = 0.5) + geom_histogram(binwidth = 0.5) +
@ -182,7 +182,7 @@ Unusual values</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p>
</div> </div>
</div> </div>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> also has an <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> functions that work slightly differently: they throw away the data outside the limits.</p> <p><code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> also has an <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
<p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p> <p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">unusual &lt;- diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">unusual &lt;- diamonds |&gt;
@ -213,7 +213,7 @@ Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li> <ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
<li><p>Explore the distribution of <code>price</code>. Do you discover anything unusual or surprising? (Hint: Carefully think about the <code>binwidth</code> and make sure you try a wide range of values.)</p></li> <li><p>Explore the distribution of <code>price</code>. Do you discover anything unusual or surprising? (Hint: Carefully think about the <code>binwidth</code> and make sure you try a wide range of values.)</p></li>
<li><p>How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?</p></li> <li><p>How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> vs <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li> <li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> vs <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
</ol></section> </ol></section>
</section> </section>
@ -230,13 +230,13 @@ Missing values</h1>
<p>We dont recommend this option because just because one measurement is invalid, doesnt mean all the measurements are. Additionally, if you have low quality data, by time that youve applied this approach to every variable you might find that you dont have any data left!</p> <p>We dont recommend this option because just because one measurement is invalid, doesnt mean all the measurements are. Additionally, if you have low quality data, by time that youve applied this approach to every variable you might find that you dont have any data left!</p>
</li> </li>
<li> <li>
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to replace the variable with a modified copy. You can use the <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> function to replace unusual values with <code>NA</code>:</p> <p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to replace the variable with a modified copy. You can use the <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> function to replace unusual values with <code>NA</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds2 &lt;- diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds2 &lt;- diamonds |&gt;
mutate(y = if_else(y &lt; 3 | y &gt; 20, NA, y))</pre> mutate(y = if_else(y &lt; 3 | y &gt; 20, NA, y))</pre>
</div> </div>
</li> </li>
</ol><p><code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>, use <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> statements nested inside one another.</p> </ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another.</p>
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. Its not obvious where you should plot missing values, so ggplot2 doesnt include them in the plot, but it does warn that theyve been removed:</p> <p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. Its not obvious where you should plot missing values, so ggplot2 doesnt include them in the plot, but it does warn that theyve been removed:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
@ -251,7 +251,7 @@ Missing values</h1>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)</pre> geom_point(na.rm = TRUE)</pre>
</div> </div>
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, well use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>.</p> <p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, well use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">nycflights13::flights |&gt; <pre data-type="programlisting" data-code-language="downlit">nycflights13::flights |&gt;
mutate( mutate(
@ -272,7 +272,7 @@ Missing values</h1>
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li> <ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
<li><p>What does <code>na.rm = TRUE</code> do in <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> and <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>?</p></li> <li><p>What does <code>na.rm = TRUE</code> do in <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> and <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>?</p></li>
</ol></section> </ol></section>
</section> </section>
@ -284,7 +284,7 @@ Covariation</h1>
<section id="sec-cat-cont" data-type="sect2"> <section id="sec-cat-cont" data-type="sect2">
<h2> <h2>
A categorical and continuous variable</h2> A categorical and continuous variable</h2>
<p>Its common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, its hard to see the differences in the shapes of their distributions. For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p> <p>Its common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, its hard to see the differences in the shapes of their distributions. For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre> geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
@ -308,7 +308,7 @@ A categorical and continuous variable</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div> </div>
</div> </div>
<p>Note that were mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes_eval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes_eval</a></code> function to do so.</p> <p>Note that were mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="https://ggplot2.tidyverse.org/reference/aes_eval.html">after_stat()</a></code> function to do so.</p>
<p>Theres something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe thats because frequency polygons are a little hard to interpret - theres a lot going on in this plot.</p> <p>Theres something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe thats because frequency polygons are a little hard to interpret - theres a lot going on in this plot.</p>
<p>Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A <strong>boxplot</strong> is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:</p> <p>Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A <strong>boxplot</strong> is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:</p>
<ul><li><p>A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.</p></li> <ul><li><p>A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.</p></li>
@ -319,7 +319,7 @@ A categorical and continuous variable</h2>
<p><img src="images/EDA-boxplot.png" class="img-fluid" alt="A diagram depicting how a boxplot is created following the steps outlined above." width="1066"/></p> <p><img src="images/EDA-boxplot.png" class="img-fluid" alt="A diagram depicting how a boxplot is created following the steps outlined above." width="1066"/></p>
</div> </div>
</div> </div>
<p>Lets take a look at the distribution of price by cut using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code>:</p> <p>Lets take a look at the distribution of price by cut using <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()</pre> geom_boxplot()</pre>
@ -328,7 +328,7 @@ A categorical and continuous variable</h2>
</div> </div>
</div> </div>
<p>We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, youll be challenged to figure out why.</p> <p>We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, youll be challenged to figure out why.</p>
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables dont have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="#chp-https://rdrr.io/r/stats/reorder.factor" data-type="xref">#chp-https://rdrr.io/r/stats/reorder.factor</a></code> function.</p> <p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables dont have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://rdrr.io/r/stats/reorder.factor.html">reorder()</a></code> function.</p>
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p> <p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@ -346,7 +346,7 @@ A categorical and continuous variable</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p>
</div> </div>
</div> </div>
<p>If you have long variable names, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p> <p>If you have long variable names, <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) + mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
@ -361,17 +361,17 @@ A categorical and continuous variable</h2>
Exercises</h3> Exercises</h3>
<ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li> <ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
<li><p>What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?</p></li> <li><p>What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?</p></li>
<li><p>Instead of exchanging the x and y variables, add <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_flip" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_flip</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li> <li><p>Instead of exchanging the x and y variables, add <code><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html">coord_flip()</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li>
<li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs cut. What do you learn? How do you interpret the plots?</p></li> <li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs cut. What do you learn? How do you interpret the plots?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_violin" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_violin</a></code> with a faceted <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>, or a coloured <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. What are the pros and cons of each method?</p></li> <li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_violin.html">geom_violin()</a></code> with a faceted <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, or a coloured <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>. What are the pros and cons of each method?</p></li>
<li><p>If you have a small dataset, its sometimes useful to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code>. List them and briefly describe what each one does.</p></li> <li><p>If you have a small dataset, its sometimes useful to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>. List them and briefly describe what each one does.</p></li>
</ol></section> </ol></section>
</section> </section>
<section id="two-categorical-variables" data-type="sect2"> <section id="two-categorical-variables" data-type="sect2">
<h2> <h2>
Two categorical variables</h2> Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_count" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_count</a></code>:</p> <p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
geom_count()</pre> geom_count()</pre>
@ -411,7 +411,7 @@ Two categorical variables</h2>
#&gt; 6 E Fair 224 #&gt; 6 E Fair 224
#&gt; # … with 29 more rows</pre> #&gt; # … with 29 more rows</pre>
</div> </div>
<p>Then visualize with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> and the fill aesthetic:</p> <p>Then visualize with <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> and the fill aesthetic:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(color, cut) |&gt; count(color, cut) |&gt;
@ -428,7 +428,7 @@ Two categorical variables</h2>
Exercises</h3> Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li> <ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
<li><p>How does the segmented bar chart change if color is mapped to the <code>x</code> aesthetic and <code>cut</code> is mapped to the <code>fill</code> aesthetic? Calculate the counts that fall into each of the segments.</p></li> <li><p>How does the segmented bar chart change if color is mapped to the <code>x</code> aesthetic and <code>cut</code> is mapped to the <code>fill</code> aesthetic? Calculate the counts that fall into each of the segments.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?</p></li> <li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?</p></li>
<li><p>Why is it slightly better to use <code>aes(x = color, y = cut)</code> rather than <code>aes(x = cut, y = color)</code> in the example above?</p></li> <li><p>Why is it slightly better to use <code>aes(x = color, y = cut)</code> rather than <code>aes(x = cut, y = color)</code> in the example above?</p></li>
</ol></section> </ol></section>
</section> </section>
@ -436,7 +436,7 @@ Exercises</h3>
<section id="two-continuous-variables" data-type="sect2"> <section id="two-continuous-variables" data-type="sect2">
<h2> <h2>
Two continuous variables</h2> Two continuous variables</h2>
<p>Youve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p> <p>Youve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point()</pre> geom_point()</pre>
@ -452,8 +452,8 @@ Two continuous variables</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p> <p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
</div> </div>
</div> </div>
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> to bin in one dimension. Now youll learn how to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> to bin in two dimensions.</p> <p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> to bin in one dimension. Now youll learn how to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> to bin in two dimensions.</p>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> creates rectangular bins. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code>.</p> <p><code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> creates rectangular bins. <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code>.</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_bin2d() geom_bin2d()
@ -474,7 +474,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
</div> </div>
</div> </div>
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so its difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p> <p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so its difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
<p>Another approach is to display approximately the same number of points in each bin. Thats the job of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>:</p> <p>Another approach is to display approximately the same number of points in each bin. Thats the job of <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre> geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre>
@ -486,7 +486,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
<section id="exercises-4" data-type="sect3"> <section id="exercises-4" data-type="sect3">
<h3> <h3>
Exercises</h3> Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code> vs <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li> <ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
<li><p>Visualize the distribution of <code>carat</code>, partitioned by <code>price</code>.</p></li> <li><p>Visualize the distribution of <code>carat</code>, partitioned by <code>price</code>.</p></li>
<li><p>How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?</p></li> <li><p>How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?</p></li>
<li><p>Combine two of the techniques youve learned to visualize the combined distribution of cut, carat, and price.</p></li> <li><p>Combine two of the techniques youve learned to visualize the combined distribution of cut, carat, and price.</p></li>
@ -565,7 +565,7 @@ ggplot2 calls</h1>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)</pre> geom_freqpoly(binwidth = 0.25)</pre>
</div> </div>
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we wont supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see whats different between plots. Thats a really important programming concern that well come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p> <p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we wont supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see whats different between plots. Thats a really important programming concern that well come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
<p>Rewriting the previous plot more concisely yields:</p> <p>Rewriting the previous plot more concisely yields:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(faithful, aes(eruptions)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(faithful, aes(eruptions)) +

View File

@ -7,7 +7,7 @@
</div> </div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> to load packages, to <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, well briefly discuss two important plotting functions.</p> <p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, well briefly discuss two important plotting functions.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
Prerequisites</h2> Prerequisites</h2>
@ -63,7 +63,7 @@ x %% 2 == 0
x[x %% 2 == 0] x[x %% 2 == 0]
#&gt; [1] 10 NA 8 NA</pre> #&gt; [1] 10 NA 8 NA</pre>
</div> </div>
<p>Note that, unlike <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, <code>NA</code> indices will be included in the output as <code>NA</code>s.</p> <p>Note that, unlike <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, <code>NA</code> indices will be included in the output as <code>NA</code>s.</p>
</li> </li>
<li> <li>
<p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p> <p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p>
@ -145,7 +145,7 @@ df2[, "x"]
dplyr equivalents</h2> dplyr equivalents</h2>
<p>A number of dplyr verbs are special cases of <code>[</code>:</p> <p>A number of dplyr verbs are special cases of <code>[</code>:</p>
<ul><li> <ul><li>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
x = c(2, 3, 1, 1, NA), x = c(2, 3, 1, 1, NA),
@ -157,10 +157,10 @@ df |&gt; filter(x &gt; 1)
# same as # same as
df[!is.na(df$x) &amp; df$x &gt; 1, ]</pre> df[!is.na(df$x) &amp; df$x &gt; 1, ]</pre>
</div> </div>
<p>Another common technique in the wild is to use <code><a href="#chp-https://rdrr.io/r/base/which" data-type="xref">#chp-https://rdrr.io/r/base/which</a></code> for its side-effect of dropping missing values: <code>df[which(df$x &gt; 1), ]</code>.</p> <p>Another common technique in the wild is to use <code><a href="https://rdrr.io/r/base/which.html">which()</a></code> for its side-effect of dropping missing values: <code>df[which(df$x &gt; 1), ]</code>.</p>
</li> </li>
<li> <li>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="#chp-https://rdrr.io/r/base/order" data-type="xref">#chp-https://rdrr.io/r/base/order</a></code>:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="https://rdrr.io/r/base/order.html">order()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; arrange(x, y) <pre data-type="programlisting" data-code-language="downlit">df |&gt; arrange(x, y)
@ -170,7 +170,7 @@ df[order(df$x, df$y), ]</pre>
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individual sort columns in decreasing order.</p> <p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individual sort columns in decreasing order.</p>
</li> </li>
<li> <li>
<p>Both <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> are similar to subsetting the columns with a character vector:</p> <p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; select(x, z) <pre data-type="programlisting" data-code-language="downlit">df |&gt; select(x, z)
@ -178,7 +178,7 @@ df[order(df$x, df$y), ]</pre>
df[, c("x", "z")]</pre> df[, c("x", "z")]</pre>
</div> </div>
</li> </li>
</ul><p>Base R also provides a function that combines the features of <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code><span data-type="footnote">But it doesnt handle grouped data frames differently and it doesnt support selection helper functions like <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code>.</span> called <code><a href="#chp-https://rdrr.io/r/base/subset" data-type="xref">#chp-https://rdrr.io/r/base/subset</a></code>:</p> </ul><p>Base R also provides a function that combines the features of <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code><span data-type="footnote">But it doesnt handle grouped data frames differently and it doesnt support selection helper functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code>.</span> called <code><a href="https://rdrr.io/r/base/subset.html">subset()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
filter(x &gt; 1) |&gt; filter(x &gt; 1) |&gt;
@ -209,7 +209,7 @@ Exercises</h2>
<li>Every element except the last value.</li> <li>Every element except the last value.</li>
<li>Only even values (and no missing values).</li> <li>Only even values (and no missing values).</li>
</ol></li> </ol></li>
<li><p>Why is <code>x[-which(x &gt; 0)]</code> not the same as <code>x[x &lt;= 0]</code>? Read the documentation for <code><a href="#chp-https://rdrr.io/r/base/which" data-type="xref">#chp-https://rdrr.io/r/base/which</a></code> and do some experiments to figure it out.</p></li> <li><p>Why is <code>x[-which(x &gt; 0)]</code> not the same as <code>x[x &lt;= 0]</code>? Read the documentation for <code><a href="https://rdrr.io/r/base/which.html">which()</a></code> and do some experiments to figure it out.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -222,7 +222,7 @@ Selecting a single element<code>$</code> and <code>[[</code>
<section id="data-frames" data-type="sect2"> <section id="data-frames" data-type="sect2">
<h2> <h2>
Data frames</h2> Data frames</h2>
<p><code>[[</code> and <code>$</code> can be used like <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p> <p><code>[[</code> and <code>$</code> can be used like <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">tb &lt;- tibble(
x = 1:4, x = 1:4,
@ -239,7 +239,7 @@ tb[["x"]]
tb$x tb$x
#&gt; [1] 1 2 3 4</pre> #&gt; [1] 1 2 3 4</pre>
</div> </div>
<p>They can also be used to create new columns, the base R equivalent of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>:</p> <p>They can also be used to create new columns, the base R equivalent of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb$z &lt;- tb$x + tb$y <pre data-type="programlisting" data-code-language="downlit">tb$z &lt;- tb$x + tb$y
tb tb
@ -251,8 +251,8 @@ tb
#&gt; 3 3 1 4 #&gt; 3 3 1 4
#&gt; 4 4 21 25</pre> #&gt; 4 4 21 25</pre>
</div> </div>
<p>There are a number other base approaches to creating new columns including with <code><a href="#chp-https://rdrr.io/r/base/transform" data-type="xref">#chp-https://rdrr.io/r/base/transform</a></code>, <code><a href="#chp-https://rdrr.io/r/base/with" data-type="xref">#chp-https://rdrr.io/r/base/with</a></code>, and <code><a href="#chp-https://rdrr.io/r/base/with" data-type="xref">#chp-https://rdrr.io/r/base/with</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p> <p>There are a number other base approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>:</p> <p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">max(diamonds$carat) <pre data-type="programlisting" data-code-language="downlit">max(diamonds$carat)
#&gt; [1] 5.01 #&gt; [1] 5.01
@ -384,9 +384,9 @@ Exercises</h2>
<section id="apply-family" data-type="sect1"> <section id="apply-family" data-type="sect1">
<h1> <h1>
Apply family</h1> Apply family</h1>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> and the map family of functions. In this section, youll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here well give you a quick overview of this family so you can recognize them in the wild.</p> <p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, youll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here well give you a quick overview of this family so you can recognize them in the wild.</p>
<p>The most important member of this family is <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>, which is very similar to <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if theres an error.</span>. In fact, because we havent used any of <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>s more advanced features, you can replace every <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>.</p> <p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if theres an error.</span>. In fact, because we havent used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
<p>Theres no exact base R equivalent to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> but you can get close by using <code>[</code> with <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> on a data frame applies the function to each column.</p> <p>Theres no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(a = 1, b = 2, c = "a", d = "b", e = 4) <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
@ -404,15 +404,15 @@ df
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 2 4 a b 8</pre> #&gt; 1 2 4 a b 8</pre>
</div> </div>
<p>The code above uses a new function, <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>. Its similar to <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We dont recommend using it for programming, because the simplification can fail and give you an unexpected type, but its usually fine for interactive use. purrr has a similar function called <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> that we didnt mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p> <p>The code above uses a new function, <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code>. Its similar to <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We dont recommend using it for programming, because the simplification can fail and give you an unexpected type, but its usually fine for interactive use. purrr has a similar function called <code><a href="https://purrr.tidyverse.org/reference/map.html">map_vec()</a></code> that we didnt mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
<p>Base R provides a stricter version of <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> called <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> call above with this <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> where we specify that we expect <code><a href="#chp-https://rdrr.io/r/base/numeric" data-type="xref">#chp-https://rdrr.io/r/base/numeric</a></code> to return a logical vector of length 1:</p> <p>Base R provides a stricter version of <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> called <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> call above with this <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> where we specify that we expect <code><a href="https://rdrr.io/r/base/numeric.html">is.numeric()</a></code> to return a logical vector of length 1:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">vapply(df, is.numeric, logical(1)) <pre data-type="programlisting" data-code-language="downlit">vapply(df, is.numeric, logical(1))
#&gt; a b c d e #&gt; a b c d e
#&gt; TRUE TRUE FALSE FALSE TRUE</pre> #&gt; TRUE TRUE FALSE FALSE TRUE</pre>
</div> </div>
<p>The distinction between <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> and <code><a href="#chp-https://rdrr.io/r/base/lapply" data-type="xref">#chp-https://rdrr.io/r/base/lapply</a></code> is really important when theyre inside a function (because it makes a big difference to the functions robustness to unusual inputs), but it doesnt usually matter in data analysis.</p> <p>The distinction between <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> and <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> is really important when theyre inside a function (because it makes a big difference to the functions robustness to unusual inputs), but it doesnt usually matter in data analysis.</p>
<p>Another important member of the apply family is <code><a href="#chp-https://rdrr.io/r/base/tapply" data-type="xref">#chp-https://rdrr.io/r/base/tapply</a></code> which computes a single grouped summary:</p> <p>Another important member of the apply family is <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> which computes a single grouped summary:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
group_by(cut) |&gt; group_by(cut) |&gt;
@ -430,8 +430,8 @@ tapply(diamonds$price, diamonds$cut, mean)
#&gt; Fair Good Very Good Premium Ideal #&gt; Fair Good Very Good Premium Ideal
#&gt; 4358.758 3928.864 3981.760 4584.258 3457.542</pre> #&gt; 4358.758 3928.864 3981.760 4584.258 3457.542</pre>
</div> </div>
<p>Unfortunately <code><a href="#chp-https://rdrr.io/r/base/tapply" data-type="xref">#chp-https://rdrr.io/r/base/tapply</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (its certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="#chp-https://rdrr.io/r/base/tapply" data-type="xref">#chp-https://rdrr.io/r/base/tapply</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="#chp-https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec" data-type="xref">#chp-https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec</a>.</p> <p>Unfortunately <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (its certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec">in a gist</a>.</p>
<p>The final member of the apply family is the titular <code><a href="#chp-https://rdrr.io/r/base/apply" data-type="xref">#chp-https://rdrr.io/r/base/apply</a></code>, which works with matrices and arrays. In particular, watch out of <code>apply(df, 2, something)</code> which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p> <p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out of <code>apply(df, 2, something)</code> which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
</section> </section>
<section id="for-loops" data-type="sect1"> <section id="for-loops" data-type="sect1">
@ -443,7 +443,7 @@ For loops</h1>
# do something with element # do something with element
}</pre> }</pre>
</div> </div>
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p> <p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; walk(append_file)</pre> <pre data-type="programlisting" data-code-language="downlit">paths |&gt; walk(append_file)</pre>
</div> </div>
@ -458,11 +458,11 @@ For loops</h1>
<pre data-type="programlisting" data-code-language="downlit">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE) <pre data-type="programlisting" data-code-language="downlit">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files &lt;- map(paths, readxl::read_excel)</pre> files &lt;- map(paths, readxl::read_excel)</pre>
</div> </div>
<p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, were going to want a list the same length as <code>paths</code>, which we can create with <code><a href="#chp-https://rdrr.io/r/base/vector" data-type="xref">#chp-https://rdrr.io/r/base/vector</a></code>:</p> <p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, were going to want a list the same length as <code>paths</code>, which we can create with <code><a href="https://rdrr.io/r/base/vector.html">vector()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- vector("list", length(paths))</pre> <pre data-type="programlisting" data-code-language="downlit">files &lt;- vector("list", length(paths))</pre>
</div> </div>
<p>Then instead of iterating over the elements of <code>paths</code>, well iterate over their indices, using <code><a href="#chp-https://rdrr.io/r/base/seq" data-type="xref">#chp-https://rdrr.io/r/base/seq</a></code> to generate one index for each element of paths:</p> <p>Then instead of iterating over the elements of <code>paths</code>, well iterate over their indices, using <code><a href="https://rdrr.io/r/base/seq.html">seq_along()</a></code> to generate one index for each element of paths:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">seq_along(paths) <pre data-type="programlisting" data-code-language="downlit">seq_along(paths)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre> #&gt; [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre>
@ -473,7 +473,7 @@ files &lt;- map(paths, readxl::read_excel)</pre>
files[[i]] &lt;- readxl::read_excel(paths[[i]]) files[[i]] &lt;- readxl::read_excel(paths[[i]])
}</pre> }</pre>
</div> </div>
<p>To combine the list of tibbles into a single tibble you can use <code><a href="#chp-https://rdrr.io/r/base/do.call" data-type="xref">#chp-https://rdrr.io/r/base/do.call</a></code> + <code><a href="#chp-https://rdrr.io/r/base/cbind" data-type="xref">#chp-https://rdrr.io/r/base/cbind</a></code>:</p> <p>To combine the list of tibbles into a single tibble you can use <code><a href="https://rdrr.io/r/base/do.call.html">do.call()</a></code> + <code><a href="https://rdrr.io/r/base/cbind.html">rbind()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">do.call(rbind, files) <pre data-type="programlisting" data-code-language="downlit">do.call(rbind, files)
#&gt; # A tibble: 1,704 × 5 #&gt; # A tibble: 1,704 × 5
@ -501,7 +501,7 @@ for (path in paths) {
<h1> <h1>
Plots</h1> Plots</h1>
<p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because theyre so concise — its very little typing to do a basic exploratory plot.</p> <p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because theyre so concise — its very little typing to do a basic exploratory plot.</p>
<p>There are two main types of base plot youll see in the wild: scatterplots and histograms, produced with <code><a href="#chp-https://rdrr.io/r/graphics/plot.default" data-type="xref">#chp-https://rdrr.io/r/graphics/plot.default</a></code> and <code><a href="#chp-https://rdrr.io/r/graphics/hist" data-type="xref">#chp-https://rdrr.io/r/graphics/hist</a></code> respectively. Heres a quick example from the diamonds dataset:</p> <p>There are two main types of base plot youll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Heres a quick example from the diamonds dataset:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">hist(diamonds$carat) <pre data-type="programlisting" data-code-language="downlit">hist(diamonds$carat)

View File

@ -13,12 +13,12 @@
Introduction</h1> Introduction</h1>
<p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, youll produce tens or hundreds of plots, most of which are immediately thrown away.</p> <p>In <a href="#chp-EDA" data-type="xref">#chp-EDA</a>, you learned how to use plots as tools for <em>exploration</em>. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, youll produce tens or hundreds of plots, most of which are immediately thrown away.</p>
<p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, youll learn some of the tools that ggplot2 provides to do so.</p> <p>Now that you understand your data, you need to <em>communicate</em> your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, youll learn some of the tools that ggplot2 provides to do so.</p>
<p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="#chp-https://www.amazon.com/gp/product/0321934075/" data-type="xref">#chp-https://www.amazon.com/gp/product/0321934075/</a>, by Albert Cairo. It doesnt teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p> <p>This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like <a href="https://www.amazon.com/gp/product/0321934075/"><em>The Truthful Art</em></a>, by Albert Cairo. It doesnt teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
Prerequisites</h2> Prerequisites</h2>
<p>In this chapter, well focus once again on ggplot2. Well also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including <strong>ggrepel</strong> and <strong>patchwork</strong>. Rather than loading those extensions here, well refer to their functions explicitly, using the <code>::</code> notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Dont forget youll need to install those packages with <code><a href="#chp-https://rdrr.io/r/utils/install.packages" data-type="xref">#chp-https://rdrr.io/r/utils/install.packages</a></code> if you dont already have them.</p> <p>In this chapter, well focus once again on ggplot2. Well also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including <strong>ggrepel</strong> and <strong>patchwork</strong>. Rather than loading those extensions here, well refer to their functions explicitly, using the <code>::</code> notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Dont forget youll need to install those packages with <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> if you dont already have them.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre> <pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div> </div>
@ -28,7 +28,7 @@ Prerequisites</h2>
<section id="label" data-type="sect1"> <section id="label" data-type="sect1">
<h1> <h1>
Label</h1> Label</h1>
<p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code> function. This example adds a plot title:</p> <p>The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> function. This example adds a plot title:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) + geom_point(aes(color = class)) +
@ -55,7 +55,7 @@ Label</h1>
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-4-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-4-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>You can also use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code> to replace the axis and legend titles. Its usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p> <p>You can also use <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> to replace the axis and legend titles. Its usually a good idea to replace short variable names with more detailed descriptions, and to include the units.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) + geom_point(aes(colour = class)) +
@ -69,7 +69,7 @@ Label</h1>
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-5-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-5-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>Its possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="#chp-https://rdrr.io/r/base/substitute" data-type="xref">#chp-https://rdrr.io/r/base/substitute</a></code> and read about the available options in <code><a href="#chp-https://rdrr.io/r/grDevices/plotmath" data-type="xref">#chp-https://rdrr.io/r/grDevices/plotmath</a></code>:</p> <p>Its possible to use mathematical equations instead of text strings. Just switch <code>""</code> out for <code><a href="https://rdrr.io/r/base/substitute.html">quote()</a></code> and read about the available options in <code><a href="https://rdrr.io/r/grDevices/plotmath.html">?plotmath</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
x = runif(10), x = runif(10),
@ -105,7 +105,7 @@ Exercises</h2>
<section id="annotations" data-type="sect1"> <section id="annotations" data-type="sect1">
<h1> <h1>
Annotations</h1> Annotations</h1>
<p>In addition to labelling major components of your plot, its often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> is similar to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p> <p>In addition to labelling major components of your plot, its often useful to label individual observations or groups of observations. The first tool you have at your disposal is <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> is similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>, but it has an additional aesthetic: <code>label</code>. This makes it possible to add textual labels to your plots.</p>
<p>There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isnt terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:</p> <p>There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isnt terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">best_in_class &lt;- mpg |&gt; <pre data-type="programlisting" data-code-language="downlit">best_in_class &lt;- mpg |&gt;
@ -119,7 +119,7 @@ ggplot(mpg, aes(displ, hwy)) +
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-8-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-8-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p> <p>This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> which draws a rectangle behind the text. We also use the <code>nudge_y</code> parameter to move the labels slightly above the corresponding points:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) + geom_point(aes(colour = class)) +
@ -161,7 +161,7 @@ ggplot(mpg, aes(displ, hwy, colour = class)) +
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-11-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-11-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>Alternatively, you might just want to add a single label to the plot, but youll still need to create a data frame. Often, you want the label in the corner of the plot, so its convenient to create a new data frame using <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> to compute the maximum values of x and y.</p> <p>Alternatively, you might just want to add a single label to the plot, but youll still need to create a data frame. Often, you want the label in the corner of the plot, so its convenient to create a new data frame using <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to compute the maximum values of x and y.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">label_info &lt;- mpg |&gt; <pre data-type="programlisting" data-code-language="downlit">label_info &lt;- mpg |&gt;
summarise( summarise(
@ -177,7 +177,7 @@ ggplot(mpg, aes(displ, hwy)) +
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-12-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-12-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since were no longer computing the positions from <code>mpg</code>, we can use <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> to create the data frame:</p> <p>If you want to place the text exactly on the borders of the plot, you can use <code>+Inf</code> and <code>-Inf</code>. Since were no longer computing the positions from <code>mpg</code>, we can use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> to create the data frame:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">label_info &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">label_info &lt;- tibble(
displ = Inf, displ = Inf,
@ -192,7 +192,7 @@ ggplot(mpg, aes(displ, hwy)) +
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-13-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-13-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_wrap" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_wrap</a></code> to automatically add line breaks, given the number of characters you want per line:</p> <p>In these examples, we manually broke the label up into lines using <code>"\n"</code>. Another approach is to use <code><a href="https://stringr.tidyverse.org/reference/str_wrap.html">stringr::str_wrap()</a></code> to automatically add line breaks, given the number of characters you want per line:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">"Increasing engine size is related to decreasing fuel economy." |&gt; <pre data-type="programlisting" data-code-language="downlit">"Increasing engine size is related to decreasing fuel economy." |&gt;
str_wrap(width = 40) |&gt; str_wrap(width = 40) |&gt;
@ -209,20 +209,20 @@ ggplot(mpg, aes(displ, hwy)) +
</figure> </figure>
</div> </div>
</div> </div>
<p>Remember, in addition to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A few ideas:</p> <p>Remember, in addition to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code>, you have many other geoms in ggplot2 available to help annotate your plot. A few ideas:</p>
<ul><li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_abline" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_abline</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_abline" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_abline</a></code> to add reference lines. We often make them thick (<code>size = 2</code>) and white (<code>colour = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li> <ul><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_hline()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_vline()</a></code> to add reference lines. We often make them thick (<code>size = 2</code>) and white (<code>colour = white</code>), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li> <li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_rect()</a></code> to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics <code>xmin</code>, <code>xmax</code>, <code>ymin</code>, <code>ymax</code>.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_segment" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_segment</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li> <li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_segment.html">geom_segment()</a></code> with the <code>arrow</code> argument to draw attention to a point with an arrow. Use aesthetics <code>x</code> and <code>y</code> to define the starting location, and <code>xend</code> and <code>yend</code> to define the end location.</p></li>
</ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p> </ul><p>The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!</p>
<section id="exercises-1" data-type="sect2"> <section id="exercises-1" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> with infinite positions to place text at the four corners of the plot.</p></li> <ol type="1"><li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> with infinite positions to place text at the four corners of the plot.</p></li>
<li><p>Read the documentation for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/annotate" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/annotate</a></code>. How can you use it to add a text label to a plot without having to create a tibble?</p></li> <li><p>Read the documentation for <code><a href="https://ggplot2.tidyverse.org/reference/annotate.html">annotate()</a></code>. How can you use it to add a text label to a plot without having to create a tibble?</p></li>
<li><p>How do labels with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)</p></li> <li><p>How do labels with <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_text()</a></code> interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)</p></li>
<li><p>What arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_text" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_text</a></code> control the appearance of the background box?</p></li> <li><p>What arguments to <code><a href="https://ggplot2.tidyverse.org/reference/geom_text.html">geom_label()</a></code> control the appearance of the background box?</p></li>
<li><p>What are the four arguments to <code><a href="#chp-https://rdrr.io/r/grid/arrow" data-type="xref">#chp-https://rdrr.io/r/grid/arrow</a></code>? How do they work? Create a series of plots that demonstrate the most important options.</p></li> <li><p>What are the four arguments to <code><a href="https://rdrr.io/r/grid/arrow.html">arrow()</a></code>? How do they work? Create a series of plots that demonstrate the most important options.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -283,7 +283,7 @@ Axis ticks and legend keys</h2>
</div> </div>
</div> </div>
<p>Note that the specification of breaks and labels for date and datetime scales is a little different:</p> <p>Note that the specification of breaks and labels for date and datetime scales is a little different:</p>
<ul><li><p><code>date_labels</code> takes a format specification, in the same form as <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code>.</p></li> <ul><li><p><code>date_labels</code> takes a format specification, in the same form as <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">parse_datetime()</a></code>.</p></li>
<li><p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1 month”.</p></li> <li><p><code>date_breaks</code> (not shown here), takes a string like “2 days” or “1 month”.</p></li>
</ul></section> </ul></section>
@ -291,7 +291,7 @@ Axis ticks and legend keys</h2>
<h2> <h2>
Legend layout</h2> Legend layout</h2>
<p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p> <p>You will most often use <code>breaks</code> and <code>labels</code> to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.</p>
<p>To control the overall position of the legend, you need to use a <code><a href="#chp-https://ggplot2.tidyverse.org/reference/theme" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/theme</a></code> setting. Well come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p> <p>To control the overall position of the legend, you need to use a <code><a href="https://ggplot2.tidyverse.org/reference/theme.html">theme()</a></code> setting. Well come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting <code>legend.position</code> controls where the legend is drawn:</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">base &lt;- ggplot(mpg, aes(displ, hwy)) + <pre data-type="programlisting" data-code-language="downlit">base &lt;- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) geom_point(aes(colour = class))
@ -320,7 +320,7 @@ base + theme(legend.position = "right") # the default</pre>
</div> </div>
</div> </div>
<p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p> <p>You can also use <code>legend.position = "none"</code> to suppress the display of the legend altogether.</p>
<p>To control the display of individual legends, use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/guides" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/guides</a></code> along with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/guide_legend" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/guide_legend</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/guide_colourbar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/guide_colourbar</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p> <p>To control the display of individual legends, use <code><a href="https://ggplot2.tidyverse.org/reference/guides.html">guides()</a></code> along with <code><a href="https://ggplot2.tidyverse.org/reference/guide_legend.html">guide_legend()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/guide_colourbar.html">guide_colorbar()</a></code>. The following example shows two important settings: controlling the number of rows the legend uses with <code>nrow</code>, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low <code>alpha</code> to display many points on a plot.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) + geom_point(aes(colour = class)) +
@ -394,7 +394,7 @@ ggplot(mpg, aes(displ, hwy)) +
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-26-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-26-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>The ColorBrewer scales are documented online at <a href="https://colorbrewer2.org/" class="uri">https://colorbrewer2.org/</a> and made available in R via the <strong>RColorBrewer</strong> package, by Erich Neuwirth. <a href="#fig-brewer" data-type="xref">#fig-brewer</a> shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if youve used <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code> to make a continuous variable into a categorical variable.</p> <p>The ColorBrewer scales are documented online at <a href="https://colorbrewer2.org/" class="uri">https://colorbrewer2.org/</a> and made available in R via the <strong>RColorBrewer</strong> package, by Erich Neuwirth. <a href="#fig-brewer" data-type="xref">#fig-brewer</a> shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if youve used <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code> to make a continuous variable into a categorical variable.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -403,7 +403,7 @@ ggplot(mpg, aes(displ, hwy)) +
</figure> </figure>
</div> </div>
</div> </div>
<p>When you have a predefined mapping between values and colors, use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_manual" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_manual</a></code>. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:</p> <p>When you have a predefined mapping between values and colors, use <code><a href="https://ggplot2.tidyverse.org/reference/scale_manual.html">scale_colour_manual()</a></code>. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">presidential |&gt; <pre data-type="programlisting" data-code-language="downlit">presidential |&gt;
mutate(id = 33 + row_number()) |&gt; mutate(id = 33 + row_number()) |&gt;
@ -415,7 +415,7 @@ ggplot(mpg, aes(displ, hwy)) +
<p><img src="communicate-plots_files/figure-html/unnamed-chunk-28-1.png" width="576"/></p> <p><img src="communicate-plots_files/figure-html/unnamed-chunk-28-1.png" width="576"/></p>
</div> </div>
</div> </div>
<p>For continuous colour, you can use the built-in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_gradient" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_gradient</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_gradient" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_gradient</a></code>. If you have a diverging scale, you can use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/scale_gradient" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/scale_gradient</a></code>. That allows you to give, for example, positive and negative values different colors. Thats sometimes also useful if you want to distinguish points above or below the mean.</p> <p>For continuous colour, you can use the built-in <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_colour_gradient()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_fill_gradient()</a></code>. If you have a diverging scale, you can use <code><a href="https://ggplot2.tidyverse.org/reference/scale_gradient.html">scale_colour_gradient2()</a></code>. That allows you to give, for example, positive and negative values different colors. Thats sometimes also useful if you want to distinguish points above or below the mean.</p>
<p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p> <p>Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (<code>c</code>), discrete (<code>d</code>), and binned (<code>b</code>) palettes in ggplot2.</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
@ -469,7 +469,7 @@ Exercises</h2>
coord_fixed()</pre> coord_fixed()</pre>
</div> </div>
</li> </li>
<li><p>What is the first argument to every scale? How does it compare to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code>?</p></li> <li><p>What is the first argument to every scale? How does it compare to <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code>?</p></li>
<li> <li>
<p>Change the display of the presidential terms by:</p> <p>Change the display of the presidential terms by:</p>
<ol type="a"><li>Combining the two variants shown above.</li> <ol type="a"><li>Combining the two variants shown above.</li>
@ -497,9 +497,9 @@ Zooming</h1>
<p>There are three ways to control the plot limits:</p> <p>There are three ways to control the plot limits:</p>
<ol type="1"><li>Adjusting what data are plotted</li> <ol type="1"><li>Adjusting what data are plotted</li>
<li>Setting the limits in each scale</li> <li>Setting the limits in each scale</li>
<li>Setting <code>xlim</code> and <code>ylim</code> in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> <li>Setting <code>xlim</code> and <code>ylim</code> in <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>
</li> </li>
</ol><p>To zoom in on a region of the plot, its generally best to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code>. Compare the following two plots:</p> </ol><p>To zoom in on a region of the plot, its generally best to use <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>. Compare the following two plots:</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, mapping = aes(displ, hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) + geom_point(aes(color = class)) +
@ -597,26 +597,26 @@ Themes</h1>
</div> </div>
</div> </div>
<p>Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.</p> <p>Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.</p>
<p>Its also possible to control individual components of each theme, like the size and colour of the font used for the y axis. Unfortunately, this level of detail is outside the scope of this book, so youll need to read the <a href="#chp-https://ggplot2-book.org/" data-type="xref">#chp-https://ggplot2-book.org/</a> for the full details. You can also create your own themes, if you are trying to match a particular corporate or journal style.</p> <p>Its also possible to control individual components of each theme, like the size and colour of the font used for the y axis. Unfortunately, this level of detail is outside the scope of this book, so youll need to read the <a href="https://ggplot2-book.org/">ggplot2 book</a> for the full details. You can also create your own themes, if you are trying to match a particular corporate or journal style.</p>
</section> </section>
<section id="sec-ggsave" data-type="sect1"> <section id="sec-ggsave" data-type="sect1">
<h1> <h1>
Saving your plots</h1> Saving your plots</h1>
<p>There are two main ways to get your plots out of R and into your final write-up: <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> and knitr. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> will save the most recent plot to disk:</p> <p>There are two main ways to get your plots out of R and into your final write-up: <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> and knitr. <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> will save the most recent plot to disk:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + geom_point() <pre data-type="programlisting" data-code-language="downlit">ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf") ggsave("my-plot.pdf")
#&gt; Saving 6 x 4 in image</pre> #&gt; Saving 6 x 4 in image</pre>
</div> </div>
<p>If you dont specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, youll want to specify them.</p> <p>If you dont specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, youll want to specify them.</p>
<p>Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics. You can learn more about <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> in the documentation.</p> <p>Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics. You can learn more about <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> in the documentation.</p>
</section> </section>
<section id="learning-more" data-type="sect1"> <section id="learning-more" data-type="sect1">
<h1> <h1>
Learning more</h1> Learning more</h1>
<p>The absolute best place to learn more is the ggplot2 book: <a href="#chp-https://ggplot2-book.org/" data-type="xref">#chp-https://ggplot2-book.org/</a>. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.</p> <p>The absolute best place to learn more is the ggplot2 book: <a href="https://ggplot2-book.org/"><em>ggplot2: Elegant graphics for data analysis</em></a>. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.</p>
<p>Another great resource is the ggplot2 extensions gallery <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a>. This site lists many of the packages that extend ggplot2 with new geoms and scales. Its a great place to start if youre trying to do something that seems hard with ggplot2.</p> <p>Another great resource is the ggplot2 extensions gallery <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a>. This site lists many of the packages that extend ggplot2 with new geoms and scales. Its a great place to start if youre trying to do something that seems hard with ggplot2.</p>

View File

@ -79,7 +79,7 @@ Reading data from a file</h1>
</tr></tbody></table></div> </tr></tbody></table></div>
</div> </div>
</div> </div>
<p>We can read this file into R using <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>. The first argument is the most important: its the path to the file.</p> <p>We can read this file into R using <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>. The first argument is the most important: its the path to the file.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_csv("data/students.csv") <pre data-type="programlisting" data-code-language="downlit">students &lt;- read_csv("data/students.csv")
#&gt; Rows: 6 Columns: 5 #&gt; Rows: 6 Columns: 5
@ -91,7 +91,7 @@ Reading data from a file</h1>
#&gt; Use `spec()` to retrieve the full column specification for this data. #&gt; Use `spec()` to retrieve the full column specification for this data.
#&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre> #&gt; Specify the column types or set `show_col_types = FALSE` to quiet this message.</pre>
</div> </div>
<p>When you run <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and well come back to in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p> <p>When you run <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and well come back to in <a href="#sec-col-types" data-type="xref">#sec-col-types</a>.</p>
<section id="practical-advice" data-type="sect2"> <section id="practical-advice" data-type="sect2">
<h2> <h2>
@ -129,7 +129,7 @@ students
#&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five #&gt; 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
#&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre> #&gt; 6 6 Güvenç Attila Ice cream Lunch only 6</pre>
</div> </div>
<p>An alternative approach is to use <code><a href="#chp-https://rdrr.io/pkg/janitor/man/clean_names" data-type="xref">#chp-https://rdrr.io/pkg/janitor/man/clean_names</a></code> to use some heuristics to turn them all into snake case at once<span data-type="footnote">The <a href="#chp-http://sfirke.github.io/janitor/" data-type="xref">#chp-http://sfirke.github.io/janitor/</a> package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses <code>|&gt;</code>.</span>.</p> <p>An alternative approach is to use <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> to use some heuristics to turn them all into snake case at once<span data-type="footnote">The <a href="http://sfirke.github.io/janitor/">janitor</a> package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses <code>|&gt;</code>.</span>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students |&gt; janitor::clean_names() <pre data-type="programlisting" data-code-language="downlit">students |&gt; janitor::clean_names()
#&gt; # A tibble: 6 × 5 #&gt; # A tibble: 6 × 5
@ -185,7 +185,7 @@ students
<section id="other-arguments" data-type="sect2"> <section id="other-arguments" data-type="sect2">
<h2> <h2>
Other arguments</h2> Other arguments</h2>
<p>There are a couple of other important arguments that we need to mention, and theyll be easier to demonstrate if we first show you a handy trick: <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> can read csv files that youve created in a string:</p> <p>There are a couple of other important arguments that we need to mention, and theyll be easier to demonstrate if we first show you a handy trick: <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> can read csv files that youve created in a string:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv( <pre data-type="programlisting" data-code-language="downlit">read_csv(
"a,b,c "a,b,c
@ -198,7 +198,7 @@ Other arguments</h2>
#&gt; 1 1 2 3 #&gt; 1 1 2 3
#&gt; 2 4 5 6</pre> #&gt; 2 4 5 6</pre>
</div> </div>
<p>Usually <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p> <p>Usually <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use <code>skip = n</code> to skip the first <code>n</code> lines or use <code>comment = "#"</code> to drop all lines that start with (e.g.) <code>#</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv( <pre data-type="programlisting" data-code-language="downlit">read_csv(
"The first line of metadata "The first line of metadata
@ -223,7 +223,7 @@ read_csv(
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 1 2 3</pre> #&gt; 1 1 2 3</pre>
</div> </div>
<p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> not to treat the first row as headings, and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p> <p>In other cases, the data might not have column names. You can use <code>col_names = FALSE</code> to tell <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> not to treat the first row as headings, and instead label them sequentially from <code>X1</code> to <code>Xn</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv( <pre data-type="programlisting" data-code-language="downlit">read_csv(
"1,2,3 "1,2,3
@ -249,29 +249,29 @@ read_csv(
#&gt; 1 1 2 3 #&gt; 1 1 2 3
#&gt; 2 4 5 6</pre> #&gt; 2 4 5 6</pre>
</div> </div>
<p>These arguments are all you need to know to read the majority of CSV files that youll encounter in practice. (For the rest, youll need to carefully inspect your <code>.csv</code> file and carefully read the documentation for <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>s many other arguments.)</p> <p>These arguments are all you need to know to read the majority of CSV files that youll encounter in practice. (For the rest, youll need to carefully inspect your <code>.csv</code> file and carefully read the documentation for <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>s many other arguments.)</p>
</section> </section>
<section id="other-file-types" data-type="sect2"> <section id="other-file-types" data-type="sect2">
<h2> <h2>
Other file types</h2> Other file types</h2>
<p>Once youve mastered <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>, using readrs other functions is straightforward; its just a matter of knowing which function to reach for:</p> <p>Once youve mastered <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, using readrs other functions is straightforward; its just a matter of knowing which function to reach for:</p>
<ul><li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reads semicolon separated files. These use <code>;</code> instead of <code>,</code> to separate fields, and are common in countries that use <code>,</code> as the decimal marker.</p></li> <ul><li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv2()</a></code> reads semicolon separated files. These use <code>;</code> instead of <code>,</code> to separate fields, and are common in countries that use <code>,</code> as the decimal marker.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reads tab delimited files.</p></li> <li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> reads tab delimited files.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reads in files with any delimiter, attempting to automatically guess the delimited if you dont specify it.</p></li> <li><p><code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_delim()</a></code> reads in files with any delimiter, attempting to automatically guess the delimited if you dont specify it.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code> reads fixed width files. You can specify fields either by their widths with <code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code> or their position with <code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code>.</p></li> <li><p><code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code> reads fixed width files. You can specify fields either by their widths with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_widths()</a></code> or their position with <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">fwf_positions()</a></code>.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_table" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_table</a></code> reads a common variation of fixed width files where columns are separated by white space.</p></li> <li><p><code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code> reads a common variation of fixed width files where columns are separated by white space.</p></li>
<li><p><code><a href="#chp-https://readr.tidyverse.org/reference/read_log" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_log</a></code> reads Apache style log files.</p></li> <li><p><code><a href="https://readr.tidyverse.org/reference/read_log.html">read_log()</a></code> reads Apache style log files.</p></li>
</ul></section> </ul></section>
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>What function would you use to read a file where fields were separated with “|”?</p></li> <ol type="1"><li><p>What function would you use to read a file where fields were separated with “|”?</p></li>
<li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> have in common?</p></li> <li><p>Apart from <code>file</code>, <code>skip</code>, and <code>comment</code>, what other arguments do <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_tsv()</a></code> have in common?</p></li>
<li><p>What are the most important arguments to <code><a href="#chp-https://readr.tidyverse.org/reference/read_fwf" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_fwf</a></code>?</p></li> <li><p>What are the most important arguments to <code><a href="https://readr.tidyverse.org/reference/read_fwf.html">read_fwf()</a></code>?</p></li>
<li> <li>
<p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> assumes that the quoting character will be <code>"</code>. What argument to <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> do you need to specify to read the following text into a data frame?</p> <p>Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like <code>"</code> or <code>'</code>. By default, <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> assumes that the quoting character will be <code>"</code>. What argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> do you need to specify to read the following text into a data frame?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">"x,y\n1,'a,b'"</pre> <pre data-type="programlisting" data-code-language="downlit">"x,y\n1,'a,b'"</pre>
</div> </div>
@ -375,7 +375,7 @@ Missing values, column types, and problems</h2>
#&gt; dat &lt;- vroom(...) #&gt; dat &lt;- vroom(...)
#&gt; problems(dat)</pre> #&gt; problems(dat)</pre>
</div> </div>
<p>Now <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> reports that there was a problem, and tells us we can find out more with <code><a href="#chp-https://readr.tidyverse.org/reference/problems" data-type="xref">#chp-https://readr.tidyverse.org/reference/problems</a></code>:</p> <p>Now <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> reports that there was a problem, and tells us we can find out more with <code><a href="https://readr.tidyverse.org/reference/problems.html">problems()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">problems(df) <pre data-type="programlisting" data-code-language="downlit">problems(df)
#&gt; # A tibble: 1 × 5 #&gt; # A tibble: 1 × 5
@ -401,18 +401,18 @@ Missing values, column types, and problems</h2>
Column types</h2> Column types</h2>
<p>readr provides a total of nine column types for you to use:</p> <p>readr provides a total of nine column types for you to use:</p>
<ul><li> <ul><li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> read logicals and real numbers. Theyre relatively rarely needed (except as above), since readr will usually guess them for you.</li> <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_logical()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_double()</a></code> read logicals and real numbers. Theyre relatively rarely needed (except as above), since readr will usually guess them for you.</li>
<li> <li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> reads integers. We distinguish because integers and doubles in this book because theyre functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li> <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_integer()</a></code> reads integers. We distinguish because integers and doubles in this book because theyre functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.</li>
<li> <li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesnt make sense to (e.g.) divide it in half.</li> <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">col_character()</a></code> reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesnt make sense to (e.g.) divide it in half.</li>
<li> <li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_factor" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_factor</a></code>, <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> create factors, dates and date-time respectively; youll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li> <code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>, <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> create factors, dates and date-time respectively; youll learn more about those when we get to those data types in <a href="#chp-factors" data-type="xref">#chp-factors</a> and <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a>.</li>
<li> <li>
<code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. Youll learn more about it in <a href="#chp-numbers" data-type="xref">#chp-numbers</a>.</li> <code><a href="https://readr.tidyverse.org/reference/parse_number.html">col_number()</a></code> is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. Youll learn more about it in <a href="#chp-numbers" data-type="xref">#chp-numbers</a>.</li>
<li> <li>
<code><a href="#chp-https://readr.tidyverse.org/reference/col_skip" data-type="xref">#chp-https://readr.tidyverse.org/reference/col_skip</a></code> skips a column so its not included in the result.</li> <code><a href="https://readr.tidyverse.org/reference/col_skip.html">col_skip()</a></code> skips a column so its not included in the result.</li>
</ul><p>Its also possible to override the default column by switching from <code><a href="#chp-https://rdrr.io/r/base/list" data-type="xref">#chp-https://rdrr.io/r/base/list</a></code> to <code><a href="#chp-https://readr.tidyverse.org/reference/cols" data-type="xref">#chp-https://readr.tidyverse.org/reference/cols</a></code>:</p> </ul><p>Its also possible to override the default column by switching from <code><a href="https://rdrr.io/r/base/list.html">list()</a></code> to <code><a href="https://readr.tidyverse.org/reference/cols.html">cols()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- " <pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
x,y,z x,y,z
@ -424,7 +424,7 @@ read_csv(csv, col_types = cols(.default = col_character()))
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; #&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1 2 3</pre> #&gt; 1 1 2 3</pre>
</div> </div>
<p>Another useful helper is <code><a href="#chp-https://readr.tidyverse.org/reference/cols" data-type="xref">#chp-https://readr.tidyverse.org/reference/cols</a></code> which will read in only the columns you specify:</p> <p>Another useful helper is <code><a href="https://readr.tidyverse.org/reference/cols.html">cols_only()</a></code> which will read in only the columns you specify:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_csv( <pre data-type="programlisting" data-code-language="downlit">read_csv(
"x,y,z "x,y,z
@ -442,7 +442,7 @@ read_csv(csv, col_types = cols(.default = col_character()))
<section id="sec-readr-directory" data-type="sect1"> <section id="sec-readr-directory" data-type="sect1">
<h1> <h1>
Reading data from multiple files</h1> Reading data from multiple files</h1>
<p>Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each months data in a separate file: <code>01-sales.csv</code> for January, <code>02-sales.csv</code> for February, and <code>03-sales.csv</code> for March. With <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> you can read these data in at once and stack them on top of each other in a single data frame.</p> <p>Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each months data in a separate file: <code>01-sales.csv</code> for January, <code>02-sales.csv</code> for February, and <code>03-sales.csv</code> for March. With <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> you can read these data in at once and stack them on top of each other in a single data frame.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv") <pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file") read_csv(sales_files, id = "file")
@ -466,7 +466,7 @@ read_csv(sales_files, id = "file")
#&gt; # … with 13 more rows</pre> #&gt; # … with 13 more rows</pre>
</div> </div>
<p>With the additional <code>id</code> parameter we have added a new column called <code>file</code> to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files youre reading in do not have an identifying column that can help you trace the observations back to their original sources.</p> <p>With the additional <code>id</code> parameter we have added a new column called <code>file</code> to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files youre reading in do not have an identifying column that can help you trace the observations back to their original sources.</p>
<p>If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base <code><a href="#chp-https://rdrr.io/r/base/list.files" data-type="xref">#chp-https://rdrr.io/r/base/list.files</a></code> function to find the files for you by matching a pattern in the file names. Youll learn more about these patterns in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p> <p>If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> function to find the files for you by matching a pattern in the file names. Youll learn more about these patterns in <a href="#chp-regexps" data-type="xref">#chp-regexps</a>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- list.files("data", pattern = "sales\\.csv$", full.names = TRUE) <pre data-type="programlisting" data-code-language="downlit">sales_files &lt;- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
sales_files sales_files
@ -477,7 +477,7 @@ sales_files
<section id="sec-writing-to-a-file" data-type="sect1"> <section id="sec-writing-to-a-file" data-type="sect1">
<h1> <h1>
Writing to a file</h1> Writing to a file</h1>
<p>readr also comes with two useful functions for writing data back to disk: <code><a href="#chp-https://readr.tidyverse.org/reference/write_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/write_delim</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/write_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/write_delim</a></code>. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.</p> <p>readr also comes with two useful functions for writing data back to disk: <code><a href="https://readr.tidyverse.org/reference/write_delim.html">write_csv()</a></code> and <code><a href="https://readr.tidyverse.org/reference/write_delim.html">write_tsv()</a></code>. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.</p>
<p>The most important arguments are <code>x</code> (the data frame to save), and <code>file</code> (the location to save it). You can also specify how missing values are written with <code>na</code>, and if you want to <code>append</code> to an existing file.</p> <p>The most important arguments are <code>x</code> (the data frame to save), and <code>file</code> (the location to save it). You can also specify how missing values are written with <code>na</code>, and if you want to <code>append</code> to an existing file.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">write_csv(students, "students.csv")</pre> <pre data-type="programlisting" data-code-language="downlit">write_csv(students, "students.csv")</pre>
@ -508,7 +508,7 @@ read_csv("students-2.csv")
</div> </div>
<p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:</p> <p>This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:</p>
<ol type="1"><li> <ol type="1"><li>
<p><code><a href="#chp-https://readr.tidyverse.org/reference/read_rds" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_rds</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/read_rds" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_rds</a></code> are uniform wrappers around the base functions <code><a href="#chp-https://rdrr.io/r/base/readRDS" data-type="xref">#chp-https://rdrr.io/r/base/readRDS</a></code> and <code><a href="#chp-https://rdrr.io/r/base/readRDS" data-type="xref">#chp-https://rdrr.io/r/base/readRDS</a></code>. These store data in Rs custom binary format called RDS:</p> <p><code><a href="https://readr.tidyverse.org/reference/read_rds.html">write_rds()</a></code> and <code><a href="https://readr.tidyverse.org/reference/read_rds.html">read_rds()</a></code> are uniform wrappers around the base functions <code><a href="https://rdrr.io/r/base/readRDS.html">readRDS()</a></code> and <code><a href="https://rdrr.io/r/base/readRDS.html">saveRDS()</a></code>. These store data in Rs custom binary format called RDS:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">write_rds(students, "students.rds") <pre data-type="programlisting" data-code-language="downlit">write_rds(students, "students.rds")
read_rds("students.rds") read_rds("students.rds")
@ -546,7 +546,7 @@ read_parquet("students.parquet")
<section id="data-entry" data-type="sect1"> <section id="data-entry" data-type="sect1">
<h1> <h1>
Data entry</h1> Data entry</h1>
<p>Sometimes youll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> works by column:</p> <p>Sometimes youll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> works by column:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tibble( <pre data-type="programlisting" data-code-language="downlit">tibble(
x = c(1, 2, 5), x = c(1, 2, 5),
@ -573,7 +573,7 @@ Data entry</h1>
#&gt; • Size 3: Column `y`. #&gt; • Size 3: Column `y`.
#&gt; Only values of size one are recycled.</pre> #&gt; Only values of size one are recycled.</pre>
</div> </div>
<p>Laying out the data by column can make it hard to see how the rows are related, so an alternative is <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>, short for <strong>tr</strong>ansposed t<strong>ibble</strong>, which lets you lay out your data row by row. <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code> is customized for data entry in code: column headings start with <code>~</code> and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:</p> <p>Laying out the data by column can make it hard to see how the rows are related, so an alternative is <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>, short for <strong>tr</strong>ansposed t<strong>ibble</strong>, which lets you lay out your data row by row. <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> is customized for data entry in code: column headings start with <code>~</code> and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tribble( <pre data-type="programlisting" data-code-language="downlit">tribble(
~x, ~y, ~z, ~x, ~y, ~z,
@ -588,13 +588,13 @@ Data entry</h1>
#&gt; 2 m 2 0.83 #&gt; 2 m 2 0.83
#&gt; 3 g 5 0.6</pre> #&gt; 3 g 5 0.6</pre>
</div> </div>
<p>Well use <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> and <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code> later in the book to construct small examples to demonstrate how various functions work.</p> <p>Well use <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> later in the book to construct small examples to demonstrate how various functions work.</p>
</section> </section>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>In this chapter, youve learned how to load CSV files with <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> and to do your own data entry with <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> and <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p> <p>In this chapter, youve learned how to load CSV files with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and to do your own data entry with <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>. Youve learned how csv files work, some of the problems you might encounter, and how to overcome them. Well come to data import a few times in this book: <a href="#chp-databases" data-type="xref">#chp-databases</a> will show you how to load data from databases, <a href="#chp-spreadsheets" data-type="xref">#chp-spreadsheets</a> from Excel and googlesheets, <a href="#chp-rectangling" data-type="xref">#chp-rectangling</a> from JSON, and <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a> from websites.</p>
<p>Now that youre writing a substantial amount of R code, its time to learn more about organizing your code into files and directories. In the next chapter, youll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p> <p>Now that youre writing a substantial amount of R code, its time to learn more about organizing your code into files and directories. In the next chapter, youll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.</p>

View File

@ -29,7 +29,7 @@ Prerequisites</h2>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre> <pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div> </div>
<p>From this chapter on, well suppress the loading message from <code><a href="#chp-https://tidyverse.tidyverse" data-type="xref">#chp-https://tidyverse.tidyverse</a></code>.</p> <p>From this chapter on, well suppress the loading message from <code><a href="https://tidyverse.tidyverse.org">library(tidyverse)</a></code>.</p>
</section> </section>
</section> </section>
@ -164,7 +164,7 @@ Pivoting</h1>
<ol type="1"><li><p>Data is often organised to facilitate some goal other than analysis. For example, its common for data to be structured to make data entry, not analysis, easy.</p></li> <ol type="1"><li><p>Data is often organised to facilitate some goal other than analysis. For example, its common for data to be structured to make data entry, not analysis, easy.</p></li>
<li><p>Most people arent familiar with the principles of tidy data, and its hard to derive them yourself unless you spend a lot of time working with data.</p></li> <li><p>Most people arent familiar with the principles of tidy data, and its hard to derive them yourself unless you spend a lot of time working with data.</p></li>
</ol><p>This means that most real analyses will require at least a little tidying. Youll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times youll need to consult with the people who originally generated the data. Next, youll <strong>pivot</strong> your data into a tidy form, with variables in the columns and observations in the rows.</p> </ol><p>This means that most real analyses will require at least a little tidying. Youll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times youll need to consult with the people who originally generated the data. Next, youll <strong>pivot</strong> your data into a tidy form, with variables in the columns and observations in the rows.</p>
<p>tidyr provides two functions for pivoting data: <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>, which makes datasets <strong>longer</strong> by increasing rows and reducing columns, and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which makes datasets <strong>wider</strong> by increasing columns and reducing rows. The following sections work through the use of <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to tackle a wide range of realistic datasets. These examples are drawn from <code><a href="#chp-https://tidyr.tidyverse.org/articles/pivot" data-type="xref">#chp-https://tidyr.tidyverse.org/articles/pivot</a></code>, which you should check out if you want to see more variations and more challenging problems.</p> <p>tidyr provides two functions for pivoting data: <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>, which makes datasets <strong>longer</strong> by increasing rows and reducing columns, and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which makes datasets <strong>wider</strong> by increasing columns and reducing rows. The following sections work through the use of <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to tackle a wide range of realistic datasets. These examples are drawn from <code><a href="https://tidyr.tidyverse.org/articles/pivot.html">vignette("pivot", package = "tidyr")</a></code>, which you should check out if you want to see more variations and more challenging problems.</p>
<p>Lets dive in.</p> <p>Lets dive in.</p>
<section id="sec-billboard" data-type="sect2"> <section id="sec-billboard" data-type="sect2">
@ -191,9 +191,9 @@ Data in column names</h2>
#&gt; # wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, wk46 &lt;dbl&gt;, wk47 &lt;dbl&gt;, wk48 &lt;dbl&gt;, …</pre> #&gt; # wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, wk46 &lt;dbl&gt;, wk47 &lt;dbl&gt;, wk48 &lt;dbl&gt;, …</pre>
</div> </div>
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p> <p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
<p>To tidy this data, well use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>. After the data, there are three key arguments:</p> <p>To tidy this data, well use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
<ul><li> <ul><li>
<code>cols</code> specifies which columns need to be pivoted, i.e. which columns arent variables. This argument uses the same syntax as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> so here we could use <code>!c(artist, track, date.entered)</code> or <code>starts_with("wk")</code>.</li> <code>cols</code> specifies which columns need to be pivoted, i.e. which columns arent variables. This argument uses the same syntax as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> so here we could use <code>!c(artist, track, date.entered)</code> or <code>starts_with("wk")</code>.</li>
<li> <li>
<code>names_to</code> names of the variable stored in the column names, here <code>"week"</code>.</li> <code>names_to</code> names of the variable stored in the column names, here <code>"week"</code>.</li>
<li> <li>
@ -221,7 +221,7 @@ Data in column names</h2>
#&gt; 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA #&gt; 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
#&gt; # … with 24,082 more rows</pre> #&gt; # … with 24,082 more rows</pre>
</div> </div>
<p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pacs “Baby Dont Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s dont really represent unknown observations; theyre forced to exist by the structure of the dataset<span data-type="footnote">Well come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p> <p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pacs “Baby Dont Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s dont really represent unknown observations; theyre forced to exist by the structure of the dataset<span data-type="footnote">Well come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard |&gt; <pre data-type="programlisting" data-code-language="downlit">billboard |&gt;
pivot_longer( pivot_longer(
@ -242,7 +242,7 @@ Data in column names</h2>
#&gt; # … with 5,301 more rows</pre> #&gt; # … with 5,301 more rows</pre>
</div> </div>
<p>You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We cant tell from this data, but you might guess that additional columns <code>wk77</code>, <code>wk78</code>, … would be added to the dataset.</p> <p>You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We cant tell from this data, but you might guess that additional columns <code>wk77</code>, <code>wk78</code>, … would be added to the dataset.</p>
<p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code>. <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p> <p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">readr::parse_number()</a></code>. <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard_tidy &lt;- billboard |&gt; <pre data-type="programlisting" data-code-language="downlit">billboard_tidy &lt;- billboard |&gt;
pivot_longer( pivot_longer(
@ -364,7 +364,7 @@ Many variables in column names</h2>
#&gt; # ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;, ep_m_4554 &lt;dbl&gt;, ep_m_5564 &lt;dbl&gt;, …</pre> #&gt; # ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;, ep_m_4554 &lt;dbl&gt;, ep_m_5564 &lt;dbl&gt;, …</pre>
</div> </div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p> <p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p> <p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">who2 |&gt; <pre data-type="programlisting" data-code-language="downlit">who2 |&gt;
pivot_longer( pivot_longer(
@ -434,7 +434,7 @@ Data and variable names in the column headers</h2>
#&gt; 6 4 1 2004-10-10 Craig #&gt; 6 4 1 2004-10-10 Craig
#&gt; # … with 3 more rows</pre> #&gt; # … with 3 more rows</pre>
</div> </div>
<p>We again use <code>values_drop_na = TRUE</code>, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> to convert (e.g.) <code>child1</code> into 1.</p> <p>We again use <code>values_drop_na = TRUE</code>, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> to convert (e.g.) <code>child1</code> into 1.</p>
<p><a href="#fig-pivot-names-and-values" data-type="xref">#fig-pivot-names-and-values</a> illustrates the basic idea with a simpler example. When you use <code>".value"</code> in <code>names_to</code>, the column names in the input contribute to both values and variable names in the output.</p> <p><a href="#fig-pivot-names-and-values" data-type="xref">#fig-pivot-names-and-values</a> illustrates the basic idea with a simpler example. When you use <code>".value"</code> in <code>names_to</code>, the column names in the input contribute to both values and variable names in the output.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -449,7 +449,7 @@ Data and variable names in the column headers</h2>
<section id="widening-data" data-type="sect2"> <section id="widening-data" data-type="sect2">
<h2> <h2>
Widening data</h2> Widening data</h2>
<p>So far weve used <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> to solve the common class of problems where values have ended up in column names. Next well pivot (HA HA) to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p> <p>So far weve used <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to solve the common class of problems where values have ended up in column names. Next well pivot (HA HA) to <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p>
<p>Well start by looking at <code>cms_patient_experience</code>, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:</p> <p>Well start by looking at <code>cms_patient_experience</code>, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience <pre data-type="programlisting" data-code-language="downlit">cms_patient_experience
@ -464,7 +464,7 @@ Widening data</h2>
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM… 24 #&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM… 24
#&gt; # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre> #&gt; # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
</div> </div>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>:</p> <p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt; <pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
distinct(measure_cd, measure_title) distinct(measure_cd, measure_title)
@ -479,7 +479,7 @@ Widening data</h2>
#&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre> #&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre>
</div> </div>
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesnt hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. Well use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p> <p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesnt hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. Well use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> has the opposite interface to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p> <p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> has the opposite interface to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt; <pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
pivot_wider( pivot_wider(
@ -499,7 +499,7 @@ Widening data</h2>
#&gt; # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8, #&gt; # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
#&gt; # ⁷CAHPS_GRP_12</pre> #&gt; # ⁷CAHPS_GRP_12</pre>
</div> </div>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p> <p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt; <pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
pivot_wider( pivot_wider(
@ -525,7 +525,7 @@ Widening data</h2>
<section id="how-does-pivot_wider-work" data-type="sect2"> <section id="how-does-pivot_wider-work" data-type="sect2">
<h2> <h2>
How does<code>pivot_wider()</code> work?</h2> How does<code>pivot_wider()</code> work?</h2>
<p>To understand how <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> works, lets again start with a very simple dataset:</p> <p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, lets again start with a very simple dataset:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~id, ~name, ~value, ~id, ~name, ~value,
@ -549,8 +549,8 @@ How does<code>pivot_wider()</code> work?</h2>
#&gt; 1 A 1 4 5 #&gt; 1 A 1 4 5
#&gt; 2 B 3 2 NA</pre> #&gt; 2 B 3 2 NA</pre>
</div> </div>
<p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p> <p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p>
<p>To begin the process <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: its just the values of <code>name</code>.</p> <p>To begin the process <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: its just the values of <code>name</code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
distinct(name) distinct(name)
@ -572,7 +572,7 @@ How does<code>pivot_wider()</code> work?</h2>
#&gt; 1 A #&gt; 1 A
#&gt; 2 B</pre> #&gt; 2 B</pre>
</div> </div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> then combines these results to generate an empty data frame:</p> <p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> then combines these results to generate an empty data frame:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
select(-name, -value) |&gt; select(-name, -value) |&gt;
@ -584,7 +584,7 @@ How does<code>pivot_wider()</code> work?</h2>
#&gt; 1 A NA NA NA #&gt; 1 A NA NA NA
#&gt; 2 B NA NA NA</pre> #&gt; 2 B NA NA NA</pre>
</div> </div>
<p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as theres no entry for id “B” and name “z”, so that cell remains missing. Well come back to this idea that <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p> <p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as theres no entry for id “B” and name “z”, so that cell remains missing. Well come back to this idea that <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
<p>You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:</p> <p>You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
@ -634,13 +634,13 @@ How does<code>pivot_wider()</code> work?</h2>
<section id="untidy-data" data-type="sect1"> <section id="untidy-data" data-type="sect1">
<h1> <h1>
Untidy data</h1> Untidy data</h1>
<p>While <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> is occasionally useful for making tidy data, its real strength is making <strong>untidy</strong> data. While that sounds like a bad thing, untidy isnt a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but its not the only data format youll ever need.</p> <p>While <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> is occasionally useful for making tidy data, its real strength is making <strong>untidy</strong> data. While that sounds like a bad thing, untidy isnt a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but its not the only data format youll ever need.</p>
<p>The following sections will show a few examples of <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.</p> <p>The following sections will show a few examples of <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.</p>
<section id="presenting-data-to-humans" data-type="sect2"> <section id="presenting-data-to-humans" data-type="sect2">
<h2> <h2>
Presenting data to humans</h2> Presenting data to humans</h2>
<p>As youve seen, <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p> <p>As youve seen, <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(clarity, color) count(clarity, color)
@ -655,7 +655,7 @@ Presenting data to humans</h2>
#&gt; 6 I1 I 92 #&gt; 6 I1 I 92
#&gt; # … with 50 more rows</pre> #&gt; # … with 50 more rows</pre>
</div> </div>
<p>This is easy to visualize or summarize further, but its not the most compact form for display. You can use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to create a form more suitable for display to other humans:</p> <p>This is easy to visualize or summarize further, but its not the most compact form for display. You can use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to create a form more suitable for display to other humans:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(clarity, color) |&gt; count(clarity, color) |&gt;
@ -674,8 +674,8 @@ Presenting data to humans</h2>
#&gt; 6 VVS2 553 991 975 1443 608 365 131 #&gt; 6 VVS2 553 991 975 1443 608 365 131
#&gt; # … with 2 more rows</pre> #&gt; # … with 2 more rows</pre>
</div> </div>
<p>This display also makes it easy to compare in two directions, horizontally and vertically, much like <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>.</p> <p>This display also makes it easy to compare in two directions, horizontally and vertically, much like <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>.</p>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like <a href="#chp-https://gt.rstudio" data-type="xref">#chp-https://gt.rstudio</a>. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.</p> <p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like <a href="https://gt.rstudio.com">gt</a>. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.</p>
</section> </section>
<section id="multivariate-statistics" data-type="sect2"> <section id="multivariate-statistics" data-type="sect2">
@ -705,7 +705,7 @@ col_year
#&gt; 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43 #&gt; 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43
#&gt; # … with 136 more rows, and 2 more variables: `2002` &lt;dbl&gt;, `2007` &lt;dbl&gt;</pre> #&gt; # … with 136 more rows, and 2 more variables: `2002` &lt;dbl&gt;, `2007` &lt;dbl&gt;</pre>
</div> </div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p> <p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">col_year &lt;- col_year |&gt; <pre data-type="programlisting" data-code-language="downlit">col_year &lt;- col_year |&gt;
column_to_rownames("country") column_to_rownames("country")
@ -727,7 +727,7 @@ head(col_year)
#&gt; Australia 4.340224 4.369675 4.431331 4.486965 4.537005</pre> #&gt; Australia 4.340224 4.369675 4.431331 4.486965 4.537005</pre>
</div> </div>
<p>This makes a data frame, because tibbles dont support row names<span data-type="footnote">tibbles dont use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.</span>.</p> <p>This makes a data frame, because tibbles dont support row names<span data-type="footnote">tibbles dont use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.</span>.</p>
<p>Were now ready to cluster with (e.g.) <code><a href="#chp-https://rdrr.io/r/stats/kmeans" data-type="xref">#chp-https://rdrr.io/r/stats/kmeans</a></code>:</p> <p>Were now ready to cluster with (e.g.) <code><a href="https://rdrr.io/r/stats/kmeans.html">kmeans()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cluster &lt;- stats::kmeans(col_year, centers = 6)</pre> <pre data-type="programlisting" data-code-language="downlit">cluster &lt;- stats::kmeans(col_year, centers = 6)</pre>
</div> </div>
@ -859,7 +859,7 @@ Pragmatic computation</h2>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="#chp-https://www.jstatsoft.org/article/view/v059i10" data-type="xref">#chp-https://www.jstatsoft.org/article/view/v059i10</a> paper published in the Journal of Statistical Software.</p> <p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="https://www.jstatsoft.org/article/view/v059i10">Tidy Data</a> paper published in the Journal of Statistical Software.</p>
<p>In the next chapter, well pivot back to workflow to discuss the importance of code style, keeping your code “tidy” (ha!) in order to make it easy for you and others to read and understand your code.</p> <p>In the next chapter, well pivot back to workflow to discuss the importance of code style, keeping your code “tidy” (ha!) in order to make it easy for you and others to read and understand your code.</p>

View File

@ -30,13 +30,13 @@ library(tidyverse)
#&gt; ✖ dplyr::filter() masks stats::filter() #&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()</pre> #&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div> </div>
<p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="#chp-https://rdrr.io/r/stats/filter" data-type="xref">#chp-https://rdrr.io/r/stats/filter</a></code> and <code><a href="#chp-https://rdrr.io/r/stats/lag" data-type="xref">#chp-https://rdrr.io/r/stats/lag</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p> <p>Take careful note of the conflicts message thats printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, youll need to use their full names: <code><a href="https://rdrr.io/r/stats/filter.html">stats::filter()</a></code> and <code><a href="https://rdrr.io/r/stats/lag.html">stats::lag()</a></code>. So far weve mostly ignored which package a function comes from because most of the time it doesnt matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, well use the same syntax as R: <code>packagename::functionname()</code>.</p>
</section> </section>
<section id="nycflights13" data-type="sect2"> <section id="nycflights13" data-type="sect2">
<h2> <h2>
nycflights13</h2> nycflights13</h2>
<p>To explore the basic dplyr verbs, were going to use <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="#chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0" data-type="xref">#chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0</a>, and is documented in <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>.</p> <p>To explore the basic dplyr verbs, were going to use <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US <a href="http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&amp;Link=0">Bureau of Transportation Statistics</a>, and is documented in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">?flights</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights <pre data-type="programlisting" data-code-language="downlit">flights
#&gt; # A tibble: 336,776 × 19 #&gt; # A tibble: 336,776 × 19
@ -81,13 +81,13 @@ dplyr basics</h2>
<section id="rows" data-type="sect1"> <section id="rows" data-type="sect1">
<h1> <h1>
Rows</h1> Rows</h1>
<p>The most important verbs that operate on rows are <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, which changes which rows are present without changing their order, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.</p> <p>The most important verbs that operate on rows are <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, which changes which rows are present without changing their order, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.</p>
<section id="filter" data-type="sect2"> <section id="filter" data-type="sect2">
<h2> <h2>
<code>filter()</code> <code>filter()</code>
</h2> </h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> allows you to keep rows based on the values of the columns<span data-type="footnote">Later, youll learn about the <code>slice_*()</code> family which allows you to choose rows based on their positions.</span>. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(arr_delay &gt; 120) filter(arr_delay &gt; 120)
@ -161,7 +161,7 @@ flights |&gt;
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p> <p>Well come back to these comparisons and logical operators in more detail in <a href="#chp-logicals" data-type="xref">#chp-logicals</a>.</p>
<p>When you run <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p> <p>When you run <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesnt modify the existing <code>flights</code> dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <code>&lt;-</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">jan1 &lt;- flights |&gt; <pre data-type="programlisting" data-code-language="downlit">jan1 &lt;- flights |&gt;
filter(month == 1 &amp; day == 1)</pre> filter(month == 1 &amp; day == 1)</pre>
@ -171,7 +171,7 @@ flights |&gt;
<section id="common-mistakes" data-type="sect2"> <section id="common-mistakes" data-type="sect2">
<h2> <h2>
Common mistakes</h2> Common mistakes</h2>
<p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> will let you know when this happens:</p> <p>When youre starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality. <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> will let you know when this happens:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month = 1) filter(month = 1)
@ -192,7 +192,7 @@ Common mistakes</h2>
<h2> <h2>
<code>arrange()</code> <code>arrange()</code>
</h2> </h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p> <p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, dep_time) arrange(year, month, day, dep_time)
@ -210,7 +210,7 @@ Common mistakes</h2>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>You can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/desc" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/desc</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p> <p>You can use <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> to re-order by a column in descending order. For example, this code shows the most delayed flights:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(desc(dep_delay)) arrange(desc(dep_delay))
@ -228,7 +228,7 @@ Common mistakes</h2>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>You can combine <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p> <p>You can combine <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt; filter(dep_delay &lt;= 10 &amp; dep_delay &gt;= -10) |&gt;
@ -264,20 +264,20 @@ Exercises</h2>
<li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li> <li><p>Sort <code>flights</code> to find the flights with longest departure delays. Find the flights that left earliest in the morning.</p></li>
<li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li> <li><p>Sort <code>flights</code> to find the fastest flights (Hint: try sorting by a calculation).</p></li>
<li><p>Which flights traveled the farthest? Which traveled the shortest?</p></li> <li><p>Which flights traveled the farthest? Which traveled the shortest?</p></li>
<li><p>Does it matter what order you used <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> in if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li> <li><p>Does it matter what order you used <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> in if youre using both? Why/why not? Think about the results and how much work the functions would have to do.</p></li>
</ol></section> </ol></section>
</section> </section>
<section id="columns" data-type="sect1"> <section id="columns" data-type="sect1">
<h1> <h1>
Columns</h1> Columns</h1>
<p>There are four important verbs that affect the columns without changing the rows: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> creates new columns that are functions of the existing columns; <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> change which columns are present, their names, or their positions.</p> <p>There are four important verbs that affect the columns without changing the rows: <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> creates new columns that are functions of the existing columns; <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> change which columns are present, their names, or their positions.</p>
<section id="sec-mutate" data-type="sect2"> <section id="sec-mutate" data-type="sect2">
<h2> <h2>
<code>mutate()</code> <code>mutate()</code>
</h2> </h2>
<p>The job of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p> <p>The job of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> is to add new columns that are calculated from the existing columns. In the transform chapters, youll learn a large set of functions that you can use to manipulate different types of variables. For now, well stick with basic algebra, which allows us to compute the <code>gain</code>, how much time a delayed flight made up in the air, and the <code>speed</code> in miles per hour:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate( mutate(
@ -299,7 +299,7 @@ Columns</h1>
#&gt; # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, #&gt; # variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#&gt; # ⁵arr_delay</pre> #&gt; # ⁵arr_delay</pre>
</div> </div>
<p>By default, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code>.</span>:</p> <p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> adds new columns on the right hand side of your dataset, which makes it difficult to see whats happening here. We can use the <code>.before</code> argument to instead add the variables to the left hand side<span data-type="footnote">Remember that in RStudio, the easiest way to see a dataset with many columns is <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>.</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate( mutate(
@ -369,7 +369,7 @@ Columns</h1>
<h2> <h2>
<code>select()</code> <code>select()</code>
</h2> </h2>
<p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p> <p>Its not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables youre interested in. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Select columns by name <pre data-type="programlisting" data-code-language="downlit"># Select columns by name
flights |&gt; flights |&gt;
@ -430,7 +430,7 @@ flights |&gt;
#&gt; 6 UA N39463 EWR ORD #&gt; 6 UA N39463 EWR ORD
#&gt; # … with 336,770 more rows</pre> #&gt; # … with 336,770 more rows</pre>
</div> </div>
<p>There are a number of helper functions you can use within <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>:</p> <p>There are a number of helper functions you can use within <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<ul><li> <ul><li>
<code>starts_with("abc")</code>: matches names that begin with “abc”.</li> <code>starts_with("abc")</code>: matches names that begin with “abc”.</li>
<li> <li>
@ -439,8 +439,8 @@ flights |&gt;
<code>contains("ijk")</code>: matches names that contain “ijk”.</li> <code>contains("ijk")</code>: matches names that contain “ijk”.</li>
<li> <li>
<code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li> <code>num_range("x", 1:3)</code>: matches <code>x1</code>, <code>x2</code> and <code>x3</code>.</li>
</ul><p>See <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be use <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code> to select variables that match a pattern.</p> </ul><p>See <code><a href="https://dplyr.tidyverse.org/reference/select.html">?select</a></code> for more details. Once you know regular expressions (the topic of <a href="#chp-regexps" data-type="xref">#chp-regexps</a>) youll also be use <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">matches()</a></code> to select variables that match a pattern.</p>
<p>You can rename variables as you <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p> <p>You can rename variables as you <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> them by using <code>=</code>. The new name appears on the left hand side of the <code>=</code>, and the old variable appears on the right hand side:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
select(tail_num = tailnum) select(tail_num = tailnum)
@ -461,7 +461,7 @@ flights |&gt;
<h2> <h2>
<code>rename()</code> <code>rename()</code>
</h2> </h2>
<p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> instead of <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>:</p> <p>If you just want to keep all the existing variables and just want to rename a few, you can use <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> instead of <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
rename(tail_num = tailnum) rename(tail_num = tailnum)
@ -479,15 +479,15 @@ flights |&gt;
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>It works exactly the same way as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, but keeps all the variables that arent explicitly selected.</p> <p>It works exactly the same way as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, but keeps all the variables that arent explicitly selected.</p>
<p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="#chp-https://rdrr.io/pkg/janitor/man/clean_names" data-type="xref">#chp-https://rdrr.io/pkg/janitor/man/clean_names</a></code> which provides some useful automated cleaning.</p> <p>If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out <code><a href="https://rdrr.io/pkg/janitor/man/clean_names.html">janitor::clean_names()</a></code> which provides some useful automated cleaning.</p>
</section> </section>
<section id="relocate" data-type="sect2"> <section id="relocate" data-type="sect2">
<h2> <h2>
<code>relocate()</code> <code>relocate()</code>
</h2> </h2>
<p>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> moves variables to the front:</p> <p>Use <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> to move variables around. You might want to collect related variables together or move important variables to the front. By default <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> moves variables to the front:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(time_hour, air_time) relocate(time_hour, air_time)
@ -505,7 +505,7 @@ flights |&gt;
#&gt; # dest &lt;chr&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, and abbreviated #&gt; # dest &lt;chr&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;, and abbreviated
#&gt; # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time</pre> #&gt; # variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time</pre>
</div> </div>
<p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to choose where to put them:</p> <p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
relocate(year:dep_time, .after = time_hour) relocate(year:dep_time, .after = time_hour)
@ -549,9 +549,9 @@ Exercises</h2>
<ol type="1"><li><p>Compare <code>air_time</code> with <code>arr_time - dep_time</code>. What do you expect to see? What do you see? What do you need to do to fix it?</p></li> <ol type="1"><li><p>Compare <code>air_time</code> with <code>arr_time - dep_time</code>. What do you expect to see? What do you see? What do you need to do to fix it?</p></li>
<li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code>, and <code>dep_delay</code>. How would you expect those three numbers to be related?</p></li> <li><p>Compare <code>dep_time</code>, <code>sched_dep_time</code>, and <code>dep_delay</code>. How would you expect those three numbers to be related?</p></li>
<li><p>Brainstorm as many ways as possible to select <code>dep_time</code>, <code>dep_delay</code>, <code>arr_time</code>, and <code>arr_delay</code> from <code>flights</code>.</p></li> <li><p>Brainstorm as many ways as possible to select <code>dep_time</code>, <code>dep_delay</code>, <code>arr_time</code>, and <code>arr_delay</code> from <code>flights</code>.</p></li>
<li><p>What happens if you include the name of a variable multiple times in a <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> call?</p></li> <li><p>What happens if you include the name of a variable multiple times in a <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> call?</p></li>
<li> <li>
<p>What does the <code><a href="#chp-https://tidyselect.r-lib.org/reference/all_of" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/all_of</a></code> function do? Why might it be helpful in conjunction with this vector?</p> <p>What does the <code><a href="https://tidyselect.r-lib.org/reference/all_of.html">any_of()</a></code> function do? Why might it be helpful in conjunction with this vector?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">variables &lt;- c("year", "month", "day", "dep_delay", "arr_delay")</pre> <pre data-type="programlisting" data-code-language="downlit">variables &lt;- c("year", "month", "day", "dep_delay", "arr_delay")</pre>
</div> </div>
@ -568,13 +568,13 @@ Exercises</h2>
<section id="groups" data-type="sect1"> <section id="groups" data-type="sect1">
<h1> <h1>
Groups</h1> Groups</h1>
<p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, and the slice family of functions.</p> <p>So far youve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, well focus on the most important functions: <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, and the slice family of functions.</p>
<section id="group_by" data-type="sect2"> <section id="group_by" data-type="sect2">
<h2> <h2>
<code>group_by()</code> <code>group_by()</code>
</h2> </h2>
<p>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> to divide your dataset into groups meaningful for your analysis:</p> <p>Use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> to divide your dataset into groups meaningful for your analysis:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) group_by(month)
@ -593,14 +593,14 @@ Groups</h1>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”.</p> <p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> doesnt change the data but, if you look closely at the output, youll notice that its now “grouped by” month. This means subsequent operations will now work “by month”.</p>
</section> </section>
<section id="sec-summarize" data-type="sect2"> <section id="sec-summarize" data-type="sect2">
<h2> <h2>
<code>summarize()</code> <code>summarize()</code>
</h2> </h2>
<p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p> <p>The most important grouped operation is a summary. It collapses each group to a single row<span data-type="footnote">This is a slightly simplification; later on youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> to produce multiple summary rows for each group.</span>. Here we compute the average departure delay by month:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt; group_by(month) |&gt;
@ -636,7 +636,7 @@ Groups</h1>
#&gt; 6 6 20.8 #&gt; 6 6 20.8
#&gt; # … with 6 more rows</pre> #&gt; # … with 6 more rows</pre>
</div> </div>
<p>You can create any number of summaries in a single call to <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>, which returns the number of rows in each group:</p> <p>You can create any number of summaries in a single call to <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Youll learn various useful summaries in the upcoming chapters, but one very useful summary is <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>, which returns the number of rows in each group:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(month) |&gt; group_by(month) |&gt;
@ -692,7 +692,7 @@ The<code>slice_</code> functions</h2>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>This is similar to computing the max delay with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, but you get the whole row instead of the single summary:</p> <p>This is similar to computing the max delay with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>, but you get the whole row instead of the single summary:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt; group_by(dest) |&gt;
@ -761,7 +761,7 @@ daily
<section id="ungrouping" data-type="sect2"> <section id="ungrouping" data-type="sect2">
<h2> <h2>
Ungrouping</h2> Ungrouping</h2>
<p>You might also want to remove grouping outside of <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. You can do this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>.</p> <p>You might also want to remove grouping outside of <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. You can do this with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">ungroup()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">daily |&gt; <pre data-type="programlisting" data-code-language="downlit">daily |&gt;
ungroup() |&gt; ungroup() |&gt;
@ -783,15 +783,15 @@ Exercises</h2>
<ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li> <ol type="1"><li><p>Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about <code>flights |&gt; group_by(carrier, dest) |&gt; summarize(n())</code>)</p></li>
<li><p>Find the most delayed flight to each destination.</p></li> <li><p>Find the most delayed flight to each destination.</p></li>
<li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li> <li><p>How do delays vary over the course of the day. Illustrate your answer with a plot.</p></li>
<li><p>What happens if you supply a negative <code>n</code> to <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code> and friends?</p></li> <li><p>What happens if you supply a negative <code>n</code> to <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code> and friends?</p></li>
<li><p>Explain what <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> do?</p></li> <li><p>Explain what <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> does in terms of the dplyr verbs you just learn. What does the <code>sort</code> argument to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> do?</p></li>
</ol></section> </ol></section>
</section> </section>
<section id="sec-sample-size" data-type="sect1"> <section id="sec-sample-size" data-type="sect1">
<h1> <h1>
Case study: aggregates and sample size</h1> Case study: aggregates and sample size</h1>
<p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p> <p>Whenever you do any aggregation, its always a good idea to include a count (<code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>). That way, you can ensure that youre not drawing conclusions based on very small amounts of data. For example, lets look at the planes (identified by their tail number) that have the highest average delays:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">delays &lt;- flights |&gt; <pre data-type="programlisting" data-code-language="downlit">delays &lt;- flights |&gt;
filter(!is.na(arr_delay), !is.na(tailnum)) |&gt; filter(!is.na(arr_delay), !is.na(tailnum)) |&gt;
@ -882,7 +882,7 @@ batters
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, those that manipulate the columns (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>), and those that manipulate groups (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p> <p>In this chapter, youve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, those that manipulate the columns (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>), and those that manipulate groups (like <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>). In this chapter, weve focused on these “whole data frame” tools, but you havent yet learned much about what you can do with the individual variable. Well come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.</p>
<p>For now, well pivot back to workflow, and in the next chapter youll learn more about the pipe, <code>|&gt;</code>, why we recommend it, and a little of the history that lead from magrittrs <code>%&gt;%</code> to base Rs <code>|&gt;</code>.</p> <p>For now, well pivot back to workflow, and in the next chapter youll learn more about the pipe, <code>|&gt;</code>, why we recommend it, and a little of the history that lead from magrittrs <code>%&gt;%</code> to base Rs <code>|&gt;</code>.</p>

View File

@ -24,7 +24,7 @@ Prerequisites</h2>
#&gt; ✖ dplyr::lag() masks stats::lag()</pre> #&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div> </div>
<p>That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).</p> <p>That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).</p>
<p>If you run this code and get the error message “there is no package called tidyverse”, youll need to first install it, then run <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> once again.</p> <p>If you run this code and get the error message “there is no package called tidyverse”, youll need to first install it, then run <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> once again.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse") <pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse")
library(tidyverse)</pre> library(tidyverse)</pre>
@ -41,7 +41,7 @@ First steps</h1>
<section id="the-mpg-data-frame" data-type="sect2"> <section id="the-mpg-data-frame" data-type="sect2">
<h2> <h2>
The<code>mpg</code> data frame</h2> The<code>mpg</code> data frame</h2>
<p>You can test your answer with the <code>mpg</code> <strong>data frame</strong> found in ggplot2 (a.k.a. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>mpg</code> contains observations collected by the US Environmental Protection Agency on 38 car models.</p> <p>You can test your answer with the <code>mpg</code> <strong>data frame</strong> found in ggplot2 (a.k.a. <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">ggplot2::mpg</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>mpg</code> contains observations collected by the US Environmental Protection Agency on 38 car models.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mpg <pre data-type="programlisting" data-code-language="downlit">mpg
#&gt; # A tibble: 234 × 11 #&gt; # A tibble: 234 × 11
@ -58,7 +58,7 @@ The<code>mpg</code> data frame</h2>
<p>Among the variables in <code>mpg</code> are:</p> <p>Among the variables in <code>mpg</code> are:</p>
<ol type="1"><li><p><code>displ</code>, a cars engine size, in liters.</p></li> <ol type="1"><li><p><code>displ</code>, a cars engine size, in liters.</p></li>
<li><p><code>hwy</code>, a cars fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.</p></li> <li><p><code>hwy</code>, a cars fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.</p></li>
</ol><p>To learn more about <code>mpg</code>, open its help page by running <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code>.</p> </ol><p>To learn more about <code>mpg</code>, open its help page by running <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code>.</p>
</section> </section>
<section id="creating-a-ggplot" data-type="sect2"> <section id="creating-a-ggplot" data-type="sect2">
@ -73,9 +73,9 @@ Creating a ggplot</h2>
</div> </div>
</div> </div>
<p>The plot shows a negative relationship between engine size (<code>displ</code>) and fuel efficiency (<code>hwy</code>). In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. Does this confirm or refute your hypothesis about fuel efficiency and engine size?</p> <p>The plot shows a negative relationship between engine size (<code>displ</code>) and fuel efficiency (<code>hwy</code>). In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. Does this confirm or refute your hypothesis about fuel efficiency and engine size?</p>
<p>With ggplot2, you begin a plot with the function <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> creates a coordinate system that you can add layers to. The first argument of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> is the dataset to use in the graph. So <code>ggplot(data = mpg)</code> creates an empty graph, but its not very interesting so we wont show it here.</p> <p>With ggplot2, you begin a plot with the function <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> creates a coordinate system that you can add layers to. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> is the dataset to use in the graph. So <code>ggplot(data = mpg)</code> creates an empty graph, but its not very interesting so we wont show it here.</p>
<p>You complete your graph by adding one or more layers to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. The function <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code> adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. Youll learn a whole bunch of them throughout this chapter.</p> <p>You complete your graph by adding one or more layers to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. The function <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code> adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. Youll learn a whole bunch of them throughout this chapter.</p>
<p>Each geom function in ggplot2 takes a <code>mapping</code> argument. This defines how variables in your dataset are mapped to visual properties of your plot. The <code>mapping</code> argument is always paired with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code>, and the <code>x</code> and <code>y</code> arguments of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the <code>data</code> argument, in this case, <code>mpg</code>.</p> <p>Each geom function in ggplot2 takes a <code>mapping</code> argument. This defines how variables in your dataset are mapped to visual properties of your plot. The <code>mapping</code> argument is always paired with <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code>, and the <code>x</code> and <code>y</code> arguments of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the <code>data</code> argument, in this case, <code>mpg</code>.</p>
</section> </section>
<section id="a-graphing-template" data-type="sect2"> <section id="a-graphing-template" data-type="sect2">
@ -94,7 +94,7 @@ A graphing template</h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>Run <code>ggplot(data = mpg)</code>. What do you see?</p></li> <ol type="1"><li><p>Run <code>ggplot(data = mpg)</code>. What do you see?</p></li>
<li><p>How many rows are in <code>mpg</code>? How many columns?</p></li> <li><p>How many rows are in <code>mpg</code>? How many columns?</p></li>
<li><p>What does the <code>drv</code> variable describe? Read the help for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code> to find out.</p></li> <li><p>What does the <code>drv</code> variable describe? Read the help for <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code> to find out.</p></li>
<li><p>Make a scatterplot of <code>hwy</code> vs <code>cyl</code>.</p></li> <li><p>Make a scatterplot of <code>hwy</code> vs <code>cyl</code>.</p></li>
<li><p>What happens if you make a scatterplot of <code>class</code> vs <code>drv</code>? Why is the plot not useful?</p></li> <li><p>What happens if you make a scatterplot of <code>class</code> vs <code>drv</code>? Why is the plot not useful?</p></li>
</ol></section> </ol></section>
@ -128,7 +128,7 @@ Aesthetic mappings</h1>
</div> </div>
</div> </div>
<p>(If you prefer British English, like Hadley, you can use <code>colour</code> instead of <code>color</code>.)</p> <p>(If you prefer British English, like Hadley, you can use <code>colour</code> instead of <code>color</code>.)</p>
<p>To map an aesthetic to a variable, associate the name of the aesthetic with the name of the variable inside <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code>. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as <strong>scaling</strong>. ggplot2 will also add a legend that explains which levels correspond to which values.</p> <p>To map an aesthetic to a variable, associate the name of the aesthetic with the name of the variable inside <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code>. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as <strong>scaling</strong>. ggplot2 will also add a legend that explains which levels correspond to which values.</p>
<p>The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars dont seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.</p> <p>The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars dont seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.</p>
<p>In the above example, we mapped <code>class</code> to the color aesthetic, but we could have mapped <code>class</code> to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a <em>warning</em> here: mapping an unordered variable (<code>class</code>) to an ordered aesthetic (<code>size</code>) is generally not a good idea because it implies a ranking that does not in fact exist.</p> <p>In the above example, we mapped <code>class</code> to the color aesthetic, but we could have mapped <code>class</code> to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a <em>warning</em> here: mapping an unordered variable (<code>class</code>) to an ordered aesthetic (<code>size</code>) is generally not a good idea because it implies a ranking that does not in fact exist.</p>
<div class="cell"> <div class="cell">
@ -160,7 +160,7 @@ ggplot(data = mpg) +
</div> </div>
</div> </div>
<p>What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.</p> <p>What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.</p>
<p>For each aesthetic, you use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> to associate the name of the aesthetic with a variable to display. The <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> function gathers together each of the aesthetic mappings used by a layer and passes them to the layers mapping argument. The syntax highlights a useful insight about <code>x</code> and <code>y</code>: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.</p> <p>For each aesthetic, you use <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> to associate the name of the aesthetic with a variable to display. The <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> function gathers together each of the aesthetic mappings used by a layer and passes them to the layers mapping argument. The syntax highlights a useful insight about <code>x</code> and <code>y</code>: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.</p>
<p>Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.</p> <p>Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.</p>
<p>You can also <em>set</em> the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:</p> <p>You can also <em>set</em> the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:</p>
<div class="cell"> <div class="cell">
@ -170,7 +170,7 @@ ggplot(data = mpg) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-12-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-12-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue." width="576"/></p>
</div> </div>
</div> </div>
<p>Here, the color doesnt convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function. In other words, it goes <em>outside</em> of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code>. Youll need to pick a value that makes sense for that aesthetic:</p> <p>Here, the color doesnt convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function. In other words, it goes <em>outside</em> of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code>. Youll need to pick a value that makes sense for that aesthetic:</p>
<ul><li>The name of a color as a character string.</li> <ul><li>The name of a color as a character string.</li>
<li>The size of a point in mm.</li> <li>The size of a point in mm.</li>
<li>The shape of a point as a number, as shown in <a href="#fig-shapes" data-type="xref">#fig-shapes</a>.</li> <li>The shape of a point as a number, as shown in <a href="#fig-shapes" data-type="xref">#fig-shapes</a>.</li>
@ -196,10 +196,10 @@ Exercises</h2>
</div> </div>
</div> </div>
</li> </li>
<li><p>Which variables in <code>mpg</code> are categorical? Which variables are continuous? (Hint: type <code><a href="#chp-https://ggplot2.tidyverse.org/reference/mpg" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/mpg</a></code> to read the documentation for the dataset). How can you see this information when you run <code>mpg</code>?</p></li> <li><p>Which variables in <code>mpg</code> are categorical? Which variables are continuous? (Hint: type <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code> to read the documentation for the dataset). How can you see this information when you run <code>mpg</code>?</p></li>
<li><p>Map a continuous variable to <code>color</code>, <code>size</code>, and <code>shape</code>. How do these aesthetics behave differently for categorical vs. continuous variables?</p></li> <li><p>Map a continuous variable to <code>color</code>, <code>size</code>, and <code>shape</code>. How do these aesthetics behave differently for categorical vs. continuous variables?</p></li>
<li><p>What happens if you map the same variable to multiple aesthetics?</p></li> <li><p>What happens if you map the same variable to multiple aesthetics?</p></li>
<li><p>What does the <code>stroke</code> aesthetic do? What shapes does it work with? (Hint: use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>)</p></li> <li><p>What does the <code>stroke</code> aesthetic do? What shapes does it work with? (Hint: use <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">?geom_point</a></code>)</p></li>
<li><p>What happens if you map an aesthetic to something other than a variable name, like <code>aes(color = displ &lt; 5)</code>? Note, youll also need to specify x and y.</p></li> <li><p>What happens if you map an aesthetic to something other than a variable name, like <code>aes(color = displ &lt; 5)</code>? Note, youll also need to specify x and y.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -220,7 +220,7 @@ Common problems</h1>
<h1> <h1>
Facets</h1> Facets</h1>
<p>One way to add additional variables to a plot is by mapping them to an aesthetic. Another way, which is particularly useful for categorical variables, is to split your plot into <strong>facets</strong>, subplots that each display one subset of the data.</p> <p>One way to add additional variables to a plot is by mapping them to an aesthetic. Another way, which is particularly useful for categorical variables, is to split your plot into <strong>facets</strong>, subplots that each display one subset of the data.</p>
<p>To facet your plot by a single variable, use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code>. The first argument of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> is a formula<span data-type="footnote">Here “formula” is the name of the type of thing created by <code>~</code>, not a synonym for “equation”.</span>, which you create with <code>~</code> followed by a variable name. The variable that you pass to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> should be discrete.</p> <p>To facet your plot by a single variable, use <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> is a formula<span data-type="footnote">Here “formula” is the name of the type of thing created by <code>~</code>, not a synonym for “equation”.</span>, which you create with <code>~</code> followed by a variable name. The variable that you pass to <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> should be discrete.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(x = displ, y = hwy)) +
@ -229,7 +229,7 @@ Facets</h1>
<p><img src="data-visualize_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-15-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows." width="576"/></p>
</div> </div>
</div> </div>
<p>To facet your plot with the combination of two variables, switch from <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>. The first argument of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code> is also a formula, but now its a double sided formula: <code>rows ~ cols</code>.</p> <p>To facet your plot with the combination of two variables, switch from <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> to <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. The first argument of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> is also a formula, but now its a double sided formula: <code>rows ~ cols</code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(x = displ, y = hwy)) +
@ -274,7 +274,7 @@ ggplot(data = mpg) +
</div> </div>
<p>What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?</p> <p>What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?</p>
</li> </li>
<li><p>Read <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code>. What does <code>nrow</code> do? What does <code>ncol</code> do? What other options control the layout of the individual panels? Why doesnt <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code> have <code>nrow</code> and <code>ncol</code> arguments?</p></li> <li><p>Read <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">?facet_wrap</a></code>. What does <code>nrow</code> do? What does <code>ncol</code> do? What other options control the layout of the individual panels? Why doesnt <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> have <code>nrow</code> and <code>ncol</code> arguments?</p></li>
<li> <li>
<p>Which of the following two plots makes it easier to compare engine size (<code>displ</code>) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?</p> <p>Which of the following two plots makes it easier to compare engine size (<code>displ</code>) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?</p>
<div class="cell"> <div class="cell">
@ -294,7 +294,7 @@ ggplot(data = mpg) +
</div> </div>
</li> </li>
<li> <li>
<p>Recreate this plot using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>. How do the positions of the facet labels change?</p> <p>Recreate this plot using <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>. How do the positions of the facet labels change?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(x = displ, y = hwy)) +
@ -323,7 +323,7 @@ Geometric objects</h1>
</div> </div>
<p>Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different <strong>geoms</strong>.</p> <p>Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different <strong>geoms</strong>.</p>
<p>A <strong>geom</strong> is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.</p> <p>A <strong>geom</strong> is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.</p>
<p>To change the geom in your plot, change the geom function that you add to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. For instance, to make the plots above, you can use this code:</p> <p>To change the geom in your plot, change the geom function that you add to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. For instance, to make the plots above, you can use this code:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Left <pre data-type="programlisting" data-code-language="downlit"># Left
ggplot(data = mpg) + ggplot(data = mpg) +
@ -333,7 +333,7 @@ ggplot(data = mpg) +
ggplot(data = mpg) + ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))</pre> geom_smooth(mapping = aes(x = displ, y = hwy))</pre>
</div> </div>
<p>Every geom function in ggplot2 takes a <code>mapping</code> argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldnt set the “shape” of a line. On the other hand, you <em>could</em> set the linetype of a line. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.</p> <p>Every geom function in ggplot2 takes a <code>mapping</code> argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldnt set the “shape” of a line. On the other hand, you <em>could</em> set the linetype of a line. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))</pre> geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))</pre>
@ -341,7 +341,7 @@ ggplot(data = mpg) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-24-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-24-1.png" alt="A plot of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed." width="576"/></p>
</div> </div>
</div> </div>
<p>Here, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> separates the cars into three lines based on their <code>drv</code> value, which describes a cars drive train. One line describes all of the points that have a <code>4</code> value, one line describes all of the points that have an <code>f</code> value, and one line describes all of the points that have an <code>r</code> value. Here, <code>4</code> stands for four-wheel drive, <code>f</code> for front-wheel drive, and <code>r</code> for rear-wheel drive.</p> <p>Here, <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> separates the cars into three lines based on their <code>drv</code> value, which describes a cars drive train. One line describes all of the points that have a <code>4</code> value, one line describes all of the points that have an <code>f</code> value, and one line describes all of the points that have an <code>r</code> value. Here, <code>4</code> stands for four-wheel drive, <code>f</code> for front-wheel drive, and <code>r</code> for rear-wheel drive.</p>
<p>If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to <code>drv</code>.</p> <p>If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to <code>drv</code>.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -349,8 +349,8 @@ ggplot(data = mpg) +
</div> </div>
</div> </div>
<p>Notice that this plot contains two geoms in the same graph! If this makes you excited, buckle up. You will learn how to place multiple geoms in the same plot very soon.</p> <p>Notice that this plot contains two geoms in the same graph! If this makes you excited, buckle up. You will learn how to place multiple geoms in the same plot very soon.</p>
<p>ggplot2 provides more than 40 geoms, and extension packages provide even more (see <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a> for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at <a href="https://rstudio.com/resources/cheatsheets" class="uri">https://rstudio.com/resources/cheatsheets</a>. To learn more about any single geom, use the help (e.g. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code>).</p> <p>ggplot2 provides more than 40 geoms, and extension packages provide even more (see <a href="https://exts.ggplot2.tidyverse.org/gallery/" class="uri">https://exts.ggplot2.tidyverse.org/gallery/</a> for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at <a href="https://rstudio.com/resources/cheatsheets" class="uri">https://rstudio.com/resources/cheatsheets</a>. To learn more about any single geom, use the help (e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">?geom_smooth</a></code>).</p>
<p>Many geoms, like <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code>, use a single geometric object to display multiple rows of data. For these geoms, you can set the <code>group</code> aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the <code>linetype</code> example). It is convenient to rely on this feature because the <code>group</code> aesthetic by itself does not add a legend or distinguishing features to the geoms.</p> <p>Many geoms, like <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>, use a single geometric object to display multiple rows of data. For these geoms, you can set the <code>group</code> aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the <code>linetype</code> example). It is convenient to rely on this feature because the <code>group</code> aesthetic by itself does not add a legend or distinguishing features to the geoms.</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy)) geom_smooth(mapping = aes(x = displ, y = hwy))
@ -377,7 +377,7 @@ ggplot(data = mpg) +
</div> </div>
</div> </div>
</div> </div>
<p>To display multiple geoms in the same plot, add multiple geom functions to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>:</p> <p>To display multiple geoms in the same plot, add multiple geom functions to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(x = displ, y = hwy)) +
@ -386,7 +386,7 @@ ggplot(data = mpg) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-27-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed." width="576"/></p>
</div> </div>
</div> </div>
<p>This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display <code>cty</code> instead of <code>hwy</code>. Youd need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code>. ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:</p> <p>This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display <code>cty</code> instead of <code>hwy</code>. Youd need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code>. ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_point() +
@ -401,7 +401,7 @@ ggplot(data = mpg) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-29-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-29-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it." width="576"/></p>
</div> </div>
</div> </div>
<p>You can use the same idea to specify different <code>data</code> for each layer. Here, our smooth line displays just a subset of the <code>mpg</code> dataset, the subcompact cars. The local data argument in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> overrides the global data argument in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> for that layer only.</p> <p>You can use the same idea to specify different <code>data</code> for each layer. Here, our smooth line displays just a subset of the <code>mpg</code> dataset, the subcompact cars. The local data argument in <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> overrides the global data argument in <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> for that layer only.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) + geom_point(mapping = aes(color = class)) +
@ -410,7 +410,7 @@ ggplot(data = mpg) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-30-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-30-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it." width="576"/></p>
</div> </div>
</div> </div>
<p>(Youll learn how <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.)</p> <p>(Youll learn how <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.)</p>
<section id="exercises-3" data-type="sect2"> <section id="exercises-3" data-type="sect2">
<h2> <h2>
@ -435,7 +435,7 @@ Exercises</h2>
</div> </div>
<p>What does <code>show.legend = FALSE</code> do here? What happens if you remove it? Why do you think we used it earlier?</p> <p>What does <code>show.legend = FALSE</code> do here? What happens if you remove it? Why do you think we used it earlier?</p>
</li> </li>
<li><p>What does the <code>se</code> argument to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> do?</p></li> <li><p>What does the <code>se</code> argument to <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code> do?</p></li>
<li> <li>
<p>Will these two graphs look different? Why/why not?</p> <p>Will these two graphs look different? Why/why not?</p>
<div class="cell"> <div class="cell">
@ -483,7 +483,7 @@ ggplot() +
<section id="statistical-transformations" data-type="sect1"> <section id="statistical-transformations" data-type="sect1">
<h1> <h1>
Statistical transformations</h1> Statistical transformations</h1>
<p>Next, lets take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>. The following chart displays the total number of diamonds in the <code>diamonds</code> dataset, grouped by <code>cut</code>. The <code>diamonds</code> dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the <code>price</code>, <code>carat</code>, <code>color</code>, <code>clarity</code>, and <code>cut</code> of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.</p> <p>Next, lets take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code>. The following chart displays the total number of diamonds in the <code>diamonds</code> dataset, grouped by <code>cut</code>. The <code>diamonds</code> dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the <code>price</code>, <code>carat</code>, <code>color</code>, <code>clarity</code>, and <code>cut</code> of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))</pre> geom_bar(mapping = aes(x = cut))</pre>
@ -495,7 +495,7 @@ Statistical transformations</h1>
<ul><li><p>bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.</p></li> <ul><li><p>bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.</p></li>
<li><p>smoothers fit a model to your data and then plot predictions from the model.</p></li> <li><p>smoothers fit a model to your data and then plot predictions from the model.</p></li>
<li><p>boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.</p></li> <li><p>boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.</p></li>
</ul><p>The algorithm used to calculate new values for a graph is called a <strong>stat</strong>, short for statistical transformation. <a href="#fig-vis-stat-bar" data-type="xref">#fig-vis-stat-bar</a> shows how this process works with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>.</p> </ul><p>The algorithm used to calculate new values for a graph is called a <strong>stat</strong>, short for statistical transformation. <a href="#fig-vis-stat-bar" data-type="xref">#fig-vis-stat-bar</a> shows how this process works with <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -504,8 +504,8 @@ Statistical transformations</h1>
</figure> </figure>
</div> </div>
</div> </div>
<p>You can learn which stat a geom uses by inspecting the default value for the <code>stat</code> argument. For example, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> shows that the default value for <code>stat</code> is “count”, which means that <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> uses <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> is documented on the same page as <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>. If you scroll down, the section called “Computed variables” explains that it computes two new variables: <code>count</code> and <code>prop</code>.</p> <p>You can learn which stat a geom uses by inspecting the default value for the <code>stat</code> argument. For example, <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">?geom_bar</a></code> shows that the default value for <code>stat</code> is “count”, which means that <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> uses <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code> is documented on the same page as <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>. If you scroll down, the section called “Computed variables” explains that it computes two new variables: <code>count</code> and <code>prop</code>.</p>
<p>You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>:</p> <p>You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">stat_count()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))</pre> stat_count(mapping = aes(x = cut))</pre>
@ -515,7 +515,7 @@ Statistical transformations</h1>
</div> </div>
<p>This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:</p> <p>This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:</p>
<ol type="1"><li> <ol type="1"><li>
<p>You might want to override the default stat. In the code below, we change the stat of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> from count (the default) to identity. This lets me map the height of the bars to the raw values of a <span class="math inline">\(y\)</span> variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.</p> <p>You might want to override the default stat. In the code below, we change the stat of <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code> from count (the default) to identity. This lets me map the height of the bars to the raw values of a <span class="math inline">\(y\)</span> variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">demo &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">demo &lt;- tribble(
~cut, ~freq, ~cut, ~freq,
@ -532,7 +532,7 @@ ggplot(data = demo) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-38-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-38-1.png" alt="Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds." width="576"/></p>
</div> </div>
</div> </div>
<p>(Dont worry that you havent seen <code>&lt;-</code> or <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code> before. You might be able to guess their meaning from the context, and youll learn exactly what they do soon!)</p> <p>(Dont worry that you havent seen <code>&lt;-</code> or <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code> before. You might be able to guess their meaning from the context, and youll learn exactly what they do soon!)</p>
</li> </li>
<li> <li>
<p>You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:</p> <p>You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:</p>
@ -543,10 +543,10 @@ ggplot(data = demo) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-39-1.png" alt="Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-39-1.png" alt="Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40." width="576"/></p>
</div> </div>
</div> </div>
<p>To find the variables computed by the stat, look for the section titled “computed variables” in the help for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>.</p> <p>To find the variables computed by the stat, look for the section titled “computed variables” in the help for <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>.</p>
</li> </li>
<li> <li>
<p>You might want to draw greater attention to the statistical transformation in your code. For example, you might use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/stat_summary" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/stat_summary</a></code>, which summarizes the y values for each unique x value, to draw attention to the summary that youre computing:</p> <p>You might want to draw greater attention to the statistical transformation in your code. For example, you might use <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>, which summarizes the y values for each unique x value, to draw attention to the summary that youre computing:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds) +
stat_summary( stat_summary(
@ -560,15 +560,15 @@ ggplot(data = demo) +
</div> </div>
</div> </div>
</li> </li>
</ol><p>ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. To see a complete list of stats, try the <a href="#chp-https://rstudio.com/resources/cheatsheets" data-type="xref">#chp-https://rstudio.com/resources/cheatsheets</a>.</p> </ol><p>ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">?stat_bin</a></code>. To see a complete list of stats, try the <a href="https://rstudio.com/resources/cheatsheets">ggplot2 cheatsheet</a>.</p>
<section id="exercises-4" data-type="sect2"> <section id="exercises-4" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>What is the default geom associated with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/stat_summary" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/stat_summary</a></code>? How could you rewrite the previous plot to use that geom function instead of the stat function?</p></li> <ol type="1"><li><p>What is the default geom associated with <code><a href="https://ggplot2.tidyverse.org/reference/stat_summary.html">stat_summary()</a></code>? How could you rewrite the previous plot to use that geom function instead of the stat function?</p></li>
<li><p>What does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code> do? How is it different from <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bar</a></code>?</p></li> <li><p>What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_col()</a></code> do? How is it different from <code><a href="https://ggplot2.tidyverse.org/reference/geom_bar.html">geom_bar()</a></code>?</p></li>
<li><p>Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?</p></li> <li><p>Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?</p></li>
<li><p>What variables does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_smooth" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_smooth</a></code> compute? What parameters control its behaviour?</p></li> <li><p>What variables does <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">stat_smooth()</a></code> compute? What parameters control its behaviour?</p></li>
<li> <li>
<p>In our proportion bar chart, we need to set <code>group = 1</code>. Why? In other words, what is the problem with these two graphs?</p> <p>In our proportion bar chart, we need to set <code>group = 1</code>. Why? In other words, what is the problem with these two graphs?</p>
<div class="cell"> <div class="cell">
@ -665,8 +665,8 @@ ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) +
<p><img src="data-visualize_files/figure-html/unnamed-chunk-48-1.png" alt="Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association." width="576"/></p> <p><img src="data-visualize_files/figure-html/unnamed-chunk-48-1.png" alt="Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association." width="576"/></p>
</div> </div>
</div> </div>
<p>Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph <em>more</em> revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for <code>geom_point(position = "jitter")</code>: <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code>.</p> <p>Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph <em>more</em> revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for <code>geom_point(position = "jitter")</code>: <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>.</p>
<p>To learn more about a position adjustment, look up the help page associated with each adjustment: <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_dodge" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_dodge</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_stack" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_stack</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_identity" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_identity</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_jitter</a></code>, and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/position_stack" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/position_stack</a></code>.</p> <p>To learn more about a position adjustment, look up the help page associated with each adjustment: <code><a href="https://ggplot2.tidyverse.org/reference/position_dodge.html">?position_dodge</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_fill</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_identity.html">?position_identity</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/position_jitter.html">?position_jitter</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/position_stack.html">?position_stack</a></code>.</p>
<section id="exercises-5" data-type="sect2"> <section id="exercises-5" data-type="sect2">
<h2> <h2>
@ -681,9 +681,9 @@ Exercises</h2>
</div> </div>
</div> </div>
</li> </li>
<li><p>What parameters to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> control the amount of jittering?</p></li> <li><p>What parameters to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> control the amount of jittering?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_count" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_count</a></code>.</p></li> <li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> with <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>.</p></li>
<li><p>Whats the default position adjustment for <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code>? Create a visualization of the <code>mpg</code> dataset that demonstrates it.</p></li> <li><p>Whats the default position adjustment for <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>? Create a visualization of the <code>mpg</code> dataset that demonstrates it.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -692,7 +692,7 @@ Exercises</h2>
Coordinate systems</h1> Coordinate systems</h1>
<p>Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are three other coordinate systems that are occasionally helpful.</p> <p>Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are three other coordinate systems that are occasionally helpful.</p>
<ul><li> <ul><li>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_flip" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_flip</a></code> switches the x and y axes. This is useful (for example), if you want horizontal boxplots. Its also useful for long labels: its hard to get them to fit without overlapping on the x-axis.</p> <p><code><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html">coord_flip()</a></code> switches the x and y axes. This is useful (for example), if you want horizontal boxplots. Its also useful for long labels: its hard to get them to fit without overlapping on the x-axis.</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() geom_boxplot()
@ -720,7 +720,7 @@ ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
</div> </div>
</li> </li>
<li> <li>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_map" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_map</a></code> sets the aspect ratio correctly for maps. This is very important if youre plotting spatial data with ggplot2. We dont have the space to discuss maps in this book, but you can learn more in the <a href="#chp-https://ggplot2-book.org/maps" data-type="xref">#chp-https://ggplot2-book.org/maps</a> of <em>ggplot2: Elegant graphics for data analysis</em>.</p> <p><code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> sets the aspect ratio correctly for maps. This is very important if youre plotting spatial data with ggplot2. We dont have the space to discuss maps in this book, but you can learn more in the <a href="https://ggplot2-book.org/maps.html">Maps chapter</a> of <em>ggplot2: Elegant graphics for data analysis</em>.</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">nz &lt;- map_data("nz") <pre data-type="programlisting" data-code-language="downlit">nz &lt;- map_data("nz")
@ -743,7 +743,7 @@ ggplot(nz, aes(long, lat, group = group)) +
</div> </div>
</li> </li>
<li> <li>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_polar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_polar</a></code> uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.</p> <p><code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code> uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.</p>
<div> <div>
<pre data-type="programlisting" data-code-language="downlit">bar &lt;- ggplot(data = diamonds) + <pre data-type="programlisting" data-code-language="downlit">bar &lt;- ggplot(data = diamonds) +
geom_bar( geom_bar(
@ -772,11 +772,11 @@ bar + coord_polar()</pre>
<section id="exercises-6" data-type="sect2"> <section id="exercises-6" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>Turn a stacked bar chart into a pie chart using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_polar" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_polar</a></code>.</p></li> <ol type="1"><li><p>Turn a stacked bar chart into a pie chart using <code><a href="https://ggplot2.tidyverse.org/reference/coord_polar.html">coord_polar()</a></code>.</p></li>
<li><p>What does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/labs" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/labs</a></code> do? Read the documentation.</p></li> <li><p>What does <code><a href="https://ggplot2.tidyverse.org/reference/labs.html">labs()</a></code> do? Read the documentation.</p></li>
<li><p>Whats the difference between <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_map" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_map</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_map" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_map</a></code>?</p></li> <li><p>Whats the difference between <code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_quickmap()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/coord_map.html">coord_map()</a></code>?</p></li>
<li> <li>
<p>What does the plot below tell you about the relationship between city and highway mpg? Why is <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_fixed" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_fixed</a></code> important? What does <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_abline" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_abline</a></code> do?</p> <p>What does the plot below tell you about the relationship between city and highway mpg? Why is <code><a href="https://ggplot2.tidyverse.org/reference/coord_fixed.html">coord_fixed()</a></code> important? What does <code><a href="https://ggplot2.tidyverse.org/reference/geom_abline.html">geom_abline()</a></code> do?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + <pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() + geom_point() +
@ -823,14 +823,14 @@ The layered grammar of graphics</h1>
</div> </div>
</div> </div>
<p>You could use this method to build <em>any</em> plot that you imagine. In other words, you can use the code template that youve learned in this chapter to build hundreds of thousands of unique plots.</p> <p>You could use this method to build <em>any</em> plot that you imagine. In other words, you can use the code template that youve learned in this chapter to build hundreds of thousands of unique plots.</p>
<p>If youd like to learn more about this theoretical underpinnings of ggplot2, you might enjoy reading “<a href="#chp-https://vita.had.co.nz/papers/layered-grammar" data-type="xref">#chp-https://vita.had.co.nz/papers/layered-grammar</a>”, the scientific paper that describes the theory of ggplot2 in detail.</p> <p>If youd like to learn more about this theoretical underpinnings of ggplot2, you might enjoy reading “<a href="https://vita.had.co.nz/papers/layered-grammar.pdf">The Layered Grammar of Graphics</a>”, the scientific paper that describes the theory of ggplot2 in detail.</p>
</section> </section>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>In this chapter, youve learn the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape. You then learned about facets, which allow you to create small multiples, where each panel contains a subgroup from your data. We then gave you a whirlwind tour of the geoms and stats which control the “type” of graph you get, whether its a scatterplot, line plot, histogram, or something else. Position adjustment control the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what <code>x</code> and <code>y</code> mean.</p> <p>In this chapter, youve learn the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape. You then learned about facets, which allow you to create small multiples, where each panel contains a subgroup from your data. We then gave you a whirlwind tour of the geoms and stats which control the “type” of graph you get, whether its a scatterplot, line plot, histogram, or something else. Position adjustment control the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what <code>x</code> and <code>y</code> mean.</p>
<p>Well use visualizations again and again through out this book, introducing new techniques as we need them. If you want to get a comprehensive understand of ggplot2, we recommend reading the book, <a href="#chp-https://ggplot2-book" data-type="xref">#chp-https://ggplot2-book</a>. Other useful resources are the <a href="#chp-https://r-graphics" data-type="xref">#chp-https://r-graphics</a> by Winston Chang and <a href="#chp-https://clauswilke.com/dataviz/" data-type="xref">#chp-https://clauswilke.com/dataviz/</a> by Claus Wilke.</p> <p>Well use visualizations again and again through out this book, introducing new techniques as we need them. If you want to get a comprehensive understand of ggplot2, we recommend reading the book, <a href="https://ggplot2-book.org"><em>ggplot2: Elegant Graphics for Data Analysis</em></a>. Other useful resources are the <a href="https://r-graphics.org"><em>R Graphics Cookbook</em></a> by Winston Chang and <a href="https://clauswilke.com/dataviz/"><em>Fundamentals of Data Visualization</em></a> by Claus Wilke.</p>
<p>With the basics of visualization under your belt, in the next chapter were going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because itll help you stay organize as you write increasing amounts of R code.</p> <p>With the basics of visualization under your belt, in the next chapter were going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because itll help you stay organize as you write increasing amounts of R code.</p>

View File

@ -15,7 +15,7 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre> <pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div> </div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p> <p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year" <p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year` FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
@ -62,7 +62,7 @@ Connecting to a database</h1>
<ul><li><p>Youll always use DBI (<strong>d</strong>ata<strong>b</strong>ase <strong>i</strong>nterface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.</p></li> <ul><li><p>Youll always use DBI (<strong>d</strong>ata<strong>b</strong>ase <strong>i</strong>nterface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.</p></li>
<li><p>Youll also use a package tailored for the DBMS youre connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. Theres usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.</p></li> <li><p>Youll also use a package tailored for the DBMS youre connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. Theres usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.</p></li>
</ul><p>If you cant find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because youll also need to install an ODBC driver and tell the odbc package where to find it.</p> </ul><p>If you cant find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because youll also need to install an ODBC driver and tell the odbc package where to find it.</p>
<p>Concretely, you create a database connection using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbConnect" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbConnect</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p> <p>Concretely, you create a database connection using <code><a href="https://dbi.r-dbi.org/reference/dbConnect.html">DBI::dbConnect()</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect( <pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(
RMariaDB::MariaDB(), RMariaDB::MariaDB(),
@ -93,7 +93,7 @@ In this book</h2>
<section id="sec-load-data" data-type="sect2"> <section id="sec-load-data" data-type="sect2">
<h2> <h2>
Load some data</h2> Load some data</h2>
<p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code>. The simplest usage of <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p> <p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">DBI::dbWriteTable()</a></code>. The simplest usage of <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dbWriteTable(con, "mpg", ggplot2::mpg) <pre data-type="programlisting" data-code-language="downlit">dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre> dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
@ -123,7 +123,7 @@ dbExistsTable(con, "foo")
<section id="extract-some-data" data-type="sect2"> <section id="extract-some-data" data-type="sect2">
<h2> <h2>
Extract some data</h2> Extract some data</h2>
<p>Once youve determined a table exists, you can retrieve it with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code>:</p> <p>Once youve determined a table exists, you can retrieve it with <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con |&gt; <pre data-type="programlisting" data-code-language="downlit">con |&gt;
dbReadTable("diamonds") |&gt; dbReadTable("diamonds") |&gt;
@ -139,14 +139,14 @@ Extract some data</h2>
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 #&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with 53,934 more rows</pre> #&gt; # … with 53,934 more rows</pre>
</div> </div>
<p><code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code> returns a <code>data.frame</code> so we use <code><a href="#chp-https://tibble.tidyverse.org/reference/as_tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/as_tibble</a></code> to convert it into a tibble so that it prints nicely.</p> <p><code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> returns a <code>data.frame</code> so we use <code><a href="https://tibble.tidyverse.org/reference/as_tibble.html">as_tibble()</a></code> to convert it into a tibble so that it prints nicely.</p>
<p>In real life, its rare that youll use <code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p> <p>In real life, its rare that youll use <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p>
</section> </section>
<section id="sec-dbGetQuery" data-type="sect2"> <section id="sec-dbGetQuery" data-type="sect2">
<h2> <h2>
Run a query</h2> Run a query</h2>
<p>The way youll usually retrieve data is with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code>. It takes a database connection and some SQL code and returns a data frame:</p> <p>The way youll usually retrieve data is with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sql &lt;- " <pre data-type="programlisting" data-code-language="downlit">sql &lt;- "
SELECT carat, cut, clarity, color, price SELECT carat, cut, clarity, color, price
@ -166,21 +166,21 @@ as_tibble(dbGetQuery(con, sql))
#&gt; # … with 1,649 more rows</pre> #&gt; # … with 1,649 more rows</pre>
</div> </div>
<p>Dont worry if youve never seen SQL before; youll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where <code>price</code> is greater than 15,000.</p> <p>Dont worry if youve never seen SQL before; youll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where <code>price</code> is greater than 15,000.</p>
<p>Youll need to be a little careful with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbSendQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbSendQuery</a></code> to get a “result set” which you can page through by calling <code><a href="#chp-https://dbi.r-dbi.org/reference/dbFetch" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbFetch</a></code> until <code><a href="#chp-https://dbi.r-dbi.org/reference/dbHasCompleted" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbHasCompleted</a></code> returns <code>TRUE</code>.</p> <p>Youll need to be a little careful with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="https://dbi.r-dbi.org/reference/dbSendQuery.html">dbSendQuery()</a></code> to get a “result set” which you can page through by calling <code><a href="https://dbi.r-dbi.org/reference/dbFetch.html">dbFetch()</a></code> until <code><a href="https://dbi.r-dbi.org/reference/dbHasCompleted.html">dbHasCompleted()</a></code> returns <code>TRUE</code>.</p>
</section> </section>
<section id="other-functions" data-type="sect2"> <section id="other-functions" data-type="sect2">
<h2> <h2>
Other functions</h2> Other functions</h2>
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p> <p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
</section> </section>
</section> </section>
<section id="dbplyr-basics" data-type="sect1"> <section id="dbplyr-basics" data-type="sect1">
<h1> <h1>
dbplyr basics</h1> dbplyr basics</h1>
<p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="#chp-https://dtplyr.tidyverse" data-type="xref">#chp-https://dtplyr.tidyverse</a> which translates to <a href="#chp-https://r-datatable" data-type="xref">#chp-https://r-datatable</a>, and <a href="#chp-https://multidplyr.tidyverse" data-type="xref">#chp-https://multidplyr.tidyverse</a> which executes your code on multiple cores.</p> <p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="https://dtplyr.tidyverse.org">dtplyr</a> which translates to <a href="https://r-datatable.com">data.table</a>, and <a href="https://multidplyr.tidyverse.org">multidplyr</a> which executes your code on multiple cores.</p>
<p>To use dbplyr, you must first use <code><a href="#chp-https://dplyr.tidyverse.org/reference/tbl" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/tbl</a></code> to create an object that represents a database table:</p> <p>To use dbplyr, you must first use <code><a href="https://dplyr.tidyverse.org/reference/tbl.html">tbl()</a></code> to create an object that represents a database table:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, "diamonds") <pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, "diamonds")
diamonds_db diamonds_db
@ -212,7 +212,7 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre> <pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div> </div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p> <p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year" <p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year` FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
@ -238,7 +238,7 @@ big_diamonds_db
#&gt; # … with more rows</pre> #&gt; # … with more rows</pre>
</div> </div>
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesnt know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something were trying to avoid.</p> <p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesnt know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something were trying to avoid.</p>
<p>You can see the SQL code generated by the dbplyr function <code><a href="#chp-https://dplyr.tidyverse.org/reference/explain" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/explain</a></code>:</p> <p>You can see the SQL code generated by the dbplyr function <code><a href="https://dplyr.tidyverse.org/reference/explain.html">show_query()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db |&gt; <pre data-type="programlisting" data-code-language="downlit">big_diamonds_db |&gt;
show_query() show_query()
@ -247,7 +247,7 @@ big_diamonds_db
#&gt; FROM diamonds #&gt; FROM diamonds
#&gt; WHERE (price &gt; 15000.0)</pre> #&gt; WHERE (price &gt; 15000.0)</pre>
</div> </div>
<p>To get all the data back into R, you call <code><a href="#chp-https://dplyr.tidyverse.org/reference/compute" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/compute</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code> to get the data, then turns the result into a tibble:</p> <p>To get all the data back into R, you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> to get the data, then turns the result into a tibble:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds &lt;- big_diamonds_db |&gt; <pre data-type="programlisting" data-code-language="downlit">big_diamonds &lt;- big_diamonds_db |&gt;
collect() collect()
@ -263,7 +263,7 @@ big_diamonds
#&gt; 6 1.73 Very Good G VS1 15014 #&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with 1,649 more rows</pre> #&gt; # … with 1,649 more rows</pre>
</div> </div>
<p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="#chp-https://dplyr.tidyverse.org/reference/compute" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/compute</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p> <p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p>
</section> </section>
<section id="sql" data-type="sect1"> <section id="sql" data-type="sect1">
@ -343,7 +343,7 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre> <pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div> </div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p> <p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year" <p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year` FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
@ -354,8 +354,8 @@ FROM `planes`</pre></div>
<section id="select" data-type="sect2"> <section id="select" data-type="sect2">
<h2> <h2>
SELECT</h2> SELECT</h2>
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>, and, as youll learn in the next section, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>.</p> <p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and, as youll learn in the next section, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt; <pre data-type="programlisting" data-code-language="downlit">planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt; select(tailnum, type, manufacturer, model, year) |&gt;
@ -380,7 +380,7 @@ planes |&gt;
#&gt; SELECT tailnum, manufacturer, model, "type", "year" #&gt; SELECT tailnum, manufacturer, model, "type", "year"
#&gt; FROM planes</pre> #&gt; FROM planes</pre>
</div> </div>
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, the old name is on the left and the new name is on the right.</p> <p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the old name is on the left and the new name is on the right.</p>
<div data-type="note"><div class="callout-body d-flex"> <div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container"> <div class="callout-icon-container">
<i class="callout-icon"/> <i class="callout-icon"/>
@ -397,13 +397,13 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre> <pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div> </div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p> <p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year" <p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year` FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div> FROM `planes`</pre></div>
<p>The translations for <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p> <p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate( mutate(
@ -426,7 +426,7 @@ FROM</h2>
<section id="group-by" data-type="sect2"> <section id="group-by" data-type="sect2">
<h2> <h2>
GROUP BY</h2> GROUP BY</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> is translated to the <code>SELECT</code> clause:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> is translated to the <code>SELECT</code> clause:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt;
group_by(cut) |&gt; group_by(cut) |&gt;
@ -440,13 +440,13 @@ GROUP BY</h2>
#&gt; FROM diamonds #&gt; FROM diamonds
#&gt; GROUP BY cut</pre> #&gt; GROUP BY cut</pre>
</div> </div>
<p>Well come back to whats happening with translation <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p> <p>Well come back to whats happening with translation <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section> </section>
<section id="where" data-type="sect2"> <section id="where" data-type="sect2">
<h2> <h2>
WHERE</h2> WHERE</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> is translated to the <code>WHERE</code> clause:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is translated to the <code>WHERE</code> clause:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dest == "IAH" | dest == "HOU") |&gt; filter(dest == "IAH" | dest == "HOU") |&gt;
@ -499,7 +499,7 @@ flights |&gt;
#&gt; 6 LAX 0.547 #&gt; 6 LAX 0.547
#&gt; # … with more rows</pre> #&gt; # … with more rows</pre>
</div> </div>
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="#chp-https://modern-sql.com/concept/three-valued-logic" data-type="xref">#chp-https://modern-sql.com/concept/three-valued-logic</a>” by Markus Winand.</p> <p>If you want to learn more about how NULLs work, you might enjoy “<a href="https://modern-sql.com/concept/three-valued-logic"><em>Three valued logic</em></a>” by Markus Winand.</p>
<p>In general, you can work with <code>NULL</code>s using the functions youd use for <code>NA</code>s in R:</p> <p>In general, you can work with <code>NULL</code>s using the functions youd use for <code>NA</code>s in R:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
@ -512,7 +512,7 @@ flights |&gt;
</div> </div>
<p>This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isnt as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator thats easier to read:</p> <p>This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isnt as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator thats easier to read:</p>
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre> <pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
<p>Note that if you <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p> <p>Note that if you <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt;
group_by(cut) |&gt; group_by(cut) |&gt;
@ -530,7 +530,7 @@ flights |&gt;
<section id="order-by" data-type="sect2"> <section id="order-by" data-type="sect2">
<h2> <h2>
ORDER BY</h2> ORDER BY</h2>
<p>Ordering rows involves a straightforward translation from <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> to the <code>ORDER BY</code> clause:</p> <p>Ordering rows involves a straightforward translation from <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> to the <code>ORDER BY</code> clause:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, desc(dep_delay)) |&gt; arrange(year, month, day, desc(dep_delay)) |&gt;
@ -540,7 +540,7 @@ ORDER BY</h2>
#&gt; FROM flights #&gt; FROM flights
#&gt; ORDER BY "year", "month", "day", dep_delay DESC</pre> #&gt; ORDER BY "year", "month", "day", dep_delay DESC</pre>
</div> </div>
<p>Notice how <code><a href="#chp-https://dplyr.tidyverse.org/reference/desc" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/desc</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p> <p>Notice how <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p>
</section> </section>
<section id="subqueries" data-type="sect2"> <section id="subqueries" data-type="sect2">
@ -562,7 +562,7 @@ Subqueries</h2>
#&gt; FROM flights #&gt; FROM flights
#&gt; ) q01</pre> #&gt; ) q01</pre>
</div> </div>
<p>Youll also see this if you attempted to <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p> <p>Youll also see this if you attempted to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(year1 = year + 1) |&gt; mutate(year1 = year + 1) |&gt;
@ -603,7 +603,7 @@ Joins</h2>
#&gt; ON (flights.tailnum = planes.tailnum)</pre> #&gt; ON (flights.tailnum = planes.tailnum)</pre>
</div> </div>
<p>The main thing to notice here is the syntax: SQL joins use sub-clauses of the <code>FROM</code> clause to bring in additional tables, using <code>ON</code> to define how the tables are related.</p> <p>The main thing to notice here is the syntax: SQL joins use sub-clauses of the <code>FROM</code> clause to bring in additional tables, using <code>ON</code> to define how the tables are related.</p>
<p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>:</p> <p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT flights.*, "type", manufacturer, model, engines, seats, speed <pre data-type="programlisting" data-code-language="sql">SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum) INNER JOIN planes ON (flights.tailnum = planes.tailnum)
@ -615,19 +615,19 @@ RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)</pre> FULL JOIN planes ON (flights.tailnum = planes.tailnum)</pre>
<p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="#chp-https://cynkra.github.io/dm/" data-type="xref">#chp-https://cynkra.github.io/dm/</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p> <p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="https://cynkra.github.io/dm/">dm package</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p>
</section> </section>
<section id="other-verbs" data-type="sect2"> <section id="other-verbs" data-type="sect2">
<h2> <h2>
Other verbs</h2> Other verbs</h2>
<p>dbplyr also translates other verbs like <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code>slice_*()</code>, and <code><a href="#chp-https://generics.r-lib.org/reference/setops" data-type="xref">#chp-https://generics.r-lib.org/reference/setops</a></code>, and a growing selection of tidyr functions like <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p> <p>dbplyr also translates other verbs like <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code>slice_*()</code>, and <code><a href="https://generics.r-lib.org/reference/setops.html">intersect()</a></code>, and a growing selection of tidyr functions like <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
</section> </section>
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>What is <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code> translated to? How about <code><a href="#chp-https://rdrr.io/r/utils/head" data-type="xref">#chp-https://rdrr.io/r/utils/head</a></code>?</p></li> <ol type="1"><li><p>What is <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> translated to? How about <code><a href="https://rdrr.io/r/utils/head.html">head()</a></code>?</p></li>
<li> <li>
<p>Explain what each of the following SQL queries do and try recreate them using dbplyr.</p> <p>Explain what each of the following SQL queries do and try recreate them using dbplyr.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT * <pre data-type="programlisting" data-code-language="sql">SELECT *
@ -643,8 +643,8 @@ FROM flights</pre>
<section id="sec-sql-expressions" data-type="sect1"> <section id="sec-sql-expressions" data-type="sect1">
<h1> <h1>
Function translations</h1> Function translations</h1>
<p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>?</p> <p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
<p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p> <p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summarize_query &lt;- function(df, ...) { <pre data-type="programlisting" data-code-language="downlit">summarize_query &lt;- function(df, ...) {
df |&gt; df |&gt;
@ -657,7 +657,7 @@ mutate_query &lt;- function(df, ...) {
show_query() show_query()
}</pre> }</pre>
</div> </div>
<p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>, have a relatively simple translation while others, like <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p> <p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>, have a relatively simple translation while others, like <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt; group_by(year, month, day) |&gt;
@ -677,7 +677,7 @@ mutate_query &lt;- function(df, ...) {
#&gt; FROM flights #&gt; FROM flights
#&gt; GROUP BY "year", "month", "day"</pre> #&gt; GROUP BY "year", "month", "day"</pre>
</div> </div>
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p> <p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt; group_by(year, month, day) |&gt;
@ -693,7 +693,7 @@ mutate_query &lt;- function(df, ...) {
#&gt; FROM flights</pre> #&gt; FROM flights</pre>
</div> </div>
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p> <p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
<p>Window functions include all functions that look forward or backwards, like <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code>:</p> <p>Window functions include all functions that look forward or backwards, like <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt; group_by(dest) |&gt;
@ -710,8 +710,8 @@ mutate_query &lt;- function(df, ...) {
#&gt; FROM flights #&gt; FROM flights
#&gt; ORDER BY time_hour</pre> #&gt; ORDER BY time_hour</pre>
</div> </div>
<p>Here its important to <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p> <p>Here its important to <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p>
<p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p> <p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate_query( mutate_query(
@ -737,7 +737,7 @@ flights |&gt;
#&gt; END AS description #&gt; END AS description
#&gt; FROM flights</pre> #&gt; FROM flights</pre>
</div> </div>
<p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code>:</p> <p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate_query( mutate_query(
@ -755,16 +755,16 @@ flights |&gt;
#&gt; END AS description #&gt; END AS description
#&gt; FROM flights</pre> #&gt; FROM flights</pre>
</div> </div>
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="#chp-https://dbplyr.tidyverse.org/articles/translation-function" data-type="xref">#chp-https://dbplyr.tidyverse.org/articles/translation-function</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p> <p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
<section id="learning-more" data-type="sect2"> <section id="learning-more" data-type="sect2">
<h2> <h2>
Learning more</h2> Learning more</h2>
<p>If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p> <p>If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
<ul><li> <ul><li>
<a href="#chp-https://sqlfordatascientists" data-type="xref">#chp-https://sqlfordatascientists</a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li> <a href="https://sqlfordatascientists.com"><em>SQL for Data Scientists</em></a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li>
<li> <li>
<a href="#chp-https://www.practicalsql" data-type="xref">#chp-https://www.practicalsql</a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li> <a href="https://www.practicalsql.com"><em>Practical SQL</em></a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
</ul></section> </ul></section>
</section> </section>
</section> </section>

View File

@ -38,7 +38,7 @@ Creating date/times</h1>
<li><p>A <strong>date-time</strong> is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <code>&lt;dttm&gt;</code>. Base R calls these POSIXct, but doesnt exactly trip off the tongue.</p></li> <li><p>A <strong>date-time</strong> is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <code>&lt;dttm&gt;</code>. Base R calls these POSIXct, but doesnt exactly trip off the tongue.</p></li>
</ul><p>In this chapter we are going to focus on dates and date-times as R doesnt have a native class for storing times. If you need one, you can use the <strong>hms</strong> package.</p> </ul><p>In this chapter we are going to focus on dates and date-times as R doesnt have a native class for storing times. If you need one, you can use the <strong>hms</strong> package.</p>
<p>You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which well come back to at the end of the chapter.</p> <p>You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which well come back to at the end of the chapter.</p>
<p>To get the current date or date-time you can use <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code> or <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code>:</p> <p>To get the current date or date-time you can use <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> or <code><a href="https://lubridate.tidyverse.org/reference/now.html">now()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">today() <pre data-type="programlisting" data-code-language="downlit">today()
#&gt; [1] "2022-11-18" #&gt; [1] "2022-11-18"
@ -67,7 +67,7 @@ read_csv(csv)
#&gt; 1 2022-01-02 2022-01-02 05:12:00</pre> #&gt; 1 2022-01-02 2022-01-02 05:12:00</pre>
</div> </div>
<p>If you havent heard of <strong>ISO8601</strong> before, its an international standard<span data-type="footnote"><a href="https://xkcd.com/1179/" class="uri">https://xkcd.com/1179/</a></span> for writing dates where the components of a date are organised from biggest to smallest separated by <code>-</code>. For example, in ISO8601 March 5 2022 is <code>2022-05-03</code>. ISO8601 dates can also include times, where hour, minute, and second are separated by <code>:</code>, and the date and time components are separated by either a <code>T</code> or a space. For example, you could write 4:26pm on March 5 2022 as either <code>2022-05-03 16:26</code> or <code>2022-05-03T16:26</code>.</p> <p>If you havent heard of <strong>ISO8601</strong> before, its an international standard<span data-type="footnote"><a href="https://xkcd.com/1179/" class="uri">https://xkcd.com/1179/</a></span> for writing dates where the components of a date are organised from biggest to smallest separated by <code>-</code>. For example, in ISO8601 March 5 2022 is <code>2022-05-03</code>. ISO8601 dates can also include times, where hour, minute, and second are separated by <code>:</code>, and the date and time components are separated by either a <code>T</code> or a space. For example, you could write 4:26pm on March 5 2022 as either <code>2022-05-03 16:26</code> or <code>2022-05-03T16:26</code>.</p>
<p>For other date-time formats, youll need to use <code>col_types</code> plus <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> or <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a <code>%</code> followed by a single character. For example, <code>%Y-%m-%d</code> specifies a date thats a year, <code>-</code>, month (as number) <code>-</code>, day. Table <a href="#tbl-date-formats" data-type="xref">#tbl-date-formats</a> lists all the options.</p> <p>For other date-time formats, youll need to use <code>col_types</code> plus <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">col_datetime()</a></code> along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a <code>%</code> followed by a single character. For example, <code>%Y-%m-%d</code> specifies a date thats a year, <code>-</code>, month (as number) <code>-</code>, day. Table <a href="#tbl-date-formats" data-type="xref">#tbl-date-formats</a> lists all the options.</p>
<div id="tbl-date-formats" class="anchored"> <div id="tbl-date-formats" class="anchored">
<table class="table"><caption>Table 17.1: All date formats understood by readr</caption> <table class="table"><caption>Table 17.1: All date formats understood by readr</caption>
<thead><tr class="header"><th>Type</th> <thead><tr class="header"><th>Type</th>
@ -169,7 +169,7 @@ read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
#&gt; 1 2001-02-15</pre> #&gt; 1 2001-02-15</pre>
</div> </div>
<p>Note that no matter how you specify the date format, its always displayed the same way once you get it into R.</p> <p>Note that no matter how you specify the date format, its always displayed the same way once you get it into R.</p>
<p>If youre using <code>%b</code> or <code>%B</code> and working with non-English dates, youll also need to provide a <code><a href="#chp-https://readr.tidyverse.org/reference/locale" data-type="xref">#chp-https://readr.tidyverse.org/reference/locale</a></code>. See the list of built-in languages in <code><a href="#chp-https://readr.tidyverse.org/reference/date_names" data-type="xref">#chp-https://readr.tidyverse.org/reference/date_names</a></code>, or create your own with <code><a href="#chp-https://readr.tidyverse.org/reference/date_names" data-type="xref">#chp-https://readr.tidyverse.org/reference/date_names</a></code>,</p> <p>If youre using <code>%b</code> or <code>%B</code> and working with non-English dates, youll also need to provide a <code><a href="https://readr.tidyverse.org/reference/locale.html">locale()</a></code>. See the list of built-in languages in <code><a href="https://readr.tidyverse.org/reference/date_names.html">date_names_langs()</a></code>, or create your own with <code><a href="https://readr.tidyverse.org/reference/date_names.html">date_names()</a></code>,</p>
</section> </section>
<section id="from-strings" data-type="sect2"> <section id="from-strings" data-type="sect2">
@ -184,7 +184,7 @@ mdy("January 31st, 2017")
dmy("31-Jan-2017") dmy("31-Jan-2017")
#&gt; [1] "2017-01-31"</pre> #&gt; [1] "2017-01-31"</pre>
</div> </div>
<p><code><a href="#chp-https://lubridate.tidyverse.org/reference/ymd" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/ymd</a></code> and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:</p> <p><code><a href="https://lubridate.tidyverse.org/reference/ymd.html">ymd()</a></code> and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ymd_hms("2017-01-31 20:11:59") <pre data-type="programlisting" data-code-language="downlit">ymd_hms("2017-01-31 20:11:59")
#&gt; [1] "2017-01-31 20:11:59 UTC" #&gt; [1] "2017-01-31 20:11:59 UTC"
@ -216,7 +216,7 @@ From individual components</h2>
#&gt; 6 2013 1 1 5 58 #&gt; 6 2013 1 1 5 58
#&gt; # … with 336,770 more rows</pre> #&gt; # … with 336,770 more rows</pre>
</div> </div>
<p>To create a date/time from this sort of input, use <code><a href="#chp-https://lubridate.tidyverse.org/reference/make_datetime" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/make_datetime</a></code> for dates, or <code><a href="#chp-https://lubridate.tidyverse.org/reference/make_datetime" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/make_datetime</a></code> for date-times:</p> <p>To create a date/time from this sort of input, use <code><a href="https://lubridate.tidyverse.org/reference/make_datetime.html">make_date()</a></code> for dates, or <code><a href="https://lubridate.tidyverse.org/reference/make_datetime.html">make_datetime()</a></code> for date-times:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
select(year, month, day, hour, minute) |&gt; select(year, month, day, hour, minute) |&gt;
@ -286,14 +286,14 @@ flights_dt
<section id="from-other-types" data-type="sect2"> <section id="from-other-types" data-type="sect2">
<h2> <h2>
From other types</h2> From other types</h2>
<p>You may want to switch between a date-time and a date. Thats the job of <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code> and <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code>:</p> <p>You may want to switch between a date-time and a date. Thats the job of <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">as_datetime(today()) <pre data-type="programlisting" data-code-language="downlit">as_datetime(today())
#&gt; [1] "2022-11-18 UTC" #&gt; [1] "2022-11-18 UTC"
as_date(now()) as_date(now())
#&gt; [1] "2022-11-18"</pre> #&gt; [1] "2022-11-18"</pre>
</div> </div>
<p>Sometimes youll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code>; if its in days, use <code><a href="#chp-https://lubridate.tidyverse.org/reference/as_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/as_date</a></code>.</p> <p>Sometimes youll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_datetime()</a></code>; if its in days, use <code><a href="https://lubridate.tidyverse.org/reference/as_date.html">as_date()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">as_datetime(60 * 60 * 10) <pre data-type="programlisting" data-code-language="downlit">as_datetime(60 * 60 * 10)
#&gt; [1] "1970-01-01 10:00:00 UTC" #&gt; [1] "1970-01-01 10:00:00 UTC"
@ -311,7 +311,7 @@ Exercises</h2>
<pre data-type="programlisting" data-code-language="downlit">ymd(c("2010-10-10", "bananas"))</pre> <pre data-type="programlisting" data-code-language="downlit">ymd(c("2010-10-10", "bananas"))</pre>
</div> </div>
</li> </li>
<li><p>What does the <code>tzone</code> argument to <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code> do? Why is it important?</p></li> <li><p>What does the <code>tzone</code> argument to <code><a href="https://lubridate.tidyverse.org/reference/now.html">today()</a></code> do? Why is it important?</p></li>
<li> <li>
<p>For each of the following date-times show how youd parse it using a readr column-specification and a lubridate function.</p> <p>For each of the following date-times show how youd parse it using a readr column-specification and a lubridate function.</p>
<div class="cell"> <div class="cell">
@ -335,7 +335,7 @@ Date-time components</h1>
<section id="getting-components" data-type="sect2"> <section id="getting-components" data-type="sect2">
<h2> <h2>
Getting components</h2> Getting components</h2>
<p>You can pull out individual parts of the date with the accessor functions <code><a href="#chp-https://lubridate.tidyverse.org/reference/year" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/year</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/month" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/month</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> (day of the month), <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> (day of the year), <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> (day of the week), <code><a href="#chp-https://lubridate.tidyverse.org/reference/hour" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/hour</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/minute" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/minute</a></code>, and <code><a href="#chp-https://lubridate.tidyverse.org/reference/second" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/second</a></code>.</p> <p>You can pull out individual parts of the date with the accessor functions <code><a href="https://lubridate.tidyverse.org/reference/year.html">year()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/month.html">month()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/day.html">mday()</a></code> (day of the month), <code><a href="https://lubridate.tidyverse.org/reference/day.html">yday()</a></code> (day of the year), <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> (day of the week), <code><a href="https://lubridate.tidyverse.org/reference/hour.html">hour()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/minute.html">minute()</a></code>, and <code><a href="https://lubridate.tidyverse.org/reference/second.html">second()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">datetime &lt;- ymd_hms("2026-07-08 12:34:56") <pre data-type="programlisting" data-code-language="downlit">datetime &lt;- ymd_hms("2026-07-08 12:34:56")
@ -351,7 +351,7 @@ yday(datetime)
wday(datetime) wday(datetime)
#&gt; [1] 4</pre> #&gt; [1] 4</pre>
</div> </div>
<p>For <code><a href="#chp-https://lubridate.tidyverse.org/reference/month" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/month</a></code> and <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> you can set <code>label = TRUE</code> to return the abbreviated name of the month or day of the week. Set <code>abbr = FALSE</code> to return the full name.</p> <p>For <code><a href="https://lubridate.tidyverse.org/reference/month.html">month()</a></code> and <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> you can set <code>label = TRUE</code> to return the abbreviated name of the month or day of the week. Set <code>abbr = FALSE</code> to return the full name.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">month(datetime, label = TRUE) <pre data-type="programlisting" data-code-language="downlit">month(datetime, label = TRUE)
#&gt; [1] Jul #&gt; [1] Jul
@ -360,7 +360,7 @@ wday(datetime, label = TRUE, abbr = FALSE)
#&gt; [1] Wednesday #&gt; [1] Wednesday
#&gt; 7 Levels: Sunday &lt; Monday &lt; Tuesday &lt; Wednesday &lt; Thursday &lt; ... &lt; Saturday</pre> #&gt; 7 Levels: Sunday &lt; Monday &lt; Tuesday &lt; Wednesday &lt; Thursday &lt; ... &lt; Saturday</pre>
</div> </div>
<p>We can use <code><a href="#chp-https://lubridate.tidyverse.org/reference/day" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/day</a></code> to see that more flights depart during the week than on the weekend:</p> <p>We can use <code><a href="https://lubridate.tidyverse.org/reference/day.html">wday()</a></code> to see that more flights depart during the week than on the weekend:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt; <pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
mutate(wday = wday(dep_time, label = TRUE)) |&gt; mutate(wday = wday(dep_time, label = TRUE)) |&gt;
@ -412,7 +412,7 @@ ggplot(sched_dep, aes(minute, avg_delay)) +
<section id="rounding" data-type="sect2"> <section id="rounding" data-type="sect2">
<h2> <h2>
Rounding</h2> Rounding</h2>
<p>An alternative approach to plotting individual components is to round the date to a nearby unit of time, with <code><a href="#chp-https://lubridate.tidyverse.org/reference/round_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/round_date</a></code>, <code><a href="#chp-https://lubridate.tidyverse.org/reference/round_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/round_date</a></code>, and <code><a href="#chp-https://lubridate.tidyverse.org/reference/round_date" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/round_date</a></code>. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:</p> <p>An alternative approach to plotting individual components is to round the date to a nearby unit of time, with <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">floor_date()</a></code>, <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">round_date()</a></code>, and <code><a href="https://lubridate.tidyverse.org/reference/round_date.html">ceiling_date()</a></code>. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt; <pre data-type="programlisting" data-code-language="downlit">flights_dt |&gt;
count(week = floor_date(dep_time, "week")) |&gt; count(week = floor_date(dep_time, "week")) |&gt;
@ -465,7 +465,7 @@ hour(datetime) &lt;- hour(datetime) + 1
datetime datetime
#&gt; [1] "2030-01-08 13:34:56 UTC"</pre> #&gt; [1] "2030-01-08 13:34:56 UTC"</pre>
</div> </div>
<p>Alternatively, rather than modifying an existing variabke, you can create a new date-time with <code><a href="#chp-https://rdrr.io/r/stats/update" data-type="xref">#chp-https://rdrr.io/r/stats/update</a></code>. This also allows you to set multiple values in one step:</p> <p>Alternatively, rather than modifying an existing variabke, you can create a new date-time with <code><a href="https://rdrr.io/r/stats/update.html">update()</a></code>. This also allows you to set multiple values in one step:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">update(datetime, year = 2030, month = 2, mday = 2, hour = 2) <pre data-type="programlisting" data-code-language="downlit">update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
#&gt; [1] "2030-02-02 02:34:56 UTC"</pre> #&gt; [1] "2030-02-02 02:34:56 UTC"</pre>
@ -664,7 +664,7 @@ y2023
y2024 y2024
#&gt; [1] 2024-01-01 UTC--2025-01-01 UTC</pre> #&gt; [1] 2024-01-01 UTC--2025-01-01 UTC</pre>
</div> </div>
<p>You could then divide it by <code><a href="#chp-https://lubridate.tidyverse.org/reference/period" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/period</a></code> to find out how many days fit in the year:</p> <p>You could then divide it by <code><a href="https://lubridate.tidyverse.org/reference/period.html">days()</a></code> to find out how many days fit in the year:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2023 / days(1) <pre data-type="programlisting" data-code-language="downlit">y2023 / days(1)
#&gt; [1] 365 #&gt; [1] 365
@ -690,13 +690,13 @@ Time zones</h1>
<!--# https://www.ietf.org/timezones/tzdb-2018a/theory.html --> <!--# https://www.ietf.org/timezones/tzdb-2018a/theory.html -->
<p>The first challenge is that everyday names of time zones tend to be ambiguous. For example, if youre American youre probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme <code>{area}/{location}</code>, typically in the form <code>{continent}/{city}</code> or <code>{ocean}/{city}</code>. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.</p> <p>The first challenge is that everyday names of time zones tend to be ambiguous. For example, if youre American youre probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme <code>{area}/{location}</code>, typically in the form <code>{continent}/{city}</code> or <code>{ocean}/{city}</code>. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.</p>
<p>You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. Its worth reading the raw time zone database (available at <a href="https://www.iana.org/time-zones" class="uri">https://www.iana.org/time-zones</a>) just to read some of these stories!</p> <p>You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. Its worth reading the raw time zone database (available at <a href="https://www.iana.org/time-zones" class="uri">https://www.iana.org/time-zones</a>) just to read some of these stories!</p>
<p>You can find out what R thinks your current time zone is with <code><a href="#chp-https://rdrr.io/r/base/timezones" data-type="xref">#chp-https://rdrr.io/r/base/timezones</a></code>:</p> <p>You can find out what R thinks your current time zone is with <code><a href="https://rdrr.io/r/base/timezones.html">Sys.timezone()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">Sys.timezone() <pre data-type="programlisting" data-code-language="downlit">Sys.timezone()
#&gt; [1] "America/Chicago"</pre> #&gt; [1] "America/Chicago"</pre>
</div> </div>
<p>(If R doesnt know, youll get an <code>NA</code>.)</p> <p>(If R doesnt know, youll get an <code>NA</code>.)</p>
<p>And see the complete list of all time zone names with <code><a href="#chp-https://rdrr.io/r/base/timezones" data-type="xref">#chp-https://rdrr.io/r/base/timezones</a></code>:</p> <p>And see the complete list of all time zone names with <code><a href="https://rdrr.io/r/base/timezones.html">OlsonNames()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">length(OlsonNames()) <pre data-type="programlisting" data-code-language="downlit">length(OlsonNames())
#&gt; [1] 595 #&gt; [1] 595
@ -725,7 +725,7 @@ x3
x1 - x3 x1 - x3
#&gt; Time difference of 0 secs</pre> #&gt; Time difference of 0 secs</pre>
</div> </div>
<p>Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>, will often drop the time zone. In that case, the date-times will display in your local time zone:</p> <p>Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>, will often drop the time zone. In that case, the date-times will display in your local time zone:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x4 &lt;- c(x1, x2, x3) <pre data-type="programlisting" data-code-language="downlit">x4 &lt;- c(x1, x2, x3)
x4 x4

View File

@ -12,7 +12,7 @@
<h1> <h1>
Introduction</h1> Introduction</h1>
<p>Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.</p> <p>Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.</p>
<p>Well start by motivating why factors are needed for data analysis and how you can create them with <code><a href="#chp-https://rdrr.io/r/base/factor" data-type="xref">#chp-https://rdrr.io/r/base/factor</a></code>. Well then introduce you to the <code>gss_cat</code> dataset which contains a bunch of categorical variables to experiment with. Youll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.</p> <p>Well start by motivating why factors are needed for data analysis and how you can create them with <code><a href="https://rdrr.io/r/base/factor.html">factor()</a></code>. Well then introduce you to the <code>gss_cat</code> dataset which contains a bunch of categorical variables to experiment with. Youll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
@ -70,7 +70,7 @@ y2
#&gt; [1] Dec Apr &lt;NA&gt; Mar #&gt; [1] Dec Apr &lt;NA&gt; Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre> #&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div> </div>
<p>This seems risky, so you might want to use <code><a href="#chp-https://forcats.tidyverse.org/reference/fct" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct</a></code> instead:</p> <p>This seems risky, so you might want to use <code><a href="https://forcats.tidyverse.org/reference/fct.html">fct()</a></code> instead:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2 &lt;- fct(x2, levels = month_levels) <pre data-type="programlisting" data-code-language="downlit">y2 &lt;- fct(x2, levels = month_levels)
#&gt; Error in `fct()`: #&gt; Error in `fct()`:
@ -83,7 +83,7 @@ y2
#&gt; [1] Dec Apr Jan Mar #&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Apr Dec Jan Mar</pre> #&gt; Levels: Apr Dec Jan Mar</pre>
</div> </div>
<p>Sometimes youd prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to <code>unique(x)</code>, or after the fact, with <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_inorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_inorder</a></code>:</p> <p>Sometimes youd prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to <code>unique(x)</code>, or after the fact, with <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_inorder()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">f1 &lt;- factor(x1, levels = unique(x1)) <pre data-type="programlisting" data-code-language="downlit">f1 &lt;- factor(x1, levels = unique(x1))
f1 f1
@ -95,12 +95,12 @@ f2
#&gt; [1] Dec Apr Jan Mar #&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Dec Apr Jan Mar</pre> #&gt; Levels: Dec Apr Jan Mar</pre>
</div> </div>
<p>If you ever need to access the set of valid levels directly, you can do so with <code><a href="#chp-https://rdrr.io/r/base/levels" data-type="xref">#chp-https://rdrr.io/r/base/levels</a></code>:</p> <p>If you ever need to access the set of valid levels directly, you can do so with <code><a href="https://rdrr.io/r/base/levels.html">levels()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">levels(f2) <pre data-type="programlisting" data-code-language="downlit">levels(f2)
#&gt; [1] "Dec" "Apr" "Jan" "Mar"</pre> #&gt; [1] "Dec" "Apr" "Jan" "Mar"</pre>
</div> </div>
<p>You can also create a factor when reading your data with readr with <code><a href="#chp-https://readr.tidyverse.org/reference/parse_factor" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_factor</a></code>:</p> <p>You can also create a factor when reading your data with readr with <code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- " <pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
month,value month,value
@ -118,7 +118,7 @@ df$month
<section id="general-social-survey" data-type="sect1"> <section id="general-social-survey" data-type="sect1">
<h1> <h1>
General Social Survey</h1> General Social Survey</h1>
<p>For the rest of this chapter, were going to use <code><a href="#chp-https://forcats.tidyverse.org/reference/gss_cat" data-type="xref">#chp-https://forcats.tidyverse.org/reference/gss_cat</a></code>. Its a sample of data from the <a href="#chp-https://gss.norc" data-type="xref">#chp-https://gss.norc</a>, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in <code>gss_cat</code> Hadley selected a handful that will illustrate some common challenges youll encounter when working with factors.</p> <p>For the rest of this chapter, were going to use <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">forcats::gss_cat</a></code>. Its a sample of data from the <a href="https://gss.norc.org">General Social Survey</a>, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in <code>gss_cat</code> Hadley selected a handful that will illustrate some common challenges youll encounter when working with factors.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat <pre data-type="programlisting" data-code-language="downlit">gss_cat
#&gt; # A tibble: 21,483 × 9 #&gt; # A tibble: 21,483 × 9
@ -132,8 +132,8 @@ General Social Survey</h1>
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA #&gt; 6 2000 Married 25 White $20000 - 24999 Strong dem… Prot… Sout… NA
#&gt; # … with 21,477 more rows</pre> #&gt; # … with 21,477 more rows</pre>
</div> </div>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="#chp-https://forcats.tidyverse.org/reference/gss_cat" data-type="xref">#chp-https://forcats.tidyverse.org/reference/gss_cat</a></code>.)</p> <p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>
<p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p> <p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
count(race) count(race)
@ -182,7 +182,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p> <p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
</div> </div>
</div> </div>
<p>It is hard to read this plot because theres no overall pattern. We can improve it by reordering the levels of <code>relig</code> using <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_reorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_reorder</a></code>. <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_reorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_reorder</a></code> takes three arguments:</p> <p>It is hard to read this plot because theres no overall pattern. We can improve it by reordering the levels of <code>relig</code> using <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code>. <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> takes three arguments:</p>
<ul><li> <ul><li>
<code>f</code>, the factor whose levels you want to modify.</li> <code>f</code>, the factor whose levels you want to modify.</li>
<li> <li>
@ -196,7 +196,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
</div> </div>
</div> </div>
<p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p> <p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p>
<p>As you start making more complicated transformations, we recommend moving them out of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> and into a separate <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> step. For example, you could rewrite the plot above as:</p> <p>As you start making more complicated transformations, we recommend moving them out of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> and into a separate <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> step. For example, you could rewrite the plot above as:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">relig_summary |&gt; <pre data-type="programlisting" data-code-language="downlit">relig_summary |&gt;
mutate( mutate(
@ -221,8 +221,8 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p> <p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
</div> </div>
</div> </div>
<p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_reorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_reorder</a></code> for factors whose levels are arbitrarily ordered.</p> <p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
<p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_relevel" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_relevel</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p> <p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="https://forcats.tidyverse.org/reference/fct_relevel.html">fct_relevel()</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) + <pre data-type="programlisting" data-code-language="downlit">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()</pre> geom_point()</pre>
@ -265,7 +265,7 @@ ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
</div> </div>
</div> </div>
</div> </div>
<p>Finally, for bar plots, you can use <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_inorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_inorder</a></code> to order levels in decreasing frequency: this is the simplest type of reordering because it doesnt need any extra variables. Combine it with <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_rev" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_rev</a></code> if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.</p> <p>Finally, for bar plots, you can use <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code> to order levels in decreasing frequency: this is the simplest type of reordering because it doesnt need any extra variables. Combine it with <code><a href="https://forcats.tidyverse.org/reference/fct_rev.html">fct_rev()</a></code> if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(marital = marital |&gt; fct_infreq() |&gt; fct_rev()) |&gt; mutate(marital = marital |&gt; fct_infreq() |&gt; fct_rev()) |&gt;
@ -288,7 +288,7 @@ Exercises</h2>
<section id="modifying-factor-levels" data-type="sect1"> <section id="modifying-factor-levels" data-type="sect1">
<h1> <h1>
Modifying factor levels</h1> Modifying factor levels</h1>
<p>More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_recode" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_recode</a></code>. It allows you to recode, or change, the value of each level. For example, take the <code>gss_cat$partyid</code>:</p> <p>More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. It allows you to recode, or change, the value of each level. For example, take the <code>gss_cat$partyid</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; count(partyid) <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; count(partyid)
#&gt; # A tibble: 10 × 2 #&gt; # A tibble: 10 × 2
@ -327,7 +327,7 @@ Modifying factor levels</h1>
#&gt; 6 Independent, near rep 1791 #&gt; 6 Independent, near rep 1791
#&gt; # … with 4 more rows</pre> #&gt; # … with 4 more rows</pre>
</div> </div>
<p><code><a href="#chp-https://forcats.tidyverse.org/reference/fct_recode" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_recode</a></code> will leave the levels that arent explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist.</p> <p><code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code> will leave the levels that arent explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist.</p>
<p>To combine groups, you can assign multiple old levels to the same new level:</p> <p>To combine groups, you can assign multiple old levels to the same new level:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
@ -357,7 +357,7 @@ Modifying factor levels</h1>
#&gt; # … with 2 more rows</pre> #&gt; # … with 2 more rows</pre>
</div> </div>
<p>Use this technique with care: if you group together categories that are truly different you will end up with misleading results.</p> <p>Use this technique with care: if you group together categories that are truly different you will end up with misleading results.</p>
<p>If you want to collapse a lot of levels, <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_collapse" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_collapse</a></code> is a useful variant of <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_recode" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_recode</a></code>. For each new variable, you can provide a vector of old levels:</p> <p>If you want to collapse a lot of levels, <code><a href="https://forcats.tidyverse.org/reference/fct_collapse.html">fct_collapse()</a></code> is a useful variant of <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. For each new variable, you can provide a vector of old levels:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate( mutate(
@ -377,7 +377,7 @@ Modifying factor levels</h1>
#&gt; 3 ind 8409 #&gt; 3 ind 8409
#&gt; 4 dem 7180</pre> #&gt; 4 dem 7180</pre>
</div> </div>
<p>Sometimes you just want to lump together the small groups to make a plot or table simpler. Thats the job of the <code>fct_lump_*()</code> family of functions. <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.</p> <p>Sometimes you just want to lump together the small groups to make a plot or table simpler. Thats the job of the <code>fct_lump_*()</code> family of functions. <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_lowfreq()</a></code> is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(relig = fct_lump_lowfreq(relig)) |&gt; mutate(relig = fct_lump_lowfreq(relig)) |&gt;
@ -388,7 +388,7 @@ Modifying factor levels</h1>
#&gt; 1 Protestant 10846 #&gt; 1 Protestant 10846
#&gt; 2 Other 10637</pre> #&gt; 2 Other 10637</pre>
</div> </div>
<p>In this case its not very helpful: it is true that the majority of Americans in this survey are Protestant, but wed probably like to see some more details! Instead, we can use the <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> to specify that we want exactly 10 groups:</p> <p>In this case its not very helpful: it is true that the majority of Americans in this survey are Protestant, but wed probably like to see some more details! Instead, we can use the <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_n()</a></code> to specify that we want exactly 10 groups:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; <pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
mutate(relig = fct_lump_n(relig, n = 10)) |&gt; mutate(relig = fct_lump_n(relig, n = 10)) |&gt;
@ -408,27 +408,27 @@ Modifying factor levels</h1>
#&gt; 9 Moslem/islam 104 #&gt; 9 Moslem/islam 104
#&gt; 10 Orthodox-christian 95</pre> #&gt; 10 Orthodox-christian 95</pre>
</div> </div>
<p>Read the documentation to learn about <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> and <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code> which are useful in other cases.</p> <p>Read the documentation to learn about <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_min()</a></code> and <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_prop()</a></code> which are useful in other cases.</p>
<section id="exercises-1" data-type="sect2"> <section id="exercises-1" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?</p></li> <ol type="1"><li><p>How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?</p></li>
<li><p>How could you collapse <code>rincome</code> into a small set of categories?</p></li> <li><p>How could you collapse <code>rincome</code> into a small set of categories?</p></li>
<li><p>Notice there are 9 groups (excluding other) in the <code>fct_lump</code> example above. Why not 10? (Hint: type <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_lump" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_lump</a></code>, and find the default for the argument <code>other_level</code> is “Other”.)</p></li> <li><p>Notice there are 9 groups (excluding other) in the <code>fct_lump</code> example above. Why not 10? (Hint: type <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">?fct_lump</a></code>, and find the default for the argument <code>other_level</code> is “Other”.)</p></li>
</ol></section> </ol></section>
</section> </section>
<section id="ordered-factors" data-type="sect1"> <section id="ordered-factors" data-type="sect1">
<h1> <h1>
Ordered factors</h1> Ordered factors</h1>
<p>Before we go on, theres a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with <code><a href="#chp-https://rdrr.io/r/base/factor" data-type="xref">#chp-https://rdrr.io/r/base/factor</a></code>, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use <code>&lt;</code> between the factor levels:</p> <p>Before we go on, theres a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with <code><a href="https://rdrr.io/r/base/factor.html">ordered()</a></code>, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use <code>&lt;</code> between the factor levels:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ordered(c("a", "b", "c")) <pre data-type="programlisting" data-code-language="downlit">ordered(c("a", "b", "c"))
#&gt; [1] a b c #&gt; [1] a b c
#&gt; Levels: a &lt; b &lt; c</pre> #&gt; Levels: a &lt; b &lt; c</pre>
</div> </div>
<p>In practice, <code><a href="#chp-https://rdrr.io/r/base/factor" data-type="xref">#chp-https://rdrr.io/r/base/factor</a></code> factors behave very similarly to regular factors. There are only two places where you might notice different behavior:</p> <p>In practice, <code><a href="https://rdrr.io/r/base/factor.html">ordered()</a></code> factors behave very similarly to regular factors. There are only two places where you might notice different behavior:</p>
<ul><li>If you map an ordered factor to color or fill in ggplot2, it will default to <code>scale_color_viridis()</code>/<code>scale_fill_viridis()</code>, a color scale that implies a ranking.</li> <ul><li>If you map an ordered factor to color or fill in ggplot2, it will default to <code>scale_color_viridis()</code>/<code>scale_fill_viridis()</code>, a color scale that implies a ranking.</li>
<li>If you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably dont routinely interpret them. If you want to learn more, we recommend <code>vignette("contrasts", package = "faux")</code> by Lisa DeBruine.</li> <li>If you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably dont routinely interpret them. If you want to learn more, we recommend <code>vignette("contrasts", package = "faux")</code> by Lisa DeBruine.</li>
</ul><p>Given the arguable utility of these differences, we dont generally recommend using ordered factors.</p> </ul><p>Given the arguable utility of these differences, we dont generally recommend using ordered factors.</p>
@ -437,8 +437,8 @@ Ordered factors</h1>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didnt have space to discuss here, so whenever youre facing a factor analysis challenge that you havent encountered before, I highly recommend skimming the <a href="#chp-https://forcats.tidyverse.org/reference/index" data-type="xref">#chp-https://forcats.tidyverse.org/reference/index</a> to see if theres a canned function that can help solve your problem.</p> <p>This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didnt have space to discuss here, so whenever youre facing a factor analysis challenge that you havent encountered before, I highly recommend skimming the <a href="https://forcats.tidyverse.org/reference/index.html">reference index</a> to see if theres a canned function that can help solve your problem.</p>
<p>If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Hortons paper, <a href="#chp-https://peerj.com/preprints/3163/" data-type="xref">#chp-https://peerj.com/preprints/3163/</a>. This paper lays out some of the history discussed in <a href="#chp-https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/" data-type="xref">#chp-https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/</a> and <a href="#chp-https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh" data-type="xref">#chp-https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh</a>, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia &amp; Nick!</p> <p>If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Hortons paper, <a href="https://peerj.com/preprints/3163/"><em>Wrangling categorical data in R</em></a>. This paper lays out some of the history discussed in <a href="https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/"><em>stringsAsFactors: An unauthorized biography</em></a> and <a href="https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh"><em>stringsAsFactors = &lt;sigh&gt;</em></a>, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia &amp; Nick!</p>
<p>In the next chapter well switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as youll soon see, the more you learn about them, the more complex they seem to get!</p> <p>In the next chapter well switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as youll soon see, the more you learn about them, the more complex they seem to get!</p>

View File

@ -23,7 +23,7 @@ Introduction</h1>
<ul><li>Vector functions take one or more vectors as input and return a vector as output.</li> <ul><li>Vector functions take one or more vectors as input and return a vector as output.</li>
<li>Data frame functions take a data frame as input and return a data frame as output.</li> <li>Data frame functions take a data frame as input and return a data frame as output.</li>
<li>Plot functions that take a data frame as input and return a plot as output.</li> <li>Plot functions that take a data frame as input and return a plot as output.</li>
</ul><p>Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldnt be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for <a href="#chp-https://twitter.com/hadleywickham/status/1571603361350164486" data-type="xref">#chp-https://twitter.com/hadleywickham/status/1571603361350164486</a> and <a href="#chp-https://twitter.com/hadleywickham/status/1574373127349575680" data-type="xref">#chp-https://twitter.com/hadleywickham/status/1574373127349575680</a> to see even more functions.</p> </ul><p>Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldnt be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for <a href="https://twitter.com/hadleywickham/status/1571603361350164486">general functions</a> and <a href="https://twitter.com/hadleywickham/status/1574373127349575680">plotting functions</a> to see even more functions.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
@ -72,7 +72,7 @@ df |&gt; mutate(
<section id="writing-a-function" data-type="sect2"> <section id="writing-a-function" data-type="sect2">
<h2> <h2>
Writing a function</h2> Writing a function</h2>
<p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> its a little easier to see the pattern because each repetition is now one line:</p> <p>To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> its a little easier to see the pattern because each repetition is now one line:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)) <pre data-type="programlisting" data-code-language="downlit">(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE)) (b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
@ -106,7 +106,7 @@ Writing a function</h2>
rescale01(c(1, 2, 3, NA, 5)) rescale01(c(1, 2, 3, NA, 5))
#&gt; [1] 0.00 0.25 0.50 NA 1.00</pre> #&gt; [1] 0.00 0.25 0.50 NA 1.00</pre>
</div> </div>
<p>Then you can rewrite the call to <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> as:</p> <p>Then you can rewrite the call to <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> as:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate( <pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(
a = rescale01(a), a = rescale01(a),
@ -123,13 +123,13 @@ rescale01(c(1, 2, 3, NA, 5))
#&gt; 4 0.795 0.531 0 1 #&gt; 4 0.795 0.531 0 1
#&gt; 5 1 0.518 0.580 0.394</pre> #&gt; 5 1 0.518 0.580 0.394</pre>
</div> </div>
<p>(In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> to reduce the duplication even further so all you need is <code>df |&gt; mutate(across(a:d, rescale01))</code>).</p> <p>(In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> to reduce the duplication even further so all you need is <code>df |&gt; mutate(across(a:d, rescale01))</code>).</p>
</section> </section>
<section id="improving-our-function" data-type="sect2"> <section id="improving-our-function" data-type="sect2">
<h2> <h2>
Improving our function</h2> Improving our function</h2>
<p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> twice and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="#chp-https://rdrr.io/r/base/range" data-type="xref">#chp-https://rdrr.io/r/base/range</a></code>:</p> <p>You might notice <code>rescale01()</code> function does some unnecessary work — instead of computing <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> twice and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> once we could instead compute both the minimum and maximum in one step with <code><a href="https://rdrr.io/r/base/range.html">range()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) { <pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE) rng &lt;- range(x, na.rm = TRUE)
@ -142,7 +142,7 @@ Improving our function</h2>
rescale01(x) rescale01(x)
#&gt; [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre> #&gt; [1] 0 0 0 0 0 0 0 0 0 0 NaN</pre>
</div> </div>
<p>That result is not particularly useful so we could ask <code><a href="#chp-https://rdrr.io/r/base/range" data-type="xref">#chp-https://rdrr.io/r/base/range</a></code> to ignore infinite values:</p> <p>That result is not particularly useful so we could ask <code><a href="https://rdrr.io/r/base/range.html">range()</a></code> to ignore infinite values:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) { <pre data-type="programlisting" data-code-language="downlit">rescale01 &lt;- function(x) {
rng &lt;- range(x, na.rm = TRUE, finite = TRUE) rng &lt;- range(x, na.rm = TRUE, finite = TRUE)
@ -158,14 +158,14 @@ rescale01(x)
<section id="mutate-functions" data-type="sect2"> <section id="mutate-functions" data-type="sect2">
<h2> <h2>
Mutate functions</h2> Mutate functions</h2>
<p>Now youve got the basic idea of functions, lets take a look a whole bunch of examples. Well start by looking at “mutate” functions, functions that work well like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> because they return an output the same length as the input.</p> <p>Now youve got the basic idea of functions, lets take a look a whole bunch of examples. Well start by looking at “mutate” functions, functions that work well like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> because they return an output the same length as the input.</p>
<p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p> <p>Lets start with a simple variation of <code>rescale01()</code>. Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">z_score &lt;- function(x) { <pre data-type="programlisting" data-code-language="downlit">z_score &lt;- function(x) {
(x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}</pre> }</pre>
</div> </div>
<p>Or maybe you want to wrap up a straightforward <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p> <p>Or maybe you want to wrap up a straightforward <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> in order to give it a useful name. For example, this <code>clamp()</code> function ensures all values of a vector lie in between a minimum or a maximum:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">clamp &lt;- function(x, min, max) { <pre data-type="programlisting" data-code-language="downlit">clamp &lt;- function(x, min, max) {
case_when( case_when(
@ -244,7 +244,7 @@ haversine &lt;- function(long1, lat1, long2, lat2, round = 3) {
<section id="summary-functions" data-type="sect2"> <section id="summary-functions" data-type="sect2">
<h2> <h2>
Summary functions</h2> Summary functions</h2>
<p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p> <p>Another important family of vector functions is summary functions, functions that return a single value for use in <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. Sometimes this can just be a matter of setting a default argument or two:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">commas &lt;- function(x) { <pre data-type="programlisting" data-code-language="downlit">commas &lt;- function(x) {
str_flatten(x, collapse = ", ", last = " and ") str_flatten(x, collapse = ", ", last = " and ")
@ -332,7 +332,7 @@ Data frame functions</h1>
<section id="indirection-and-tidy-evaluation" data-type="sect2"> <section id="indirection-and-tidy-evaluation" data-type="sect2">
<h2> <h2>
Indirection and tidy evaluation</h2> Indirection and tidy evaluation</h2>
<p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> the unique (distinct) values of a variable:</p> <p>When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Lets illustrate the problem with a very simple function: <code>pull_unique()</code>. The goal of this function is to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> the unique (distinct) values of a variable:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">pull_unique &lt;- function(df, var) { <pre data-type="programlisting" data-code-language="downlit">pull_unique &lt;- function(df, var) {
df |&gt; df |&gt;
@ -356,7 +356,7 @@ df |&gt; pull_unique(y)
#&gt; [1] "var"</pre> #&gt; [1] "var"</pre>
</div> </div>
<p>Regardless of how we call <code>pull_unique()</code> it always does <code>df |&gt; distinct(var) |&gt; pull(var)</code>, instead of <code>df |&gt; distinct(x) |&gt; pull(x)</code> or <code>df |&gt; distinct(y) |&gt; pull(y)</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p> <p>Regardless of how we call <code>pull_unique()</code> it always does <code>df |&gt; distinct(var) |&gt; pull(var)</code>, instead of <code>df |&gt; distinct(x) |&gt; pull(x)</code> or <code>df |&gt; distinct(y) |&gt; pull(y)</code>. This is a problem of indirection, and it arises because dplyr uses <strong>tidy evaluation</strong> to allow you to refer to the names of variables inside your data frame without any special treatment.</p>
<p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> not to treat <code>var</code> as the name of a variable, but instead look inside <code>var</code> for the variable we actually want to use.</p> <p>Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; its obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> not to treat <code>var</code> as the name of a variable, but instead look inside <code>var</code> for the variable we actually want to use.</p>
<p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p> <p>Tidy evaluation includes a solution to this problem called <strong>embracing</strong> 🤗. Embracing a variable means to wrap it in braces so (e.g.) <code>var</code> becomes <code>{{ var }}</code>. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember whats happening is to think of <code>{{ }}</code> as looking down a tunnel — <code>{{ var }}</code> will make a dplyr function look inside of <code>var</code> rather than looking for a variable called <code>var</code>.</p>
<p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p> <p>So to make <code>pull_unique()</code> work we need to replace <code>var</code> with <code>{{ var }}</code>:</p>
<div class="cell"> <div class="cell">
@ -376,8 +376,8 @@ diamonds |&gt; pull_unique(clarity)
<h2> <h2>
When to embrace?</h2> When to embrace?</h2>
<p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:</p> <p>So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:</p>
<ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> that compute with variables.</p></li> <ul><li><p><strong>Data-masking</strong>: this is used in functions like <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> that compute with variables.</p></li>
<li><p><strong>Tidy-selection</strong>: this is used for for functions like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> that select variables.</p></li> <li><p><strong>Tidy-selection</strong>: this is used for for functions like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code> that select variables.</p></li>
</ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p> </ul><p>Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. <code>x + 1</code>) or select (e.g. <code>a:x</code>).</p>
<p>In the following sections well explore the sorts of handy functions you might write once you understand embracing.</p> <p>In the following sections well explore the sorts of handy functions you might write once you understand embracing.</p>
</section> </section>
@ -404,8 +404,8 @@ diamonds |&gt; summary6(carat)
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.2 0.798 0.7 5.01 53940 0</pre> #&gt; 1 0.2 0.798 0.7 5.01 53940 0</pre>
</div> </div>
<p>(Whenever you wrap <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p> <p>(Whenever you wrap <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> in a helper, we think its good practice to set <code>.groups = "drop"</code> to both avoid the message and leave the data in an ungrouped state.)</p>
<p>The nice thing about this function is because it wraps <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> you can used it on grouped data:</p> <p>The nice thing about this function is because it wraps <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> you can used it on grouped data:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
group_by(cut) |&gt; group_by(cut) |&gt;
@ -433,8 +433,8 @@ diamonds |&gt; summary6(carat)
#&gt; 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0 #&gt; 4 Premium -0.699 -0.125 -0.0655 0.603 13791 0
#&gt; 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre> #&gt; 5 Ideal -0.699 -0.225 -0.268 0.544 21551 0</pre>
</div> </div>
<p>To summarize multiple variables youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>.</p> <p>To summarize multiple variables youll need to wait until <a href="#sec-across" data-type="xref">#sec-across</a>, where youll learn how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>.</p>
<p>Another popular <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> helper function is a version of <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> that also computes proportions:</p> <p>Another popular <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> helper function is a version of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> that also computes proportions:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/Diabb6/status/1571635146658402309 <pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/Diabb6/status/1571635146658402309
count_prop &lt;- function(df, var, sort = FALSE) { count_prop &lt;- function(df, var, sort = FALSE) {
@ -454,7 +454,7 @@ diamonds |&gt; count_prop(clarity)
#&gt; 6 VVS2 5066 0.0939 #&gt; 6 VVS2 5066 0.0939
#&gt; # … with 2 more rows</pre> #&gt; # … with 2 more rows</pre>
</div> </div>
<p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because its passed to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> which uses data-masking for all variables in <code></code>.</p> <p>This function has three arguments: <code>df</code>, <code>var</code>, and <code>sort</code>, and only <code>var</code> needs to be embraced because its passed to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> which uses data-masking for all variables in <code></code>.</p>
<p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, well allow the user to supply a condition:</p> <p>Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, well allow the user to supply a condition:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">unique_where &lt;- function(df, condition, var) { <pre data-type="programlisting" data-code-language="downlit">unique_where &lt;- function(df, condition, var) {
@ -479,7 +479,7 @@ flights |&gt; unique_where(month == 12, dest)
flights |&gt; unique_where(tailnum == "N14228", month) flights |&gt; unique_where(tailnum == "N14228", month)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 12</pre> #&gt; [1] 1 2 3 4 5 6 7 8 9 10 12</pre>
</div> </div>
<p>Here we embrace <code>condition</code> because its passed to <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> and <code>var</code> because its passed to <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code>.</p> <p>Here we embrace <code>condition</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code>var</code> because its passed to <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>.</p>
<p>Weve made all these examples take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p> <p>Weve made all these examples take a data frame as the first argument, but if youre working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects <code>time_hour</code>, <code>carrier</code>, and <code>flight</code> since they form the compound primary key that allows you to identify a row.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights_sub &lt;- function(rows, cols) { <pre data-type="programlisting" data-code-language="downlit">flights_sub &lt;- function(rows, cols) {
@ -520,7 +520,7 @@ flights |&gt;
#&gt; Caused by error: #&gt; Caused by error:
#&gt; ! `..1` must be size 336776 or 1, not 1010328.</pre> #&gt; ! `..1` must be size 336776 or 1, not 1010328.</pre>
</div> </div>
<p>This doesnt work because <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="#chp-https://dplyr.tidyverse.org/reference/pick" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pick</a></code> which allows you to use use tidy-selection inside data-masking functions:</p> <p>This doesnt work because <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> uses data-masking, not tidy-selection. We can work around that problem by using the handy <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> which allows you to use use tidy-selection inside data-masking functions:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">count_missing &lt;- function(df, group_vars, x_var) { <pre data-type="programlisting" data-code-language="downlit">count_missing &lt;- function(df, group_vars, x_var) {
df |&gt; df |&gt;
@ -543,7 +543,7 @@ flights |&gt;
#&gt; 6 2013 1 6 1 #&gt; 6 2013 1 6 1
#&gt; # … with 359 more rows</pre> #&gt; # … with 359 more rows</pre>
</div> </div>
<p>Another convenient use of <code><a href="#chp-https://dplyr.tidyverse.org/reference/pick" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pick</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to rearrange the counts into a grid:</p> <p>Another convenient use of <code><a href="https://dplyr.tidyverse.org/reference/pick.html">pick()</a></code> is to make a 2d table of counts. Here we count using all the variables in the <code>rows</code> and <code>columns</code>, then use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to rearrange the counts into a grid:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/pollicipes/status/1571606508944719876 <pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/pollicipes/status/1571606508944719876
count_wide &lt;- function(data, rows, cols) { count_wide &lt;- function(data, rows, cols) {
@ -579,7 +579,7 @@ diamonds |&gt; count_wide(c(clarity, color), cut)
#&gt; 6 I1 I 34 9 8 24 17 #&gt; 6 I1 I 34 9 8 24 17
#&gt; # … with 50 more rows</pre> #&gt; # … with 50 more rows</pre>
</div> </div>
<p>While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> docs you can see that <code>names_from</code> uses tidy-selection.</p> <p>While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> docs you can see that <code>names_from</code> uses tidy-selection.</p>
</section> </section>
<section id="exercises-1" data-type="sect2"> <section id="exercises-1" data-type="sect2">
@ -618,7 +618,7 @@ Exercises</h2>
</div> </div>
</li> </li>
</ol></li> </ol></li>
<li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/slice" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/slice</a></code>.</p></li> <li><p>For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_min()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/slice.html">slice_sample()</a></code>.</p></li>
<li> <li>
<p>Generalize the following function so that you can supply any number of variables to count.</p> <p>Generalize the following function so that you can supply any number of variables to count.</p>
<div class="cell"> <div class="cell">
@ -635,7 +635,7 @@ Exercises</h2>
<section id="plot-functions" data-type="sect1"> <section id="plot-functions" data-type="sect1">
<h1> <h1>
Plot functions</h1> Plot functions</h1>
<p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p> <p>Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function. For example, imagine that youre making a lot of histograms:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
ggplot(aes(carat)) + ggplot(aes(carat)) +
@ -645,7 +645,7 @@ diamonds |&gt;
ggplot(aes(carat)) + ggplot(aes(carat)) +
geom_histogram(binwidth = 0.05)</pre> geom_histogram(binwidth = 0.05)</pre>
</div> </div>
<p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> is a data-masking function so that you need to embrace:</p> <p>Wouldnt it be nice if you could wrap this up into a histogram function? This is easy as once you know that <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> is a data-masking function so that you need to embrace:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth = NULL) { <pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth = NULL) {
df |&gt; df |&gt;
@ -714,7 +714,7 @@ diamonds |&gt; hex_plot(carat, price, depth)</pre>
<section id="combining-with-dplyr" data-type="sect2"> <section id="combining-with-dplyr" data-type="sect2">
<h2> <h2>
Combining with dplyr</h2> Combining with dplyr</h2>
<p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="#chp-https://forcats.tidyverse.org/reference/fct_inorder" data-type="xref">#chp-https://forcats.tidyverse.org/reference/fct_inorder</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p> <p>Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code>. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sorted_bars &lt;- function(df, var) { <pre data-type="programlisting" data-code-language="downlit">sorted_bars &lt;- function(df, var) {
df |&gt; df |&gt;
@ -780,7 +780,7 @@ fancy_ts(df, value, dist_name)</pre>
<section id="faceting" data-type="sect2"> <section id="faceting" data-type="sect2">
<h2> <h2>
Faceting</h2> Faceting</h2>
<p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="#chp-https://ggplot2.tidyverse.org/reference/vars" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/vars</a></code> uses tidy evaluation so you can embrace within it:</p> <p>Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing <code>~ x</code>, you need to write <code>vars(x)</code> and instead of <code>~ x + y</code> you need to write <code>vars(x, y)</code>. The only advantage of this syntax is that <code><a href="https://ggplot2.tidyverse.org/reference/vars.html">vars()</a></code> uses tidy evaluation so you can embrace within it:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/sharoz/status/1574376332821204999 <pre data-type="programlisting" data-code-language="downlit"># https://twitter.com/sharoz/status/1574376332821204999
@ -831,7 +831,7 @@ Labeling</h2>
}</pre> }</pre>
</div> </div>
<p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from package we havent talked about before: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p> <p>Wouldnt it be nice if we could label the output with the variable and the bin width that was used? To do so, were going to have to go under the covers of tidy evaluation and use a function from package we havent talked about before: rlang. rlang is a low-level package thats used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).</p>
<p>To solve the labeling problem we can use <code><a href="#chp-https://rlang.r-lib.org/reference/englue" data-type="xref">#chp-https://rlang.r-lib.org/reference/englue</a></code>. This works similarly to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code>, so any value wrapped in <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p> <p>To solve the labeling problem we can use <code><a href="https://rlang.r-lib.org/reference/englue.html">rlang::englue()</a></code>. This works similarly to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>, so any value wrapped in <code><a href="https://rdrr.io/r/base/Paren.html">{ }</a></code> will be inserted into the string. But it also understands <code>{{ }}</code>, which automatically insert the appropriate variable name:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth) { <pre data-type="programlisting" data-code-language="downlit">histogram &lt;- function(df, var, binwidth) {
label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}") label &lt;- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
@ -865,7 +865,7 @@ Exercises</h2>
<h1> <h1>
Style</h1> Style</h1>
<p>R doesnt care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. Thats hard! But its better to be clear than short, as RStudios autocomplete makes it easy to type long names.</p> <p>R doesnt care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. Thats hard! But its better to be clear than short, as RStudios autocomplete makes it easy to type long names.</p>
<p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="#chp-https://rdrr.io/r/stats/coef" data-type="xref">#chp-https://rdrr.io/r/stats/coef</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and dont be afraid to rename a function if you figure out a better name later.</p> <p>Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> is better than <code>compute_mean()</code>), or accessing some property of an object (i.e. <code><a href="https://rdrr.io/r/stats/coef.html">coef()</a></code> is better than <code>get_coefficients()</code>). Use your best judgement and dont be afraid to rename a function if you figure out a better name later.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Too short <pre data-type="programlisting" data-code-language="downlit"># Too short
f() f()
@ -877,7 +877,7 @@ my_awesome_function()
impute_missing() impute_missing()
collapse_years()</pre> collapse_years()</pre>
</div> </div>
<p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p> <p>R also doesnt care about how you use white space in your functions but future readers will. Continue to follow the rules from <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>. Additionally, <code>function()</code> should always be followed by squiggly brackets (<code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># missing extra two spaces <pre data-type="programlisting" data-code-language="downlit"># missing extra two spaces
pull_unique &lt;- function(df, var) { pull_unique &lt;- function(df, var) {
@ -913,7 +913,7 @@ f3 &lt;- function(x, y) {
</div> </div>
</li> </li>
<li><p>Take a function that youve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li> <li><p>Take a function that youve written recently and spend 5 minutes brainstorming a better name for it and its arguments.</p></li>
<li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc would be better than <code><a href="#chp-https://rdrr.io/r/stats/Normal" data-type="xref">#chp-https://rdrr.io/r/stats/Normal</a></code>, <code><a href="#chp-https://rdrr.io/r/stats/Normal" data-type="xref">#chp-https://rdrr.io/r/stats/Normal</a></code>. Make a case for the opposite.</p></li> <li><p>Make a case for why <code>norm_r()</code>, <code>norm_d()</code> etc would be better than <code><a href="https://rdrr.io/r/stats/Normal.html">rnorm()</a></code>, <code><a href="https://rdrr.io/r/stats/Normal.html">dnorm()</a></code>. Make a case for the opposite.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -922,9 +922,9 @@ f3 &lt;- function(x, y) {
Summary</h1> Summary</h1>
<p>In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p> <p>In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.</p>
<p>We have only shown you the bare minimum to get started with functions and theres much more to learn. A few places to learn more are:</p> <p>We have only shown you the bare minimum to get started with functions and theres much more to learn. A few places to learn more are:</p>
<ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="#chp-https://dplyr.tidyverse.org/articles/programming" data-type="xref">#chp-https://dplyr.tidyverse.org/articles/programming</a> and <a href="#chp-https://tidyr.tidyverse.org/articles/programming" data-type="xref">#chp-https://tidyr.tidyverse.org/articles/programming</a> and learn more about the theory in <a href="#chp-https://rlang.r-lib.org/reference/topic-data-mask" data-type="xref">#chp-https://rlang.r-lib.org/reference/topic-data-mask</a>.</li> <ul><li>To learn more about programming with tidy evaluation, see useful recipes in <a href="https://dplyr.tidyverse.org/articles/programming.html">programming with dplyr</a> and <a href="https://tidyr.tidyverse.org/articles/programming.html">programming with tidyr</a> and learn more about the theory in <a href="https://rlang.r-lib.org/reference/topic-data-mask.html">What is data-masking and why do I need {{?</a>.</li>
<li>To learn more about reducing duplication in your ggplot2 code, read the <a href="#chp-https://ggplot2-book.org/programming" class="uri" data-type="xref">#chp-https://ggplot2-book.org/programming</a> chapter of the ggplot2 book.</li> <li>To learn more about reducing duplication in your ggplot2 code, read the <a href="https://ggplot2-book.org/programming.html" class="uri">Programming with ggplot2</a> chapter of the ggplot2 book.</li>
<li>For more advice on function style, see the <a href="#chp-https://style.tidyverse.org/functions" class="uri" data-type="xref">#chp-https://style.tidyverse.org/functions</a>.</li> <li>For more advice on function style, see the <a href="https://style.tidyverse.org/functions.html" class="uri">tidyverse style guide</a>.</li>
</ul><p>In the next chapter, well dive into some of the details of Rs vector data structures that weve omitted so far. These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.</p> </ul><p>In the next chapter, well dive into some of the details of Rs vector data structures that weve omitted so far. These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.</p>

View File

@ -38,15 +38,15 @@ What you wont learn</h1>
<h2> <h2>
Modeling</h2> Modeling</h2>
<!--# TO DO: Say a few sentences about modelling. --> <!--# TO DO: Say a few sentences about modelling. -->
<p>To learn more about modeling, we highly recommend <a href="#chp-https://www.tmwr" data-type="xref">#chp-https://www.tmwr</a>, by our colleagues Max Kuhn and Julia Silge. This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.</p> <p>To learn more about modeling, we highly recommend <a href="https://www.tmwr.org">Tidy Modeling with R</a>, by our colleagues Max Kuhn and Julia Silge. This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.</p>
</section> </section>
<section id="big-data" data-type="sect2"> <section id="big-data" data-type="sect2">
<h2> <h2>
Big data</h2> Big data</h2>
<p>This book proudly focuses on small, in-memory datasets. This is the right place to start because you cant tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data. If youre routinely working with larger data (10-100 Gb, say), you should learn more about <a href="#chp-https://github.com/Rdatatable/data" data-type="xref">#chp-https://github.com/Rdatatable/data</a>. This book doesnt teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn. However, if youre working with large data, the performance payoff is well worth the effort required to learn it.</p> <p>This book proudly focuses on small, in-memory datasets. This is the right place to start because you cant tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data. If youre routinely working with larger data (10-100 Gb, say), you should learn more about <a href="https://github.com/Rdatatable/data.table">data.table</a>. This book doesnt teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn. However, if youre working with large data, the performance payoff is well worth the effort required to learn it.</p>
<p>If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise. While the complete data set might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that youre interested in. The challenge here is finding the right small data, which often requires a lot of iteration.</p> <p>If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise. While the complete data set might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that youre interested in. The challenge here is finding the right small data, which often requires a lot of iteration.</p>
<p>Another possibility is that your big data problem is actually a large number of small data problems in disguise. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. This would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like <a href="#chp-https://hadoop.apache.org/" data-type="xref">#chp-https://hadoop.apache.org/</a> or <a href="#chp-https://spark.apache.org/" data-type="xref">#chp-https://spark.apache.org/</a>) that allows you to send different datasets to different computers for processing. Once youve figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like <strong>sparklyr</strong> to solve it for the full dataset.</p> <p>Another possibility is that your big data problem is actually a large number of small data problems in disguise. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. This would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like <a href="https://hadoop.apache.org/">Hadoop</a> or <a href="https://spark.apache.org/">Spark</a>) that allows you to send different datasets to different computers for processing. Once youve figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like <strong>sparklyr</strong> to solve it for the full dataset.</p>
</section> </section>
<section id="python-julia-and-friends" data-type="sect2"> <section id="python-julia-and-friends" data-type="sect2">
@ -61,7 +61,7 @@ Python, Julia, and friends</h2>
<section id="prerequisites" data-type="sect1"> <section id="prerequisites" data-type="sect1">
<h1> <h1>
Prerequisites</h1> Prerequisites</h1>
<p>Weve made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and its helpful if you have some programming experience already. If youve never programmed before, you might find <a href="#chp-https://rstudio-education.github.io/hopr/" data-type="xref">#chp-https://rstudio-education.github.io/hopr/</a> by Garrett to be a useful adjunct to this book.</p> <p>Weve made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and its helpful if you have some programming experience already. If youve never programmed before, you might find <a href="https://rstudio-education.github.io/hopr/">Hands on Programming with R</a> by Garrett to be a useful adjunct to this book.</p>
<p>There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the <strong>tidyverse</strong>, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.</p> <p>There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the <strong>tidyverse</strong>, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.</p>
<section id="r" data-type="sect2"> <section id="r" data-type="sect2">
@ -95,7 +95,7 @@ The tidyverse</h2>
<pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse")</pre> <pre data-type="programlisting" data-code-language="downlit">install.packages("tidyverse")</pre>
</div> </div>
<p>On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that <a href="https://cloud.r-project.org/" class="uri">https://cloud.r-project.org/</a> isnt blocked by your firewall or proxy.</p> <p>On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that <a href="https://cloud.r-project.org/" class="uri">https://cloud.r-project.org/</a> isnt blocked by your firewall or proxy.</p>
<p>You will not be able to use the functions, objects, or help files in a package until you load it with <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code>. Once you have installed a package, you can load it using the <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> function:</p> <p>You will not be able to use the functions, objects, or help files in a package until you load it with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>. Once you have installed a package, you can load it using the <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> function:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse) <pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
#&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── #&gt; ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
@ -108,7 +108,7 @@ The tidyverse</h2>
#&gt; ✖ dplyr::lag() masks stats::lag()</pre> #&gt; ✖ dplyr::lag() masks stats::lag()</pre>
</div> </div>
<p>This tells you that tidyverse is loading eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats packages. These are considered to be the <strong>core</strong> of the tidyverse because youll use them in almost every analysis.</p> <p>This tells you that tidyverse is loading eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats packages. These are considered to be the <strong>core</strong> of the tidyverse because youll use them in almost every analysis.</p>
<p>Packages in the tidyverse change fairly frequently. You can check whether updates are available, and optionally install them, by running <code><a href="#chp-https://tidyverse.tidyverse.org/reference/tidyverse_update" data-type="xref">#chp-https://tidyverse.tidyverse.org/reference/tidyverse_update</a></code>.</p> <p>Packages in the tidyverse change fairly frequently. You can check whether updates are available, and optionally install them, by running <code><a href="https://tidyverse.tidyverse.org/reference/tidyverse_update.html">tidyverse_update()</a></code>.</p>
</section> </section>
<section id="other-packages" data-type="sect2"> <section id="other-packages" data-type="sect2">
@ -136,9 +136,9 @@ Running R code</h1>
[1] 3</code></pre> [1] 3</code></pre>
<p>There are two main differences. In your console, you type after the <code>&gt;</code>, called the <strong>prompt</strong>; we dont show the prompt in the book. In the book, output is commented out with <code>#&gt;</code>; in your console it appears directly after your code. These two differences mean that if youre working with an electronic version of the book, you can easily copy code out of the book and into the console.</p> <p>There are two main differences. In your console, you type after the <code>&gt;</code>, called the <strong>prompt</strong>; we dont show the prompt in the book. In the book, output is commented out with <code>#&gt;</code>; in your console it appears directly after your code. These two differences mean that if youre working with an electronic version of the book, you can easily copy code out of the book and into the console.</p>
<p>Throughout the book, we use a consistent set of conventions to refer to code:</p> <p>Throughout the book, we use a consistent set of conventions to refer to code:</p>
<ul><li><p>Functions are displayed in a code font and followed by parentheses, like <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>, or <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>.</p></li> <ul><li><p>Functions are displayed in a code font and followed by parentheses, like <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, or <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>.</p></li>
<li><p>Other R objects (such as data or function arguments) are in a code font, without parentheses, like <code>flights</code> or <code>x</code>.</p></li> <li><p>Other R objects (such as data or function arguments) are in a code font, without parentheses, like <code>flights</code> or <code>x</code>.</p></li>
<li><p>Sometimes, to make it clear which package an object comes from, well use well use the package name followed by two colons, like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, or<br/><code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code>. This is also valid R code.</p></li> <li><p>Sometimes, to make it clear which package an object comes from, well use well use the package name followed by two colons, like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">dplyr::mutate()</a></code>, or<br/><code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code>. This is also valid R code.</p></li>
</ul></section> </ul></section>
<section id="acknowledgements" data-type="sect1"> <section id="acknowledgements" data-type="sect1">
@ -147,7 +147,7 @@ Acknowledgements</h1>
<p>This book isnt just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that weve had with many people in the R community. There are a few people wed like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:</p> <p>This book isnt just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that weve had with many people in the R community. There are a few people wed like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:</p>
<ul><li><p>Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.</p></li> <ul><li><p>Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.</p></li>
<li><p>The three chapters on workflow were adapted (with permission), from <a href="https://stat545.com/block002_hello-r-workspace-wd-project.html" class="uri">https://stat545.com/block002_hello-r-workspace-wd-project.html</a> by Jenny Bryan.</p></li> <li><p>The three chapters on workflow were adapted (with permission), from <a href="https://stat545.com/block002_hello-r-workspace-wd-project.html" class="uri">https://stat545.com/block002_hello-r-workspace-wd-project.html</a> by Jenny Bryan.</p></li>
<li><p>Yihui Xie for his work on the <a href="#chp-https://github.com/rstudio/bookdown" data-type="xref">#chp-https://github.com/rstudio/bookdown</a> package, and for tirelessly responding to my feature requests.</p></li> <li><p>Yihui Xie for his work on the <a href="https://github.com/rstudio/bookdown">bookdown</a> package, and for tirelessly responding to my feature requests.</p></li>
<li><p>Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford.</p></li> <li><p>Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford.</p></li>
<li><p>The #rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.</p></li> <li><p>The #rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.</p></li>
</ul><p>This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub:</p> </ul><p>This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub:</p>
@ -160,7 +160,7 @@ Acknowledgements</h1>
<section id="colophon" data-type="sect1"> <section id="colophon" data-type="sect1">
<h1> <h1>
Colophon</h1> Colophon</h1>
<p>An online version of this book is available at <a href="https://r4ds.hadley.nz" class="uri">https://r4ds.hadley.nz</a>. It will continue to evolve in between reprints of the physical book. The source of the book is available at <a href="https://github.com/hadley/r4ds" class="uri">https://github.com/hadley/r4ds</a>. The book is powered by <a href="#chp-https://quarto" data-type="xref">#chp-https://quarto</a> which makes it easy to write books that combine text and executable code.</p> <p>An online version of this book is available at <a href="https://r4ds.hadley.nz" class="uri">https://r4ds.hadley.nz</a>. It will continue to evolve in between reprints of the physical book. The source of the book is available at <a href="https://github.com/hadley/r4ds" class="uri">https://github.com/hadley/r4ds</a>. The book is powered by <a href="https://quarto.org">Quarto</a> which makes it easy to write books that combine text and executable code.</p>
<p>This book was built with:</p> <p>This book was built with:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sessioninfo::session_info(c("tidyverse")) <pre data-type="programlisting" data-code-language="downlit">sessioninfo::session_info(c("tidyverse"))

View File

@ -14,11 +14,11 @@ Introduction</h1>
<p>In this chapter, youll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, youd need to explicitly double each element of x using some sort of for loop.</p> <p>In this chapter, youll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector <code>x</code> in R, you can just write <code>2 * x</code>. In most other languages, youd need to explicitly double each element of x using some sort of for loop.</p>
<p>This book has already given you a small but powerful number of tools that perform the same action for multiple “things”:</p> <p>This book has already given you a small but powerful number of tools that perform the same action for multiple “things”:</p>
<ul><li> <ul><li>
<code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_wrap" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_wrap</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code> draws a plot for each subset.</li> <code><a href="https://ggplot2.tidyverse.org/reference/facet_wrap.html">facet_wrap()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code> draws a plot for each subset.</li>
<li> <li>
<code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> plus <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> computes a summary statistics for each subset.</li> <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> plus <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> computes a summary statistics for each subset.</li>
<li> <li>
<code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> create new rows and columns for each element of a list-column.</li> <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> create new rows and columns for each element of a list-column.</li>
</ul><p>Now its time to learn some more general tools, often called <strong>functional programming</strong> tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter well keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.</p> </ul><p>Now its time to learn some more general tools, often called <strong>functional programming</strong> tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter well keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
@ -33,7 +33,7 @@ Prerequisites</h2>
<p>This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with <code>devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr"))</code>.</p></div> <p>This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with <code>devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr"))</code>.</p></div>
<p>In this chapter, well focus on tools provided by dplyr and purrr, both core members of the tidyverse. Youve seen dplyr before, but <a href="#chp-http://purrr.tidyverse.org/" data-type="xref">#chp-http://purrr.tidyverse.org/</a> is new. Were going to use just a couple of purrr functions from in this chapter, but its a great package to explore as you improve your programming skills.</p> <p>In this chapter, well focus on tools provided by dplyr and purrr, both core members of the tidyverse. Youve seen dplyr before, but <a href="http://purrr.tidyverse.org/">purrr</a> is new. Were going to use just a couple of purrr functions from in this chapter, but its a great package to explore as you improve your programming skills.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre> <pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div> </div>
@ -66,7 +66,7 @@ Modifying multiple columns</h1>
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; #&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 10 -0.246 -0.287 -0.0567 0.144</pre> #&gt; 1 10 -0.246 -0.287 -0.0567 0.144</pre>
</div> </div>
<p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>:</p> <p>That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead you can use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; summarise( <pre data-type="programlisting" data-code-language="downlit">df |&gt; summarise(
n = n(), n = n(),
@ -77,14 +77,14 @@ Modifying multiple columns</h1>
#&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; #&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 10 -0.246 -0.287 -0.0567 0.144</pre> #&gt; 1 10 -0.246 -0.287 -0.0567 0.144</pre>
</div> </div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> has three particularly important arguments, which well discuss in detail in the following sections. Youll use the first two every time you use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>: the first argument, <code>.cols</code>, specifies which columns you want to iterate over, and the second argument, <code>.fns</code>, specifies what to do with each column. You can use the <code>.names</code> argument when you need additional control over the names of output columns, which is particularly important when you use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. Well also discuss two important variations, <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>, which work with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>.</p> <p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> has three particularly important arguments, which well discuss in detail in the following sections. Youll use the first two every time you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: the first argument, <code>.cols</code>, specifies which columns you want to iterate over, and the second argument, <code>.fns</code>, specifies what to do with each column. You can use the <code>.names</code> argument when you need additional control over the names of output columns, which is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. Well also discuss two important variations, <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>, which work with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>.</p>
<section id="selecting-columns-with-.cols" data-type="sect2"> <section id="selecting-columns-with-.cols" data-type="sect2">
<h2> <h2>
Selecting columns with<code>.cols</code> Selecting columns with<code>.cols</code>
</h2> </h2>
<p>The first argument to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>, <code>.cols</code>, selects the columns to transform. This uses the same specifications as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <a href="#sec-select" data-type="xref">#sec-select</a>, so you can use functions like <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code> and <code><a href="#chp-https://tidyselect.r-lib.org/reference/starts_with" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/starts_with</a></code> to select columns based on their name.</p> <p>The first argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, <code>.cols</code>, selects the columns to transform. This uses the same specifications as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <a href="#sec-select" data-type="xref">#sec-select</a>, so you can use functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code> and <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">ends_with()</a></code> to select columns based on their name.</p>
<p>There are two additional selection techniques that are particularly useful for <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>: <code><a href="#chp-https://tidyselect.r-lib.org/reference/everything" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/everything</a></code> and <code>where()</code>. <code><a href="#chp-https://tidyselect.r-lib.org/reference/everything" data-type="xref">#chp-https://tidyselect.r-lib.org/reference/everything</a></code> is straightforward: it selects every (non-grouping) column:</p> <p>There are two additional selection techniques that are particularly useful for <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> and <code>where()</code>. <code><a href="https://tidyselect.r-lib.org/reference/everything.html">everything()</a></code> is straightforward: it selects every (non-grouping) column:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
grp = sample(2, 10, replace = TRUE), grp = sample(2, 10, replace = TRUE),
@ -103,7 +103,7 @@ df |&gt;
#&gt; 1 1 -0.0935 -0.0163 0.363 0.364 #&gt; 1 1 -0.0935 -0.0163 0.363 0.364
#&gt; 2 2 0.312 -0.0576 0.208 0.565</pre> #&gt; 2 2 0.312 -0.0576 0.208 0.565</pre>
</div> </div>
<p>Note grouping columns (<code>grp</code> here) are not included in <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>, because theyre automatically preserved by <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>.</p> <p>Note grouping columns (<code>grp</code> here) are not included in <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, because theyre automatically preserved by <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>.</p>
<p><code>where()</code> allows you to select columns based on their type:</p> <p><code>where()</code> allows you to select columns based on their type:</p>
<ul><li> <ul><li>
<code>where(is.numeric)</code> selects all numeric columns.</li> <code>where(is.numeric)</code> selects all numeric columns.</li>
@ -143,8 +143,8 @@ df_types |&gt;
<section id="calling-a-single-function" data-type="sect2"> <section id="calling-a-single-function" data-type="sect2">
<h2> <h2>
Calling a single function</h2> Calling a single function</h2>
<p>The second argument to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: were passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a function programming language.</p> <p>The second argument to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: were passing one function (<code>median</code>, <code>mean</code>, <code>str_flatten</code>, …) to another function (<code>across</code>). This is one of the features that makes R a function programming language.</p>
<p>Its important to note that were passing this function to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>, so <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> can call it, not calling it ourselves. That means the function name should never be followed by <code>()</code>. If you forget, youll get an error:</p> <p>Its important to note that were passing this function to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, so <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can call it, not calling it ourselves. That means the function name should never be followed by <code>()</code>. If you forget, youll get an error:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
group_by(grp) |&gt; group_by(grp) |&gt;
@ -162,7 +162,7 @@ Calling a single function</h2>
<section id="calling-multiple-functions" data-type="sect2"> <section id="calling-multiple-functions" data-type="sect2">
<h2> <h2>
Calling multiple functions</h2> Calling multiple functions</h2>
<p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code> propagates those missing values, giving us a suboptimal output:</p> <p>In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> propagates those missing values, giving us a suboptimal output:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rnorm_na &lt;- function(n, n_na, mean = 0, sd = 1) { <pre data-type="programlisting" data-code-language="downlit">rnorm_na &lt;- function(n, n_na, mean = 0, sd = 1) {
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na))) sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
@ -184,7 +184,7 @@ df_miss |&gt;
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 NA NA NA 0.704 5</pre> #&gt; 1 NA NA NA 0.704 5</pre>
</div> </div>
<p>It would be nice if we could pass along <code>na.rm = TRUE</code> to <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code> to remove these missing values. To do so, instead of calling <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code> directly, we need to create a new function that calls <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code> with the desired arguments:</p> <p>It would be nice if we could pass along <code>na.rm = TRUE</code> to <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> to remove these missing values. To do so, instead of calling <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> directly, we need to create a new function that calls <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code> with the desired arguments:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; <pre data-type="programlisting" data-code-language="downlit">df_miss |&gt;
summarise( summarise(
@ -204,7 +204,7 @@ df_miss |&gt;
n = n() n = n()
)</pre> )</pre>
</div> </div>
<p>In either case, <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> effectively expands to the following code:</p> <p>In either case, <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> effectively expands to the following code:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; <pre data-type="programlisting" data-code-language="downlit">df_miss |&gt;
summarise( summarise(
@ -215,7 +215,7 @@ df_miss |&gt;
n = n() n = n()
)</pre> )</pre>
</div> </div>
<p>When we remove the missing values from the <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code>, it would be nice to know just how many values we were removing. We can find that out by supplying two functions to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p> <p>When we remove the missing values from the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, it would be nice to know just how many values we were removing. We can find that out by supplying two functions to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to <code>.fns</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; <pre data-type="programlisting" data-code-language="downlit">df_miss |&gt;
summarise( summarise(
@ -236,7 +236,7 @@ df_miss |&gt;
<section id="column-names" data-type="sect2"> <section id="column-names" data-type="sect2">
<h2> <h2>
Column names</h2> Column names</h2>
<p>The result of <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> is named according to the specification provided in the <code>.names</code> argument. We could specify our own if we wanted the name of the function to come first<span data-type="footnote">You cant currently change the order of the columns, but you could reorder them after the fact using <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> or similar.</span>:</p> <p>The result of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is named according to the specification provided in the <code>.names</code> argument. We could specify our own if we wanted the name of the function to come first<span data-type="footnote">You cant currently change the order of the columns, but you could reorder them after the fact using <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> or similar.</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; <pre data-type="programlisting" data-code-language="downlit">df_miss |&gt;
summarise( summarise(
@ -255,7 +255,7 @@ Column names</h2>
#&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt; #&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5</pre> #&gt; 1 0.429 1 -0.721 1 -0.796 2 0.704 0 5</pre>
</div> </div>
<p>The <code>.names</code> argument is particularly important when you use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. By default the output of <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> is given the same names as the inputs. This means that <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> inside of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> will replace existing columns. For example, here we use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code> to replace <code>NA</code>s with <code>0</code>:</p> <p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; <pre data-type="programlisting" data-code-language="downlit">df_miss |&gt;
mutate( mutate(
@ -290,7 +290,7 @@ Column names</h2>
<section id="filtering" data-type="sect2"> <section id="filtering" data-type="sect2">
<h2> <h2>
Filtering</h2> Filtering</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> is a great match for <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> but its more awkward to use with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&amp;</code>. Its clear that <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> called <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is a great match for <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> but its more awkward to use with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, because you usually combine multiple conditions with either <code>|</code> or <code>&amp;</code>. Its clear that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> can help to create multiple logical columns, but then what? So dplyr provides two variants of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> called <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_any()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; filter(is.na(a) | is.na(b) | is.na(c) | is.na(d)) <pre data-type="programlisting" data-code-language="downlit">df_miss |&gt; filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
#&gt; # A tibble: 3 × 4 #&gt; # A tibble: 3 × 4
@ -321,7 +321,7 @@ df_miss |&gt; filter(if_all(a:d, is.na))
<section id="across-in-functions" data-type="sect2"> <section id="across-in-functions" data-type="sect2">
<h2> <h2>
<code>across()</code> in functions</h2> <code>across()</code> in functions</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> is particularly useful to program with because it allows you to operate on multiple columns. For example, <a href="#chp-https://twitter.com/_wurli/status/1571836746899283969" data-type="xref">#chp-https://twitter.com/_wurli/status/1571836746899283969</a> uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is particularly useful to program with because it allows you to operate on multiple columns. For example, <a href="https://twitter.com/_wurli/status/1571836746899283969">Jacob Scott</a> uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(lubridate) <pre data-type="programlisting" data-code-language="downlit">library(lubridate)
#&gt; Loading required package: timechange #&gt; Loading required package: timechange
@ -351,7 +351,7 @@ df_date |&gt;
#&gt; 1 Amy 2009-08-03 2009 8 3 #&gt; 1 Amy 2009-08-03 2009 8 3
#&gt; 2 Bob 2010-01-16 2010 1 16</pre> #&gt; 2 Bob 2010-01-16 2010 1 16</pre>
</div> </div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in <a href="#sec-embracing" data-type="xref">#sec-embracing</a>. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in <a href="#sec-embracing" data-type="xref">#sec-embracing</a>. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summarise_means &lt;- function(df, summary_vars = where(is.numeric)) { <pre data-type="programlisting" data-code-language="downlit">summarise_means &lt;- function(df, summary_vars = where(is.numeric)) {
df |&gt; df |&gt;
@ -394,7 +394,7 @@ diamonds |&gt;
<h2> <h2>
Vs<code>pivot_longer()</code> Vs<code>pivot_longer()</code>
</h2> </h2>
<p>Before we go on, its worth pointing out an interesting connection between <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> (<a href="#sec-pivoting" data-type="xref">#sec-pivoting</a>). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:</p> <p>Before we go on, its worth pointing out an interesting connection between <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> (<a href="#sec-pivoting" data-type="xref">#sec-pivoting</a>). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
summarise(across(a:d, list(median = median, mean = mean))) summarise(across(a:d, list(median = median, mean = mean)))
@ -421,7 +421,7 @@ long
#&gt; 3 c 0.260 0.0716 #&gt; 3 c 0.260 0.0716
#&gt; 4 d 0.540 0.508</pre> #&gt; 4 d 0.540 0.508</pre>
</div> </div>
<p>And if you wanted the same structure as <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> you could pivot again:</p> <p>And if you wanted the same structure as <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> you could pivot again:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">long |&gt; <pre data-type="programlisting" data-code-language="downlit">long |&gt;
pivot_wider( pivot_wider(
@ -435,7 +435,7 @@ long
#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; #&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
#&gt; 1 0.0380 0.205 -0.0163 0.0910 0.260 0.0716 0.540 0.508</pre> #&gt; 1 0.0380 0.205 -0.0163 0.0910 0.260 0.0716 0.540 0.508</pre>
</div> </div>
<p>This is a useful technique to know about because sometimes youll hit a problem thats not currently possible to solve with <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>: when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:</p> <p>This is a useful technique to know about because sometimes youll hit a problem thats not currently possible to solve with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>: when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_paired &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">df_paired &lt;- tibble(
a_val = rnorm(10), a_val = rnorm(10),
@ -448,7 +448,7 @@ long
d_wts = runif(10) d_wts = runif(10)
)</pre> )</pre>
</div> </div>
<p>Theres currently no way to do this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code><span data-type="footnote">Maybe there will be one day, but currently we dont see how.</span>, but its relatively straightforward with <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>:</p> <p>Theres currently no way to do this with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code><span data-type="footnote">Maybe there will be one day, but currently we dont see how.</span>, but its relatively straightforward with <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df_long &lt;- df_paired |&gt; <pre data-type="programlisting" data-code-language="downlit">df_long &lt;- df_paired |&gt;
pivot_longer( pivot_longer(
@ -479,17 +479,17 @@ df_long |&gt;
#&gt; 3 c -0.746 #&gt; 3 c -0.746
#&gt; 4 d -0.0142</pre> #&gt; 4 d -0.0142</pre>
</div> </div>
<p>If needed, you could <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> this back to the original form.</p> <p>If needed, you could <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> this back to the original form.</p>
</section> </section>
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>Compute the number of unique values in each column of <code><a href="#chp-https://allisonhorst.github.io/palmerpenguins/reference/penguins" data-type="xref">#chp-https://allisonhorst.github.io/palmerpenguins/reference/penguins</a></code>.</p></li> <ol type="1"><li><p>Compute the number of unique values in each column of <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">palmerpenguins::penguins</a></code>.</p></li>
<li><p>Compute the mean of every column in <code>mtcars</code>.</p></li> <li><p>Compute the mean of every column in <code>mtcars</code>.</p></li>
<li><p>Group <code>diamonds</code> by <code>cut</code>, <code>clarity</code>, and <code>color</code> then count the number of observations and the mean of each numeric column.</p></li> <li><p>Group <code>diamonds</code> by <code>cut</code>, <code>clarity</code>, and <code>color</code> then count the number of observations and the mean of each numeric column.</p></li>
<li><p>What happens if you use a list of functions, but dont name them? How is the output named?</p></li> <li><p>What happens if you use a list of functions, but dont name them? How is the output named?</p></li>
<li><p>It is possible to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> inside <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> where its equivalent to <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>. Can you explain why?</p></li> <li><p>It is possible to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> where its equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">if_all()</a></code>. Can you explain why?</p></li>
<li><p>Adjust <code>expand_dates()</code> to automatically remove the date columns after theyve been expanded. Do you need to embrace any arguments?</p></li> <li><p>Adjust <code>expand_dates()</code> to automatically remove the date columns after theyve been expanded. Do you need to embrace any arguments?</p></li>
<li> <li>
<p>Explain what each step of the pipeline in this function does. What special feature of <code>where()</code> are we taking advantage of?</p> <p>Explain what each step of the pipeline in this function does. What special feature of <code>where()</code> are we taking advantage of?</p>
@ -512,27 +512,27 @@ nycflights13::flights |&gt; show_missing(c(year, month, day))</pre>
<section id="reading-multiple-files" data-type="sect1"> <section id="reading-multiple-files" data-type="sect1">
<h1> <h1>
Reading multiple files</h1> Reading multiple files</h1>
<p>In the previous section, you learned how to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code> to repeat a transformation on multiple columns. In this section, youll learn how to use <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> to do something to every file in a directory. Lets start with a little motivation: imagine you have a directory full of excel spreadsheets<span data-type="footnote">If you instead had a directory of csv files with the same format, you can use the technique from <a href="#sec-readr-directory" data-type="xref">#sec-readr-directory</a>.</span> you want to read. You could do it with copy and paste:</p> <p>In the previous section, you learned how to use <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> to repeat a transformation on multiple columns. In this section, youll learn how to use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to do something to every file in a directory. Lets start with a little motivation: imagine you have a directory full of excel spreadsheets<span data-type="footnote">If you instead had a directory of csv files with the same format, you can use the technique from <a href="#sec-readr-directory" data-type="xref">#sec-readr-directory</a>.</span> you want to read. You could do it with copy and paste:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data2019 &lt;- readxl::read_excel("data/y2019.xlsx") <pre data-type="programlisting" data-code-language="downlit">data2019 &lt;- readxl::read_excel("data/y2019.xlsx")
data2020 &lt;- readxl::read_excel("data/y2020.xlsx") data2020 &lt;- readxl::read_excel("data/y2020.xlsx")
data2021 &lt;- readxl::read_excel("data/y2021.xlsx") data2021 &lt;- readxl::read_excel("data/y2021.xlsx")
data2022 &lt;- readxl::read_excel("data/y2022.xlsx")</pre> data2022 &lt;- readxl::read_excel("data/y2022.xlsx")</pre>
</div> </div>
<p>And then use <code><a href="#chp-https://dplyr.tidyverse.org/reference/bind_rows" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/bind_rows</a></code> to combine them all together:</p> <p>And then use <code><a href="https://dplyr.tidyverse.org/reference/bind_rows.html">dplyr::bind_rows()</a></code> to combine them all together:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data &lt;- bind_rows(data2019, data2020, data2021, data2022)</pre> <pre data-type="programlisting" data-code-language="downlit">data &lt;- bind_rows(data2019, data2020, data2021, data2022)</pre>
</div> </div>
<p>You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use <code><a href="#chp-https://rdrr.io/r/base/list.files" data-type="xref">#chp-https://rdrr.io/r/base/list.files</a></code> to list all the files in a directory, then use <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> to read each of them into a list, then use <code><a href="#chp-https://purrr.tidyverse.org/reference/list_c" data-type="xref">#chp-https://purrr.tidyverse.org/reference/list_c</a></code> to combine them into a single data frame. Well then discuss how you can handle situations of increasing heterogeneity, where you cant do exactly the same thing to every file.</p> <p>You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> to list all the files in a directory, then use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to read each of them into a list, then use <code><a href="https://purrr.tidyverse.org/reference/list_c.html">purrr::list_rbind()</a></code> to combine them into a single data frame. Well then discuss how you can handle situations of increasing heterogeneity, where you cant do exactly the same thing to every file.</p>
<section id="listing-files-in-a-directory" data-type="sect2"> <section id="listing-files-in-a-directory" data-type="sect2">
<h2> <h2>
Listing files in a directory</h2> Listing files in a directory</h2>
<p>As the name suggests, <code><a href="#chp-https://rdrr.io/r/base/list.files" data-type="xref">#chp-https://rdrr.io/r/base/list.files</a></code> lists the files in a directory. TO CONSIDER: why not use it via the more obvious name <code><a href="#chp-https://rdrr.io/r/base/list.files" data-type="xref">#chp-https://rdrr.io/r/base/list.files</a></code>? Youll almost always use three arguments:</p> <p>As the name suggests, <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code> lists the files in a directory. TO CONSIDER: why not use it via the more obvious name <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code>? Youll almost always use three arguments:</p>
<ul><li><p>The first argument, <code>path</code>, is the directory to look in.</p></li> <ul><li><p>The first argument, <code>path</code>, is the directory to look in.</p></li>
<li><p><code>pattern</code> is a regular expression used to filter the file names. The most common pattern is something like <code>[.]xlsx$</code> or <code>[.]csv$</code> to find all files with a specified extension.</p></li> <li><p><code>pattern</code> is a regular expression used to filter the file names. The most common pattern is something like <code>[.]xlsx$</code> or <code>[.]csv$</code> to find all files with a specified extension.</p></li>
<li><p><code>full.names</code> determines whether or not the directory name should be included in the output. You almost always want this to be <code>TRUE</code>.</p></li> <li><p><code>full.names</code> determines whether or not the directory name should be included in the output. You almost always want this to be <code>TRUE</code>.</p></li>
</ul><p>To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one years worth of data for 142 countries. We can list them all with the appropriate call to <code><a href="#chp-https://rdrr.io/r/base/list.files" data-type="xref">#chp-https://rdrr.io/r/base/list.files</a></code>:</p> </ul><p>To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one years worth of data for 142 countries. We can list them all with the appropriate call to <code><a href="https://rdrr.io/r/base/list.files.html">list.files()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths &lt;- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) <pre data-type="programlisting" data-code-language="downlit">paths &lt;- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
paths paths
@ -587,7 +587,7 @@ gapminder_2007 &lt;- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
<h2> <h2>
<code>purrr::map()</code> and <code>list_rbind()</code> <code>purrr::map()</code> and <code>list_rbind()</code>
</h2> </h2>
<p>The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> to make even better use of our <code>paths</code> vector. <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> is similar to<code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>, but instead of doing something to each column in a data frame, it does something to each element of a vector.<code>map(x, f)</code> is shorthand for:</p> <p>The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code> to make even better use of our <code>paths</code> vector. <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> is similar to<code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>, but instead of doing something to each column in a data frame, it does something to each element of a vector.<code>map(x, f)</code> is shorthand for:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">list( <pre data-type="programlisting" data-code-language="downlit">list(
f(x[[1]]), f(x[[1]]),
@ -596,7 +596,7 @@ gapminder_2007 &lt;- readxl::read_excel("data/gapminder/2007.xlsx")</pre>
f(x[[n]]) f(x[[n]])
)</pre> )</pre>
</div> </div>
<p>So we can use <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> get a list of 12 data frames:</p> <p>So we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> get a list of 12 data frames:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- map(paths, readxl::read_excel) <pre data-type="programlisting" data-code-language="downlit">files &lt;- map(paths, readxl::read_excel)
length(files) length(files)
@ -614,8 +614,8 @@ files[[1]]
#&gt; 6 Australia Oceania 69.1 8691212 10040. #&gt; 6 Australia Oceania 69.1 8691212 10040.
#&gt; # … with 136 more rows</pre> #&gt; # … with 136 more rows</pre>
</div> </div>
<p>(This is another data structure that doesnt display particularly compactly with <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code> so you might want to load into RStudio and inspect it with <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code>).</p> <p>(This is another data structure that doesnt display particularly compactly with <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> so you might want to load into RStudio and inspect it with <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code>).</p>
<p>Now we can use <code><a href="#chp-https://purrr.tidyverse.org/reference/list_c" data-type="xref">#chp-https://purrr.tidyverse.org/reference/list_c</a></code> to combine that list of data frames into a single data frame:</p> <p>Now we can use <code><a href="https://purrr.tidyverse.org/reference/list_c.html">purrr::list_rbind()</a></code> to combine that list of data frames into a single data frame:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">list_rbind(files) <pre data-type="programlisting" data-code-language="downlit">list_rbind(files)
#&gt; # A tibble: 1,704 × 5 #&gt; # A tibble: 1,704 × 5
@ -635,7 +635,7 @@ files[[1]]
map(readxl::read_excel) |&gt; map(readxl::read_excel) |&gt;
list_rbind()</pre> list_rbind()</pre>
</div> </div>
<p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>. For example, its often useful to peak at the first few row of the data with <code>n_max = 1</code>:</p> <p>What if we want to pass in extra arguments to <code>read_excel()</code>? We use the same technique that we used with <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>. For example, its often useful to peak at the first few row of the data with <code>n_max = 1</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; <pre data-type="programlisting" data-code-language="downlit">paths |&gt;
map(\(path) readxl::read_excel(path, n_max = 1)) |&gt; map(\(path) readxl::read_excel(path, n_max = 1)) |&gt;
@ -658,7 +658,7 @@ files[[1]]
<h2> <h2>
Data in the path</h2> Data in the path</h2>
<p>Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.</p> <p>Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.</p>
<p>First, we name the vector of paths. The easiest way to do this is with the <code><a href="#chp-https://rlang.r-lib.org/reference/set_names" data-type="xref">#chp-https://rlang.r-lib.org/reference/set_names</a></code> function, which can take a function. Here we use <code><a href="#chp-https://rdrr.io/r/base/basename" data-type="xref">#chp-https://rdrr.io/r/base/basename</a></code> to extract just the file name from the full path:</p> <p>First, we name the vector of paths. The easiest way to do this is with the <code><a href="https://rlang.r-lib.org/reference/set_names.html">set_names()</a></code> function, which can take a function. Here we use <code><a href="https://rdrr.io/r/base/basename.html">basename()</a></code> to extract just the file name from the full path:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; set_names(basename) <pre data-type="programlisting" data-code-language="downlit">paths |&gt; set_names(basename)
#&gt; 1952.xlsx 1957.xlsx #&gt; 1952.xlsx 1957.xlsx
@ -680,7 +680,7 @@ Data in the path</h2>
set_names(basename) |&gt; set_names(basename) |&gt;
map(readxl::read_excel)</pre> map(readxl::read_excel)</pre>
</div> </div>
<p>That makes this call to <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> shorthand for:</p> <p>That makes this call to <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> shorthand for:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- list( <pre data-type="programlisting" data-code-language="downlit">files &lt;- list(
"1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"), "1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"),
@ -704,7 +704,7 @@ Data in the path</h2>
#&gt; 6 Australia Oceania 70.9 10794968 12217. #&gt; 6 Australia Oceania 70.9 10794968 12217.
#&gt; # … with 136 more rows</pre> #&gt; # … with 136 more rows</pre>
</div> </div>
<p>Then we use the <code>names_to</code> argument to <code><a href="#chp-https://purrr.tidyverse.org/reference/list_c" data-type="xref">#chp-https://purrr.tidyverse.org/reference/list_c</a></code> to tell it to save the names into a new column called <code>year</code> then use <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> to extract the number from the string.</p> <p>Then we use the <code>names_to</code> argument to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> to tell it to save the names into a new column called <code>year</code> then use <code><a href="https://readr.tidyverse.org/reference/parse_number.html">readr::parse_number()</a></code> to extract the number from the string.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; <pre data-type="programlisting" data-code-language="downlit">paths |&gt;
set_names(basename) |&gt; set_names(basename) |&gt;
@ -722,7 +722,7 @@ Data in the path</h2>
#&gt; 6 1952 Australia Oceania 69.1 8691212 10040. #&gt; 6 1952 Australia Oceania 69.1 8691212 10040.
#&gt; # … with 1,698 more rows</pre> #&gt; # … with 1,698 more rows</pre>
</div> </div>
<p>In more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use <code><a href="#chp-https://rlang.r-lib.org/reference/set_names" data-type="xref">#chp-https://rlang.r-lib.org/reference/set_names</a></code> (without any arguments) to record the full path, and then use <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> and friends to turn them into useful columns.</p> <p>In more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use <code><a href="https://rlang.r-lib.org/reference/set_names.html">set_names()</a></code> (without any arguments) to record the full path, and then use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">tidyr::separate_wider_delim()</a></code> and friends to turn them into useful columns.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># NOTE: this chapter also depends on dev tidyr (in addition to dev purrr and dev dplyr) <pre data-type="programlisting" data-code-language="downlit"># NOTE: this chapter also depends on dev tidyr (in addition to dev purrr and dev dplyr)
paths |&gt; paths |&gt;
@ -759,14 +759,14 @@ write_csv(gapminder, "gapminder.csv")</pre>
</div> </div>
<p>Now when you come back to this problem in the future, you can read in a single csv file.</p> <p>Now when you come back to this problem in the future, you can read in a single csv file.</p>
<p>If youre working in a project, wed suggest calling the file that does this sort of data prep work something like <code>0-cleanup.R.</code> The <code>0</code> in the file name suggests that this should be run before anything else.</p> <p>If youre working in a project, wed suggest calling the file that does this sort of data prep work something like <code>0-cleanup.R.</code> The <code>0</code> in the file name suggests that this should be run before anything else.</p>
<p>If your input data files change over time, you might consider learning a tool like <a href="#chp-https://docs.ropensci.org/targets/" data-type="xref">#chp-https://docs.ropensci.org/targets/</a> to set up your data cleaning code to automatically re-run whenever one of the input files is modified.</p> <p>If your input data files change over time, you might consider learning a tool like <a href="https://docs.ropensci.org/targets/">targets</a> to set up your data cleaning code to automatically re-run whenever one of the input files is modified.</p>
</section> </section>
<section id="many-simple-iterations" data-type="sect2"> <section id="many-simple-iterations" data-type="sect2">
<h2> <h2>
Many simple iterations</h2> Many simple iterations</h2>
<p>Here weve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, youll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but youre often better by doing multiple simple iterations.</p> <p>Here weve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, youll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but youre often better by doing multiple simple iterations.</p>
<p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps then call <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> once:</p> <p>For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps then call <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> once:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">process_file &lt;- function(path) { <pre data-type="programlisting" data-code-language="downlit">process_file &lt;- function(path) {
df &lt;- read_csv(path) df &lt;- read_csv(path)
@ -805,7 +805,7 @@ paths |&gt;
<section id="heterogeneous-data" data-type="sect2"> <section id="heterogeneous-data" data-type="sect2">
<h2> <h2>
Heterogeneous data</h2> Heterogeneous data</h2>
<p>Unfortunately sometimes its not possible to go from <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> straight to <code><a href="#chp-https://purrr.tidyverse.org/reference/list_c" data-type="xref">#chp-https://purrr.tidyverse.org/reference/list_c</a></code> because the data frames are so heterogeneous that <code><a href="#chp-https://purrr.tidyverse.org/reference/list_c" data-type="xref">#chp-https://purrr.tidyverse.org/reference/list_c</a></code> either fails or yields a data frame thats not very useful. In that case, its still useful to start by loading all of the files:</p> <p>Unfortunately sometimes its not possible to go from <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> straight to <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> because the data frames are so heterogeneous that <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code> either fails or yields a data frame thats not very useful. In that case, its still useful to start by loading all of the files:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- paths |&gt; <pre data-type="programlisting" data-code-language="downlit">files &lt;- paths |&gt;
map(readxl::read_excel) </pre> map(readxl::read_excel) </pre>
@ -861,21 +861,21 @@ df_types(nycflights13::flights)
#&gt; 6 1977.xlsx character character double double double #&gt; 6 1977.xlsx character character double double double
#&gt; # … with 6 more rows</pre> #&gt; # … with 6 more rows</pre>
</div> </div>
<p>If the files have heterogeneous formats you might need to do more processing before you can successfully merge them. Unfortunately were now going to leave you to figure that out on your own, but you might want to read about <code><a href="#chp-https://purrr.tidyverse.org/reference/map_if" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map_if</a></code> and <code><a href="#chp-https://purrr.tidyverse.org/reference/map_if" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map_if</a></code>. <code><a href="#chp-https://purrr.tidyverse.org/reference/map_if" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map_if</a></code> allows you to selectively modify elements of a list based on their values; <code><a href="#chp-https://purrr.tidyverse.org/reference/map_if" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map_if</a></code> allows you to selectively modify elements based on their names.</p> <p>If the files have heterogeneous formats you might need to do more processing before you can successfully merge them. Unfortunately were now going to leave you to figure that out on your own, but you might want to read about <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> and <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_if()</a></code> allows you to selectively modify elements of a list based on their values; <code><a href="https://purrr.tidyverse.org/reference/map_if.html">map_at()</a></code> allows you to selectively modify elements based on their names.</p>
</section> </section>
<section id="handling-failures" data-type="sect2"> <section id="handling-failures" data-type="sect2">
<h2> <h2>
Handling failures</h2> Handling failures</h2>
<p>Sometimes the structure of your data might be sufficiently wild that you cant even read all the files with a single command. And then youll encounter one of the downsides of map: it succeeds or fails as a whole. <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?</p> <p>Sometimes the structure of your data might be sufficiently wild that you cant even read all the files with a single command. And then youll encounter one of the downsides of map: it succeeds or fails as a whole. <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?</p>
<p>Luckily, purrr comes with a helper to tackle this problem: <code><a href="#chp-https://purrr.tidyverse.org/reference/possibly" data-type="xref">#chp-https://purrr.tidyverse.org/reference/possibly</a></code>. <code><a href="#chp-https://purrr.tidyverse.org/reference/possibly" data-type="xref">#chp-https://purrr.tidyverse.org/reference/possibly</a></code> is whats known as a function operator: it takes a function and returns a function with modified behavior. In particular, <code><a href="#chp-https://purrr.tidyverse.org/reference/possibly" data-type="xref">#chp-https://purrr.tidyverse.org/reference/possibly</a></code> changes a function from erroring to returning a value that you specify:</p> <p>Luckily, purrr comes with a helper to tackle this problem: <code><a href="https://purrr.tidyverse.org/reference/possibly.html">possibly()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/possibly.html">possibly()</a></code> is whats known as a function operator: it takes a function and returns a function with modified behavior. In particular, <code><a href="https://purrr.tidyverse.org/reference/possibly.html">possibly()</a></code> changes a function from erroring to returning a value that you specify:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- paths |&gt; <pre data-type="programlisting" data-code-language="downlit">files &lt;- paths |&gt;
map(possibly(\(path) readxl::read_excel(path), NULL)) map(possibly(\(path) readxl::read_excel(path), NULL))
data &lt;- files |&gt; list_rbind()</pre> data &lt;- files |&gt; list_rbind()</pre>
</div> </div>
<p>This works particularly well here because <code><a href="#chp-https://purrr.tidyverse.org/reference/list_c" data-type="xref">#chp-https://purrr.tidyverse.org/reference/list_c</a></code>, like many tidyverse functions, automatically ignores <code>NULL</code>s.</p> <p>This works particularly well here because <code><a href="https://purrr.tidyverse.org/reference/list_c.html">list_rbind()</a></code>, like many tidyverse functions, automatically ignores <code>NULL</code>s.</p>
<p>Now you have all the data that can be read easily, and its time to tackle the hard part of figuring out why some files failed load and what do to about it. Start by getting the paths that failed:</p> <p>Now you have all the data that can be read easily, and its time to tackle the hard part of figuring out why some files failed load and what do to about it. Start by getting the paths that failed:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">failed &lt;- map_vec(files, is.null) <pre data-type="programlisting" data-code-language="downlit">failed &lt;- map_vec(files, is.null)
@ -889,7 +889,7 @@ paths[failed]
<section id="saving-multiple-outputs" data-type="sect1"> <section id="saving-multiple-outputs" data-type="sect1">
<h1> <h1>
Saving multiple outputs</h1> Saving multiple outputs</h1>
<p>In the last section, you learned about <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>, which is useful for reading multiple files into a single object. In this section, well now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? Well explore this challenge using three examples:</p> <p>In the last section, you learned about <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>, which is useful for reading multiple files into a single object. In this section, well now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? Well explore this challenge using three examples:</p>
<ul><li>Saving multiple data frames into one database.</li> <ul><li>Saving multiple data frames into one database.</li>
<li>Saving multiple data frames into multiple csv files.</li> <li>Saving multiple data frames into multiple csv files.</li>
<li>Saving multiple plots to multiple <code>.png</code> files.</li> <li>Saving multiple plots to multiple <code>.png</code> files.</li>
@ -920,7 +920,7 @@ template
#&gt; 6 Australia Oceania 69.1 8691212 10040. 1952 #&gt; 6 Australia Oceania 69.1 8691212 10040. 1952
#&gt; # … with 136 more rows</pre> #&gt; # … with 136 more rows</pre>
</div> </div>
<p>Now we can connect to the database, and use <code><a href="#chp-https://dbi.r-dbi.org/reference/dbCreateTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbCreateTable</a></code> to turn our template into database table:</p> <p>Now we can connect to the database, and use <code><a href="https://dbi.r-dbi.org/reference/dbCreateTable.html">DBI::dbCreateTable()</a></code> to turn our template into database table:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(duckdb::duckdb()) <pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(duckdb::duckdb())
DBI::dbCreateTable(con, "gapminder", template)</pre> DBI::dbCreateTable(con, "gapminder", template)</pre>
@ -933,7 +933,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
#&gt; # … with 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;, pop &lt;dbl&gt;, #&gt; # … with 6 variables: country &lt;chr&gt;, continent &lt;chr&gt;, lifeExp &lt;dbl&gt;, pop &lt;dbl&gt;,
#&gt; # gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;</pre> #&gt; # gdpPercap &lt;dbl&gt;, year &lt;dbl&gt;</pre>
</div> </div>
<p>Next, we need a function that takes a single file path, reads it into R, and adds the result to the <code>gapminder</code> table. We can do that by combining <code>read_excel()</code> with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbAppendTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbAppendTable</a></code>:</p> <p>Next, we need a function that takes a single file path, reads it into R, and adds the result to the <code>gapminder</code> table. We can do that by combining <code>read_excel()</code> with <code><a href="https://dbi.r-dbi.org/reference/dbAppendTable.html">DBI::dbAppendTable()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">append_file &lt;- function(path) { <pre data-type="programlisting" data-code-language="downlit">append_file &lt;- function(path) {
df &lt;- readxl::read_excel(path) df &lt;- readxl::read_excel(path)
@ -942,11 +942,11 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
DBI::dbAppendTable(con, "gapminder", df) DBI::dbAppendTable(con, "gapminder", df)
}</pre> }</pre>
</div> </div>
<p>Now we need to call <code>append_csv()</code> once for each element of <code>paths</code>. Thats certainly possible with <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>:</p> <p>Now we need to call <code>append_csv()</code> once for each element of <code>paths</code>. Thats certainly possible with <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; map(append_file)</pre> <pre data-type="programlisting" data-code-language="downlit">paths |&gt; map(append_file)</pre>
</div> </div>
<p>But we dont care about the output of <code>append_file()</code>, so instead of <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> its slightly nicer to use <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code>. <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> does exactly the same thing as <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> but throws the output away:</p> <p>But we dont care about the output of <code>append_file()</code>, so instead of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> its slightly nicer to use <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>. <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code> does exactly the same thing as <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> but throws the output away:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; walk(append_file)</pre> <pre data-type="programlisting" data-code-language="downlit">paths |&gt; walk(append_file)</pre>
</div> </div>
@ -972,7 +972,7 @@ DBI::dbCreateTable(con, "gapminder", template)</pre>
<section id="writing-csv-files" data-type="sect2"> <section id="writing-csv-files" data-type="sect2">
<h2> <h2>
Writing csv files</h2> Writing csv files</h2>
<p>The same basic principle applies if we want to write multiple csv files, one for each group. Lets imagine that we want to take the <code><a href="#chp-https://ggplot2.tidyverse.org/reference/diamonds" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/diamonds</a></code> data and save one csv file for each <code>clarity</code>. First we need to make those individual datasets. There are many ways you could do that, but theres one way we particularly like: <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_nest" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_nest</a></code>.</p> <p>The same basic principle applies if we want to write multiple csv files, one for each group. Lets imagine that we want to take the <code><a href="https://ggplot2.tidyverse.org/reference/diamonds.html">ggplot2::diamonds</a></code> data and save one csv file for each <code>clarity</code>. First we need to make those individual datasets. There are many ways you could do that, but theres one way we particularly like: <code><a href="https://dplyr.tidyverse.org/reference/group_nest.html">group_nest()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">by_clarity &lt;- diamonds |&gt; <pre data-type="programlisting" data-code-language="downlit">by_clarity &lt;- diamonds |&gt;
group_nest(clarity) group_nest(clarity)
@ -1003,7 +1003,7 @@ by_clarity
#&gt; 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4 #&gt; 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4
#&gt; # … with 735 more rows</pre> #&gt; # … with 735 more rows</pre>
</div> </div>
<p>While were here, lets create a column that gives the name of output file, using <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code>:</p> <p>While were here, lets create a column that gives the name of output file, using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">by_clarity &lt;- by_clarity |&gt; <pre data-type="programlisting" data-code-language="downlit">by_clarity &lt;- by_clarity |&gt;
mutate(path = str_glue("diamonds-{clarity}.csv")) mutate(path = str_glue("diamonds-{clarity}.csv"))
@ -1028,7 +1028,7 @@ write_csv(by_clarity$data[[3]], by_clarity$path[[3]])
... ...
write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])</pre> write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])</pre>
</div> </div>
<p>This is a little different to our previous uses of <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> because there are two arguments that are changing, not just one. That means we need a new function: <code><a href="#chp-https://purrr.tidyverse.org/reference/map2" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map2</a></code>, which varies both the first and second arguments. And because we again dont care about the output, we want <code><a href="#chp-https://purrr.tidyverse.org/reference/map2" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map2</a></code> rather than <code><a href="#chp-https://purrr.tidyverse.org/reference/map2" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map2</a></code>. That gives us:</p> <p>This is a little different to our previous uses of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> because there are two arguments that are changing, not just one. That means we need a new function: <code><a href="https://purrr.tidyverse.org/reference/map2.html">map2()</a></code>, which varies both the first and second arguments. And because we again dont care about the output, we want <code><a href="https://purrr.tidyverse.org/reference/map2.html">walk2()</a></code> rather than <code><a href="https://purrr.tidyverse.org/reference/map2.html">map2()</a></code>. That gives us:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">walk2(by_clarity$data, by_clarity$path, write_csv)</pre> <pre data-type="programlisting" data-code-language="downlit">walk2(by_clarity$data, by_clarity$path, write_csv)</pre>
</div> </div>
@ -1048,7 +1048,7 @@ carat_histogram(by_clarity$data[[1]])</pre>
<p><img src="iteration_files/figure-html/unnamed-chunk-70-1.png" class="img-fluid" width="576"/></p> <p><img src="iteration_files/figure-html/unnamed-chunk-70-1.png" class="img-fluid" width="576"/></p>
</div> </div>
</div> </div>
<p>Now we can use <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> to create a list of many plots<span data-type="footnote">You can print <code>by_clarity$plot</code> to get a crude animation — youll get one plot for each element of <code>plots</code>. NOTE: this didnt happen for me.</span> and their eventual file paths:</p> <p>Now we can use <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> to create a list of many plots<span data-type="footnote">You can print <code>by_clarity$plot</code> to get a crude animation — youll get one plot for each element of <code>plots</code>. NOTE: this didnt happen for me.</span> and their eventual file paths:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">by_clarity &lt;- by_clarity |&gt; <pre data-type="programlisting" data-code-language="downlit">by_clarity &lt;- by_clarity |&gt;
mutate( mutate(
@ -1056,7 +1056,7 @@ carat_histogram(by_clarity$data[[1]])</pre>
path = str_glue("clarity-{clarity}.png") path = str_glue("clarity-{clarity}.png")
)</pre> )</pre>
</div> </div>
<p>Then use <code><a href="#chp-https://purrr.tidyverse.org/reference/map2" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map2</a></code> with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggsave" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggsave</a></code> to save each plot:</p> <p>Then use <code><a href="https://purrr.tidyverse.org/reference/map2.html">walk2()</a></code> with <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> to save each plot:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">walk2( <pre data-type="programlisting" data-code-language="downlit">walk2(
by_clarity$path, by_clarity$path,
@ -1084,8 +1084,8 @@ ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)</pre>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>In this chapter youve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once youve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="#chp-https://adv-r.hadley.nz/functionals" data-type="xref">#chp-https://adv-r.hadley.nz/functionals</a> of <em>Advanced R</em> and consulting the <a href="#chp-https://purrr.tidyverse" data-type="xref">#chp-https://purrr.tidyverse</a>.</p> <p>In this chapter youve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once youve mastered the techniques in this chapter, we highly recommend learning more by reading the <a href="https://adv-r.hadley.nz/functionals.html">Functionals chapter</a> of <em>Advanced R</em> and consulting the <a href="https://purrr.tidyverse.org">purrr website</a>.</p>
<p>If you know much about iteration in other languages you might be surprised that we didnt discuss the <code>for</code> loop. Thats because Rs orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you cant, you can often use a functional programming tool like <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> that does something to each element of a list. However, you will see <code>for</code> loops in wild-caught code, so youll learn about them in the next chapter where well discuss some important base R tools.</p> <p>If you know much about iteration in other languages you might be surprised that we didnt discuss the <code>for</code> loop. Thats because Rs orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you cant, you can often use a functional programming tool like <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> that does something to each element of a list. However, you will see <code>for</code> loops in wild-caught code, so youll learn about them in the next chapter where well discuss some important base R tools.</p>
</section> </section>

View File

@ -127,7 +127,7 @@ Primary and foreign keys</h2>
<section id="checking-primary-keys" data-type="sect2"> <section id="checking-primary-keys" data-type="sect2">
<h2> <h2>
Checking primary keys</h2> Checking primary keys</h2>
<p>Now that that weve identified the primary keys in each table, its good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p> <p>Now that that weve identified the primary keys in each table, its good practice to verify that they do indeed uniquely identify each observation. One way to do that is to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> the primary keys and look for entries where <code>n</code> is greater than one. This reveals that <code>planes</code> and <code>weather</code> both look good:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt; <pre data-type="programlisting" data-code-language="downlit">planes |&gt;
count(tailnum) |&gt; count(tailnum) |&gt;
@ -219,13 +219,13 @@ Exercises</h2>
<section id="sec-mutating-joins" data-type="sect1"> <section id="sec-mutating-joins" data-type="sect1">
<h1> <h1>
Basic joins</h1> Basic joins</h1>
<p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p> <p>Now that you understand how data frames are connected via keys, we can start using joins to better understand the <code>flights</code> dataset. dplyr provides six join functions: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. They all have the same interface: they take a pair of data frames (<code>x</code> and <code>y</code>) and return a data frame. The order of the rows and columns in the output is primarily determined by <code>x</code>.</p>
<p>In this section, youll learn how to use one mutating join, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, and two filtering joins, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>.</p> <p>In this section, youll learn how to use one mutating join, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>, and two filtering joins, <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">semi_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>. In the next section, youll learn exactly how these functions work, and about the remaining <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>.</p>
<section id="mutating-joins" data-type="sect2"> <section id="mutating-joins" data-type="sect2">
<h2> <h2>
Mutating joins</h2> Mutating joins</h2>
<p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, the join functions add variables to the right, so if your dataset has many variables, you wont see the new ones. For these examples, well make it easier to see whats going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code> to avoid this problem.</span>:</p> <p>A <strong>mutating join</strong> allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the join functions add variables to the right, so if your dataset has many variables, you wont see the new ones. For these examples, well make it easier to see whats going on by creating a narrower dataset with just six variables<span data-type="footnote">Remember that in RStudio you can also use <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> to avoid this problem.</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 &lt;- flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights2 &lt;- flights |&gt;
select(year, time_hour, origin, dest, tailnum, carrier) select(year, time_hour, origin, dest, tailnum, carrier)
@ -241,7 +241,7 @@ flights2
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA #&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
#&gt; # … with 336,770 more rows</pre> #&gt; # … with 336,770 more rows</pre>
</div> </div>
<p>There are four types of mutating join, but theres one that youll use almost all of the time: <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>. Its special because the output will always have the same rows as <code>x</code><span data-type="footnote">Thats not 100% true, but youll get a warning whenever it isnt.</span>. The primary use of <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> is to add in additional metadata. For example, we can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> to add the full airline name to the <code>flights2</code> data:</p> <p>There are four types of mutating join, but theres one that youll use almost all of the time: <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code>. Its special because the output will always have the same rows as <code>x</code><span data-type="footnote">Thats not 100% true, but youll get a warning whenever it isnt.</span>. The primary use of <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> is to add in additional metadata. For example, we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> to add the full airline name to the <code>flights2</code> data:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt; <pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(airlines) left_join(airlines)
@ -289,7 +289,7 @@ flights2
#&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191 #&gt; 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed wi… 2 191
#&gt; # … with 336,770 more rows</pre> #&gt; # … with 336,770 more rows</pre>
</div> </div>
<p>When <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p> <p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, theres no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt; <pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
filter(tailnum == "N3ALAA") |&gt; filter(tailnum == "N3ALAA") |&gt;
@ -312,7 +312,7 @@ flights2
<section id="specifying-join-keys" data-type="sect2"> <section id="specifying-join-keys" data-type="sect2">
<h2> <h2>
Specifying join keys</h2> Specifying join keys</h2>
<p>By default, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesnt always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p> <p>By default, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> will use all variables that appear in both data frames as the join key, the so called <strong>natural</strong> join. This is a useful heuristic, but it doesnt always work. For example, what happens if we try to join <code>flights2</code> with the complete <code>planes</code> dataset?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt; <pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes) left_join(planes)
@ -329,7 +329,7 @@ Specifying join keys</h2>
#&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;, #&gt; # … with 336,770 more rows, 4 more variables: engines &lt;int&gt;, seats &lt;int&gt;,
#&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre> #&gt; # speed &lt;int&gt;, engine &lt;chr&gt;, and abbreviated variable name ¹manufacturer</pre>
</div> </div>
<p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code>:</p> <p>We get a lot of missing matches because our join is trying to use <code>tailnum</code> and <code>year</code> as a compound key. Both <code>flights</code> and <code>planes</code> have a <code>year</code> column but they mean different things: <code>flights$year</code> is year the flight occurred and <code>planes$year</code> is the year the plane was built. We only want to join on <code>tailnum</code> so we need to provide an explicit specification with <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights2 |&gt; <pre data-type="programlisting" data-code-language="downlit">flights2 |&gt;
left_join(planes, join_by(tailnum)) left_join(planes, join_by(tailnum))
@ -383,7 +383,7 @@ flights2 |&gt;
<code>by = "x"</code> corresponds to <code>join_by(x)</code>.</li> <code>by = "x"</code> corresponds to <code>join_by(x)</code>.</li>
<li> <li>
<code>by = c("a" = "x")</code> corresponds to <code>join_by(a == x)</code>.</li> <code>by = c("a" = "x")</code> corresponds to <code>join_by(a == x)</code>.</li>
</ul><p>Now that it exists, we prefer <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code> since it provides a clearer and more flexible specification.</p> </ul><p>Now that it exists, we prefer <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code> since it provides a clearer and more flexible specification.</p>
</section> </section>
<section id="filtering-joins" data-type="sect2"> <section id="filtering-joins" data-type="sect2">
@ -570,7 +570,7 @@ y &lt;- tribble(
<section id="row-matching" data-type="sect2"> <section id="row-matching" data-type="sect2">
<h2> <h2>
Row matching</h2> Row matching</h2>
<p>So far weve explored what happens if a row in <code>x</code> matches zero or one rows in <code>y</code>. What happens if it matches more than one row? To understand whats going lets first narrow our focus to the <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> and then draw a picture, <a href="#fig-join-match-types" data-type="xref">#fig-join-match-types</a>.</p> <p>So far weve explored what happens if a row in <code>x</code> matches zero or one rows in <code>y</code>. What happens if it matches more than one row? To understand whats going lets first narrow our focus to the <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> and then draw a picture, <a href="#fig-join-match-types" data-type="xref">#fig-join-match-types</a>.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -606,7 +606,7 @@ df1 |&gt;
#&gt; 2 2 x2 y2 #&gt; 2 2 x2 y2
#&gt; 3 2 x2 y3</pre> #&gt; 3 2 x2 y3</pre>
</div> </div>
<p>This is one reason we like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> — if it runs without warning, you know that each row of the output matches the row in the same position in <code>x</code>.</p> <p>This is one reason we like <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> — if it runs without warning, you know that each row of the output matches the row in the same position in <code>x</code>.</p>
<p>You can gain further control over row matching with two arguments:</p> <p>You can gain further control over row matching with two arguments:</p>
<ul><li> <ul><li>
<code>unmatched</code> controls what happens when a row in <code>x</code> fails to match any rows in <code>y</code>. It defaults to <code>"drop"</code> which will silently drop any unmatched rows.</li> <code>unmatched</code> controls what happens when a row in <code>x</code> fails to match any rows in <code>y</code>. It defaults to <code>"drop"</code> which will silently drop any unmatched rows.</li>
@ -635,7 +635,7 @@ df1 |&gt;
#&gt; ! Each row of `x` must have a match in `y`. #&gt; ! Each row of `x` must have a match in `y`.
#&gt; Row 1 of `x` does not have a match.</pre> #&gt; Row 1 of `x` does not have a match.</pre>
</div> </div>
<p>Note that <code>unmatched = "error"</code> is not useful with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> because, as described above, every row in <code>x</code> has a fallback match to a virtual row in <code>y</code>.</p> <p>Note that <code>unmatched = "error"</code> is not useful with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> because, as described above, every row in <code>x</code> has a fallback match to a virtual row in <code>y</code>.</p>
</section> </section>
<section id="allow-multiple-rows" data-type="sect2"> <section id="allow-multiple-rows" data-type="sect2">
@ -703,7 +703,7 @@ Filtering joins</h2>
<h1> <h1>
Non-equi joins</h1> Non-equi joins</h1>
<p>So far youve only seen equi-joins, joins where the rows match if the <code>x</code> key equals the <code>y</code> key. Now were going to relax that restriction and discuss other ways of determining if a pair of rows match.</p> <p>So far youve only seen equi-joins, joins where the rows match if the <code>x</code> key equals the <code>y</code> key. Now were going to relax that restriction and discuss other ways of determining if a pair of rows match.</p>
<p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p> <p>But before we can do that, we need to revisit a simplification we made above. In equi-joins the <code>x</code> keys and <code>y</code> are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with <code>keep = TRUE</code>, leading to the code below and the re-drawn <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code> in <a href="#fig-inner-both" data-type="xref">#fig-inner-both</a>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x |&gt; left_join(y, by = "key", keep = TRUE) <pre data-type="programlisting" data-code-language="downlit">x |&gt; left_join(y, by = "key", keep = TRUE)
#&gt; # A tibble: 3 × 4 #&gt; # A tibble: 3 × 4
@ -956,7 +956,7 @@ x |&gt; full_join(y, by = "key", keep = TRUE)
#&gt; 4 NA &lt;NA&gt; 4 y3</pre> #&gt; 4 NA &lt;NA&gt; 4 y3</pre>
</div> </div>
</li> </li>
<li><p>When finding if any party period overlapped with another party period we used <code>q &lt; q</code> in the <code><a href="#chp-https://dplyr.tidyverse.org/reference/join_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/join_by</a></code>? Why? What happens if you remove this inequality?</p></li> <li><p>When finding if any party period overlapped with another party period we used <code>q &lt; q</code> in the <code><a href="https://dplyr.tidyverse.org/reference/join_by.html">join_by()</a></code>? Why? What happens if you remove this inequality?</p></li>
</ol></section> </ol></section>
</section> </section>

View File

@ -12,23 +12,23 @@
<h1> <h1>
Introduction</h1> Introduction</h1>
<p>In this chapter, youll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. Its relatively rare to find logical vectors in your raw data, but youll create and manipulate in the course of almost every analysis.</p> <p>In this chapter, youll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: <code>TRUE</code>, <code>FALSE</code>, and <code>NA</code>. Its relatively rare to find logical vectors in your raw data, but youll create and manipulate in the course of almost every analysis.</p>
<p>Well begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then youll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. Well finish off with <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>, two useful functions for making conditional changes powered by logical vectors.</p> <p>Well begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then youll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. Well finish off with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, two useful functions for making conditional changes powered by logical vectors.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
Prerequisites</h2> Prerequisites</h2>
<p>Most of the functions youll learn about in this chapter are provided by base R, so we dont need the tidyverse, but well still load it so we can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>, and friends to work with data frames. Well also continue to draw examples from the nycflights13 dataset.</p> <p>Most of the functions youll learn about in this chapter are provided by base R, so we dont need the tidyverse, but well still load it so we can use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, and friends to work with data frames. Well also continue to draw examples from the nycflights13 dataset.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse) <pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre> library(nycflights13)</pre>
</div> </div>
<p>However, as we start to cover more tools, there wont always be a perfect real example. So well start making up some dummy data with <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>:</p> <p>However, as we start to cover more tools, there wont always be a perfect real example. So well start making up some dummy data with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 3, 5, 7, 11, 13) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 3, 5, 7, 11, 13)
x * 2 x * 2
#&gt; [1] 2 4 6 10 14 22 26</pre> #&gt; [1] 2 4 6 10 14 22 26</pre>
</div> </div>
<p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside data frame with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and friends.</p> <p>This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside data frame with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and friends.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x) <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x)
df |&gt; df |&gt;
@ -50,7 +50,7 @@ df |&gt;
<section id="comparisons" data-type="sect1"> <section id="comparisons" data-type="sect1">
<h1> <h1>
Comparisons</h1> Comparisons</h1>
<p>A very common way to create a logical vector is via a numeric comparison with <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>!=</code>, and <code>==</code>. So far, weve mostly created logical variables transiently within <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that leave roughly on time:</p> <p>A very common way to create a logical vector is via a numeric comparison with <code>&lt;</code>, <code>&lt;=</code>, <code>&gt;</code>, <code>&gt;=</code>, <code>!=</code>, and <code>==</code>. So far, weve mostly created logical variables transiently within <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that leave roughly on time:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time &gt; 600 &amp; dep_time &lt; 2000 &amp; abs(arr_delay) &lt; 20) filter(dep_time &gt; 600 &amp; dep_time &lt; 2000 &amp; abs(arr_delay) &lt; 20)
@ -68,7 +68,7 @@ Comparisons</h1>
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>Its useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>:</p> <p>Its useful to know that this is a shortcut and you can explicitly create the underlying logical variables with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate( mutate(
@ -112,13 +112,13 @@ x
<pre data-type="programlisting" data-code-language="downlit">x == c(1, 2) <pre data-type="programlisting" data-code-language="downlit">x == c(1, 2)
#&gt; [1] FALSE FALSE</pre> #&gt; [1] FALSE FALSE</pre>
</div> </div>
<p>Whats going on? Computers store numbers with a fixed number of decimal places so theres no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="#chp-https://rdrr.io/r/base/print" data-type="xref">#chp-https://rdrr.io/r/base/print</a></code> with the the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p> <p>Whats going on? Computers store numbers with a fixed number of decimal places so theres no way to exactly represent 1/49 or <code>sqrt(2)</code> and subsequent computations will be very slightly off. We can see the exact values by calling <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> with the the <code>digits</code><span data-type="footnote">R normally calls print for you (i.e. <code>x</code> is a shortcut for <code>print(x)</code>), but calling it explicitly is useful if you want to provide other arguments.</span> argument:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">print(x, digits = 16) <pre data-type="programlisting" data-code-language="downlit">print(x, digits = 16)
#&gt; [1] 0.9999999999999999 2.0000000000000004</pre> #&gt; [1] 0.9999999999999999 2.0000000000000004</pre>
</div> </div>
<p>You can see why R defaults to rounding these numbers; they really are very close to what you expect.</p> <p>You can see why R defaults to rounding these numbers; they really are very close to what you expect.</p>
<p>Now that youve seen why <code>==</code> is failing, what can you do about it? One option is to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/near" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/near</a></code> which ignores small differences:</p> <p>Now that youve seen why <code>==</code> is failing, what can you do about it? One option is to use <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> which ignores small differences:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">near(x, c(1, 2)) <pre data-type="programlisting" data-code-language="downlit">near(x, c(1, 2))
#&gt; [1] TRUE TRUE</pre> #&gt; [1] TRUE TRUE</pre>
@ -153,7 +153,7 @@ x == y
#&gt; [1] NA #&gt; [1] NA
# We don't know!</pre> # We don't know!</pre>
</div> </div>
<p>So if you want to find all flights with <code>dep_time</code> is missing, the following code doesnt work because <code>dep_time == NA</code> will yield a <code>NA</code> for every single row, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> automatically drops missing values:</p> <p>So if you want to find all flights with <code>dep_time</code> is missing, the following code doesnt work because <code>dep_time == NA</code> will yield a <code>NA</code> for every single row, and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> automatically drops missing values:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dep_time == NA) filter(dep_time == NA)
@ -164,7 +164,7 @@ x == y
#&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;, #&gt; # tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;, air_time &lt;dbl&gt;, distance &lt;dbl&gt;,
#&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre> #&gt; # hour &lt;dbl&gt;, minute &lt;dbl&gt;, time_hour &lt;dttm&gt;</pre>
</div> </div>
<p>Instead well need a new tool: <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>.</p> <p>Instead well need a new tool: <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
</section> </section>
<section id="is.na" data-type="sect2"> <section id="is.na" data-type="sect2">
@ -180,7 +180,7 @@ is.na(c(1, NA, 3))
is.na(c("a", NA, "b")) is.na(c("a", NA, "b"))
#&gt; [1] FALSE TRUE FALSE</pre> #&gt; [1] FALSE TRUE FALSE</pre>
</div> </div>
<p>We can use <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code> to find all the rows with a missing <code>dep_time</code>:</p> <p>We can use <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> to find all the rows with a missing <code>dep_time</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(is.na(dep_time)) filter(is.na(dep_time))
@ -198,7 +198,7 @@ is.na(c("a", NA, "b"))
#&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names #&gt; # minute &lt;dbl&gt;, time_hour &lt;dttm&gt;, and abbreviated variable names
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p><code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code> can also be useful in <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>:</p> <p><code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code> can also be useful in <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> usually puts all the missing values at the end but you can override this default by first sorting by <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(month == 1, day == 1) |&gt; filter(month == 1, day == 1) |&gt;
@ -240,15 +240,15 @@ flights |&gt;
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li>How does <code><a href="#chp-https://dplyr.tidyverse.org/reference/near" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/near</a></code> work? Type <code>near</code> to see the source code.</li> <ol type="1"><li>How does <code><a href="https://dplyr.tidyverse.org/reference/near.html">dplyr::near()</a></code> work? Type <code>near</code> to see the source code.</li>
<li>Use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> together to describe how the missing values in <code>dep_time</code>, <code>sched_dep_time</code> and <code>dep_delay</code> are connected.</li> <li>Use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> together to describe how the missing values in <code>dep_time</code>, <code>sched_dep_time</code> and <code>dep_delay</code> are connected.</li>
</ol></section> </ol></section>
</section> </section>
<section id="boolean-algebra" data-type="sect1"> <section id="boolean-algebra" data-type="sect1">
<h1> <h1>
Boolean algebra</h1> Boolean algebra</h1>
<p>Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, <code>&amp;</code> is “and”, <code>|</code> is “or”, and <code>!</code> is “not”, and <code><a href="#chp-https://rdrr.io/r/base/Logic" data-type="xref">#chp-https://rdrr.io/r/base/Logic</a></code> is exclusive or<span data-type="footnote">That is, <code>xor(x, y)</code> is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.</span>. <a href="#fig-bool-ops" data-type="xref">#fig-bool-ops</a> shows the complete set of Boolean operations and how they work.</p> <p>Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, <code>&amp;</code> is “and”, <code>|</code> is “or”, and <code>!</code> is “not”, and <code><a href="https://rdrr.io/r/base/Logic.html">xor()</a></code> is exclusive or<span data-type="footnote">That is, <code>xor(x, y)</code> is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.</span>. <a href="#fig-bool-ops" data-type="xref">#fig-bool-ops</a> shows the complete set of Boolean operations and how they work.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -388,8 +388,8 @@ Summaries</h1>
<section id="logical-summaries" data-type="sect2"> <section id="logical-summaries" data-type="sect2">
<h2> <h2>
Logical summaries</h2> Logical summaries</h2>
<p>There are two main logical summaries: <code><a href="#chp-https://rdrr.io/r/base/any" data-type="xref">#chp-https://rdrr.io/r/base/any</a></code> and <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code>. <code>any(x)</code> is the equivalent of <code>|</code>; itll return <code>TRUE</code> if there are any <code>TRUE</code>s in <code>x</code>. <code>all(x)</code> is equivalent of <code>&amp;</code>; itll return <code>TRUE</code> only if all values of <code>x</code> are <code>TRUE</code>s. Like all summary functions, theyll return <code>NA</code> if there are any missing values present, and as usual you can make the missing values go away with <code>na.rm = TRUE</code>.</p> <p>There are two main logical summaries: <code><a href="https://rdrr.io/r/base/any.html">any()</a></code> and <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>. <code>any(x)</code> is the equivalent of <code>|</code>; itll return <code>TRUE</code> if there are any <code>TRUE</code>s in <code>x</code>. <code>all(x)</code> is equivalent of <code>&amp;</code>; itll return <code>TRUE</code> only if all values of <code>x</code> are <code>TRUE</code>s. Like all summary functions, theyll return <code>NA</code> if there are any missing values present, and as usual you can make the missing values go away with <code>na.rm = TRUE</code>.</p>
<p>For example, we could use <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code> to find out if there were days where every flight was delayed:</p> <p>For example, we could use <code><a href="https://rdrr.io/r/base/all.html">all()</a></code> to find out if there were days where every flight was delayed:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt; group_by(year, month, day) |&gt;
@ -409,13 +409,13 @@ Logical summaries</h2>
#&gt; 6 2013 1 6 FALSE TRUE #&gt; 6 2013 1 6 FALSE TRUE
#&gt; # … with 359 more rows</pre> #&gt; # … with 359 more rows</pre>
</div> </div>
<p>In most cases, however, <code><a href="#chp-https://rdrr.io/r/base/any" data-type="xref">#chp-https://rdrr.io/r/base/any</a></code> and <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code> are a little too crude, and it would be nice to be able to get a little more detail about how many values are <code>TRUE</code> or <code>FALSE</code>. That leads us to the numeric summaries.</p> <p>In most cases, however, <code><a href="https://rdrr.io/r/base/any.html">any()</a></code> and <code><a href="https://rdrr.io/r/base/all.html">all()</a></code> are a little too crude, and it would be nice to be able to get a little more detail about how many values are <code>TRUE</code> or <code>FALSE</code>. That leads us to the numeric summaries.</p>
</section> </section>
<section id="numeric-summaries-of-logical-vectors" data-type="sect2"> <section id="numeric-summaries-of-logical-vectors" data-type="sect2">
<h2> <h2>
Numeric summaries of logical vectors</h2> Numeric summaries of logical vectors</h2>
<p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a>.</p> <p>When you use a logical vector in a numeric context, <code>TRUE</code> becomes 1 and <code>FALSE</code> becomes 0. This makes <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> very useful with logical vectors because <code>sum(x)</code> will give the number of <code>TRUE</code>s and <code>mean(x)</code> the proportion of <code>TRUE</code>s. That lets us see the distribution of delays across the days of the year as shown in <a href="#fig-prop-delayed-dist" data-type="xref">#fig-prop-delayed-dist</a>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt; group_by(year, month, day) |&gt;
@ -501,27 +501,27 @@ Logical subsetting</h2>
#&gt; 6 2013 1 6 24.4 -13.6 832 #&gt; 6 2013 1 6 24.4 -13.6 832
#&gt; # … with 359 more rows</pre> #&gt; # … with 359 more rows</pre>
</div> </div>
<p>Also note the difference in the group size: in the first chunk <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> gives the number of delayed flights per day; in the second, <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> gives the total number of flights.</p> <p>Also note the difference in the group size: in the first chunk <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the number of delayed flights per day; in the second, <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> gives the total number of flights.</p>
</section> </section>
<section id="exercises-2" data-type="sect2"> <section id="exercises-2" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li> <ol type="1"><li>What will <code>sum(is.na(x))</code> tell you? How about <code>mean(is.na(x))</code>?</li>
<li>What does <code><a href="#chp-https://rdrr.io/r/base/prod" data-type="xref">#chp-https://rdrr.io/r/base/prod</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? What does <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> return applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.</li> <li>What does <code><a href="https://rdrr.io/r/base/prod.html">prod()</a></code> return when applied to a logical vector? What logical summary function is it equivalent to? What does <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.</li>
</ol></section> </ol></section>
</section> </section>
<section id="conditional-transformations" data-type="sect1"> <section id="conditional-transformations" data-type="sect1">
<h1> <h1>
Conditional transformations</h1> Conditional transformations</h1>
<p>One of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>.</p> <p>One of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>.</p>
<section id="if_else" data-type="sect2"> <section id="if_else" data-type="sect2">
<h2> <h2>
<code>if_else()</code> <code>if_else()</code>
</h2> </h2>
<p>If you want to use one value when a condition is true and another value when its <code>FALSE</code>, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code><span data-type="footnote">dplyrs <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> is very similar to base Rs <code><a href="#chp-https://rdrr.io/r/base/ifelse" data-type="xref">#chp-https://rdrr.io/r/base/ifelse</a></code>. There are two main advantages of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>over <code><a href="#chp-https://rdrr.io/r/base/ifelse" data-type="xref">#chp-https://rdrr.io/r/base/ifelse</a></code>: you can choose what should happen to missing values, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. Youll always use the first three argument of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p> <p>If you want to use one value when a condition is true and another value when its <code>FALSE</code>, you can use <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">dplyr::if_else()</a></code><span data-type="footnote">dplyrs <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is very similar to base Rs <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>. There are two main advantages of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>over <code><a href="https://rdrr.io/r/base/ifelse.html">ifelse()</a></code>: you can choose what should happen to missing values, and <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> is much more likely to give you a meaningful error if you variables have incompatible types.</span>. Youll always use the first three argument of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>. The first argument, <code>condition</code>, is a logical vector, the second, <code>true</code>, gives the output when the condition is true, and the third, <code>false</code>, gives the output if the condition is false.</p>
<p>Lets begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p> <p>Lets begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(-3:3, NA) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(-3:3, NA)
@ -533,32 +533,32 @@ if_else(x &gt; 0, "+ve", "-ve")
<pre data-type="programlisting" data-code-language="downlit">if_else(x &gt; 0, "+ve", "-ve", "???") <pre data-type="programlisting" data-code-language="downlit">if_else(x &gt; 0, "+ve", "-ve", "???")
#&gt; [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"</pre> #&gt; [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"</pre>
</div> </div>
<p>You can also use vectors for the the <code>true</code> and <code>false</code> arguments. For example, this allows us to create a minimal implementation of <code><a href="#chp-https://rdrr.io/r/base/MathFun" data-type="xref">#chp-https://rdrr.io/r/base/MathFun</a></code>:</p> <p>You can also use vectors for the the <code>true</code> and <code>false</code> arguments. For example, this allows us to create a minimal implementation of <code><a href="https://rdrr.io/r/base/MathFun.html">abs()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">if_else(x &lt; 0, -x, x) <pre data-type="programlisting" data-code-language="downlit">if_else(x &lt; 0, -x, x)
#&gt; [1] 3 2 1 0 1 2 3 NA</pre> #&gt; [1] 3 2 1 0 1 2 3 NA</pre>
</div> </div>
<p>So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code> like this:</p> <p>So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> like this:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- c(NA, 1, 2, NA) <pre data-type="programlisting" data-code-language="downlit">x1 &lt;- c(NA, 1, 2, NA)
y1 &lt;- c(3, NA, 4, 6) y1 &lt;- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1) if_else(is.na(x1), y1, x1)
#&gt; [1] 3 1 2 6</pre> #&gt; [1] 3 1 2 6</pre>
</div> </div>
<p>You might have noticed a small infelicity in our labeling: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>:</p> <p>You might have noticed a small infelicity in our labeling: zero is neither positive nor negative. We could resolve this by adding an additional <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">if_else(x == 0, "0", if_else(x &lt; 0, "-ve", "+ve"), "???") <pre data-type="programlisting" data-code-language="downlit">if_else(x == 0, "0", if_else(x &lt; 0, "-ve", "+ve"), "???")
#&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre> #&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
</div> </div>
<p>This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>.</p> <p>This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">dplyr::case_when()</a></code>.</p>
</section> </section>
<section id="case_when" data-type="sect2"> <section id="case_when" data-type="sect2">
<h2> <h2>
<code>case_when()</code> <code>case_when()</code>
</h2> </h2>
<p>dplyrs <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> is inspired by SQLs <code>CASE</code> statement and provides a flexible way of performing different computations for different computations. It has a special syntax that unfortunately looks like nothing else youll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when its <code>TRUE</code>, <code>output</code> will be used.</p> <p>dplyrs <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is inspired by SQLs <code>CASE</code> statement and provides a flexible way of performing different computations for different computations. It has a special syntax that unfortunately looks like nothing else youll use in the tidyverse. It takes pairs that look like <code>condition ~ output</code>. <code>condition</code> must be a logical vector; when its <code>TRUE</code>, <code>output</code> will be used.</p>
<p>This means we could recreate our previous nested <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> as follows:</p> <p>This means we could recreate our previous nested <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> as follows:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">case_when( <pre data-type="programlisting" data-code-language="downlit">case_when(
x == 0 ~ "0", x == 0 ~ "0",
@ -569,7 +569,7 @@ if_else(is.na(x1), y1, x1)
#&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre> #&gt; [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???"</pre>
</div> </div>
<p>This is more code, but its also more explicit.</p> <p>This is more code, but its also more explicit.</p>
<p>To explain how <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> works, lets explore some simpler cases. If none of the cases match, the output gets an <code>NA</code>:</p> <p>To explain how <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> works, lets explore some simpler cases. If none of the cases match, the output gets an <code>NA</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">case_when( <pre data-type="programlisting" data-code-language="downlit">case_when(
x &lt; 0 ~ "-ve", x &lt; 0 ~ "-ve",
@ -594,7 +594,7 @@ if_else(is.na(x1), y1, x1)
) )
#&gt; [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre> #&gt; [1] NA NA NA NA "+ve" "+ve" "+ve" NA</pre>
</div> </div>
<p>Just like with <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> you can use variables on both sides of the <code>~</code> and you can mix and match variables as needed for your problem. For example, we could use <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> to provide some human readable labels for the arrival delay:</p> <p>Just like with <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> you can use variables on both sides of the <code>~</code> and you can mix and match variables as needed for your problem. For example, we could use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> to provide some human readable labels for the arrival delay:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate( mutate(
@ -625,7 +625,7 @@ if_else(is.na(x1), y1, x1)
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="#chp-https://rdrr.io/r/base/any" data-type="xref">#chp-https://rdrr.io/r/base/any</a></code>, <code><a href="#chp-https://rdrr.io/r/base/all" data-type="xref">#chp-https://rdrr.io/r/base/all</a></code>, <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>, and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>. You also learned the powerful <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> that allow you to return values depending on the value of a logical vector.</p> <p>The definition of a logical vector is simple because each value must be either <code>TRUE</code>, <code>FALSE</code>, or <code>NA</code>. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with <code>&gt;</code>, <code>&lt;</code>, <code>&lt;=</code>, <code>=&gt;</code>, <code>==</code>, <code>!=</code>, and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>, how to combine them with <code>!</code>, <code>&amp;</code>, and <code>|</code>, and how to summarize them with <code><a href="https://rdrr.io/r/base/any.html">any()</a></code>, <code><a href="https://rdrr.io/r/base/all.html">all()</a></code>, <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>. You also learned the powerful <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> that allow you to return values depending on the value of a logical vector.</p>
<p>Well see logical vectors again and in the following chapters. For example in <a href="#chp-strings" data-type="xref">#chp-strings</a> youll learn about <code>str_detect(x, pattern)</code> which returns a logical vector thats <code>TRUE</code> for the elements of <code>x</code> that match the <code>pattern</code>, and in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> youll create logical vectors from the comparison of dates and times. But for now, were going to move onto the next most important type of vector: numeric vectors.</p> <p>Well see logical vectors again and in the following chapters. For example in <a href="#chp-strings" data-type="xref">#chp-strings</a> youll learn about <code>str_detect(x, pattern)</code> which returns a logical vector thats <code>TRUE</code> for the elements of <code>x</code> that match the <code>pattern</code>, and in <a href="#chp-datetimes" data-type="xref">#chp-datetimes</a> youll create logical vectors from the comparison of dates and times. But for now, were going to move onto the next most important type of vector: numeric vectors.</p>

View File

@ -42,7 +42,7 @@ Last observation carried forward</h2>
"Katherine Burke", 1, 4 "Katherine Burke", 1, 4
)</pre> )</pre>
</div> </div>
<p>You can fill in these missing values with <code><a href="#chp-https://tidyr.tidyverse.org/reference/fill" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/fill</a></code>. It works like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, taking a set of columns:</p> <p>You can fill in these missing values with <code><a href="https://tidyr.tidyverse.org/reference/fill.html">tidyr::fill()</a></code>. It works like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, taking a set of columns:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">treatment |&gt; <pre data-type="programlisting" data-code-language="downlit">treatment |&gt;
fill(everything()) fill(everything())
@ -60,14 +60,14 @@ Last observation carried forward</h2>
<section id="fixed-values" data-type="sect2"> <section id="fixed-values" data-type="sect2">
<h2> <h2>
Fixed values</h2> Fixed values</h2>
<p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code> to replace them:</p> <p>Some times missing values represent some fixed and known value, most commonly 0. You can use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">dplyr::coalesce()</a></code> to replace them:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, NA) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, NA)
coalesce(x, 0) coalesce(x, 0)
#&gt; [1] 1 4 5 7 0</pre> #&gt; [1] 1 4 5 7 0</pre>
</div> </div>
<p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p> <p>Sometimes youll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesnt have a proper way to represent missing values, so it must instead use some special value like 99 or -999.</p>
<p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="#chp-https://dplyr.tidyverse.org/reference/na_if" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/na_if</a></code>:</p> <p>If possible, handle this when reading in the data, for example, by using the <code>na</code> argument to <code><a href="https://readr.tidyverse.org/reference/read_delim.html">readr::read_csv()</a></code>. If you discover the problem later, or your data source doesnt provide a way to handle on it read, you can use <code><a href="https://dplyr.tidyverse.org/reference/na_if.html">dplyr::na_if()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, -99) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 4, 5, 7, -99)
na_if(x, -99) na_if(x, -99)
@ -147,7 +147,7 @@ Pivoting</h2>
<section id="complete" data-type="sect2"> <section id="complete" data-type="sect2">
<h2> <h2>
Complete</h2> Complete</h2>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p> <p><code><a href="https://tidyr.tidyverse.org/reference/complete.html">tidyr::complete()</a></code> allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of <code>year</code> and <code>qtr</code> should exist in the <code>stocks</code> data:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt; <pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
complete(year, qtr) complete(year, qtr)
@ -162,7 +162,7 @@ Complete</h2>
#&gt; 6 2021 2 0.92 #&gt; 6 2021 2 0.92
#&gt; # … with 2 more rows</pre> #&gt; # … with 2 more rows</pre>
</div> </div>
<p>Typically, youll call <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p> <p>Typically, youll call <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the <code>stocks</code> dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for <code>year</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">stocks |&gt; <pre data-type="programlisting" data-code-language="downlit">stocks |&gt;
complete(year = 2019:2021, qtr) complete(year = 2019:2021, qtr)
@ -178,14 +178,14 @@ Complete</h2>
#&gt; # … with 6 more rows</pre> #&gt; # … with 6 more rows</pre>
</div> </div>
<p>If the range of a variable is correct, but not all values are present, you could use <code>full_seq(x, 1)</code> to generate all values from <code>min(x)</code> to <code>max(x)</code> spaced out by 1.</p> <p>If the range of a variable is correct, but not all values are present, you could use <code>full_seq(x, 1)</code> to generate all values from <code>min(x)</code> to <code>max(x)</code> spaced out by 1.</p>
<p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>.</p> <p>In some cases, the complete set of observations cant be generated by a simple combination of variables. In that case, you can do manually what <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code> does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">dplyr::full_join()</a></code>.</p>
</section> </section>
<section id="joins" data-type="sect2"> <section id="joins" data-type="sect2">
<h2> <h2>
Joins</h2> Joins</h2>
<p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p> <p>This brings us to another important way of revealing implicitly missing observations: joins. Youll learn more about joins in <a href="#chp-joins" data-type="xref">#chp-joins</a>, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.</p>
<p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter-joins</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p> <p><code>dplyr::anti_join(x, y)</code> is a particularly useful tool here because it selects only the rows in <code>x</code> that dont have a match in <code>y</code>. For example, we can use two <code><a href="https://dplyr.tidyverse.org/reference/filter-joins.html">anti_join()</a></code>s reveal to reveal that were missing information for four airports and 722 planes mentioned in <code>flights</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(nycflights13) <pre data-type="programlisting" data-code-language="downlit">library(nycflights13)
@ -236,7 +236,7 @@ Factors and empty groups</h1>
age = c(34L, 88L, 75L, 47L, 56L), age = c(34L, 88L, 75L, 47L, 56L),
)</pre> )</pre>
</div> </div>
<p>And we want to count the number of smokers with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p> <p>And we want to count the number of smokers with <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker) <pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker)
#&gt; # A tibble: 1 × 2 #&gt; # A tibble: 1 × 2
@ -244,7 +244,7 @@ Factors and empty groups</h1>
#&gt; &lt;fct&gt; &lt;int&gt; #&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 no 5</pre> #&gt; 1 no 5</pre>
</div> </div>
<p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p> <p>This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to keep all the groups, even those not seen in the data by using <code>.drop = FALSE</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker, .drop = FALSE) <pre data-type="programlisting" data-code-language="downlit">health |&gt; count(smoker, .drop = FALSE)
#&gt; # A tibble: 2 × 2 #&gt; # A tibble: 2 × 2
@ -273,7 +273,7 @@ ggplot(health, aes(smoker)) +
</div> </div>
</div> </div>
</div> </div>
<p>The same problem comes up more generally with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p> <p>The same problem comes up more generally with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">dplyr::group_by()</a></code>. And again you can use <code>.drop = FALSE</code> to preserve all factor levels:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; <pre data-type="programlisting" data-code-language="downlit">health |&gt;
group_by(smoker, .drop = FALSE) |&gt; group_by(smoker, .drop = FALSE) |&gt;
@ -309,8 +309,8 @@ x2 &lt;- numeric()
length(x2) length(x2)
#&gt; [1] 0</pre> #&gt; [1] 0</pre>
</div> </div>
<p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p> <p>All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see <code>mean(age)</code> returning <code>NaN</code> because <code>mean(age)</code> = <code>sum(age)/length(age)</code> which here is 0/0. <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute youll get the minimum or maximum of the new data<span data-type="footnote">In other words, <code>min(c(x, y))</code> is always equal to <code>min(min(x), min(y))</code>.</span>.</p>
<p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="#chp-https://tidyr.tidyverse.org/reference/complete" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/complete</a></code>.</p> <p>Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with <code><a href="https://tidyr.tidyverse.org/reference/complete.html">complete()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">health |&gt; <pre data-type="programlisting" data-code-language="downlit">health |&gt;
group_by(smoker) |&gt; group_by(smoker) |&gt;

View File

@ -12,12 +12,12 @@
<h1> <h1>
Introduction</h1> Introduction</h1>
<p>Numeric vectors are the backbone of data science, and youve already used them a bunch of times earlier in the book. Now its time to systematically survey what you can do with them in R, ensuring that youre well situated to tackle any future problem involving numeric vectors.</p> <p>Numeric vectors are the backbone of data science, and youve already used them a bunch of times earlier in the book. Now its time to systematically survey what you can do with them in R, ensuring that youre well situated to tackle any future problem involving numeric vectors.</p>
<p>Well start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>. Then well dive into various numeric transformations that pair well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Well finish off by covering the summary functions that pair well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> and show you how they can also be used with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>.</p> <p>Well start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. Then well dive into various numeric transformations that pair well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Well finish off by covering the summary functions that pair well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and show you how they can also be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
Prerequisites</h2> Prerequisites</h2>
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because well use these base R functions inside of tidyverse functions like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>. Like in the last chapter, well use real examples from nycflights13, as well as toy examples made with <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code> and <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>.</p> <p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because well use these base R functions inside of tidyverse functions like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. Like in the last chapter, well use real examples from nycflights13, as well as toy examples made with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse) <pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
library(nycflights13)</pre> library(nycflights13)</pre>
@ -29,13 +29,13 @@ library(nycflights13)</pre>
<h1> <h1>
Making numbers</h1> Making numbers</h1>
<p>In most cases, youll get numbers already recorded in one of Rs numeric types: integer or double. In some cases, however, youll encounter them as strings, possibly because youve created them by pivoting from column headers or something has gone wrong in your data import process.</p> <p>In most cases, youll get numbers already recorded in one of Rs numeric types: integer or double. In some cases, however, youll encounter them as strings, possibly because youve created them by pivoting from column headers or something has gone wrong in your data import process.</p>
<p>readr provides two useful functions for parsing strings into numbers: <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code>. Use <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> when you have numbers that have been written as strings:</p> <p>readr provides two useful functions for parsing strings into numbers: <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code>. Use <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">parse_double()</a></code> when you have numbers that have been written as strings:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("1.2", "5.6", "1e3") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("1.2", "5.6", "1e3")
parse_double(x) parse_double(x)
#&gt; [1] 1.2 5.6 1000.0</pre> #&gt; [1] 1.2 5.6 1000.0</pre>
</div> </div>
<p>Use <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:</p> <p>Use <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("$1,234", "USD 3,513", "59%") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("$1,234", "USD 3,513", "59%")
parse_number(x) parse_number(x)
@ -46,7 +46,7 @@ parse_number(x)
<section id="counts" data-type="sect1"> <section id="counts" data-type="sect1">
<h1> <h1>
Counts</h1> Counts</h1>
<p>Its surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>. This function is great for quick exploration and checks during analysis:</p> <p>Its surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. This function is great for quick exploration and checks during analysis:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(dest) <pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(dest)
#&gt; # A tibble: 105 × 2 #&gt; # A tibble: 105 × 2
@ -60,7 +60,7 @@ Counts</h1>
#&gt; 6 AUS 2439 #&gt; 6 AUS 2439
#&gt; # … with 99 more rows</pre> #&gt; # … with 99 more rows</pre>
</div> </div>
<p>(Despite the advice in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, we usually put <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> on a single line because its usually used at the console for a quick check that a calculation is working as expected.)</p> <p>(Despite the advice in <a href="#chp-workflow-style" data-type="xref">#chp-workflow-style</a>, we usually put <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> on a single line because its usually used at the console for a quick check that a calculation is working as expected.)</p>
<p>If you want to see the most common values add <code>sort = TRUE</code>:</p> <p>If you want to see the most common values add <code>sort = TRUE</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(dest, sort = TRUE) <pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(dest, sort = TRUE)
@ -76,7 +76,7 @@ Counts</h1>
#&gt; # … with 99 more rows</pre> #&gt; # … with 99 more rows</pre>
</div> </div>
<p>And remember that if you want to see all the values, you can use <code>|&gt; View()</code> or <code>|&gt; print(n = Inf)</code>.</p> <p>And remember that if you want to see all the values, you can use <code>|&gt; View()</code> or <code>|&gt; print(n = Inf)</code>.</p>
<p>You can perform the same computation “by hand” with <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code>. This is useful because it allows you to compute other summaries at the same time:</p> <p>You can perform the same computation “by hand” with <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code>. This is useful because it allows you to compute other summaries at the same time:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt; group_by(dest) |&gt;
@ -95,14 +95,14 @@ Counts</h1>
#&gt; 6 AUS 2439 6.02 #&gt; 6 AUS 2439 6.02
#&gt; # … with 99 more rows</pre> #&gt; # … with 99 more rows</pre>
</div> </div>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> is a special summary function that doesnt take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> is a special summary function that doesnt take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">n() <pre data-type="programlisting" data-code-language="downlit">n()
#&gt; Error in `n()`: #&gt; Error in `n()`:
#&gt; ! Must only be used inside data-masking verbs like `mutate()`, #&gt; ! Must only be used inside data-masking verbs like `mutate()`,
#&gt; `filter()`, and `group_by()`.</pre> #&gt; `filter()`, and `group_by()`.</pre>
</div> </div>
<p>There are a couple of variants of <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> that you might find useful:</p> <p>There are a couple of variants of <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> that you might find useful:</p>
<ul><li> <ul><li>
<p><code>n_distinct(x)</code> counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:</p> <p><code>n_distinct(x)</code> counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:</p>
<div class="cell"> <div class="cell">
@ -141,7 +141,7 @@ Counts</h1>
#&gt; 6 N104UW 25157 #&gt; 6 N104UW 25157
#&gt; # … with 4,038 more rows</pre> #&gt; # … with 4,038 more rows</pre>
</div> </div>
<p>Weighted counts are a common problem so <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> has a <code>wt</code> argument that does the same thing:</p> <p>Weighted counts are a common problem so <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> has a <code>wt</code> argument that does the same thing:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(tailnum, wt = distance) <pre data-type="programlisting" data-code-language="downlit">flights |&gt; count(tailnum, wt = distance)
#&gt; # A tibble: 4,044 × 2 #&gt; # A tibble: 4,044 × 2
@ -157,7 +157,7 @@ Counts</h1>
</div> </div>
</li> </li>
<li> <li>
<p>You can count missing values by combining <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> and <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>. In the <code>flights</code> dataset this represents flights that are cancelled:</p> <p>You can count missing values by combining <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>. In the <code>flights</code> dataset this represents flights that are cancelled:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt; group_by(dest) |&gt;
@ -178,8 +178,8 @@ Counts</h1>
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li>How can you use <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to count the number rows with a missing value for a given variable?</li> <ol type="1"><li>How can you use <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to count the number rows with a missing value for a given variable?</li>
<li>Expand the following calls to <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to instead use <code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code>: <li>Expand the following calls to <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to instead use <code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code>:
<ol type="1"><li><p><code>flights |&gt; count(dest, sort = TRUE)</code></p></li> <ol type="1"><li><p><code>flights |&gt; count(dest, sort = TRUE)</code></p></li>
<li><p><code>flights |&gt; count(tailnum, wt = distance)</code></p></li> <li><p><code>flights |&gt; count(tailnum, wt = distance)</code></p></li>
</ol></li> </ol></li>
@ -189,7 +189,7 @@ Exercises</h2>
<section id="numeric-transformations" data-type="sect1"> <section id="numeric-transformations" data-type="sect1">
<h1> <h1>
Numeric transformations</h1> Numeric transformations</h1>
<p>Transformation functions work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because their output is the same length as the input. The vast majority of transformation functions are already built into base R. Its impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we dont list them here because theyre rarely needed for data science.</p> <p>Transformation functions work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as the input. The vast majority of transformation functions are already built into base R. Its impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we dont list them here because theyre rarely needed for data science.</p>
<section id="sec-recycling" data-type="sect2"> <section id="sec-recycling" data-type="sect2">
<h2> <h2>
@ -232,13 +232,13 @@ x * c(1, 2, 3)
#&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> #&gt; # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre>
</div> </div>
<p>The code runs without error, but it doesnt return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unforuntately theres no warning because <code>flights</code> has an even number of rows.</p> <p>The code runs without error, but it doesnt return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unforuntately theres no warning because <code>flights</code> has an even number of rows.</p>
<p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesnt help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>.</p> <p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesnt help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>.</p>
</section> </section>
<section id="minimum-and-maximum" data-type="sect2"> <section id="minimum-and-maximum" data-type="sect2">
<h2> <h2>
Minimum and maximum</h2> Minimum and maximum</h2>
<p>The arithmetic functions work with pairs of variables. Two closely related functions are <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code>, which when given two or more variables will return the smallest or largest value in each row:</p> <p>The arithmetic functions work with pairs of variables. Two closely related functions are <code><a href="https://rdrr.io/r/base/Extremes.html">pmin()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">pmax()</a></code>, which when given two or more variables will return the smallest or largest value in each row:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~x, ~y, ~x, ~y,
@ -259,7 +259,7 @@ df |&gt;
#&gt; 2 5 2 2 5 #&gt; 2 5 2 2 5
#&gt; 3 7 NA 7 7</pre> #&gt; 3 7 NA 7 7</pre>
</div> </div>
<p>Note that these are different to the summary functions <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> which take multiple observations and return a single value. You can tell that youve used the wrong form when all the minimums and all the maximums have the same value:</p> <p>Note that these are different to the summary functions <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> which take multiple observations and return a single value. You can tell that youve used the wrong form when all the minimums and all the maximums have the same value:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
mutate( mutate(
@ -353,8 +353,8 @@ money &lt;- tibble(
</div> </div>
</div> </div>
<p>This a straight line because a little algebra reveals that <code>log(money) = log(starting) + n * log(interest)</code>, which matches the pattern for a line, <code>y = m * x + b</code>. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that theres underlying exponential growth.</p> <p>This a straight line because a little algebra reveals that <code>log(money) = log(starting) + n * log(interest)</code>, which matches the pattern for a line, <code>y = m * x + b</code>. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that theres underlying exponential growth.</p>
<p>If youre log-transforming your data with dplyr you have a choice of three logarithms provided by base R: <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> (the natural log, base e), <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> (base 2), and <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> (base 10). We recommend using <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> or <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code>. <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> is easy to back-transform because (e.g) 3 is 10^3 = 1000.</p> <p>If youre log-transforming your data with dplyr you have a choice of three logarithms provided by base R: <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> (the natural log, base e), <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> (base 2), and <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> (base 10). We recommend using <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code>. <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> is easy to back-transform because (e.g) 3 is 10^3 = 1000.</p>
<p>The inverse of <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> is <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code>; to compute the inverse of <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> or <code><a href="#chp-https://rdrr.io/r/base/Log" data-type="xref">#chp-https://rdrr.io/r/base/Log</a></code> youll need to use <code>2^</code> or <code>10^</code>.</p> <p>The inverse of <code><a href="https://rdrr.io/r/base/Log.html">log()</a></code> is <code><a href="https://rdrr.io/r/base/Log.html">exp()</a></code>; to compute the inverse of <code><a href="https://rdrr.io/r/base/Log.html">log2()</a></code> or <code><a href="https://rdrr.io/r/base/Log.html">log10()</a></code> youll need to use <code>2^</code> or <code>10^</code>.</p>
</section> </section>
<section id="sec-rounding" data-type="sect2"> <section id="sec-rounding" data-type="sect2">
@ -376,13 +376,13 @@ round(123.456, -1) # round to nearest ten
round(123.456, -2) # round to nearest hundred round(123.456, -2) # round to nearest hundred
#&gt; [1] 100</pre> #&gt; [1] 100</pre>
</div> </div>
<p>Theres one weirdness with <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> that seems surprising at first glance:</p> <p>Theres one weirdness with <code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> that seems surprising at first glance:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">round(c(1.5, 2.5)) <pre data-type="programlisting" data-code-language="downlit">round(c(1.5, 2.5))
#&gt; [1] 2 2</pre> #&gt; [1] 2 2</pre>
</div> </div>
<p><code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> uses whats known as “round half to even” or Bankers rounding: if a number is half way between two integers, it will be rounded to the <strong>even</strong> integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.</p> <p><code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> uses whats known as “round half to even” or Bankers rounding: if a number is half way between two integers, it will be rounded to the <strong>even</strong> integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.</p>
<p><code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> is paired with <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> which always rounds down and <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> which always rounds up:</p> <p><code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> is paired with <code><a href="https://rdrr.io/r/base/Round.html">floor()</a></code> which always rounds down and <code><a href="https://rdrr.io/r/base/Round.html">ceiling()</a></code> which always rounds up:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- 123.456 <pre data-type="programlisting" data-code-language="downlit">x &lt;- 123.456
@ -400,7 +400,7 @@ floor(x / 0.01) * 0.01
ceiling(x / 0.01) * 0.01 ceiling(x / 0.01) * 0.01
#&gt; [1] 123.46</pre> #&gt; [1] 123.46</pre>
</div> </div>
<p>You can use the same technique if you want to <code><a href="#chp-https://rdrr.io/r/base/Round" data-type="xref">#chp-https://rdrr.io/r/base/Round</a></code> to a multiple of some other number:</p> <p>You can use the same technique if you want to <code><a href="https://rdrr.io/r/base/Round.html">round()</a></code> to a multiple of some other number:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Round to nearest multiple of 4 <pre data-type="programlisting" data-code-language="downlit"># Round to nearest multiple of 4
round(x / 4) * 4 round(x / 4) * 4
@ -415,7 +415,7 @@ round(x / 0.25) * 0.25
<section id="cutting-numbers-into-ranges" data-type="sect2"> <section id="cutting-numbers-into-ranges" data-type="sect2">
<h2> <h2>
Cutting numbers into ranges</h2> Cutting numbers into ranges</h2>
<p>Use <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code><span data-type="footnote">ggplot2 provides some helpers for common cases in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>, and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>. ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.</span> to break up a numeric vector into discrete buckets:</p> <p>Use <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code><span data-type="footnote">ggplot2 provides some helpers for common cases in <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_interval()</a></code>, <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>, and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>. ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed.</span> to break up a numeric vector into discrete buckets:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 5, 10, 15, 20) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 5, 10, 15, 20)
cut(x, breaks = c(0, 5, 10, 15, 20)) cut(x, breaks = c(0, 5, 10, 15, 20))
@ -450,13 +450,13 @@ cut(y, breaks = c(0, 5, 10, 15, 20))
<section id="cumulative-and-rolling-aggregates" data-type="sect2"> <section id="cumulative-and-rolling-aggregates" data-type="sect2">
<h2> <h2>
Cumulative and rolling aggregates</h2> Cumulative and rolling aggregates</h2>
<p>Base R provides <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code>, <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code>, <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code>, <code><a href="#chp-https://rdrr.io/r/base/cumsum" data-type="xref">#chp-https://rdrr.io/r/base/cumsum</a></code> for running, or cumulative, sums, products, mins and maxes. dplyr provides <code><a href="#chp-https://dplyr.tidyverse.org/reference/cumall" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/cumall</a></code> for cumulative means. Cumulative sums tend to come up the most in practice:</p> <p>Base R provides <code><a href="https://rdrr.io/r/base/cumsum.html">cumsum()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cumprod()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummin()</a></code>, <code><a href="https://rdrr.io/r/base/cumsum.html">cummax()</a></code> for running, or cumulative, sums, products, mins and maxes. dplyr provides <code><a href="https://dplyr.tidyverse.org/reference/cumall.html">cummean()</a></code> for cumulative means. Cumulative sums tend to come up the most in practice:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- 1:10 <pre data-type="programlisting" data-code-language="downlit">x &lt;- 1:10
cumsum(x) cumsum(x)
#&gt; [1] 1 3 6 10 15 21 28 36 45 55</pre> #&gt; [1] 1 3 6 10 15 21 28 36 45 55</pre>
</div> </div>
<p>If you need more complex rolling or sliding aggregates, try the <a href="#chp-https://davisvaughan.github.io/slider/" data-type="xref">#chp-https://davisvaughan.github.io/slider/</a> package by Davis Vaughan. The following example illustrates some of its features.</p> <p>If you need more complex rolling or sliding aggregates, try the <a href="https://davisvaughan.github.io/slider/">slider</a> package by Davis Vaughan. The following example illustrates some of its features.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(slider) <pre data-type="programlisting" data-code-language="downlit">library(slider)
@ -505,7 +505,7 @@ General transformations</h1>
<section id="ranks" data-type="sect2"> <section id="ranks" data-type="sect2">
<h2> <h2>
Ranks</h2> Ranks</h2>
<p>dplyr provides a number of ranking functions inspired by SQL, but you should always start with <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>. It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.</p> <p>dplyr provides a number of ranking functions inspired by SQL, but you should always start with <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::min_rank()</a></code>. It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 2, 3, 4, NA) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(1, 2, 2, 3, 4, NA)
min_rank(x) min_rank(x)
@ -516,7 +516,7 @@ min_rank(x)
<pre data-type="programlisting" data-code-language="downlit">min_rank(desc(x)) <pre data-type="programlisting" data-code-language="downlit">min_rank(desc(x))
#&gt; [1] 5 3 3 2 1 NA</pre> #&gt; [1] 5 3 3 2 1 NA</pre>
</div> </div>
<p>If <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code> doesnt do what you need, look at the variants <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/percent_rank" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/percent_rank</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/percent_rank" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/percent_rank</a></code>. See the documentation for details.</p> <p>If <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code> doesnt do what you need, look at the variants <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::row_number()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">dplyr::dense_rank()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/percent_rank.html">dplyr::percent_rank()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/percent_rank.html">dplyr::cume_dist()</a></code>. See the documentation for details.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = x) <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = x)
df |&gt; df |&gt;
@ -536,8 +536,8 @@ df |&gt;
#&gt; 5 4 5 4 1 1 #&gt; 5 4 5 4 1 1
#&gt; 6 NA NA NA NA NA</pre> #&gt; 6 NA NA NA NA NA</pre>
</div> </div>
<p>You can achieve many of the same results by picking the appropriate <code>ties.method</code> argument to base Rs <code><a href="#chp-https://rdrr.io/r/base/rank" data-type="xref">#chp-https://rdrr.io/r/base/rank</a></code>; youll probably also want to set <code>na.last = "keep"</code> to keep <code>NA</code>s as <code>NA</code>.</p> <p>You can achieve many of the same results by picking the appropriate <code>ties.method</code> argument to base Rs <code><a href="https://rdrr.io/r/base/rank.html">rank()</a></code>; youll probably also want to set <code>na.last = "keep"</code> to keep <code>NA</code>s as <code>NA</code>.</p>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code> can also be used without any arguments when inside a dplyr verb. In this case, itll give the number of the “current” row. When combined with <code>%%</code> or <code>%/%</code> this can be a useful tool for dividing data into similarly sized groups:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/row_number.html">row_number()</a></code> can also be used without any arguments when inside a dplyr verb. In this case, itll give the number of the “current” row. When combined with <code>%%</code> or <code>%/%</code> this can be a useful tool for dividing data into similarly sized groups:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = runif(10)) <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = runif(10))
@ -563,7 +563,7 @@ df |&gt;
<section id="offsets" data-type="sect2"> <section id="offsets" data-type="sect2">
<h2> <h2>
Offsets</h2> Offsets</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with <code>NA</code>s at the start or end:</p> <p><code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">dplyr::lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">dplyr::lag()</a></code> allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with <code>NA</code>s at the start or end:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(2, 5, 11, 11, 19, 35) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(2, 5, 11, 11, 19, 35)
lag(x) lag(x)
@ -591,13 +591,13 @@ lead(x)
<section id="exercises-2" data-type="sect2"> <section id="exercises-2" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="#chp-https://dplyr.tidyverse.org/reference/row_number" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/row_number</a></code>.</p></li> <ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code>.</p></li>
<li><p>Which plane (<code>tailnum</code>) has the worst on-time record?</p></li> <li><p>Which plane (<code>tailnum</code>) has the worst on-time record?</p></li>
<li><p>What time of day should you fly if you want to avoid delays as much as possible?</p></li> <li><p>What time of day should you fly if you want to avoid delays as much as possible?</p></li>
<li><p>What does <code>flights |&gt; group_by(dest() |&gt; filter(row_number() &lt; 4)</code> do? What does <code>flights |&gt; group_by(dest() |&gt; filter(row_number(dep_delay) &lt; 4)</code> do?</p></li> <li><p>What does <code>flights |&gt; group_by(dest() |&gt; filter(row_number() &lt; 4)</code> do? What does <code>flights |&gt; group_by(dest() |&gt; filter(row_number(dep_delay) &lt; 4)</code> do?</p></li>
<li><p>For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.</p></li> <li><p>For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.</p></li>
<li> <li>
<p>Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code>, explore how the average flight delay for an hour is related to the average delay for the previous hour.</p> <p>Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>, explore how the average flight delay for an hour is related to the average delay for the previous hour.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(hour = dep_time %/% 100) |&gt; mutate(hour = dep_time %/% 100) |&gt;
@ -623,7 +623,7 @@ Numeric summaries</h1>
<section id="center" data-type="sect2"> <section id="center" data-type="sect2">
<h2> <h2>
Center</h2> Center</h2>
<p>So far, weve mostly used <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code>, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable youre interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.</p> <p>So far, weve mostly used <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable youre interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.</p>
<p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs the median when looking at the hourly vs median departure delay. The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.</p> <p><a href="#fig-mean-vs-median" data-type="xref">#fig-mean-vs-median</a> compares the mean vs the median when looking at the hourly vs median departure delay. The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
@ -646,13 +646,13 @@ Center</h2>
</figure> </figure>
</div> </div>
</div> </div>
<p>You might also wonder about the <strong>mode</strong>, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesnt work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and theres no mode function included in base R<span data-type="footnote">The <code><a href="#chp-https://rdrr.io/r/base/mode" data-type="xref">#chp-https://rdrr.io/r/base/mode</a></code> function does something quite different!</span>.</p> <p>You might also wonder about the <strong>mode</strong>, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesnt work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and theres no mode function included in base R<span data-type="footnote">The <code><a href="https://rdrr.io/r/base/mode.html">mode()</a></code> function does something quite different!</span>.</p>
</section> </section>
<section id="sec-min-max-summary" data-type="sect2"> <section id="sec-min-max-summary" data-type="sect2">
<h2> <h2>
Minimum, maximum, and quantiles</h2> Minimum, maximum, and quantiles</h2>
<p>What if youre interested in locations other than the center? <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> and <code><a href="#chp-https://rdrr.io/r/base/Extremes" data-type="xref">#chp-https://rdrr.io/r/base/Extremes</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="#chp-https://rdrr.io/r/stats/quantile" data-type="xref">#chp-https://rdrr.io/r/stats/quantile</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find a value thats greater than 95% of the values.</p> <p>What if youre interested in locations other than the center? <code><a href="https://rdrr.io/r/base/Extremes.html">min()</a></code> and <code><a href="https://rdrr.io/r/base/Extremes.html">max()</a></code> will give you the largest and smallest values. Another powerful tool is <code><a href="https://rdrr.io/r/stats/quantile.html">quantile()</a></code> which is a generalization of the median: <code>quantile(x, 0.25)</code> will find the value of <code>x</code> that is greater than 25% of the values, <code>quantile(x, 0.5)</code> is equivalent to the median, and <code>quantile(x, 0.95)</code> will find a value thats greater than 95% of the values.</p>
<p>For the <code>flights</code> data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.</p> <p>For the <code>flights</code> data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
@ -678,8 +678,8 @@ Minimum, maximum, and quantiles</h2>
<section id="spread" data-type="sect2"> <section id="spread" data-type="sect2">
<h2> <h2>
Spread</h2> Spread</h2>
<p>Sometimes youre not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, <code>sd(x)</code>, and the inter-quartile range, <code><a href="#chp-https://rdrr.io/r/stats/IQR" data-type="xref">#chp-https://rdrr.io/r/stats/IQR</a></code>. We wont explain <code><a href="#chp-https://rdrr.io/r/stats/sd" data-type="xref">#chp-https://rdrr.io/r/stats/sd</a></code> here since youre probably already familiar with it, but <code><a href="#chp-https://rdrr.io/r/stats/IQR" data-type="xref">#chp-https://rdrr.io/r/stats/IQR</a></code> might be new — its <code>quantile(x, 0.75) - quantile(x, 0.25)</code> and gives you the range that contains the middle 50% of the data.</p> <p>Sometimes youre not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, <code>sd(x)</code>, and the inter-quartile range, <code><a href="https://rdrr.io/r/stats/IQR.html">IQR()</a></code>. We wont explain <code><a href="https://rdrr.io/r/stats/sd.html">sd()</a></code> here since youre probably already familiar with it, but <code><a href="https://rdrr.io/r/stats/IQR.html">IQR()</a></code> might be new — its <code>quantile(x, 0.75) - quantile(x, 0.25)</code> and gives you the range that contains the middle 50% of the data.</p>
<p>We can use this to reveal a small oddity in the <code>flights</code> data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, <a href="#chp-https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport" data-type="xref">#chp-https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport</a>, might have moved.</p> <p>We can use this to reveal a small oddity in the <code>flights</code> data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, <a href="https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport">EGE</a>, might have moved.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(origin, dest) |&gt; group_by(origin, dest) |&gt;
@ -774,7 +774,7 @@ Positions</h2>
#&gt; # … with 359 more rows</pre> #&gt; # … with 359 more rows</pre>
</div> </div>
<p>(These functions currently lack an <code>na.rm</code> argument but will hopefully be fixed by the time you read this book: <a href="https://github.com/tidyverse/dplyr/issues/6242" class="uri">https://github.com/tidyverse/dplyr/issues/6242</a>).</p> <p>(These functions currently lack an <code>na.rm</code> argument but will hopefully be fixed by the time you read this book: <a href="https://github.com/tidyverse/dplyr/issues/6242" class="uri">https://github.com/tidyverse/dplyr/issues/6242</a>).</p>
<p>If youre familiar with <code>[</code>, you might wonder if you ever need these functions. There are two main reasons: the <code>default</code> argument and the <code>order_by</code> argument. <code>default</code> allows you to set a default value thats used if the requested position doesnt exist, e.g. youre trying to get the 3rd element from a two element group. <code>order_by</code> lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by <code><a href="#chp-https://dplyr.tidyverse.org/reference/order_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/order_by</a></code>.</p> <p>If youre familiar with <code>[</code>, you might wonder if you ever need these functions. There are two main reasons: the <code>default</code> argument and the <code>order_by</code> argument. <code>default</code> allows you to set a default value thats used if the requested position doesnt exist, e.g. youre trying to get the 3rd element from a two element group. <code>order_by</code> lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by <code><a href="https://dplyr.tidyverse.org/reference/order_by.html">order_by()</a></code>.</p>
<p>Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:</p> <p>Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
@ -802,7 +802,7 @@ Positions</h2>
<h2> <h2>
With<code>mutate()</code> With<code>mutate()</code>
</h2> </h2>
<p>As the names suggest, the summary functions are typically paired with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, particularly when you want do some sort of group standardization. For example:</p> <p>As the names suggest, the summary functions are typically paired with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, particularly when you want do some sort of group standardization. For example:</p>
<ul><li> <ul><li>
<code>x / sum(x)</code> calculates the proportion of a total.</li> <code>x / sum(x)</code> calculates the proportion of a total.</li>
<li> <li>

View File

@ -6,7 +6,7 @@
<li><p>The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.</p></li> <li><p>The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.</p></li>
<li><p>Weve added new chapters on column-wise and row-wise operations.</p></li> <li><p>Weve added new chapters on column-wise and row-wise operations.</p></li>
<li><p>Weve added a new set of chapters on import that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and scraping data from the web.</p></li> <li><p>Weve added a new set of chapters on import that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and scraping data from the web.</p></li>
<li><p>The modeling part has been removed. For modeling, we recommend using packages from <a href="#chp-https://www.tidymodels.org/" data-type="xref">#chp-https://www.tidymodels.org/</a> and reading <a href="#chp-https://www.tmwr.org/" data-type="xref">#chp-https://www.tmwr.org/</a> by Max Kuhn and Julia Silge to learn more about them.</p></li> <li><p>The modeling part has been removed. For modeling, we recommend using packages from <a href="https://www.tidymodels.org/">tidymodels</a> and reading <a href="https://www.tmwr.org/">Tidy Modeling with R</a> by Max Kuhn and Julia Silge to learn more about them.</p></li>
<li><p>Weve switched from the magrittr pipe to the base pipe.</p></li> <li><p>Weve switched from the magrittr pipe to the base pipe.</p></li>
</ul></section> </ul></section>

View File

@ -13,6 +13,6 @@
<h1>Learning more</h1> <h1>Learning more</h1>
<p>The goal of these chapters is to teach you the minimum about programming that you need to practice data science. Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills. Learning more about programming is a long-term investment: it wont pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.</p> <p>The goal of these chapters is to teach you the minimum about programming that you need to practice data science. Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills. Learning more about programming is a long-term investment: it wont pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.</p>
<p>To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:</p> <p>To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:</p>
<ul><li><p><a href="#chp-https://rstudio-education.github.io/hopr/" data-type="xref">#chp-https://rstudio-education.github.io/hopr/</a>, by Garrett Grolemund. This is an introduction to R as a programming language and is a great place to start if R is your first programming language. It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino). Its a useful complement if you find that these four chapters go by too quickly.</p></li> <ul><li><p><a href="https://rstudio-education.github.io/hopr/"><em>Hands on Programming with R</em></a>, by Garrett Grolemund. This is an introduction to R as a programming language and is a great place to start if R is your first programming language. It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino). Its a useful complement if you find that these four chapters go by too quickly.</p></li>
<li><p><a href="#chp-https://adv-r.hadley.nz/" data-type="xref">#chp-https://adv-r.hadley.nz/</a> by Hadley Wickham. This dives into the details of R the programming language. This is a great place to start if you have existing programming experience. Its also a great next step once youve internalized the ideas in these chapters.</p></li> <li><p><a href="https://adv-r.hadley.nz/"><em>Advanced R</em></a> by Hadley Wickham. This dives into the details of R the programming language. This is a great place to start if you have existing programming experience. Its also a great next step once youve internalized the ideas in these chapters.</p></li>
</ul></section></div> </ul></section></div>

View File

@ -284,10 +284,10 @@ Other formats</h1>
<h1> <h1>
Learning more</h1> Learning more</h1>
<p>To learn more about effective communication in these different formats we recommend the following resources:</p> <p>To learn more about effective communication in these different formats we recommend the following resources:</p>
<ul><li><p>To improve your presentation skills, try <a href="#chp-https://amzn.com/0321820800" data-type="xref">#chp-https://amzn.com/0321820800</a>, by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.</p></li> <ul><li><p>To improve your presentation skills, try <a href="https://amzn.com/0321820800"><em>Presentation Patterns</em></a>, by Neal Ford, Matthew McCollough, and Nathaniel Schutta. It provides a set of effective patterns (both low- and high-level) that you can apply to improve your presentations.</p></li>
<li><p>If you give academic talks, you might like the <a href="#chp-https://github.com/jtleek/talkguide" data-type="xref">#chp-https://github.com/jtleek/talkguide</a>.</p></li> <li><p>If you give academic talks, you might like the <a href="https://github.com/jtleek/talkguide"><em>Leek group guide to giving talks</em></a>.</p></li>
<li><p>We havent taken it outselves, but weve heard good things about Matt McGarritys online course on public speaking: <a href="https://www.coursera.org/learn/public-speaking" class="uri">https://www.coursera.org/learn/public-speaking</a>.</p></li> <li><p>We havent taken it outselves, but weve heard good things about Matt McGarritys online course on public speaking: <a href="https://www.coursera.org/learn/public-speaking" class="uri">https://www.coursera.org/learn/public-speaking</a>.</p></li>
<li><p>If you are creating a lot of dashboards, make sure to read Stephen Fews <a href="#chp-https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167" data-type="xref">#chp-https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167</a>. It will help you create dashboards that are truly useful, not just pretty to look at.</p></li> <li><p>If you are creating a lot of dashboards, make sure to read Stephen Fews <a href="https://www.amazon.com/Information-Dashboard-Design-Effective-Communication/dp/0596100167"><em>Information Dashboard Design: The Effective Visual Communication of Data</em></a>. It will help you create dashboards that are truly useful, not just pretty to look at.</p></li>
<li><p>Effectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams <a href="#chp-https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151" data-type="xref">#chp-https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151</a> is a great place to start.</p></li> <li><p>Effectively communicating your ideas often benefits from some knowledge of graphic design. Robin Williams <a href="https://www.amazon.com/Non-Designers-Design-Book-4th/dp/0133966151"><em>The Non-Designers Design Book</em></a> is a great place to start.</p></li>
</ul></section> </ul></section>
</section> </section>

View File

@ -17,9 +17,9 @@
<p>Use ISO8601 YYYY-MM-DD format so thats there no ambiguity. Use it even if you dont normally write dates that way!</p> <p>Use ISO8601 YYYY-MM-DD format so thats there no ambiguity. Use it even if you dont normally write dates that way!</p>
</li> </li>
<li><p>If you spend a lot of time on an analysis idea and it turns out to be a dead end, dont delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.</p></li> <li><p>If you spend a lot of time on an analysis idea and it turns out to be a dead end, dont delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.</p></li>
<li><p>Generally, youre better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using <code><a href="#chp-https://tibble.tidyverse.org/reference/tribble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tribble</a></code>.</p></li> <li><p>Generally, youre better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tibble::tribble()</a></code>.</p></li>
<li><p>If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.</p></li> <li><p>If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.</p></li>
<li><p>Before you finish for the day, make sure you can render the notebook. If youre using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.</p></li> <li><p>Before you finish for the day, make sure you can render the notebook. If youre using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.</p></li>
<li><p>If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), youll need to track the versions of the packages that your code uses. A rigorous approach is to use <strong>renv</strong>, <a href="https://rstudio.github.io/renv/index.html" class="uri">https://rstudio.github.io/renv/index.html</a>, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs <code><a href="#chp-https://rdrr.io/r/utils/sessionInfo" data-type="xref">#chp-https://rdrr.io/r/utils/sessionInfo</a></code> — that wont let you easily recreate your packages as they are today, but at least youll know what they were.</p></li> <li><p>If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), youll need to track the versions of the packages that your code uses. A rigorous approach is to use <strong>renv</strong>, <a href="https://rstudio.github.io/renv/index.html" class="uri">https://rstudio.github.io/renv/index.html</a>, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs <code><a href="https://rdrr.io/r/utils/sessionInfo.html">sessionInfo()</a></code> — that wont let you easily recreate your packages as they are today, but at least youll know what they were.</p></li>
<li><p>You are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.</p></li> <li><p>You are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.</p></li>
</ul></section> </ul></section>

View File

@ -103,7 +103,7 @@ Exercises</h2>
<section id="visual-editor" data-type="sect1"> <section id="visual-editor" data-type="sect1">
<h1> <h1>
Visual editor</h1> Visual editor</h1>
<p>The Visual editor in RStudio provides a <a href="#chp-https://en.wikipedia.org/wiki/WYSIWYM" data-type="xref">#chp-https://en.wikipedia.org/wiki/WYSIWYM</a> interface for authoring Quarto documents. Under the hood, prose in Quarto documents (<code>.qmd</code> files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in <a href="#sec-source-editor" data-type="xref">#sec-source-editor</a>, it still requires learning new syntax. Therefore, if youre new to computational documents like <code>.qmd</code> files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.</p> <p>The Visual editor in RStudio provides a <a href="https://en.wikipedia.org/wiki/WYSIWYM">WYSIWYM</a> interface for authoring Quarto documents. Under the hood, prose in Quarto documents (<code>.qmd</code> files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in <a href="#sec-source-editor" data-type="xref">#sec-source-editor</a>, it still requires learning new syntax. Therefore, if youre new to computational documents like <code>.qmd</code> files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.</p>
<p>In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all <kbd>⌘ /</kbd> shortcut to insert just about anything. If you are at the beginning of a line (as shown below), you can also enter just <kbd>/</kbd> to invoke the shortcut.</p> <p>In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all <kbd>⌘ /</kbd> shortcut to insert just about anything. If you are at the beginning of a line (as shown below), you can also enter just <kbd>/</kbd> to invoke the shortcut.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -339,7 +339,7 @@ Inline code</h2>
<blockquote class="blockquote"> <blockquote class="blockquote">
<p>We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:</p> <p>We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:</p>
</blockquote> </blockquote>
<p>When inserting numbers into text, <code><a href="#chp-https://rdrr.io/r/base/format" data-type="xref">#chp-https://rdrr.io/r/base/format</a></code> is your friend. It allows you to set the number of <code>digits</code> so you dont print to a ridiculous degree of accuracy, and a <code>big.mark</code> to make numbers easier to read. You might combine these into a helper function:</p> <p>When inserting numbers into text, <code><a href="https://rdrr.io/r/base/format.html">format()</a></code> is your friend. It allows you to set the number of <code>digits</code> so you dont print to a ridiculous degree of accuracy, and a <code>big.mark</code> to make numbers easier to read. You might combine these into a helper function:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">comma &lt;- function(x) format(x, digits = 2, big.mark = ",") <pre data-type="programlisting" data-code-language="downlit">comma &lt;- function(x) format(x, digits = 2, big.mark = ",")
comma(3452345) comma(3452345)
@ -423,7 +423,7 @@ Tables</h1>
#&gt; Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 #&gt; Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#&gt; Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2</pre> #&gt; Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2</pre>
</div> </div>
<p>If you prefer that data be displayed with additional formatting you can use the <code><a href="#chp-https://rdrr.io/pkg/knitr/man/kable" data-type="xref">#chp-https://rdrr.io/pkg/knitr/man/kable</a></code> function. The code below generates <a href="#tbl-kable" data-type="xref">#tbl-kable</a>.</p> <p>If you prefer that data be displayed with additional formatting you can use the <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">knitr::kable()</a></code> function. The code below generates <a href="#tbl-kable" data-type="xref">#tbl-kable</a>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">knitr::kable(mtcars[1:5, ], )</pre> <pre data-type="programlisting" data-code-language="downlit">knitr::kable(mtcars[1:5, ], )</pre>
<div class="cell-output-display"> <div class="cell-output-display">
@ -504,7 +504,7 @@ Tables</h1>
</tr></tbody></table></div> </tr></tbody></table></div>
</div> </div>
</div> </div>
<p>Read the documentation for <code><a href="#chp-https://rdrr.io/pkg/knitr/man/kable" data-type="xref">#chp-https://rdrr.io/pkg/knitr/man/kable</a></code> to see the other ways in which you can customize the table. For even deeper customization, consider the <strong>gt</strong>, <strong>huxtable</strong>, <strong>reactable</strong>, <strong>kableExtra</strong>, <strong>xtable</strong>, <strong>stargazer</strong>, <strong>pander</strong>, <strong>tables</strong>, and <strong>ascii</strong> packages. Each provides a set of tools for returning formatted tables from R code.</p> <p>Read the documentation for <code><a href="https://rdrr.io/pkg/knitr/man/kable.html">?knitr::kable</a></code> to see the other ways in which you can customize the table. For even deeper customization, consider the <strong>gt</strong>, <strong>huxtable</strong>, <strong>reactable</strong>, <strong>kableExtra</strong>, <strong>xtable</strong>, <strong>stargazer</strong>, <strong>pander</strong>, <strong>tables</strong>, and <strong>ascii</strong> packages. Each provides a set of tools for returning formatted tables from R code.</p>
<p>There is also a rich set of options for controlling how figures are embedded. Youll learn about these in <a href="#chp-communicate-plots" data-type="xref">#chp-communicate-plots</a>.</p> <p>There is also a rich set of options for controlling how figures are embedded. Youll learn about these in <a href="#chp-communicate-plots" data-type="xref">#chp-communicate-plots</a>.</p>
<section id="exercises-5" data-type="sect2"> <section id="exercises-5" data-type="sect2">
@ -559,20 +559,20 @@ processed_data &lt;- rawdata |&gt;
mutate(new_variable = complicated_transformation(x, y, z)) mutate(new_variable = complicated_transformation(x, y, z))
```</code></pre> ```</code></pre>
<p><code>dependson</code> should contain a character vector of <em>every</em> chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.</p> <p><code>dependson</code> should contain a character vector of <em>every</em> chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.</p>
<p>Note that the chunks wont update if <code>a_very_large_file.csv</code> changes, because knitr caching only tracks changes within the <code>.qmd</code> file. If you want to also track changes to that file you can use the <code>cache.extra</code> option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is <code><a href="#chp-https://rdrr.io/r/base/file.info" data-type="xref">#chp-https://rdrr.io/r/base/file.info</a></code>: it returns a bunch of information about the file including when it was last modified. Then you can write:</p> <p>Note that the chunks wont update if <code>a_very_large_file.csv</code> changes, because knitr caching only tracks changes within the <code>.qmd</code> file. If you want to also track changes to that file you can use the <code>cache.extra</code> option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is <code><a href="https://rdrr.io/r/base/file.info.html">file.info()</a></code>: it returns a bunch of information about the file including when it was last modified. Then you can write:</p>
<pre><code>```{r} <pre><code>```{r}
#| label: raw-data #| label: raw-data
#| cache.extra: file.info("a_very_large_file.csv") #| cache.extra: file.info("a_very_large_file.csv")
rawdata &lt;- readr::read_csv("a_very_large_file.csv") rawdata &lt;- readr::read_csv("a_very_large_file.csv")
```</code></pre> ```</code></pre>
<p>As your caching strategies get progressively more complicated, its a good idea to regularly clear out all your caches with <code><a href="#chp-https://rdrr.io/pkg/knitr/man/clean_cache" data-type="xref">#chp-https://rdrr.io/pkg/knitr/man/clean_cache</a></code>.</p> <p>As your caching strategies get progressively more complicated, its a good idea to regularly clear out all your caches with <code><a href="https://rdrr.io/pkg/knitr/man/clean_cache.html">knitr::clean_cache()</a></code>.</p>
<p>Weve followed the advice of <a href="#chp-https://twitter.com/drob/status/738786604731490304" data-type="xref">#chp-https://twitter.com/drob/status/738786604731490304</a> to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the <code>dependson</code> specification.</p> <p>Weve followed the advice of <a href="https://twitter.com/drob/status/738786604731490304">David Robinson</a> to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the <code>dependson</code> specification.</p>
<section id="exercises-6" data-type="sect2"> <section id="exercises-6" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li>Set up a network of chunks where <code>d</code> depends on <code>c</code> and <code>b</code>, and both <code>b</code> and <code>c</code> depend on <code>a</code>. Have each chunk print <code><a href="#chp-https://lubridate.tidyverse.org/reference/now" data-type="xref">#chp-https://lubridate.tidyverse.org/reference/now</a></code>, set <code>cache: true</code>, then verify your understanding of caching.</li> <ol type="1"><li>Set up a network of chunks where <code>d</code> depends on <code>c</code> and <code>b</code>, and both <code>b</code> and <code>c</code> depend on <code>a</code>. Have each chunk print <code><a href="https://lubridate.tidyverse.org/reference/now.html">lubridate::now()</a></code>, set <code>cache: true</code>, then verify your understanding of caching.</li>
</ol></section> </ol></section>
</section> </section>
@ -582,8 +582,8 @@ Troubleshooting</h1>
<p>Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.</p> <p>Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.</p>
<p>One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.</p> <p>One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.</p>
<p>If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If youre lucky, that will recreate the problem, and you can figure out whats going on interactively.</p> <p>If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If youre lucky, that will recreate the problem, and you can figure out whats going on interactively.</p>
<p>If that doesnt help, there must be something different between your interactive environment and the Quarto environment. Youre going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including <code><a href="#chp-https://rdrr.io/r/base/getwd" data-type="xref">#chp-https://rdrr.io/r/base/getwd</a></code> in a chunk.</p> <p>If that doesnt help, there must be something different between your interactive environment and the Quarto environment. Youre going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code> in a chunk.</p>
<p>Next, brainstorm all the things that might cause the bug. Youll need to systematically check that theyre the same in your R session and your Quarto session. The easiest way to do that is to set <code>error: true</code> on the chunk causing the problem, then use <code><a href="#chp-https://rdrr.io/r/base/print" data-type="xref">#chp-https://rdrr.io/r/base/print</a></code> and <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code> to check that settings are as you expect.</p> <p>Next, brainstorm all the things that might cause the bug. Youll need to systematically check that theyre the same in your R session and your Quarto session. The easiest way to do that is to set <code>error: true</code> on the chunk causing the problem, then use <code><a href="https://rdrr.io/r/base/print.html">print()</a></code> and <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> to check that settings are as you expect.</p>
</section> </section>
<section id="yaml-header" data-type="sect1"> <section id="yaml-header" data-type="sect1">
@ -645,9 +645,9 @@ ggplot(class, aes(displ, hwy)) +
Bibliographies and Citations</h2> Bibliographies and Citations</h2>
<p>Quarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.</p> <p>Quarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.</p>
<p>To add a citation using the visual editor, go to Insert &gt; Citation. Citations can be inserted from a variety of sources:</p> <p>To add a citation using the visual editor, go to Insert &gt; Citation. Citations can be inserted from a variety of sources:</p>
<ol type="1"><li><p><a href="#citations-from-dois" data-type="xref">#citations-from-dois</a> (Document Object Identifier) references.</p></li> <ol type="1"><li><p><a href="https://quarto.org/docs/visual-editor/technical.html#citations-from-dois">DOI</a> (Document Object Identifier) references.</p></li>
<li><p><a href="#citations-from-zotero" data-type="xref">#citations-from-zotero</a> personal or group libraries.</p></li> <li><p><a href="https://quarto.org/docs/visual-editor/technical.html#citations-from-zotero">Zotero</a> personal or group libraries.</p></li>
<li><p>Searches of <a href="#chp-https://www.crossref.org/" data-type="xref">#chp-https://www.crossref.org/</a>, <a href="#chp-https://datacite.org/" data-type="xref">#chp-https://datacite.org/</a>, or <a href="#chp-https://pubmed.ncbi.nlm.nih.gov/" data-type="xref">#chp-https://pubmed.ncbi.nlm.nih.gov/</a>.</p></li> <li><p>Searches of <a href="https://www.crossref.org/">Crossref</a>, <a href="https://datacite.org/">DataCite</a>, or <a href="https://pubmed.ncbi.nlm.nih.gov/">PubMed</a>.</p></li>
<li><p>Your document bibliography (a <code>.bib</code> file in the directory of your document)</p></li> <li><p>Your document bibliography (a <code>.bib</code> file in the directory of your document)</p></li>
</ol><p>Under the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g. <code>[@citation]</code>).</p> </ol><p>Under the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g. <code>[@citation]</code>).</p>
<p>If you add a citation using one of the first three methods, the visual editor will automatically create a <code>bibliography.bib</code> file for you and add the reference to it. It will also add a <code>bibliography</code> field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.</p> <p>If you add a citation using one of the first three methods, the visual editor will automatically create a <code>bibliography.bib</code> file for you and add the reference to it. It will also add a <code>bibliography</code> field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.</p>
@ -675,7 +675,7 @@ csl: apa.csl</pre>
Learning more</h1> Learning more</h1>
<p>Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: <a href="https://quarto.org/" class="uri">https://quarto.org</a>.</p> <p>Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: <a href="https://quarto.org/" class="uri">https://quarto.org</a>.</p>
<p>There are two important topics that we havent covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: <a href="https://happygitwithr.com" class="uri">https://happygitwithr.com</a>.</p> <p>There are two important topics that we havent covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: <a href="https://happygitwithr.com" class="uri">https://happygitwithr.com</a>.</p>
<p>We have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either <a href="#chp-https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416" data-type="xref">#chp-https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416</a> by Joseph M. Williams &amp; Joseph Bizup, or <a href="#chp-https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327" data-type="xref">#chp-https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327</a> by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but theyre used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <a href="https://www.georgegopen.com/the-litigation-articles.html" class="uri">https://www.georgegopen.com/the-litigation-articles.html</a>. They are aimed at lawyers, but almost everything applies to data scientists too.</p> <p>We have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either <a href="https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416"><em>Style: Lessons in Clarity and Grace</em></a> by Joseph M. Williams &amp; Joseph Bizup, or <a href="https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327"><em>The Sense of Structure: Writing from the Readers Perspective</em></a> by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but theyre used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at <a href="https://www.georgegopen.com/the-litigation-articles.html" class="uri">https://www.georgegopen.com/the-litigation-articles.html</a>. They are aimed at lawyers, but almost everything applies to data scientists too.</p>
</section> </section>

View File

@ -8,13 +8,13 @@
Base R Base R
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="#chp-https://rdrr.io/r/base/data.frame" data-type="xref">#chp-https://rdrr.io/r/base/data.frame</a></code> treats a list as a list of columns:</p><div class="cell"> <p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5)) <pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5))
#&gt; x.1.3 x.3.5 #&gt; x.1.3 x.3.5
#&gt; 1 1 3 #&gt; 1 1 3
#&gt; 2 2 4 #&gt; 2 2 4
#&gt; 3 3 5</pre> #&gt; 3 3 5</pre>
</div><p>You can force <code><a href="#chp-https://rdrr.io/r/base/data.frame" data-type="xref">#chp-https://rdrr.io/r/base/data.frame</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="#chp-https://rdrr.io/r/base/AsIs" data-type="xref">#chp-https://rdrr.io/r/base/AsIs</a></code>, but the result doesnt print particularly well:</p><div class="cell"> </div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesnt print particularly well:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame( <pre data-type="programlisting" data-code-language="downlit">data.frame(
x = I(list(1:2, 3:5)), x = I(list(1:2, 3:5)),
y = c("1, 2", "3, 4, 5") y = c("1, 2", "3, 4, 5")
@ -22,13 +22,13 @@ Base R
#&gt; x y #&gt; x y
#&gt; 1 1, 2 1, 2 #&gt; 1 1, 2 1, 2
#&gt; 2 3, 4, 5 3, 4, 5</pre> #&gt; 2 3, 4, 5 3, 4, 5</pre>
</div><p>Its easier to use list-columns with tibbles because <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div> </div><p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div>
<section id="introduction" data-type="sect1"> <section id="introduction" data-type="sect1">
<h1> <h1>
Introduction</h1> Introduction</h1>
<p>In this chapter, youll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p> <p>In this chapter, youll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
<p>To learn about rectangling, youll need to first learn about lists, the data structure that makes hierarchical data possible. Then youll learn about two crucial tidyr functions: <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>. Well then show you a few case studies, applying these simple functions again and again to solve real problems. Well finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.</p> <p>To learn about rectangling, youll need to first learn about lists, the data structure that makes hierarchical data possible. Then youll learn about two crucial tidyr functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">tidyr::unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">tidyr::unnest_wider()</a></code>. Well then show you a few case studies, applying these simple functions again and again to solve real problems. Well finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.</p>
<section id="prerequisites" data-type="sect2"> <section id="prerequisites" data-type="sect2">
<h2> <h2>
@ -45,7 +45,7 @@ library(jsonlite)</pre>
<section id="lists" data-type="sect1"> <section id="lists" data-type="sect1">
<h1> <h1>
Lists</h1> Lists</h1>
<p>So far youve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because theyre homogeneous: every element is the same type. If you want to store element of different types in the same vector, youll need a <strong>list</strong>, which you create with <code><a href="#chp-https://rdrr.io/r/base/list" data-type="xref">#chp-https://rdrr.io/r/base/list</a></code>:</p> <p>So far youve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because theyre homogeneous: every element is the same type. If you want to store element of different types in the same vector, youll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- list(1:4, "a", TRUE) <pre data-type="programlisting" data-code-language="downlit">x1 &lt;- list(1:4, "a", TRUE)
x1 x1
@ -71,7 +71,7 @@ x2
#&gt; $c #&gt; $c
#&gt; [1] 1 2 3 4</pre> #&gt; [1] 1 2 3 4</pre>
</div> </div>
<p>Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code>, which generates a compact display of the <strong>str</strong>ucture, de-emphasizing the contents:</p> <p>Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code>, which generates a compact display of the <strong>str</strong>ucture, de-emphasizing the contents:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str(x1) <pre data-type="programlisting" data-code-language="downlit">str(x1)
#&gt; List of 3 #&gt; List of 3
@ -84,7 +84,7 @@ str(x2)
#&gt; $ b: int [1:3] 1 2 3 #&gt; $ b: int [1:3] 1 2 3
#&gt; $ c: int [1:4] 1 2 3 4</pre> #&gt; $ c: int [1:4] 1 2 3 4</pre>
</div> </div>
<p>As you can see, <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code> displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.</p> <p>As you can see, <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.</p>
<section id="hierarchy" data-type="sect2"> <section id="hierarchy" data-type="sect2">
<h2> <h2>
@ -101,7 +101,7 @@ str(x3)
#&gt; ..$ : num 3 #&gt; ..$ : num 3
#&gt; ..$ : num 4</pre> #&gt; ..$ : num 4</pre>
</div> </div>
<p>This is notably different to <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>, which generates a flat vector:</p> <p>This is notably different to <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>, which generates a flat vector:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">c(c(1, 2), c(3, 4)) <pre data-type="programlisting" data-code-language="downlit">c(c(1, 2), c(3, 4))
#&gt; [1] 1 2 3 4 #&gt; [1] 1 2 3 4
@ -114,7 +114,7 @@ str(x4)
#&gt; $ : num 3 #&gt; $ : num 3
#&gt; $ : num 4</pre> #&gt; $ : num 4</pre>
</div> </div>
<p>As lists get more complex, <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code> gets more useful, as it lets you see the hierarchy at a glance:</p> <p>As lists get more complex, <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> gets more useful, as it lets you see the hierarchy at a glance:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x5 &lt;- list(1, list(2, list(3, list(4, list(5))))) <pre data-type="programlisting" data-code-language="downlit">x5 &lt;- list(1, list(2, list(3, list(4, list(5)))))
str(x5) str(x5)
@ -129,7 +129,7 @@ str(x5)
#&gt; .. .. ..$ :List of 1 #&gt; .. .. ..$ :List of 1
#&gt; .. .. .. ..$ : num 5</pre> #&gt; .. .. .. ..$ : num 5</pre>
</div> </div>
<p>As lists get even larger and more complex, <code><a href="#chp-https://rdrr.io/r/utils/str" data-type="xref">#chp-https://rdrr.io/r/utils/str</a></code> eventually starts to fail, and youll need to switch to <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code><span data-type="footnote">This is an RStudio feature.</span>. <a href="#fig-view-collapsed" data-type="xref">#fig-view-collapsed</a> shows the result of calling <code>View(x4)</code>. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in <a href="#fig-view-expand-1" data-type="xref">#fig-view-expand-1</a>. RStudio will also show you the code you need to access that element, as in <a href="#fig-view-expand-2" data-type="xref">#fig-view-expand-2</a>. Well come back to how this code works in <a href="#sec-subset-one" data-type="xref">#sec-subset-one</a>.</p> <p>As lists get even larger and more complex, <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> eventually starts to fail, and youll need to switch to <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code><span data-type="footnote">This is an RStudio feature.</span>. <a href="#fig-view-collapsed" data-type="xref">#fig-view-collapsed</a> shows the result of calling <code>View(x4)</code>. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in <a href="#fig-view-expand-1" data-type="xref">#fig-view-expand-1</a>. RStudio will also show you the code you need to access that element, as in <a href="#fig-view-expand-2" data-type="xref">#fig-view-expand-2</a>. Well come back to how this code works in <a href="#sec-subset-one" data-type="xref">#sec-subset-one</a>.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -159,7 +159,7 @@ str(x5)
<section id="list-columns" data-type="sect2"> <section id="list-columns" data-type="sect2">
<h2> <h2>
List-columns</h2> List-columns</h2>
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldnt usually belong in a tibble. In particular, list-columns are are used a lot in the <a href="#chp-https://www.tidymodels" data-type="xref">#chp-https://www.tidymodels</a> ecosystem, because they allow you to store things like models or resamples in a data frame.</p> <p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldnt usually belong in a tibble. In particular, list-columns are are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like models or resamples in a data frame.</p>
<p>Heres a simple example of a list-column:</p> <p>Heres a simple example of a list-column:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
@ -195,18 +195,18 @@ df
#&gt; ..$ : num 1 #&gt; ..$ : num 1
#&gt; ..$ : num 2</pre> #&gt; ..$ : num 2</pre>
</div> </div>
<p>Similarly, if you <code><a href="#chp-https://rdrr.io/r/utils/View" data-type="xref">#chp-https://rdrr.io/r/utils/View</a></code> a data frame in RStudio, youll get the standard tabular view, which doesnt allow you to selectively expand list columns. To explore those fields youll need to <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code> and view, e.g. <code>df |&gt; pull(z) |&gt; View()</code>.</p> <p>Similarly, if you <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> a data frame in RStudio, youll get the standard tabular view, which doesnt allow you to selectively expand list columns. To explore those fields youll need to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> and view, e.g. <code>df |&gt; pull(z) |&gt; View()</code>.</p>
<div data-type="note"><h1> <div data-type="note"><h1>
Base R Base R
</h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p>
<p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="#chp-https://rdrr.io/r/base/data.frame" data-type="xref">#chp-https://rdrr.io/r/base/data.frame</a></code> treats a list as a list of columns:</p><div class="cell"> <p>Its possible to put a list in a column of a <code>data.frame</code>, but its a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5)) <pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5))
#&gt; x.1.3 x.3.5 #&gt; x.1.3 x.3.5
#&gt; 1 1 3 #&gt; 1 1 3
#&gt; 2 2 4 #&gt; 2 2 4
#&gt; 3 3 5</pre> #&gt; 3 3 5</pre>
</div><p>You can force <code><a href="#chp-https://rdrr.io/r/base/data.frame" data-type="xref">#chp-https://rdrr.io/r/base/data.frame</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="#chp-https://rdrr.io/r/base/AsIs" data-type="xref">#chp-https://rdrr.io/r/base/AsIs</a></code>, but the result doesnt print particularly well:</p><div class="cell"> </div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesnt print particularly well:</p><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">data.frame( <pre data-type="programlisting" data-code-language="downlit">data.frame(
x = I(list(1:2, 3:5)), x = I(list(1:2, 3:5)),
y = c("1, 2", "3, 4, 5") y = c("1, 2", "3, 4, 5")
@ -214,7 +214,7 @@ Base R
#&gt; x y #&gt; x y
#&gt; 1 1, 2 1, 2 #&gt; 1 1, 2 1, 2
#&gt; 2 3, 4, 5 3, 4, 5</pre> #&gt; 2 3, 4, 5 3, 4, 5</pre>
</div><p>Its easier to use list-columns with tibbles because <code><a href="#chp-https://tibble.tidyverse.org/reference/tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/tibble</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div> </div><p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div>
</section> </section>
</section> </section>
@ -242,13 +242,13 @@ df2 &lt;- tribble(
3, list(31, 32), 3, list(31, 32),
)</pre> )</pre>
</div> </div>
<p>tidyr provides two functions for these two cases: <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code>. The following sections explain how they work.</p> <p>tidyr provides two functions for these two cases: <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>. The following sections explain how they work.</p>
<section id="unnest_wider" data-type="sect2"> <section id="unnest_wider" data-type="sect2">
<h2> <h2>
<code>unnest_wider()</code> <code>unnest_wider()</code>
</h2> </h2>
<p>When each row has the same number of elements with the same names, like <code>df1</code>, its natural to put each component into its own column with <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>:</p> <p>When each row has the same number of elements with the same names, like <code>df1</code>, its natural to put each component into its own column with <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 |&gt; <pre data-type="programlisting" data-code-language="downlit">df1 |&gt;
unnest_wider(y) unnest_wider(y)
@ -270,7 +270,7 @@ df2 &lt;- tribble(
#&gt; 2 2 21 22 #&gt; 2 2 21 22
#&gt; 3 3 31 32</pre> #&gt; 3 3 31 32</pre>
</div> </div>
<p>We can also use <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> with unnamed list-columns, as in <code>df2</code>. Since columns require names but the list lacks them, <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> will label them with consecutive integers:</p> <p>We can also use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> with unnamed list-columns, as in <code>df2</code>. Since columns require names but the list lacks them, <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> will label them with consecutive integers:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df2 |&gt; <pre data-type="programlisting" data-code-language="downlit">df2 |&gt;
unnest_wider(y, names_sep = "_") unnest_wider(y, names_sep = "_")
@ -281,14 +281,14 @@ df2 &lt;- tribble(
#&gt; 2 2 21 NA NA #&gt; 2 2 21 NA NA
#&gt; 3 3 31 32 NA</pre> #&gt; 3 3 31 32 NA</pre>
</div> </div>
<p>Youll notice that <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>, much like <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>, turns implicit missing values in to explicit missing values.</p> <p>Youll notice that <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>, much like <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>, turns implicit missing values in to explicit missing values.</p>
</section> </section>
<section id="unnest_longer" data-type="sect2"> <section id="unnest_longer" data-type="sect2">
<h2> <h2>
<code>unnest_longer()</code> <code>unnest_longer()</code>
</h2> </h2>
<p>When each row contains an unnamed list, its most natural to put each element into its own row with <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code>:</p> <p>When each row contains an unnamed list, its most natural to put each element into its own row with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df2 |&gt; <pre data-type="programlisting" data-code-language="downlit">df2 |&gt;
unnest_longer(y) unnest_longer(y)
@ -360,7 +360,7 @@ Inconsistent types</h2>
"b", list(TRUE, factor("a"), 5) "b", list(TRUE, factor("a"), 5)
)</pre> )</pre>
</div> </div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> always keeps the set of columns change, while changing the number of rows. So what happens? How does <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> produce five rows while keeping everything in <code>y</code>?</p> <p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns change, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df4 |&gt; <pre data-type="programlisting" data-code-language="downlit">df4 |&gt;
unnest_longer(y) unnest_longer(y)
@ -373,7 +373,7 @@ Inconsistent types</h2>
#&gt; 4 b &lt;fct [1]&gt; #&gt; 4 b &lt;fct [1]&gt;
#&gt; 5 b &lt;dbl [1]&gt;</pre> #&gt; 5 b &lt;dbl [1]&gt;</pre>
</div> </div>
<p>As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> cant find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.</p> <p>As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> cant find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.</p>
<p>What happens if you find this problem in a dataset youre trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. Its not particularly useful here because theres only really one class that these five class can be converted to character.</p> <p>What happens if you find this problem in a dataset youre trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. Its not particularly useful here because theres only really one class that these five class can be converted to character.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df4 |&gt; <pre data-type="programlisting" data-code-language="downlit">df4 |&gt;
@ -398,7 +398,7 @@ Inconsistent types</h2>
#&gt; 1 a &lt;dbl [1]&gt; #&gt; 1 a &lt;dbl [1]&gt;
#&gt; 2 b &lt;dbl [1]&gt;</pre> #&gt; 2 b &lt;dbl [1]&gt;</pre>
</div> </div>
<p>Then you can call <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> once more:</p> <p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df4 |&gt; <pre data-type="programlisting" data-code-language="downlit">df4 |&gt;
unnest_longer(y) |&gt; unnest_longer(y) |&gt;
@ -410,7 +410,7 @@ Inconsistent types</h2>
#&gt; 1 a 1 #&gt; 1 a 1
#&gt; 2 b 5</pre> #&gt; 2 b 5</pre>
</div> </div>
<p>Youll learn more about <code><a href="#chp-https://purrr.tidyverse.org/reference/map" data-type="xref">#chp-https://purrr.tidyverse.org/reference/map</a></code> in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p> <p>Youll learn more about <code><a href="https://purrr.tidyverse.org/reference/map.html">map_lgl()</a></code> in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
</section> </section>
<section id="other-functions" data-type="sect2"> <section id="other-functions" data-type="sect2">
@ -418,11 +418,11 @@ Inconsistent types</h2>
Other functions</h2> Other functions</h2>
<p>tidyr has a few other useful rectangling functions that were not going to cover in this book:</p> <p>tidyr has a few other useful rectangling functions that were not going to cover in this book:</p>
<ul><li> <ul><li>
<code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_auto" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_auto</a></code> automatically picks between <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> based on the structure of the list-column. Its a great for rapid exploration, but ultimately its a bad idea because it doesnt force you to understand how your data is structured, and makes your code harder to understand.</li> <code><a href="https://tidyr.tidyverse.org/reference/unnest_auto.html">unnest_auto()</a></code> automatically picks between <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> based on the structure of the list-column. Its a great for rapid exploration, but ultimately its a bad idea because it doesnt force you to understand how your data is structured, and makes your code harder to understand.</li>
<li> <li>
<code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest</a></code> expands both rows and columns. Its useful when you have a list-column that contains a 2d structure like a data frame, which you dont see in this book.</li> <code><a href="https://tidyr.tidyverse.org/reference/unnest.html">unnest()</a></code> expands both rows and columns. Its useful when you have a list-column that contains a 2d structure like a data frame, which you dont see in this book.</li>
<li> <li>
<code><a href="#chp-https://tidyr.tidyverse.org/reference/hoist" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/hoist</a></code> allows you to reach into a deeply nested list and extract just the components that you need. Its mostly equivalent to repeated invocations of <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> + <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> so read up on it if youre trying to extract just a couple of important variables embedded in a bunch of data that you dont care about.</li> <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code> allows you to reach into a deeply nested list and extract just the components that you need. Its mostly equivalent to repeated invocations of <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> + <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> so read up on it if youre trying to extract just a couple of important variables embedded in a bunch of data that you dont care about.</li>
</ul><p>These are good to know about when youre reading other peoples code or tackling rarer rectangling challenges.</p> </ul><p>These are good to know about when youre reading other peoples code or tackling rarer rectangling challenges.</p>
</section> </section>
@ -430,7 +430,7 @@ Other functions</h2>
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li> <ol type="1"><li>
<p>From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of <code>y</code> and <code>z</code> are aligned (i.e. <code>y</code> and <code>z</code> will always have the same length within a row, and the first value of <code>y</code> corresponds to the first value of <code>z</code>). What happens if you apply two <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> calls to this data frame? How can you preserve the relationship between <code>x</code> and <code>y</code>? (Hint: carefully read the docs).</p> <p>From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of <code>y</code> and <code>z</code> are aligned (i.e. <code>y</code> and <code>z</code> will always have the same length within a row, and the first value of <code>y</code> corresponds to the first value of <code>z</code>). What happens if you apply two <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> calls to this data frame? How can you preserve the relationship between <code>x</code> and <code>y</code>? (Hint: carefully read the docs).</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df4 &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">df4 &lt;- tribble(
~x, ~y, ~z, ~x, ~y, ~z,
@ -445,7 +445,7 @@ Exercises</h2>
<section id="case-studies" data-type="sect1"> <section id="case-studies" data-type="sect1">
<h1> <h1>
Case studies</h1> Case studies</h1>
<p>The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> and/or <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>. This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that weve encountered in the wild.</p> <p>The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> and/or <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>. This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that weve encountered in the wild.</p>
<section id="very-wide-data" data-type="sect2"> <section id="very-wide-data" data-type="sect2">
<h2> <h2>
@ -465,7 +465,7 @@ repos
#&gt; 5 &lt;list [30]&gt; #&gt; 5 &lt;list [30]&gt;
#&gt; 6 &lt;list [30]&gt;</pre> #&gt; 6 &lt;list [30]&gt;</pre>
</div> </div>
<p>This tibble contains 6 rows, one row for each child of <code>gh_repos</code>. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, well start with <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> to put each child in its own row:</p> <p>This tibble contains 6 rows, one row for each child of <code>gh_repos</code>. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, well start with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put each child in its own row:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">repos |&gt; <pre data-type="programlisting" data-code-language="downlit">repos |&gt;
unnest_longer(json) unnest_longer(json)
@ -480,7 +480,7 @@ repos
#&gt; 6 &lt;named list [68]&gt; #&gt; 6 &lt;named list [68]&gt;
#&gt; # … with 170 more rows</pre> #&gt; # … with 170 more rows</pre>
</div> </div>
<p>At first glance, it might seem like we havent improved the situation: while we have more rows (176 instead of 6) each element of <code>json</code> is still a list. However, theres an important difference: now each element is a <strong>named</strong> list so we can use <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> to put each element into its own column:</p> <p>At first glance, it might seem like we havent improved the situation: while we have more rows (176 instead of 6) each element of <code>json</code> is still a list. However, theres an important difference: now each element is a <strong>named</strong> list so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put each element into its own column:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">repos |&gt; <pre data-type="programlisting" data-code-language="downlit">repos |&gt;
unnest_longer(json) |&gt; unnest_longer(json) |&gt;
@ -502,7 +502,7 @@ repos
#&gt; # languages_url &lt;chr&gt;, stargazers_url &lt;chr&gt;, contributors_url &lt;chr&gt;, #&gt; # languages_url &lt;chr&gt;, stargazers_url &lt;chr&gt;, contributors_url &lt;chr&gt;,
#&gt; # subscribers_url &lt;chr&gt;, subscription_url &lt;chr&gt;, commits_url &lt;chr&gt;, …</pre> #&gt; # subscribers_url &lt;chr&gt;, subscription_url &lt;chr&gt;, commits_url &lt;chr&gt;, …</pre>
</div> </div>
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesnt even print all of them! We can see them all with <code><a href="#chp-https://rdrr.io/r/base/names" data-type="xref">#chp-https://rdrr.io/r/base/names</a></code>:</p> <p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesnt even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">repos |&gt; <pre data-type="programlisting" data-code-language="downlit">repos |&gt;
unnest_longer(json) |&gt; unnest_longer(json) |&gt;
@ -550,7 +550,7 @@ repos
#&gt; # … with 170 more rows</pre> #&gt; # … with 170 more rows</pre>
</div> </div>
<p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p> <p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
<p><code>owner</code> is another list-column, and since it contains a named list, we can use <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> to get at the values:</p> <p><code>owner</code> is another list-column, and since it contains a named list, we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to get at the values:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">repos |&gt; <pre data-type="programlisting" data-code-language="downlit">repos |&gt;
unnest_longer(json) |&gt; unnest_longer(json) |&gt;
@ -729,7 +729,7 @@ characters |&gt;
<section id="a-dash-of-text-analysis" data-type="sect2"> <section id="a-dash-of-text-analysis" data-type="sect2">
<h2> <h2>
A dash of text analysis</h2> A dash of text analysis</h2>
<p>What if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_split" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_split</a></code> to break each element of <code>title</code> up into words by spitting on <code>" "</code>:</p> <p>What if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by spitting on <code>" "</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">titles |&gt; <pre data-type="programlisting" data-code-language="downlit">titles |&gt;
mutate(word = str_split(title, " "), .keep = "unused") mutate(word = str_split(title, " "), .keep = "unused")
@ -744,7 +744,7 @@ A dash of text analysis</h2>
#&gt; 6 1074 &lt;chr [6]&gt; #&gt; 6 1074 &lt;chr [6]&gt;
#&gt; # … with 47 more rows</pre> #&gt; # … with 47 more rows</pre>
</div> </div>
<p>This creates a unnamed variable length list-column, so we can use <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code>:</p> <p>This creates a unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">titles |&gt; <pre data-type="programlisting" data-code-language="downlit">titles |&gt;
mutate(word = str_split(title, " "), .keep = "unused") |&gt; mutate(word = str_split(title, " "), .keep = "unused") |&gt;
@ -798,13 +798,13 @@ titles |&gt;
#&gt; 6 Queen 5 #&gt; 6 Queen 5
#&gt; # … with 70 more rows</pre> #&gt; # … with 70 more rows</pre>
</div> </div>
<p>Breaking up text into individual fragments is a powerful idea that underlies much of text analysis. If this sounds interesting, a good place to learn more is <a href="#chp-https://www.tidytextmining" data-type="xref">#chp-https://www.tidytextmining</a> by Julia Silge and David Robinson.</p> <p>Breaking up text into individual fragments is a powerful idea that underlies much of text analysis. If this sounds interesting, a good place to learn more is <a href="https://www.tidytextmining.com">Text Mining with R</a> by Julia Silge and David Robinson.</p>
</section> </section>
<section id="deeply-nested" data-type="sect2"> <section id="deeply-nested" data-type="sect2">
<h2> <h2>
Deeply nested</h2> Deeply nested</h2>
<p>Well finish off these case studies with a list-column thats very deeply nested and requires repeated rounds of <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> to unravel: <code>gmaps_cities</code>. This is a two column tibble containing five city names and the results of using Googles <a href="#chp-https://developers.google.com/maps/documentation/geocoding" data-type="xref">#chp-https://developers.google.com/maps/documentation/geocoding</a> to determine their location:</p> <p>Well finish off these case studies with a list-column thats very deeply nested and requires repeated rounds of <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to unravel: <code>gmaps_cities</code>. This is a two column tibble containing five city names and the results of using Googles <a href="https://developers.google.com/maps/documentation/geocoding">geocoding API</a> to determine their location:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities <pre data-type="programlisting" data-code-language="downlit">gmaps_cities
#&gt; # A tibble: 5 × 2 #&gt; # A tibble: 5 × 2
@ -816,7 +816,7 @@ Deeply nested</h2>
#&gt; 4 Chicago &lt;named list [2]&gt; #&gt; 4 Chicago &lt;named list [2]&gt;
#&gt; 5 Arlington &lt;named list [2]&gt;</pre> #&gt; 5 Arlington &lt;named list [2]&gt;</pre>
</div> </div>
<p><code>json</code> is a list-column with internal names, so we start with an <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>:</p> <p><code>json</code> is a list-column with internal names, so we start with an <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities |&gt; <pre data-type="programlisting" data-code-language="downlit">gmaps_cities |&gt;
unnest_wider(json) unnest_wider(json)
@ -846,7 +846,7 @@ Deeply nested</h2>
#&gt; 6 Arlington &lt;named list [5]&gt; #&gt; 6 Arlington &lt;named list [5]&gt;
#&gt; # … with 1 more row</pre> #&gt; # … with 1 more row</pre>
</div> </div>
<p>Now <code>results</code> is a named list, so well use <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>:</p> <p>Now <code>results</code> is a named list, so well use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">locations &lt;- gmaps_cities |&gt; <pre data-type="programlisting" data-code-language="downlit">locations &lt;- gmaps_cities |&gt;
unnest_wider(json) |&gt; unnest_wider(json) |&gt;
@ -938,8 +938,8 @@ locations
#&gt; 6 Arlington Arlington, TX, USA 32.8 -97.0 32.6 -97.2 #&gt; 6 Arlington Arlington, TX, USA 32.8 -97.0 32.6 -97.2
#&gt; # … with 1 more row</pre> #&gt; # … with 1 more row</pre>
</div> </div>
<p>Note how we unnest two columns simultaneously by supplying a vector of variable names to <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code>.</p> <p>Note how we unnest two columns simultaneously by supplying a vector of variable names to <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>.</p>
<p>This is somewhere that <code><a href="#chp-https://tidyr.tidyverse.org/reference/hoist" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/hoist</a></code>, mentioned briefly above, can be useful. Once youve discovered the path to get to the components youre interested in, you can extract them directly using <code><a href="#chp-https://tidyr.tidyverse.org/reference/hoist" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/hoist</a></code>:</p> <p>This is somewhere that <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned briefly above, can be useful. Once youve discovered the path to get to the components youre interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">locations |&gt; <pre data-type="programlisting" data-code-language="downlit">locations |&gt;
select(city, formatted_address, geometry) |&gt; select(city, formatted_address, geometry) |&gt;
@ -968,7 +968,7 @@ locations
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>Roughly estimate when <code>gh_repos</code> was created. Why can you only roughly estimate the date?</p></li> <ol type="1"><li><p>Roughly estimate when <code>gh_repos</code> was created. Why can you only roughly estimate the date?</p></li>
<li><p>The <code>owner</code> column of <code>gh_repo</code> contains a lot of duplicated information because each owner can have many repos. Can you construct a <code>owners</code> data frame that contains one row for each owner? (Hint: does <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code> work with <code>list-cols</code>?)</p></li> <li><p>The <code>owner</code> column of <code>gh_repo</code> contains a lot of duplicated information because each owner can have many repos. Can you construct a <code>owners</code> data frame that contains one row for each owner? (Hint: does <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> work with <code>list-cols</code>?)</p></li>
<li> <li>
<p>Explain the following code line-by-line. Why is it interesting? Why does it work for <code>got_chars</code> but might not work in general?</p> <p>Explain the following code line-by-line. Why is it interesting? Why does it work for <code>got_chars</code> but might not work in general?</p>
<div class="cell"> <div class="cell">
@ -983,7 +983,7 @@ Exercises</h2>
unnest_longer(value)</pre> unnest_longer(value)</pre>
</div> </div>
</li> </li>
<li><p>In <code>gmaps_cities</code>, what does <code>address_components</code> contain? Why does the length vary between rows? Unnest it appropriately to figure it out. (Hint: <code>types</code> always appears to contain two elements. Does <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> make it easier to work with than <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code>?) .</p></li> <li><p>In <code>gmaps_cities</code>, what does <code>address_components</code> contain? Why does the length vary between rows? Unnest it appropriately to figure it out. (Hint: <code>types</code> always appears to contain two elements. Does <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> make it easier to work with than <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>?) .</p></li>
</ol></section> </ol></section>
</section> </section>
@ -1001,13 +1001,13 @@ Data types</h2>
<li>A <strong>number</strong> is similar to Rs numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesnt support Inf, -Inf, or NaN.</li> <li>A <strong>number</strong> is similar to Rs numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesnt support Inf, -Inf, or NaN.</li>
<li>A <strong>boolean</strong> is similar to Rs <code>TRUE</code> and <code>FALSE</code>, but uses lowercase <code>true</code> and <code>false</code>.</li> <li>A <strong>boolean</strong> is similar to Rs <code>TRUE</code> and <code>FALSE</code>, but uses lowercase <code>true</code> and <code>false</code>.</li>
</ul><p>JSONs strings, numbers, and booleans are pretty similar to Rs character, numeric, and logical vectors. The main difference is that JSONs scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.</p> </ul><p>JSONs strings, numbers, and booleans are pretty similar to Rs character, numeric, and logical vectors. The main difference is that JSONs scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.</p>
<p>Both arrays and objects are similar to lists in R; the difference is whether or not theyre named. An <strong>array</strong> is like an unnamed list, and is written with <code>[]</code>. For example <code>[1, 2, 3]</code> is an array containing 3 numbers, and <code>[null, 1, "string", false]</code> is an array that contains a null, a number, a string, and a boolean. An <strong>object</strong> is like a named list, and is written with <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code>. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, <code>{"x": 1, "y": 2}</code> is an object that maps <code>x</code> to 1 and <code>y</code> to 2.</p> <p>Both arrays and objects are similar to lists in R; the difference is whether or not theyre named. An <strong>array</strong> is like an unnamed list, and is written with <code>[]</code>. For example <code>[1, 2, 3]</code> is an array containing 3 numbers, and <code>[null, 1, "string", false]</code> is an array that contains a null, a number, a string, and a boolean. An <strong>object</strong> is like a named list, and is written with <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, <code>{"x": 1, "y": 2}</code> is an object that maps <code>x</code> to 1 and <code>y</code> to 2.</p>
</section> </section>
<section id="jsonlite" data-type="sect2"> <section id="jsonlite" data-type="sect2">
<h2> <h2>
jsonlite</h2> jsonlite</h2>
<p>To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. Well use only two jsonlite functions: <code><a href="#chp-https://rdrr.io/pkg/jsonlite/man/read_json" data-type="xref">#chp-https://rdrr.io/pkg/jsonlite/man/read_json</a></code> and <code><a href="#chp-https://rdrr.io/pkg/jsonlite/man/read_json" data-type="xref">#chp-https://rdrr.io/pkg/jsonlite/man/read_json</a></code>. In real life, youll use <code><a href="#chp-https://rdrr.io/pkg/jsonlite/man/read_json" data-type="xref">#chp-https://rdrr.io/pkg/jsonlite/man/read_json</a></code> to read a JSON file from disk. For example, the repurrsive package also provides the source for <code>gh_user</code> as a JSON file and you can read it with <code><a href="#chp-https://rdrr.io/pkg/jsonlite/man/read_json" data-type="xref">#chp-https://rdrr.io/pkg/jsonlite/man/read_json</a></code>:</p> <p>To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. Well use only two jsonlite functions: <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code> and <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>. In real life, youll use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code> to read a JSON file from disk. For example, the repurrsive package also provides the source for <code>gh_user</code> as a JSON file and you can read it with <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># A path to a json file inside the package: <pre data-type="programlisting" data-code-language="downlit"># A path to a json file inside the package:
gh_users_json() gh_users_json()
@ -1020,7 +1020,7 @@ gh_users2 &lt;- read_json(gh_users_json())
identical(gh_users, gh_users2) identical(gh_users, gh_users2)
#&gt; [1] TRUE</pre> #&gt; [1] TRUE</pre>
</div> </div>
<p>In this book, Ill also use <code><a href="#chp-https://rdrr.io/pkg/jsonlite/man/read_json" data-type="xref">#chp-https://rdrr.io/pkg/jsonlite/man/read_json</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, heres three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:</p> <p>In this book, Ill also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, heres three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str(parse_json('1')) <pre data-type="programlisting" data-code-language="downlit">str(parse_json('1'))
#&gt; int 1 #&gt; int 1
@ -1036,7 +1036,7 @@ str(parse_json('{"x": [1, 2, 3]}'))
#&gt; ..$ : int 2 #&gt; ..$ : int 2
#&gt; ..$ : int 3</pre> #&gt; ..$ : int 3</pre>
</div> </div>
<p>jsonlite has another important function called <code><a href="#chp-https://rdrr.io/pkg/jsonlite/man/fromJSON" data-type="xref">#chp-https://rdrr.io/pkg/jsonlite/man/fromJSON</a></code>. We dont use it here because it performs automatic simplification (<code>simplifyVector = TRUE</code>). This often works well, particularly in simple cases, but we think youre better off doing the rectangling yourself so you know exactly whats happening and can more easily handle the most complicated nested structures.</p> <p>jsonlite has another important function called <code><a href="https://rdrr.io/pkg/jsonlite/man/fromJSON.html">fromJSON()</a></code>. We dont use it here because it performs automatic simplification (<code>simplifyVector = TRUE</code>). This often works well, particularly in simple cases, but we think youre better off doing the rectangling yourself so you know exactly whats happening and can more easily handle the most complicated nested structures.</p>
</section> </section>
<section id="starting-the-rectangling-process" data-type="sect2"> <section id="starting-the-rectangling-process" data-type="sect2">
@ -1107,7 +1107,7 @@ df |&gt;
<section id="translation-challenges" data-type="sect2"> <section id="translation-challenges" data-type="sect2">
<h2> <h2>
Translation challenges</h2> Translation challenges</h2>
<p>Since JSON doesnt have any way to represent dates or date-times, theyre often stored as ISO8601 date times in strings, and youll need to use <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> or <code><a href="#chp-https://readr.tidyverse.org/reference/parse_datetime" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_datetime</a></code> to turn them into the correct data structure. Similarly, JSONs rules for representing floating point numbers in JSON are a little imprecise, so youll also sometimes find numbers stored in strings. Apply <code><a href="#chp-https://readr.tidyverse.org/reference/parse_atomic" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_atomic</a></code> as needed to the get correct variable type.</p> <p>Since JSON doesnt have any way to represent dates or date-times, theyre often stored as ISO8601 date times in strings, and youll need to use <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">readr::parse_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">readr::parse_datetime()</a></code> to turn them into the correct data structure. Similarly, JSONs rules for representing floating point numbers in JSON are a little imprecise, so youll also sometimes find numbers stored in strings. Apply <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">readr::parse_double()</a></code> as needed to the get correct variable type.</p>
</section> </section>
<section id="exercises-2" data-type="sect2"> <section id="exercises-2" data-type="sect2">
@ -1140,7 +1140,7 @@ df_row &lt;- tibble(json = json_row)</pre>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">
<h1> <h1>
Summary</h1> Summary</h1>
<p>In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_longer</a></code> to put list elements into rows and <code><a href="#chp-https://tidyr.tidyverse.org/reference/unnest_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/unnest_wider</a></code> to put list elements into columns. It doesnt matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p> <p>In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesnt matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>
<p>JSON is the most common data format returned by web APIs. What happens if the website doesnt have an API, but you can see data you want on the website? Thats the topic of the next chapter: web scraping, extracting data from HTML webpages.</p> <p>JSON is the most common data format returned by web APIs. What happens if the website doesnt have an API, but you can see data you want on the website? Thats the topic of the next chapter: web scraping, extracting data from HTML webpages.</p>

View File

@ -44,7 +44,7 @@ library(babynames)</pre>
<section id="sec-reg-basics" data-type="sect1"> <section id="sec-reg-basics" data-type="sect1">
<h1> <h1>
Pattern basics</h1> Pattern basics</h1>
<p>Well use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> to learn how regex patterns work. We used <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> in the last chapter to better understand a string vs its printed representation, and now well use it with its second argument, a regular expression. When this is supplied, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> will show only the elements of the string vector that match, surrounding each match with <code>&lt;&gt;</code>, and, where possible, highlighting the match in blue.</p> <p>Well use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> to learn how regex patterns work. We used <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> in the last chapter to better understand a string vs its printed representation, and now well use it with its second argument, a regular expression. When this is supplied, <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> will show only the elements of the string vector that match, surrounding each match with <code>&lt;&gt;</code>, and, where possible, highlighting the match in blue.</p>
<p>The simplest patterns consist of letters and numbers which match those characters exactly:</p> <p>The simplest patterns consist of letters and numbers which match those characters exactly:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "berry") <pre data-type="programlisting" data-code-language="downlit">str_view(fruit, "berry")
@ -167,12 +167,12 @@ Key functions</h1>
<section id="detect-matches" data-type="sect2"> <section id="detect-matches" data-type="sect2">
<h2> <h2>
Detect matches</h2> Detect matches</h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p> <p><code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector that is <code>TRUE</code> if the pattern matched an element of the character vector and <code>FALSE</code> otherwise:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_detect(c("a", "b", "c"), "[aeiou]") <pre data-type="programlisting" data-code-language="downlit">str_detect(c("a", "b", "c"), "[aeiou]")
#&gt; [1] TRUE FALSE FALSE</pre> #&gt; [1] TRUE FALSE FALSE</pre>
</div> </div>
<p>Since <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> returns a logical vector of the same length as the initial vector, it pairs well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>. For example, this code finds all the most popular names containing a lower-case “x”:</p> <p>Since <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> returns a logical vector of the same length as the initial vector, it pairs well with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. For example, this code finds all the most popular names containing a lower-case “x”:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt; <pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
filter(str_detect(name, "x")) |&gt; filter(str_detect(name, "x")) |&gt;
@ -188,7 +188,7 @@ Detect matches</h2>
#&gt; 6 Alexa 123032 #&gt; 6 Alexa 123032
#&gt; # … with 968 more rows</pre> #&gt; # … with 968 more rows</pre>
</div> </div>
<p>We can also use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> by pairing it with <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code> or <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>: <code>sum(str_detect(x, pattern))</code> tells you the number of observations that match and <code>mean(str_detect(x, pattern))</code> tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names<span data-type="footnote">This gives us the proportion of <strong>names</strong> that contain an “x”; if you wanted the proportion of babies with a name containing an x, youd need to perform a weighted mean.</span> that contain “x”, broken down by year. It looks like theyve radically increased in popularity lately!</p> <p>We can also use <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> by pairing it with <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> or <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>: <code>sum(str_detect(x, pattern))</code> tells you the number of observations that match and <code>mean(str_detect(x, pattern))</code> tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby names<span data-type="footnote">This gives us the proportion of <strong>names</strong> that contain an “x”; if you wanted the proportion of babies with a name containing an x, youd need to perform a weighted mean.</span> that contain “x”, broken down by year. It looks like theyve radically increased in popularity lately!</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt; <pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
group_by(year) |&gt; group_by(year) |&gt;
@ -202,7 +202,7 @@ Detect matches</h2>
</figure> </figure>
</div> </div>
</div> </div>
<p>There are two functions that are closely related to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code>, namely <code><a href="#chp-https://stringr.tidyverse.org/reference/str_subset" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_subset</a></code> which returns just the strings that contain a match and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_which" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_which</a></code> which returns the indexes of strings that have a match:</p> <p>There are two functions that are closely related to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>, namely <code><a href="https://stringr.tidyverse.org/reference/str_subset.html">str_subset()</a></code> which returns just the strings that contain a match and <code><a href="https://stringr.tidyverse.org/reference/str_which.html">str_which()</a></code> which returns the indexes of strings that have a match:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_subset(c("a", "b", "c"), "[aeiou]") <pre data-type="programlisting" data-code-language="downlit">str_subset(c("a", "b", "c"), "[aeiou]")
#&gt; [1] "a" #&gt; [1] "a"
@ -214,7 +214,7 @@ str_which(c("a", "b", "c"), "[aeiou]")
<section id="count-matches" data-type="sect2"> <section id="count-matches" data-type="sect2">
<h2> <h2>
Count matches</h2> Count matches</h2>
<p>The next step up in complexity from <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> is <code><a href="#chp-https://stringr.tidyverse.org/reference/str_count" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_count</a></code>: rather than a simple true or false, it tells you how many matches there are in each string.</p> <p>The next step up in complexity from <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> is <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code>: rather than a simple true or false, it tells you how many matches there are in each string.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "banana", "pear") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "banana", "pear")
str_count(x, "p") str_count(x, "p")
@ -227,7 +227,7 @@ str_count(x, "p")
str_view("abababa", "aba") str_view("abababa", "aba")
#&gt; [1] │ &lt;aba&gt;b&lt;aba&gt;</pre> #&gt; [1] │ &lt;aba&gt;b&lt;aba&gt;</pre>
</div> </div>
<p>Its natural to use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_count" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_count</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. The following example uses <code><a href="#chp-https://stringr.tidyverse.org/reference/str_count" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_count</a></code> with character classes to count the number of vowels and consonants in each name.</p> <p>Its natural to use <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. The following example uses <code><a href="https://stringr.tidyverse.org/reference/str_count.html">str_count()</a></code> with character classes to count the number of vowels and consonants in each name.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt; <pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
count(name) |&gt; count(name) |&gt;
@ -249,7 +249,7 @@ str_view("abababa", "aba")
<p>If you look closely, youll notice that theres something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. Thats because regular expressions are case sensitive. There are three ways we could fix this:</p> <p>If you look closely, youll notice that theres something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. Thats because regular expressions are case sensitive. There are three ways we could fix this:</p>
<ul><li>Add the upper case vowels to the character class: <code>str_count(name, "[aeiouAEIOU]")</code>.</li> <ul><li>Add the upper case vowels to the character class: <code>str_count(name, "[aeiouAEIOU]")</code>.</li>
<li>Tell the regular expression to ignore case: <code>str_count(regex(name, ignore_case = TRUE), "[aeiou]")</code>. Well talk about more in <a href="#sec-flags" data-type="xref">#sec-flags</a>.</li> <li>Tell the regular expression to ignore case: <code>str_count(regex(name, ignore_case = TRUE), "[aeiou]")</code>. Well talk about more in <a href="#sec-flags" data-type="xref">#sec-flags</a>.</li>
<li>Use <code><a href="#chp-https://stringr.tidyverse.org/reference/case" data-type="xref">#chp-https://stringr.tidyverse.org/reference/case</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>. You learned about this function in <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a>.</li> <li>Use <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> to convert the names to lower case: <code>str_count(str_to_lower(name), "[aeiou]")</code>. You learned about this function in <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a>.</li>
</ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p> </ul><p>This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.</p>
<p>In this case, since were applying two functions to the name, I think its easier to transform it first:</p> <p>In this case, since were applying two functions to the name, I think its easier to transform it first:</p>
<div class="cell"> <div class="cell">
@ -276,25 +276,25 @@ str_view("abababa", "aba")
<section id="replace-values" data-type="sect2"> <section id="replace-values" data-type="sect2">
<h2> <h2>
Replace values</h2> Replace values</h2>
<p>As well as detecting and counting matches, we can also modify them with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code>. <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code> replaces the first match, and as the name suggests, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code> replaces all matches.</p> <p>As well as detecting and counting matches, we can also modify them with <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>. <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code> replaces the first match, and as the name suggests, <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code> replaces all matches.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "pear", "banana") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-") str_replace_all(x, "[aeiou]", "-")
#&gt; [1] "-ppl-" "p--r" "b-n-n-"</pre> #&gt; [1] "-ppl-" "p--r" "b-n-n-"</pre>
</div> </div>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_remove" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_remove</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_remove" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_remove</a></code> are handy shortcuts for <code>str_replace(x, pattern, "")</code>.</p> <p><code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_remove.html">str_remove_all()</a></code> are handy shortcuts for <code>str_replace(x, pattern, "")</code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "pear", "banana") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("apple", "pear", "banana")
str_remove_all(x, "[aeiou]") str_remove_all(x, "[aeiou]")
#&gt; [1] "ppl" "pr" "bnn"</pre> #&gt; [1] "ppl" "pr" "bnn"</pre>
</div> </div>
<p>These functions are naturally paired with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> when doing data cleaning, and youll often apply them repeatedly to peel off layers of inconsistent formatting.</p> <p>These functions are naturally paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> when doing data cleaning, and youll often apply them repeatedly to peel off layers of inconsistent formatting.</p>
</section> </section>
<section id="extract-variables" data-type="sect2"> <section id="extract-variables" data-type="sect2">
<h2> <h2>
Extract variables</h2> Extract variables</h2>
<p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>. Its a peer of the <code>separate_wider_location()</code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p> <p>The last function well discuss uses regular expressions to extract data out of one column into one or more new columns: <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Its a peer of the <code>separate_wider_location()</code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> functions that you learned about in <a href="#sec-string-columns" data-type="xref">#sec-string-columns</a>. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.</p>
<p>Lets create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that youd never see something this weird in real life, but unfortunately over the course of your career youre likely to see much weirder!</span>:</p> <p>Lets create a simple dataset to show how it works. Here we have some data derived from <code>babynames</code> where we have the name, gender, and age of a bunch of people in a rather weird format<span data-type="footnote">We wish we could reassure you that youd never see something this weird in real life, but unfortunately over the course of your career youre likely to see much weirder!</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
@ -308,7 +308,7 @@ Extract variables</h2>
"&lt;Patricia&gt;-F_84", "&lt;Patricia&gt;-F_84",
)</pre> )</pre>
</div> </div>
<p>To extract this data using <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:</p> <p>To extract this data using <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
separate_wider_regex( separate_wider_regex(
@ -330,7 +330,7 @@ Extract variables</h2>
#&gt; 6 Justin M 41 #&gt; 6 Justin M 41
#&gt; # … with 1 more row</pre> #&gt; # … with 1 more row</pre>
</div> </div>
<p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>.</p> <p>If the match fails, you can use <code>too_short = "debug"</code> to figure out what went wrong, just like <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code>.</p>
</section> </section>
<section id="exercises-1" data-type="sect2"> <section id="exercises-1" data-type="sect2">
@ -338,7 +338,7 @@ Extract variables</h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li><p>What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)</p></li> <ol type="1"><li><p>What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)</p></li>
<li><p>Replace all forward slashes in a string with backslashes.</p></li> <li><p>Replace all forward slashes in a string with backslashes.</p></li>
<li><p>Implement a simple version of <code><a href="#chp-https://stringr.tidyverse.org/reference/case" data-type="xref">#chp-https://stringr.tidyverse.org/reference/case</a></code> using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code>.</p></li> <li><p>Implement a simple version of <code><a href="https://stringr.tidyverse.org/reference/case.html">str_to_lower()</a></code> using <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace_all()</a></code>.</p></li>
<li><p>Create a regular expression that will match telephone numbers as commonly written in your country.</p></li> <li><p>Create a regular expression that will match telephone numbers as commonly written in your country.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -415,7 +415,7 @@ str_view(fruit, "a$")
str_view(fruit, "^apple$") str_view(fruit, "^apple$")
#&gt; [1] │ &lt;apple&gt;</pre> #&gt; [1] │ &lt;apple&gt;</pre>
</div> </div>
<p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudios find and replace tool. For example, if to find all uses of <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p> <p>You can also match the boundary between words (i.e. the start or end of a word) with <code>\b</code>. This can be particularly when using RStudios find and replace tool. For example, if to find all uses of <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>, you can search for <code>\bsum\b</code> to avoid matching <code>summarise</code>, <code>summary</code>, <code>rowsum</code> and so on:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum") str_view(x, "sum")
@ -495,7 +495,7 @@ str_view(x, "\\S+")
<section id="sec-quantifiers" data-type="sect2"> <section id="sec-quantifiers" data-type="sect2">
<h2> <h2>
Quantifiers</h2> Quantifiers</h2>
<p><strong>Quantifiers</strong> control how many times a pattern matches. In <a href="#sec-reg-basics" data-type="xref">#sec-reg-basics</a> you learned about <code>?</code> (0 or 1 matches), <code>+</code> (1 or more matches), and <code>*</code> (0 or more matches). For example, <code>colou?r</code> will match American or British spelling, <code>\d+</code> will match one or more digits, and <code>\s?</code> will optionally match a single item of whitespace. You can also specify the number of matches precisely with <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code>:</p> <p><strong>Quantifiers</strong> control how many times a pattern matches. In <a href="#sec-reg-basics" data-type="xref">#sec-reg-basics</a> you learned about <code>?</code> (0 or 1 matches), <code>+</code> (1 or more matches), and <code>*</code> (0 or more matches). For example, <code>colou?r</code> will match American or British spelling, <code>\d+</code> will match one or more digits, and <code>\s?</code> will optionally match a single item of whitespace. You can also specify the number of matches precisely with <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>:</p>
<ul><li> <ul><li>
<code>{n}</code> matches exactly n times.</li> <code>{n}</code> matches exactly n times.</li>
<li> <li>
@ -551,7 +551,7 @@ Grouping and capturing</h2>
#&gt; [699] │ &lt;require&gt; #&gt; [699] │ &lt;require&gt;
#&gt; [739] │ &lt;sense&gt;</pre> #&gt; [739] │ &lt;sense&gt;</pre>
</div> </div>
<p>You can also use back references in <code><a href="#chp-https://stringr.tidyverse.org/reference/str_replace" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_replace</a></code>. For example, this code switches the order of the second and third words in <code>sentences</code>:</p> <p>You can also use back references in <code><a href="https://stringr.tidyverse.org/reference/str_replace.html">str_replace()</a></code>. For example, this code switches the order of the second and third words in <code>sentences</code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sentences |&gt; <pre data-type="programlisting" data-code-language="downlit">sentences |&gt;
str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |&gt; str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |&gt;
@ -568,7 +568,7 @@ Grouping and capturing</h2>
#&gt; [10] │ A size large in stockings is hard to sell. #&gt; [10] │ A size large in stockings is hard to sell.
#&gt; ... and 710 more</pre> #&gt; ... and 710 more</pre>
</div> </div>
<p>If you want extract the matches for each group you can use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_match" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_match</a></code>. But <code><a href="#chp-https://stringr.tidyverse.org/reference/str_match" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_match</a></code> returns a matrix, so its not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p> <p>If you want extract the matches for each group you can use <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code>. But <code><a href="https://stringr.tidyverse.org/reference/str_match.html">str_match()</a></code> returns a matrix, so its not particularly easy to work with<span data-type="footnote">Mostly because we never discuss matrices in this book!</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sentences |&gt; <pre data-type="programlisting" data-code-language="downlit">sentences |&gt;
str_match("the (\\w+) (\\w+)") |&gt; str_match("the (\\w+) (\\w+)") |&gt;
@ -598,7 +598,7 @@ Grouping and capturing</h2>
#&gt; 6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt; #&gt; 6 &lt;NA&gt; &lt;NA&gt; &lt;NA&gt;
#&gt; # … with 714 more rows</pre> #&gt; # … with 714 more rows</pre>
</div> </div>
<p>But then youve basically recreated your own version of <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>. Indeed, behind the scenes, <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> converts your vector of patterns to a single regex that uses grouping to capture the named components.</p> <p>But then youve basically recreated your own version of <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>. Indeed, behind the scenes, <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code> converts your vector of patterns to a single regex that uses grouping to capture the named components.</p>
<p>Occasionally, youll want to use parentheses without creating matching groups. You can create a non-capturing group with <code>(?:)</code>.</p> <p>Occasionally, youll want to use parentheses without creating matching groups. You can create a non-capturing group with <code>(?:)</code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("a gray cat", "a grey dog") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("a gray cat", "a grey dog")
@ -619,11 +619,11 @@ Exercises</h2>
<ol type="1"><li><p>How would you match the literal string <code>"'\</code>? How about <code>"$^$"</code>?</p></li> <ol type="1"><li><p>How would you match the literal string <code>"'\</code>? How about <code>"$^$"</code>?</p></li>
<li><p>Explain why each of these patterns dont match a <code>\</code>: <code>"\"</code>, <code>"\\"</code>, <code>"\\\"</code>.</p></li> <li><p>Explain why each of these patterns dont match a <code>\</code>: <code>"\"</code>, <code>"\\"</code>, <code>"\\\"</code>.</p></li>
<li> <li>
<p>Given the corpus of common words in <code><a href="#chp-https://stringr.tidyverse.org/reference/stringr-data" data-type="xref">#chp-https://stringr.tidyverse.org/reference/stringr-data</a></code>, create regular expressions that find all words that:</p> <p>Given the corpus of common words in <code><a href="https://stringr.tidyverse.org/reference/stringr-data.html">stringr::words</a></code>, create regular expressions that find all words that:</p>
<ol type="a"><li>Start with “y”.</li> <ol type="a"><li>Start with “y”.</li>
<li>Dont start with “y”.</li> <li>Dont start with “y”.</li>
<li>End with “x”.</li> <li>End with “x”.</li>
<li>Are exactly three letters long. (Dont cheat by using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code>!)</li> <li>Are exactly three letters long. (Dont cheat by using <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code>!)</li>
<li>Have seven letters or more.</li> <li>Have seven letters or more.</li>
<li>Contain a vowel-consonant pair.</li> <li>Contain a vowel-consonant pair.</li>
<li>Contain at least two vowel-consonant pairs in a row.</li> <li>Contain at least two vowel-consonant pairs in a row.</li>
@ -653,7 +653,7 @@ Pattern control</h1>
<section id="sec-flags" data-type="sect2"> <section id="sec-flags" data-type="sect2">
<h2> <h2>
Regex flags</h2> Regex flags</h2>
<p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p> <p>There are a number of settings that can use to control the details of the regexp. These settings are often called <strong>flags</strong> in other programming languages. In stringr, you can use these by wrapping the pattern in a call to <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code>. The most useful flag is probably <code>ignore_case = TRUE</code> because it allows characters to match either their uppercase or lowercase forms:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">bananas &lt;- c("banana", "Banana", "BANANA") <pre data-type="programlisting" data-code-language="downlit">bananas &lt;- c("banana", "Banana", "BANANA")
str_view(bananas, "banana") str_view(bananas, "banana")
@ -719,19 +719,19 @@ str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
<section id="fixed-matches" data-type="sect2"> <section id="fixed-matches" data-type="sect2">
<h2> <h2>
Fixed matches</h2> Fixed matches</h2>
<p>You can opt-out of the regular expression rules by using <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code>:</p> <p>You can opt-out of the regular expression rules by using <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view(c("", "a", "."), fixed(".")) <pre data-type="programlisting" data-code-language="downlit">str_view(c("", "a", "."), fixed("."))
#&gt; [3] │ &lt;.&gt;</pre> #&gt; [3] │ &lt;.&gt;</pre>
</div> </div>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code> also gives you the ability to ignore case:</p> <p><code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code> also gives you the ability to ignore case:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view("x X", "X") <pre data-type="programlisting" data-code-language="downlit">str_view("x X", "X")
#&gt; [1] │ x &lt;X&gt; #&gt; [1] │ x &lt;X&gt;
str_view("x X", fixed("X", ignore_case = TRUE)) str_view("x X", fixed("X", ignore_case = TRUE))
#&gt; [1] │ &lt;x&gt; &lt;X&gt;</pre> #&gt; [1] │ &lt;x&gt; &lt;X&gt;</pre>
</div> </div>
<p>If youre working with non-English text, you will probably want <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code> instead of <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code>, as it implements the full rules for capitalization as used by the <code>locale</code> you specify. See <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a> for more details on locales.</p> <p>If youre working with non-English text, you will probably want <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">coll()</a></code> instead of <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">fixed()</a></code>, as it implements the full rules for capitalization as used by the <code>locale</code> you specify. See <a href="#sec-other-languages" data-type="xref">#sec-other-languages</a> for more details on locales.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_view("i İ ı I", fixed("İ", ignore_case = TRUE)) <pre data-type="programlisting" data-code-language="downlit">str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
#&gt; [1] │ i &lt;İ&gt; ı I #&gt; [1] │ i &lt;İ&gt; ı I
@ -864,7 +864,7 @@ Boolean operations</h2>
#&gt; [71] │ &lt;ba&gt;ll #&gt; [71] │ &lt;ba&gt;ll
#&gt; ... and 20 more</pre> #&gt; ... and 20 more</pre>
</div> </div>
<p>Its simpler to combine the results of two calls to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code>:</p> <p>Its simpler to combine the results of two calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">words[str_detect(words, "a") &amp; str_detect(words, "b")] <pre data-type="programlisting" data-code-language="downlit">words[str_detect(words, "a") &amp; str_detect(words, "b")]
#&gt; [1] "able" "about" "absolute" "available" "baby" "back" #&gt; [1] "able" "about" "absolute" "available" "baby" "back"
@ -879,7 +879,7 @@ Boolean operations</h2>
# ... # ...
words[str_detect(words, "u.*o.*i.*e.*a")]</pre> words[str_detect(words, "u.*o.*i.*e.*a")]</pre>
</div> </div>
<p>Its much simpler to combine five calls to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code>:</p> <p>Its much simpler to combine five calls to <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">words[ <pre data-type="programlisting" data-code-language="downlit">words[
str_detect(words, "a") &amp; str_detect(words, "a") &amp;
@ -915,7 +915,7 @@ Creating a pattern with code</h2>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rgb &lt;- c("red", "green", "blue")</pre> <pre data-type="programlisting" data-code-language="downlit">rgb &lt;- c("red", "green", "blue")</pre>
</div> </div>
<p>Well, we can! Wed just need to create the pattern from the vector using <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code>:</p> <p>Well, we can! Wed just need to create the pattern from the vector using <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_c("\\b(", str_flatten(rgb, "|"), ")\\b") <pre data-type="programlisting" data-code-language="downlit">str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
#&gt; [1] "\\b(red|green|blue)\\b"</pre> #&gt; [1] "\\b(red|green|blue)\\b"</pre>
@ -968,21 +968,21 @@ str_view(sentences, pattern)
#&gt; [167] │ The office paint was a dull, sad &lt;tan&gt;. #&gt; [167] │ The office paint was a dull, sad &lt;tan&gt;.
#&gt; ... and 53 more</pre> #&gt; ... and 53 more</pre>
</div> </div>
<p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create create patterns from existing strings its wise to run them through <code><a href="#chp-https://stringr.tidyverse.org/reference/str_escape" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_escape</a></code> to ensure they match literally.</p> <p>In this example, <code>cols</code> only contains numbers and letters so you dont need to worry about metacharacters. But in general, whenever you create create patterns from existing strings its wise to run them through <code><a href="https://stringr.tidyverse.org/reference/str_escape.html">str_escape()</a></code> to ensure they match literally.</p>
</section> </section>
<section id="exercises-3" data-type="sect2"> <section id="exercises-3" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li> <ol type="1"><li>
<p>For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple <code><a href="#chp-https://stringr.tidyverse.org/reference/str_detect" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_detect</a></code> calls.</p> <p>For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple <code><a href="https://stringr.tidyverse.org/reference/str_detect.html">str_detect()</a></code> calls.</p>
<ol type="a"><li>Find all <code>words</code> that start or end with <code>x</code>.</li> <ol type="a"><li>Find all <code>words</code> that start or end with <code>x</code>.</li>
<li>Find all <code>words</code> that start with a vowel and end with a consonant.</li> <li>Find all <code>words</code> that start with a vowel and end with a consonant.</li>
<li>Are there any <code>words</code> that contain at least one of each different vowel?</li> <li>Are there any <code>words</code> that contain at least one of each different vowel?</li>
</ol></li> </ol></li>
<li><p>Construct patterns to find evidence for and against the rule “i before e except after c”?</p></li> <li><p>Construct patterns to find evidence for and against the rule “i before e except after c”?</p></li>
<li><p><code><a href="#chp-https://rdrr.io/r/grDevices/colors" data-type="xref">#chp-https://rdrr.io/r/grDevices/colors</a></code> contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).</p></li> <li><p><code><a href="https://rdrr.io/r/grDevices/colors.html">colors()</a></code> contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).</p></li>
<li><p>Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the <code><a href="#chp-https://rdrr.io/r/utils/data" data-type="xref">#chp-https://rdrr.io/r/utils/data</a></code> function: <code>data(package = "datasets")$results[, "Item"]</code>. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so youll need to strip those off.</p></li> <li><p>Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the <code><a href="https://rdrr.io/r/utils/data.html">data()</a></code> function: <code>data(package = "datasets")$results[, "Item"]</code>. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so youll need to strip those off.</p></li>
</ol></section> </ol></section>
</section> </section>
@ -995,9 +995,9 @@ Regular expressions in other places</h1>
<h2> <h2>
tidyverse</h2> tidyverse</h2>
<p>There are three other particularly useful places where you might want to use a regular expressions</p> <p>There are three other particularly useful places where you might want to use a regular expressions</p>
<ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. Its a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/across" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/across</a></code>).</p></li> <ul><li><p><code>matches(pattern)</code> will select all variables whose name matches the supplied pattern. Its a “tidyselect” function that you can use anywhere in any tidyverse function that selects variables (e.g. <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename_with()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code>).</p></li>
<li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code>separate_with_regex()</code>. Its useful when extracting data out of variable names with a complex structure</p></li> <li><p><code>pivot_longer()'s</code> <code>names_pattern</code> argument takes a vector of regular expressions, just like <code>separate_with_regex()</code>. Its useful when extracting data out of variable names with a complex structure</p></li>
<li><p>The <code>delim</code> argument in <code>separate_delim_longer()</code> and <code>separate_delim_wider()</code> usually matches a fixed string, but you can use <code><a href="#chp-https://stringr.tidyverse.org/reference/modifiers" data-type="xref">#chp-https://stringr.tidyverse.org/reference/modifiers</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li> <li><p>The <code>delim</code> argument in <code>separate_delim_longer()</code> and <code>separate_delim_wider()</code> usually matches a fixed string, but you can use <code><a href="https://stringr.tidyverse.org/reference/modifiers.html">regex()</a></code> to make it match a pattern. This is useful, for example, if you want to match a comma that is optionally followed by a space, i.e. <code>regex(", ?")</code>.</p></li>
</ul></section> </ul></section>
<section id="base-r" data-type="sect2"> <section id="base-r" data-type="sect2">
@ -1014,7 +1014,7 @@ Base R</h2>
<pre data-type="programlisting" data-code-language="downlit">head(list.files(pattern = "\\.Rmd$")) <pre data-type="programlisting" data-code-language="downlit">head(list.files(pattern = "\\.Rmd$"))
#&gt; character(0)</pre> #&gt; character(0)</pre>
</div> </div>
<p>Its worth noting that the pattern language used by base R is very slightly different to that used by stringr. Thats because stringr is built on top of the <a href="#chp-https://stringi.gagolewski" data-type="xref">#chp-https://stringi.gagolewski</a>, which is in turn built on top of the <a href="#chp-https://unicode-org.github.io/icu/userguide/strings/regexp" data-type="xref">#chp-https://unicode-org.github.io/icu/userguide/strings/regexp</a>, whereas base R functions use either the <a href="#chp-https://github.com/laurikari/tre" data-type="xref">#chp-https://github.com/laurikari/tre</a> or the <a href="#chp-https://www.pcre" data-type="xref">#chp-https://www.pcre</a>, depending on whether or not youve set <code>perl = TRUE</code>. Fortunately, the basics of regular expressions are so well established that youll encounter few variations when working with the patterns youll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the <code>(?…)</code> syntax.</p> <p>Its worth noting that the pattern language used by base R is very slightly different to that used by stringr. Thats because stringr is built on top of the <a href="https://stringi.gagolewski.com">stringi package</a>, which is in turn built on top of the <a href="https://unicode-org.github.io/icu/userguide/strings/regexp.html">ICU engine</a>, whereas base R functions use either the <a href="https://github.com/laurikari/tre">TRE engine</a> or the <a href="https://www.pcre.org">PCRE engine</a>, depending on whether or not youve set <code>perl = TRUE</code>. Fortunately, the basics of regular expressions are so well established that youll encounter few variations when working with the patterns youll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the <code>(?…)</code> syntax.</p>
</section> </section>
</section> </section>
@ -1023,7 +1023,7 @@ Base R</h2>
Summary</h1> Summary</h1>
<p>With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. Theyre definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.</p> <p>With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. Theyre definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.</p>
<p>In this chapter, youve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.</p> <p>In this chapter, youve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.</p>
<p>A good place to start is <code><a href="#chp-https://stringr.tidyverse.org/articles/regular-expressions" data-type="xref">#chp-https://stringr.tidyverse.org/articles/regular-expressions</a></code>: it documents the full set of syntax supported by stringr. Another useful reference is <a href="https://www.regular-expressions.info/tutorial.html">https://www.regular-expressions.info/</a>. Its not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.</p> <p>A good place to start is <code><a href="https://stringr.tidyverse.org/articles/regular-expressions.html">vignette("regular-expressions", package = "stringr")</a></code>: it documents the full set of syntax supported by stringr. Another useful reference is <a href="https://www.regular-expressions.info/tutorial.html">https://www.regular-expressions.info/</a>. Its not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.</p>
<p>Its also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk. If youre struggling to find a function that does what you need in stringr, dont be afraid to look in stringi. Youll find stringi very easy to pick up because it follows many of the the same conventions as stringr.</p> <p>Its also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk. If youre struggling to find a function that does what you need in stringr, dont be afraid to look in stringi. Youll find stringi very easy to pick up because it follows many of the the same conventions as stringr.</p>
<p>In the next chapter, well talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.</p> <p>In the next chapter, well talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.</p>

View File

@ -35,12 +35,12 @@ library(tidyverse)</pre>
Getting started</h2> Getting started</h2>
<p>Most of readxls functions allow you to load Excel spreadsheets into R:</p> <p>Most of readxls functions allow you to load Excel spreadsheets into R:</p>
<ul><li> <ul><li>
<code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> reads Excel files with <code>xls</code> format.</li> <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_xls()</a></code> reads Excel files with <code>xls</code> format.</li>
<li> <li>
<code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> read Excel files with <code>xlsx</code> format.</li> <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_xlsx()</a></code> read Excel files with <code>xlsx</code> format.</li>
<li> <li>
<code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> can read files with both <code>xls</code> and <code>xlsx</code> format. It guesses the file type based on the input.</li> <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> can read files with both <code>xls</code> and <code>xlsx</code> format. It guesses the file type based on the input.</li>
</ul><p>These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g. <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code>, <code><a href="#chp-https://readr.tidyverse.org/reference/read_table" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_table</a></code>, etc. For the rest of the chapter we will focus on using <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>.</p> </ul><p>These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g. <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code>, <code><a href="https://readr.tidyverse.org/reference/read_table.html">read_table()</a></code>, etc. For the rest of the chapter we will focus on using <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
</section> </section>
<section id="sec-reading-spreadsheets" data-type="sect2"> <section id="sec-reading-spreadsheets" data-type="sect2">
@ -55,11 +55,11 @@ Reading spreadsheets</h2>
</figure> </figure>
</div> </div>
</div> </div>
<p>The first argument to <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> is the path to the file to read.</p> <p>The first argument to <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> is the path to the file to read.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students &lt;- read_excel("data/students.xlsx")</pre> <pre data-type="programlisting" data-code-language="downlit">students &lt;- read_excel("data/students.xlsx")</pre>
</div> </div>
<p><code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> will read the file in as a tibble.</p> <p><code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> will read the file in as a tibble.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">students <pre data-type="programlisting" data-code-language="downlit">students
#&gt; # A tibble: 6 × 5 #&gt; # A tibble: 6 × 5
@ -130,7 +130,7 @@ Reading spreadsheets</h2>
</div> </div>
</li> </li>
<li> <li>
<p>One other remaining issue is that <code>age</code> is read in as a character variable, but it really should be numeric. Just like with <code><a href="#chp-https://readr.tidyverse.org/reference/read_delim" data-type="xref">#chp-https://readr.tidyverse.org/reference/read_delim</a></code> and friends for reading data from flat files, you can supply a <code>col_types</code> argument to <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are <code>"skip"</code>, <code>"guess"</code>, <code>"logical"</code>, <code>"numeric"</code>, <code>"date"</code>, <code>"text"</code> or <code>"list"</code>.</p> <p>One other remaining issue is that <code>age</code> is read in as a character variable, but it really should be numeric. Just like with <code><a href="https://readr.tidyverse.org/reference/read_delim.html">read_csv()</a></code> and friends for reading data from flat files, you can supply a <code>col_types</code> argument to <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are <code>"skip"</code>, <code>"guess"</code>, <code>"logical"</code>, <code>"numeric"</code>, <code>"date"</code>, <code>"text"</code> or <code>"list"</code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel( <pre data-type="programlisting" data-code-language="downlit">read_excel(
"data/students.xlsx", "data/students.xlsx",
@ -193,7 +193,7 @@ Reading individual sheets</h2>
</figure> </figure>
</div> </div>
</div> </div>
<p>You can read a single sheet from a spreadsheet with the <code>sheet</code> argument in <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>.</p> <p>You can read a single sheet from a spreadsheet with the <code>sheet</code> argument in <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island") <pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
#&gt; # A tibble: 52 × 8 #&gt; # A tibble: 52 × 8
@ -225,12 +225,12 @@ penguins_torgersen
#&gt; # … with 46 more rows, and abbreviated variable names ¹flipper_length_mm, #&gt; # … with 46 more rows, and abbreviated variable names ¹flipper_length_mm,
#&gt; # ²body_mass_g</pre> #&gt; # ²body_mass_g</pre>
</div> </div>
<p>However, we cheated here a bit. We looked inside the Excel spreadsheet, which is not a recommended workflow. Instead, you can use <code><a href="#chp-https://readxl.tidyverse.org/reference/excel_sheets" data-type="xref">#chp-https://readxl.tidyverse.org/reference/excel_sheets</a></code> to get information on all sheets in an Excel spreadsheet, and then read the one(s) youre interested in.</p> <p>However, we cheated here a bit. We looked inside the Excel spreadsheet, which is not a recommended workflow. Instead, you can use <code><a href="https://readxl.tidyverse.org/reference/excel_sheets.html">excel_sheets()</a></code> to get information on all sheets in an Excel spreadsheet, and then read the one(s) youre interested in.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">excel_sheets("data/penguins.xlsx") <pre data-type="programlisting" data-code-language="downlit">excel_sheets("data/penguins.xlsx")
#&gt; [1] "Torgersen Island" "Biscoe Island" "Dream Island"</pre> #&gt; [1] "Torgersen Island" "Biscoe Island" "Dream Island"</pre>
</div> </div>
<p>Once you know the names of the sheets, you can read them in individually with <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>.</p> <p>Once you know the names of the sheets, you can read them in individually with <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins_biscoe &lt;- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA") <pre data-type="programlisting" data-code-language="downlit">penguins_biscoe &lt;- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
penguins_dream &lt;- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")</pre> penguins_dream &lt;- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")</pre>
@ -244,7 +244,7 @@ dim(penguins_biscoe)
dim(penguins_dream) dim(penguins_dream)
#&gt; [1] 124 8</pre> #&gt; [1] 124 8</pre>
</div> </div>
<p>We can put them together with <code><a href="#chp-https://dplyr.tidyverse.org/reference/bind_rows" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/bind_rows</a></code>.</p> <p>We can put them together with <code><a href="https://dplyr.tidyverse.org/reference/bind_rows.html">bind_rows()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream) <pre data-type="programlisting" data-code-language="downlit">penguins &lt;- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins penguins
@ -275,7 +275,7 @@ Reading part of a sheet</h2>
</figure> </figure>
</div> </div>
</div> </div>
<p>This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the <code><a href="#chp-https://readxl.tidyverse.org/reference/readxl_example" data-type="xref">#chp-https://readxl.tidyverse.org/reference/readxl_example</a></code> function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code> as usual.</p> <p>This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the <code><a href="https://readxl.tidyverse.org/reference/readxl_example.html">readxl_example()</a></code> function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_excel()</a></code> as usual.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">deaths_path &lt;- readxl_example("deaths.xlsx") <pre data-type="programlisting" data-code-language="downlit">deaths_path &lt;- readxl_example("deaths.xlsx")
deaths &lt;- read_excel(deaths_path) deaths &lt;- read_excel(deaths_path)
@ -389,7 +389,7 @@ bake_sale
#&gt; 2 cupcake 5 #&gt; 2 cupcake 5
#&gt; 3 cookie 8</pre> #&gt; 3 cookie 8</pre>
</div> </div>
<p>You can write data back to disk as an Excel file using the <code><a href="#chp-https://docs.ropensci.org/writexl/reference/write_xlsx" data-type="xref">#chp-https://docs.ropensci.org/writexl/reference/write_xlsx</a></code> from the <strong>writexl</strong> package.</p> <p>You can write data back to disk as an Excel file using the <code><a href="https://docs.ropensci.org/writexl/reference/write_xlsx.html">write_xlsx()</a></code> from the <strong>writexl</strong> package.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(writexl) <pre data-type="programlisting" data-code-language="downlit">library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre> write_xlsx(bake_sale, path = "data/bake-sale.xlsx")</pre>
@ -469,7 +469,7 @@ writeDataTable(
#&gt; Active Sheet 1: "Adelie" #&gt; Active Sheet 1: "Adelie"
#&gt; Position: 1</pre> #&gt; Position: 1</pre>
</div> </div>
<p>And we can write this to this with <code><a href="#chp-https://rdrr.io/pkg/openxlsx/man/saveWorkbook" data-type="xref">#chp-https://rdrr.io/pkg/openxlsx/man/saveWorkbook</a></code>.</p> <p>And we can write this to this with <code><a href="https://rdrr.io/pkg/openxlsx/man/saveWorkbook.html">saveWorkbook()</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre> <pre data-type="programlisting" data-code-language="downlit">saveWorkbook(penguins_species, "data/penguins-species.xlsx")</pre>
</div> </div>
@ -488,8 +488,8 @@ writeDataTable(
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li>Recreate the <code>bake_sale</code> data frame, write it out to an Excel file using the <code><a href="#chp-https://rdrr.io/pkg/openxlsx/man/write.xlsx" data-type="xref">#chp-https://rdrr.io/pkg/openxlsx/man/write.xlsx</a></code> function from the openxlsx package.</li> <ol type="1"><li>Recreate the <code>bake_sale</code> data frame, write it out to an Excel file using the <code><a href="https://rdrr.io/pkg/openxlsx/man/write.xlsx.html">write.xlsx()</a></code> function from the openxlsx package.</li>
<li>What happens if you try to read in a file with <code>.xlsx</code> extension with <code><a href="#chp-https://readxl.tidyverse.org/reference/read_excel" data-type="xref">#chp-https://readxl.tidyverse.org/reference/read_excel</a></code>?</li> <li>What happens if you try to read in a file with <code>.xlsx</code> extension with <code><a href="https://readxl.tidyverse.org/reference/read_excel.html">read_xls()</a></code>?</li>
</ol><!--# Need moar exercises --></section> </ol><!--# Need moar exercises --></section>
</section> </section>

View File

@ -44,7 +44,7 @@ library(babynames)</pre>
<section id="creating-a-string" data-type="sect1"> <section id="creating-a-string" data-type="sect1">
<h1> <h1>
Creating a string</h1> Creating a string</h1>
<p>Weve created strings in passing earlier in the book, but didnt discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). Theres no difference in behavior between the two so in the interests of consistency the <a href="#character-vectors" data-type="xref">#character-vectors</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p> <p>Weve created strings in passing earlier in the book, but didnt discuss the details. Firstly, you can create a string using either single quotes (<code>'</code>) or double quotes (<code>"</code>). Theres no difference in behavior between the two so in the interests of consistency the <a href="https://style.tidyverse.org/syntax.html#character-vectors">tidyverse style guide</a> recommends using <code>"</code>, unless the string contains multiple <code>"</code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">string1 &lt;- "This is a string" <pre data-type="programlisting" data-code-language="downlit">string1 &lt;- "This is a string"
string2 &lt;- 'If I want to include a "quote" inside a string, I use single quotes'</pre> string2 &lt;- 'If I want to include a "quote" inside a string, I use single quotes'</pre>
@ -68,7 +68,7 @@ single_quote &lt;- '\'' # or "'"</pre>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">backslash &lt;- "\\"</pre> <pre data-type="programlisting" data-code-language="downlit">backslash &lt;- "\\"</pre>
</div> </div>
<p>Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code><span data-type="footnote">Or use the base R function <code><a href="#chp-https://rdrr.io/r/base/writeLines" data-type="xref">#chp-https://rdrr.io/r/base/writeLines</a></code>.</span>:</p> <p>Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code><span data-type="footnote">Or use the base R function <code><a href="https://rdrr.io/r/base/writeLines.html">writeLines()</a></code>.</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(single_quote, double_quote, backslash) <pre data-type="programlisting" data-code-language="downlit">x &lt;- c(single_quote, double_quote, backslash)
x x
@ -92,7 +92,7 @@ str_view(tricky)
#&gt; [1] │ double_quote &lt;- "\"" # or '"' #&gt; [1] │ double_quote &lt;- "\"" # or '"'
#&gt; │ single_quote &lt;- '\'' # or "'"</pre> #&gt; │ single_quote &lt;- '\'' # or "'"</pre>
</div> </div>
<p>Thats a lot of backslashes! (This is sometimes called <a href="#chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome" data-type="xref">#chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome</a>.) To eliminate the escaping you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p> <p>Thats a lot of backslashes! (This is sometimes called <a href="https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome">leaning toothpick syndrome</a>.) To eliminate the escaping you can instead use a <strong>raw string</strong><span data-type="footnote">Available in R 4.0.0 and above.</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tricky &lt;- r"(double_quote &lt;- "\"" # or '"' <pre data-type="programlisting" data-code-language="downlit">tricky &lt;- r"(double_quote &lt;- "\"" # or '"'
single_quote &lt;- '\'' # or "'")" single_quote &lt;- '\'' # or "'")"
@ -106,7 +106,7 @@ str_view(tricky)
<section id="other-special-characters" data-type="sect2"> <section id="other-special-characters" data-type="sect2">
<h2> <h2>
Other special characters</h2> Other special characters</h2>
<p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code> there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. Youll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in <code><a href="#chp-https://rdrr.io/r/base/Quotes" data-type="xref">#chp-https://rdrr.io/r/base/Quotes</a></code>.</p> <p>As well as <code>\"</code>, <code>\'</code>, and <code>\\</code> there are a handful of other special characters that may come in handy. The most common are <code>\n</code>, newline, and <code>\t</code>, tab. Youll also sometimes see strings containing Unicode escapes that start with <code>\u</code> or <code>\U</code>. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in <code><a href="https://rdrr.io/r/base/Quotes.html">?'"'</a></code>.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604") <pre data-type="programlisting" data-code-language="downlit">x &lt;- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
x x
@ -118,7 +118,7 @@ str_view(x)
#&gt; [3] │ µ #&gt; [3] │ µ
#&gt; [4] │ 😄</pre> #&gt; [4] │ 😄</pre>
</div> </div>
<p>Note that <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that theres a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.</p> <p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that theres a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.</p>
</section> </section>
<section id="exercises" data-type="sect2"> <section id="exercises" data-type="sect2">
@ -131,7 +131,7 @@ Exercises</h2>
<li><p><code>\\\\\\</code></p></li> <li><p><code>\\\\\\</code></p></li>
</ol></li> </ol></li>
<li> <li>
<p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="#chp-https://stringr.tidyverse.org/reference/str_view" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_view</a></code> display it? Can you do a little googling to figure out what this special character is?</p> <p>Create the string in your R session and print it. What happens to the special “\u00a0”? How does <code><a href="https://stringr.tidyverse.org/reference/str_view.html">str_view()</a></code> display it? Can you do a little googling to figure out what this special character is?</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- "This\u00a0is\u00a0tricky"</pre> <pre data-type="programlisting" data-code-language="downlit">x &lt;- "This\u00a0is\u00a0tricky"</pre>
</div> </div>
@ -142,13 +142,13 @@ Exercises</h2>
<section id="creating-many-strings-from-data" data-type="sect1"> <section id="creating-many-strings-from-data" data-type="sect1">
<h1> <h1>
Creating many strings from data</h1> Creating many strings from data</h1>
<p>Now that youve learned the basics of creating a string or two by “hand”, well go into the details of creating strings from other strings. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. For example, to create a greeting you might combine “Hello” with a <code>name</code> variable. Well show you how to do this with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> and how you can you use them with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>. That naturally raises the question of what string functions you might use with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, so well finish this section with a discussion of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code> which is a summary function for strings.</p> <p>Now that youve learned the basics of creating a string or two by “hand”, well go into the details of creating strings from other strings. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. For example, to create a greeting you might combine “Hello” with a <code>name</code> variable. Well show you how to do this with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> and how you can you use them with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. That naturally raises the question of what string functions you might use with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, so well finish this section with a discussion of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code> which is a summary function for strings.</p>
<section id="str_c" data-type="sect2"> <section id="str_c" data-type="sect2">
<h2> <h2>
<code>str_c()</code> <code>str_c()</code>
</h2> </h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code><span data-type="footnote"><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> is very similar to the base <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code>. There are two main reasons we recommend it: it propagates <code>NA</code>s (rather than converting them to <code>"NA"</code>) and it uses the tidyverse recycling rules.</span> takes any number of vectors as arguments and returns a character vector:</p> <p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code><span data-type="footnote"><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is very similar to the base <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code>. There are two main reasons we recommend it: it propagates <code>NA</code>s (rather than converting them to <code>"NA"</code>) and it uses the tidyverse recycling rules.</span> takes any number of vectors as arguments and returns a character vector:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_c("x", "y") <pre data-type="programlisting" data-code-language="downlit">str_c("x", "y")
#&gt; [1] "xy" #&gt; [1] "xy"
@ -157,7 +157,7 @@ str_c("x", "y", "z")
str_c("Hello ", c("John", "Susan")) str_c("Hello ", c("John", "Susan"))
#&gt; [1] "Hello John" "Hello Susan"</pre> #&gt; [1] "Hello John" "Hello Susan"</pre>
</div> </div>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> is designed to be used with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> so it obeys the usual rules for recycling and missing values:</p> <p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> is designed to be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> so it obeys the usual rules for recycling and missing values:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">set.seed(1410) <pre data-type="programlisting" data-code-language="downlit">set.seed(1410)
df &lt;- tibble(name = c(wakefield::name(3), NA)) df &lt;- tibble(name = c(wakefield::name(3), NA))
@ -170,7 +170,7 @@ df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
#&gt; 3 Graylon Hi Graylon! #&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; &lt;NA&gt;</pre> #&gt; 4 &lt;NA&gt; &lt;NA&gt;</pre>
</div> </div>
<p>If you want missing values to display in some other way, use <code><a href="#chp-https://dplyr.tidyverse.org/reference/coalesce" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/coalesce</a></code>. Depending on what you want, you might use it either inside or outside of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>:</p> <p>If you want missing values to display in some other way, use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code>. Depending on what you want, you might use it either inside or outside of <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; <pre data-type="programlisting" data-code-language="downlit">df |&gt;
mutate( mutate(
@ -191,7 +191,7 @@ df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
<h2> <h2>
<code>str_glue()</code> <code>str_glue()</code>
</h2> </h2>
<p>If you are mixing many fixed and variable strings with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>, youll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="#chp-https://glue.tidyverse" data-type="xref">#chp-https://glue.tidyverse</a> via <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code><span data-type="footnote">If youre not using stringr, you can also access it directly with <code><a href="#chp-https://glue.tidyverse.org/reference/glue" data-type="xref">#chp-https://glue.tidyverse.org/reference/glue</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="#chp-https://rdrr.io/r/base/Paren" data-type="xref">#chp-https://rdrr.io/r/base/Paren</a></code> will be evaluated like its outside of the quotes:</p> <p>If you are mixing many fixed and variable strings with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>, youll notice that you type a lot of <code>"</code>s, making it hard to see the overall goal of the code. An alternative approach is provided by the <a href="https://glue.tidyverse.org">glue package</a> via <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code><span data-type="footnote">If youre not using stringr, you can also access it directly with <code><a href="https://glue.tidyverse.org/reference/glue.html">glue::glue()</a></code>.</span>. You give it a single string that has a special feature: anything inside <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code> will be evaluated like its outside of the quotes:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(greeting = str_glue("Hi {name}!")) <pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(greeting = str_glue("Hi {name}!"))
#&gt; # A tibble: 4 × 2 #&gt; # A tibble: 4 × 2
@ -202,7 +202,7 @@ df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
#&gt; 3 Graylon Hi Graylon! #&gt; 3 Graylon Hi Graylon!
#&gt; 4 &lt;NA&gt; Hi NA!</pre> #&gt; 4 &lt;NA&gt; Hi NA!</pre>
</div> </div>
<p>As you can see, <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code>.</p> <p>As you can see, <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> currently converts missing values to the string <code>"NA"</code> unfortunately making it inconsistent with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code>.</p>
<p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. If you guess that youll need to somehow escape it, youre on the right track. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p> <p>You also might wonder what happens if you need to include a regular <code>{</code> or <code>}</code> in your string. If you guess that youll need to somehow escape it, youre on the right track. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like <code>\</code>, you double up the special characters:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}")) <pre data-type="programlisting" data-code-language="downlit">df |&gt; mutate(greeting = str_glue("{{Hi {name}!}}"))
@ -220,7 +220,7 @@ df |&gt; mutate(greeting = str_c("Hi ", name, "!"))
<h2> <h2>
<code>str_flatten()</code> <code>str_flatten()</code>
</h2> </h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> and <code>glue()</code> work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>, i.e. something that always returns a single string? Thats the job of <code><a href="#chp-https://stringr.tidyverse.org/reference/str_flatten" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_flatten</a></code><span data-type="footnote">The base R equivalent is <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p> <p><code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> and <code>glue()</code> work well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because their output is the same length as their inputs. What if you want a function that works well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>, i.e. something that always returns a single string? Thats the job of <code><a href="https://stringr.tidyverse.org/reference/str_flatten.html">str_flatten()</a></code><span data-type="footnote">The base R equivalent is <code><a href="https://rdrr.io/r/base/paste.html">paste()</a></code> used with the <code>collapse</code> argument.</span>: it takes a character vector and combines each element of the vector into a single string:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_flatten(c("x", "y", "z")) <pre data-type="programlisting" data-code-language="downlit">str_flatten(c("x", "y", "z"))
#&gt; [1] "xyz" #&gt; [1] "xyz"
@ -229,7 +229,7 @@ str_flatten(c("x", "y", "z"), ", ")
str_flatten(c("x", "y", "z"), ", ", last = ", and ") str_flatten(c("x", "y", "z"), ", ", last = ", and ")
#&gt; [1] "x, y, and z"</pre> #&gt; [1] "x, y, and z"</pre>
</div> </div>
<p>This makes it work well with <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>:</p> <p>This makes it work well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble( <pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~ name, ~ fruit, ~ name, ~ fruit,
@ -256,14 +256,14 @@ df |&gt;
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li> <ol type="1"><li>
<p>Compare and contrast the results of <code><a href="#chp-https://rdrr.io/r/base/paste" data-type="xref">#chp-https://rdrr.io/r/base/paste</a></code> with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> for the following inputs:</p> <p>Compare and contrast the results of <code><a href="https://rdrr.io/r/base/paste.html">paste0()</a></code> with <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> for the following inputs:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_c("hi ", NA) <pre data-type="programlisting" data-code-language="downlit">str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])</pre> str_c(letters[1:2], letters[1:3])</pre>
</div> </div>
</li> </li>
<li> <li>
<p>Convert the following expressions from <code><a href="#chp-https://stringr.tidyverse.org/reference/str_c" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_c</a></code> to <code><a href="#chp-https://stringr.tidyverse.org/reference/str_glue" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_glue</a></code> or vice versa:</p> <p>Convert the following expressions from <code><a href="https://stringr.tidyverse.org/reference/str_c.html">str_c()</a></code> to <code><a href="https://stringr.tidyverse.org/reference/str_glue.html">str_glue()</a></code> or vice versa:</p>
<ol type="a"><li><p><code>str_c("The price of ", food, " is ", price)</code></p></li> <ol type="a"><li><p><code>str_c("The price of ", food, " is ", price)</code></p></li>
<li><p><code>str_glue("I'm {age} years old and live in {country}")</code></p></li> <li><p><code>str_glue("I'm {age} years old and live in {country}")</code></p></li>
<li><p><code>str_c("\\section{", title, "}")</code></p></li> <li><p><code>str_c("\\section{", title, "}")</code></p></li>
@ -290,7 +290,7 @@ Extracting data from strings</h1>
<section id="separating-into-rows" data-type="sect2"> <section id="separating-into-rows" data-type="sect2">
<h2> <h2>
Separating into rows</h2> Separating into rows</h2>
<p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim</a></code> to split based on a delimiter:</p> <p>Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_delim()</a></code> to split based on a delimiter:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 &lt;- tibble(x = c("a,b,c", "d,e", "f")) <pre data-type="programlisting" data-code-language="downlit">df1 &lt;- tibble(x = c("a,b,c", "d,e", "f"))
df1 |&gt; df1 |&gt;
@ -305,7 +305,7 @@ df1 |&gt;
#&gt; 5 e #&gt; 5 e
#&gt; 6 f</pre> #&gt; 6 f</pre>
</div> </div>
<p>Its rarer to see <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_longer_delim</a></code> in the wild, but some older datasets do use very compact format where each character is used to record a value:</p> <p>Its rarer to see <code><a href="https://tidyr.tidyverse.org/reference/separate_longer_delim.html">separate_longer_position()</a></code> in the wild, but some older datasets do use very compact format where each character is used to record a value:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df2 &lt;- tibble(x = c("1211", "131", "21")) <pre data-type="programlisting" data-code-language="downlit">df2 &lt;- tibble(x = c("1211", "131", "21"))
df2 |&gt; df2 |&gt;
@ -326,7 +326,7 @@ df2 |&gt;
<section id="sec-string-columns" data-type="sect2"> <section id="sec-string-columns" data-type="sect2">
<h2> <h2>
Separating into columns</h2> Separating into columns</h2>
<p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> we supply the delimiter and the names in two arguments:</p> <p>Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their <code>longer</code> equivalents because you need to name the columns. For example, in this following dataset <code>x</code> is made up of a code, an edition number, and a year, separated by <code>"."</code>. To use <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> we supply the delimiter and the names in two arguments:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df3 &lt;- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015")) <pre data-type="programlisting" data-code-language="downlit">df3 &lt;- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 |&gt; df3 |&gt;
@ -357,7 +357,7 @@ df3 |&gt;
#&gt; 2 b10 2011 #&gt; 2 b10 2011
#&gt; 3 e15 2015</pre> #&gt; 3 e15 2015</pre>
</div> </div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p> <p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df4 &lt;- tibble(x = c("202215TX", "202122LA", "202325CA")) <pre data-type="programlisting" data-code-language="downlit">df4 &lt;- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 |&gt; df4 |&gt;
@ -377,7 +377,7 @@ df4 |&gt;
<section id="diagnosing-widening-problems" data-type="sect2"> <section id="diagnosing-widening-problems" data-type="sect2">
<h2> <h2>
Diagnosing widening problems</h2> Diagnosing widening problems</h2>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code><span data-type="footnote">The same principles apply to <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows dont have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Lets first look at the <code>too_few</code> case with the following sample dataset:</p> <p><code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code><span data-type="footnote">The same principles apply to <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_position()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_regex()</a></code>.</span> requires a fixed and known set of columns. What happens if some of the rows dont have the expected number of pieces? There are two possible problems, too few or too many pieces, so <code><a href="https://tidyr.tidyverse.org/reference/separate_wider_delim.html">separate_wider_delim()</a></code> provides two arguments to help: <code>too_few</code> and <code>too_many</code>. Lets first look at the <code>too_few</code> case with the following sample dataset:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1")) <pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
@ -523,12 +523,12 @@ Letters</h1>
<section id="length" data-type="sect2"> <section id="length" data-type="sect2">
<h2> <h2>
Length</h2> Length</h2>
<p><code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> tells you the number of letters in the string:</p> <p><code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> tells you the number of letters in the string:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_length(c("a", "R for data science", NA)) <pre data-type="programlisting" data-code-language="downlit">str_length(c("a", "R for data science", NA))
#&gt; [1] 1 18 NA</pre> #&gt; [1] 1 18 NA</pre>
</div> </div>
<p>You could use this with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> to find the distribution of lengths of US babynames, and then with <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, wed guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p> <p>You could use this with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to find the distribution of lengths of US babynames, and then with <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> to look at the longest names<span data-type="footnote">Looking at these entries, wed guess that the babynames data drops spaces or hyphens and truncates after 15 letters.</span>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt; <pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
count(length = str_length(name), wt = n) count(length = str_length(name), wt = n)
@ -573,12 +573,12 @@ str_sub(x, 1, 3)
<pre data-type="programlisting" data-code-language="downlit">str_sub(x, -3, -1) <pre data-type="programlisting" data-code-language="downlit">str_sub(x, -3, -1)
#&gt; [1] "ple" "ana" "ear"</pre> #&gt; [1] "ple" "ana" "ear"</pre>
</div> </div>
<p>Note that <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> wont fail if the string is too short: it will just return as much as possible:</p> <p>Note that <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> wont fail if the string is too short: it will just return as much as possible:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str_sub("a", 1, 5) <pre data-type="programlisting" data-code-language="downlit">str_sub("a", 1, 5)
#&gt; [1] "a"</pre> #&gt; [1] "a"</pre>
</div> </div>
<p>We could use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> with <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to find the first and last letter of each name:</p> <p>We could use <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to find the first and last letter of each name:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">babynames |&gt; <pre data-type="programlisting" data-code-language="downlit">babynames |&gt;
mutate( mutate(
@ -626,7 +626,7 @@ str_view(str_wrap(x, 30))
<section id="exercises-2" data-type="sect2"> <section id="exercises-2" data-type="sect2">
<h2> <h2>
Exercises</h2> Exercises</h2>
<ol type="1"><li>Use <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li> <ol type="1"><li>Use <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code> to extract the middle letter from each baby name. What will you do if the string has an even number of characters?</li>
<li>Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?</li> <li>Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?</li>
</ol></section> </ol></section>
</section> </section>
@ -639,7 +639,7 @@ Non-English text</h1>
<section id="encoding" data-type="sect2"> <section id="encoding" data-type="sect2">
<h2> <h2>
Encoding</h2> Encoding</h2>
<p>When working with non-English text the first challenge is often the <strong>encoding</strong>. To understand whats going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="#chp-https://rdrr.io/r/base/rawConversion" data-type="xref">#chp-https://rdrr.io/r/base/rawConversion</a></code>:</p> <p>When working with non-English text the first challenge is often the <strong>encoding</strong>. To understand whats going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using <code><a href="https://rdrr.io/r/base/rawConversion.html">charToRaw()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">charToRaw("Hadley") <pre data-type="programlisting" data-code-language="downlit">charToRaw("Hadley")
#&gt; [1] 48 61 64 6c 65 79</pre> #&gt; [1] 48 61 64 6c 65 79</pre>
@ -676,7 +676,7 @@ read_csv(x2, locale = locale(encoding = "Shift-JIS"))
#&gt; &lt;chr&gt; #&gt; &lt;chr&gt;
#&gt; 1 こんにちは</pre> #&gt; 1 こんにちは</pre>
</div> </div>
<p>How do you find the correct encoding? If youre lucky, itll be included somewhere in the data documentation. Unfortunately, thats rarely the case, so readr provides <code><a href="#chp-https://readr.tidyverse.org/reference/encoding" data-type="xref">#chp-https://readr.tidyverse.org/reference/encoding</a></code> to help you figure it out. Its not foolproof, and it works better when you have lots of text (unlike here), but its a reasonable place to start. Expect to try a few different encodings before you find the right one.</p> <p>How do you find the correct encoding? If youre lucky, itll be included somewhere in the data documentation. Unfortunately, thats rarely the case, so readr provides <code><a href="https://readr.tidyverse.org/reference/encoding.html">guess_encoding()</a></code> to help you figure it out. Its not foolproof, and it works better when you have lots of text (unlike here), but its a reasonable place to start. Expect to try a few different encodings before you find the right one.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">guess_encoding(x1) <pre data-type="programlisting" data-code-language="downlit">guess_encoding(x1)
#&gt; # A tibble: 1 × 2 #&gt; # A tibble: 1 × 2
@ -695,7 +695,7 @@ guess_encoding(x2)
<section id="letter-variations" data-type="sect2"> <section id="letter-variations" data-type="sect2">
<h2> <h2>
Letter variations</h2> Letter variations</h2>
<p>If youre working with individual letters (e.g. with <code><a href="#chp-https://stringr.tidyverse.org/reference/str_length" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_length</a></code> and <code><a href="#chp-https://stringr.tidyverse.org/reference/str_sub" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_sub</a></code>) theres an important challenge if youre working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:</p> <p>If youre working with individual letters (e.g. with <code><a href="https://stringr.tidyverse.org/reference/str_length.html">str_length()</a></code> and <code><a href="https://stringr.tidyverse.org/reference/str_sub.html">str_sub()</a></code>) theres an important challenge if youre working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">u &lt;- c("\u00fc", "u\u0308") <pre data-type="programlisting" data-code-language="downlit">u &lt;- c("\u00fc", "u\u0308")
str_view(u) str_view(u)
@ -709,7 +709,7 @@ str_view(u)
str_sub(u, 1, 1) str_sub(u, 1, 1)
#&gt; [1] "ü" "u"</pre> #&gt; [1] "ü" "u"</pre>
</div> </div>
<p>Finally note that these strings look differently when you compare them with <code>==</code>, for which is stringr provides the handy <code><a href="#chp-https://stringr.tidyverse.org/reference/str_equal" data-type="xref">#chp-https://stringr.tidyverse.org/reference/str_equal</a></code> function:</p> <p>Finally note that these strings look differently when you compare them with <code>==</code>, for which is stringr provides the handy <code><a href="https://stringr.tidyverse.org/reference/str_equal.html">str_equal()</a></code> function:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">u[[1]] == u[[2]] <pre data-type="programlisting" data-code-language="downlit">u[[1]] == u[[2]]
#&gt; [1] FALSE #&gt; [1] FALSE
@ -722,7 +722,7 @@ str_equal(u[[1]], u[[2]])
<section id="locale-dependent-function" data-type="sect2"> <section id="locale-dependent-function" data-type="sect2">
<h2> <h2>
Locale-dependent function</h2> Locale-dependent function</h2>
<p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language. A locale is specified by lower-case language abbreviation, optionally followed by a <code>_</code> and a upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you dont already know the code for your language, <a href="#chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes" data-type="xref">#chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="#chp-https://rdrr.io/pkg/stringi/man/stri_locale_list" data-type="xref">#chp-https://rdrr.io/pkg/stringi/man/stri_locale_list</a></code>.</p> <p>Finally, there are a handful of stringr functions whose behavior depends on your <strong>locale</strong>. A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language. A locale is specified by lower-case language abbreviation, optionally followed by a <code>_</code> and a upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you dont already know the code for your language, <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes">Wikipedia</a> has a good list, and you can see which are supported in stringr by looking at <code><a href="https://rdrr.io/pkg/stringi/man/stri_locale_list.html">stringi::stri_locale_list()</a></code>.</p>
<p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country. To avoid this problem, stringr defaults to using English rules, by using the “en” locale, and requires you to specify the <code>locale</code> argument to override it. Fortunately there are two sets of functions where the locale really matters: changing case and sorting.</p> <p>Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country. To avoid this problem, stringr defaults to using English rules, by using the “en” locale, and requires you to specify the <code>locale</code> argument to override it. Fortunately there are two sets of functions where the locale really matters: changing case and sorting.</p>
<p><strong>T</strong>he rules for changing case are not the same in every language. For example, Turkish has two is: with and without a dot, and it capitalizes them in a different way to English:</p> <p><strong>T</strong>he rules for changing case are not the same in every language. For example, Turkish has two is: with and without a dot, and it capitalizes them in a different way to English:</p>
<div class="cell"> <div class="cell">
@ -738,7 +738,7 @@ str_to_upper(c("i", "ı"), locale = "tr")
str_sort(c("a", "c", "ch", "h", "z"), locale = "cs") str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
#&gt; [1] "a" "c" "h" "ch" "z"</pre> #&gt; [1] "a" "c" "h" "ch" "z"</pre>
</div> </div>
<p>This also comes up when sorting strings with <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> which is why it also has a <code>locale</code> argument.</p> <p>This also comes up when sorting strings with <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">dplyr::arrange()</a></code> which is why it also has a <code>locale</code> argument.</p>
</section> </section>
</section> </section>

View File

@ -24,7 +24,7 @@ sin(pi / 2)
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- 3 * 4</pre> <pre data-type="programlisting" data-code-language="downlit">x &lt;- 3 * 4</pre>
</div> </div>
<p>You can <strong>c</strong>ombine multiple elements into a vector with <code><a href="#chp-https://rdrr.io/r/base/c" data-type="xref">#chp-https://rdrr.io/r/base/c</a></code>:</p> <p>You can <strong>c</strong>ombine multiple elements into a vector with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">primes &lt;- c(2, 3, 5, 7, 11, 13)</pre> <pre data-type="programlisting" data-code-language="downlit">primes &lt;- c(2, 3, 5, 7, 11, 13)</pre>
</div> </div>
@ -105,7 +105,7 @@ Calling functions</h1>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">function_name(arg1 = val1, arg2 = val2, ...)</pre> <pre data-type="programlisting" data-code-language="downlit">function_name(arg1 = val1, arg2 = val2, ...)</pre>
</div> </div>
<p>Lets try using <code><a href="#chp-https://rdrr.io/r/base/seq" data-type="xref">#chp-https://rdrr.io/r/base/seq</a></code>, which makes regular <strong>seq</strong>uences of numbers and, while were at it, learn more helpful features of RStudio. Type <code>se</code> and hit TAB. A popup shows you possible completions. Specify <code><a href="#chp-https://rdrr.io/r/base/seq" data-type="xref">#chp-https://rdrr.io/r/base/seq</a></code> by typing more (a <code>q</code>) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the functions arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.</p> <p>Lets try using <code><a href="https://rdrr.io/r/base/seq.html">seq()</a></code>, which makes regular <strong>seq</strong>uences of numbers and, while were at it, learn more helpful features of RStudio. Type <code>se</code> and hit TAB. A popup shows you possible completions. Specify <code><a href="https://rdrr.io/r/base/seq.html">seq()</a></code> by typing more (a <code>q</code>) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the functions arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.</p>
<p>When youve selected the function you want, press TAB again. RStudio will add matching opening (<code>(</code>) and closing (<code>)</code>) parentheses for you. Type the arguments <code>1, 10</code> and hit return.</p> <p>When youve selected the function you want, press TAB again. RStudio will add matching opening (<code>(</code>) and closing (<code>)</code>) parentheses for you. Type the arguments <code>1, 10</code> and hit return.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">seq(1, 10) <pre data-type="programlisting" data-code-language="downlit">seq(1, 10)

View File

@ -12,14 +12,14 @@
<h1> <h1>
Google is your friend</h1> Google is your friend</h1>
<p>If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isnt useful, it often means that there arent any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isnt in English, run <code>Sys.setenv(LANGUAGE = "en")</code> and re-run the code; youre more likely to find help for English error messages.)</p> <p>If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isnt useful, it often means that there arent any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isnt in English, run <code>Sys.setenv(LANGUAGE = "en")</code> and re-run the code; youre more likely to find help for English error messages.)</p>
<p>If Google doesnt help, try <a href="#chp-https://stackoverflow" data-type="xref">#chp-https://stackoverflow</a>. Start by spending a little time searching for an existing answer, including <code>[R]</code> to restrict your search to questions and answers that use R.</p> <p>If Google doesnt help, try <a href="https://stackoverflow.com">Stack Overflow</a>. Start by spending a little time searching for an existing answer, including <code>[R]</code> to restrict your search to questions and answers that use R.</p>
</section> </section>
<section id="making-a-reprex" data-type="sect1"> <section id="making-a-reprex" data-type="sect1">
<h1> <h1>
Making a reprex</h1> Making a reprex</h1>
<p>If your googling doesnt find anything useful, its a really good idea prepare a <strong>reprex,</strong> short for minimal <strong>repr</strong>oducible <strong>ex</strong>ample. A good reprex makes it easier for other people to help you, and often youll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:</p> <p>If your googling doesnt find anything useful, its a really good idea prepare a <strong>reprex,</strong> short for minimal <strong>repr</strong>oducible <strong>ex</strong>ample. A good reprex makes it easier for other people to help you, and often youll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:</p>
<ul><li><p>First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code> calls and create all necessary objects. The easiest way to make sure youve done this is to use the reprex package.</p></li> <ul><li><p>First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> calls and create all necessary objects. The easiest way to make sure youve done this is to use the reprex package.</p></li>
<li><p>Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one youre facing in real life or even using built-in data.</p></li> <li><p>Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one youre facing in real life or even using built-in data.</p></li>
</ul><p>That sounds like a lot of work! And it can be, but it has a great payoff:</p> </ul><p>That sounds like a lot of work! And it can be, but it has a great payoff:</p>
<ul><li><p>80% of the time creating an excellent reprex reveals the source of your problem. Its amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.</p></li> <ul><li><p>80% of the time creating an excellent reprex reveals the source of your problem. Its amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.</p></li>
@ -47,7 +47,7 @@ mean(y)
<p>There are three things you need to include to make your example reproducible: required packages, data, and code.</p> <p>There are three things you need to include to make your example reproducible: required packages, data, and code.</p>
<ol type="1"><li><p><strong>Packages</strong> should be loaded at the top of the script, so its easy to see which ones the example needs. This is a good time to check that youre using the latest version of each package; its possible youve discovered a bug thats been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run <code>tidyverse_update()</code>.</p></li> <ol type="1"><li><p><strong>Packages</strong> should be loaded at the top of the script, so its easy to see which ones the example needs. This is a good time to check that youre using the latest version of each package; its possible youve discovered a bug thats been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run <code>tidyverse_update()</code>.</p></li>
<li> <li>
<p>The easiest way to include <strong>data</strong> is to use <code><a href="#chp-https://rdrr.io/r/base/dput" data-type="xref">#chp-https://rdrr.io/r/base/dput</a></code> to generate the R code needed to recreate it. For example, to recreate the <code>mtcars</code> dataset in R, perform the following steps:</p> <p>The easiest way to include <strong>data</strong> is to use <code><a href="https://rdrr.io/r/base/dput.html">dput()</a></code> to generate the R code needed to recreate it. For example, to recreate the <code>mtcars</code> dataset in R, perform the following steps:</p>
<ol type="1"><li>Run <code>dput(mtcars)</code> in R</li> <ol type="1"><li>Run <code>dput(mtcars)</code> in R</li>
<li>Copy the output</li> <li>Copy the output</li>
<li>In reprex, type <code>mtcars &lt;-</code> then paste.</li> <li>In reprex, type <code>mtcars &lt;-</code> then paste.</li>
@ -66,8 +66,8 @@ mean(y)
<section id="investing-in-yourself" data-type="sect1"> <section id="investing-in-yourself" data-type="sect1">
<h1> <h1>
Investing in yourself</h1> Investing in yourself</h1>
<p>You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the <a href="#chp-https://www.tidyverse.org/blog/" data-type="xref">#chp-https://www.tidyverse.org/blog/</a>. To keep up with the R community more broadly, we recommend reading <a href="#chp-https://rweekly" data-type="xref">#chp-https://rweekly</a>: its a community effort to aggregate the most interesting news in the R community each week.</p> <p>You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the <a href="https://www.tidyverse.org/blog/">tidyverse blog</a>. To keep up with the R community more broadly, we recommend reading <a href="https://rweekly.org">R Weekly</a>: its a community effort to aggregate the most interesting news in the R community each week.</p>
<p>If youre an active Twitter user, you might also want to follow Hadley (<a href="#chp-https://twitter.com/hadleywickham" data-type="xref">#chp-https://twitter.com/hadleywickham</a>), Mine (<a href="#chp-https://twitter.com/minebocek" data-type="xref">#chp-https://twitter.com/minebocek</a>), Garrett (<a href="#chp-https://twitter.com/statgarrett" data-type="xref">#chp-https://twitter.com/statgarrett</a>), or follow <a href="#chp-https://twitter.com/rstudiotips" data-type="xref">#chp-https://twitter.com/rstudiotips</a> to keep up with new features in the IDE. If you want the full fire hose of new developments, you can also read the (<a href="#chp-https://twitter.com/search?q=%23rstats" data-type="xref">#chp-https://twitter.com/search?q=%23rstats</a>) hashtag. This is one the key tools that Hadley and Mine use to keep up with new developments in the community.</p> <p>If youre an active Twitter user, you might also want to follow Hadley (<a href="https://twitter.com/hadleywickham">@hadleywickham</a>), Mine (<a href="https://twitter.com/minebocek">@minebocek</a>), Garrett (<a href="https://twitter.com/statgarrett">@statgarrett</a>), or follow <a href="https://twitter.com/rstudiotips">@rstudiotips</a> to keep up with new features in the IDE. If you want the full fire hose of new developments, you can also read the (<a href="https://twitter.com/search?q=%23rstats"><code>#rstats</code></a>) hashtag. This is one the key tools that Hadley and Mine use to keep up with new developments in the community.</p>
</section> </section>
<section id="summary" data-type="sect1"> <section id="summary" data-type="sect1">

View File

@ -84,7 +84,7 @@ mtcars %&gt;%
<ul><li><p>By default, the pipe passes the object on its left hand side to the first argument of the function on the right-hand side. <code>%&gt;%</code> allows you change the placement with a <code>.</code> placeholder. For example, <code>x %&gt;% f(1)</code> is equivalent to <code>f(x, 1)</code> but <code>x %&gt;% f(1, .)</code> is equivalent to <code>f(1, x)</code>. R 4.2.0 added a <code>_</code> placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, <code>x |&gt; f(1, y = _)</code> is equivalent to <code>f(1, y = x)</code>.</p></li> <ul><li><p>By default, the pipe passes the object on its left hand side to the first argument of the function on the right-hand side. <code>%&gt;%</code> allows you change the placement with a <code>.</code> placeholder. For example, <code>x %&gt;% f(1)</code> is equivalent to <code>f(x, 1)</code> but <code>x %&gt;% f(1, .)</code> is equivalent to <code>f(1, x)</code>. R 4.2.0 added a <code>_</code> placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, <code>x |&gt; f(1, y = _)</code> is equivalent to <code>f(1, y = x)</code>.</p></li>
<li> <li>
<p>The <code>|&gt;</code> placeholder is deliberately simple and cant replicate many features of the <code>%&gt;%</code> placeholder: you cant pass it to multiple arguments, and it doesnt have any special behavior when the placeholder is used inside another function. For example, <code>df %&gt;% split(.$var)</code> is equivalent to <code>split(df, df$var)</code> and <code>df %&gt;% {split(.$x, .$y)}</code> is equivalent to <code>split(df$x, df$y)</code>.</p> <p>The <code>|&gt;</code> placeholder is deliberately simple and cant replicate many features of the <code>%&gt;%</code> placeholder: you cant pass it to multiple arguments, and it doesnt have any special behavior when the placeholder is used inside another function. For example, <code>df %&gt;% split(.$var)</code> is equivalent to <code>split(df, df$var)</code> and <code>df %&gt;% {split(.$x, .$y)}</code> is equivalent to <code>split(df$x, df$y)</code>.</p>
<p>With <code>%&gt;%</code> you can use <code>.</code> on the left-hand side of operators like <code>$</code>, <code>[[</code>, <code>[</code> (which youll learn about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>), so you can extract a single column from a data frame with (e.g.) <code>mtcars %&gt;% .$cyl</code>. A future version of R may add similar support for <code>|&gt;</code> and <code>_</code>. For the special case of extracting a column out of a data frame, you can also use <code><a href="#chp-https://dplyr.tidyverse.org/reference/pull" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/pull</a></code>:</p> <p>With <code>%&gt;%</code> you can use <code>.</code> on the left-hand side of operators like <code>$</code>, <code>[[</code>, <code>[</code> (which youll learn about in <a href="#sec-subset-many" data-type="xref">#sec-subset-many</a>), so you can extract a single column from a data frame with (e.g.) <code>mtcars %&gt;% .$cyl</code>. A future version of R may add similar support for <code>|&gt;</code> and <code>_</code>. For the special case of extracting a column out of a data frame, you can also use <code><a href="https://dplyr.tidyverse.org/reference/pull.html">dplyr::pull()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">mtcars |&gt; pull(cyl) <pre data-type="programlisting" data-code-language="downlit">mtcars |&gt; pull(cyl)
#&gt; [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4</pre> #&gt; [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4</pre>

View File

@ -39,7 +39,7 @@ not_cancelled |&gt;
summarize(mean = mean(dep_delay))</pre> summarize(mean = mean(dep_delay))</pre>
</div> </div>
<p>Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that youve captured all the important parts of your code in the script.</p> <p>Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that youve captured all the important parts of your code in the script.</p>
<p>We recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include <code><a href="#chp-https://rdrr.io/r/utils/install.packages" data-type="xref">#chp-https://rdrr.io/r/utils/install.packages</a></code> in a script that you share. Its very antisocial to change settings on someone elses computer!</p> <p>We recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> in a script that you share. Its very antisocial to change settings on someone elses computer!</p>
<p>When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you wont even think about it.</p> <p>When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you wont even think about it.</p>
</section> </section>
@ -84,7 +84,7 @@ Figure_02.png
model_first_try.R model_first_try.R
run-first.r run-first.r
temp.txt</code></pre> temp.txt</code></pre>
<p>There are a variety of problems here: its hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (<code>finalreport</code> vs. <code>FinalReport</code><span data-type="footnote">Not to mention that youre tempting fate by using “final” in the name 😆 The comic piled higher and deeper has a <a href="#chp-https://phdcomics.com/comics/archive.php?comicid=1531" data-type="xref">#chp-https://phdcomics.com/comics/archive.php?comicid=1531</a>.</span>), and some names dont describe their contents (<code>run-first</code> and <code>temp</code>).</p> <p>There are a variety of problems here: its hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (<code>finalreport</code> vs. <code>FinalReport</code><span data-type="footnote">Not to mention that youre tempting fate by using “final” in the name 😆 The comic piled higher and deeper has a <a href="https://phdcomics.com/comics/archive.php?comicid=1531">fun strip on this</a>.</span>), and some names dont describe their contents (<code>run-first</code> and <code>temp</code>).</p>
<p>Heres better way of naming and organizing the same set of files:</p> <p>Heres better way of naming and organizing the same set of files:</p>
<pre><code>01-load-data.R <pre><code>01-load-data.R
02-exploratory-analysis.R 02-exploratory-analysis.R
@ -111,7 +111,7 @@ Projects</h1>
<h2> <h2>
What is the source of truth?</h2> What is the source of truth?</h2>
<p>As a beginning R user, its OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis. However, in the long run, youll be much better off if you ensure that your R scripts are the source of truth. With your R scripts (and your data files), you can recreate the environment. With only your environment, its much harder to recreate your R scripts: youll either have to retype a lot of code from memory (inevitably making mistakes along the way) or youll have to carefully mine your R history.</p> <p>As a beginning R user, its OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis. However, in the long run, youll be much better off if you ensure that your R scripts are the source of truth. With your R scripts (and your data files), you can recreate the environment. With only your environment, its much harder to recreate your R scripts: youll either have to retype a lot of code from memory (inevitably making mistakes along the way) or youll have to carefully mine your R history.</p>
<p>To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running <code><a href="#chp-https://usethis.r-lib.org/reference/use_blank_slate" data-type="xref">#chp-https://usethis.r-lib.org/reference/use_blank_slate</a></code><span data-type="footnote">If you dont have usethis installed, you can install it with <code>install.packages("usethis")</code>.</span> or by mimicking the options shown in <a href="#fig-blank-slate" data-type="xref">#fig-blank-slate</a>. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time. But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code. Theres nothing worse than discovering three months after the fact that youve only stored the results of an important calculation in your workspace, not the calculation itself in your code.</p> <p>To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running <code><a href="https://usethis.r-lib.org/reference/use_blank_slate.html">usethis::use_blank_slate()</a></code><span data-type="footnote">If you dont have usethis installed, you can install it with <code>install.packages("usethis")</code>.</span> or by mimicking the options shown in <a href="#fig-blank-slate" data-type="xref">#fig-blank-slate</a>. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time. But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code. Theres nothing worse than discovering three months after the fact that youve only stored the results of an important calculation in your workspace, not the calculation itself in your code.</p>
<div class="cell"> <div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
@ -141,7 +141,7 @@ Where does your analysis live?</h2>
<p><img src="screenshots/rstudio-wd.png" alt="The Console tab shows the current working directory as ~/Documents/r4ds/r4ds. " width="321"/></p> <p><img src="screenshots/rstudio-wd.png" alt="The Console tab shows the current working directory as ~/Documents/r4ds/r4ds. " width="321"/></p>
</div> </div>
</div> </div>
<p>And you can print this out in R code by running <code><a href="#chp-https://rdrr.io/r/base/getwd" data-type="xref">#chp-https://rdrr.io/r/base/getwd</a></code>:</p> <p>And you can print this out in R code by running <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code>:</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">getwd() <pre data-type="programlisting" data-code-language="downlit">getwd()
#&gt; [1] "/Users/hadley/Documents/r4ds/r4ds"</pre> #&gt; [1] "/Users/hadley/Documents/r4ds/r4ds"</pre>

View File

@ -7,7 +7,7 @@
</div> </div>
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
<p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer its a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="#chp-https://style.tidyverse" data-type="xref">#chp-https://style.tidyverse</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="#chp-https://styler.r-lib" data-type="xref">#chp-https://styler.r-lib</a> package by Lorenz Walthert. Once youve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudios <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell"> <p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer its a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="https://style.tidyverse.org">tidyverse style guide</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="https://styler.r-lib.org">styler</a> package by Lorenz Walthert. Once youve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudios <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell">
<div class="cell-output-display"> <div class="cell-output-display">
<figure id="fig-rstudio-sections"><p><img src="screenshots/rstudio-palette.png" alt="A screenshot showing the command palette after typing &quot;styler&quot;, showing the four styling tool provided by the package." width="638"/></p> <figure id="fig-rstudio-sections"><p><img src="screenshots/rstudio-palette.png" alt="A screenshot showing the command palette after typing &quot;styler&quot;, showing the four styling tool provided by the package." width="638"/></p>
@ -21,7 +21,7 @@ library(nycflights13)</pre>
<section id="names" data-type="sect1"> <section id="names" data-type="sect1">
<h1> <h1>
Names</h1> Names</h1>
<p>We talked briefly about names in <a href="#sec-whats-in-a-name" data-type="xref">#sec-whats-in-a-name</a>. Remember that variable names (those created by <code>&lt;-</code> and those created by <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>) should use only lowercase letters, numbers, and <code>_</code>. Use <code>_</code> to separate words within a name.</p> <p>We talked briefly about names in <a href="#sec-whats-in-a-name" data-type="xref">#sec-whats-in-a-name</a>. Remember that variable names (those created by <code>&lt;-</code> and those created by <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>) should use only lowercase letters, numbers, and <code>_</code>. Use <code>_</code> to separate words within a name.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for: <pre data-type="programlisting" data-code-language="downlit"># Strive for:
short_flights &lt;- flights |&gt; filter(air_time &lt; 60) short_flights &lt;- flights |&gt; filter(air_time &lt; 60)
@ -52,7 +52,7 @@ mean(x, na.rm = TRUE)
# Avoid # Avoid
mean (x ,na.rm=TRUE)</pre> mean (x ,na.rm=TRUE)</pre>
</div> </div>
<p>Its OK to add extra spaces if it improves alignment. For example, if youre creating multiple variables in <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, you might want to add spaces so that all the <code>=</code> line up. This makes it easier to skim the code.</p> <p>Its OK to add extra spaces if it improves alignment. For example, if youre creating multiple variables in <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, you might want to add spaces so that all the <code>=</code> line up. This makes it easier to skim the code.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt; <pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate( mutate(
@ -76,7 +76,7 @@ flights |&gt;
# Avoid # Avoid
flights|&gt;filter(!is.na(arr_delay), !is.na(tailnum))|&gt;count(dest)</pre> flights|&gt;filter(!is.na(arr_delay), !is.na(tailnum))|&gt;count(dest)</pre>
</div> </div>
<p>If the function youre piping into has named arguments (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>), put each argument on a new line. If the function doesnt have named arguments (like <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code>) keep everything on one line unless it doesnt fit, in which case you should put each argument on its own line.</p> <p>If the function youre piping into has named arguments (like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>), put each argument on a new line. If the function doesnt have named arguments (like <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>) keep everything on one line unless it doesnt fit, in which case you should put each argument on its own line.</p>
<div class="cell"> <div class="cell">
<pre data-type="programlisting" data-code-language="downlit"># Strive for <pre data-type="programlisting" data-code-language="downlit"># Strive for
flights |&gt; flights |&gt;

View File

@ -13,4 +13,4 @@
<li><p>In <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a>, youll learn about harvesting data off the web and getting it into R.</p></li> <li><p>In <a href="#chp-webscraping" data-type="xref">#chp-webscraping</a>, youll learn about harvesting data off the web and getting it into R.</p></li>
</ul><p>Some other types of data are not covered in this book:</p><ul><li><p><strong>haven</strong> reads SPSS, Stata, and SAS files.</p></li> </ul><p>Some other types of data are not covered in this book:</p><ul><li><p><strong>haven</strong> reads SPSS, Stata, and SAS files.</p></li>
<li><p>xml2 for <strong>xml2</strong> for XML</p></li> <li><p>xml2 for <strong>xml2</strong> for XML</p></li>
</ul><p>For other file types, try the <a href="#chp-https://cran.r-project.org/doc/manuals/r-release/R-data" data-type="xref">#chp-https://cran.r-project.org/doc/manuals/r-release/R-data</a> and the <a href="#chp-https://github.com/leeper/rio" data-type="xref">#chp-https://github.com/leeper/rio</a> package.</p></div> </ul><p>For other file types, try the <a href="https://cran.r-project.org/doc/manuals/r-release/R-data.html">R data import/export manual</a> and the <a href="https://github.com/leeper/rio"><strong>rio</strong></a> package.</p></div>