More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,22 +1,24 @@
<section data-type="chapter" id="chp-numbers">
<h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="numbers-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Numeric vectors are the backbone of data science, and youve already used them a bunch of times earlier in the book. Now its time to systematically survey what you can do with them in R, ensuring that youre well situated to tackle any future problem involving numeric vectors.</p>
<p>Well start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>. Then well dive into various numeric transformations that pair well with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Well finish off by covering the summary functions that pair well with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> and show you how they can also be used with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>.</p>
<section id="prerequisites" data-type="sect2">
<section id="numbers-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<div data-type="important"><div class="callout-body d-flex">
<div data-type="important">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
</div>
</div>
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/dplyr")</code>.</p>
<p>This chapter relies on features only found in dplyr 1.1.0, which is still in development. If you want to live on the edge, you can get the dev versions with <code>devtools::install_github("tidyverse/dplyr")</code>.</p></div>
</div>
</div>
<p>This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because well use these base R functions inside of tidyverse functions like <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>. Like in the last chapter, well use real examples from nycflights13, as well as toy examples made with <code><a href="https://rdrr.io/r/base/c.html">c()</a></code> and <code><a href="https://tibble.tidyverse.org/reference/tribble.html">tribble()</a></code>.</p>
<div class="cell">
@@ -109,9 +111,7 @@ Counts</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt;
group_by(dest) |&gt;
summarize(
carriers = n_distinct(carrier)
) |&gt;
summarize(carriers = n_distinct(carrier)) |&gt;
arrange(desc(carriers))
#&gt; # A tibble: 105 × 2
#&gt; dest carriers
@@ -144,17 +144,7 @@ Counts</h1>
</div>
<p>Weighted counts are a common problem so <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> has a <code>wt</code> argument that does the same thing:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(tailnum, wt = distance)
#&gt; # A tibble: 4,044 × 2
#&gt; tailnum n
#&gt; &lt;chr&gt; &lt;dbl&gt;
#&gt; 1 D942DN 3418
#&gt; 2 N0EGMQ 250866
#&gt; 3 N10156 115966
#&gt; 4 N102UW 25722
#&gt; 5 N103US 24619
#&gt; 6 N104UW 25157
#&gt; # … with 4,038 more rows</pre>
<pre data-type="programlisting" data-code-language="r">flights |&gt; count(tailnum, wt = distance)</pre>
</div>
</li>
<li>
@@ -176,7 +166,7 @@ Counts</h1>
</div>
</li>
</ul>
<section id="exercises" data-type="sect2">
<section id="numbers-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>How can you use <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> to count the number rows with a missing value for a given variable?</li>
@@ -228,9 +218,7 @@ x * c(1, 2, 3)
#&gt; 5 2013 1 1 557 600 -3 838 846
#&gt; 6 2013 1 1 558 600 -2 849 851
#&gt; # … with 25,971 more rows, and 11 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
<p>The code runs without error, but it doesnt return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unfortunately theres no warning because <code>flights</code> has an even number of rows.</p>
<p>To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesnt help here, or in many other cases, because the key computation is performed by the base R function <code>==</code>, not <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>.</p>
@@ -476,7 +464,7 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="numbers-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explain in words what each line of the code used to generate <a href="#fig-prop-cancelled" data-type="xref">#fig-prop-cancelled</a> does.</p></li>
@@ -671,7 +659,7 @@ df
</div>
</section>
<section id="exercises-2" data-type="sect2">
<section id="numbers-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for <code><a href="https://dplyr.tidyverse.org/reference/row_number.html">min_rank()</a></code>.</p></li>
@@ -718,10 +706,8 @@ Center</h2>
.groups = "drop"
) |&gt;
ggplot(aes(x = mean, y = median)) +
geom_abline(slope = 1, intercept = 0, color = "white", size = 2) +
geom_point()
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; Please use `linewidth` instead.</pre>
geom_abline(slope = 1, intercept = 0, color = "white", linewidth = 2) +
geom_point()</pre>
<div class="cell-output-display">
<figure id="fig-mean-vs-median"><p><img src="numbers_files/figure-html/fig-mean-vs-median-1.png" alt="All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55." width="576"/></p>
@@ -875,15 +861,13 @@ Positions</h2>
#&gt; 5 2013 1 2 42 2359 43 518 442
#&gt; 6 2013 1 2 458 500 -2 703 650
#&gt; # … with 1,189 more rows, and 12 more variables: arr_delay &lt;dbl&gt;,
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,
#&gt; # air_time &lt;dbl&gt;, distance &lt;dbl&gt;, hour &lt;dbl&gt;, minute &lt;dbl&gt;,
#&gt; # time_hour &lt;dttm&gt;, r &lt;int&gt;</pre>
#&gt; # carrier &lt;chr&gt;, flight &lt;int&gt;, tailnum &lt;chr&gt;, origin &lt;chr&gt;, dest &lt;chr&gt;,</pre>
</div>
</section>
<section id="with-mutate" data-type="sect2">
<h2>
With<code>mutate()</code>
With mutate()
</h2>
<p>As the names suggest, the summary functions are typically paired with <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>. However, because of the recycling rules we discussed in <a href="#sec-recycling" data-type="xref">#sec-recycling</a> they can also be usefully paired with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, particularly when you want do some sort of group standardization. For example:</p>
<ul><li>
@@ -894,7 +878,7 @@ With<code>mutate()</code>
<code>x / first(x)</code> computes an index based on the first observation.</li>
</ul></section>
<section id="exercises-3" data-type="sect2">
<section id="numbers-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@@ -910,7 +894,7 @@ Exercises</h2>
</ol></section>
</section>
<section id="summary" data-type="sect1">
<section id="numbers-summary" data-type="sect1">
<h1>
Summary</h1>
<p>Youre already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. Youve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.</p>