Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -1,5 +1,5 @@
<section data-type="chapter" id="chp-base-R">
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, well briefly discuss two important plotting functions.</p>
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>To finish off the programming section, were going to give you a quick tour of the most important base R functions that we dont otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that youll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. Its not possible to use the tidyverse without using base R, so weve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&amp;</code>, and <code>!</code>. What we havent focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book youll learn other approaches to the same problems using base R, data.table, and other packages. Youll certainly encounter these other approaches when you start reading R code written by other people, particularly if youre using StackOverflow. Its 100% okay to write code that uses a mix of approaches, and dont let anyone tell you otherwise!</p><p>In this chapter, well focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and <code>for</code> loops. To finish off, well briefly discuss two important plotting functions.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
@@ -44,14 +44,10 @@ x[c(3, 2, 5)]
<pre data-type="programlisting" data-code-language="r">x &lt;- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
!is.na(x)
#&gt; [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE
x[!is.na(x)]
#&gt; [1] 10 3 5 8 1
# All even (or missing!) values of x
x %% 2 == 0
#&gt; [1] TRUE FALSE NA FALSE TRUE FALSE NA
x[x %% 2 == 0]
#&gt; [1] 10 NA 8 NA</pre>
</div>
@@ -73,7 +69,7 @@ x[c("xyz", "def")]
<section id="subsetting-data-frames" data-type="sect2">
<h2>
Subsetting data frames</h2>
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to selecting rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to select rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
<p>Here are a couple of examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
@@ -107,7 +103,7 @@ df[df$x &gt; 1, ]
#&gt; 2 3 f 0.601</pre>
</div>
<p>Well come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesnt use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
<p>Theres an important difference between tibbles and data frames when it comes to <code>[</code>. In this book weve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to Rs built-in data frame, well write <code>data.frame</code>s. So if <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
<p>Theres an important difference between tibbles and data frames when it comes to <code>[</code>. In this book weve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibble and data frame interchangeably, so when we want to draw particular attention to Rs built-in data frame, well write <code>data.frame</code>. If <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1 &lt;- data.frame(x = 1:3)
df1[, "x"]
@@ -124,7 +120,7 @@ df2[, "x"]
</div>
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df1[, "x", drop = FALSE]
<pre data-type="programlisting" data-code-language="r">df1[, "x" , drop = FALSE]
#&gt; x
#&gt; 1 1
#&gt; 2 2
@@ -159,7 +155,7 @@ df[!is.na(df$x) &amp; df$x &gt; 1, ]</pre>
# same as
df[order(df$x, df$y), ]</pre>
</div>
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individual sort columns in decreasing order.</p>
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individually sort columns in decreasing order.</p>
</li>
<li>
<p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
@@ -209,12 +205,12 @@ Exercises</h2>
<h1>
Selecting a single element<code>$</code> and <code>[[</code>
</h1>
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, well show you how to use <code>[[</code> and <code>$</code> to pull columns out of a data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, well show you how to use <code>[[</code> and <code>$</code> to pull columns out of data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
<section id="data-frames" data-type="sect2">
<h2>
Data frames</h2>
<p><code>[[</code> and <code>$</code> can be used like <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
<p><code>[[</code> and <code>$</code> can be used to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">tb &lt;- tibble(
x = 1:4,
@@ -243,8 +239,8 @@ tb
#&gt; 3 3 1 4
#&gt; 4 4 21 25</pre>
</div>
<p>There are a number other base approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
<p>There are a number of other base R approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">max(diamonds$carat)
#&gt; [1] 5.01
@@ -252,6 +248,14 @@ tb
levels(diamonds$cut)
#&gt; [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
</div>
<p>dplyr also provides an equivalent to <code>[[</code>/<code>$</code> that we didnt mention in <a href="#chp-data-transform" data-type="xref">#chp-data-transform</a>: <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt; pull(carat) |&gt; mean()
#&gt; [1] 0.7979397
diamonds |&gt; pull(cut) |&gt; levels()
#&gt; [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
</div>
</section>
<section id="tibbles" data-type="sect2">
@@ -283,7 +287,7 @@ tb$z
<section id="lists" data-type="sect2">
<h2>
Lists</h2>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and its important to understand how they differ to <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and its important to understand how they differ from <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">l &lt;- list(
a = 1:3,
@@ -299,6 +303,9 @@ Lists</h2>
#&gt; List of 2
#&gt; $ a: int [1:3] 1 2 3
#&gt; $ b: chr "a string"
str(l[1])
#&gt; List of 1
#&gt; $ a: int [1:3] 1 2 3
str(l[4])
#&gt; List of 1
#&gt; $ d:List of 2
@@ -376,7 +383,7 @@ Exercises</h2>
<section id="apply-family" data-type="sect1">
<h1>
Apply family</h1>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, youll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here well give you a quick overview of this family so you can recognize them in the wild.</p>
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, youll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here well give you a quick overview of this family so you can recognize them in the wild.</p>
<p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if theres an error.</span>. In fact, because we havent used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
<p>Theres no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
<div class="cell">
@@ -408,7 +415,7 @@ df
<div class="cell">
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summarise(price = mean(price))
summarize(price = mean(price))
#&gt; # A tibble: 5 × 2
#&gt; cut price
#&gt; &lt;ord&gt; &lt;dbl&gt;
@@ -423,29 +430,29 @@ tapply(diamonds$price, diamonds$cut, mean)
#&gt; 4358.758 3928.864 3981.760 4584.258 3457.542</pre>
</div>
<p>Unfortunately <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (its certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec">in a gist</a>.</p>
<p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out of <code>apply(df, 2, something)</code> which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
<p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out for <code>apply(df, 2, something)</code>, which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
</section>
<section id="for-loops" data-type="sect1">
<h1>
For loops</h1>
<p>For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:</p>
<p><code>for</code> loops are the fundamental building block of iteration that both the apply and map families use under the hood. <code>for</code> loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a <code>for</code> loop looks like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">for (element in vector) {
# do something with element
}</pre>
</div>
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
<p>The most straightforward use of <code>for</code> loops is to achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths |&gt; walk(append_file)</pre>
</div>
<p>We could have used a for loop:</p>
<p>We could have used a <code>for</code> loop:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">for (path in paths) {
append_file(path)
}</pre>
</div>
<p>Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
<p>Things get a little trickier if you want to save the output of the <code>for</code> loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files &lt;- map(paths, readxl::read_excel)</pre>
@@ -486,23 +493,23 @@ for (path in paths) {
out &lt;- rbind(out, readxl::read_excel(path))
}</pre>
</div>
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This the source of the persistent canard that <code>for</code> loops are slow: theyre not, but iteratively growing a vector is.</p>
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that <code>for</code> loops are slow: theyre not, but iteratively growing a vector is.</p>
</section>
<section id="plots" data-type="sect1">
<h1>
Plots</h1>
<p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because theyre so concise — its very little typing to do a basic exploratory plot.</p>
<p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because theyre so concise — it takes very little typing to do a basic exploratory plot.</p>
<p>There are two main types of base plot youll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Heres a quick example from the diamonds dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">hist(diamonds$carat)
plot(diamonds$carat, diamonds$price)</pre>
<div class="cell-output-display">
<p><img src="base-R_files/figure-html/unnamed-chunk-39-1.png" width="576"/></p>
<p><img src="base-R_files/figure-html/unnamed-chunk-40-1.png" width="576"/></p>
</div>
<div class="cell-output-display">
<p><img src="base-R_files/figure-html/unnamed-chunk-39-2.png" width="576"/></p>
<p><img src="base-R_files/figure-html/unnamed-chunk-40-2.png" width="576"/></p>
</div>
</div>
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
@@ -511,8 +518,8 @@ plot(diamonds$carat, diamonds$price)</pre>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, weve shown you selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
<p>This chapter concludes the programming section of the book. Youve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that youre are looking forward to learning more outside of this book.</p>
<p>In this chapter, weve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
<p>This chapter concludes the programming section of the book. Youve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that youre looking forward to learning more outside of this book.</p>
</section>