Fix code language

This commit is contained in:
Hadley Wickham
2022-11-18 11:26:25 -06:00
parent 69b4597f3b
commit 868a35ca71
29 changed files with 912 additions and 907 deletions

View File

@@ -4,7 +4,7 @@
<h2>
Prerequisites</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
@@ -21,27 +21,27 @@ Subsetting vectors</h2>
<ol type="1"><li>
<p><strong>A vector of positive integers</strong>. Subsetting with positive integers keeps the elements at those positions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c("one", "two", "three", "four", "five")
<pre data-type="programlisting" data-code-language="r">x &lt;- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
#&gt; [1] "three" "two" "five"</pre>
</div>
<p>By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x[c(1, 1, 5, 5, 5, 2)]
<pre data-type="programlisting" data-code-language="r">x[c(1, 1, 5, 5, 5, 2)]
#&gt; [1] "one" "one" "five" "five" "five" "two"</pre>
</div>
</li>
<li>
<p><strong>A vector of negative integers</strong>. Negative values drop the elements at the specified positions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x[c(-1, -3, -5)]
<pre data-type="programlisting" data-code-language="r">x[c(-1, -3, -5)]
#&gt; [1] "two" "four"</pre>
</div>
</li>
<li>
<p><strong>A logical vector</strong>. Subsetting with a logical vector keeps all values corresponding to a <code>TRUE</code> value. This is most often useful in conjunction with the comparison functions.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(10, 3, NA, 5, 8, 1, NA)
<pre data-type="programlisting" data-code-language="r">x &lt;- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
!is.na(x)
@@ -60,7 +60,7 @@ x[x %% 2 == 0]
<li>
<p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x &lt;- c(abc = 1, def = 2, xyz = 5)
<pre data-type="programlisting" data-code-language="r">x &lt;- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
#&gt; xyz def
#&gt; 5 2</pre>
@@ -76,7 +76,7 @@ Subsetting data frames</h2>
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to selecting rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
<p>Here are a couple of examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = 1:3,
y = c("a", "e", "f"),
z = runif(3)
@@ -109,7 +109,7 @@ df[df$x &gt; 1, ]
<p>Well come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesnt use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
<p>Theres an important difference between tibbles and data frames when it comes to <code>[</code>. In this book weve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to Rs built-in data frame, well write <code>data.frame</code>s. So if <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1 &lt;- data.frame(x = 1:3)
<pre data-type="programlisting" data-code-language="r">df1 &lt;- data.frame(x = 1:3)
df1[, "x"]
#&gt; [1] 1 2 3
@@ -124,7 +124,7 @@ df2[, "x"]
</div>
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df1[, "x", drop = FALSE]
<pre data-type="programlisting" data-code-language="r">df1[, "x", drop = FALSE]
#&gt; x
#&gt; 1 1
#&gt; 2 2
@@ -139,7 +139,7 @@ dplyr equivalents</h2>
<ul><li>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
x = c(2, 3, 1, 1, NA),
y = letters[1:5],
z = runif(5)
@@ -154,7 +154,7 @@ df[!is.na(df$x) &amp; df$x &gt; 1, ]</pre>
<li>
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="https://rdrr.io/r/base/order.html">order()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; arrange(x, y)
<pre data-type="programlisting" data-code-language="r">df |&gt; arrange(x, y)
# same as
df[order(df$x, df$y), ]</pre>
@@ -164,7 +164,7 @@ df[order(df$x, df$y), ]</pre>
<li>
<p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt; select(x, z)
<pre data-type="programlisting" data-code-language="r">df |&gt; select(x, z)
# same as
df[, c("x", "z")]</pre>
@@ -172,7 +172,7 @@ df[, c("x", "z")]</pre>
</li>
</ul><p>Base R also provides a function that combines the features of <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code><span data-type="footnote">But it doesnt handle grouped data frames differently and it doesnt support selection helper functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code>.</span> called <code><a href="https://rdrr.io/r/base/subset.html">subset()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
<pre data-type="programlisting" data-code-language="r">df |&gt;
filter(x &gt; 1) |&gt;
select(y, z)
#&gt; # A tibble: 2 × 2
@@ -216,7 +216,7 @@ Selecting a single element<code>$</code> and <code>[[</code>
Data frames</h2>
<p><code>[[</code> and <code>$</code> can be used like <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb &lt;- tibble(
<pre data-type="programlisting" data-code-language="r">tb &lt;- tibble(
x = 1:4,
y = c(10, 4, 1, 21)
)
@@ -233,7 +233,7 @@ tb$x
</div>
<p>They can also be used to create new columns, the base R equivalent of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb$z &lt;- tb$x + tb$y
<pre data-type="programlisting" data-code-language="r">tb$z &lt;- tb$x + tb$y
tb
#&gt; # A tibble: 4 × 3
#&gt; x y z
@@ -246,7 +246,7 @@ tb
<p>There are a number other base approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, theres no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">max(diamonds$carat)
<pre data-type="programlisting" data-code-language="r">max(diamonds$carat)
#&gt; [1] 5.01
levels(diamonds$cut)
@@ -259,7 +259,7 @@ levels(diamonds$cut)
Tibbles</h2>
<p>There are a couple of important differences between tibbles and base <code>data.frame</code>s when it comes to <code>$</code>. Data frames match the prefix of any variable names (so-called <strong>partial matching</strong>) and dont complain if a column doesnt exist:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- data.frame(x1 = 1)
<pre data-type="programlisting" data-code-language="r">df &lt;- data.frame(x1 = 1)
df$x
#&gt; Warning in df$x: partial match of 'x' to 'x1'
#&gt; [1] 1
@@ -268,7 +268,7 @@ df$z
</div>
<p>Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesnt exist:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">tb &lt;- tibble(x1 = 1)
<pre data-type="programlisting" data-code-language="r">tb &lt;- tibble(x1 = 1)
tb$x
#&gt; Warning: Unknown or uninitialised column: `x`.
@@ -285,7 +285,7 @@ tb$z
Lists</h2>
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and its important to understand how they differ to <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">l &lt;- list(
<pre data-type="programlisting" data-code-language="r">l &lt;- list(
a = 1:3,
b = "a string",
c = pi,
@@ -295,7 +295,7 @@ Lists</h2>
<ul><li>
<p><code>[</code> extracts a sub-list. It doesnt matter how many elements you extract, the result will always be a list.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str(l[1:2])
<pre data-type="programlisting" data-code-language="r">str(l[1:2])
#&gt; List of 2
#&gt; $ a: int [1:3] 1 2 3
#&gt; $ b: chr "a string"
@@ -310,7 +310,7 @@ str(l[4])
<li>
<p><code>[[</code> and <code>$</code> extract a single component from a list. They remove a level of hierarchy from the list.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">str(l[[1]])
<pre data-type="programlisting" data-code-language="r">str(l[[1]])
#&gt; int [1:3] 1 2 3
str(l[[4]])
#&gt; List of 2
@@ -348,7 +348,7 @@ str(l$a)
</div>
<p>This same principle applies when you use 1d <code>[</code> with a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(x = 1:3, y = 3:5)
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(x = 1:3, y = 3:5)
# returns a one-column data frame
df["x"]
@@ -380,7 +380,7 @@ Apply family</h1>
<p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if theres an error.</span>. In fact, because we havent used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
<p>Theres no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
# First find numeric columns
num_cols &lt;- sapply(df, is.numeric)
@@ -399,14 +399,14 @@ df
<p>The code above uses a new function, <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code>. Its similar to <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We dont recommend using it for programming, because the simplification can fail and give you an unexpected type, but its usually fine for interactive use. purrr has a similar function called <code><a href="https://purrr.tidyverse.org/reference/map.html">map_vec()</a></code> that we didnt mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
<p>Base R provides a stricter version of <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> called <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> call above with this <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> where we specify that we expect <code><a href="https://rdrr.io/r/base/numeric.html">is.numeric()</a></code> to return a logical vector of length 1:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">vapply(df, is.numeric, logical(1))
<pre data-type="programlisting" data-code-language="r">vapply(df, is.numeric, logical(1))
#&gt; a b c d e
#&gt; TRUE TRUE FALSE FALSE TRUE</pre>
</div>
<p>The distinction between <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> and <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> is really important when theyre inside a function (because it makes a big difference to the functions robustness to unusual inputs), but it doesnt usually matter in data analysis.</p>
<p>Another important member of the apply family is <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> which computes a single grouped summary:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
<pre data-type="programlisting" data-code-language="r">diamonds |&gt;
group_by(cut) |&gt;
summarise(price = mean(price))
#&gt; # A tibble: 5 × 2
@@ -431,43 +431,43 @@ tapply(diamonds$price, diamonds$cut, mean)
For loops</h1>
<p>For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">for (element in vector) {
<pre data-type="programlisting" data-code-language="r">for (element in vector) {
# do something with element
}</pre>
</div>
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths |&gt; walk(append_file)</pre>
<pre data-type="programlisting" data-code-language="r">paths |&gt; walk(append_file)</pre>
</div>
<p>We could have used a for loop:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">for (path in paths) {
<pre data-type="programlisting" data-code-language="r">for (path in paths) {
append_file(path)
}</pre>
</div>
<p>Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
<pre data-type="programlisting" data-code-language="r">paths &lt;- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files &lt;- map(paths, readxl::read_excel)</pre>
</div>
<p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, were going to want a list the same length as <code>paths</code>, which we can create with <code><a href="https://rdrr.io/r/base/vector.html">vector()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">files &lt;- vector("list", length(paths))</pre>
<pre data-type="programlisting" data-code-language="r">files &lt;- vector("list", length(paths))</pre>
</div>
<p>Then instead of iterating over the elements of <code>paths</code>, well iterate over their indices, using <code><a href="https://rdrr.io/r/base/seq.html">seq_along()</a></code> to generate one index for each element of paths:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">seq_along(paths)
<pre data-type="programlisting" data-code-language="r">seq_along(paths)
#&gt; [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre>
</div>
<p>Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">for (i in seq_along(paths)) {
<pre data-type="programlisting" data-code-language="r">for (i in seq_along(paths)) {
files[[i]] &lt;- readxl::read_excel(paths[[i]])
}</pre>
</div>
<p>To combine the list of tibbles into a single tibble you can use <code><a href="https://rdrr.io/r/base/do.call.html">do.call()</a></code> + <code><a href="https://rdrr.io/r/base/cbind.html">rbind()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">do.call(rbind, files)
<pre data-type="programlisting" data-code-language="r">do.call(rbind, files)
#&gt; # A tibble: 1,704 × 5
#&gt; country continent lifeExp pop gdpPercap
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
@@ -481,7 +481,7 @@ files &lt;- map(paths, readxl::read_excel)</pre>
</div>
<p>Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">out &lt;- NULL
<pre data-type="programlisting" data-code-language="r">out &lt;- NULL
for (path in paths) {
out &lt;- rbind(out, readxl::read_excel(path))
}</pre>
@@ -495,7 +495,7 @@ Plots</h1>
<p>Many R users who dont otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because theyre so concise — its very little typing to do a basic exploratory plot.</p>
<p>There are two main types of base plot youll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Heres a quick example from the diamonds dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">hist(diamonds$carat)
<pre data-type="programlisting" data-code-language="r">hist(diamonds$carat)
plot(diamonds$carat, diamonds$price)</pre>
<div class="cell-output-display">