520 lines
32 KiB
HTML
520 lines
32 KiB
HTML
<section data-type="chapter" id="chp-base-R">
|
||
<h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.</p>
|
||
<section id="prerequisites" data-type="sect2">
|
||
<h2>
|
||
Prerequisites</h2>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section id="sec-subset-many" data-type="sect1">
|
||
<h1>
|
||
Selecting multiple elements with<code>[</code>
|
||
</h1>
|
||
<p><code>[</code> is used to extract sub-components from vectors and data frames, and is called like <code>x[i]</code> or <code>x[i, j]</code>. In this section, we’ll introduce you to the power of <code>[</code>, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of <code>[</code>.</p>
|
||
|
||
<section id="subsetting-vectors" data-type="sect2">
|
||
<h2>
|
||
Subsetting vectors</h2>
|
||
<p>There are five main types of things that you can subset a vector with, i.e. that can be the <code>i</code> in <code>x[i]</code>:</p>
|
||
<ol type="1"><li>
|
||
<p><strong>A vector of positive integers</strong>. Subsetting with positive integers keeps the elements at those positions:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">x <- c("one", "two", "three", "four", "five")
|
||
x[c(3, 2, 5)]
|
||
#> [1] "three" "two" "five"</pre>
|
||
</div>
|
||
<p>By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">x[c(1, 1, 5, 5, 5, 2)]
|
||
#> [1] "one" "one" "five" "five" "five" "two"</pre>
|
||
</div>
|
||
</li>
|
||
<li>
|
||
<p><strong>A vector of negative integers</strong>. Negative values drop the elements at the specified positions:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">x[c(-1, -3, -5)]
|
||
#> [1] "two" "four"</pre>
|
||
</div>
|
||
</li>
|
||
<li>
|
||
<p><strong>A logical vector</strong>. Subsetting with a logical vector keeps all values corresponding to a <code>TRUE</code> value. This is most often useful in conjunction with the comparison functions.</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">x <- c(10, 3, NA, 5, 8, 1, NA)
|
||
|
||
# All non-missing values of x
|
||
!is.na(x)
|
||
#> [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE
|
||
x[!is.na(x)]
|
||
#> [1] 10 3 5 8 1
|
||
|
||
# All even (or missing!) values of x
|
||
x %% 2 == 0
|
||
#> [1] TRUE FALSE NA FALSE TRUE FALSE NA
|
||
x[x %% 2 == 0]
|
||
#> [1] 10 NA 8 NA</pre>
|
||
</div>
|
||
<p>Note that, unlike <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code>, <code>NA</code> indices will be included in the output as <code>NA</code>s.</p>
|
||
</li>
|
||
<li>
|
||
<p><strong>A character vector</strong>. If you have a named vector, you can subset it with a character vector:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">x <- c(abc = 1, def = 2, xyz = 5)
|
||
x[c("xyz", "def")]
|
||
#> xyz def
|
||
#> 5 2</pre>
|
||
</div>
|
||
<p>As with subsetting with positive integers, you can use a character vector to duplicate individual entries.</p>
|
||
</li>
|
||
<li><p><strong>Nothing</strong>. The final type of subsetting is nothing, <code>x[]</code>, which returns the complete <code>x</code>. This is not useful for subsetting vectors, but as we’ll see shortly it is useful when subsetting 2d structures like tibbles.</p></li>
|
||
</ol></section>
|
||
|
||
<section id="subsetting-data-frames" data-type="sect2">
|
||
<h2>
|
||
Subsetting data frames</h2>
|
||
<p>There are quite a few different ways<span data-type="footnote">Read <a href="https://adv-r.hadley.nz/subsetting.html#subset-multiple" class="uri">https://adv-r.hadley.nz/subsetting.html#subset-multiple</a> to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.</span> that you can use <code>[</code> with a data frame, but the most important way is to selecting rows and columns independently with <code>df[rows, cols]</code>. Here <code>rows</code> and <code>cols</code> are vectors as described above. For example, <code>df[rows, ]</code> and <code>df[, cols]</code> select just rows or just columns, using the empty subset to preserve the other dimension.</p>
|
||
<p>Here are a couple of examples:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||
x = 1:3,
|
||
y = c("a", "e", "f"),
|
||
z = runif(3)
|
||
)
|
||
|
||
# Select first row and second column
|
||
df[1, 2]
|
||
#> # A tibble: 1 × 1
|
||
#> y
|
||
#> <chr>
|
||
#> 1 a
|
||
|
||
# Select all rows and columns x and y
|
||
df[, c("x" , "y")]
|
||
#> # A tibble: 3 × 2
|
||
#> x y
|
||
#> <int> <chr>
|
||
#> 1 1 a
|
||
#> 2 2 e
|
||
#> 3 3 f
|
||
|
||
# Select rows where `x` is greater than 1 and all columns
|
||
df[df$x > 1, ]
|
||
#> # A tibble: 2 × 3
|
||
#> x y z
|
||
#> <int> <chr> <dbl>
|
||
#> 1 2 e 0.834
|
||
#> 2 3 f 0.601</pre>
|
||
</div>
|
||
<p>We’ll come back to <code>$</code> shortly, but you should be able to guess what <code>df$x</code> does from the context: it extracts the <code>x</code> variable from <code>df</code>. We need to use it here because <code>[</code> doesn’t use tidy evaluation, so you need to be explicit about the source of the <code>x</code> variable.</p>
|
||
<p>There’s an important difference between tibbles and data frames when it comes to <code>[</code>. In this book we’ve mostly used tibbles, which <em>are</em> data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write <code>data.frame</code>s. So if <code>df</code> is a <code>data.frame</code>, then <code>df[, cols]</code> will return a vector if <code>col</code> selects a single column and a data frame if it selects more than one column. If <code>df</code> is a tibble, then <code>[</code> will always return a tibble.</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df1 <- data.frame(x = 1:3)
|
||
df1[, "x"]
|
||
#> [1] 1 2 3
|
||
|
||
df2 <- tibble(x = 1:3)
|
||
df2[, "x"]
|
||
#> # A tibble: 3 × 1
|
||
#> x
|
||
#> <int>
|
||
#> 1 1
|
||
#> 2 2
|
||
#> 3 3</pre>
|
||
</div>
|
||
<p>One way to avoid this ambiguity with <code>data.frame</code>s is to explicitly specify <code>drop = FALSE</code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df1[, "x", drop = FALSE]
|
||
#> x
|
||
#> 1 1
|
||
#> 2 2
|
||
#> 3 3</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section id="dplyr-equivalents" data-type="sect2">
|
||
<h2>
|
||
dplyr equivalents</h2>
|
||
<p>A number of dplyr verbs are special cases of <code>[</code>:</p>
|
||
<ul><li>
|
||
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
|
||
x = c(2, 3, 1, 1, NA),
|
||
y = letters[1:5],
|
||
z = runif(5)
|
||
)
|
||
df |> filter(x > 1)
|
||
|
||
# same as
|
||
df[!is.na(df$x) & df$x > 1, ]</pre>
|
||
</div>
|
||
<p>Another common technique in the wild is to use <code><a href="https://rdrr.io/r/base/which.html">which()</a></code> for its side-effect of dropping missing values: <code>df[which(df$x > 1), ]</code>.</p>
|
||
</li>
|
||
<li>
|
||
<p><code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> is equivalent to subsetting the rows with an integer vector, usually created with <code><a href="https://rdrr.io/r/base/order.html">order()</a></code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df |> arrange(x, y)
|
||
|
||
# same as
|
||
df[order(df$x, df$y), ]</pre>
|
||
</div>
|
||
<p>You can use <code>order(decreasing = TRUE)</code> to sort all columns in descending order or <code>-rank(col)</code> to individual sort columns in decreasing order.</p>
|
||
</li>
|
||
<li>
|
||
<p>Both <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> are similar to subsetting the columns with a character vector:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df |> select(x, z)
|
||
|
||
# same as
|
||
df[, c("x", "z")]</pre>
|
||
</div>
|
||
</li>
|
||
</ul><p>Base R also provides a function that combines the features of <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code><span data-type="footnote">But it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like <code><a href="https://tidyselect.r-lib.org/reference/starts_with.html">starts_with()</a></code>.</span> called <code><a href="https://rdrr.io/r/base/subset.html">subset()</a></code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df |>
|
||
filter(x > 1) |>
|
||
select(y, z)
|
||
#> # A tibble: 2 × 2
|
||
#> y z
|
||
#> <chr> <dbl>
|
||
#> 1 a 0.157
|
||
#> 2 b 0.00740
|
||
|
||
# same as
|
||
df |> subset(x > 1, c(y, z))
|
||
#> # A tibble: 2 × 2
|
||
#> y z
|
||
#> <chr> <dbl>
|
||
#> 1 a 0.157
|
||
#> 2 b 0.00740</pre>
|
||
</div>
|
||
<p>This function was the inspiration for much of dplyr’s syntax.</p>
|
||
</section>
|
||
|
||
<section id="exercises" data-type="sect2">
|
||
<h2>
|
||
Exercises</h2>
|
||
<ol type="1"><li>
|
||
<p>Create functions that take a vector as input and return:</p>
|
||
<ol type="a"><li>The elements at even numbered positions.</li>
|
||
<li>Every element except the last value.</li>
|
||
<li>Only even values (and no missing values).</li>
|
||
</ol></li>
|
||
<li><p>Why is <code>x[-which(x > 0)]</code> not the same as <code>x[x <= 0]</code>? Read the documentation for <code><a href="https://rdrr.io/r/base/which.html">which()</a></code> and do some experiments to figure it out.</p></li>
|
||
</ol></section>
|
||
</section>
|
||
|
||
<section id="sec-subset-one" data-type="sect1">
|
||
<h1>
|
||
Selecting a single element<code>$</code> and <code>[[</code>
|
||
</h1>
|
||
<p><code>[</code>, which selects many elements, is paired with <code>[[</code> and <code>$</code>, which extract a single element. In this section, we’ll show you how to use <code>[[</code> and <code>$</code> to pull columns out of a data frames, discuss a couple more differences between <code>data.frames</code> and tibbles, and emphasize some important differences between <code>[</code> and <code>[[</code> when used with lists.</p>
|
||
|
||
<section id="data-frames" data-type="sect2">
|
||
<h2>
|
||
Data frames</h2>
|
||
<p><code>[[</code> and <code>$</code> can be used like <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> to extract columns out of a data frame. <code>[[</code> can access by position or by name, and <code>$</code> is specialized for access by name:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">tb <- tibble(
|
||
x = 1:4,
|
||
y = c(10, 4, 1, 21)
|
||
)
|
||
|
||
# by position
|
||
tb[[1]]
|
||
#> [1] 1 2 3 4
|
||
|
||
# by name
|
||
tb[["x"]]
|
||
#> [1] 1 2 3 4
|
||
tb$x
|
||
#> [1] 1 2 3 4</pre>
|
||
</div>
|
||
<p>They can also be used to create new columns, the base R equivalent of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">tb$z <- tb$x + tb$y
|
||
tb
|
||
#> # A tibble: 4 × 3
|
||
#> x y z
|
||
#> <int> <dbl> <dbl>
|
||
#> 1 1 10 11
|
||
#> 2 2 4 6
|
||
#> 3 3 1 4
|
||
#> 4 4 21 25</pre>
|
||
</div>
|
||
<p>There are a number other base approaches to creating new columns including with <code><a href="https://rdrr.io/r/base/transform.html">transform()</a></code>, <code><a href="https://rdrr.io/r/base/with.html">with()</a></code>, and <code><a href="https://rdrr.io/r/base/with.html">within()</a></code>. Hadley collected a few examples at <a href="https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf" class="uri">https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf</a>.</p>
|
||
<p>Using <code>$</code> directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of <code>cut</code>, there’s no need to use <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">max(diamonds$carat)
|
||
#> [1] 5.01
|
||
|
||
levels(diamonds$cut)
|
||
#> [1] "Fair" "Good" "Very Good" "Premium" "Ideal"</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section id="tibbles" data-type="sect2">
|
||
<h2>
|
||
Tibbles</h2>
|
||
<p>There are a couple of important differences between tibbles and base <code>data.frame</code>s when it comes to <code>$</code>. Data frames match the prefix of any variable names (so-called <strong>partial matching</strong>) and don’t complain if a column doesn’t exist:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df <- data.frame(x1 = 1)
|
||
df$x
|
||
#> Warning in df$x: partial match of 'x' to 'x1'
|
||
#> [1] 1
|
||
df$z
|
||
#> NULL</pre>
|
||
</div>
|
||
<p>Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">tb <- tibble(x1 = 1)
|
||
|
||
tb$x
|
||
#> Warning: Unknown or uninitialised column: `x`.
|
||
#> NULL
|
||
tb$z
|
||
#> Warning: Unknown or uninitialised column: `z`.
|
||
#> NULL</pre>
|
||
</div>
|
||
<p>For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.</p>
|
||
</section>
|
||
|
||
<section id="lists" data-type="sect2">
|
||
<h2>
|
||
Lists</h2>
|
||
<p><code>[[</code> and <code>$</code> are also really important for working with lists, and it’s important to understand how they differ to <code>[</code>. Lets illustrate the differences with a list named <code>l</code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">l <- list(
|
||
a = 1:3,
|
||
b = "a string",
|
||
c = pi,
|
||
d = list(-1, -5)
|
||
)</pre>
|
||
</div>
|
||
<ul><li>
|
||
<p><code>[</code> extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">str(l[1:2])
|
||
#> List of 2
|
||
#> $ a: int [1:3] 1 2 3
|
||
#> $ b: chr "a string"
|
||
str(l[4])
|
||
#> List of 1
|
||
#> $ d:List of 2
|
||
#> ..$ : num -1
|
||
#> ..$ : num -5</pre>
|
||
</div>
|
||
<p>Like with vectors, you can subset with a logical, integer, or character vector.</p>
|
||
</li>
|
||
<li>
|
||
<p><code>[[</code> and <code>$</code> extract a single component from a list. They remove a level of hierarchy from the list.</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">str(l[[1]])
|
||
#> int [1:3] 1 2 3
|
||
str(l[[4]])
|
||
#> List of 2
|
||
#> $ : num -1
|
||
#> $ : num -5
|
||
|
||
str(l$a)
|
||
#> int [1:3] 1 2 3</pre>
|
||
</div>
|
||
</li>
|
||
</ul><p>The difference between <code>[</code> and <code>[[</code> is particularly important for lists because <code>[[</code> drills down into the list while <code>[</code> returns a new, smaller list. To help you remember the difference, take a look at the an unusual pepper shaker shown in <a href="#fig-pepper-1" data-type="xref">#fig-pepper-1</a>. If this pepper shaker is your list <code>pepper</code>, then, <code>pepper[1]</code> is a pepper shaker containing a single pepper packet, as in <a href="#fig-pepper-2" data-type="xref">#fig-pepper-2</a>. If we suppose this pepper shaker is a list <code>pepper</code>, then, <code>pepper[1]</code> is a pepper shaker containing a single pepper packet, as in <a href="#fig-pepper-2" data-type="xref">#fig-pepper-2</a>. <code>pepper[2]</code> would look the same, but would contain the second packet. <code>pepper[1:2]</code> would be a pepper shaker containing two pepper packets. <code>pepper[[1]]</code> would extract the pepper packet itself, as in <a href="#fig-pepper-3" data-type="xref">#fig-pepper-3</a>.</p>
|
||
<div class="cell">
|
||
<div class="cell-output-display">
|
||
|
||
<figure id="fig-pepper-1"><p><img src="images/pepper.jpg" style="width:25.0%" alt="A photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains many packets of pepper."/></p>
|
||
<figcaption>A pepper shaker that Hadley once found in his hotel room.</figcaption>
|
||
</figure>
|
||
</div>
|
||
</div>
|
||
<div class="cell">
|
||
<div class="cell-output-display">
|
||
|
||
<figure id="fig-pepper-2"><p><img src="images/pepper-1.jpg" style="width:25.0%" alt="A photo of the glass pepper shaker containing just one packet of pepper."/></p>
|
||
<figcaption><code>pepper[1]</code></figcaption>
|
||
</figure>
|
||
</div>
|
||
</div>
|
||
<div class="cell">
|
||
<div class="cell-output-display">
|
||
|
||
<figure id="fig-pepper-3"><p><img src="images/pepper-2.jpg" style="width:25.0%" alt="A photo of single packet of pepper."/></p>
|
||
<figcaption><code>pepper[[1]]</code></figcaption>
|
||
</figure>
|
||
</div>
|
||
</div>
|
||
<p>This same principle applies when you use 1d <code>[</code> with a data frame:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df <- tibble(x = 1:3, y = 3:5)
|
||
|
||
# returns a one-column data frame
|
||
df["x"]
|
||
#> # A tibble: 3 × 1
|
||
#> x
|
||
#> <int>
|
||
#> 1 1
|
||
#> 2 2
|
||
#> 3 3
|
||
|
||
# returns the contents of x
|
||
df[["x"]]
|
||
#> [1] 1 2 3</pre>
|
||
</div>
|
||
</section>
|
||
|
||
<section id="exercises-1" data-type="sect2">
|
||
<h2>
|
||
Exercises</h2>
|
||
<ol type="1"><li><p>What happens when you use <code>[[</code> with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?</p></li>
|
||
<li><p>What would <code>pepper[[1]][1]</code> be? What about <code>pepper[[1]][[1]]</code>?</p></li>
|
||
</ol></section>
|
||
</section>
|
||
|
||
<section id="apply-family" data-type="sect1">
|
||
<h1>
|
||
Apply family</h1>
|
||
<p>In <a href="#chp-iteration" data-type="xref">#chp-iteration</a>, you learned tidyverse techniques for iteration like <code><a href="https://dplyr.tidyverse.org/reference/across.html">dplyr::across()</a></code> and the map family of functions. In this section, you’ll learn about their base equivalents, the <strong>apply family</strong>. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.</p>
|
||
<p>The most important member of this family is <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>, which is very similar to <code><a href="https://purrr.tidyverse.org/reference/map.html">purrr::map()</a></code><span data-type="footnote">It just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.</span>. In fact, because we haven’t used any of <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code>’s more advanced features, you can replace every <code><a href="https://purrr.tidyverse.org/reference/map.html">map()</a></code> call in <a href="#chp-iteration" data-type="xref">#chp-iteration</a> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>.</p>
|
||
<p>There’s no exact base R equivalent to <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> but you can get close by using <code>[</code> with <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code>. This works because under the hood, data frames are lists of columns, so calling <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> on a data frame applies the function to each column.</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
|
||
|
||
# First find numeric columns
|
||
num_cols <- sapply(df, is.numeric)
|
||
num_cols
|
||
#> a b c d e
|
||
#> TRUE TRUE FALSE FALSE TRUE
|
||
|
||
# Then transform each column with lapply() then replace the original values
|
||
df[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
|
||
df
|
||
#> # A tibble: 1 × 5
|
||
#> a b c d e
|
||
#> <dbl> <dbl> <chr> <chr> <dbl>
|
||
#> 1 2 4 a b 8</pre>
|
||
</div>
|
||
<p>The code above uses a new function, <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code>. It’s similar to <code><a href="https://rdrr.io/r/base/lapply.html">lapply()</a></code> but it always tries to simplify the result, hence the <code>s</code> in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called <code><a href="https://purrr.tidyverse.org/reference/map.html">map_vec()</a></code> that we didn’t mention in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
|
||
<p>Base R provides a stricter version of <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> called <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code>, short for <strong>v</strong>ector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> call above with this <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> where we specify that we expect <code><a href="https://rdrr.io/r/base/numeric.html">is.numeric()</a></code> to return a logical vector of length 1:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">vapply(df, is.numeric, logical(1))
|
||
#> a b c d e
|
||
#> TRUE TRUE FALSE FALSE TRUE</pre>
|
||
</div>
|
||
<p>The distinction between <code><a href="https://rdrr.io/r/base/lapply.html">sapply()</a></code> and <code><a href="https://rdrr.io/r/base/lapply.html">vapply()</a></code> is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.</p>
|
||
<p>Another important member of the apply family is <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> which computes a single grouped summary:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||
group_by(cut) |>
|
||
summarise(price = mean(price))
|
||
#> # A tibble: 5 × 2
|
||
#> cut price
|
||
#> <ord> <dbl>
|
||
#> 1 Fair 4359.
|
||
#> 2 Good 3929.
|
||
#> 3 Very Good 3982.
|
||
#> 4 Premium 4584.
|
||
#> 5 Ideal 3458.
|
||
|
||
tapply(diamonds$price, diamonds$cut, mean)
|
||
#> Fair Good Very Good Premium Ideal
|
||
#> 4358.758 3928.864 3981.760 4584.258 3457.542</pre>
|
||
</div>
|
||
<p>Unfortunately <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use <code><a href="https://rdrr.io/r/base/tapply.html">tapply()</a></code> or other base techniques to perform other grouped summaries, Hadley has collected a few techniques <a href="https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec">in a gist</a>.</p>
|
||
<p>The final member of the apply family is the titular <code><a href="https://rdrr.io/r/base/apply.html">apply()</a></code>, which works with matrices and arrays. In particular, watch out of <code>apply(df, 2, something)</code> which is a slow and potentially dangerous way of doing <code>lapply(df, something)</code>. This rarely comes up in data science because we usually work with data frames and not matrices.</p>
|
||
</section>
|
||
|
||
<section id="for-loops" data-type="sect1">
|
||
<h1>
|
||
For loops</h1>
|
||
<p>For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">for (element in vector) {
|
||
# do something with element
|
||
}</pre>
|
||
</div>
|
||
<p>The most straightforward use of <code>for()</code> loops is achieve the same affect as <code><a href="https://purrr.tidyverse.org/reference/map.html">walk()</a></code>: call some function with a side-effect on each element of a list. For example, in <a href="#sec-save-database" data-type="xref">#sec-save-database</a> instead of using walk:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">paths |> walk(append_file)</pre>
|
||
</div>
|
||
<p>We could have used a for loop:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">for (path in paths) {
|
||
append_file(path)
|
||
}</pre>
|
||
</div>
|
||
<p>Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
|
||
files <- map(paths, readxl::read_excel)</pre>
|
||
</div>
|
||
<p>There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as <code>paths</code>, which we can create with <code><a href="https://rdrr.io/r/base/vector.html">vector()</a></code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">files <- vector("list", length(paths))</pre>
|
||
</div>
|
||
<p>Then instead of iterating over the elements of <code>paths</code>, we’ll iterate over their indices, using <code><a href="https://rdrr.io/r/base/seq.html">seq_along()</a></code> to generate one index for each element of paths:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">seq_along(paths)
|
||
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12</pre>
|
||
</div>
|
||
<p>Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">for (i in seq_along(paths)) {
|
||
files[[i]] <- readxl::read_excel(paths[[i]])
|
||
}</pre>
|
||
</div>
|
||
<p>To combine the list of tibbles into a single tibble you can use <code><a href="https://rdrr.io/r/base/do.call.html">do.call()</a></code> + <code><a href="https://rdrr.io/r/base/cbind.html">rbind()</a></code>:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">do.call(rbind, files)
|
||
#> # A tibble: 1,704 × 5
|
||
#> country continent lifeExp pop gdpPercap
|
||
#> <chr> <chr> <dbl> <dbl> <dbl>
|
||
#> 1 Afghanistan Asia 28.8 8425333 779.
|
||
#> 2 Albania Europe 55.2 1282697 1601.
|
||
#> 3 Algeria Africa 43.1 9279525 2449.
|
||
#> 4 Angola Africa 30.0 4232095 3521.
|
||
#> 5 Argentina Americas 62.5 17876956 5911.
|
||
#> 6 Australia Oceania 69.1 8691212 10040.
|
||
#> # … with 1,698 more rows</pre>
|
||
</div>
|
||
<p>Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">out <- NULL
|
||
for (path in paths) {
|
||
out <- rbind(out, readxl::read_excel(path))
|
||
}</pre>
|
||
</div>
|
||
<p>We recommend avoiding this pattern because it can become very slow when the vector is very long. This the source of the persistent canard that <code>for</code> loops are slow: they’re not, but iteratively growing a vector is.</p>
|
||
</section>
|
||
|
||
<section id="plots" data-type="sect1">
|
||
<h1>
|
||
Plots</h1>
|
||
<p>Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because they’re so concise — it’s very little typing to do a basic exploratory plot.</p>
|
||
<p>There are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with <code><a href="https://rdrr.io/r/graphics/plot.default.html">plot()</a></code> and <code><a href="https://rdrr.io/r/graphics/hist.html">hist()</a></code> respectively. Here’s a quick example from the diamonds dataset:</p>
|
||
<div class="cell">
|
||
<pre data-type="programlisting" data-code-language="r">hist(diamonds$carat)
|
||
|
||
plot(diamonds$carat, diamonds$price)</pre>
|
||
<div class="cell-output-display">
|
||
<p><img src="base-R_files/figure-html/unnamed-chunk-39-1.png" width="576"/></p>
|
||
</div>
|
||
<div class="cell-output-display">
|
||
<p><img src="base-R_files/figure-html/unnamed-chunk-39-2.png" width="576"/></p>
|
||
</div>
|
||
</div>
|
||
<p>Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using <code>$</code> or some other technique.</p>
|
||
</section>
|
||
|
||
<section id="summary" data-type="sect1">
|
||
<h1>
|
||
Summary</h1>
|
||
<p>In this chapter, we’ve shown you selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.</p>
|
||
<p>This chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can <em>program</em> in R. We hope these chapters have sparked your interested in programming and that you’re are looking forward to learning more outside of this book.</p>
|
||
|
||
|
||
</section>
|
||
</section>
|