r4ds/iteration.qmd

# Iteration {#sec-iteration}

```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```

## Introduction

In @sec-functions, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
Reducing code duplication has three main benefits:

1.  It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.

2.  It's easier to respond to changes in requirements.
    As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.

3.  You're likely to have fewer bugs because each line of code is used in more places.

One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Iteration is somewhat of a moving target in the tidyverse because we're keep adding new features to make it easier to solve problems that previously required explicit iteration.
For example:

-   To draw one plot for each group you can use ggplot2's facetting.
-   To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
-   To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
-   To extract every element from a named list you can use `tidyr::unnest_wider()`.

In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving objects.

### Prerequisites

We'll use a selection of useful iteration idioms from dplyr and purrr, both core members of the tidyverse.

```{r}
#| label: setup
#| message: false

library(tidyverse)
```

## For each column

### Motivation

Imagine you have this simple tibble:

```{r}
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
```

And you want to compute the median of every column.
You could do it with copy-and-paste:

```{r}
df %>% summarise(
  a = median(a),
  b = median(b),
  c = median(c),
  d = median(d),
)
```

But that breaks our rule of thumb: never copy and paste more than twice.
And you could imagine that this will get particularly tedious if you have tens or even hundreds of variables.
Instead you can use `across()`:

```{r}
df %>% summarise(
  across(a:d, median)
)
```

There are two arguments that you'll use every time:

-   The first argument specifies which columns you want to iterate over. It uses the same syntax as `select()`.
-   The second argument specifies what to do with each column.

There's another argument, `.names` that's useful when use `across()` with `mutate()`, and two variations `if_any()` and `if_all()` that work with `filter()`.
These are described in detail below.

### Which columns

The first argument to `across()`, `.cols`, selects the columns to transform.
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
There are two other techniques that you can use with both `select()` and `across()` that we didn't discuss earlier because they're particularly useful for `across()`: `everything()` and `where()` .

`everything()` is straightforward: it selects every (non-grouping) column!

```{r}
df <- tibble(
  grp = sample(2, 10, replace = TRUE),
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df %>% 
  group_by(grp) |> 
  summarise(across(everything(), median))
```

`where()` allows you to select columns based on their type:

-   `where(is.numeric)` selects all numeric columns.
-   `where(is.character)` selects all string columns.
-   `where(is.Date)` selects all date columns.
-   `where(is.POSIXct)` selects all date-time columns.
-   `where(is.logical)` selects all logical columns.

You can combine these in the usual `select()` way with Boolean algebra so that `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".

### Extra arguments

The second argument, `.funs`, determines what happens to each column selected by the first argument.
In most cases, this will be the name of an existing function, but you can also create your own function inline, or supply multiple functions.

Lets motivate this problem with an example: what happens if we have some missing values?
It'd be nice to be able to pass along additional arguments to `median()`:

```{r}
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
  sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
}

df <- tibble(
  a = rnorm_na(10, 2),
  b = rnorm_na(10, 2),
  c = rnorm_na(10, 4),
  d = rnorm(10)
)
df %>% summarise(
  across(a:d, median)
)
```

For complicated reasons, it's not easy to pass on arguments from `across()`, so instead we can create another function that wraps `median()` and calls it with the correct arguments.
We can write that compactly using R's anonymous function shorthand:

```{r}
df %>% summarise(
  across(a:d, \(x) median(x, na.rm = TRUE))
)
```

This is short hand for creating a function, as below.
It's easier to remember because you just replace the eight letters of `function` with a single `\`.

```{r}
#| results: false
df %>% summarise(
  across(a:d, function(x) median(x, na.rm = TRUE))
)
```

As well as computing the median with out missing values, it'd be nice to know how many missing values there were.
We can do that by supplying a named list of functions to `across()`:

```{r}
df %>% summarise(
  across(a:d, list(
    median = \(x) median(x, na.rm = TRUE),
    n_miss = \(x) sum(is.na(x))
  ))
)
```

Note that you could describe the name of the new columns using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fun` is the name of the function in the list.
That's not a coincidence because you can use the `.names` argument to set these names.

### Column names

The result of `across()` is named according to the specification provided in the `.names` variable.
We could specify our own if we wanted the name of the function to come first.
(You can't currently change the order of the columns).

```{r}
df %>% summarise(
  across(a:d, list(
    median = \(x) median(x, na.rm = TRUE),
    n_miss = \(x) sum(is.na(x))
  ), .names = "{.fn}_{.col}")
)
```

The `.names` argument is particularly important when you use `across()` with `mutate()`.
By default the outputs of `across()` are given the same numbers as the inputs.
This means that `across()` inside of `mutate()` will replace existing columns:

```{r}
df %>% mutate(
  across(a:d, \(x) x + 1)
)
```

If you'd like to instead create new columns, you can supply the `.names` argument which takes a glue specification where `{.col}` refers to the current column name.

```{r}
df %>% mutate(
  across(a:d, \(x) x * 2, .names = "{.col}_2")
)
```

### Filtering

`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:

```{r}
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
# same as:
df |> filter(if_any(a:d, is.na))

df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
# same as:
df |> filter(if_all(a:d, is.na))
```

### Vs `pivot_longer()`

Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()`.
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
For example, we could rewrite our multiple summary `across()` as:

```{r}
df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarise(
    median = median(value, na.rm = TRUE),
    n_miss = sum(is.na(value))
  )
```

This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:

```{r}
df3 <- tibble(
  a_val = rnorm(10),
  a_w = runif(10),
  b_val = rnorm(10),
  b_w = runif(10),
  c_val = rnorm(10),
  c_w = runif(10),
  d_val = rnorm(10),
  d_w = runif(10)
)
```

There's currently no way to do this with `across()`[^iteration-1], but it's relatively straightforward with `pivot_longer()`:

[^iteration-1]: Maybe there will be one day, but currently we don't see how.

```{r}
df3_long <- df3 |> 
  pivot_longer(
    everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  )
df3_long

df3_long |> 
  group_by(group) |> 
  summarise(mean = weighted.mean(val, w))
```

If needed, you could `pivot_wider()` this back to the original form.

### Exercises

1.  Compute the number of unique values in each column of `palmerpenguins::penguins`.
2.  Compute the mean of every column in `mtcars`.
3.  Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
4.  What happens if you use a list of functions, but don't name them? How is the output named?
5.  It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?

## For each file

Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in.
You could do it with copy and paste:

[^iteration-2]: If you instead had a directory of csv files with the same format, you can use `read_csv()` directly: `read_csv(c("data/y2019.xls", "data/y2020.xls", "data/y2021.xls", "data/y2020.xls").`

```{r}
#| eval: false
data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")
```

And then use `dplyr::bind_rows()` to combine them all together:

```{r}
#| eval: false
data <- bind_rows(data2019, data2020, data2021, data2022)
```

But you can imagine that this would get tedious quickly, since often you won't have four files, but more like 400.
In this section you'll first learn a little bit about the base `dir()` function which allows you to list all the files in a directory.
And then about `map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step.

`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.

### Listing files in a directory

`dir()`.
Use `pattern`, a regular expression, to filter files.
Always use `full.name`.

```{r}
#| eval: false
paths <- dir("data", pattern = "\\.xls$", full.names = TRUE)
```

### Basic pattern

Two steps --- read every file into a list.
Then join the pieces back into a data frame.
Overall this framework is sometimes called split-apply-combine.
You split the problem up into pieces (here paths), apply a function to each piece (read_csv), and then combine the pieces back together.

```{r}
#| eval: false
paths |> 
  map(\(path) readxl::read_excel(path)) |> 
  list_rbind()
```

### Data in the path

If the file name itself contains data, try:

```{r}
#| eval: false
paths |> 
  set_names(basename) |> 
  map(\(path) readxl::read_excel) |> 
  list_rbind(.id = "path")
```

You can then use `tidyr::separate_by()` and friends to turn into useful columns.

You can use `set_names(basename)` to just use the file name.

### Get to a single data frame as quickly as possible

If you need to read and transform your data in some way you have two basic ways of structuring your data: doing a little iteration and a lot in a function, or doing a lot of iteration with simple functions.
Let's make that concrete with an example.

Say you want to read in a bunch of files, filter out missing values, pivot them, and then join them all together.
One way to approach the problem is write a function that takes a file and does all those steps:

```{r}
#| eval: false
process_file <- function(path) {
  df <- read_csv(path)
  
  df |> 
    filter(!is.na(id)) |> 
    mutate(id = tolower(id)) |> 
    pivot_longer(jan:dec, names_to = "month")
}

all <- paths |> 
  map(process_file) |> 
  list_rbind()
```

Alternatively, you could write

```{r}
#| eval: false

data <- paths |> 
  map(read_csv) |> 
  list_rbind() 

data |> 
  filter(!is.na(id)) |> 
  mutate(id = tolower(id)) |> 
  pivot_longer(jan:dec, names_to = "month")
```

If you need to do more work to get `list_rbind()` to work, you should do it, but in generate the sooner you can everything into one big data frame the better.

This is particularly important if the structure of your data varies in some way because it's usually easier to understand the variations when you have them all in front of you.
Much easier to interactively experiment and figure out what the right approach is.

### Heterogeneous data

However, sometimes that's not possible because the data frames are sufficiently inconsistent that `list_rbind()` either fails or yields a data frame that's not very useful.
In that case, start by loading all the files:

```{r}
#| eval: false
files <- paths |> map(read_excel, .id = "id") 
```

If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
You can use `map_if()` or `map_at()` to selectively modify inputs.
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.

After spending all this effort, save it to a new csv file.

In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.

If the files are really inconsistent, one useful way to get some traction is to think about the structure of the files as data itself.

```{r}
#| eval: false

paths |> 
  set_names(basename) |> 
  map(\(path) read_csv(path, n_max = 0)) |> 
  map(\(df) data.frame(cols = names(df))) |> 
  list_rbind(.id = "name")
```

You could then think about pivotting or plotting this code to understand what the differences are.

### Handling failures

Some times you might not be able

```{r}
#| eval: false
paths |> 
  map(safely(\(path) readxl::read_excel(path)))
```

When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail.
When this happens, you'll get an error message, and no output.
This is annoying: why does one failure prevent you from accessing all the other successes?
How do you ensure that one bad apple doesn't ruin the whole barrel?

In this section you'll learn how to deal with this situation with a new function: `safely()`.
`safely()` is an adverb: it takes a function (a verb) and returns a modified version.
In this case, the modified function will never throw an error.
Instead, it always returns a list with two elements:

1.  `result` is the original result.
    If there was an error, this will be `NULL`.

2.  `error` is an error object.
    If the operation was successful, this will be `NULL`.

(You might be familiar with the `try()` function in base R.
It's similar, but because it sometimes returns the original result and it sometimes returns an error object it's more difficult to work with.)

Let's illustrate this with a simple example: `log()`:

```{r}
safe_log <- safely(log)
str(safe_log(10))
str(safe_log("a"))
```

When the function succeeds, the `result` element contains the result and the `error` element is `NULL`.
When the function fails, the `result` element is `NULL` and the `error` element contains an error object.

`safely()` is designed to work with `map()`:

```{r}
x <- list(1, 10, "a")
y <- x |> map(safely(log))
str(y)
```

```{r}
#| eval: false
paths |> 
  map(safely(read_csv))
```

This would be easier to work with if we had two lists: one of all the errors and one of all the output.
That's easy to get with `purrr::transpose()`:

```{r}
y <- y |> transpose()
str(y)
```

It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error, or work with the values of `y` that are ok:

```{r}
is_ok <- y$error |> map_lgl(is_null)
x[!is_ok]
y$result[is_ok] |> flatten_dbl()
```

## Writing multiple outputs

So far we've focused on map, which is design for functions that return something.
But some functions don't return data, they instead change the state of the world in some way.
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.

### Very large data

Another exception to this rule is if you have very large data --- it might be impossible to store all the data in memory at once.
If you're lucky, the database you're working with will have a function to load csv files directly into the database.
For example, if you're using duckdb, you can:

```{r}
#| eval: false
duckdb::duckdb_read_csv(con, "cars", paths)
```

Otherwise:

```{r}
#| eval: false
template <- read_csv(paths[[1]])
DBI::dbWriteTable(con, "cars", filter(template, FALSE))

append_csv <- function(path) {
  df <- read_csv(path)
  DBI::dbAppendTable(con, "cars", df)
}

paths |> walk(append_csv)
```

Or maybe you just write one clean csv for each file and then read with `arrow::open_dataset()`.

### Saving plots

To save plots, we need to embrace a new challenge: there's now two important arguments: the object you want to save and the place you want to save it.
So we're going to switch from `walk()` to `walk2()`.

`walk2()`.
It differs in two ways: it iterates over two arguments at the same time, and it hides the output.
Let's first make some plots:

```{r}
plots <- mtcars |> 
  group_split(cyl) |> 
  map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
```

Then

```{r}
file_names <- str_c(names(plots), ".pdf")

plots |> 
  walk2(file_names, \(plot, name) ggsave(name, plot, path = tempdir()))
```

## For loops

Another way to attack this sort of problem is with a `for` loop.
We don't teach for loops here to stay focused.
They're definitely important.
You can learn more about them and how they're connected to the map functions in purr in <https://adv-r.hadley.nz/control-flow.html#loops> and <https://adv-r.hadley.nz/functionals.html>.

Once you master these functions, you'll find it takes much less time to solve iteration problems.
But you should never feel bad about using a `for` loop instead of a map function.
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).

Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.

If you actually need to worry about performance, you'll know, it'll be obvious.
till then, don't worry about it.