Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
2. It's easier to respond to changes in requirements.
As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.
3. You're likely to have fewer bugs because each line of code is used in more places.
One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is **iteration**, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
Iteration is somewhat of a moving target in the tidyverse because we're keep adding new features to make it easier to solve problems that previously required explicit iteration.
For example:
- To draw one plot for each group you can use ggplot2's facetting.
- To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
- To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
- To extract every element from a named list you can use `tidyr::unnest_wider()`.
In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving objects.
There's another argument, `.names` that's useful when use `across()` with `mutate()`, and two variations `if_any()` and `if_all()` that work with `filter()`.
The first argument to `across()`, `.cols`, selects the columns to transform.
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
There are two other techniques that you can use with both `select()` and `across()` that we didn't discuss earlier because they're particularly useful for `across()`: `everything()` and `where()` .
`everything()` is straightforward: it selects every (non-grouping) column!
```{r}
df <- tibble(
grp = sample(2, 10, replace = TRUE),
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df %>%
group_by(grp) |>
summarise(across(everything(), median))
```
`where()` allows you to select columns based on their type:
- `where(is.numeric)` selects all numeric columns.
- `where(is.character)` selects all string columns.
- `where(is.Date)` selects all date columns.
- `where(is.POSIXct)` selects all date-time columns.
- `where(is.logical)` selects all logical columns.
You can combine these in the usual `select()` way with Boolean algebra so that `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
For complicated reasons, it's not easy to pass on arguments from `across()`, so instead we can create another function that wraps `median()` and calls it with the correct arguments.
We can write that compactly using R's anonymous function shorthand:
As well as computing the median with out missing values, it'd be nice to know how many missing values there were.
We can do that by supplying a named list of functions to `across()`:
```{r}
df %>% summarise(
across(a:d, list(
median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x))
))
)
```
Note that you could describe the name of the new columns using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fun` is the name of the function in the list.
That's not a coincidence because you can use the `.names` argument to set these names.
If you'd like to instead create new columns, you can supply the `.names` argument which takes a glue specification where `{.col}` refers to the current column name.
`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:
Imagine you have a directory full of excel spreadsheets[^iteration-2] you want to read in.
You could do it with copy and paste:
[^iteration-2]: If you instead had a directory of csv files with the same format, you can use `read_csv()` directly: `read_csv(c("data/y2019.xls", "data/y2020.xls", "data/y2021.xls", "data/y2020.xls").`
```{r}
#| eval: false
data2019 <- readr::read_excel("data/y2019.xls")
data2020 <- readr::read_excel("data/y2020.xls")
data2021 <- readr::read_excel("data/y2021.xls")
data2022 <- readr::read_excel("data/y2022.xls")
```
And then use `dplyr::bind_rows()` to combine them all together:
```{r}
#| eval: false
data <- bind_rows(data2019, data2020, data2021, data2022)
```
But you can imagine that this would get tedious quickly, since often you won't have four files, but more like 400.
In this section you'll first learn a little bit about the base `dir()` function which allows you to list all the files in a directory.
And then about `map()` which lets you repeatedly apply a function to each element of a vector, allowing you to read many files in one step.
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
If you need to read and transform your data in some way you have two basic ways of structuring your data: doing a little iteration and a lot in a function, or doing a lot of iteration with simple functions.
If you need to do more work to get `list_rbind()` to work, you should do it, but in generate the sooner you can everything into one big data frame the better.
This is particularly important if the structure of your data varies in some way because it's usually easier to understand the variations when you have them all in front of you.
Much easier to interactively experiment and figure out what the right approach is.
However, sometimes that's not possible because the data frames are sufficiently inconsistent that `list_rbind()` either fails or yields a data frame that's not very useful.
In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.
It's up to you how to deal with the errors, but typically you'll either look at the values of `x` where `y` is an error, or work with the values of `y` that are ok:
Another way to attack this sort of problem is with a `for` loop.
We don't teach for loops here to stay focused.
They're definitely important.
You can learn more about them and how they're connected to the map functions in purr in <https://adv-r.hadley.nz/control-flow.html#loops> and <https://adv-r.hadley.nz/functionals.html>.
Once you master these functions, you'll find it takes much less time to solve iteration problems.
But you should never feel bad about using a `for` loop instead of a map function.
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
Some people will tell you to avoid `for` loops because they are slow.
They're wrong!
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.