872 lines
28 KiB
Plaintext
872 lines
28 KiB
Plaintext
# Iteration {#sec-iteration}
|
|
|
|
```{r}
|
|
#| results: "asis"
|
|
#| echo: false
|
|
source("_common.R")
|
|
status("drafting")
|
|
```
|
|
|
|
## Introduction
|
|
|
|
Iteration is somewhat of a moving target in the tidyverse because we're keep adding new features to make it easier to solve problems that previously required explicit iteration.
|
|
For example:
|
|
|
|
- To draw one plot for each group you can use ggplot2's facetting.
|
|
- To compute summary statistics for subgroups you can use `dplyr::group_by()` + `dplyr::summarise()`.
|
|
- To read every .csv file in a directory you can pass a vector to `readr::read_csv()`.
|
|
- To extract every element from a named list you can use `tidyr::unnest_wider()`.
|
|
|
|
In this section we'll show you three related sets of tools for manipulating each column in a data frame, reading each file in a directory, and saving each element in a list.
|
|
|
|
These are the basics of iteration, focusing on the places where it comes up in an analysis.
|
|
But in general, iteration is a super power: one you solved one problem, you can apply iteration techniques to solve every similar problem.
|
|
You can learn more in <https://purrr.tidyverse.org> and the [Functionals chapter](https://adv-r.hadley.nz/functionals.html) of *Advanced R*.
|
|
|
|
### Prerequisites
|
|
|
|
We'll use a selection of important iteration idioms from dplyr and purrr, both core members of the tidyverse.
|
|
|
|
```{r}
|
|
#| label: setup
|
|
#| message: false
|
|
|
|
library(tidyverse)
|
|
```
|
|
|
|
## Modifying multiple columns {#sec-across}
|
|
|
|
Imagine you have this simple tibble:
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
a = rnorm(10),
|
|
b = rnorm(10),
|
|
c = rnorm(10),
|
|
d = rnorm(10)
|
|
)
|
|
```
|
|
|
|
And you want to compute the median of every column.
|
|
You could do it with copy-and-paste:
|
|
|
|
```{r}
|
|
df %>% summarise(
|
|
a = median(a),
|
|
b = median(b),
|
|
c = median(c),
|
|
d = median(d),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
But that breaks our rule of thumb: never copy and paste more than twice.
|
|
And you could imagine that this will get particularly tedious if you have tens or even hundreds of variables.
|
|
Instead you can use `across()`:
|
|
|
|
```{r}
|
|
df %>% summarise(
|
|
across(a:d, median),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
There are two arguments that you'll use every time:
|
|
|
|
- The first argument specifies which columns you want to iterate over. It uses the same syntax as `select()`.
|
|
- The second argument specifies what to do with each column.
|
|
|
|
There's another argument, `.names` that's useful when use `across()` with `mutate()`, and two variations `if_any()` and `if_all()` that work with `filter()`.
|
|
These are described in detail below.
|
|
|
|
### Selecting columns with `.cols`
|
|
|
|
The first argument to `across()`, `.cols`, selects the columns to transform.
|
|
This argument uses the same specifications as `select()`, @sec-select, so you can use functions like `starts_with()` and `ends_with()` to select variables based on their name.
|
|
Grouping columns are automatically ignored because they're carried along for the ride by the dplyr verb.
|
|
|
|
There are two other techniques that you can use with both `select()` and `across()` that we didn't discuss earlier because they're particularly useful for `across()`: `everything()` and `where()` .
|
|
|
|
`everything()` is straightforward: it selects every (non-grouping) column!
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
grp = sample(2, 10, replace = TRUE),
|
|
a = rnorm(10),
|
|
b = rnorm(10),
|
|
c = rnorm(10),
|
|
d = rnorm(10)
|
|
)
|
|
|
|
df %>%
|
|
group_by(grp) |>
|
|
summarise(across(everything(), median))
|
|
```
|
|
|
|
`where()` allows you to select columns based on their type:
|
|
|
|
- `where(is.numeric)` selects all numeric columns.
|
|
- `where(is.character)` selects all string columns.
|
|
- `where(is.Date)` selects all date columns.
|
|
- `where(is.POSIXct)` selects all date-time columns.
|
|
- `where(is.logical)` selects all logical columns.
|
|
|
|
```{r}
|
|
df <- tibble(
|
|
x1 = 1:3,
|
|
x2 = runif(3),
|
|
y1 = sample(letters, 3),
|
|
y2 = c("banana", "apple", "egg")
|
|
)
|
|
|
|
df |>
|
|
summarise(across(where(is.numeric), mean))
|
|
df |>
|
|
summarise(across(where(is.character), str_flatten))
|
|
```
|
|
|
|
You can combine these in the usual `select()` way with Boolean algebra so that `!where(is.numeric)` selects all non-numeric columns and `starts_with("a") & where(is.logical)` selects all logical columns whose name starts with "a".
|
|
|
|
### Defining the action with `.funs`
|
|
|
|
The second argument, `.funs`, determines what happens to each column selected by the first argument.
|
|
In most cases, this will be the name of an existing function, but you can also create your own function inline, or supply multiple functions.
|
|
|
|
Lets motivate this problem with an example: what happens if we have some missing values?
|
|
It'd be nice to be able to pass along additional arguments to `median()`:
|
|
|
|
```{r}
|
|
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
|
|
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
|
|
}
|
|
|
|
df <- tibble(
|
|
a = rnorm_na(10, 2),
|
|
b = rnorm_na(10, 2),
|
|
c = rnorm_na(10, 4),
|
|
d = rnorm(10)
|
|
)
|
|
df %>%
|
|
summarise(
|
|
across(a:d, median),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
For complicated reasons, it's not easy to pass on arguments from `across()`, so instead we can create another function that wraps `median()` and calls it with the correct arguments.
|
|
We can write that compactly using R's anonymous function shorthand:
|
|
|
|
```{r}
|
|
df %>%
|
|
summarise(
|
|
across(a:d, \(x) median(x, na.rm = TRUE)),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
This expands to the following code.
|
|
Each call is the same, apart from the argument which changes each time.
|
|
|
|
```{r}
|
|
#| eval: false
|
|
|
|
df %>% summarise(
|
|
a = median(a, na.rm = TRUE),
|
|
b = median(b, na.rm = TRUE),
|
|
c = median(c, na.rm = TRUE),
|
|
d = median(d, na.rm = TRUE),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
This is shorthand for creating a function, as below.
|
|
It's easier to remember because you just replace the eight letters of `function` with a single `\`.
|
|
|
|
```{r}
|
|
#| results: false
|
|
df %>%
|
|
summarise(
|
|
across(a:d, function(x) median(x, na.rm = TRUE)),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
As well as computing the median with out missing values, it'd be nice to know how many missing values there were.
|
|
We can do that by supplying a named list of functions to `across()`:
|
|
|
|
```{r}
|
|
df %>%
|
|
summarise(
|
|
across(a:d, list(
|
|
median = \(x) median(x, na.rm = TRUE),
|
|
n_miss = \(x) sum(is.na(x))
|
|
)),
|
|
n = n()
|
|
)
|
|
```
|
|
|
|
If you look carefully, you might intuit that the columns are named using using a glue specification (@sec-glue) like `{.col}_{.fn}` where `.col` is the name of the original column and `.fun` is the name of the function in the list.
|
|
That's not a coincidence because you can use the `.names` argument to set these names, the topic of the next section.
|
|
|
|
### Column names
|
|
|
|
The result of `across()` is named according to the specification provided in the `.names` variable.
|
|
We could specify our own if we wanted the name of the function to come first[^iteration-1]:
|
|
|
|
[^iteration-1]: You can't currently change the order of the columns, but you could reorder them after the fact using `relocate()` or similar.
|
|
|
|
```{r}
|
|
df %>%
|
|
summarise(
|
|
across(
|
|
a:d,
|
|
list(
|
|
median = \(x) median(x, na.rm = TRUE),
|
|
n_miss = \(x) sum(is.na(x))
|
|
),
|
|
.names = "{.fn}_{.col}"
|
|
),
|
|
n = n(),
|
|
)
|
|
```
|
|
|
|
The `.names` argument is particularly important when you use `across()` with `mutate()`.
|
|
By default the output of `across()` is given the same names as the inputs.
|
|
This means that `across()` inside of `mutate()` will replace existing columns:
|
|
|
|
```{r}
|
|
df %>%
|
|
mutate(
|
|
across(a:d, \(x) x + 1)
|
|
)
|
|
```
|
|
|
|
If you'd like to instead create new columns, you can use the `.names` argument give the output new names:
|
|
|
|
```{r}
|
|
df %>%
|
|
mutate(
|
|
across(a:d, \(x) x * 2, .names = "{.col}_double")
|
|
)
|
|
```
|
|
|
|
### Filtering
|
|
|
|
`across()` is a great match for `summarise()` and `mutate()` but it's not such a great fit for `filter()` because you usually string together calls to multiple functions either with `|` or `&`.
|
|
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
|
|
|
|
```{r}
|
|
df |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
|
|
# same as:
|
|
df |> filter(if_any(a:d, is.na))
|
|
|
|
df |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
|
|
# same as:
|
|
df |> filter(if_all(a:d, is.na))
|
|
```
|
|
|
|
### `across()` in functions
|
|
|
|
`across()` is particularly useful to program with because it allows you to operate with multiple variables.
|
|
For example, [Jacob Scott](https://twitter.com/_wurli/status/1571836746899283969) uses this little helper to expand our all date into year, month, and day variables:
|
|
|
|
```{r}
|
|
expand_dates <- function(df) {
|
|
df |>
|
|
mutate(
|
|
across(
|
|
where(lubridate::is.Date),
|
|
list(year = year, month = month, day = mday)
|
|
)
|
|
)
|
|
}
|
|
```
|
|
|
|
It also lets the user supply multiple variables.
|
|
The key thing to remember is that the first argument to `across()` uses tidy evaluation, so you need to embrace any arguments.
|
|
For example, this function will compute the means of numeric variables by default.
|
|
But by supplying the second argument you can choose to summarize just selected variables.
|
|
|
|
```{r}
|
|
summarise_means <- function(data, summary_vars = where(is.numeric)) {
|
|
data |>
|
|
summarise(
|
|
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
|
|
n = n()
|
|
)
|
|
}
|
|
diamonds |>
|
|
group_by(clarity) |>
|
|
summarise_means()
|
|
|
|
diamonds |>
|
|
group_by(clarity) |>
|
|
summarise_means(c(carat, x:z))
|
|
```
|
|
|
|
### Vs `pivot_longer()`
|
|
|
|
Before we go on, it's worth pointing out an interesting connection between `across()` and `pivot_longer()`.
|
|
In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column.
|
|
For example, we could rewrite our multiple summary `across()` as:
|
|
|
|
```{r}
|
|
df |>
|
|
pivot_longer(a:d) |>
|
|
group_by(name) |>
|
|
summarise(
|
|
median = median(value, na.rm = TRUE),
|
|
n_miss = sum(is.na(value))
|
|
)
|
|
```
|
|
|
|
This is a useful technique to know about because sometimes you'll hit a problem that's not currently possible to solve with `across()`: when you have groups of variables that you want to compute with simultaneously.
|
|
For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:
|
|
|
|
```{r}
|
|
df3 <- tibble(
|
|
a_val = rnorm(10),
|
|
a_w = runif(10),
|
|
b_val = rnorm(10),
|
|
b_w = runif(10),
|
|
c_val = rnorm(10),
|
|
c_w = runif(10),
|
|
d_val = rnorm(10),
|
|
d_w = runif(10)
|
|
)
|
|
```
|
|
|
|
There's currently no way to do this with `across()`[^iteration-2], but it's relatively straightforward with `pivot_longer()`:
|
|
|
|
[^iteration-2]: Maybe there will be one day, but currently we don't see how.
|
|
|
|
```{r}
|
|
df3_long <- df3 |>
|
|
pivot_longer(
|
|
everything(),
|
|
names_to = c("group", ".value"),
|
|
names_sep = "_"
|
|
)
|
|
df3_long
|
|
|
|
df3_long |>
|
|
group_by(group) |>
|
|
summarise(mean = weighted.mean(val, w))
|
|
```
|
|
|
|
If needed, you could `pivot_wider()` this back to the original form.
|
|
|
|
### Exercises
|
|
|
|
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
|
|
2. Compute the mean of every column in `mtcars`.
|
|
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric variable.
|
|
4. What happens if you use a list of functions, but don't name them? How is the output named?
|
|
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`. Can you explain why?
|
|
|
|
## Reading multiple files
|
|
|
|
Imagine you have a directory full of excel spreadsheets[^iteration-3] you want to read.
|
|
You could do it with copy and paste:
|
|
|
|
[^iteration-3]: If you instead had a directory of csv files with the same format, you can use the technique from @sec-readr-directory.
|
|
|
|
```{r}
|
|
#| eval: false
|
|
data2019 <- readr::read_excel("data/y2019.xls")
|
|
data2020 <- readr::read_excel("data/y2020.xls")
|
|
data2021 <- readr::read_excel("data/y2021.xls")
|
|
data2022 <- readr::read_excel("data/y2022.xls")
|
|
```
|
|
|
|
And then use `dplyr::bind_rows()` to combine them all together:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
data <- bind_rows(data2019, data2020, data2021, data2022)
|
|
```
|
|
|
|
But you can imagine that this would get tedious quickly, especially if you had 400 files, not just four.
|
|
|
|
In the following secitons section you'll learn how to use `dir()` list all the files in a directory, then `purrr::map()` to read each of them into a list, and then `purrr::list_rbind()` to combine them into a single data frame.
|
|
We'll then discuss how you can use these tools as the challenge level increases.
|
|
|
|
### Listing files in a directory
|
|
|
|
`dir()` lists the files in a directory.
|
|
You'll almost always use three arguments:
|
|
|
|
- `path`, the first argument, which you won't usually name, is the directory to look in.
|
|
|
|
- `pattern` is a regular expression that file names must match to be included in the output.
|
|
The most common pattern is to match an extension like `\\.xlsx$` or `\\.csv$` but you can use whatever you need to extract you data files.
|
|
|
|
- `full.names` determines whether or not the directory name should be included in the output.
|
|
You almost always want this to be `TRUE`.
|
|
|
|
For example, this book contains a folder with 12 excel spreadsheets that contain data from the gapminder package.
|
|
Each file contains provides the life expectancy, population, and per capita GDP for 142 countries for one year.
|
|
We can list them all with the appropriate call to `dir()`:
|
|
|
|
```{r}
|
|
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
|
|
paths
|
|
```
|
|
|
|
### `purrr::map()` and `list_rbind()`
|
|
|
|
Now that we have these 12 paths, we call `read_excel()` 12 times to get 12 data frames.
|
|
We're going to make a small generalization compared to the example above.
|
|
Since, in general, we won't know how files there are to read, instead of loading each individual data frame in its own variable, we'll put them all into a list, something like this:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
list(
|
|
readxl::read_excel("data/gapminder/1952.xls"),
|
|
readxl::read_excel("data/gapminder/1957.xls"),
|
|
readxl::read_excel("data/gapminder/1962.xls"),
|
|
...,
|
|
readxl::read_excel("data/gapminder/2007.xls")
|
|
)
|
|
```
|
|
|
|
Now that's just as tedious to type as before, but we can use a shortcut: `purrr::map()`.
|
|
`map()` is similar to `across()`, but instead of doing something to each column in a data frame, it does something to each element of a vector.
|
|
`map(x, f)` is shorthand for:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
list(
|
|
f(x[[1]]),
|
|
f(x[[2]]),
|
|
...,
|
|
f(x[[n]])
|
|
)
|
|
```
|
|
|
|
So we can use `map()` get a list of 12 data frames:
|
|
|
|
```{r}
|
|
files <- map(paths, readxl::read_excel)
|
|
length(files)
|
|
|
|
files[[1]]
|
|
```
|
|
|
|
(This is another data structure that doesn't display particularly compactly with `str()` so you might want to load into RStudio and inspect it with `View()`).
|
|
|
|
Now we can use `purrr::list_rbind()` to combine that list of data frames into a single data frame:
|
|
|
|
```{r}
|
|
list_rbind(files)
|
|
```
|
|
|
|
Or we could do both steps at once in pipeline:
|
|
|
|
```{r}
|
|
#| results: false
|
|
paths |>
|
|
map(readxl::read_excel) |>
|
|
list_rbind()
|
|
```
|
|
|
|
What if we want to pass in extra arguments to `read_excel()`?
|
|
We use the same trick that we used with across.
|
|
For example, it's often useful to peak at just the first few rows of the data which we can do with `n_max`:
|
|
|
|
```{r}
|
|
paths |>
|
|
map(\(path) readxl::read_excel(path, n_max = 1)) |>
|
|
list_rbind()
|
|
```
|
|
|
|
This makes it very clear that each individual sheet doesn't contain the year, which is only recorded in the path.
|
|
We'll tackle that problem next.
|
|
|
|
### Data in the path
|
|
|
|
Sometimes the name of the file is itself data.
|
|
In this example, the file name contains the year, which is not otherwise recorded in the individual data frames.
|
|
To get that column into the final data frame, we need to do two things.
|
|
Firstly, we name the vector of paths.
|
|
The easiest way to do this is with the `set_names()` function, which can take a function.
|
|
Here we use `basename` to extract just the file name from the full path:
|
|
|
|
```{r}
|
|
paths <- paths |> set_names(basename)
|
|
paths
|
|
```
|
|
|
|
Those paths are automatically carried along by all the map functions, so the list of data frames will have those same names:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
paths |>
|
|
map(readxl::read_excel) |>
|
|
names()
|
|
```
|
|
|
|
Then we use the `names_to` argument `list_rbind()` to tell it to save the names to a new column called `year`, and use `readr::parse_number()` to turn it into a number.
|
|
|
|
```{r}
|
|
paths |>
|
|
set_names(basename) |>
|
|
map(readxl::read_excel) |>
|
|
list_rbind(names_to = "year") |>
|
|
mutate(year = parse_number(year))
|
|
```
|
|
|
|
In other cases, there might be more variables in the directory, or maybe multiple variables encoded in the path.
|
|
In that case, you can use `set_names()` without any argument to record the full path, and then you `tidyr::separate_by()` and friends to turn them into useful columns.
|
|
|
|
```{r}
|
|
paths |>
|
|
set_names() |>
|
|
map(readxl::read_excel) |>
|
|
list_rbind(names_to = "year") |>
|
|
separate(year, into = c(NA, "directory", "file", "ext"), sep = "[/.]")
|
|
```
|
|
|
|
### Save your work
|
|
|
|
Now that you've done all this hard work to get to a nice tidy data frame, make sure to save your work!
|
|
|
|
In terms of organising your analysis project, you might want to have a file called `0-cleanup.R` that generates nice csv files to be used by the rest of your project.
|
|
|
|
### Many simple iterations
|
|
|
|
If you need to read and transform your data in some way you have two basic ways of structuring your data: doing one round of iteration with a complex function, or doing a multiple rounds of iteration with simple functions.
|
|
In my experience, you will be better off with many simple iterations, but most folks reach first for one complex iteration.
|
|
|
|
Let's make that concrete with an example.
|
|
Imagine that you want to read in a bunch of files, filter out missing values, pivot them, and then join them all together.
|
|
One way to approach the problem is write a function that takes a file and does all those steps:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
process_file <- function(path) {
|
|
df <- read_csv(path)
|
|
|
|
df |>
|
|
filter(!is.na(id)) |>
|
|
mutate(id = tolower(id)) |>
|
|
pivot_longer(jan:dec, names_to = "month")
|
|
}
|
|
```
|
|
|
|
Then you call `map()` once:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
all <- paths |>
|
|
map(process_file) |>
|
|
list_rbind()
|
|
```
|
|
|
|
Alternatively, you could read all the files first:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
data <- paths |>
|
|
map(read_csv) |>
|
|
list_rbind()
|
|
```
|
|
|
|
Then rely on dplyr functions to do the rest:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
data |>
|
|
filter(!is.na(id)) |>
|
|
mutate(id = tolower(id)) |>
|
|
pivot_longer(jan:dec, names_to = "month")
|
|
```
|
|
|
|
I think this second approach is usually more desirable because it stops you getting fixated on getting the first file right because moving on to the rest.
|
|
By considering all of the data when you do your tidying and cleaning, you're more likely to think holistically about the problems and end up with a higher quality result.
|
|
|
|
### Heterogeneous data
|
|
|
|
Unforuntately sometimes the strategy fails because the data frames are so heterogenous that `list_rbind()` either fails or yields a data frame that's not very useful.
|
|
In that case, it's still useful to start by getting all of the files into memory:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
files <- paths |> map(readxl::read_excel)
|
|
```
|
|
|
|
And then a very useful strategy is to convert the structure of the data frames to data so that you can then explore it.
|
|
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
|
|
|
|
```{r}
|
|
df_types <- function(df) {
|
|
tibble(
|
|
col_name = names(df),
|
|
col_type = map_chr(df, vctrs::vec_ptype_full)
|
|
)
|
|
}
|
|
|
|
df_types(starwars)
|
|
```
|
|
|
|
You can then use the to explore all of the files:
|
|
|
|
```{r}
|
|
files |>
|
|
map(df_types) |>
|
|
list_rbind(names_to = "file_name") |>
|
|
pivot_wider(names_from = col_name, values_from = col_type)
|
|
```
|
|
|
|
If the files have heterogeneous formats you might need to do more processing before you can successfully merge them.
|
|
You can use `map_if()` or `map_at()` to selectively modify inputs.
|
|
Use `map_if()` if its easier to select the elements to transform with a function; use `map_at()` if you can tell based on their names.
|
|
|
|
### Handling failures
|
|
|
|
Sometimes the structure of your data might be sufficiently wild that you can't even read all the files with a single command.
|
|
One of the downsides of map is that it succeeds or fails as a whole: either you successfully read all of the files in a directory or you fail with an error.
|
|
This is annoying: why does one failure prevent you from accessing all the other successes?
|
|
How do you ensure that one bad apple doesn't ruin the whole barrel?
|
|
|
|
Luckily, purrr comes with a helper for this situation: `possibly()`.
|
|
Now any failure will pull a `NULL` in the list of files, and `list_rbind()` will automatically ignore those `NULL`.
|
|
|
|
```{r}
|
|
files <- paths |>
|
|
map(possibly(\(path) readxl::read_excel(path), NULL))
|
|
|
|
data <- files |> list_rbind()
|
|
```
|
|
|
|
Now comes the hard part of figuring out why they failed and what do to about it.
|
|
Start by getting the paths that failed:
|
|
|
|
```{r}
|
|
failed <- map_vec(files, is.null)
|
|
paths[failed]
|
|
```
|
|
|
|
Now the hard work begins: you'll have to look at each failure, call the import file again, and figure out what went wrong.
|
|
|
|
## Saving multiple objects
|
|
|
|
So far we've focused on map, which is designed for functions that return something.
|
|
But some functions don't return things, they instead do things (i.e. their return value isn't important).
|
|
This sort of function includes:
|
|
|
|
- Saving data to a database.
|
|
- Saving data to disk, like `readr::read_csv()`.
|
|
- Saving plots to disk with `ggsave()`.
|
|
|
|
In this section, you'll learn about `map()`'s friend `walk()`, which is design to work with this sort of function.
|
|
Along the way you'll see how to use it to load multiple csv files into a database and turn multiple plots into files.
|
|
|
|
### Writing to a database {#sec-save-database}
|
|
|
|
Sometimes when working with many files at once, it's not possible to load all your data into memory at once.
|
|
If you can't `map(files, read_csv)` how can you work with your work?
|
|
Well, one approach is to put it all into a database and then use dbplyr to access just the subsets that you need.
|
|
|
|
If you're Sometimes the database package will provide a handy function that will take a vector of paths and load them all into the datbase.
|
|
This is the case with duckdb's `duckdb_read_csv()`:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
con <- DBI::dbConnect(duckdb::duckdb())
|
|
duckdb::duckdb_read_csv(con, "gapminder", paths)
|
|
```
|
|
|
|
But we don't have csv files, we have excel spreadsheets.
|
|
So we're going to have to do it "by hand".
|
|
And you can use this same pattern for databases that don't have
|
|
|
|
Unlike in @sec-load-data, we we're not using to `dbWriteTable()`, because we're going to create the table once, and then append to it multiple times.
|
|
So instead we'll use `dbCreateTable()` and `dbAppend()` table.
|
|
We first create an empty table with the fields we'll use:
|
|
|
|
```{r}
|
|
con <- DBI::dbConnect(duckdb::duckdb())
|
|
|
|
template <- readxl::read_excel(paths[[1]])
|
|
template$year <- 1952
|
|
|
|
DBI::dbCreateTable(con, "gapminder", template)
|
|
```
|
|
|
|
Unlike `dbWriteTable()`, `dbCreateTable()` doesn't load in any data.
|
|
It's job is to create the write table fields with the right types:
|
|
|
|
```{r}
|
|
con |> tbl("gapminder")
|
|
```
|
|
|
|
Now we need a function that takes a single path and loads it into an existing table in the database with `dbAppendTable()`:
|
|
|
|
```{r}
|
|
append_file <- function(path) {
|
|
df <- readxl::read_excel(path)
|
|
df$year <- parse_number(basename(path))
|
|
|
|
DBI::dbAppendTable(con, "gapminder", df)
|
|
}
|
|
```
|
|
|
|
Now you need to call `append_csv()` once for each value of `path`.
|
|
That's certainly possible with map:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
paths |> map(append_file)
|
|
```
|
|
|
|
But we don't actually care about the output, so instead we can use `walk()`.
|
|
This does exactly the same thing as `map()` but throws the output away.
|
|
|
|
```{r}
|
|
paths |> walk(append_file)
|
|
```
|
|
|
|
Now if we look at the data we can see we have all the data in one place:
|
|
|
|
```{r}
|
|
con |> tbl("gapminder")
|
|
```
|
|
|
|
```{r, include = FALSE}
|
|
DBI::dbDisconnect(con, shutdown = TRUE)
|
|
```
|
|
|
|
### Writing csv files
|
|
|
|
The same basic principle applies if we want to save out multiple csv files, one for each group.
|
|
Let's imagine that we want to take the `ggplot2::diamonds` data and save our one csv file for each `clarity`.
|
|
First we need to make those individual datasets.
|
|
One way to do that is with dplyr's `group_split()`:
|
|
|
|
```{r}
|
|
by_clarity <- diamonds |>
|
|
group_by(clarity) |>
|
|
group_split()
|
|
```
|
|
|
|
This produces a list of length 8, containing one tibble for each unique value of `clarity`:
|
|
|
|
```{r}
|
|
length(by_clarity)
|
|
|
|
by_clarity[[1]]
|
|
```
|
|
|
|
If we were going to save these data frames by hand, we might write something like:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
write_csv(by_clarity[[1]], "diamonds-I1.csv")
|
|
write_csv(by_clarity[[2]], "diamonds-SI2.csv")
|
|
write_csv(by_clarity[[3]], "diamonds-SI1.csv")
|
|
...
|
|
write_csv(by_clarity[[8]], "diamonds-IF.csv")
|
|
```
|
|
|
|
This is a little different compared our previous uses of `map()` because instead of changing one argument we're now changing two.
|
|
This means that we'll need to use `map2()` instead of `map()`.
|
|
|
|
We'll also need to generate the names for those files somehow.
|
|
The most general way to do so is to use `dplyr::group_indices()`:
|
|
|
|
```{r}
|
|
keys <- diamonds |>
|
|
group_by(clarity) |>
|
|
group_keys()
|
|
keys
|
|
|
|
paths <- keys |>
|
|
mutate(path = str_glue("diamonds-{clarity}.csv")) |>
|
|
pull()
|
|
paths
|
|
```
|
|
|
|
This feels a bit fiddly here because we're only working with a single group, but you can imagine this is very powerful if you want to group by multiple variables.
|
|
|
|
Now that we have all the pieces in place, we can eliminate the need to copy and paste by running `walk2()`:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
walk2(by_clarity, paths, write_csv)
|
|
```
|
|
|
|
### Saving plots
|
|
|
|
We can take the same basic approach if you want to create many plots.
|
|
We're jumping the gun here a bit because you won't learn how to save a single plot until @sec-ggsave, but hopefully
|
|
|
|
Let's first split up the data:
|
|
|
|
```{r}
|
|
by_cyl <- mtcars |> group_by(cyl)
|
|
```
|
|
|
|
Then create the plots using `map()` to call `ggplot()` repeatedly with different datasets.
|
|
That gives us a list of plots[^iteration-4]:
|
|
|
|
[^iteration-4]: You can print `plots` to get a crude animation --- you'll get one plot for each element of `plots`.
|
|
|
|
```{r}
|
|
plots <- by_cyl |>
|
|
group_split() |>
|
|
map(\(df) ggplot(df, aes(mpg, wt)) + geom_point())
|
|
```
|
|
|
|
(If this was a more complicated plot you'd use a named function so there's more room for all the details.)
|
|
|
|
Then you create the file names:
|
|
|
|
```{r}
|
|
paths <- by_cyl |>
|
|
group_keys() |>
|
|
mutate(path = str_glue("cyl-{cyl}.png")) |>
|
|
pull()
|
|
paths
|
|
```
|
|
|
|
Then use `walk2()` with `ggsave()` to save each plot:
|
|
|
|
```{r}
|
|
walk2(plots, paths, \(plot, name) ggsave(name, plot, path = tempdir()))
|
|
```
|
|
|
|
This is short hand for:
|
|
|
|
```{r}
|
|
#| eval: false
|
|
ggsave(plots[[1]], paths[[1]], path = tempdir())
|
|
ggsave(plots[[2]], paths[[2]], path = tempdir())
|
|
ggsave(plots[[3]], paths[[3]], path = tempdir())
|
|
```
|
|
|
|
It's barely necessary here, but you can imagine how useful this would be if you had to create hundreds of plot.
|
|
|
|
### Exercises
|
|
|
|
1. Imagine you have a table of student data containing (amongst other variables) `school_name` and `student_id`. Sketch out what code you'd write if you want to save all the information for each student in file called `{student_id}.csv` in the `{school}` directory.
|
|
|
|
## For loops
|
|
|
|
Another way to attack this sort of problem is with a `for` loop.
|
|
We don't teach for loops here to stay focused.
|
|
They're definitely important.
|
|
You can learn more about them and how they're connected to the map functions in purr in <https://adv-r.hadley.nz/control-flow.html#loops> and <https://adv-r.hadley.nz/functionals.html>.
|
|
|
|
Once you master these functions, you'll find it takes much less time to solve iteration problems.
|
|
But you should never feel bad about using a `for` loop instead of a map function.
|
|
The map functions are a step up a tower of abstraction, and it can take a long time to get your head around how they work.
|
|
The important thing is that you solve the problem that you're working on, not write the most concise and elegant code (although that's definitely something you want to strive towards!).
|
|
|
|
Some people will tell you to avoid `for` loops because they are slow.
|
|
They're wrong!
|
|
(Well at least they're rather out of date, as `for` loops haven't been slow for many years.) The chief benefits of using functions like `map()` is not speed, but clarity: they make your code easier to write and to read.
|
|
|
|
If you actually need to worry about performance, you'll know, it'll be obvious.
|
|
till then, don't worry about it.
|