parent
c89aa20627
commit
bdc3555b9a
13
base-R.qmd
13
base-R.qmd
|
@ -219,7 +219,7 @@ In this section, we'll show you how to use `[[` and `$` to pull columns out of a
|
|||
|
||||
### Data frames
|
||||
|
||||
`[[` and `$` can be used like `pull()` to extract columns out of a data frame.
|
||||
`[[` and `$` can be used extract columns out of a data frame.
|
||||
`[[` can access by position or by name, and `$` is specialized for access by name:
|
||||
|
||||
```{r}
|
||||
|
@ -255,6 +255,16 @@ max(diamonds$carat)
|
|||
levels(diamonds$cut)
|
||||
```
|
||||
|
||||
dplyr also provides an equivalent to `[[`/`$` that we didn't mention in @sec-data-transform: `pull()`.
|
||||
`pull()` takes either a variable name or variable position and returns just that column.
|
||||
That means we could rewrite the above code to use the pipe:
|
||||
|
||||
```{r}
|
||||
diamonds |> pull(carat) |> mean()
|
||||
|
||||
diamonds |> pull(cut) |> levels()
|
||||
```
|
||||
|
||||
### Tibbles
|
||||
|
||||
There are a couple of important differences between tibbles and base `data.frame`s when it comes to `$`.
|
||||
|
@ -537,3 +547,4 @@ This often makes life easier for programming and so becomes more important as yo
|
|||
This chapter concludes the programming section of the book.
|
||||
You've made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can *program* in R.
|
||||
We hope these chapters have sparked your interested in programming and that you're are looking forward to learning more outside of this book.
|
||||
|
||||
|
|
|
@ -384,14 +384,14 @@ With this theory under your belt, we'll then show you a bunch of examples to ill
|
|||
### Indirection and tidy evaluation
|
||||
|
||||
When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection.
|
||||
Let's illustrate the problem with a very simple function: `pull_unique()`.
|
||||
The goal of this function is to `pull()` the unique (distinct) values of a variable:
|
||||
Let's illustrate the problem with a very simple function: `grouped_mean()`.
|
||||
The goal of this function is compute the mean of `mean_var` grouped by `group_var`:
|
||||
|
||||
```{r}
|
||||
pull_unique <- function(df, var) {
|
||||
grouped_mean <- function(df, group_var, mean_var) {
|
||||
df |>
|
||||
distinct(var) |>
|
||||
pull(var)
|
||||
group_by(group_var) |>
|
||||
summarize(mean(mean_var))
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -399,38 +399,45 @@ If we try and use it, we get an error:
|
|||
|
||||
```{r}
|
||||
#| error: true
|
||||
diamonds |> pull_unique(clarity)
|
||||
diamonds |> grouped_mean(cut, carat)
|
||||
```
|
||||
|
||||
To make the problem a bit more clear we can use a made up data frame:
|
||||
|
||||
```{r}
|
||||
df <- tibble(var = "var", x = "x", y = "y")
|
||||
df |> pull_unique(x)
|
||||
df |> pull_unique(y)
|
||||
df <- tibble(
|
||||
mean_var = 1,
|
||||
group_var = "g",
|
||||
group = 1,
|
||||
x = 10,
|
||||
y = 100
|
||||
)
|
||||
df |> grouped_mean(group, x)
|
||||
df |> grouped_mean(group, y)
|
||||
```
|
||||
|
||||
Regardless of how we call `pull_unique()` it always does `df |> distinct(var) |> pull(var)`, instead of `df |> distinct(x) |> pull(x)` or `df |> distinct(y) |> pull(y)`.
|
||||
Regardless of how we call `grouped_mean()` it always does `df |> group_by(group_var) |> summarise(mean(mean_var))`, instead of `df |> group_by(group) |> summarise(mean(x))` or `df |> group_by(group) |> summarise(mean(y))`.
|
||||
This is a problem of indirection, and it arises because dplyr uses **tidy evaluation** to allow you to refer to the names of variables inside your data frame without any special treatment.
|
||||
|
||||
Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it's obvious from the context.
|
||||
The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function.
|
||||
Here we need some way to tell `distinct()` and `pull()` not to treat `var` as the name of a variable, but instead look inside `var` for the variable we actually want to use.
|
||||
Here we need some way to tell `group_mean()` and `summarise()` not to treat `group_var` and `mean_var` as the name of the variables, but instead look inside them for the variable we actually want to use.
|
||||
|
||||
Tidy evaluation includes a solution to this problem called **embracing** 🤗.
|
||||
Embracing a variable means to wrap it in braces so (e.g.) `var` becomes `{{ var }}`.
|
||||
Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name.
|
||||
One way to remember what's happening is to think of `{{ }}` as looking down a tunnel --- `{{ var }}` will make a dplyr function look inside of `var` rather than looking for a variable called `var`.
|
||||
|
||||
So to make `pull_unique()` work we need to replace `var` with `{{ var }}`:
|
||||
So to make grouped_mean`()` work we need to replace surround `group_var` and `mean_var()` with `{{ }}`:
|
||||
|
||||
```{r}
|
||||
pull_unique <- function(df, var) {
|
||||
grouped_mean <- function(df, group_var, mean_var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
group_by({{ group_var }}) |>
|
||||
summarize(mean({{ mean_var }}))
|
||||
}
|
||||
diamonds |> pull_unique(clarity)
|
||||
|
||||
diamonds |> grouped_mean(cut, carat)
|
||||
```
|
||||
|
||||
Success!
|
||||
|
@ -511,8 +518,7 @@ unique_where <- function(df, condition, var) {
|
|||
df |>
|
||||
filter({{ condition }}) |>
|
||||
distinct({{ var }}) |>
|
||||
arrange({{ var }}) |>
|
||||
pull({{ var }})
|
||||
arrange({{ var }})
|
||||
}
|
||||
|
||||
# Find all the destinations in December
|
||||
|
@ -521,7 +527,7 @@ flights |> unique_where(month == 12, dest)
|
|||
flights |> unique_where(tailnum == "N14228", month)
|
||||
```
|
||||
|
||||
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()`, `arrange()`, and `pull()`.
|
||||
Here we embrace `condition` because it's passed to `filter()` and `var` because its passed to `distinct()` and `arrange()`.
|
||||
|
||||
We've made all these examples take a data frame as the first argument, but if you're working repeatedly with the same data, it can make sense to hardcode it.
|
||||
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
|
||||
|
@ -890,21 +896,20 @@ This makes it easier to see the hierarchy in your code by skimming the left-hand
|
|||
|
||||
```{r}
|
||||
# missing extra two spaces
|
||||
pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
density <- function(colour, facets, binwidth = 0.1) {
|
||||
diamonds |>
|
||||
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
|
||||
geom_freqpoly(binwidth = binwidth) +
|
||||
facet_wrap(vars({{ facets }}))
|
||||
}
|
||||
|
||||
# Pipe indented incorrectly
|
||||
pull_unique <- function(df, var) {
|
||||
df |>
|
||||
distinct({{ var }}) |>
|
||||
pull({{ var }})
|
||||
density <- function(colour, facets, binwidth = 0.1) {
|
||||
diamonds |>
|
||||
ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
|
||||
geom_freqpoly(binwidth = binwidth) +
|
||||
facet_wrap(vars({{ facets }}))
|
||||
}
|
||||
|
||||
# Missing {} and all one line
|
||||
pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }})
|
||||
```
|
||||
|
||||
As you can see we recommend putting extra spaces inside of `{{ }}`.
|
||||
|
|
Loading…
Reference in New Issue