Use US spelling of summarize()

Fixes #1125
This commit is contained in:
Hadley Wickham 2022-11-18 16:07:54 -06:00
parent d06d4129e6
commit 3045d05e64
17 changed files with 87 additions and 88 deletions

View File

@ -247,7 +247,7 @@ There are a number other base approaches to creating new columns including with
Hadley collected a few examples at <https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf>. Hadley collected a few examples at <https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf>.
Using `$` directly is convenient when performing quick summaries. Using `$` directly is convenient when performing quick summaries.
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarise()`: For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarize()`:
```{r} ```{r}
max(diamonds$carat) max(diamonds$carat)
@ -423,7 +423,7 @@ Another important member of the apply family is `tapply()` which computes a sing
```{r} ```{r}
diamonds |> diamonds |>
group_by(cut) |> group_by(cut) |>
summarise(price = mean(price)) summarize(price = mean(price))
tapply(diamonds$price, diamonds$cut, mean) tapply(diamonds$price, diamonds$cut, mean)
``` ```

View File

@ -187,7 +187,7 @@ It's not wonderful for this plot, but it isn't too bad.
```{r} ```{r}
class_avg <- mpg |> class_avg <- mpg |>
group_by(class) |> group_by(class) |>
summarise( summarize(
displ = median(displ), displ = median(displ),
hwy = median(hwy) hwy = median(hwy)
) )
@ -208,7 +208,7 @@ Often, you want the label in the corner of the plot, so it's convenient to creat
```{r} ```{r}
label_info <- mpg |> label_info <- mpg |>
summarise( summarize(
displ = max(displ), displ = max(displ),
hwy = max(hwy), hwy = max(hwy),
label = "Increasing engine size is \nrelated to decreasing fuel economy." label = "Increasing engine size is \nrelated to decreasing fuel economy."

View File

@ -423,11 +423,10 @@ This means subsequent operations will now work "by month".
### `summarize()` {#sec-summarize} ### `summarize()` {#sec-summarize}
The most important grouped operation is a summary. The most important grouped operation is a summary, which each collapses each group to a single row.
It collapses each group to a single row[^data-transform-3]. In dplyr, this is operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month:
Here we compute the average departure delay by month:
[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarize()` to produce multiple summary rows for each group. [^data-transform-3]: Or `summarise()`, if you prefer British English.
```{r} ```{r}
flights |> flights |>
@ -673,7 +672,7 @@ You can find a good explanation of this problem and how to overcome it at <http:
## Summary ## Summary
In this chapter, you've learned the tools that dplyr provides for working with data frames. In this chapter, you've learned the tools that dplyr provides for working with data frames.
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarise()`). The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarize()`).
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable. In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable. We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.

View File

@ -310,7 +310,7 @@ flights |>
```{r} ```{r}
flights |> flights |>
group_by(dest) |> group_by(dest) |>
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |> summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
show_query() show_query()
``` ```
@ -393,14 +393,14 @@ You'll see more complex examples once we hit the join functions.
### GROUP BY ### GROUP BY
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause: `group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarize()` is translated to the `SELECT` clause:
[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause. [^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
```{r} ```{r}
diamonds_db |> diamonds_db |>
group_by(cut) |> group_by(cut) |>
summarise( summarize(
n = n(), n = n(),
avg_price = mean(price, na.rm = TRUE) avg_price = mean(price, na.rm = TRUE)
) |> ) |>
@ -445,7 +445,7 @@ dbplyr will remind you about this behavior the first time you hit it:
```{r} ```{r}
flights |> flights |>
group_by(dest) |> group_by(dest) |>
summarise(delay = mean(arr_delay)) summarize(delay = mean(arr_delay))
``` ```
If you want to learn more about how NULLs work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand. If you want to learn more about how NULLs work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
@ -471,7 +471,7 @@ This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated
```{r} ```{r}
diamonds_db |> diamonds_db |>
group_by(cut) |> group_by(cut) |>
summarise(n = n()) |> summarize(n = n()) |>
filter(n > 100) |> filter(n > 100) |>
show_query() show_query()
``` ```
@ -579,13 +579,13 @@ The easiest way to see the full set of what's currently available is to visit th
So far we've focused on the big picture of how dplyr verbs are translated to the clauses of a query. So far we've focused on the big picture of how dplyr verbs are translated to the clauses of a query.
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`? Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?
To help see what's going on, we'll use a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL. To help see what's going on, we'll use a couple of little helper functions that run a `summarize()` or `mutate()` and show the generated SQL.
That will make it a little easier to explore a few variations and see how summaries and transformations can differ. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.
```{r} ```{r}
summarize_query <- function(df, ...) { summarize_query <- function(df, ...) {
df |> df |>
summarise(...) |> summarize(...) |>
show_query() show_query()
} }
mutate_query <- function(df, ...) { mutate_query <- function(df, ...) {

View File

@ -351,7 +351,7 @@ It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays
flights_dt |> flights_dt |>
mutate(minute = minute(dep_time)) |> mutate(minute = minute(dep_time)) |>
group_by(minute) |> group_by(minute) |>
summarise( summarize(
avg_delay = mean(dep_delay, na.rm = TRUE), avg_delay = mean(dep_delay, na.rm = TRUE),
n = n()) |> n = n()) |>
ggplot(aes(minute, avg_delay)) + ggplot(aes(minute, avg_delay)) +
@ -369,7 +369,7 @@ Interestingly, if we look at the *scheduled* departure time we don't see such a
sched_dep <- flights_dt |> sched_dep <- flights_dt |>
mutate(minute = minute(sched_dep_time)) |> mutate(minute = minute(sched_dep_time)) |>
group_by(minute) |> group_by(minute) |>
summarise( summarize(
avg_delay = mean(arr_delay, na.rm = TRUE), avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()) n = n())

View File

@ -179,7 +179,7 @@ For example, imagine you want to explore the average number of hours spent watch
#| any sense of overall pattern. #| any sense of overall pattern.
relig_summary <- gss_cat |> relig_summary <- gss_cat |>
group_by(relig) |> group_by(relig) |>
summarise( summarize(
age = mean(age, na.rm = TRUE), age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE),
n = n() n = n()
@ -232,7 +232,7 @@ What if we create a similar plot looking at how average age varies across report
#| then $8000-9999. #| then $8000-9999.
rincome_summary <- gss_cat |> rincome_summary <- gss_cat |>
group_by(rincome) |> group_by(rincome) |>
summarise( summarize(
age = mean(age, na.rm = TRUE), age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE),
n = n() n = n()

View File

@ -441,7 +441,7 @@ So the key challenge in writing data frame functions is figuring out which argum
Fortunately this is easy because you can look it up from the documentation 😄. Fortunately this is easy because you can look it up from the documentation 😄.
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation: There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` that compute with variables. - **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables. - **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.
@ -455,7 +455,7 @@ If you commonly perform the same set of summaries when doing initial data explor
```{r} ```{r}
summary6 <- function(data, var) { summary6 <- function(data, var) {
data |> summarise( data |> summarize(
min = min({{ var }}, na.rm = TRUE), min = min({{ var }}, na.rm = TRUE),
mean = mean({{ var }}, na.rm = TRUE), mean = mean({{ var }}, na.rm = TRUE),
median = median({{ var }}, na.rm = TRUE), median = median({{ var }}, na.rm = TRUE),
@ -468,9 +468,9 @@ summary6 <- function(data, var) {
diamonds |> summary6(carat) diamonds |> summary6(carat)
``` ```
(Whenever you wrap `summarise()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.) (Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
The nice thing about this function is because it wraps `summarise()` you can used it on grouped data: The nice thing about this function is because it wraps `summarize()` you can used it on grouped data:
```{r} ```{r}
diamonds |> diamonds |>
@ -489,7 +489,7 @@ diamonds |>
To summarize multiple variables you'll need to wait until @sec-across, where you'll learn how to use `across()`. To summarize multiple variables you'll need to wait until @sec-across, where you'll learn how to use `across()`.
Another popular `summarise()` helper function is a version of `count()` that also computes proportions: Another popular `summarize()` helper function is a version of `count()` that also computes proportions:
```{r} ```{r}
# https://twitter.com/Diabb6/status/1571635146658402309 # https://twitter.com/Diabb6/status/1571635146658402309
@ -547,7 +547,7 @@ You might try writing something like:
count_missing <- function(df, group_vars, x_var) { count_missing <- function(df, group_vars, x_var) {
df |> df |>
group_by({{ group_vars }}) |> group_by({{ group_vars }}) |>
summarise(n_miss = sum(is.na({{ x_var }}))) summarize(n_miss = sum(is.na({{ x_var }})))
} }
flights |> flights |>
count_missing(c(year, month, day), dep_time) count_missing(c(year, month, day), dep_time)
@ -560,7 +560,7 @@ We can work around that problem by using the handy `pick()` which allows you to
count_missing <- function(df, group_vars, x_var) { count_missing <- function(df, group_vars, x_var) {
df |> df |>
group_by(pick({{ group_vars }})) |> group_by(pick({{ group_vars }})) |>
summarise(n_miss = sum(is.na({{ x_var }}))) summarize(n_miss = sum(is.na({{ x_var }})))
} }
flights |> flights |>
count_missing(c(year, month, day), dep_time) count_missing(c(year, month, day), dep_time)
@ -602,7 +602,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
```{r} ```{r}
#| eval: false #| eval: false
flights |> group_by(dest) |> summarise_severe() flights |> group_by(dest) |> summarize_severe()
``` ```
3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours: 3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
@ -616,7 +616,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
```{r} ```{r}
#| eval: false #| eval: false
weather |> summarise_weather(temp) weather |> summarize_weather(temp)
``` ```
5. Converts the user supplied variable that uses clock time (e.g. `dep_time`, `arr_time`, etc) into a decimal time (i.e. hours + minutes / 60). 5. Converts the user supplied variable that uses clock time (e.g. `dep_time`, `arr_time`, etc) into a decimal time (i.e. hours + minutes / 60).

View File

@ -17,7 +17,7 @@ In most other languages, you'd need to explicitly double each element of x using
This book has already given you a small but powerful number of tools that perform the same action for multiple "things": This book has already given you a small but powerful number of tools that perform the same action for multiple "things":
- `facet_wrap()` and `facet_grid()` draws a plot for each subset. - `facet_wrap()` and `facet_grid()` draws a plot for each subset.
- `group_by()` plus `summarise()` computes a summary statistics for each subset. - `group_by()` plus `summarize()` computes a summary statistics for each subset.
- `unnest_wider()` and `unnest_longer()` create new rows and columns for each element of a list-column. - `unnest_wider()` and `unnest_longer()` create new rows and columns for each element of a list-column.
Now it's time to learn some more general tools, often called **functional programming** tools because they are built around functions that take other functions as inputs. Now it's time to learn some more general tools, often called **functional programming** tools because they are built around functions that take other functions as inputs.
@ -57,7 +57,7 @@ df <- tibble(
You could do it with copy-and-paste: You could do it with copy-and-paste:
```{r} ```{r}
df |> summarise( df |> summarize(
n = n(), n = n(),
a = median(a), a = median(a),
b = median(b), b = median(b),
@ -70,7 +70,7 @@ That breaks our rule of thumb to never copy and paste more than twice, and you c
Instead you can use `across()`: Instead you can use `across()`:
```{r} ```{r}
df |> summarise( df |> summarize(
n = n(), n = n(),
across(a:d, median), across(a:d, median),
) )
@ -100,10 +100,10 @@ df <- tibble(
df |> df |>
group_by(grp) |> group_by(grp) |>
summarise(across(everything(), median)) summarize(across(everything(), median))
``` ```
Note grouping columns (`grp` here) are not included in `across()`, because they're automatically preserved by `summarise()`. Note grouping columns (`grp` here) are not included in `across()`, because they're automatically preserved by `summarize()`.
`where()` allows you to select columns based on their type: `where()` allows you to select columns based on their type:
@ -122,10 +122,10 @@ df_types <- tibble(
) )
df_types |> df_types |>
summarise(across(where(is.numeric), mean)) summarize(across(where(is.numeric), mean))
df_types |> df_types |>
summarise(across(where(is.character), str_flatten)) summarize(across(where(is.character), str_flatten))
``` ```
Just like other selectors, you can combine these with Boolean algebra. Just like other selectors, you can combine these with Boolean algebra.
@ -146,7 +146,7 @@ If you forget, you'll get an error:
#| error: true #| error: true
df |> df |>
group_by(grp) |> group_by(grp) |>
summarise(across(everything(), median())) summarize(across(everything(), median()))
``` ```
This error arises because you're calling the function with no input, e.g.: This error arises because you're calling the function with no input, e.g.:
@ -174,7 +174,7 @@ df_miss <- tibble(
d = rnorm(5) d = rnorm(5)
) )
df_miss |> df_miss |>
summarise( summarize(
across(a:d, median), across(a:d, median),
n = n() n = n()
) )
@ -185,7 +185,7 @@ To do so, instead of calling `median()` directly, we need to create a new functi
```{r} ```{r}
df_miss |> df_miss |>
summarise( summarize(
across(a:d, function(x) median(x, na.rm = TRUE)), across(a:d, function(x) median(x, na.rm = TRUE)),
n = n() n = n()
) )
@ -203,7 +203,7 @@ This is a little verbose, so R comes with a handy shortcut: for this sort of thr
```{r} ```{r}
#| results: false #| results: false
df_miss |> df_miss |>
summarise( summarize(
across(a:d, \(x) median(x, na.rm = TRUE)), across(a:d, \(x) median(x, na.rm = TRUE)),
n = n() n = n()
) )
@ -215,7 +215,7 @@ In either case, `across()` effectively expands to the following code:
#| eval: false #| eval: false
df_miss |> df_miss |>
summarise( summarize(
a = median(a, na.rm = TRUE), a = median(a, na.rm = TRUE),
b = median(b, na.rm = TRUE), b = median(b, na.rm = TRUE),
c = median(c, na.rm = TRUE), c = median(c, na.rm = TRUE),
@ -230,7 +230,7 @@ You supply multiple functions by using a named list to `.fns`:
```{r} ```{r}
df_miss |> df_miss |>
summarise( summarize(
across(a:d, list( across(a:d, list(
median = \(x) median(x, na.rm = TRUE), median = \(x) median(x, na.rm = TRUE),
n_miss = \(x) sum(is.na(x)) n_miss = \(x) sum(is.na(x))
@ -252,7 +252,7 @@ We could specify our own if we wanted the name of the function to come first[^it
```{r} ```{r}
df_miss |> df_miss |>
summarise( summarize(
across( across(
a:d, a:d,
list( list(
@ -288,7 +288,7 @@ df_miss |>
### Filtering ### Filtering
`across()` is a great match for `summarise()` and `mutate()` but it's more awkward to use with `filter()`, because you usually combine multiple conditions with either `|` or `&`. `across()` is a great match for `summarize()` and `mutate()` but it's more awkward to use with `filter()`, because you usually combine multiple conditions with either `|` or `&`.
It's clear that `across()` can help to create multiple logical columns, but then what? It's clear that `across()` can help to create multiple logical columns, but then what?
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`: So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
@ -331,20 +331,20 @@ For example, this function will compute the means of numeric columns by default.
But by supplying the second argument you can choose to summarize just selected columns: But by supplying the second argument you can choose to summarize just selected columns:
```{r} ```{r}
summarise_means <- function(df, summary_vars = where(is.numeric)) { summarize_means <- function(df, summary_vars = where(is.numeric)) {
df |> df |>
summarise( summarize(
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)), across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
n = n() n = n()
) )
} }
diamonds |> diamonds |>
group_by(clarity) |> group_by(clarity) |>
summarise_means() summarize_means()
diamonds |> diamonds |>
group_by(clarity) |> group_by(clarity) |>
summarise_means(c(carat, x:z)) summarize_means(c(carat, x:z))
``` ```
### Vs `pivot_longer()` ### Vs `pivot_longer()`
@ -355,7 +355,7 @@ For example, take this multi-function summary:
```{r} ```{r}
df |> df |>
summarise(across(a:d, list(median = median, mean = mean))) summarize(across(a:d, list(median = median, mean = mean)))
``` ```
We could compute the same values by pivoting longer and then summarizing: We could compute the same values by pivoting longer and then summarizing:
@ -364,7 +364,7 @@ We could compute the same values by pivoting longer and then summarizing:
long <- df |> long <- df |>
pivot_longer(a:d) |> pivot_longer(a:d) |>
group_by(name) |> group_by(name) |>
summarise( summarize(
median = median(value), median = median(value),
mean = mean(value) mean = mean(value)
) )
@ -414,7 +414,7 @@ df_long
df_long |> df_long |>
group_by(group) |> group_by(group) |>
summarise(mean = weighted.mean(val, wts)) summarize(mean = weighted.mean(val, wts))
``` ```
If needed, you could `pivot_wider()` this back to the original form. If needed, you could `pivot_wider()` this back to the original form.
@ -445,7 +445,7 @@ If needed, you could `pivot_wider()` this back to the original form.
show_missing <- function(df, group_vars, summary_vars = everything()) { show_missing <- function(df, group_vars, summary_vars = everything()) {
df |> df |>
group_by(pick({{ group_vars }})) |> group_by(pick({{ group_vars }})) |>
summarise( summarize(
across({{ summary_vars }}, \(x) sum(is.na(x))), across({{ summary_vars }}, \(x) sum(is.na(x))),
.groups = "drop" .groups = "drop"
) |> ) |>

View File

@ -391,7 +391,7 @@ flights2 |>
worst <- filter(flights, !is.na(dep_time), month == 6, day == 13) worst <- filter(flights, !is.na(dep_time), month == 6, day == 13)
worst |> worst |>
group_by(dest) |> group_by(dest) |>
summarise(delay = mean(arr_delay), n = n()) |> summarize(delay = mean(arr_delay), n = n()) |>
filter(n > 5) |> filter(n > 5) |>
inner_join(airports, by = c("dest" = "faa")) |> inner_join(airports, by = c("dest" = "faa")) |>
ggplot(aes(lon, lat)) + ggplot(aes(lon, lat)) +

View File

@ -336,7 +336,7 @@ For example, we could use `all()` to find out if there were days where every fli
```{r} ```{r}
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
all_delayed = all(arr_delay >= 0, na.rm = TRUE), all_delayed = all(arr_delay >= 0, na.rm = TRUE),
any_delayed = any(arr_delay >= 0, na.rm = TRUE), any_delayed = any(arr_delay >= 0, na.rm = TRUE),
.groups = "drop" .groups = "drop"
@ -361,7 +361,7 @@ That lets us see the distribution of delays across the days of the year as shown
#| peaks around 30% delayed flights. #| peaks around 30% delayed flights.
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
prop_delayed = mean(arr_delay > 0, na.rm = TRUE), prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
.groups = "drop" .groups = "drop"
) |> ) |>
@ -374,7 +374,7 @@ Or we could ask how many flights left before 5am, which are often flights that w
```{r} ```{r}
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
n_early = sum(dep_time < 500, na.rm = TRUE), n_early = sum(dep_time < 500, na.rm = TRUE),
.groups = "drop" .groups = "drop"
) |> ) |>
@ -393,7 +393,7 @@ One way to do so would be to first filter the flights:
flights |> flights |>
filter(arr_delay > 0) |> filter(arr_delay > 0) |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
behind = mean(arr_delay), behind = mean(arr_delay),
n = n(), n = n(),
.groups = "drop" .groups = "drop"
@ -411,7 +411,7 @@ This leads to:
```{r} ```{r}
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE), behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE), ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
n = n(), n = n(),

View File

@ -250,7 +250,7 @@ And again you can use `.drop = FALSE` to preserve all factor levels:
```{r} ```{r}
health |> health |>
group_by(smoker, .drop = FALSE) |> group_by(smoker, .drop = FALSE) |>
summarise( summarize(
n = n(), n = n(),
mean_age = mean(age), mean_age = mean(age),
min_age = min(age), min_age = min(age),
@ -283,7 +283,7 @@ Sometimes a simpler approach is to perform the summary and then make the implici
```{r} ```{r}
health |> health |>
group_by(smoker) |> group_by(smoker) |>
summarise( summarize(
n = n(), n = n(),
mean_age = mean(age), mean_age = mean(age),
min_age = min(age), min_age = min(age),

View File

@ -14,7 +14,7 @@ Now it's time to systematically survey what you can do with them in R, ensuring
We'll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of `count()`. We'll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of `count()`.
Then we'll dive into various numeric transformations that pair well with `mutate()`, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. Then we'll dive into various numeric transformations that pair well with `mutate()`, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors.
We'll finish off by covering the summary functions that pair well with `summarise()` and show you how they can also be used with `mutate()`. We'll finish off by covering the summary functions that pair well with `summarize()` and show you how they can also be used with `mutate()`.
### Prerequisites ### Prerequisites
@ -70,13 +70,13 @@ flights |> count(dest, sort = TRUE)
And remember that if you want to see all the values, you can use `|> View()` or `|> print(n = Inf)`. And remember that if you want to see all the values, you can use `|> View()` or `|> print(n = Inf)`.
You can perform the same computation "by hand" with `group_by()`, `summarise()` and `n()`. You can perform the same computation "by hand" with `group_by()`, `summarize()` and `n()`.
This is useful because it allows you to compute other summaries at the same time: This is useful because it allows you to compute other summaries at the same time:
```{r} ```{r}
flights |> flights |>
group_by(dest) |> group_by(dest) |>
summarise( summarize(
n = n(), n = n(),
delay = mean(arr_delay, na.rm = TRUE) delay = mean(arr_delay, na.rm = TRUE)
) )
@ -99,7 +99,7 @@ There are a couple of variants of `n()` that you might find useful:
```{r} ```{r}
flights |> flights |>
group_by(dest) |> group_by(dest) |>
summarise( summarize(
carriers = n_distinct(carrier) carriers = n_distinct(carrier)
) |> ) |>
arrange(desc(carriers)) arrange(desc(carriers))
@ -111,7 +111,7 @@ There are a couple of variants of `n()` that you might find useful:
```{r} ```{r}
flights |> flights |>
group_by(tailnum) |> group_by(tailnum) |>
summarise(miles = sum(distance)) summarize(miles = sum(distance))
``` ```
Weighted counts are a common problem so `count()` has a `wt` argument that does the same thing: Weighted counts are a common problem so `count()` has a `wt` argument that does the same thing:
@ -126,13 +126,13 @@ There are a couple of variants of `n()` that you might find useful:
```{r} ```{r}
flights |> flights |>
group_by(dest) |> group_by(dest) |>
summarise(n_cancelled = sum(is.na(dep_time))) summarize(n_cancelled = sum(is.na(dep_time)))
``` ```
### Exercises ### Exercises
1. How can you use `count()` to count the number rows with a missing value for a given variable? 1. How can you use `count()` to count the number rows with a missing value for a given variable?
2. Expand the following calls to `count()` to instead use `group_by()`, `summarise()`, and `arrange()`: 2. Expand the following calls to `count()` to instead use `group_by()`, `summarize()`, and `arrange()`:
1. `flights |> count(dest, sort = TRUE)` 1. `flights |> count(dest, sort = TRUE)`
2. `flights |> count(tailnum, wt = distance)` 2. `flights |> count(tailnum, wt = distance)`
@ -254,7 +254,7 @@ The results are shown in @fig-prop-cancelled.
#| getting down to around 1% by midnight. #| getting down to around 1% by midnight.
flights |> flights |>
group_by(hour = sched_dep_time %/% 100) |> group_by(hour = sched_dep_time %/% 100) |>
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |> summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
filter(hour > 1) |> filter(hour > 1) |>
ggplot(aes(hour, prop_cancelled)) + ggplot(aes(hour, prop_cancelled)) +
geom_line(color = "grey50") + geom_line(color = "grey50") +
@ -543,7 +543,7 @@ You can lead or lag by more than one position by using the second argument, `n`.
flights |> flights |>
mutate(hour = dep_time %/% 100) |> mutate(hour = dep_time %/% 100) |>
group_by(year, month, day, hour) |> group_by(year, month, day, hour) |>
summarise( summarize(
dep_delay = mean(dep_delay, na.rm = TRUE), dep_delay = mean(dep_delay, na.rm = TRUE),
n = n(), n = n(),
.groups = "drop" .groups = "drop"
@ -589,7 +589,7 @@ The median delay is always smaller than the mean delay because because flights s
#| outlying points with mean ~60, median ~50, and mean ~85, median ~55. #| outlying points with mean ~60, median ~50, and mean ~85, median ~55.
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
mean = mean(dep_delay, na.rm = TRUE), mean = mean(dep_delay, na.rm = TRUE),
median = median(dep_delay, na.rm = TRUE), median = median(dep_delay, na.rm = TRUE),
n = n(), n = n(),
@ -618,7 +618,7 @@ For the `flights` data, you might want to look at the 95% quantile of delays rat
```{r} ```{r}
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
max = max(dep_delay, na.rm = TRUE), max = max(dep_delay, na.rm = TRUE),
q95 = quantile(dep_delay, 0.95, na.rm = TRUE), q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
.groups = "drop" .groups = "drop"
@ -638,7 +638,7 @@ But the code below makes it looks like one airport, [EGE](https://en.wikipedia.o
```{r} ```{r}
flights |> flights |>
group_by(origin, dest) |> group_by(origin, dest) |>
summarise( summarize(
distance_sd = IQR(distance), distance_sd = IQR(distance),
n = n(), n = n(),
.groups = "drop" .groups = "drop"
@ -721,7 +721,7 @@ For example, we can find the first and last departure for each day:
```{r} ```{r}
flights |> flights |>
group_by(year, month, day) |> group_by(year, month, day) |>
summarise( summarize(
first_dep = first(dep_time), first_dep = first(dep_time),
fifth_dep = nth(dep_time, 5), fifth_dep = nth(dep_time, 5),
last_dep = last(dep_time) last_dep = last(dep_time)
@ -747,7 +747,7 @@ flights |>
### With `mutate()` ### With `mutate()`
As the names suggest, the summary functions are typically paired with `summarise()`. As the names suggest, the summary functions are typically paired with `summarize()`.
However, because of the recycling rules we discussed in @sec-recycling they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization. However, because of the recycling rules we discussed in @sec-recycling they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
For example: For example:

View File

@ -28,7 +28,7 @@ tints <- seq(0.1, 1, length.out = 10)
oreilly |> oreilly |>
group_by(name) |> group_by(name) |>
summarise( summarize(
tint = paste0("t", tints * 100), tint = paste0("t", tints * 100),
colour = tint(col, tints), colour = tint(col, tints),
.groups = "drop" .groups = "drop"

View File

@ -227,7 +227,7 @@ str_split(x, " ")[[1]]
str_split(x, boundary("word"))[[1]] str_split(x, boundary("word"))[[1]]
``` ```
Show how `separate_rows()` is a special case of `str_split()` + `summarise()`. Show how `separate_rows()` is a special case of `str_split()` + `summarize()`.
## Replace with function ## Replace with function

View File

@ -165,7 +165,7 @@ It looks like they've radically increased in popularity lately!
babynames |> babynames |>
group_by(year) |> group_by(year) |>
summarise(prop_x = mean(str_detect(name, "x"))) |> summarize(prop_x = mean(str_detect(name, "x"))) |>
ggplot(aes(year, prop_x)) + ggplot(aes(year, prop_x)) +
geom_line() geom_line()
``` ```
@ -388,10 +388,10 @@ str_view(fruit, "^apple$")
You can also match the boundary between words (i.e. the start or end of a word) with `\b`. You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
This can be particularly when using RStudio's find and replace tool. This can be particularly when using RStudio's find and replace tool.
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on: For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarize`, `summary`, `rowsum` and so on:
```{r} ```{r}
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)") x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum") str_view(x, "sum")
str_view(x, "\\bsum\\b") str_view(x, "\\bsum\\b")
``` ```
@ -569,7 +569,7 @@ str_match(x, "gr(?:e|a)y")
g. Contain at least two vowel-consonant pairs in a row. g. Contain at least two vowel-consonant pairs in a row.
h. Only consist of repeated vowel-consonant pairs. h. Only consist of repeated vowel-consonant pairs.
4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. 4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarize, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
Try and make the shortest possible regex! Try and make the shortest possible regex!
5. Switch the first and last letters in `words`. 5. Switch the first and last letters in `words`.

View File

@ -155,7 +155,7 @@ Now that you've learned the basics of creating a string or two by "hand", we'll
This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame.
For example, to create a greeting you might combine "Hello" with a `name` variable. For example, to create a greeting you might combine "Hello" with a `name` variable.
We'll show you how to do this with `str_c()` and `str_glue()` and how you can you use them with `mutate()`. We'll show you how to do this with `str_c()` and `str_glue()` and how you can you use them with `mutate()`.
That naturally raises the question of what string functions you might use with `summarise()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings. That naturally raises the question of what string functions you might use with `summarize()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.
### `str_c()` ### `str_c()`
@ -212,7 +212,7 @@ df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
### `str_flatten()` ### `str_flatten()`
`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs. `str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
What if you want a function that works well with `summarise()`, i.e. something that always returns a single string? What if you want a function that works well with `summarize()`, i.e. something that always returns a single string?
That's the job of `str_flatten()`[^strings-5]: it takes a character vector and combines each element of the vector into a single string: That's the job of `str_flatten()`[^strings-5]: it takes a character vector and combines each element of the vector into a single string:
[^strings-5]: The base R equivalent is `paste()` used with the `collapse` argument. [^strings-5]: The base R equivalent is `paste()` used with the `collapse` argument.
@ -223,7 +223,7 @@ str_flatten(c("x", "y", "z"), ", ")
str_flatten(c("x", "y", "z"), ", ", last = ", and ") str_flatten(c("x", "y", "z"), ", ", last = ", and ")
``` ```
This makes it work well with `summarise()`: This makes it work well with `summarize()`:
```{r} ```{r}
df <- tribble( df <- tribble(
@ -237,7 +237,7 @@ df <- tribble(
) )
df |> df |>
group_by(name) |> group_by(name) |>
summarise(fruits = str_flatten(fruit, ", ")) summarize(fruits = str_flatten(fruit, ", "))
``` ```
### Exercises ### Exercises

View File

@ -36,7 +36,7 @@ For example, the last chapter finished with a moderately complex pipe:
flights |> flights |>
filter(!is.na(arr_delay), !is.na(tailnum)) |> filter(!is.na(arr_delay), !is.na(tailnum)) |>
group_by(tailnum) |> group_by(tailnum) |>
summarise( summarize(
delay = mean(arr_delay, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE),
n = n() n = n()
) )
@ -50,7 +50,7 @@ We could nest each function call inside the previous call:
```{r} ```{r}
#| eval: false #| eval: false
summarise( summarize(
group_by( group_by(
filter( filter(
flights, flights,
@ -71,7 +71,7 @@ Or we could use a bunch of intermediate variables:
flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum)) flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
flights2 <- group_by(flights1, tailnum) flights2 <- group_by(flights1, tailnum)
flights3 <- summarise(flight2, flights3 <- summarize(flight2,
delay = mean(arr_delay, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE),
n = n() n = n()
) )
@ -91,7 +91,7 @@ library(tidyverse)
mtcars %>% mtcars %>%
group_by(cyl) %>% group_by(cyl) %>%
summarise(n = n()) summarize(n = n())
``` ```
For simple cases `|>` and `%>%` behave identically. For simple cases `|>` and `%>%` behave identically.