parent
d06d4129e6
commit
3045d05e64
|
@ -247,7 +247,7 @@ There are a number other base approaches to creating new columns including with
|
|||
Hadley collected a few examples at <https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf>.
|
||||
|
||||
Using `$` directly is convenient when performing quick summaries.
|
||||
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarise()`:
|
||||
For example, if you just want find the size of the biggest diamond or the possible values of `cut`, there's no need to use `summarize()`:
|
||||
|
||||
```{r}
|
||||
max(diamonds$carat)
|
||||
|
@ -423,7 +423,7 @@ Another important member of the apply family is `tapply()` which computes a sing
|
|||
```{r}
|
||||
diamonds |>
|
||||
group_by(cut) |>
|
||||
summarise(price = mean(price))
|
||||
summarize(price = mean(price))
|
||||
|
||||
tapply(diamonds$price, diamonds$cut, mean)
|
||||
```
|
||||
|
|
|
@ -187,7 +187,7 @@ It's not wonderful for this plot, but it isn't too bad.
|
|||
```{r}
|
||||
class_avg <- mpg |>
|
||||
group_by(class) |>
|
||||
summarise(
|
||||
summarize(
|
||||
displ = median(displ),
|
||||
hwy = median(hwy)
|
||||
)
|
||||
|
@ -208,7 +208,7 @@ Often, you want the label in the corner of the plot, so it's convenient to creat
|
|||
|
||||
```{r}
|
||||
label_info <- mpg |>
|
||||
summarise(
|
||||
summarize(
|
||||
displ = max(displ),
|
||||
hwy = max(hwy),
|
||||
label = "Increasing engine size is \nrelated to decreasing fuel economy."
|
||||
|
|
|
@ -423,11 +423,10 @@ This means subsequent operations will now work "by month".
|
|||
|
||||
### `summarize()` {#sec-summarize}
|
||||
|
||||
The most important grouped operation is a summary.
|
||||
It collapses each group to a single row[^data-transform-3].
|
||||
Here we compute the average departure delay by month:
|
||||
The most important grouped operation is a summary, which each collapses each group to a single row.
|
||||
In dplyr, this is operation is performed by `summarize()`[^data-transform-3], as shown by the following example, which computes the average departure delay by month:
|
||||
|
||||
[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarize()` to produce multiple summary rows for each group.
|
||||
[^data-transform-3]: Or `summarise()`, if you prefer British English.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -673,7 +672,7 @@ You can find a good explanation of this problem and how to overcome it at <http:
|
|||
## Summary
|
||||
|
||||
In this chapter, you've learned the tools that dplyr provides for working with data frames.
|
||||
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarise()`).
|
||||
The tools are roughly grouped into three categories: those that manipulate the rows (like `filter()` and `arrange()`, those that manipulate the columns (like `select()` and `mutate()`), and those that manipulate groups (like `group_by()` and `summarize()`).
|
||||
In this chapter, we've focused on these "whole data frame" tools, but you haven't yet learned much about what you can do with the individual variable.
|
||||
We'll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.
|
||||
|
||||
|
|
|
@ -310,7 +310,7 @@ flights |>
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
|
@ -393,14 +393,14 @@ You'll see more complex examples once we hit the join functions.
|
|||
|
||||
### GROUP BY
|
||||
|
||||
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause:
|
||||
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarize()` is translated to the `SELECT` clause:
|
||||
|
||||
[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
avg_price = mean(price, na.rm = TRUE)
|
||||
) |>
|
||||
|
@ -445,7 +445,7 @@ dbplyr will remind you about this behavior the first time you hit it:
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(delay = mean(arr_delay))
|
||||
summarize(delay = mean(arr_delay))
|
||||
```
|
||||
|
||||
If you want to learn more about how NULLs work, you might enjoy "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
|
||||
|
@ -471,7 +471,7 @@ This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated
|
|||
```{r}
|
||||
diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(n = n()) |>
|
||||
summarize(n = n()) |>
|
||||
filter(n > 100) |>
|
||||
show_query()
|
||||
```
|
||||
|
@ -579,13 +579,13 @@ The easiest way to see the full set of what's currently available is to visit th
|
|||
So far we've focused on the big picture of how dplyr verbs are translated to the clauses of a query.
|
||||
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?
|
||||
|
||||
To help see what's going on, we'll use a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL.
|
||||
To help see what's going on, we'll use a couple of little helper functions that run a `summarize()` or `mutate()` and show the generated SQL.
|
||||
That will make it a little easier to explore a few variations and see how summaries and transformations can differ.
|
||||
|
||||
```{r}
|
||||
summarize_query <- function(df, ...) {
|
||||
df |>
|
||||
summarise(...) |>
|
||||
summarize(...) |>
|
||||
show_query()
|
||||
}
|
||||
mutate_query <- function(df, ...) {
|
||||
|
|
|
@ -351,7 +351,7 @@ It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays
|
|||
flights_dt |>
|
||||
mutate(minute = minute(dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
summarize(
|
||||
avg_delay = mean(dep_delay, na.rm = TRUE),
|
||||
n = n()) |>
|
||||
ggplot(aes(minute, avg_delay)) +
|
||||
|
@ -369,7 +369,7 @@ Interestingly, if we look at the *scheduled* departure time we don't see such a
|
|||
sched_dep <- flights_dt |>
|
||||
mutate(minute = minute(sched_dep_time)) |>
|
||||
group_by(minute) |>
|
||||
summarise(
|
||||
summarize(
|
||||
avg_delay = mean(arr_delay, na.rm = TRUE),
|
||||
n = n())
|
||||
|
||||
|
|
|
@ -179,7 +179,7 @@ For example, imagine you want to explore the average number of hours spent watch
|
|||
#| any sense of overall pattern.
|
||||
relig_summary <- gss_cat |>
|
||||
group_by(relig) |>
|
||||
summarise(
|
||||
summarize(
|
||||
age = mean(age, na.rm = TRUE),
|
||||
tvhours = mean(tvhours, na.rm = TRUE),
|
||||
n = n()
|
||||
|
@ -232,7 +232,7 @@ What if we create a similar plot looking at how average age varies across report
|
|||
#| then $8000-9999.
|
||||
rincome_summary <- gss_cat |>
|
||||
group_by(rincome) |>
|
||||
summarise(
|
||||
summarize(
|
||||
age = mean(age, na.rm = TRUE),
|
||||
tvhours = mean(tvhours, na.rm = TRUE),
|
||||
n = n()
|
||||
|
|
|
@ -441,7 +441,7 @@ So the key challenge in writing data frame functions is figuring out which argum
|
|||
Fortunately this is easy because you can look it up from the documentation 😄.
|
||||
There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:
|
||||
|
||||
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarise()` that compute with variables.
|
||||
- **Data-masking**: this is used in functions like `arrange()`, `filter()`, and `summarize()` that compute with variables.
|
||||
|
||||
- **Tidy-selection**: this is used for for functions like `select()`, `relocate()`, and `rename()` that select variables.
|
||||
|
||||
|
@ -455,7 +455,7 @@ If you commonly perform the same set of summaries when doing initial data explor
|
|||
|
||||
```{r}
|
||||
summary6 <- function(data, var) {
|
||||
data |> summarise(
|
||||
data |> summarize(
|
||||
min = min({{ var }}, na.rm = TRUE),
|
||||
mean = mean({{ var }}, na.rm = TRUE),
|
||||
median = median({{ var }}, na.rm = TRUE),
|
||||
|
@ -468,9 +468,9 @@ summary6 <- function(data, var) {
|
|||
diamonds |> summary6(carat)
|
||||
```
|
||||
|
||||
(Whenever you wrap `summarise()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
|
||||
(Whenever you wrap `summarize()` in a helper, we think it's good practice to set `.groups = "drop"` to both avoid the message and leave the data in an ungrouped state.)
|
||||
|
||||
The nice thing about this function is because it wraps `summarise()` you can used it on grouped data:
|
||||
The nice thing about this function is because it wraps `summarize()` you can used it on grouped data:
|
||||
|
||||
```{r}
|
||||
diamonds |>
|
||||
|
@ -489,7 +489,7 @@ diamonds |>
|
|||
|
||||
To summarize multiple variables you'll need to wait until @sec-across, where you'll learn how to use `across()`.
|
||||
|
||||
Another popular `summarise()` helper function is a version of `count()` that also computes proportions:
|
||||
Another popular `summarize()` helper function is a version of `count()` that also computes proportions:
|
||||
|
||||
```{r}
|
||||
# https://twitter.com/Diabb6/status/1571635146658402309
|
||||
|
@ -547,7 +547,7 @@ You might try writing something like:
|
|||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by({{ group_vars }}) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
|
@ -560,7 +560,7 @@ We can work around that problem by using the handy `pick()` which allows you to
|
|||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(n_miss = sum(is.na({{ x_var }})))
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
}
|
||||
flights |>
|
||||
count_missing(c(year, month, day), dep_time)
|
||||
|
@ -602,7 +602,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
|
||||
```{r}
|
||||
#| eval: false
|
||||
flights |> group_by(dest) |> summarise_severe()
|
||||
flights |> group_by(dest) |> summarize_severe()
|
||||
```
|
||||
|
||||
3. Finds all flights that were cancelled or delayed by more than a user supplied number of hours:
|
||||
|
@ -616,7 +616,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
|
||||
```{r}
|
||||
#| eval: false
|
||||
weather |> summarise_weather(temp)
|
||||
weather |> summarize_weather(temp)
|
||||
```
|
||||
|
||||
5. Converts the user supplied variable that uses clock time (e.g. `dep_time`, `arr_time`, etc) into a decimal time (i.e. hours + minutes / 60).
|
||||
|
|
|
@ -17,7 +17,7 @@ In most other languages, you'd need to explicitly double each element of x using
|
|||
This book has already given you a small but powerful number of tools that perform the same action for multiple "things":
|
||||
|
||||
- `facet_wrap()` and `facet_grid()` draws a plot for each subset.
|
||||
- `group_by()` plus `summarise()` computes a summary statistics for each subset.
|
||||
- `group_by()` plus `summarize()` computes a summary statistics for each subset.
|
||||
- `unnest_wider()` and `unnest_longer()` create new rows and columns for each element of a list-column.
|
||||
|
||||
Now it's time to learn some more general tools, often called **functional programming** tools because they are built around functions that take other functions as inputs.
|
||||
|
@ -57,7 +57,7 @@ df <- tibble(
|
|||
You could do it with copy-and-paste:
|
||||
|
||||
```{r}
|
||||
df |> summarise(
|
||||
df |> summarize(
|
||||
n = n(),
|
||||
a = median(a),
|
||||
b = median(b),
|
||||
|
@ -70,7 +70,7 @@ That breaks our rule of thumb to never copy and paste more than twice, and you c
|
|||
Instead you can use `across()`:
|
||||
|
||||
```{r}
|
||||
df |> summarise(
|
||||
df |> summarize(
|
||||
n = n(),
|
||||
across(a:d, median),
|
||||
)
|
||||
|
@ -100,10 +100,10 @@ df <- tibble(
|
|||
|
||||
df |>
|
||||
group_by(grp) |>
|
||||
summarise(across(everything(), median))
|
||||
summarize(across(everything(), median))
|
||||
```
|
||||
|
||||
Note grouping columns (`grp` here) are not included in `across()`, because they're automatically preserved by `summarise()`.
|
||||
Note grouping columns (`grp` here) are not included in `across()`, because they're automatically preserved by `summarize()`.
|
||||
|
||||
`where()` allows you to select columns based on their type:
|
||||
|
||||
|
@ -122,10 +122,10 @@ df_types <- tibble(
|
|||
)
|
||||
|
||||
df_types |>
|
||||
summarise(across(where(is.numeric), mean))
|
||||
summarize(across(where(is.numeric), mean))
|
||||
|
||||
df_types |>
|
||||
summarise(across(where(is.character), str_flatten))
|
||||
summarize(across(where(is.character), str_flatten))
|
||||
```
|
||||
|
||||
Just like other selectors, you can combine these with Boolean algebra.
|
||||
|
@ -146,7 +146,7 @@ If you forget, you'll get an error:
|
|||
#| error: true
|
||||
df |>
|
||||
group_by(grp) |>
|
||||
summarise(across(everything(), median()))
|
||||
summarize(across(everything(), median()))
|
||||
```
|
||||
|
||||
This error arises because you're calling the function with no input, e.g.:
|
||||
|
@ -174,7 +174,7 @@ df_miss <- tibble(
|
|||
d = rnorm(5)
|
||||
)
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, median),
|
||||
n = n()
|
||||
)
|
||||
|
@ -185,7 +185,7 @@ To do so, instead of calling `median()` directly, we need to create a new functi
|
|||
|
||||
```{r}
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, function(x) median(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
)
|
||||
|
@ -203,7 +203,7 @@ This is a little verbose, so R comes with a handy shortcut: for this sort of thr
|
|||
```{r}
|
||||
#| results: false
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, \(x) median(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
)
|
||||
|
@ -215,7 +215,7 @@ In either case, `across()` effectively expands to the following code:
|
|||
#| eval: false
|
||||
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
a = median(a, na.rm = TRUE),
|
||||
b = median(b, na.rm = TRUE),
|
||||
c = median(c, na.rm = TRUE),
|
||||
|
@ -230,7 +230,7 @@ You supply multiple functions by using a named list to `.fns`:
|
|||
|
||||
```{r}
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(a:d, list(
|
||||
median = \(x) median(x, na.rm = TRUE),
|
||||
n_miss = \(x) sum(is.na(x))
|
||||
|
@ -252,7 +252,7 @@ We could specify our own if we wanted the name of the function to come first[^it
|
|||
|
||||
```{r}
|
||||
df_miss |>
|
||||
summarise(
|
||||
summarize(
|
||||
across(
|
||||
a:d,
|
||||
list(
|
||||
|
@ -288,7 +288,7 @@ df_miss |>
|
|||
|
||||
### Filtering
|
||||
|
||||
`across()` is a great match for `summarise()` and `mutate()` but it's more awkward to use with `filter()`, because you usually combine multiple conditions with either `|` or `&`.
|
||||
`across()` is a great match for `summarize()` and `mutate()` but it's more awkward to use with `filter()`, because you usually combine multiple conditions with either `|` or `&`.
|
||||
It's clear that `across()` can help to create multiple logical columns, but then what?
|
||||
So dplyr provides two variants of `across()` called `if_any()` and `if_all()`:
|
||||
|
||||
|
@ -331,20 +331,20 @@ For example, this function will compute the means of numeric columns by default.
|
|||
But by supplying the second argument you can choose to summarize just selected columns:
|
||||
|
||||
```{r}
|
||||
summarise_means <- function(df, summary_vars = where(is.numeric)) {
|
||||
summarize_means <- function(df, summary_vars = where(is.numeric)) {
|
||||
df |>
|
||||
summarise(
|
||||
summarize(
|
||||
across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
|
||||
n = n()
|
||||
)
|
||||
}
|
||||
diamonds |>
|
||||
group_by(clarity) |>
|
||||
summarise_means()
|
||||
summarize_means()
|
||||
|
||||
diamonds |>
|
||||
group_by(clarity) |>
|
||||
summarise_means(c(carat, x:z))
|
||||
summarize_means(c(carat, x:z))
|
||||
```
|
||||
|
||||
### Vs `pivot_longer()`
|
||||
|
@ -355,7 +355,7 @@ For example, take this multi-function summary:
|
|||
|
||||
```{r}
|
||||
df |>
|
||||
summarise(across(a:d, list(median = median, mean = mean)))
|
||||
summarize(across(a:d, list(median = median, mean = mean)))
|
||||
```
|
||||
|
||||
We could compute the same values by pivoting longer and then summarizing:
|
||||
|
@ -364,7 +364,7 @@ We could compute the same values by pivoting longer and then summarizing:
|
|||
long <- df |>
|
||||
pivot_longer(a:d) |>
|
||||
group_by(name) |>
|
||||
summarise(
|
||||
summarize(
|
||||
median = median(value),
|
||||
mean = mean(value)
|
||||
)
|
||||
|
@ -414,7 +414,7 @@ df_long
|
|||
|
||||
df_long |>
|
||||
group_by(group) |>
|
||||
summarise(mean = weighted.mean(val, wts))
|
||||
summarize(mean = weighted.mean(val, wts))
|
||||
```
|
||||
|
||||
If needed, you could `pivot_wider()` this back to the original form.
|
||||
|
@ -445,7 +445,7 @@ If needed, you could `pivot_wider()` this back to the original form.
|
|||
show_missing <- function(df, group_vars, summary_vars = everything()) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarise(
|
||||
summarize(
|
||||
across({{ summary_vars }}, \(x) sum(is.na(x))),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
|
|
|
@ -391,7 +391,7 @@ flights2 |>
|
|||
worst <- filter(flights, !is.na(dep_time), month == 6, day == 13)
|
||||
worst |>
|
||||
group_by(dest) |>
|
||||
summarise(delay = mean(arr_delay), n = n()) |>
|
||||
summarize(delay = mean(arr_delay), n = n()) |>
|
||||
filter(n > 5) |>
|
||||
inner_join(airports, by = c("dest" = "faa")) |>
|
||||
ggplot(aes(lon, lat)) +
|
||||
|
|
10
logicals.qmd
10
logicals.qmd
|
@ -336,7 +336,7 @@ For example, we could use `all()` to find out if there were days where every fli
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
all_delayed = all(arr_delay >= 0, na.rm = TRUE),
|
||||
any_delayed = any(arr_delay >= 0, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
|
@ -361,7 +361,7 @@ That lets us see the distribution of delays across the days of the year as shown
|
|||
#| peaks around 30% delayed flights.
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
|
@ -374,7 +374,7 @@ Or we could ask how many flights left before 5am, which are often flights that w
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n_early = sum(dep_time < 500, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
) |>
|
||||
|
@ -393,7 +393,7 @@ One way to do so would be to first filter the flights:
|
|||
flights |>
|
||||
filter(arr_delay > 0) |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
behind = mean(arr_delay),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
|
@ -411,7 +411,7 @@ This leads to:
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
|
||||
ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
|
||||
n = n(),
|
||||
|
|
|
@ -250,7 +250,7 @@ And again you can use `.drop = FALSE` to preserve all factor levels:
|
|||
```{r}
|
||||
health |>
|
||||
group_by(smoker, .drop = FALSE) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
mean_age = mean(age),
|
||||
min_age = min(age),
|
||||
|
@ -283,7 +283,7 @@ Sometimes a simpler approach is to perform the summary and then make the implici
|
|||
```{r}
|
||||
health |>
|
||||
group_by(smoker) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
mean_age = mean(age),
|
||||
min_age = min(age),
|
||||
|
|
28
numbers.qmd
28
numbers.qmd
|
@ -14,7 +14,7 @@ Now it's time to systematically survey what you can do with them in R, ensuring
|
|||
|
||||
We'll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of `count()`.
|
||||
Then we'll dive into various numeric transformations that pair well with `mutate()`, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors.
|
||||
We'll finish off by covering the summary functions that pair well with `summarise()` and show you how they can also be used with `mutate()`.
|
||||
We'll finish off by covering the summary functions that pair well with `summarize()` and show you how they can also be used with `mutate()`.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -70,13 +70,13 @@ flights |> count(dest, sort = TRUE)
|
|||
|
||||
And remember that if you want to see all the values, you can use `|> View()` or `|> print(n = Inf)`.
|
||||
|
||||
You can perform the same computation "by hand" with `group_by()`, `summarise()` and `n()`.
|
||||
You can perform the same computation "by hand" with `group_by()`, `summarize()` and `n()`.
|
||||
This is useful because it allows you to compute other summaries at the same time:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
delay = mean(arr_delay, na.rm = TRUE)
|
||||
)
|
||||
|
@ -99,7 +99,7 @@ There are a couple of variants of `n()` that you might find useful:
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(
|
||||
summarize(
|
||||
carriers = n_distinct(carrier)
|
||||
) |>
|
||||
arrange(desc(carriers))
|
||||
|
@ -111,7 +111,7 @@ There are a couple of variants of `n()` that you might find useful:
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(tailnum) |>
|
||||
summarise(miles = sum(distance))
|
||||
summarize(miles = sum(distance))
|
||||
```
|
||||
|
||||
Weighted counts are a common problem so `count()` has a `wt` argument that does the same thing:
|
||||
|
@ -126,13 +126,13 @@ There are a couple of variants of `n()` that you might find useful:
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(n_cancelled = sum(is.na(dep_time)))
|
||||
summarize(n_cancelled = sum(is.na(dep_time)))
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. How can you use `count()` to count the number rows with a missing value for a given variable?
|
||||
2. Expand the following calls to `count()` to instead use `group_by()`, `summarise()`, and `arrange()`:
|
||||
2. Expand the following calls to `count()` to instead use `group_by()`, `summarize()`, and `arrange()`:
|
||||
1. `flights |> count(dest, sort = TRUE)`
|
||||
|
||||
2. `flights |> count(tailnum, wt = distance)`
|
||||
|
@ -254,7 +254,7 @@ The results are shown in @fig-prop-cancelled.
|
|||
#| getting down to around 1% by midnight.
|
||||
flights |>
|
||||
group_by(hour = sched_dep_time %/% 100) |>
|
||||
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
|
||||
summarize(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
|
||||
filter(hour > 1) |>
|
||||
ggplot(aes(hour, prop_cancelled)) +
|
||||
geom_line(color = "grey50") +
|
||||
|
@ -543,7 +543,7 @@ You can lead or lag by more than one position by using the second argument, `n`.
|
|||
flights |>
|
||||
mutate(hour = dep_time %/% 100) |>
|
||||
group_by(year, month, day, hour) |>
|
||||
summarise(
|
||||
summarize(
|
||||
dep_delay = mean(dep_delay, na.rm = TRUE),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
|
@ -589,7 +589,7 @@ The median delay is always smaller than the mean delay because because flights s
|
|||
#| outlying points with mean ~60, median ~50, and mean ~85, median ~55.
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
mean = mean(dep_delay, na.rm = TRUE),
|
||||
median = median(dep_delay, na.rm = TRUE),
|
||||
n = n(),
|
||||
|
@ -618,7 +618,7 @@ For the `flights` data, you might want to look at the 95% quantile of delays rat
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
max = max(dep_delay, na.rm = TRUE),
|
||||
q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
|
||||
.groups = "drop"
|
||||
|
@ -638,7 +638,7 @@ But the code below makes it looks like one airport, [EGE](https://en.wikipedia.o
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(origin, dest) |>
|
||||
summarise(
|
||||
summarize(
|
||||
distance_sd = IQR(distance),
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
|
@ -721,7 +721,7 @@ For example, we can find the first and last departure for each day:
|
|||
```{r}
|
||||
flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarise(
|
||||
summarize(
|
||||
first_dep = first(dep_time),
|
||||
fifth_dep = nth(dep_time, 5),
|
||||
last_dep = last(dep_time)
|
||||
|
@ -747,7 +747,7 @@ flights |>
|
|||
|
||||
### With `mutate()`
|
||||
|
||||
As the names suggest, the summary functions are typically paired with `summarise()`.
|
||||
As the names suggest, the summary functions are typically paired with `summarize()`.
|
||||
However, because of the recycling rules we discussed in @sec-recycling they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
|
||||
For example:
|
||||
|
||||
|
|
|
@ -28,7 +28,7 @@ tints <- seq(0.1, 1, length.out = 10)
|
|||
|
||||
oreilly |>
|
||||
group_by(name) |>
|
||||
summarise(
|
||||
summarize(
|
||||
tint = paste0("t", tints * 100),
|
||||
colour = tint(col, tints),
|
||||
.groups = "drop"
|
||||
|
|
|
@ -227,7 +227,7 @@ str_split(x, " ")[[1]]
|
|||
str_split(x, boundary("word"))[[1]]
|
||||
```
|
||||
|
||||
Show how `separate_rows()` is a special case of `str_split()` + `summarise()`.
|
||||
Show how `separate_rows()` is a special case of `str_split()` + `summarize()`.
|
||||
|
||||
## Replace with function
|
||||
|
||||
|
|
|
@ -165,7 +165,7 @@ It looks like they've radically increased in popularity lately!
|
|||
|
||||
babynames |>
|
||||
group_by(year) |>
|
||||
summarise(prop_x = mean(str_detect(name, "x"))) |>
|
||||
summarize(prop_x = mean(str_detect(name, "x"))) |>
|
||||
ggplot(aes(year, prop_x)) +
|
||||
geom_line()
|
||||
```
|
||||
|
@ -388,10 +388,10 @@ str_view(fruit, "^apple$")
|
|||
|
||||
You can also match the boundary between words (i.e. the start or end of a word) with `\b`.
|
||||
This can be particularly when using RStudio's find and replace tool.
|
||||
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarise`, `summary`, `rowsum` and so on:
|
||||
For example, if to find all uses of `sum()`, you can search for `\bsum\b` to avoid matching `summarize`, `summary`, `rowsum` and so on:
|
||||
|
||||
```{r}
|
||||
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
|
||||
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
|
||||
str_view(x, "sum")
|
||||
str_view(x, "\\bsum\\b")
|
||||
```
|
||||
|
@ -569,7 +569,7 @@ str_match(x, "gr(?:e|a)y")
|
|||
g. Contain at least two vowel-consonant pairs in a row.
|
||||
h. Only consist of repeated vowel-consonant pairs.
|
||||
|
||||
4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
|
||||
4. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarize, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut.
|
||||
Try and make the shortest possible regex!
|
||||
|
||||
5. Switch the first and last letters in `words`.
|
||||
|
|
|
@ -155,7 +155,7 @@ Now that you've learned the basics of creating a string or two by "hand", we'll
|
|||
This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame.
|
||||
For example, to create a greeting you might combine "Hello" with a `name` variable.
|
||||
We'll show you how to do this with `str_c()` and `str_glue()` and how you can you use them with `mutate()`.
|
||||
That naturally raises the question of what string functions you might use with `summarise()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.
|
||||
That naturally raises the question of what string functions you might use with `summarize()`, so we'll finish this section with a discussion of `str_flatten()` which is a summary function for strings.
|
||||
|
||||
### `str_c()`
|
||||
|
||||
|
@ -212,7 +212,7 @@ df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
|
|||
### `str_flatten()`
|
||||
|
||||
`str_c()` and `glue()` work well with `mutate()` because their output is the same length as their inputs.
|
||||
What if you want a function that works well with `summarise()`, i.e. something that always returns a single string?
|
||||
What if you want a function that works well with `summarize()`, i.e. something that always returns a single string?
|
||||
That's the job of `str_flatten()`[^strings-5]: it takes a character vector and combines each element of the vector into a single string:
|
||||
|
||||
[^strings-5]: The base R equivalent is `paste()` used with the `collapse` argument.
|
||||
|
@ -223,7 +223,7 @@ str_flatten(c("x", "y", "z"), ", ")
|
|||
str_flatten(c("x", "y", "z"), ", ", last = ", and ")
|
||||
```
|
||||
|
||||
This makes it work well with `summarise()`:
|
||||
This makes it work well with `summarize()`:
|
||||
|
||||
```{r}
|
||||
df <- tribble(
|
||||
|
@ -237,7 +237,7 @@ df <- tribble(
|
|||
)
|
||||
df |>
|
||||
group_by(name) |>
|
||||
summarise(fruits = str_flatten(fruit, ", "))
|
||||
summarize(fruits = str_flatten(fruit, ", "))
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
|
|
@ -36,7 +36,7 @@ For example, the last chapter finished with a moderately complex pipe:
|
|||
flights |>
|
||||
filter(!is.na(arr_delay), !is.na(tailnum)) |>
|
||||
group_by(tailnum) |>
|
||||
summarise(
|
||||
summarize(
|
||||
delay = mean(arr_delay, na.rm = TRUE),
|
||||
n = n()
|
||||
)
|
||||
|
@ -50,7 +50,7 @@ We could nest each function call inside the previous call:
|
|||
```{r}
|
||||
#| eval: false
|
||||
|
||||
summarise(
|
||||
summarize(
|
||||
group_by(
|
||||
filter(
|
||||
flights,
|
||||
|
@ -71,7 +71,7 @@ Or we could use a bunch of intermediate variables:
|
|||
|
||||
flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
|
||||
flights2 <- group_by(flights1, tailnum)
|
||||
flights3 <- summarise(flight2,
|
||||
flights3 <- summarize(flight2,
|
||||
delay = mean(arr_delay, na.rm = TRUE),
|
||||
n = n()
|
||||
)
|
||||
|
@ -91,7 +91,7 @@ library(tidyverse)
|
|||
|
||||
mtcars %>%
|
||||
group_by(cyl) %>%
|
||||
summarise(n = n())
|
||||
summarize(n = n())
|
||||
```
|
||||
|
||||
For simple cases `|>` and `%>%` behave identically.
|
||||
|
|
Loading…
Reference in New Issue