Minor polishing to get back into the swing of things
This commit is contained in:
parent
e80ed2d577
commit
821b51d536
|
@ -62,16 +62,6 @@ There are three other common types that aren't used here but you'll encounter la
|
|||
### dplyr basics
|
||||
|
||||
In this chapter you are going to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges.
|
||||
They are organised into four camps:
|
||||
|
||||
- Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns, `slice()` and friends subsets rows based on their position, and `arrange()` changes the order of the rows.
|
||||
|
||||
- Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` which changes their positions.
|
||||
|
||||
- Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
|
||||
|
||||
Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
|
||||
|
||||
All dplyr verbs work the same way:
|
||||
|
||||
1. The first argument is a data frame.
|
||||
|
@ -81,11 +71,22 @@ All dplyr verbs work the same way:
|
|||
3. The result is a new data frame.
|
||||
|
||||
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
|
||||
The verbs are organised into four groups:
|
||||
|
||||
- Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns and `arrange()` changes the order of the rows.
|
||||
|
||||
- Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` changes their positions.
|
||||
|
||||
- Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
|
||||
|
||||
- Functions that operate on **tables**, like the join functions and the set operations.
|
||||
We'll come back to these in in Chapter \@ref(relational-data).
|
||||
|
||||
Let's dive in and see how these verbs work.
|
||||
|
||||
## Rows
|
||||
|
||||
These functions affect the rows (the observations), leaving the columns (the variables) unchanged.
|
||||
`filter()` and `arrange()` affect the rows (the observations), leaving the columns (the variables) unchanged.
|
||||
`filter()` changes which rows are included without changing the order, `arrange()` changes the order without changing the membership.
|
||||
|
||||
### `filter()`
|
||||
|
@ -111,6 +112,7 @@ jan1 <- filter(flights, month == 1, day == 1)
|
|||
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
|
||||
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
|
||||
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
|
||||
We'll come back to these operations again in Chapter \@ref(logicals-numbers).
|
||||
|
||||
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
|
||||
`filter()` will let you know when this happens:
|
||||
|
@ -158,7 +160,7 @@ arrange(flights, desc(dep_delay))
|
|||
|
||||
## Columns
|
||||
|
||||
These functions affect the columns (the variables) without changing the rows (the observations).
|
||||
`mutate()`, `select()`, `rename()`, and `relocate()` affect the columns (the variables) without changing the rows (the observations).
|
||||
`mutate()` creates new variables that are functions of the existing variables; `select()`, `rename()`, and `relocate()` changes which variables are present, their names, and their positions.
|
||||
|
||||
### `mutate()`
|
||||
|
@ -187,8 +189,8 @@ mutate(flights,
|
|||
)
|
||||
```
|
||||
|
||||
The leading `.` is a sign that `.before` is an argument to the function, not a new variable being created.
|
||||
You can also use `.after` to add after a variable, and use a variable name instead of a position:
|
||||
The leading `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
|
||||
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position:
|
||||
|
||||
```{r}
|
||||
mutate(flights,
|
||||
|
@ -212,7 +214,7 @@ mutate(flights,
|
|||
### `select()` {#select}
|
||||
|
||||
It's not uncommon to get datasets with hundreds or even thousands of variables.
|
||||
In this case, the first challenge is often narrowing in on the variables you're actually interested in.
|
||||
In this case, the first challenge is often focussing on just the variables you're interested in.
|
||||
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
|
||||
|
||||
`select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
|
||||
|
@ -239,7 +241,7 @@ There are a number of helper functions you can use within `select()`:
|
|||
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
|
||||
|
||||
See `?select` for more details.
|
||||
Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a regexp.
|
||||
Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a pattern.
|
||||
|
||||
You can rename variables as you `select()` them by using `=`.
|
||||
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
|
||||
|
@ -267,7 +269,7 @@ By default it moves variables to the front:
|
|||
relocate(flights, time_hour, air_time)
|
||||
```
|
||||
|
||||
But you can use the `.before` and `.after` arguments to choose where to place them:
|
||||
But like with `mutate()`, you can use the `.before` and `.after` arguments to choose where to place them:
|
||||
|
||||
```{r}
|
||||
relocate(flights, year:dep_time, .after = time_hour)
|
||||
|
@ -406,13 +408,13 @@ daily %>% summarise(n = n())
|
|||
|
||||
If you're happy with this behaviour, you can explicitly define it in order to suppress the message:
|
||||
|
||||
```{r results = FALSE}
|
||||
```{r, results = FALSE}
|
||||
daily %>% summarise(n = n(), .groups = "drop_last")
|
||||
```
|
||||
|
||||
Alternatively, you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
|
||||
|
||||
```{r results = FALSE}
|
||||
```{r, results = FALSE}
|
||||
daily %>% summarise(n = n(), .groups = "drop")
|
||||
daily %>% summarise(n = n(), .groups = "keep")
|
||||
```
|
||||
|
@ -433,26 +435,17 @@ daily %>%
|
|||
|
||||
For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.
|
||||
|
||||
### Selecting rows
|
||||
|
||||
`arrange()` and `filter()` are mostly unaffected by grouping.
|
||||
But the slice functions are super useful:
|
||||
|
||||
- `slice_head()` and `slice_tail()` select the first or last rows in each group.
|
||||
|
||||
- `slice_max()` and `slice_min()` select the rows in each group with highest or lowest values.
|
||||
|
||||
- `slice_sample()` random selects rows from each group.
|
||||
|
||||
Each of these verbs takes either a `n` or `prop` argument depending on whether you want to select a fixed number of rows, or a number of rows proportional to the group size.
|
||||
|
||||
### Other verbs
|
||||
|
||||
`group_by()` is usually paired with `summarise()`, but it's good to know how it affects other verbs:
|
||||
|
||||
- `select()`, `rename()`, `relocate()`: grouping has no affect
|
||||
|
||||
- `filter()`, `mutate()`: computation happens per group.
|
||||
- `mutate()`: computation happens per group.
|
||||
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
|
||||
|
||||
- `arrange()` and `filter()` are mostly unaffected by grouping, unless you are doing computation (e.g. `filter(flights, dep_delay == min(dep_delay)`), in which case the `mutate()` caveat applies.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Which carrier has the worst delays?
|
||||
|
|
Loading…
Reference in New Issue