Diving back into transformation chapter
This commit is contained in:
parent
2efcd7e4fe
commit
6825c577d9
|
@ -8,7 +8,10 @@ status("restructuring")
|
|||
|
||||
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need.
|
||||
Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.
|
||||
You'll learn how to do all that (and more!) in this chapter, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013.
|
||||
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights departing New York City in 2013.
|
||||
|
||||
The goal of this chapter is to give you an overview of all the key tools for transforming a data frame.
|
||||
We'll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -34,30 +37,15 @@ The data comes from the US [Bureau of Transportation Statistics](http://www.tran
|
|||
flights
|
||||
```
|
||||
|
||||
If you've used R before, you might notice that this data frame prints a little differently to data frames you might've worked with in the past.
|
||||
If you've used R before, you might notice that this data frame prints a little differently to data frames that you might've worked with in the past.
|
||||
That's because it's a **tibble**, a special type of data frame designed by the tidyverse team to avoid some common data.frame gotchas.
|
||||
The most important difference is the way it prints: tibbles are designed for large datasets, so only show the first few rows and only the columns that fit on one screen.
|
||||
The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
|
||||
If you want to see everything you can use `View(flights)` to open the dataset in the RStudio viewer.
|
||||
We'll come back to other important differences in Chapter \@ref(tibbles).
|
||||
|
||||
You might also have noticed the row of short abbreviations under the column names.
|
||||
These describe the type of each variable:
|
||||
|
||||
- `int` stands for integer.
|
||||
|
||||
- `dbl` stands for double, a vector of real numbers.
|
||||
|
||||
- `chr` stands for character, a vector of strings.
|
||||
|
||||
- `dttm` stands for date-time (a date + a time).
|
||||
|
||||
There are three other common types that aren't used here but you'll encounter later in the book:
|
||||
|
||||
- `lgl` stands for logical, a vector that contains only `TRUE` or `FALSE`.
|
||||
|
||||
- `fctr` stands for factor, which R uses to represent categorical variables with fixed possible values.
|
||||
|
||||
- `date` stands for date.
|
||||
You might also have noticed the row of short abbreviations following each column name.
|
||||
These describe the type of each variable: `<int>` is short for integer, and `<dbl>` is short for double (aka real numbers), `<chr>` for characters (aka strings), and `<dttm>` for date-times.
|
||||
These are important because the operations you can perform on a column depend so much on the type of column, and are used to organize the chapters in the Transform section of this book.
|
||||
|
||||
### dplyr basics
|
||||
|
||||
|
@ -70,72 +58,134 @@ All dplyr verbs work the same way:
|
|||
|
||||
3. The result is a new data frame.
|
||||
|
||||
This means that dplyr code typically looks something like this:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
data |>
|
||||
filter(x == 1) |>
|
||||
mutate(
|
||||
y = x + 1
|
||||
)
|
||||
```
|
||||
|
||||
`|>` is a special operator called a pipe.
|
||||
It takes the thing on its left and passes it along to the function on its right.
|
||||
The easiest way to pronounce the pipe is "then".
|
||||
So you can read the above as take data, then filter it, then mutate it.
|
||||
We'll come back to the pipe and its alternatives in Chapter \@ref(pipes).
|
||||
In RStudio, you can make the pipe by pressing Ctrl/Cmd + Shift + M.
|
||||
|
||||
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
|
||||
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
|
||||
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(workflow-pipes).
|
||||
|
||||
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
|
||||
The verbs are organised into four groups:
|
||||
|
||||
- Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns and `arrange()` changes the order of the rows.
|
||||
|
||||
- Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` changes their positions.
|
||||
|
||||
- Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
|
||||
|
||||
- Functions that operate on **tables**, like the join functions and the set operations.
|
||||
We'll come back to these in in Chapter \@ref(relational-data).
|
||||
|
||||
Let's dive in and see how these verbs work.
|
||||
The verbs are organised into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
|
||||
In the following sections you'll learn the most important verbs for rows, columns, and groups.
|
||||
We'll come back to operations that work on multiple tables in Chapter \@ref(relational-data).
|
||||
Let's dive in!
|
||||
|
||||
## Rows
|
||||
|
||||
`filter()` and `arrange()` affect the rows (the observations), leaving the columns (the variables) unchanged.
|
||||
`filter()` changes which rows are included without changing the order, `arrange()` changes the order without changing the membership.
|
||||
The most important verbs that affect the rows are `filter()` which changes membership without changing order and `arrange()` which changes the order without changing the membership.
|
||||
Both functions only affect the rows, so the columns are left unchanged.
|
||||
|
||||
### `filter()`
|
||||
|
||||
`filter()` allows you to choose rows based on their values[^data-transform-1].
|
||||
The first argument is the name of the data frame.
|
||||
The second and subsequent arguments are the expressions that filter the data frame.
|
||||
For example, we can select all flights on January 1st with:
|
||||
`filter()` allows you to pick rows based on the values of the columns[^data-transform-1].
|
||||
The first argument is the data frame.
|
||||
The second and subsequent arguments are the conditions that must be true to keep the row.
|
||||
For example, we could find all flights that arrived more than 120 minutes (two hours) late:
|
||||
|
||||
[^data-transform-1]: Later, you'll learn about the `slice_*()` family which allows you to choose rows based on their positions
|
||||
|
||||
```{r}
|
||||
filter(flights, month == 1, day == 1)
|
||||
flights |>
|
||||
filter(arr_delay > 120)
|
||||
```
|
||||
|
||||
When you run that line of code, dplyr executes the filtering operation and returns a new data frame.
|
||||
dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`:
|
||||
As well as `>` (greater than) provides the `>=` (greater than or equal to), `<` (less than), `<=` (less than or equal to), `==` (equal to), and `!=` (not equal to).
|
||||
|
||||
You can use `&` (and) or `|` (or) to combine multiple conditions:
|
||||
|
||||
```{r}
|
||||
jan1 <- filter(flights, month == 1, day == 1)
|
||||
# Flights that departed on January 1
|
||||
flights |>
|
||||
filter(month == 1 & day == 1)
|
||||
|
||||
# Flights that departed in January or February
|
||||
flights |>
|
||||
filter(month == 1 | month == 2)
|
||||
```
|
||||
|
||||
To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
|
||||
R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
|
||||
It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
|
||||
We'll come back to these operations again in Chapter \@ref(logicals-numbers).
|
||||
There's a useful shortcut when you're combining `|` and `==`: `%in%`.
|
||||
It returns true if the value on the left right hand side is any of the values on the right hand side:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
filter(month %in% c(1, 2))
|
||||
```
|
||||
|
||||
We'll come back to these comparisons and logical operators in more detail in Chapter \@ref(logicals-numbers).
|
||||
|
||||
When you run `filter()` dplyr executes the filtering operation, creating a new data frame, and then prints it.
|
||||
It doesn't modify the existing `flights` dataset because dplyr functions never modify their inputs.
|
||||
To save the result, you need to use the assignment operator, `<-`:
|
||||
|
||||
```{r}
|
||||
jan1 <- flights |>
|
||||
filter(month == 1 & day == 1)
|
||||
```
|
||||
|
||||
### `arrange()`
|
||||
|
||||
`arrange()` changes the order of the rows based on the value of the columns.
|
||||
Again, it takes a data frame and a set of column names (or more complicated expressions) to order by.
|
||||
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
|
||||
For example, the following code sorts by the departure time, which is spread over four columns.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
arrange(year, month, day, dep_time)
|
||||
```
|
||||
|
||||
You can use `desc()` to re-order by a column in descending order.
|
||||
For example, this is useful if you want to see the most delayed flights:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
arrange(desc(dep_delay))
|
||||
```
|
||||
|
||||
You can of course combine `arrange()` and `filter()` to solve more complex problems.
|
||||
For example, we could look for the flights that were most delayed on arrival that left on roughly on time:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
filter(dep_delay <= 10 & dep_delay >= -10) |>
|
||||
arrange(desc(arr_delay))
|
||||
```
|
||||
|
||||
### Common mistakes
|
||||
|
||||
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
|
||||
`filter()` will let you know when this happens:
|
||||
|
||||
```{r, error = TRUE}
|
||||
filter(flights, month = 1)
|
||||
flights |>
|
||||
filter(month = 1)
|
||||
```
|
||||
|
||||
### `arrange()`
|
||||
Another mistakes is you write "or" statements like you would in English:
|
||||
|
||||
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
|
||||
It takes a data frame and a set of column names (or more complicated expressions) to order by.
|
||||
If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
|
||||
|
||||
```{r}
|
||||
arrange(flights, year, month, day)
|
||||
```{r, eval = FALSE}
|
||||
flights |>
|
||||
filter(month == 1 | 2)
|
||||
```
|
||||
|
||||
You can use `desc()` to re-order by a column in descending order:
|
||||
|
||||
```{r}
|
||||
arrange(flights, desc(dep_delay))
|
||||
```
|
||||
This works, in the sense that it doesn't throw an error, but it doesn't do what you want.
|
||||
We'll come back to what it does and why in Section \@ref(boolean-operations).
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -147,97 +197,109 @@ arrange(flights, desc(dep_delay))
|
|||
d. Departed in summer (July, August, and September)
|
||||
e. Arrived more than two hours late, but didn't leave late
|
||||
f. Were delayed by at least an hour, but made up over 30 minutes in flight
|
||||
g. Departed between midnight and 6am (inclusive)
|
||||
|
||||
2. Sort `flights` to find the flights with longest departure delays.
|
||||
Find the flights that left earliest.
|
||||
Find the flights that left earliest in the morning.
|
||||
|
||||
3. Sort `flights` to find the fastest (highest speed) flights.
|
||||
(Hint: try sorting by a calculation).
|
||||
3. Sort `flights` to find the fastest flights (Hint: try sorting by a calculation).
|
||||
|
||||
4. Which flights travelled the farthest?
|
||||
Which travelled the shortest?
|
||||
4. Which flights traveled the farthest?
|
||||
Which traveled the shortest?
|
||||
|
||||
5. Does it matter what order you used `filter()` and `arrange()` in if you're using both?
|
||||
Why/why not?
|
||||
Think about the results and how much work the functions would have to do.
|
||||
|
||||
## Columns
|
||||
|
||||
`mutate()`, `select()`, `rename()`, and `relocate()` affect the columns (the variables) without changing the rows (the observations).
|
||||
`mutate()` creates new variables that are functions of the existing variables; `select()`, `rename()`, and `relocate()` changes which variables are present, their names, and their positions.
|
||||
The are four important verbs that affect the columns without changing the rows: `mutate()`, `select()`, `rename()`, and `relocate()`.
|
||||
`mutate()` creates new columns that are functions of the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, and their positions.
|
||||
|
||||
### `mutate()`
|
||||
|
||||
The job of `mutate()` is to add new columns that are functions of existing column.
|
||||
In the later chapters, you'll learn the full set of functions that you can use to manipulate different types of variables.
|
||||
For now, we'll stick with basic mathematical operators, which allows us to compute the `gain`, how much time a delayed flight made up in the air, and the `speed` in miles per hour:
|
||||
The job of `mutate()` is to add new columns that are calculated from the existing columns.
|
||||
In the transform chapters, you'll learn a large set of functions that you can use to manipulate different types of variables.
|
||||
For now, we'll stick with basic algebra, which allows us to compute the `gain`, how much time a delayed flight made up in the air, and the `speed` in miles per hour:
|
||||
|
||||
```{r}
|
||||
mutate(flights,
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60
|
||||
)
|
||||
flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60
|
||||
)
|
||||
```
|
||||
|
||||
By default, `mutate()` adds new columns at the right hand side of your dataset, which makes it hard to see what's happening here.
|
||||
By default, `mutate()` adds new columns on the right hand side of your dataset, which makes it hard to see what's happening here.
|
||||
We can use the `.before` argument to instead add the variables to the left hand side[^data-transform-2]:
|
||||
|
||||
[^data-transform-2]: Remember that when you're in RStudio, the easiest way to see all the columns is `View()`.
|
||||
|
||||
```{r}
|
||||
mutate(flights,
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
.before = 1
|
||||
)
|
||||
flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
.before = 1
|
||||
)
|
||||
```
|
||||
|
||||
The leading `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
|
||||
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position:
|
||||
The `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
|
||||
You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position.
|
||||
For example, we could add the new variables after `day:`
|
||||
|
||||
```{r}
|
||||
mutate(flights,
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
.after = day
|
||||
)
|
||||
flights |>
|
||||
mutate(
|
||||
gain = dep_delay - arr_delay,
|
||||
speed = distance / air_time * 60,
|
||||
.after = day
|
||||
)
|
||||
```
|
||||
|
||||
Alternatively, can control which variables are kept with the `.keep` argument:
|
||||
Alternatively, can control which variables are kept with the `.keep` argument.
|
||||
A particularly useful argument is `"used"` which allows you to see the inputs and outputs from your calculations:
|
||||
|
||||
```{r}
|
||||
mutate(flights,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
gain_per_hour = gain / hours,
|
||||
.keep = "none"
|
||||
)
|
||||
flights |>
|
||||
mutate(,
|
||||
gain = dep_delay - arr_delay,
|
||||
hours = air_time / 60,
|
||||
gain_per_hour = gain / hours,
|
||||
.keep = "used"
|
||||
)
|
||||
```
|
||||
|
||||
### `select()` {#select}
|
||||
|
||||
It's not uncommon to get datasets with hundreds or even thousands of variables.
|
||||
In this case, the first challenge is often focussing on just the variables you're interested in.
|
||||
In this case, the first challenge is often focusing on just the variables you're interested in.
|
||||
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
|
||||
|
||||
`select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
|
||||
|
||||
```{r}
|
||||
# Select columns by name
|
||||
select(flights, year, month, day)
|
||||
flights |>
|
||||
select(year, month, day)
|
||||
|
||||
# Select all columns between year and day (inclusive)
|
||||
select(flights, year:day)
|
||||
flights |>
|
||||
select(year:day)
|
||||
|
||||
# Select all columns except those from year to day (inclusive)
|
||||
select(flights, -(year:day))
|
||||
flights |>
|
||||
select(-(year:day))
|
||||
|
||||
# Select all columns that are characters
|
||||
flights |>
|
||||
select(where(is.character))
|
||||
```
|
||||
|
||||
There are a number of helper functions you can use within `select()`:
|
||||
|
||||
- `starts_with("abc")`: matches names that begin with "abc".
|
||||
|
||||
- `ends_with("xyz")`: matches names that end with "xyz".
|
||||
|
||||
- `contains("ijk")`: matches names that contain "ijk".
|
||||
|
||||
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
|
||||
|
||||
See `?select` for more details.
|
||||
|
@ -247,7 +309,7 @@ You can rename variables as you `select()` them by using `=`.
|
|||
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
|
||||
|
||||
```{r}
|
||||
select(flights, tail_num = tailnum)
|
||||
flights |> select(tail_num = tailnum)
|
||||
```
|
||||
|
||||
### `rename()`
|
||||
|
@ -255,29 +317,31 @@ select(flights, tail_num = tailnum)
|
|||
If you just want to keep all the existing variables and just want to rename a few, you can use `rename()` instead of `select()`:
|
||||
|
||||
```{r}
|
||||
rename(flights, tail_num = tailnum)
|
||||
flights |>
|
||||
rename(tail_num = tailnum)
|
||||
```
|
||||
|
||||
It works exactly the same way as `select()`, but keeps all the variables that aren't explicitly selected.
|
||||
|
||||
### `relocate()`
|
||||
|
||||
You can move variables around with `relocate`.
|
||||
You can move variables around with `relocate()`.
|
||||
By default it moves variables to the front:
|
||||
|
||||
```{r}
|
||||
relocate(flights, time_hour, air_time)
|
||||
flights |>
|
||||
relocate(time_hour, air_time)
|
||||
```
|
||||
|
||||
But like with `mutate()`, you can use the `.before` and `.after` arguments to choose where to place them:
|
||||
But you can use the same `.before` and `.after` arguments as `mutate()` to choose where to put them:
|
||||
|
||||
```{r}
|
||||
relocate(flights, year:dep_time, .after = time_hour)
|
||||
relocate(flights, starts_with("arr"), .before = dep_time)
|
||||
flights |>
|
||||
relocate(year:dep_time, .after = time_hour)
|
||||
flights |>
|
||||
relocate(starts_with("arr"), .before = dep_time)
|
||||
```
|
||||
|
||||
These work the same way as the `.before` and `.after` arguments to `mutate()` --- they can be a numeric position, the name of a variable, or any of the other functions that you can use with `select()`.
|
||||
|
||||
### Exercises
|
||||
|
||||
```{r, eval = FALSE, echo = FALSE}
|
||||
|
@ -334,11 +398,11 @@ The two key functions are `group_by()` and `summarise()`, but as you'll learn `g
|
|||
Use `group_by()` to divide your dataset into groups meaningful for your analysis:
|
||||
|
||||
```{r}
|
||||
by_month <- group_by(flights, month)
|
||||
by_month
|
||||
flights |>
|
||||
group_by(month)
|
||||
```
|
||||
|
||||
`group_by()` doesn't change the data but, if you look closely, you'll notice that it's now "grouped by" month.
|
||||
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that it's now "grouped by" month.
|
||||
The reason to group your data is because it changes the operation of subsequent verbs.
|
||||
|
||||
### `summarise()`
|
||||
|
@ -350,73 +414,83 @@ Here we compute the average departure delay by month:
|
|||
[^data-transform-3]: This is a slightly simplification; later on you'll learn how to use `summarise()` to produce multiple summary rows for each group.
|
||||
|
||||
```{r}
|
||||
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
|
||||
flights |>
|
||||
group_by(month) |>
|
||||
summarise(
|
||||
delay = mean(dep_delay)
|
||||
)
|
||||
```
|
||||
|
||||
Uhoh!
|
||||
Something has gone wrong and all of our results are `NA`, R's symbol for missing value.
|
||||
We'll come back to discuss missing values in Chapter \@ref(missing-values), but for now we'll remove them by using `na.rm = TRUE`:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
group_by(month) |>
|
||||
summarise(
|
||||
delay = mean(dep_delay, na.rm = TRUE)
|
||||
)
|
||||
```
|
||||
|
||||
You can create any number of summaries in a single call to `summarise()`.
|
||||
You'll learn various useful summaries in the upcoming chapters on individual data types, but one very useful summary is `n()`, which returns the number of rows in each group:
|
||||
You'll learn various useful summaries in the upcoming chapters, but one very useful summary is `n()`, which returns the number of rows in each group:
|
||||
|
||||
```{r}
|
||||
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE), n = n())
|
||||
flights |>
|
||||
group_by(month) |>
|
||||
summarise(
|
||||
delay = mean(dep_delay, na.rm = TRUE),
|
||||
n = n()
|
||||
)
|
||||
```
|
||||
|
||||
(In fact, `count()`, which we've used a bunch in previous chapters, is just shorthand for `group_by()` + `summarise(n = n())`.)
|
||||
|
||||
We'll come back to discuss missing values in Chapter \@ref(missing-values).
|
||||
For now, know you can drop them in summary functions by using `na.rm = TRUE` or remove them with a filter by using `!is.na()`:
|
||||
|
||||
```{r}
|
||||
not_cancelled <- filter(flights, !is.na(dep_delay))
|
||||
by_month <- group_by(not_cancelled, month)
|
||||
summarise(by_month, delay = mean(dep_delay))
|
||||
```
|
||||
|
||||
### Combining multiple operations
|
||||
|
||||
This code is starting to get a little frustrating to write because each intermediate data frame has to be given a name, even though we don't care about them.
|
||||
Naming things is hard, so this slows down our analysis.
|
||||
There's another way to tackle the same problem with the **pipe**, `%>%`:
|
||||
|
||||
```{r}
|
||||
flights %>%
|
||||
filter(!is.na(dep_delay)) %>%
|
||||
group_by(month) %>%
|
||||
summarise(delay = mean(dep_delay), n = n())
|
||||
```
|
||||
|
||||
When you see `%>%` in code, a good way to "pronounce" it in your head is as "then".
|
||||
That way you can read this code as a series of imperative statements: take the flights dataset, then filter it to remove rows with missing `dep_delay`, then group it by month, then summarise it with the average `dep_delay` and the number of observations.
|
||||
|
||||
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
|
||||
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
|
||||
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(workflow-pipes).
|
||||
Means and counts can get you a surprisingly long way in data science!
|
||||
|
||||
### Grouping by multiple variables
|
||||
|
||||
You can group a data frame by multiple variables:
|
||||
|
||||
```{r}
|
||||
daily <- flights %>% group_by(year, month, day)
|
||||
daily <- flights %>%
|
||||
group_by(year, month, day)
|
||||
daily
|
||||
```
|
||||
|
||||
When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.
|
||||
|
||||
```{r}
|
||||
daily %>% summarise(n = n())
|
||||
daily %>%
|
||||
summarise(
|
||||
n = n()
|
||||
)
|
||||
```
|
||||
|
||||
If you're happy with this behaviour, you can explicitly define it in order to suppress the message:
|
||||
|
||||
```{r, results = FALSE}
|
||||
daily %>% summarise(n = n(), .groups = "drop_last")
|
||||
daily %>%
|
||||
summarise(
|
||||
n = n(),
|
||||
.groups = "drop_last"
|
||||
)
|
||||
```
|
||||
|
||||
Alternatively, you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:
|
||||
|
||||
```{r, results = FALSE}
|
||||
daily %>% summarise(n = n(), .groups = "drop")
|
||||
daily %>% summarise(n = n(), .groups = "keep")
|
||||
daily %>%
|
||||
summarise(
|
||||
n = n(),
|
||||
.groups = "drop"
|
||||
)
|
||||
daily %>%
|
||||
summarise(
|
||||
n = n(),
|
||||
.groups = "keep"
|
||||
)
|
||||
```
|
||||
|
||||
### Ungrouping
|
||||
|
@ -439,7 +513,7 @@ For the purposes of summarising, ungrouped data is treated as if all your data w
|
|||
|
||||
`group_by()` is usually paired with `summarise()`, but it's good to know how it affects other verbs:
|
||||
|
||||
- `select()`, `rename()`, `relocate()`: grouping has no affect
|
||||
- `select()`, `rename()`, `relocate()`: grouping has no affect.
|
||||
|
||||
- `mutate()`: computation happens per group.
|
||||
This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).
|
||||
|
@ -458,15 +532,17 @@ For the purposes of summarising, ungrouped data is treated as if all your data w
|
|||
|
||||
## Case study: aggregates and sample size
|
||||
|
||||
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
|
||||
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`).
|
||||
That way you can check that you're not drawing conclusions based on very small amounts of data.
|
||||
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
|
||||
|
||||
```{r}
|
||||
delays <- not_cancelled %>%
|
||||
delays <- flights %>%
|
||||
filter(!is.na(arr_delay)) |>
|
||||
group_by(tailnum) %>%
|
||||
summarise(
|
||||
delay = mean(arr_delay)
|
||||
delay = mean(arr_delay),
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(data = delays, mapping = aes(x = delay)) +
|
||||
|
@ -479,13 +555,6 @@ The story is actually a little more nuanced.
|
|||
We can get more insight if we draw a scatterplot of number of flights vs. average delay:
|
||||
|
||||
```{r}
|
||||
delays <- not_cancelled %>%
|
||||
group_by(tailnum) %>%
|
||||
summarise(
|
||||
delay = mean(arr_delay),
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
|
||||
geom_point(alpha = 1/10)
|
||||
```
|
||||
|
@ -501,13 +570,24 @@ It's a bit painful that you have to switch from `%>%` to `+`, but once you get t
|
|||
delays %>%
|
||||
filter(n > 25) %>%
|
||||
ggplot(mapping = aes(x = n, y = delay)) +
|
||||
geom_point(alpha = 1/10)
|
||||
geom_point(alpha = 1/10) +
|
||||
geom_smooth(se = FALSE)
|
||||
```
|
||||
|
||||
There's another common variation of this type of pattern.
|
||||
Let's look at how the average performance of batters in baseball is related to the number of times they're at bat.
|
||||
Here I use data from the **Lahman** package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
|
||||
|
||||
```{r}
|
||||
batters <- Lahman::Batting %>%
|
||||
group_by(playerID) %>%
|
||||
summarise(
|
||||
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
|
||||
ab = sum(AB, na.rm = TRUE)
|
||||
)
|
||||
batters
|
||||
```
|
||||
|
||||
When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
|
||||
|
||||
1. As above, the variation in our aggregate decreases as we get more data points.
|
||||
|
@ -516,20 +596,10 @@ When I plot the skill of the batter (measured by the batting average, `ba`) agai
|
|||
This is because teams control who gets to play, and obviously they'll pick their best players.
|
||||
|
||||
```{r}
|
||||
# Convert to a tibble so it prints nicely
|
||||
batting <- as_tibble(Lahman::Batting)
|
||||
|
||||
batters <- batting %>%
|
||||
group_by(playerID) %>%
|
||||
summarise(
|
||||
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
|
||||
ab = sum(AB, na.rm = TRUE)
|
||||
)
|
||||
|
||||
batters %>%
|
||||
filter(ab > 100) %>%
|
||||
ggplot(mapping = aes(x = ab, y = ba)) +
|
||||
geom_point() +
|
||||
geom_point(alpha = 1 / 10) +
|
||||
geom_smooth(se = FALSE)
|
||||
```
|
||||
|
||||
|
|
|
@ -24,6 +24,19 @@ You can create new objects with `<-`:
|
|||
x <- 3 * 4
|
||||
```
|
||||
|
||||
You can **c**ombine multiple elements into a vector with `c()`:
|
||||
|
||||
```{r}
|
||||
primes <- c(1, 2, 3, 5, 7, 11, 13)
|
||||
```
|
||||
|
||||
And basic arithmetic is applied to every element of the vector:
|
||||
|
||||
```{r}
|
||||
primes * 2
|
||||
primes - 1
|
||||
```
|
||||
|
||||
All R statements where you create objects, **assignment** statements, have the same form:
|
||||
|
||||
```{r eval = FALSE}
|
||||
|
@ -134,20 +147,6 @@ If this happens, R will show you the continuation character "+":
|
|||
The `+` tells you that R is waiting for more input; it doesn't think you're done yet.
|
||||
Usually that means you've forgotten either a `"` or a `)`. Either add the missing pair, or press ESCAPE to abort the expression and try again.
|
||||
|
||||
If you make an assignment, you don't get to see the value.
|
||||
You're then tempted to immediately double-check the result:
|
||||
|
||||
```{r}
|
||||
y <- seq(1, 10, length.out = 5)
|
||||
y
|
||||
```
|
||||
|
||||
This common action can be shortened by surrounding the assignment with parentheses, which causes assignment and "print to screen" to happen.
|
||||
|
||||
```{r}
|
||||
(y <- seq(1, 10, length.out = 5))
|
||||
```
|
||||
|
||||
Now look at your environment in the upper right pane:
|
||||
|
||||
```{r, echo = FALSE, out.width = NULL}
|
||||
|
|
Loading…
Reference in New Issue