Correct description of flights joins/keys

Fixes #757
This commit is contained in:
Hadley Wickham 2022-08-30 08:38:46 -05:00
parent 47607389c1
commit 5e611fd079
4 changed files with 35 additions and 38 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

BIN
diagrams/relational.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

View File

@ -9,25 +9,21 @@ status("restructuring")
## Introduction ## Introduction
Waiting on <https://github.com/tidyverse/dplyr/pull/5910> <!-- TODO: redraw all diagrams to match O'Reilly style. From one to many on -->
<!-- TODO: redraw all diagrams to match O'Reilly style -->
It's rare that a data analysis involves only a single data frame. It's rare that a data analysis involves only a single data frame.
Typically you have many data frames, and you must **join** them together to answer the questions that you're interested in. Typically you have many data frames, and you must **join** them together to answer the questions that you're interested in.
All the verbs in this chapter use a pair of data frames. All the verbs in this chapter use a pair of data frames.
Fortunately this is enough, since you can combine three data frames by combining two pairs. Fortunately this is enough, since you can solve any more complex problem a pair at a time.
Sometimes both elements of a pair will be the same data frame.
This is needed if, for example, you have a data frame of people, and each person has a reference to their parents.
There are two important types of joins. You'll learn about important types of joins in this chapter:
**Mutating joins** adds new variables to one data frame from matching observations in another.
**Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
If you're familiar with SQL, you should find these ideas very familiar as their realization in dplyr is very similar. - **Mutating joins** add new variables to one data frame from matching observations in another.
- **Filtering joins**, filters observations from one data frame based on whether or not they match an observation in another.
If you're familiar with SQL, you should find the ideas in this chapter familiar, as their realization in dplyr is very similar.
We'll point out any important differences as we go. We'll point out any important differences as we go.
Don't worry if you're not familiar with SQL, we'll back to it in @sec-import-databases. Don't worry if you're not familiar with SQL as you'll learn more about it in @sec-import-databases.
### Prerequisites ### Prerequisites
@ -43,7 +39,7 @@ library(nycflights13)
## nycflights13 {#sec-nycflights13-relational} ## nycflights13 {#sec-nycflights13-relational}
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in @sec-data-transform on data transformation: As well as the `flights` data frame that you used in @sec-data-transform, four addition related tibbles:
- `airlines` lets you look up the full carrier name from its abbreviated code: - `airlines` lets you look up the full carrier name from its abbreviated code:
@ -71,13 +67,13 @@ nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `plan
These datasets are connected as follows: These datasets are connected as follows:
- `flights` connects to `planes` via a single variable, `tailnum`. - `flights` connects to `planes` through the `tailnum`.
- `flights` connects to `airlines` through the `carrier` variable. - `flights` connects to `airlines` through the `carrier` variable.
- `flights` connects to `airports` in two ways: via the `origin` and `dest` variables. - `flights` connects to `airports` in two ways: through the origin (`origin)` and through the destination (`dest)`.
- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time). - `flights` connects to `weather` through two variables at the same time: the location (`origin)` and the time (`time_hour`).
One way to show the relationships between the different data frames is with a diagram, as in @fig-flights-relationships. One way to show the relationships between the different data frames is with a diagram, as in @fig-flights-relationships.
This diagram is a little overwhelming, but it's simple compared to some you'll see in the wild! This diagram is a little overwhelming, but it's simple compared to some you'll see in the wild!
@ -87,20 +83,22 @@ You don't need to understand the whole thing; you just need to understand the ch
```{r} ```{r}
#| label: fig-flights-relationships #| label: fig-flights-relationships
#| echo: false #| echo: false
#| out-width: ~
#| fig-cap: > #| fig-cap: >
#| Connections between all six data frames in the nycflights package. #| Connections between all five data frames in the nycflights package.
#| fig-alt: > #| fig-alt: >
#| Diagram showing the relationships between airports, planes, flights, #| Diagram showing the relationships between airports, planes, flights,
#| weather, and airlines datasets from the nycflights13 package. The faa #| weather, and airlines datasets from the nycflights13 package. The faa
#| variable in the airports data frame is connected to the origin and dest #| variable in the airports data frame is connected to the origin and dest
#| variables in the flights data frame. The tailnum variable in the planes #| variables in the flights data frame. The tailnum variable in the planes
#| data frame is connected to the tailnum variable in flights. The year, #| data frame is connected to the tailnum variable in flights. The
#| month, day, hour, and origin variables are connected to the variables #| time_hour and origin variables in the weather data frame are connected
#| with the same name in the flights data frame. And finally the carrier #| to the variables with the same name in the flights data frame. And
#| variables in the airlines data frame is connected to the carrier #| finally the carrier variables in the airlines data frame is connected
#| variable in the flights data frame. There are no direct connections #| to the carrier variable in the flights data frame. There are no direct
#| between airports, planes, airlines, and weather data frames. #| connections between airports, planes, airlines, and weather data
knitr::include_graphics("diagrams/relational-nycflights.png") #| frames.
knitr::include_graphics("diagrams/relational.png", dpi = 270)
``` ```
### Exercises ### Exercises
@ -122,7 +120,7 @@ A key is a variable (or set of variables) that uniquely identifies an observatio
In simple cases, a single variable is sufficient to identify an observation. In simple cases, a single variable is sufficient to identify an observation.
For example, each plane is uniquely identified by its `tailnum`. For example, each plane is uniquely identified by its `tailnum`.
In other cases, multiple variables may be needed. In other cases, multiple variables may be needed.
For example, to identify an observation in `weather` you need five variables: `year`, `month`, `day`, `hour`, and `origin`. For example, to identify an observation in `weather` you need two variables: `time_hour` and `origin`.
There are two types of keys: There are two types of keys:
@ -144,26 +142,22 @@ planes |>
filter(n > 1) filter(n > 1)
weather |> weather |>
count(year, month, day, hour, origin) |> count(time_hour, origin) |>
filter(n > 1) filter(n > 1)
``` ```
Sometimes a data frame doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. Sometimes a data frame doesn't have an explicit primary key and only an unwieldy combination of variables reliably identifies an observation.
For example, what's the primary key in the `flights` data frame? For example, to uniquely identify a flight, we need the hour the flight departs, the carrier, and the flight number:
You might think it would be the date plus the flight or tail number, but neither of those are unique:
```{r} ```{r}
flights |> flights |>
count(year, month, day, flight) |> count(time_hour, carrier, flight) |>
filter(n > 1)
flights |>
count(year, month, day, tailnum) |>
filter(n > 1) filter(n > 1)
``` ```
When starting to work with this data, we had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. When starting to work with this data, we had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight.
Unfortunately that is not the case! Unfortunately that is not the case, and we have to assume that flight number will never to re-used within a hour.
If a data frame lacks a primary key, it's sometimes useful to add one with `mutate()` and `row_number()`. If a data frame lacks a primary key, it's sometimes useful to add one with `mutate()` and `row_number()`.
That makes it easier to match observations if you've done some filtering and want to check back in with the original data. That makes it easier to match observations if you've done some filtering and want to check back in with the original data.
This is called a **surrogate key**. This is called a **surrogate key**.
@ -180,12 +174,15 @@ For example, in this data there's a many-to-many relationship between airlines a
1. Add a surrogate key to `flights`. 1. Add a surrogate key to `flights`.
2. We know that some days of the year are "special", and fewer people than usual fly on them. 2. The year, month, day, hour, and origin variables almost form a compound key for weather, but there's one hour that has duplicate observations.
Can you figure out what's special about this time?
3. We know that some days of the year are "special", and fewer people than usual fly on them.
How might you represent that data as a data frame? How might you represent that data as a data frame?
What would be the primary keys of that data frame? What would be the primary keys of that data frame?
How would it connect to the existing data frames? How would it connect to the existing data frames?
3. Identify the keys in the following datasets 4. Identify the keys in the following datasets
a. `Lahman::Batting` a. `Lahman::Batting`
b. `babynames::babynames` b. `babynames::babynames`
@ -195,7 +192,7 @@ For example, in this data there's a many-to-many relationship between airlines a
(You might need to install some packages and read some documentation.) (You might need to install some packages and read some documentation.)
4. Draw a diagram illustrating the connections between the `Batting`, `People`, and `Salaries` data frames in the Lahman package. 5. Draw a diagram illustrating the connections between the `Batting`, `People`, and `Salaries` data frames in the Lahman package.
Draw another diagram that shows the relationship between `People`, `Managers`, `AwardsManagers`. Draw another diagram that shows the relationship between `People`, `Managers`, `AwardsManagers`.
How would you characterise the relationship between the `Batting`, `Pitching`, and `Fielding` data frames? How would you characterise the relationship between the `Batting`, `Pitching`, and `Fielding` data frames?