These datasets are all connected via the `flights` data frame because the `tailnum`, `carrier`, `origin`, `dest`, and `time_hour` variables are all primary keys in other datasets making them foreign keys.
- `flights$tailnum` connects to primary key `planes$tailnum`.
This diagram is a little overwhelming, but it's simple compared to some you'll see in the wild!
The key to understanding diagrams like this is that you'll solve real problems by working with pairs of data frames.
You don't need to understand the whole thing; you just need to understand the chain of connections between the two data frames that you're interested in.
You should also check for missing values in your primary keys --- if a value is missing then it can't identify an observation!
```{r}
planes |>
filter(is.na(tailnum))
weather |>
filter(is.na(time_hour) | is.na(origin))
```
### Surrogate keys
So far we haven't talked about the primary key for `flights`.
It's not super important here, because there are no data frames that use it as a foreign key, but it's still useful to think about because it makes it easier to work with observations if have some way to uniquely identify them.
There's clearly no one variable or even a pair of variables that uniquely identifies a flight, but we can find three together that work:
Does that make `time_hour`-`carrier`-`flight` a primary key?
It's certainly a good start, but it doesn't guarantee it.
For example, are altitude and longitude a primary key for `airports`?
```{r}
airports |>
count(alt, lat) |>
filter(n > 1)
```
Identifying an airport by it's altitude and latitude is clearly a bad idea, and in general it's not possible to know from the data itself whether or not a combination of variables that uniquely identifies an observation is a primary key.
For flights, the combination of `time_hour`, `carrier`, and `flight` seems like a reasonable primary key because it would be really confusing for the airline if there were multiple flights with the same number in the air at the same time.
That said, we might be better off introducing a simple numeric **surrogate** key using the row number:
Surrogate keys can be particular useful when communicating to other humans: it's much easier to tell someone to take a look at flight 2001 than to say look at the UA430 which departed 9am 2013-01-03.
Now that you understand how data frames are connected via keys, we can start to using them to better understand the `flights` dataset.
We'll first show you the mutating joins, so called because their primary role[^joins-1] is to add additional column to the `x` data frame, just like `mutate()`. You'll learn learn about join keys, and finish up with a discussion of the filtering joins, which work like a `filter()` rather than a `mutate()`.
[^joins-1]: They also affect the number of rows; we'll come back to that shortly.
### Mutating joins
A **mutating join** allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.
Like `mutate()`, the join functions add variables to the right, so if you have a lot of variables already, you won't see the new variables.
For these examples, we'll make it easier to see what's going on in the examples by creating a narrower dataset:
Note that in each of these cases the number of rows has stayed the same, but we've added new columns to the right.
### Specifying join keys
By default, `left_join()` will use all variables that appear in both data frames as the join key, the so called **natural** join.
This is a useful heuristic, but it doesn't always work.
What happens if we try to join `flights` with the complete `planes`?
```{r}
flights2 |>
left_join(planes)
```
We get a lot of missing matches because both `flights` and `planes` have a `year` column but they mean different things: the year the flight occurred and the year the plane was built.
We only want to join on the `tailnum` column so we need switch to an explicit specification:
```{r}
flights2 |>
left_join(planes, join_by(tailnum))
```
Note that the `year` variables are disambiguated in the output with a suffix.
You can control this with the `suffix` argument.
`join_by(tailnum)` is short for `join_by(tailnum == tailnum)`.
This fuller form is important because it's how you specify different join keys in each table.
For example, there are two ways to join the `flight2` and `airports` table: either by `dest` or `origin:`
```{r}
flights2 |>
left_join(airports, join_by(dest == faa))
flights2 |>
left_join(airports, join_by(origin == faa))
```
In older code you might see a different way of specifying the join keys, using a character vector:
Now that it exists, we prefer `join_by()` as it's a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
### Filtering joins
As you might guess the primary action of a **filtering join** is to filter the rows.
There are two types: semi-joins and anti-joins.
**Semi-joins** keep all rows in `x` that have a match in `y` are useful for matching filtered summary data frames back to the original rows.
For example, we could use to filter the `airports` dataset to show just the origin airports:
```{r}
airports |>
semi_join(flights2, join_by(faa == origin))
```
Or just the destinations:
```{r}
airports |>
semi_join(flights2, join_by(faa == dest))
```
**Anti-joins** are the opposite: they return all rows in `x` that don't have a match in `y`.
They're useful for figuring out what's missing.
For example, we can figure out which flights are missing information about the destination airport:
```{r}
flights2 |>
anti_join(airports, join_by(dest == faa))
```
Or which flights lack metadata about their plane:
```{r}
flights2 |>
anti_join(planes, join_by(tailnum)) |>
distinct(tailnum)
```
### Exercises
1. Does every departing flight have corresponding weather data for that hour?
2. Find the 48 hours (over the course of the whole year) that have the worst delays.
Cross-reference it with the `weather` data.
Can you see any patterns?
3. Imagine you've found the top 10 most popular destinations using this code:
```{r}
top_dest <- flights2 |>
count(dest, sort = TRUE) |>
head(10)
```
How can you find all flights to that destination?
4. What does it mean for a flight to have a missing `tailnum`?
What do the tail numbers that don't have a matching record in `planes` have in common?
(Hint: one variable explains \~90% of the problems.)
5. You might expect that there's an implicit relationship between plane and airline, because each plane is flown by a single airline.
Confirm or reject this hypothesis using the tools you've learned above.
6. Add the location of the origin *and* destination (i.e. the `lat` and `lon`) to `flights`.
Is it easier to rename the columns before or after the join?
7. Compute the average delay by destination, then join on the `airports` data frame so you can show the spatial distribution of delays.
Here's an easy way to draw a map of the United States:
```{r}
#| eval: false
airports |>
semi_join(flights, join_by(faa == dest)) |>
ggplot(aes(lon, lat)) +
borders("state") +
geom_point() +
coord_quickmap()
```
You might want to use the `size` or `colour` of the points to display the average delay for each airport.
8. What happened on June 13 2013?
Display the spatial pattern of delays, and then use Google to cross-reference with the weather.
```{r}
#| eval: false
#| include: false
worst <- filter(flights, !is.na(dep_time), month == 6, day == 13)
worst |>
group_by(dest) |>
summarise(delay = mean(arr_delay), n = n()) |>
filter(n > 5) |>
inner_join(airports, by = c("dest" = "faa")) |>
ggplot(aes(lon, lat)) +
borders("state") +
geom_point(aes(size = n, colour = delay)) +
coord_quickmap()
```
## How do joins work?
Now that you've used a few joins it's time to learn more about how they work, focusing especially on how each row in `x` matches with each row in `y`.
Another way to show how the outer joins differ is with a Venn diagram, @fig-join-venn.
This, however, is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate what's happening with the columns.
This is another reason we recommend the `left_join()` --- every row in `x` is guaranteed to match a "virtual" row in `y` so it'll never drop rows, and you'll always get a warning when it duplicates rows.
- `unmatched` controls what happens if a row in `x` doesn't match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
- `multiple` controls what happens if a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if there are any multiple matches.
(`unmatched = "error"` is not useful with `left_join()` because as described above, a `left_join()` always matches a virtual row in `y` filled with missing values).
The number of matches is closely related to what the filtering joins too.
The semi-join keeps rows in `x` that have one or more matches in `y`, as in @fig-join-semi. The anti-join keeps rows in `x` that don't have a match in `y`, as in @fig-join-anti.
In both cases, nly the existence of a match is important; it doesn't matter which observation is matched.
This means that filtering joins never duplicate rows like mutating joins do.
This allows us to make an important simplification in both the diagrams and the return values of the join frames: we only ever include the join key from one table.
We can request that dplyr keep both keys with `keep = TRUE`.
This is shown in the code below and in @fig-inner-both.
```{r}
x |> left_join(y, by = "key", keep = TRUE)
```
```{r}
#| label: fig-inner-both
#| fig-cap: >
#| Inner join showing keys from both `x` and `y`. This is not the
#| default because for equi-joins, the keys are the same so showing
This distinction between the keys becomes much more important as we move away from equi-joins because the key values are much more likely to be different.
Because of this, dplyr defaults to showing both keys.
For example, instead of requiring that the `x` and `y` keys be equal, we could request that key from `x` be less than the key from `y`, as in the code below and @fig-join-gte.
Non-equi join isn't a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps a bit by identifying three useful types of non-equi join
Here we perform a self-join (i.e we join a table to itself), then use the inequality join to ensure that we one of the two possible pairs (e.g. just (a, b) not also (b, a)) and don't match the same row.
Rolling joins are a special type of inequality join where instead of getting *every* row that satisfies the inequality, you get just the closest row.
They're particularly useful when you have two tables of dates that don't perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.
You can turn any inequality join into a rolling join by adding `closest()`.
For example `join_by(closest(x <= y))` finds the smallest `y` that's greater than or equal to x, and `join_by(closest(x > y))` finds the biggest `y` that's less than x.
For example, imagine that you're in charge of office birthdays.
Your company is rather stingy so instead of having individual parties, you only have a party once each quarter.
Parties are always on a Monday, and you skip the first week of January since a lot of people are on holiday and the first Monday of Q3 is July 4, so that has to be pushed back a week.
That leads to the following party days:
```{r}
parties <- tibble(
q = 1:4,
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
)
```
Then we have a table of employees along with their birthdays:
We could also flip the question around and ask which employees will celebrate in each party.
This requires explicitly specifying which table each variable comes from since otherwise `between()` assumes that the first argument comes from `x` and the second and third come from `y`.