Edits to joins chapter (#1086)
* Add missing word * Delete a word * Add missing word * Don't say "value of a primary key"; use more parallel language * Typo * How about "Now"? * Comma, wording, grammar * Plural * 'Special' used in same same sense, unquoted, in previous exercise * Add word, remove 's' * Add words * Subject-verb * Don't use 'key' in a non-join-y way * Copy edits to match details * Wording * Add words
This commit is contained in:
parent
4ac50eb359
commit
0c9acc7074
66
joins.qmd
66
joins.qmd
|
@ -17,9 +17,9 @@ This chapter will introduce you to two important types of joins:
|
|||
- Filtering joins, filter observations from one data frame based on whether or not they match an observation in another.
|
||||
|
||||
We'll begin by discussing keys, the variables used to connect a pair of data frames in a join.
|
||||
You'll then see how to use joins to a variety of challenges from the nycflights13 dataset.
|
||||
You'll then see how to use joins to tackle a variety of challenges from the nycflights13 dataset.
|
||||
Next we'll discuss how joins work, focusing on their action on the rows.
|
||||
We'll finish up by with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
|
||||
We'll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
|
||||
|
||||
If you're familiar with SQL, you should find the ideas in this chapter familiar, as their realization in dplyr is very similar.
|
||||
|
||||
|
@ -46,7 +46,7 @@ You'll also learn how to check that your keys are valid, and what to do if your
|
|||
|
||||
Every join involves a pair of keys: a primary key and a foreign key.
|
||||
A **primary key** is a variable (or group of variables) that uniquely identifies an observation.
|
||||
A **foreign key** is the value of a primary key in another table so can be used to lookup the corresponding observation.
|
||||
A **foreign key** is the corresponding variable (or groups of variables) in another table.
|
||||
Let's make those terms concrete by looking at four of the data frames in nycfights13:
|
||||
|
||||
- `airlines` lets you look up the full carrier name from its abbreviated code.
|
||||
|
@ -57,7 +57,7 @@ Let's make those terms concrete by looking at four of the data frames in nycfigh
|
|||
```
|
||||
|
||||
- `airports` gives information about each airport.
|
||||
Its primary key is the three `faa` airport code.
|
||||
Its primary key is the three letter `faa` airport code.
|
||||
|
||||
```{r}
|
||||
airports
|
||||
|
@ -80,7 +80,7 @@ Let's make those terms concrete by looking at four of the data frames in nycfigh
|
|||
These datasets are all connected via the `flights` data frame because the `tailnum`, `carrier`, `origin`, `dest`, and `time_hour` variables are all foreign keys:
|
||||
|
||||
- `flights$tailnum` connects to primary key `planes$tailnum`.
|
||||
- `flights$carrier` connects to primary key `airlines$carrer`.
|
||||
- `flights$carrier` connects to primary key `airlines$carrier`.
|
||||
- `flights$origin` connects to primary key `airports$faa`.
|
||||
- `flights$dest` connects to primary key `airports$faa` .
|
||||
- `flights$origin`-`flights$time_hour` connects to primary key `weather$origin`-`weather$time_hour`.
|
||||
|
@ -115,7 +115,7 @@ knitr::include_graphics("diagrams/relational.png", dpi = 270)
|
|||
|
||||
### Checking primary keys
|
||||
|
||||
That that we've identified the primary keys in each table, it's good practice to verify that they do indeed uniquely identify each observation.
|
||||
Now that that we've identified the primary keys in each table, it's good practice to verify that they do indeed uniquely identify each observation.
|
||||
One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one.
|
||||
This reveals that `planes` and `weather` both look good:
|
||||
|
||||
|
@ -144,7 +144,7 @@ weather |>
|
|||
So far we haven't talked about the primary key for `flights`.
|
||||
It's not super important here, because there are no data frames that use it as a foreign key, but it's still useful to consider because it's easier to work with observations if have some way to describe them to others.
|
||||
|
||||
After a little thinking and experimentation we discovered that there are three variables that together uniquely identifies each flight:
|
||||
After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -180,13 +180,13 @@ Surrogate keys can be particular useful when communicating to other humans: it's
|
|||
1. We forgot to draw the relationship between `weather` and `airports` in @fig-flights-relationships.
|
||||
What is the relationship and how should it appear in the diagram?
|
||||
|
||||
2. `weather` only contains information for the three origin airport in NYC.
|
||||
2. `weather` only contains information for the three origin airports in NYC.
|
||||
If it contained weather records for all airports in the USA, what additional connection would it make to `flights`?
|
||||
|
||||
3. The `year`, `month`, `day`, `hour`, and `origin` variables almost form a compound key for `weather`, but there's one hour that has duplicate observations.
|
||||
Can you figure out what's special about that hour?
|
||||
|
||||
4. We know that some days of the year are "special" and fewer people than usual fly on them.
|
||||
4. We know that some days of the year are special and fewer people than usual fly on them.
|
||||
How might you represent that data as a data frame?
|
||||
What would be the primary key?
|
||||
How would it connect to the existing data frames?
|
||||
|
@ -199,10 +199,10 @@ Surrogate keys can be particular useful when communicating to other humans: it's
|
|||
|
||||
Now that you understand how data frames are connected via keys, we can start using joins to better understand the `flights` dataset.
|
||||
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `semi_join()`, and `anti_join()`.
|
||||
They all the same interface: they take a pair of data frames `x` and `y` and return a data frame.
|
||||
They all have the same interface: they take a pair of data frames `x` and `y` and return a data frame.
|
||||
The order of the rows and columns in the output is primarily determined by `x`.
|
||||
|
||||
In this section, you'll learn how to use one mutating joins, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
|
||||
In this section, you'll learn how to use one mutating join, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
|
||||
In the next section, you'll learn exactly how these functions work, and about the remaining `inner_join()`, `right_join()` and `full_join()`.
|
||||
|
||||
### Mutating joins
|
||||
|
@ -267,7 +267,7 @@ flights2 |>
|
|||
left_join(planes)
|
||||
```
|
||||
|
||||
We get a lot of missing matches our join is trying to use both `tailnum` and `year`.
|
||||
We get a lot of missing matches because our join is trying to use both `tailnum` and `year`.
|
||||
Both `flights` and `planes` have a `year` column but they mean different things: `flights$year` is year the flight occurred and `planes$year` is the year the plane was built.
|
||||
We only want to join on `tailnum` so we need to provide an explicit specification with `join_by()`:
|
||||
|
||||
|
@ -295,14 +295,14 @@ In older code you might see a different way of specifying the join keys, using a
|
|||
- `by = "x"` corresponds to `join_by(x)`.
|
||||
- `by = c("a" = "x")` corresponds to `join_by(a == x)`.
|
||||
|
||||
Now that it exists, we prefer `join_by()` since provides a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
|
||||
Now that it exists, we prefer `join_by()` since it provides a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
|
||||
|
||||
### Filtering joins
|
||||
|
||||
As you might guess the primary action of a **filtering join** is to filter the rows.
|
||||
There are two types: semi-joins and anti-joins.
|
||||
**Semi-joins** keep all rows in `x` that have a match in `y`.
|
||||
For example, we could use to filter the `airports` dataset to show just the origin airports:
|
||||
For example, we could use a semi-join to filter the `airports` dataset to show just the origin airports:
|
||||
|
||||
```{r}
|
||||
airports |>
|
||||
|
@ -423,8 +423,8 @@ y <- tribble(
|
|||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| Graphical representation of two simple tables. The coloured `key`
|
||||
#| columns map background colour to key value. The grey columns represents
|
||||
#| the "value" columns that is carried along for the ride.
|
||||
#| columns map background colour to key value. The grey columns represent
|
||||
#| the "value" columns that are carried along for the ride.
|
||||
#| fig-alt: >
|
||||
#| x and y are two data frames with 2 columns and 3 rows each. The first
|
||||
#| column in each is the key and the second is the value. The contents of
|
||||
|
@ -518,7 +518,7 @@ There are three types of outer joins:
|
|||
```
|
||||
|
||||
- A **full join** keeps all observations that appear in `x` or `y`, @fig-join-full.
|
||||
Every row of `x` and `y` `is` included in the output because both `x` and `y` have a fall back row of `NA`s.
|
||||
Every row of `x` and `y` is included in the output because both `x` and `y` have a fall back row of `NA`s.
|
||||
Note the output will consist of all `x` rows followed by the remaining `y` rows.
|
||||
|
||||
```{r}
|
||||
|
@ -571,7 +571,7 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
|
|||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| The three key ways a row in `x` can match. `x1` matches
|
||||
#| The three ways a row in `x` can match. `x1` matches
|
||||
#| one row in `y`, `x2` matches two rows in `y`, `x3` matches
|
||||
#| zero rows in y. Note that while there are three rows in
|
||||
#| `x` and three rows in the output, there isn't a direct
|
||||
|
@ -584,20 +584,20 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
|
|||
knitr::include_graphics("diagrams/join/match-types.png", dpi = 270)
|
||||
```
|
||||
|
||||
There are three possible outcomes for a row:
|
||||
There are three possible outcomes for a row in `x`:
|
||||
|
||||
- If it doesn't match anything, it's dropped.
|
||||
- If it matches 1 row, it's kept as is.
|
||||
- If it matches more than 1 row, it's duplicated once for each match.
|
||||
- If it matches 1 row in `y`, it's kept as is.
|
||||
- If it matches more than 1 row in `y`, it's duplicated once for each match.
|
||||
|
||||
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`:
|
||||
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`, compared to the number of rows in `x`.
|
||||
|
||||
- There might be fewer rows if some rows in `x` don't match any rows in `y`.
|
||||
- There might be more rows if some rows in `x` match multiple rows in `y`.
|
||||
- There might be the same number of rows if every row in `x` matches one row in `y`.
|
||||
- There might be the same number of rows if some rows don't match any rows, and exactly the same number of rows match two rows in `y`!!
|
||||
|
||||
Row expansion is a fundamental property of joins, but it's dangerous because it might by hidden.
|
||||
Row expansion is a fundamental property of joins, but it's dangerous because it might happen without you realizing it.
|
||||
To avoid this problem, dplyr will warn whenever there are multiple matches:
|
||||
|
||||
```{r}
|
||||
|
@ -612,7 +612,7 @@ This is another reason we recommend `left_join()` --- if it runs without warning
|
|||
|
||||
You can gain further control over row matching with two arguments:
|
||||
|
||||
- `unmatched` controls what happens when in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
|
||||
- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
|
||||
- `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if any rows have multiple matches.
|
||||
|
||||
There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.
|
||||
|
@ -638,7 +638,7 @@ Note that `unmatched = "error"` is not useful with `left_join()` because, as des
|
|||
### Allow multiple rows
|
||||
|
||||
Sometimes it's useful to deliberately expand the number of rows in the output.
|
||||
This can come about naturally if "flip" the direction of the question you're asking.
|
||||
This can come about naturally if you "flip" the direction of the question you're asking.
|
||||
For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight:
|
||||
|
||||
```{r}
|
||||
|
@ -655,7 +655,7 @@ plane_flights <- planes |>
|
|||
left_join(flights2, by = "tailnum")
|
||||
```
|
||||
|
||||
Since this duplicate rows in `x` (the planes), we need to explicitly say we're ok with the multiple matches by setting `multiple = "all"`:
|
||||
Since this duplicates rows in `x` (the planes), we need to explicitly say we're ok with the multiple matches by setting `multiple = "all"`:
|
||||
|
||||
```{r}
|
||||
plane_flights <- planes |>
|
||||
|
@ -670,7 +670,7 @@ plane_flights
|
|||
The number of matches also determines the behavior of the filtering joins.
|
||||
The semi-join keeps rows in `x` that have one or more matches in `y`, as in @fig-join-semi.
|
||||
The anti-join keeps rows in `x` that don't have a match in `y`, as in @fig-join-anti.
|
||||
In both cases, only the existence of a match is important; it doesn't matter how many times its match.
|
||||
In both cases, only the existence of a match is important; it doesn't matter how many times it matches.
|
||||
This means that filtering joins never duplicate rows like mutating joins do.
|
||||
|
||||
```{r}
|
||||
|
@ -709,7 +709,7 @@ knitr::include_graphics("diagrams/join/anti.png", dpi = 270)
|
|||
|
||||
## Non-equi joins
|
||||
|
||||
So far you've only seen **equi-joins**, joins where the two rows match if the `x` keys equal the `y` keys.
|
||||
So far you've only seen **equi-joins**, joins where the two rows match if the `x` keys are exactly equal to the `y` keys.
|
||||
Now we're going to relax that restriction and discuss other ways of determining if a pair of rows match.
|
||||
|
||||
But before we can do that, we need to revisit a simplification we made above.
|
||||
|
@ -736,7 +736,7 @@ knitr::include_graphics("diagrams/join/inner-both.png", dpi = 270)
|
|||
```
|
||||
|
||||
When we move away from equi-joins we'll always show the keys, because the key values will often different.
|
||||
For example, instead matching when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal the `y$key`, leading to @fig-join-gte.
|
||||
For example, instead of matching only when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal to the `y$key`, leading to @fig-join-gte.
|
||||
dplyr's join functions understand this distinction so will always show both keys when you perform a non-equi-join.
|
||||
|
||||
```{r}
|
||||
|
@ -882,7 +882,7 @@ parties
|
|||
```
|
||||
|
||||
Hadley is hopelessly bad at data entry so he also wanted to check that the party periods don't overlap.
|
||||
You can perform an self-join and check to see if any start-end interval overlaps with any other:
|
||||
You can perform a self-join and check to see if any start-end interval overlaps with any other:
|
||||
|
||||
```{r}
|
||||
parties |>
|
||||
|
@ -911,7 +911,7 @@ employees |>
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Can you explain what's happening the keys in this equi-join?
|
||||
1. Can you explain what's happening with the keys in this equi-join?
|
||||
Why are they different?
|
||||
|
||||
```{r}
|
||||
|
@ -927,11 +927,11 @@ employees |>
|
|||
## Summary
|
||||
|
||||
In this chapter, you've learned how to use mutating and filtering joins to combine data from a pair of data frames.
|
||||
Along the way you learned how to identify keys, and the between primary and foreign keys.
|
||||
Along the way you learned how to identify keys, and the difference between primary and foreign keys.
|
||||
You also understand how joins work and how to figure out how many rows the output will have.
|
||||
Finally, you've gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.
|
||||
|
||||
This chapter concludes the "Transform" part of the book where the focus was on the tools you could use with individual columns and tibbles.
|
||||
You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.
|
||||
|
||||
In the next part of the book, you'll learn more getting various types of data into R in a tidy form.
|
||||
In the next part of the book, you'll learn more about getting various types of data into R in a tidy form.
|
||||
|
|
Loading…
Reference in New Issue