Updates for new relationship argument (#1331)
This commit is contained in:
parent
8b8b31a4b9
commit
1eed88433c
|
@ -36,5 +36,6 @@ Suggests:
|
|||
jpeg,
|
||||
knitr,
|
||||
sessioninfo
|
||||
Remotes: tidyverse/dplyr
|
||||
Encoding: UTF-8
|
||||
License: CC NC ND 3.0
|
||||
|
|
85
joins.qmd
85
joins.qmd
|
@ -412,8 +412,7 @@ flights2 |>
|
|||
## How do joins work?
|
||||
|
||||
Now that you've used joins a few times it's time to learn more about how they work, focusing on how each row in `x` matches rows in `y`.
|
||||
We'll begin by using @fig-join-setup to introduce a visual representation of the two simple tibbles defined below.
|
||||
In these examples we'll use a single key called `key` and a single value column (`val_x` and `val_y`), but the ideas all generalize to multiple keys and multiple values.
|
||||
We'll begin by introducing a visual representation of joins, using the simple tibbles defined below and shown in @fig-join-setup. In these examples we'll use a single key called `key` and a single value column (`val_x` and `val_y`), but the ideas all generalize to multiple keys and multiple values.
|
||||
|
||||
```{r}
|
||||
x <- tribble(
|
||||
|
@ -446,7 +445,8 @@ y <- tribble(
|
|||
knitr::include_graphics("diagrams/join/setup.png", dpi = 270)
|
||||
```
|
||||
|
||||
@fig-join-setup2 shows all potential matches between `x` and `y` as the intersection between lines drawn from each row of `x` and each row of `y`.
|
||||
@fig-join-setup2 introduces the foundation for our visual representation.
|
||||
It shows all potential matches between `x` and `y` as the intersection between lines drawn from each row of `x` and each row of `y`.
|
||||
The rows and columns in the output are primarily determined by `x`, so the `x` table is horizontal and lines up with the output.
|
||||
|
||||
```{r}
|
||||
|
@ -465,8 +465,9 @@ The rows and columns in the output are primarily determined by `x`, so the `x` t
|
|||
knitr::include_graphics("diagrams/join/setup2.png", dpi = 270)
|
||||
```
|
||||
|
||||
In an actual join, matches will be indicated with dots, as in @fig-join-inner.
|
||||
The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
|
||||
To describe a specific type of join, we indicate matches with dots.
|
||||
The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values.
|
||||
For example, @fig-join-inner shows an inner join, where rows are retained if and only if the keys are equal.
|
||||
|
||||
```{r}
|
||||
#| label: fig-join-inner
|
||||
|
@ -484,7 +485,7 @@ The number of dots equals the number of matches, which in turn equals the number
|
|||
knitr::include_graphics("diagrams/join/inner.png", dpi = 270)
|
||||
```
|
||||
|
||||
An **outer join** keeps observations that appear in at least one of the data frames.
|
||||
We can apply the same principles to explain the **outer joins**, which keep observations that appear in at least one of the data frames.
|
||||
These joins work by adding an additional "virtual" observation to each data frame.
|
||||
This observation has a key that matches if no other key matches, and values filled with `NA`.
|
||||
There are three types of outer joins:
|
||||
|
@ -606,78 +607,26 @@ There are three possible outcomes for a row in `x`:
|
|||
- If it matches 1 row in `y`, it's preserved.
|
||||
- If it matches more than 1 row in `y`, it's duplicated once for each match.
|
||||
|
||||
In principle, this means that there's no guaranteed correspondence between the rows in the output and the rows in the `x`:
|
||||
|
||||
- There might be fewer rows if some rows in `x` don't match any rows in `y`.
|
||||
- There might be more rows if some rows in `x` match multiple rows in `y`.
|
||||
- There might be the same number of rows if every row in `x` matches one row in `y`.
|
||||
- There might be the same number of rows if some rows don't match any rows, and exactly the same number of rows match two rows in `y`!!
|
||||
|
||||
Row expansion is a fundamental property of joins, but it's dangerous because it might happen without you realizing it.
|
||||
To avoid this problem, dplyr will warn whenever there are multiple matches:
|
||||
In principle, this means that there's no guaranteed correspondence between the rows in the output and the rows in `x`, but in practice, this rarely causes problems.
|
||||
There is, however, one particularly dangerous case which can cause a combinatorial explosion of rows.
|
||||
Imagine joining the following two tables:
|
||||
|
||||
```{r}
|
||||
df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
|
||||
df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3"))
|
||||
df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
|
||||
```
|
||||
|
||||
While the first row in `df1` only matches one row in `df2`, the second and third rows both match two rows.
|
||||
This is sometimes called a `many-to-many` join, and will cause dplyr to emit a warning:
|
||||
|
||||
```{r}
|
||||
df1 |>
|
||||
inner_join(df2, join_by(key))
|
||||
```
|
||||
|
||||
You can gain further control over row matching with two arguments:
|
||||
If you are doing this deliberately, you can set `relationship = "many-to-many"`, as the warning suggests.
|
||||
|
||||
- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
|
||||
- `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if any rows have multiple matches.
|
||||
|
||||
There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.
|
||||
|
||||
### One-to-one mapping
|
||||
|
||||
Both `unmatched` and `multiple` can take value `"error"` which means that the join will fail unless each row in `x` matches exactly one row in `y`:
|
||||
|
||||
```{r}
|
||||
#| error: true
|
||||
df1 <- tibble(x = 1)
|
||||
df2 <- tibble(x = c(1, 1))
|
||||
df3 <- tibble(x = 3)
|
||||
|
||||
df1 |>
|
||||
inner_join(df2, join_by(x), unmatched = "error", multiple = "error")
|
||||
df1 |>
|
||||
inner_join(df3, join_by(x), unmatched = "error", multiple = "error")
|
||||
```
|
||||
|
||||
Note that `unmatched = "error"` is not useful with `left_join()` because, as described above, every row in `x` has a fallback match to a virtual row in `y`.
|
||||
|
||||
### Allow multiple rows
|
||||
|
||||
Sometimes it's useful to deliberately expand the number of rows in the output.
|
||||
This can come about naturally if you "flip" the direction of the question you're asking.
|
||||
For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight:
|
||||
|
||||
```{r}
|
||||
#| results: false
|
||||
flights2 |>
|
||||
left_join(planes, by = "tailnum")
|
||||
```
|
||||
|
||||
But it's also reasonable to ask what flights did each plane fly:
|
||||
|
||||
```{r}
|
||||
plane_flights <- planes |>
|
||||
select(tailnum, type, engines, seats) |>
|
||||
left_join(flights2, by = "tailnum")
|
||||
```
|
||||
|
||||
Since this duplicates rows in `x` (the planes), we need to explicitly say that we're ok with the multiple matches by setting `multiple = "all"`:
|
||||
|
||||
```{r}
|
||||
plane_flights <- planes |>
|
||||
select(tailnum, type, engines, seats) |>
|
||||
left_join(flights2, by = "tailnum", multiple = "all")
|
||||
|
||||
plane_flights
|
||||
```
|
||||
|
||||
### Filtering joins {#sec-non-equi-joins}
|
||||
|
||||
|
|
Loading…
Reference in New Issue