Updates for new relationship argument (#1331)

2023-03-01 14:01:24 -06:00 · 2023-03-01 14:01:24 -06:00 · 1eed88433c
parent 8b8b31a4b9
commit 1eed88433c
2 changed files with 18 additions and 68 deletions
--- a/1
+++ b/1
@ -36,5 +36,6 @@ Suggests:
    jpeg,
    knitr,
    sessioninfo
+Remotes: tidyverse/dplyr
 Encoding: UTF-8
 License: CC NC ND 3.0
--- a/joins.qmd
+++ b/joins.qmd
@ -412,8 +412,7 @@ flights2 |>
 ## How do joins work?

 Now that you've used joins a few times it's time to learn more about how they work, focusing on how each row in `x` matches rows in `y`.
-We'll begin by using @fig-join-setup to introduce a visual representation of the two simple tibbles defined below.
-In these examples we'll use a single key called `key` and a single value column (`val_x` and `val_y`), but the ideas all generalize to multiple keys and multiple values.
+We'll begin by introducing a visual representation of joins, using the simple tibbles defined below and shown in @fig-join-setup. In these examples we'll use a single key called `key` and a single value column (`val_x` and `val_y`), but the ideas all generalize to multiple keys and multiple values.

 ```{r}
 x <- tribble(
@ -446,7 +445,8 @@ y <- tribble(
 knitr::include_graphics("diagrams/join/setup.png", dpi = 270)
 ```

-@fig-join-setup2 shows all potential matches between `x` and `y` as the intersection between lines drawn from each row of `x` and each row of `y`.
+@fig-join-setup2 introduces the foundation for our visual representation.
+It shows all potential matches between `x` and `y` as the intersection between lines drawn from each row of `x` and each row of `y`.
 The rows and columns in the output are primarily determined by `x`, so the `x` table is horizontal and lines up with the output.

 ```{r}
@ -465,8 +465,9 @@ The rows and columns in the output are primarily determined by `x`, so the `x` t
 knitr::include_graphics("diagrams/join/setup2.png", dpi = 270)
 ```

-In an actual join, matches will be indicated with dots, as in @fig-join-inner.
-The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
+To describe a specific type of join, we indicate matches with dots.
+The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values.
+For example, @fig-join-inner shows an inner join, where rows are retained if and only if the keys are equal.

 ```{r}
 #| label: fig-join-inner
@ -484,7 +485,7 @@ The number of dots equals the number of matches, which in turn equals the number
 knitr::include_graphics("diagrams/join/inner.png", dpi = 270)
 ```

-An **outer join** keeps observations that appear in at least one of the data frames.
+We can apply the same principles to explain the **outer joins**, which keep observations that appear in at least one of the data frames.
 These joins work by adding an additional "virtual" observation to each data frame.
 This observation has a key that matches if no other key matches, and values filled with `NA`.
 There are three types of outer joins:
@ -606,78 +607,26 @@ There are three possible outcomes for a row in `x`:
 -   If it matches 1 row in `y`, it's preserved.
 -   If it matches more than 1 row in `y`, it's duplicated once for each match.

-In principle, this means that there's no guaranteed correspondence between the rows in the output and the rows in the `x`:
-
-   There might be fewer rows if some rows in `x` don't match any rows in `y`.
-   There might be more rows if some rows in `x` match multiple rows in `y`.
-   There might be the same number of rows if every row in `x` matches one row in `y`.
-   There might be the same number of rows if some rows don't match any rows, and exactly the same number of rows match two rows in `y`!!
-
-Row expansion is a fundamental property of joins, but it's dangerous because it might happen without you realizing it.
-To avoid this problem, dplyr will warn whenever there are multiple matches:
+In principle, this means that there's no guaranteed correspondence between the rows in the output and the rows in `x`, but in practice, this rarely causes problems.
+There is, however, one particularly dangerous case which can cause a combinatorial explosion of rows.
+Imagine joining the following two tables:

 ```{r}
-df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
+df1 <- tibble(key = c(1, 2, 2), val_x = c("x1", "x2", "x3"))
 df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
+```

+While the first row in `df1` only matches one row in `df2`, the second and third rows both match two rows.
+This is sometimes called a `many-to-many` join, and will cause dplyr to emit a warning:
+
+```{r}
 df1 |> 
  inner_join(df2, join_by(key))
 ```

-You can gain further control over row matching with two arguments:
+If you are doing this deliberately, you can set `relationship = "many-to-many"`, as the warning suggests.

-   `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
-   `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if any rows have multiple matches.

-There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.
-
-### One-to-one mapping
-
-Both `unmatched` and `multiple` can take value `"error"` which means that the join will fail unless each row in `x` matches exactly one row in `y`:
-
-```{r}
-#| error: true
-df1 <- tibble(x = 1)
-df2 <- tibble(x = c(1, 1))
-df3 <- tibble(x = 3)
-
-df1 |> 
-  inner_join(df2, join_by(x), unmatched = "error", multiple = "error")
-df1 |> 
-  inner_join(df3, join_by(x), unmatched = "error", multiple = "error")
-```
-
-Note that `unmatched = "error"` is not useful with `left_join()` because, as described above, every row in `x` has a fallback match to a virtual row in `y`.
-
-### Allow multiple rows
-
-Sometimes it's useful to deliberately expand the number of rows in the output.
-This can come about naturally if you "flip" the direction of the question you're asking.
-For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight:
-
-```{r}
-#| results: false
-flights2 |> 
-  left_join(planes, by = "tailnum")
-```
-
-But it's also reasonable to ask what flights did each plane fly:
-
-```{r}
-plane_flights <- planes |> 
-  select(tailnum, type, engines, seats) |> 
-  left_join(flights2, by = "tailnum")
-```
-
-Since this duplicates rows in `x` (the planes), we need to explicitly say that we're ok with the multiple matches by setting `multiple = "all"`:
-
-```{r}
-plane_flights <- planes |> 
-  select(tailnum, type, engines, seats) |> 
-  left_join(flights2, by = "tailnum", multiple = "all")
-
-plane_flights
-```

 ### Filtering joins {#sec-non-equi-joins}