equi join and non-equi join

This commit is contained in:
mine-cetinkaya-rundel 2023-05-25 21:08:14 -04:00
parent daaf3ef52e
commit 386c9156b0
1 changed files with 10 additions and 10 deletions

View File

@ -19,7 +19,7 @@ This chapter will introduce you to two important types of joins:
We'll begin by discussing keys, the variables used to connect a pair of data frames in a join.
We cement the theory with an examination of the keys in the datasets from the nycflights13 package, then use that knowledge to start joining data frames together.
Next we'll discuss how joins work, focusing on their action on the rows.
We'll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
We'll finish up with a discussion of non-equi joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
### Prerequisites
@ -283,8 +283,8 @@ You can override the default suffixes with the `suffix` argument.
`join_by(tailnum)` is short for `join_by(tailnum == tailnum)`.
It's important to know about this fuller form for two reasons.
Firstly, it describes the relationship between the two tables: the keys must be equal.
That's why this type of join is often called an **equi-join**.
You'll learn about non-equi-joins in @sec-non-equi-joins.
That's why this type of join is often called an **equi join**.
You'll learn about non-equi joins in @sec-non-equi-joins.
Secondly, it's how you specify different join keys in each table.
For example, there are two ways to join the `flight2` and `airports` table: either by `dest` or `origin`:
@ -575,7 +575,7 @@ knitr::include_graphics("diagrams/join/venn.png", dpi = 270)
```
The joins shown here are the so-called **equi** **joins**, where rows match if the keys are equal.
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just say "inner join" rather than "equi inner join".
Equi joins are the most common type of join, so we'll typically omit the equi prefix, and just say "inner join" rather than "equi inner join".
We'll come back to non-equi joins in @sec-non-equi-joins.
### Row matching
@ -666,11 +666,11 @@ knitr::include_graphics("diagrams/join/anti.png", dpi = 270)
## Non-equi joins {#sec-non-equi-joins}
So far you've only seen equi-joins, joins where the rows match if the `x` key equals the `y` key.
So far you've only seen equi joins, joins where the rows match if the `x` key equals the `y` key.
Now we're going to relax that restriction and discuss other ways of determining if a pair of rows match.
But before we can do that, we need to revisit a simplification we made above.
In equi-joins the `x` keys and `y` are always equal, so we only need to show one in the output.
In equi joins the `x` keys and `y` are always equal, so we only need to show one in the output.
We can request that dplyr keep both keys with `keep = TRUE`, leading to the code below and the re-drawn `inner_join()` in @fig-inner-both.
```{r}
@ -692,7 +692,7 @@ x |> left_join(y, by = "key", keep = TRUE)
knitr::include_graphics("diagrams/join/inner-both.png", dpi = 270)
```
When we move away from equi-joins we'll always show the keys, because the key values will often be different.
When we move away from equi joins we'll always show the keys, because the key values will often be different.
For example, instead of matching only when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal to the `y$key`, leading to @fig-join-gte.
dplyr's join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.
@ -711,7 +711,7 @@ dplyr's join functions understand this distinction equi and non-equi joins so wi
knitr::include_graphics("diagrams/join/gte.png", dpi = 270)
```
Non-equi-join isn't a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi-join:
Non-equi join isn't a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi join:
- **Cross joins** match every pair of rows.
- **Inequality joins** use `<`, `<=`, `>`, and `>=` instead of `==`.
@ -883,7 +883,7 @@ employees |>
### Exercises
1. Can you explain what's happening with the keys in this equi-join?
1. Can you explain what's happening with the keys in this equi join?
Why are they different?
```{r}
@ -901,7 +901,7 @@ employees |>
In this chapter, you've learned how to use mutating and filtering joins to combine data from a pair of data frames.
Along the way you learned how to identify keys, and the difference between primary and foreign keys.
You also understand how joins work and how to figure out how many rows the output will have.
Finally, you've gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.
Finally, you've gained a glimpse into the power of non-equi joins and seen a few interesting use cases.
This chapter concludes the "Transform" part of the book where the focus was on the tools you could use with individual columns and tibbles.
You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.