Edits to joins chapter (#1086)
* Add missing word * Delete a word * Add missing word * Don't say "value of a primary key"; use more parallel language * Typo * How about "Now"? * Comma, wording, grammar * Plural * 'Special' used in same same sense, unquoted, in previous exercise * Add word, remove 's' * Add words * Subject-verb * Don't use 'key' in a non-join-y way * Copy edits to match details * Wording * Add words
This commit is contained in:
		
				
					committed by
					
						
						GitHub
					
				
			
			
				
	
			
			
			
						parent
						
							4ac50eb359
						
					
				
				
					commit
					0c9acc7074
				
			
							
								
								
									
										66
									
								
								joins.qmd
									
									
									
									
									
								
							
							
						
						
									
										66
									
								
								joins.qmd
									
									
									
									
									
								
							@@ -17,9 +17,9 @@ This chapter will introduce you to two important types of joins:
 | 
			
		||||
-   Filtering joins, filter observations from one data frame based on whether or not they match an observation in another.
 | 
			
		||||
 | 
			
		||||
We'll begin by discussing keys, the variables used to connect a pair of data frames in a join.
 | 
			
		||||
You'll then see how to use joins to a variety of challenges from the nycflights13 dataset.
 | 
			
		||||
You'll then see how to use joins to tackle a variety of challenges from the nycflights13 dataset.
 | 
			
		||||
Next we'll discuss how joins work, focusing on their action on the rows.
 | 
			
		||||
We'll finish up by with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
 | 
			
		||||
We'll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.
 | 
			
		||||
 | 
			
		||||
If you're familiar with SQL, you should find the ideas in this chapter familiar, as their realization in dplyr is very similar.
 | 
			
		||||
 | 
			
		||||
@@ -46,7 +46,7 @@ You'll also learn how to check that your keys are valid, and what to do if your
 | 
			
		||||
 | 
			
		||||
Every join involves a pair of keys: a primary key and a foreign key.
 | 
			
		||||
A **primary key** is a variable (or group of variables) that uniquely identifies an observation.
 | 
			
		||||
A **foreign key** is the value of a primary key in another table so can be used to lookup the corresponding observation.
 | 
			
		||||
A **foreign key** is the corresponding variable (or groups of variables) in another table.
 | 
			
		||||
Let's make those terms concrete by looking at four of the data frames in nycfights13:
 | 
			
		||||
 | 
			
		||||
-   `airlines` lets you look up the full carrier name from its abbreviated code.
 | 
			
		||||
@@ -57,7 +57,7 @@ Let's make those terms concrete by looking at four of the data frames in nycfigh
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
-   `airports` gives information about each airport.
 | 
			
		||||
    Its primary key is the three `faa` airport code.
 | 
			
		||||
    Its primary key is the three letter `faa` airport code.
 | 
			
		||||
 | 
			
		||||
    ```{r}
 | 
			
		||||
    airports
 | 
			
		||||
@@ -80,7 +80,7 @@ Let's make those terms concrete by looking at four of the data frames in nycfigh
 | 
			
		||||
These datasets are all connected via the `flights` data frame because the `tailnum`, `carrier`, `origin`, `dest`, and `time_hour` variables are all foreign keys:
 | 
			
		||||
 | 
			
		||||
-   `flights$tailnum` connects to primary key `planes$tailnum`.
 | 
			
		||||
-   `flights$carrier` connects to primary key `airlines$carrer`.
 | 
			
		||||
-   `flights$carrier` connects to primary key `airlines$carrier`.
 | 
			
		||||
-   `flights$origin` connects to primary key `airports$faa`.
 | 
			
		||||
-   `flights$dest` connects to primary key `airports$faa` .
 | 
			
		||||
-   `flights$origin`-`flights$time_hour` connects to primary key `weather$origin`-`weather$time_hour`.
 | 
			
		||||
@@ -115,7 +115,7 @@ knitr::include_graphics("diagrams/relational.png", dpi = 270)
 | 
			
		||||
 | 
			
		||||
### Checking primary keys
 | 
			
		||||
 | 
			
		||||
That that we've identified the primary keys in each table, it's good practice to verify that they do indeed uniquely identify each observation.
 | 
			
		||||
Now that that we've identified the primary keys in each table, it's good practice to verify that they do indeed uniquely identify each observation.
 | 
			
		||||
One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one.
 | 
			
		||||
This reveals that `planes` and `weather` both look good:
 | 
			
		||||
 | 
			
		||||
@@ -144,7 +144,7 @@ weather |>
 | 
			
		||||
So far we haven't talked about the primary key for `flights`.
 | 
			
		||||
It's not super important here, because there are no data frames that use it as a foreign key, but it's still useful to consider because it's easier to work with observations if have some way to describe them to others.
 | 
			
		||||
 | 
			
		||||
After a little thinking and experimentation we discovered that there are three variables that together uniquely identifies each flight:
 | 
			
		||||
After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
flights |> 
 | 
			
		||||
@@ -180,13 +180,13 @@ Surrogate keys can be particular useful when communicating to other humans: it's
 | 
			
		||||
1.  We forgot to draw the relationship between `weather` and `airports` in @fig-flights-relationships.
 | 
			
		||||
    What is the relationship and how should it appear in the diagram?
 | 
			
		||||
 | 
			
		||||
2.  `weather` only contains information for the three origin airport in NYC.
 | 
			
		||||
2.  `weather` only contains information for the three origin airports in NYC.
 | 
			
		||||
    If it contained weather records for all airports in the USA, what additional connection would it make to `flights`?
 | 
			
		||||
 | 
			
		||||
3.  The `year`, `month`, `day`, `hour`, and `origin` variables almost form a compound key for `weather`, but there's one hour that has duplicate observations.
 | 
			
		||||
    Can you figure out what's special about that hour?
 | 
			
		||||
 | 
			
		||||
4.  We know that some days of the year are "special" and fewer people than usual fly on them.
 | 
			
		||||
4.  We know that some days of the year are special and fewer people than usual fly on them.
 | 
			
		||||
    How might you represent that data as a data frame?
 | 
			
		||||
    What would be the primary key?
 | 
			
		||||
    How would it connect to the existing data frames?
 | 
			
		||||
@@ -199,10 +199,10 @@ Surrogate keys can be particular useful when communicating to other humans: it's
 | 
			
		||||
 | 
			
		||||
Now that you understand how data frames are connected via keys, we can start using joins to better understand the `flights` dataset.
 | 
			
		||||
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `semi_join()`, and `anti_join()`.
 | 
			
		||||
They all the same interface: they take a pair of data frames `x` and `y` and return a data frame.
 | 
			
		||||
They all have the same interface: they take a pair of data frames `x` and `y` and return a data frame.
 | 
			
		||||
The order of the rows and columns in the output is primarily determined by `x`.
 | 
			
		||||
 | 
			
		||||
In this section, you'll learn how to use one mutating joins, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
 | 
			
		||||
In this section, you'll learn how to use one mutating join, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
 | 
			
		||||
In the next section, you'll learn exactly how these functions work, and about the remaining `inner_join()`, `right_join()` and `full_join()`.
 | 
			
		||||
 | 
			
		||||
### Mutating joins
 | 
			
		||||
@@ -267,7 +267,7 @@ flights2 |>
 | 
			
		||||
  left_join(planes)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
We get a lot of missing matches our join is trying to use both `tailnum` and `year`.
 | 
			
		||||
We get a lot of missing matches because our join is trying to use both `tailnum` and `year`.
 | 
			
		||||
Both `flights` and `planes` have a `year` column but they mean different things: `flights$year` is year the flight occurred and `planes$year` is the year the plane was built.
 | 
			
		||||
We only want to join on `tailnum` so we need to provide an explicit specification with `join_by()`:
 | 
			
		||||
 | 
			
		||||
@@ -295,14 +295,14 @@ In older code you might see a different way of specifying the join keys, using a
 | 
			
		||||
-   `by = "x"` corresponds to `join_by(x)`.
 | 
			
		||||
-   `by = c("a" = "x")` corresponds to `join_by(a == x)`.
 | 
			
		||||
 | 
			
		||||
Now that it exists, we prefer `join_by()` since provides a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
 | 
			
		||||
Now that it exists, we prefer `join_by()` since it provides a more flexible specification that supports more types of join, as you'll learn in @sec-non-equi-joins.
 | 
			
		||||
 | 
			
		||||
### Filtering joins
 | 
			
		||||
 | 
			
		||||
As you might guess the primary action of a **filtering join** is to filter the rows.
 | 
			
		||||
There are two types: semi-joins and anti-joins.
 | 
			
		||||
**Semi-joins** keep all rows in `x` that have a match in `y`.
 | 
			
		||||
For example, we could use to filter the `airports` dataset to show just the origin airports:
 | 
			
		||||
For example, we could use a semi-join to filter the `airports` dataset to show just the origin airports:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
airports |> 
 | 
			
		||||
@@ -423,8 +423,8 @@ y <- tribble(
 | 
			
		||||
#| out-width: ~
 | 
			
		||||
#| fig-cap: >
 | 
			
		||||
#|   Graphical representation of two simple tables. The coloured `key`
 | 
			
		||||
#|   columns map background colour to key value. The grey columns represents
 | 
			
		||||
#|   the "value" columns that is carried along for the ride. 
 | 
			
		||||
#|   columns map background colour to key value. The grey columns represent
 | 
			
		||||
#|   the "value" columns that are carried along for the ride. 
 | 
			
		||||
#| fig-alt: >
 | 
			
		||||
#|   x and y are two data frames with 2 columns and 3 rows each. The first
 | 
			
		||||
#|   column in each is the key and the second is the value. The contents of
 | 
			
		||||
@@ -518,7 +518,7 @@ There are three types of outer joins:
 | 
			
		||||
    ```
 | 
			
		||||
 | 
			
		||||
-   A **full join** keeps all observations that appear in `x` or `y`, @fig-join-full.
 | 
			
		||||
    Every row of `x` and `y` `is` included in the output because both `x` and `y` have a fall back row of `NA`s.
 | 
			
		||||
    Every row of `x` and `y` is included in the output because both `x` and `y` have a fall back row of `NA`s.
 | 
			
		||||
    Note the output will consist of all `x` rows followed by the remaining `y` rows.
 | 
			
		||||
 | 
			
		||||
    ```{r}
 | 
			
		||||
@@ -571,7 +571,7 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
 | 
			
		||||
#| echo: false
 | 
			
		||||
#| out-width: ~
 | 
			
		||||
#| fig-cap: > 
 | 
			
		||||
#|   The three key ways a row in `x` can match. `x1` matches
 | 
			
		||||
#|   The three ways a row in `x` can match. `x1` matches
 | 
			
		||||
#|   one row in `y`, `x2` matches two rows in `y`, `x3` matches
 | 
			
		||||
#|   zero rows in y. Note that while there are three rows in
 | 
			
		||||
#|   `x` and three rows in the output, there isn't a direct
 | 
			
		||||
@@ -584,20 +584,20 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
 | 
			
		||||
knitr::include_graphics("diagrams/join/match-types.png", dpi = 270)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
There are three possible outcomes for a row:
 | 
			
		||||
There are three possible outcomes for a row in `x`:
 | 
			
		||||
 | 
			
		||||
-   If it doesn't match anything, it's dropped.
 | 
			
		||||
-   If it matches 1 row, it's kept as is.
 | 
			
		||||
-   If it matches more than 1 row, it's duplicated once for each match.
 | 
			
		||||
-   If it matches 1 row in `y`, it's kept as is.
 | 
			
		||||
-   If it matches more than 1 row in `y`, it's duplicated once for each match.
 | 
			
		||||
 | 
			
		||||
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`:
 | 
			
		||||
In principle, this means that there are no guarantees about the number of rows in the output of an `inner_join()`, compared to the number of rows in `x`.
 | 
			
		||||
 | 
			
		||||
-   There might be fewer rows if some rows in `x` don't match any rows in `y`.
 | 
			
		||||
-   There might be more rows if some rows in `x` match multiple rows in `y`.
 | 
			
		||||
-   There might be the same number of rows if every row in `x` matches one row in `y`.
 | 
			
		||||
-   There might be the same number of rows if some rows don't match any rows, and exactly the same number of rows match two rows in `y`!!
 | 
			
		||||
 | 
			
		||||
Row expansion is a fundamental property of joins, but it's dangerous because it might by hidden.
 | 
			
		||||
Row expansion is a fundamental property of joins, but it's dangerous because it might happen without you realizing it.
 | 
			
		||||
To avoid this problem, dplyr will warn whenever there are multiple matches:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
@@ -612,7 +612,7 @@ This is another reason we recommend `left_join()` --- if it runs without warning
 | 
			
		||||
 | 
			
		||||
You can gain further control over row matching with two arguments:
 | 
			
		||||
 | 
			
		||||
-   `unmatched` controls what happens when in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
 | 
			
		||||
-   `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
 | 
			
		||||
-   `multiple` controls what happens when a row in `x` matches more than one row in `y`. For equi-joins, it defaults to `"warn"` which emits a warning message if any rows have multiple matches.
 | 
			
		||||
 | 
			
		||||
There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.
 | 
			
		||||
@@ -638,7 +638,7 @@ Note that `unmatched = "error"` is not useful with `left_join()` because, as des
 | 
			
		||||
### Allow multiple rows
 | 
			
		||||
 | 
			
		||||
Sometimes it's useful to deliberately expand the number of rows in the output.
 | 
			
		||||
This can come about naturally if "flip" the direction of the question you're asking.
 | 
			
		||||
This can come about naturally if you "flip" the direction of the question you're asking.
 | 
			
		||||
For example, as we've seen above, it's natural to supplement the `flights` data with information about the plane that flew each flight:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
@@ -655,7 +655,7 @@ plane_flights <- planes |>
 | 
			
		||||
  left_join(flights2, by = "tailnum")
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Since this duplicate rows in `x` (the planes), we need to explicitly say we're ok with the multiple matches by setting `multiple = "all"`:
 | 
			
		||||
Since this duplicates rows in `x` (the planes), we need to explicitly say we're ok with the multiple matches by setting `multiple = "all"`:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
plane_flights <- planes |> 
 | 
			
		||||
@@ -670,7 +670,7 @@ plane_flights
 | 
			
		||||
The number of matches also determines the behavior of the filtering joins.
 | 
			
		||||
The semi-join keeps rows in `x` that have one or more matches in `y`, as in @fig-join-semi.
 | 
			
		||||
The anti-join keeps rows in `x` that don't have a match in `y`, as in @fig-join-anti.
 | 
			
		||||
In both cases, only the existence of a match is important; it doesn't matter how many times its match.
 | 
			
		||||
In both cases, only the existence of a match is important; it doesn't matter how many times it matches.
 | 
			
		||||
This means that filtering joins never duplicate rows like mutating joins do.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
@@ -709,7 +709,7 @@ knitr::include_graphics("diagrams/join/anti.png", dpi = 270)
 | 
			
		||||
 | 
			
		||||
## Non-equi joins
 | 
			
		||||
 | 
			
		||||
So far you've only seen **equi-joins**, joins where the two rows match if the `x` keys equal the `y` keys.
 | 
			
		||||
So far you've only seen **equi-joins**, joins where the two rows match if the `x` keys are exactly equal to the `y` keys.
 | 
			
		||||
Now we're going to relax that restriction and discuss other ways of determining if a pair of rows match.
 | 
			
		||||
 | 
			
		||||
But before we can do that, we need to revisit a simplification we made above.
 | 
			
		||||
@@ -736,7 +736,7 @@ knitr::include_graphics("diagrams/join/inner-both.png", dpi = 270)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
When we move away from equi-joins we'll always show the keys, because the key values will often different.
 | 
			
		||||
For example, instead matching when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal the `y$key`, leading to @fig-join-gte.
 | 
			
		||||
For example, instead of matching only when the `x$key` and `y$key` are equal, we could match whenever the `x$key` is greater than or equal to the `y$key`, leading to @fig-join-gte.
 | 
			
		||||
dplyr's join functions understand this distinction so will always show both keys when you perform a non-equi-join.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
@@ -882,7 +882,7 @@ parties
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Hadley is hopelessly bad at data entry so he also wanted to check that the party periods don't overlap.
 | 
			
		||||
You can perform an self-join and check to see if any start-end interval overlaps with any other:
 | 
			
		||||
You can perform a self-join and check to see if any start-end interval overlaps with any other:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
parties |> 
 | 
			
		||||
@@ -911,7 +911,7 @@ employees |>
 | 
			
		||||
 | 
			
		||||
### Exercises
 | 
			
		||||
 | 
			
		||||
1.  Can you explain what's happening the keys in this equi-join?
 | 
			
		||||
1.  Can you explain what's happening with the keys in this equi-join?
 | 
			
		||||
    Why are they different?
 | 
			
		||||
 | 
			
		||||
    ```{r}
 | 
			
		||||
@@ -927,11 +927,11 @@ employees |>
 | 
			
		||||
## Summary
 | 
			
		||||
 | 
			
		||||
In this chapter, you've learned how to use mutating and filtering joins to combine data from a pair of data frames.
 | 
			
		||||
Along the way you learned how to identify keys, and the between primary and foreign keys.
 | 
			
		||||
Along the way you learned how to identify keys, and the difference between primary and foreign keys.
 | 
			
		||||
You also understand how joins work and how to figure out how many rows the output will have.
 | 
			
		||||
Finally, you've gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.
 | 
			
		||||
 | 
			
		||||
This chapter concludes the "Transform" part of the book where the focus was on the tools you could use with individual columns and tibbles.
 | 
			
		||||
You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.
 | 
			
		||||
 | 
			
		||||
In the next part of the book, you'll learn more getting various types of data into R in a tidy form.
 | 
			
		||||
In the next part of the book, you'll learn more about getting various types of data into R in a tidy form.
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user