Light updates to joins chapter

This commit is contained in:
Hadley Wickham 2022-06-20 10:08:33 -05:00
parent 0705aceba7
commit ca38492660
1 changed files with 12 additions and 58 deletions

View File

@ -1,4 +1,4 @@
# Two-table verbs {#sec-relational-data}
# Joins {#sec-relational-data}
```{r}
#| results: "asis"
@ -14,31 +14,24 @@ Waiting on <https://github.com/tidyverse/dplyr/pull/5910>
<!-- TODO: redraw all diagrams to match O'Reilly style -->
It's rare that a data analysis involves only a single data frame.
Typically you have many data frames, and you must combine them to answer the questions that you're interested in.
Typically you have many data frames, and you must **join** them together to answer the questions that you're interested in.
All the verbs in this chapter use a pair of data frames.
Fortunately this is enough, since you can combine three data frames by combining two pairs.
Sometimes both elements of a pair will be the same data frame.
This is needed if, for example, you have a data frame of people, and each person has a reference to their parents.
There are three families of verbs designed to work with pairs of data frames:
There are two important types of joins.
**Mutating joins** adds new variables to one data frame from matching observations in another.
**Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
- **Mutating joins**, which adds new variables to one data frame from matching observations in another.
- **Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
- **Set operations**, which treat observations as if they were set elements.
The most common place to find relational data is in a *relational* database management system (or RDBMS), a term that encompasses almost all modern databases.
If you've used a database before, you've almost certainly used SQL.
If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.
One other major terminology difference between databases and R is that what we generally refer to as data frames in R while the same concept is referred to as "table" in databases.
Hence you'll see references to one-table and two-table verbs in dplyr documentation.
Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
If you're not familiar with databases or SQL, you'll learn more about them in [Chapter -@sec-import-databases].
If you're familiar with SQL, you should find these ideas very familiar as their instantiation in dplyr is very similar.
We'll point out any important differences as we go.
Don't worry if you're not familiar with SQL, we'll back to it in @sec-import-databases.
### Prerequisites
We will explore relational data from `nycflights13` using the two-table verbs from dplyr.
We will explore relational data from nycflights13 using the join functions from dplyr.
```{r}
#| label: setup
@ -50,7 +43,7 @@ library(nycflights13)
## nycflights13 {#sec-nycflights13-relational}
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in [Chapter -@sec-data-transform] on data transformation:
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in @sec-data-transform on data transformation:
- `airlines` lets you look up the full carrier name from its abbreviated code:
@ -253,7 +246,7 @@ We'll then use that to explain the four mutating join functions: the inner join,
When working with real data, keys don't always uniquely identify observations, so next we'll talk about what happens when there isn't a unique match.
Finally, you'll learn how to tell dplyr which variables are the keys for a given join.
### Understanding joins
## Join types
To help you learn how joins work, I'm going to use a visual representation:
@ -727,42 +720,3 @@ Your own data is unlikely to be so nice, so there are a few things that you shou
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
If you have an inner join with duplicate keys in both data frames, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
## Set operations {#sec-set-operations}
The final type of two-table verb are the set operations.
Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces.
All these operations work with a complete row, comparing the values of every variable.
These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
- `intersect(x, y)`: return only observations in both `x` and `y`.
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
Given this simple data:
```{r}
df1 <- tribble(
~x, ~y,
1, 1,
2, 1
)
df2 <- tribble(
~x, ~y,
1, 1,
1, 2
)
```
The four possibilities are:
```{r}
intersect(df1, df2)
# Note that we get 3 rows, not 4
union(df1, df2)
setdiff(df1, df2)
setdiff(df2, df1)
```