Light updates to joins chapter
This commit is contained in:
parent
0705aceba7
commit
ca38492660
|
@ -1,4 +1,4 @@
|
|||
# Two-table verbs {#sec-relational-data}
|
||||
# Joins {#sec-relational-data}
|
||||
|
||||
```{r}
|
||||
#| results: "asis"
|
||||
|
@ -14,31 +14,24 @@ Waiting on <https://github.com/tidyverse/dplyr/pull/5910>
|
|||
<!-- TODO: redraw all diagrams to match O'Reilly style -->
|
||||
|
||||
It's rare that a data analysis involves only a single data frame.
|
||||
Typically you have many data frames, and you must combine them to answer the questions that you're interested in.
|
||||
Typically you have many data frames, and you must **join** them together to answer the questions that you're interested in.
|
||||
|
||||
All the verbs in this chapter use a pair of data frames.
|
||||
Fortunately this is enough, since you can combine three data frames by combining two pairs.
|
||||
Sometimes both elements of a pair will be the same data frame.
|
||||
This is needed if, for example, you have a data frame of people, and each person has a reference to their parents.
|
||||
|
||||
There are three families of verbs designed to work with pairs of data frames:
|
||||
There are two important types of joins.
|
||||
**Mutating joins** adds new variables to one data frame from matching observations in another.
|
||||
**Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
|
||||
|
||||
- **Mutating joins**, which adds new variables to one data frame from matching observations in another.
|
||||
|
||||
- **Filtering joins**, which filters observations from one data frame based on whether or not they match an observation in another.
|
||||
|
||||
- **Set operations**, which treat observations as if they were set elements.
|
||||
|
||||
The most common place to find relational data is in a *relational* database management system (or RDBMS), a term that encompasses almost all modern databases.
|
||||
If you've used a database before, you've almost certainly used SQL.
|
||||
If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different.
|
||||
One other major terminology difference between databases and R is that what we generally refer to as data frames in R while the same concept is referred to as "table" in databases.
|
||||
Hence you'll see references to one-table and two-table verbs in dplyr documentation.
|
||||
Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
|
||||
If you're not familiar with databases or SQL, you'll learn more about them in [Chapter -@sec-import-databases].
|
||||
If you're familiar with SQL, you should find these ideas very familiar as their instantiation in dplyr is very similar.
|
||||
We'll point out any important differences as we go.
|
||||
Don't worry if you're not familiar with SQL, we'll back to it in @sec-import-databases.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
We will explore relational data from `nycflights13` using the two-table verbs from dplyr.
|
||||
We will explore relational data from nycflights13 using the join functions from dplyr.
|
||||
|
||||
```{r}
|
||||
#| label: setup
|
||||
|
@ -50,7 +43,7 @@ library(nycflights13)
|
|||
|
||||
## nycflights13 {#sec-nycflights13-relational}
|
||||
|
||||
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in [Chapter -@sec-data-transform] on data transformation:
|
||||
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in @sec-data-transform on data transformation:
|
||||
|
||||
- `airlines` lets you look up the full carrier name from its abbreviated code:
|
||||
|
||||
|
@ -253,7 +246,7 @@ We'll then use that to explain the four mutating join functions: the inner join,
|
|||
When working with real data, keys don't always uniquely identify observations, so next we'll talk about what happens when there isn't a unique match.
|
||||
Finally, you'll learn how to tell dplyr which variables are the keys for a given join.
|
||||
|
||||
### Understanding joins
|
||||
## Join types
|
||||
|
||||
To help you learn how joins work, I'm going to use a visual representation:
|
||||
|
||||
|
@ -727,42 +720,3 @@ Your own data is unlikely to be so nice, so there are a few things that you shou
|
|||
|
||||
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
|
||||
If you have an inner join with duplicate keys in both data frames, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
|
||||
|
||||
## Set operations {#sec-set-operations}
|
||||
|
||||
The final type of two-table verb are the set operations.
|
||||
Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces.
|
||||
All these operations work with a complete row, comparing the values of every variable.
|
||||
These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
|
||||
|
||||
- `intersect(x, y)`: return only observations in both `x` and `y`.
|
||||
- `union(x, y)`: return unique observations in `x` and `y`.
|
||||
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
|
||||
|
||||
Given this simple data:
|
||||
|
||||
```{r}
|
||||
df1 <- tribble(
|
||||
~x, ~y,
|
||||
1, 1,
|
||||
2, 1
|
||||
)
|
||||
df2 <- tribble(
|
||||
~x, ~y,
|
||||
1, 1,
|
||||
1, 2
|
||||
)
|
||||
```
|
||||
|
||||
The four possibilities are:
|
||||
|
||||
```{r}
|
||||
intersect(df1, df2)
|
||||
|
||||
# Note that we get 3 rows, not 4
|
||||
union(df1, df2)
|
||||
|
||||
setdiff(df1, df2)
|
||||
|
||||
setdiff(df2, df1)
|
||||
```
|
||||
|
|
Loading…
Reference in New Issue