Merge pull request #39 from radugrosu/patch-4
Update relational-data.Rmd
This commit is contained in:
commit
0578bf5e70
|
@ -18,7 +18,7 @@ It's rare that a data analysis involves only a single table of data. Typically y
|
||||||
|
|
||||||
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair; sometimes both elements of a pair can be the same table.
|
Relations are always defined between a pair of tables. All other relations are built up from this simple idea: the relations of three or more tables are always a property of the relations between each pair; sometimes both elements of a pair can be the same table.
|
||||||
|
|
||||||
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs design to work with relational data:
|
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
|
||||||
|
|
||||||
* __Mutating joins__, which add new variables to one data frame from matching
|
* __Mutating joins__, which add new variables to one data frame from matching
|
||||||
rows in another.
|
rows in another.
|
||||||
|
@ -28,11 +28,11 @@ To work with relational data you need verbs that work with pairs of tables. Ther
|
||||||
|
|
||||||
* __Set operations__, which treat observations like they were set elements.
|
* __Set operations__, which treat observations like they were set elements.
|
||||||
|
|
||||||
The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is little different. Generally, dplyr is a little easier to use than SQL because it's specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.
|
The most common place to find relational data is in a _relational_ database management system, a term that encompasses almost all modern databases. If you've used a database before, you've almost certainly used SQL. If so, you should find the concepts in this chapter familiar, although their expression in dplyr is a little different. Generally, dplyr is a little easier to use than SQL because it's specialised to data analysis: it makes common data analysis operations easier, at the expense of making it difficult to do other things.
|
||||||
|
|
||||||
## nycflights13 {#nycflights13-relational}
|
## nycflights13 {#nycflights13-relational}
|
||||||
|
|
||||||
You'll learn about relational data with other datasets from the nycflights13 package. As well as the `flights` table that you've worked with so far, nycflights13 contains a four related data frames:
|
You'll learn about relational data with other datasets from the nycflights13 package. As well as the `flights` table that you've worked with so far, nycflights13 contains four other related data frames:
|
||||||
|
|
||||||
* `airlines` lets you look up the full carrier name from its abbreviated
|
* `airlines` lets you look up the full carrier name from its abbreviated
|
||||||
code:
|
code:
|
||||||
|
@ -112,7 +112,7 @@ There are two types of keys:
|
||||||
each plane.
|
each plane.
|
||||||
|
|
||||||
* A __foreign key__ uniquely identifies an observation in another table.
|
* A __foreign key__ uniquely identifies an observation in another table.
|
||||||
For example, the `flights$tailnum` is a foregin key because it matches each
|
For example, the `flights$tailnum` is a foreign key because it matches each
|
||||||
flight to a unique plane.
|
flight to a unique plane.
|
||||||
|
|
||||||
A variable can be both part of primary key _and_ a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airport` table.
|
A variable can be both part of primary key _and_ a foreign key. For example, `origin` is part of the `weather` primary key, and is also a foreign key for the `airport` table.
|
||||||
|
@ -124,16 +124,16 @@ planes %>% count(tailnum) %>% filter(n > 1)
|
||||||
weather %>% count(year, month, day, hour, origin) %>% filter(n > 1)
|
weather %>% count(year, month, day, hour, origin) %>% filter(n > 1)
|
||||||
```
|
```
|
||||||
|
|
||||||
Sometimes a table does't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what's the primary key in the `flights` table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
|
Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what's the primary key in the `flights` table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
flights %>% count(year, month, day, flight) %>% filter(n > 1)
|
flights %>% count(year, month, day, flight) %>% filter(n > 1)
|
||||||
flights %>% count(year, month, day, tailnum) %>% filter(n > 1)
|
flights %>% count(year, month, day, tailnum) %>% filter(n > 1)
|
||||||
```
|
```
|
||||||
|
|
||||||
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easiser to communicate problems with a specific flight. Unfortunately that is not the case! If a table lacks a primary key, it's sometimes useful to add one with `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a surrogate key.
|
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Unfortunately that is not the case! If a table lacks a primary key, it's sometimes useful to add one with `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a surrogate key.
|
||||||
|
|
||||||
A primary key and the corresponding foreign key in another table form a __relation__. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you'll occassionaly see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. It's possible to model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there's a many-to-many relationship between airlines and airports: each airport flies to many airlines; each airport hosts many airlines.
|
A primary key and the corresponding foreign key in another table form a __relation__. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you'll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. It's possible to model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there's a many-to-many relationship between airlines and airports: each airport flies to many airlines; each airport hosts many airlines.
|
||||||
|
|
||||||
### Exercises
|
### Exercises
|
||||||
|
|
||||||
|
@ -243,7 +243,7 @@ Graphically, that looks like:
|
||||||
knitr::include_graphics("diagrams/join-outer.png")
|
knitr::include_graphics("diagrams/join-outer.png")
|
||||||
```
|
```
|
||||||
|
|
||||||
The most commonly used join is the left join: you use this when ever you lookup additional data out of another table, becasuse it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
|
The most commonly used join is the left join: you use this whenever you lookup additional data out of another table, because it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
|
||||||
|
|
||||||
Another way to depict the different types of joins is with a Venn diagram:
|
Another way to depict the different types of joins is with a Venn diagram:
|
||||||
|
|
||||||
|
@ -352,7 +352,7 @@ So far, the pairs of tables have always been joined by a single variable, and th
|
||||||
1. What weather conditions make it more likely to see a delay?
|
1. What weather conditions make it more likely to see a delay?
|
||||||
|
|
||||||
1. What happened on June 13 2013? Display the spatial pattern of delays,
|
1. What happened on June 13 2013? Display the spatial pattern of delays,
|
||||||
and then use google to cross-reference with the weather.
|
and then use Google to cross-reference with the weather.
|
||||||
|
|
||||||
```{r, eval = FALSE, include = FALSE}
|
```{r, eval = FALSE, include = FALSE}
|
||||||
worst <- filter(not_cancelled, month == 6, day == 13)
|
worst <- filter(not_cancelled, month == 6, day == 13)
|
||||||
|
@ -385,17 +385,17 @@ SQL is the inspiration for dplyr's conventions, so the translation is straightfo
|
||||||
dplyr | SQL
|
dplyr | SQL
|
||||||
-----------------------------|-------------------------------------------
|
-----------------------------|-------------------------------------------
|
||||||
`inner_join(x, y, by = "z")` | `SELECT * FROM x INNER JOIN y USING (z)`
|
`inner_join(x, y, by = "z")` | `SELECT * FROM x INNER JOIN y USING (z)`
|
||||||
`left_join(x, y, by = "z")` | `SELECT * FROM x LEFT OUTER JOIN USING (z)`
|
`left_join(x, y, by = "z")` | `SELECT * FROM x LEFT OUTER JOIN y USING (z)`
|
||||||
`right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN USING (z)`
|
`right_join(x, y, by = "z")` | `SELECT * FROM x RIGHT OUTER JOIN y USING (z)`
|
||||||
`full_join(x, y, by = "z")` | `SELECT * FROM x FULL OUTER JOIN USING (z)`
|
`full_join(x, y, by = "z")` | `SELECT * FROM x FULL OUTER JOIN y USING (z)`
|
||||||
|
|
||||||
Note that "INNER" and "OUTER" are optional, and often ommitted.
|
Note that "INNER" and "OUTER" are optional, and often omitted.
|
||||||
|
|
||||||
Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests SQL supports a wide range of join types than dplyr because you can connect the tables using constraints other than equiality (sometimes called non-equijoins).
|
Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests SQL supports a wide range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).
|
||||||
|
|
||||||
## Filtering joins {#filtering-joins}
|
## Filtering joins {#filtering-joins}
|
||||||
|
|
||||||
Filtering joins match obserations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
|
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
|
||||||
|
|
||||||
* `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`.
|
* `semi_join(x, y)` __keeps__ all observations in `x` that have a match in `y`.
|
||||||
* `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`.
|
* `anti_join(x, y)` __drops__ all observations in `x` that have a match in `y`.
|
||||||
|
@ -494,7 +494,7 @@ Be aware that simply checking the number of rows before and after the join is no
|
||||||
|
|
||||||
## Set operations {#set-operations}
|
## Set operations {#set-operations}
|
||||||
|
|
||||||
The final type of two-table verb is set operations. Generally, I use these the least frequently, but they are occassionally useful when you want to break a single complex filter into simpler pieces that you then combine.
|
The final type of two-table verb is set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces that you then combine.
|
||||||
|
|
||||||
All these operations work with a complete row, comparing the values of every variable. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
|
All these operations work with a complete row, comparing the values of every variable. These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets:
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue