Working on second half
This commit is contained in:
parent
cd16ec24fc
commit
a44a1a0551
|
@ -126,12 +126,11 @@ As you'll see in this chapter, it's really easy to get started with but it can a
|
|||
### Load some data {#sec-load-data}
|
||||
|
||||
Since this is a temporary database, we need to start by adding some data.
|
||||
Here we'll use the `mpg` and `diamonds` datasets from ggplot2, and all data in the nycflights13 package.
|
||||
Here we'll use the `mpg` and `diamonds` datasets from ggplot2.
|
||||
|
||||
```{r}
|
||||
dbWriteTable(con, "mpg", ggplot2::mpg)
|
||||
dbWriteTable(con, "diamonds", ggplot2::diamonds)
|
||||
dbplyr::copy_nycflights13(con)
|
||||
```
|
||||
|
||||
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
|
||||
|
@ -206,7 +205,7 @@ There are lots of other functions in DBI that you might find useful if you're ma
|
|||
Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
|
||||
dbplyr is a dplyr **backend**, which means that you write the dplyr code that you're already familiar with and dbplyr translates it to run in a different way, in this case to SQL.
|
||||
|
||||
To use dbplyr you start start by creating a `tbl()`: this creates something that looks like a tibble, but is really a reference to a table in a database[^import-databases-3]:
|
||||
To use dbplyr, you must first use `tbl()` to create an object that represents a database table[^import-databases-3]:
|
||||
|
||||
[^import-databases-3]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, SQL("SELECT * FROM foo")).`
|
||||
|
||||
|
@ -215,29 +214,29 @@ diamonds_db <- tbl(con, "diamonds")
|
|||
diamonds_db
|
||||
```
|
||||
|
||||
You can tell it's a database query because it prints the database name at the top, and typically it won't be able to tell you the total number of rows.
|
||||
This is because finding the total number of rows is often an expensive computation for a database.
|
||||
|
||||
You can see the SQL generated by a dbplyr query by called `show_query()`.
|
||||
We can create the SQL from @sec-dbGetQuery with the following dplyr code:
|
||||
This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
|
||||
For example, take the following pipeline:
|
||||
|
||||
```{r}
|
||||
big_diamonds_db <- diamonds_db |>
|
||||
filter(price > 15000) |>
|
||||
select(carat:clarity, price)
|
||||
|
||||
big_diamonds_db
|
||||
```
|
||||
|
||||
`big_diamonds_db` captures the transformations we want to perform on the data but doesn't actually perform them.
|
||||
Instead, it translates your dplyr code into SQL, which you can see with `show_query()`:
|
||||
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically it won't tell you the number of rows.
|
||||
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
|
||||
|
||||
You can see the SQL the dbplyr generates by a dbplyr query by calling `show_query()`:
|
||||
|
||||
```{r}
|
||||
big_diamonds_db |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
To get the data back into R, we call `collect()`.
|
||||
Behind the scenes, this generates the SQL, calls `dbGetQuery()`, and turns the result back into a tibble:
|
||||
To get all the data back into R, you call `collect()`.
|
||||
Behind the scenes, this generates the SQL, calls `dbGetQuery()` to get the data, then turns the result into a tibble:
|
||||
|
||||
```{r}
|
||||
big_diamonds <- big_diamonds_db |>
|
||||
|
@ -247,105 +246,167 @@ big_diamonds
|
|||
|
||||
## SQL
|
||||
|
||||
This SQL is a little different to what you might write by hand: dbplyr quotes every variable name and may include parentheses when they're not absolutely needed.
|
||||
If you were to write this by hand, you'd probably do:
|
||||
The rest of the chapter will teach you a little about SQL through the lens of dbplyr.
|
||||
It's a rather non-traditional introduction to SQL but I hope it will get you quickly up to speed with the basics.
|
||||
It will hopefully help you understand the parallels between SQL and dplyr but it's not going to give you much practice writing SQL.
|
||||
For that, I'd recommend [*SQL for Data Scientists*](https://sqlfordatascientists.com)by Renée M. P. Teate.
|
||||
It's an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organisations.
|
||||
|
||||
``` sql
|
||||
SELECT carat, cut, color, clarity, price
|
||||
FROM diamonds
|
||||
WHERE price > 15000
|
||||
```
|
||||
|
||||
### SQL basics
|
||||
|
||||
The basic unit of composition in SQL is not a function, but a **statement**.
|
||||
Common statements include `INSERT` for adding new data, `CREATE` for making new tables, and `UPDATE` for modifying data, and `SELECT` for retrieving data.
|
||||
Unlike R SQL is (mostly) case insensitive, but by convention, to make them stand out the clauses are usually capitalized like `SELECT`, `FROM`, and `WHERE` above.
|
||||
|
||||
We're going to focus on `SELECT` statements because they are almost exclusively what you'll use as a data scientist.
|
||||
The other statements will be handled by someone else; in the case that you need to update your own database, you can solve most problems with `dbWriteTable()` and/or `dbInsertTable()`.
|
||||
In fact, as a data scientist in most cases you won't even be able to run these statements because you only have read only access to the database.
|
||||
This ensures that there's no way for you to accidentally mess things up.
|
||||
|
||||
A `SELECT` statement is often called a query, and a query is made up of clauses.
|
||||
Every query must have two clauses `SELECT` and `FROM`[^import-databases-4].
|
||||
The simplest query is something like `SELECT * FROM tablename` which will select all columns from `tablename`. Other optional clauses allow you
|
||||
|
||||
[^import-databases-4]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculation.
|
||||
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
||||
|
||||
The following sections work through the most important optional clauses.
|
||||
Unlike in R, SQL clauses must come in a specific order: `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`.
|
||||
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights dataset, the `flights` and `planes` tibbles.
|
||||
These are easy to get into our learning database because dbplyr has a function designed for this exact scenario.
|
||||
|
||||
```{r}
|
||||
dbplyr::copy_nycflights13(con)
|
||||
flights <- tbl(con, "flights")
|
||||
planes <- tbl(con, "planes")
|
||||
```
|
||||
|
||||
### SELECT and FROM
|
||||
### SQL basics
|
||||
|
||||
The two most important clauses are `FROM`, which determines the source table or tables, and `SELECT` which determines which columns are in the output.
|
||||
There's no real equivalent to `FROM` in dbplyr; it's just the name of the data frame.
|
||||
Instead of functions, like R, SQL has **statements**.
|
||||
Common statements include `CREATE` for defining new tables, `INSERT` for adding data, and `SELECT` for retrieving data.
|
||||
We're going to focus on `SELECT` statements, aka **queries**, because they are almost exclusively what you'll use as a data scientist.
|
||||
Your job is usually to analyse existing data, and in most cases you won't even have permission to modify the data.
|
||||
|
||||
`SELECT` is the workhorse of SQL queries, and is used for `select()`, `mutate()`, `rename()`, and `relocate()`.
|
||||
In the next section, you'll see that `SELECT` is *also* used for `summarize()` when paired with `GROUP BY`.
|
||||
A query is made up of **clauses**.
|
||||
Every query must have two clauses `SELECT` and `FROM`[^import-databases-4].
|
||||
The simplest query is uses `SELECT * FROM tablename` to select columns from the specified table.
|
||||
This is what dplyr generates for an adulterated table:
|
||||
|
||||
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` --- they just change the number and order of the variables, renaming where necessary with `AS`.
|
||||
Unlike R, the old name is on the left and the new name is on the right.
|
||||
[^import-databases-4]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculation.
|
||||
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
||||
|
||||
```{r}
|
||||
diamonds_db |> select(cut:carat) |> show_query()
|
||||
diamonds_db |> rename(colour = color) |> show_query()
|
||||
diamonds_db |> relocate(x:z) |> show_query()
|
||||
flights |> show_query()
|
||||
planes |> show_query()
|
||||
```
|
||||
|
||||
There are three other important clauses: `WHERE`, `ORDER BY`, and `GROUP BY`. `WHERE` and `ORDER BY` control which rows are included in the result and how they are ordered:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
filter(dest == "IAH") |>
|
||||
arrange(dep_delay) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
While `GROUP BY` works just like `dplyr::group_by()` causing aggregation to happen by group:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
group_by(dest) |>
|
||||
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
There are two important differences between dplyr verbs and SELECT clauses:
|
||||
|
||||
- SQL, unlike R, is **case** **insensitive** so you can write `select`, `SELECT`, or even `SeLeCt`. In this book we'll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
|
||||
- In SQL, order matters. Unlike dplyr, where you can call the verbs in whatever order makes the most sense to you, SQL clauses must come in a specific order: `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how they are actually evaluated which is `FROM`, `WHERE`, `GROUP BY`, `SELECT`, `ORDER BY`.
|
||||
|
||||
The following sections will explore each clause in more detail.
|
||||
|
||||
### SELECT
|
||||
|
||||
`SELECT` is the workhorse of SQL queries, and is equivalent to `select()`, `mutate()`, `rename()`, `relocate()`, and, as you'll learn in the next section, `summarize()`.
|
||||
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` as they affect where a column appears (if at all) along with its name:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
select(year:day, starts_with("dep")) |>
|
||||
show_query()
|
||||
|
||||
flights |>
|
||||
rename(tail_num = tailnum) |>
|
||||
show_query()
|
||||
|
||||
flights |>
|
||||
relocate(hour:minute, .after = day) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
This example also shows you have SQL does renaming.
|
||||
In SQL terminology is called **aliasing** and is done with `AS`; just note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
|
||||
|
||||
The translations for `mutate()` are similarly straightforward.
|
||||
We'll come back to the translation of individual components in @sec-sql-expressions.
|
||||
|
||||
```{r}
|
||||
diamonds_db |> mutate(price_per_carat = price / carat) |> show_query()
|
||||
```
|
||||
|
||||
### WHERE
|
||||
|
||||
`filter()` is translated to `WHERE`:
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
filter(carat > 1, colour == "J") |>
|
||||
mutate(price_per_carat = price / carat) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
### GROUP BY
|
||||
|
||||
`SELECT` is also used for summaries when pared with `GROUP BY`:
|
||||
When pared with `group_by()`, `summarise()` is also translated to `SELECT`:
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
avg_price = mean(price)
|
||||
avg_price = mean(price, na.rm = TRUE)
|
||||
) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
Note the warning: unlike R, missing values (called `NULL` instead of `NA` in SQL) are not infectious in summary statistics.
|
||||
We'll come back to this challenge a bit later in Section \@ref(sql-expressions).
|
||||
We'll come back to the translations of `n()` and `mean()` in @sec-sql-expressions.
|
||||
But it's no coincidence that `group_by()` is translated to `GROUP BY`: the SQL clause inspired the R function name.
|
||||
|
||||
###
|
||||
### WHERE
|
||||
|
||||
`filter()` is translated to `WHERE`.
|
||||
`|` becomes `OR` and `&` becomes `AND:`
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
filter(dest == "IAH" | dest == "HOU") |>
|
||||
show_query()
|
||||
|
||||
flights |>
|
||||
filter(arr_delay > 0 & arr_delay < 20) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
Note that SQL uses `=` for comparison, not `==`.
|
||||
This is super annoying if you're switching between writing R code and SQL!
|
||||
|
||||
Another useful SQL function is `IN`, which is very close to R's `%in%`:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
filter(dest %in% c("IAH", "HOU")) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
SQL doesn't have `NA`s, but instead has `NULL`s.
|
||||
They behave very similarly to `NA`s, including their "infectious" properties.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
filter(!is.na(dep_delay)) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
This SQL illustrates one of the drawbacks of dbplyr: it doesn't always generate the simplest SQL.
|
||||
In this case, the parentheses are redundant and you could use the special form `IS NOT NULL` yielding:
|
||||
|
||||
``` sql
|
||||
WHERE "dep_delay" IS NOT NULL
|
||||
```
|
||||
|
||||
### ORDER BY
|
||||
|
||||
`arrange()` is translated to `ORDER BY`:
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
arrange(carat, desc(price)) |>
|
||||
flights |>
|
||||
arrange(year, month, day, desc(dep_delay)) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
And `desc()` becomes `DESC` --- and now you know the inspiration for the function name 😄.
|
||||
Note that `desc()` becomes `DESC`; this is another R function whose named was directly inspired by SQL.
|
||||
|
||||
### Subqueries
|
||||
|
||||
|
@ -376,45 +437,22 @@ diamonds_db |>
|
|||
```
|
||||
|
||||
Sometimes dbplyr uses a subquery where strictly speaking it's not necessary.
|
||||
For example, take this pipeline that filters on a summary value:
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
avg_price = mean(price)
|
||||
) |>
|
||||
filter(n > 10) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
In this case it's possible to use the special `HAVING` clause.
|
||||
This is works the same way as `WHERE` except that it's applied *after* the aggregates have been computed, not before.
|
||||
|
||||
``` sql
|
||||
SELECT "cut", COUNT(*) AS "n", AVG("price") AS "avg_price"
|
||||
FROM "diamonds"
|
||||
GROUP BY "cut"
|
||||
HAVING "n" > 10.0
|
||||
```
|
||||
For example, take this pipeline that filters on a summary value.
|
||||
|
||||
### Joins
|
||||
|
||||
SQL joins are straightforward, but dbplyr's current translation requires spelling out
|
||||
|
||||
```{r}
|
||||
flights |> inner_join(planes, by = "tailnum") |> show_query()
|
||||
flights |> left_join(planes, by = "tailnum") |> show_query()
|
||||
flights |> full_join(planes, by = "tailnum") |> show_query()
|
||||
flights |>
|
||||
left_join(planes, by = "tailnum") |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
### Semi and anti-joins
|
||||
|
||||
SQL's syntax for semi- and anti-joins are a bit arcane.
|
||||
I don't remember these and just google if I ever need the syntax outside of SQL.
|
||||
Instead we'll create some dummy data:
|
||||
|
||||
```{r}
|
||||
flights |> semi_join(planes, by = "tailnum") |> show_query()
|
||||
flights |> anti_join(planes, by = "tailnum") |> show_query()
|
||||
|
||||
```
|
||||
|
||||
### Temporary data
|
||||
|
@ -433,6 +471,12 @@ Most database will allow you to create temporary tables, even if you don't other
|
|||
Rather than copying the data to the database, it builds SQL that generates the data inline.
|
||||
It's useful if you don't have permission to create temporary tables, and is faster than `copy_to()` for small datasets.
|
||||
|
||||
### Other statements
|
||||
|
||||
in the case that you need to update your own database, you can solve most problems with `dbWriteTable()` and/or `dbInsertTable()`.
|
||||
In fact, as a data scientist in most cases you won't even be able to run these statements because you only have read only access to the database.
|
||||
This ensures that there's no way for you to accidentally mess things up.
|
||||
|
||||
## SQL expressions {#sec-sql-expressions}
|
||||
|
||||
https://dbplyr.tidyverse.org/articles/translation-function.html
|
||||
|
|
Loading…
Reference in New Issue