Use dev dbplyr
This commit is contained in:
parent
226e0061ad
commit
c83d21200d
|
@ -48,6 +48,7 @@ Suggests:
|
|||
tidymodels,
|
||||
xml2
|
||||
Remotes:
|
||||
tidyverse/dbplyr,
|
||||
tidyverse/stringr,
|
||||
tidyverse/tidyr,
|
||||
jennybc/repurrrsive
|
||||
|
|
110
databases.qmd
110
databases.qmd
|
@ -13,13 +13,13 @@ A huge amount of data lives in databases, so it's essential that you know how to
|
|||
Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
|
||||
You want to be able to reach into the database directly to get the data you need, when you need it.
|
||||
|
||||
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^import-databases-1] query.
|
||||
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^databases-1] query.
|
||||
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for all data scientists to learn.
|
||||
That said, we're not going to start with SQL, but instead we'll teach you dbplyr, which can translate your dplyr code to the SQL.
|
||||
We'll use that as way to teach you some of the most important features of SQL.
|
||||
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
|
||||
|
||||
[^import-databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
|
||||
[^databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
|
||||
|
||||
### Prerequisites
|
||||
|
||||
|
@ -73,10 +73,10 @@ This uses the ODBC protocol supported by many DBMS.
|
|||
odbc requires a little more setup because you'll also need to install an ODBC driver and tell the odbc package where to find it.
|
||||
|
||||
Concretely, you create a database connection using `DBI::dbConnect()`.
|
||||
The first argument selects the DBMS[^import-databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
|
||||
The first argument selects the DBMS[^databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
|
||||
The following code shows a couple of typical examples:
|
||||
|
||||
[^import-databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
|
||||
[^databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -133,16 +133,16 @@ dbWriteTable(con, "diamonds", ggplot2::diamonds)
|
|||
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
|
||||
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
|
||||
|
||||
## Database basics
|
||||
## DBI basics
|
||||
|
||||
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
|
||||
|
||||
### What's there?
|
||||
|
||||
The most important database objects for data scientists are tables.
|
||||
DBI provides two useful functions to either list all the tables in the database[^import-databases-3] or to check if a specific table already exists:
|
||||
DBI provides two useful functions to either list all the tables in the database[^databases-3] or to check if a specific table already exists:
|
||||
|
||||
[^import-databases-3]: At least, all the tables that you have permission to see.
|
||||
[^databases-3]: At least, all the tables that you have permission to see.
|
||||
|
||||
```{r}
|
||||
dbListTables(con)
|
||||
|
@ -279,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
|
|||
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
|
||||
|
||||
A query is made up of **clauses**.
|
||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
||||
. This is what dplyr generates for an adulterated table
|
||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^databases-4] and `FROM`[^databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
||||
. This is what dplyr generates for an unadulterated table
|
||||
:
|
||||
|
||||
[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
||||
[^databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
||||
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
|
||||
|
||||
[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
|
||||
[^databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
|
||||
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
|
||||
|
||||
```{r}
|
||||
|
@ -334,14 +334,16 @@ The `SELECT` clause is the workhorse of queries and performs the same job as `se
|
|||
|
||||
```{r}
|
||||
planes |>
|
||||
select(tailnum, type, manufacturer, model) |>
|
||||
select(tailnum, type, manufacturer, model, year) |>
|
||||
show_query()
|
||||
|
||||
planes |>
|
||||
select(tailnum, type, manufacturer, model, year) |>
|
||||
rename(year_built = year) |>
|
||||
show_query()
|
||||
|
||||
planes |>
|
||||
select(tailnum, type, manufacturer, model, year) |>
|
||||
relocate(manufacturer, model, .before = type) |>
|
||||
show_query()
|
||||
```
|
||||
|
@ -350,42 +352,48 @@ This example also shows you how SQL does renaming.
|
|||
In SQL terminology renaming is called **aliasing** and is done with `AS`.
|
||||
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
|
||||
|
||||
::: callout-note
|
||||
In the examples above note that `"year"` and `"type"` are wrapped in double quotes.
|
||||
That's because these are **reserved words** in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.
|
||||
|
||||
When working with other databases you're likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.
|
||||
|
||||
``` sql
|
||||
SELECT "tailnum", "type", "manufacturer", "model", "year"
|
||||
FROM "planes"
|
||||
```
|
||||
|
||||
Some other database systems use backticks instead of quotes:
|
||||
|
||||
``` sql
|
||||
SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
|
||||
FROM `planes`
|
||||
```
|
||||
:::
|
||||
|
||||
The translations for `mutate()` are similarly straightforward: each variable becomes a new expression in `SELECT`:
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
flights |>
|
||||
mutate(
|
||||
price_per_carat = price / carat
|
||||
speed = distance / (air_time / 60)
|
||||
) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
We'll come back to the translation of individual components (like `/`) in @sec-sql-expressions.
|
||||
|
||||
::: callout-note
|
||||
When working with other databases you're likely to see variable names wrapped in some sort of quote character, like this:
|
||||
### FROM
|
||||
|
||||
``` sql
|
||||
SELECT "year", "month", "day", "dep_time", "dep_delay"
|
||||
FROM "flights"
|
||||
```
|
||||
|
||||
Or like this:
|
||||
|
||||
``` sql
|
||||
SELECT `year`, `month`, `day`, `dep_time`, `dep_delay`
|
||||
FROM `flights`
|
||||
```
|
||||
|
||||
Quoting is only required for **reserved words** like `SELECT` or `FROM` to avoid confusion between column/tables names and SQL operators.
|
||||
But only a handful of client packages, like duckdb, know what all the reserved words are, so most packages quote everything just to be safe.
|
||||
:::
|
||||
The `FROM` clause defines the data source.
|
||||
It's going to be rather uninteresting for a little while, because we're just using single tables.
|
||||
You'll see more complex examples once we hit the join functions.
|
||||
|
||||
### GROUP BY
|
||||
|
||||
`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:
|
||||
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause:
|
||||
|
||||
[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
||||
[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
|
@ -430,7 +438,7 @@ flights |>
|
|||
SQL uses `NULL` instead of `NA`.
|
||||
`NULL`s behave similarly to `NA`s.
|
||||
The main difference is that while they're "infectious" in comparisons and arithmetic, they are silently dropped when summarizing.
|
||||
dbplyr will remind you about this behaviour the first time you hit it:
|
||||
dbplyr will remind you about this behavior the first time you hit it:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -438,7 +446,7 @@ flights |>
|
|||
summarise(delay = mean(arr_delay))
|
||||
```
|
||||
|
||||
If you want to learn more about how NULLs work, I recomend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand,
|
||||
If you want to learn more about how NULLs work, I recommend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
|
||||
|
||||
In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:
|
||||
|
||||
|
@ -455,6 +463,17 @@ In this case, you could drop the parentheses and use a special operator that's e
|
|||
WHERE "dep_delay" IS NOT NULL
|
||||
```
|
||||
|
||||
Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `FROM` clause.
|
||||
This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated before `SELECT`, so it needs another clause that's evaluated afterwards.
|
||||
|
||||
```{r}
|
||||
diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(n = n()) |>
|
||||
filter(n > 100) |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
### ORDER BY
|
||||
|
||||
Ordering rows involves a straightforward translation from `arrange()` to the `ORDER BY` clause:
|
||||
|
@ -501,33 +520,14 @@ As dbplyr improves over time, these cases will get rarer but will probably never
|
|||
### Joins
|
||||
|
||||
If you're familiar with dplyr's joins, SQL joins are very similar.
|
||||
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
|
||||
Here's a simple example:
|
||||
|
||||
[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
left_join(planes |> rename(year_built = year), by = "tailnum") |>
|
||||
show_query()
|
||||
```
|
||||
|
||||
If you were writing this by hand, you'd probably write this as:
|
||||
|
||||
``` sql
|
||||
SELECT
|
||||
flights.*,
|
||||
year as year_built,
|
||||
"type",
|
||||
manufacturer,
|
||||
model,
|
||||
engines,
|
||||
seats,
|
||||
speed
|
||||
FROM flights
|
||||
LEFT JOIN planes ON (flights.tailnum = planes.tailnum)
|
||||
```
|
||||
|
||||
The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.
|
||||
|
||||
dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`
|
||||
|
@ -641,7 +641,7 @@ Here's a couple of simple examples:
|
|||
```{r}
|
||||
flights |>
|
||||
mutate_query(
|
||||
description = if_else(arr_deay > 0, "delayed", "on-time")
|
||||
description = if_else(arr_delay > 0, "delayed", "on-time")
|
||||
)
|
||||
flights |>
|
||||
mutate_query(
|
||||
|
|
Loading…
Reference in New Issue