Use dev dbplyr
This commit is contained in:
		@@ -48,6 +48,7 @@ Suggests:
 | 
			
		||||
    tidymodels,
 | 
			
		||||
    xml2
 | 
			
		||||
Remotes:
 | 
			
		||||
    tidyverse/dbplyr,
 | 
			
		||||
    tidyverse/stringr,
 | 
			
		||||
    tidyverse/tidyr,
 | 
			
		||||
    jennybc/repurrrsive
 | 
			
		||||
 
 | 
			
		||||
							
								
								
									
										110
									
								
								databases.qmd
									
									
									
									
									
								
							
							
						
						
									
										110
									
								
								databases.qmd
									
									
									
									
									
								
							@@ -13,13 +13,13 @@ A huge amount of data lives in databases, so it's essential that you know how to
 | 
			
		||||
Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you'll have to communicate with another human.
 | 
			
		||||
You want to be able to reach into the database directly to get the data you need, when you need it.
 | 
			
		||||
 | 
			
		||||
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^import-databases-1] query.
 | 
			
		||||
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQL[^databases-1] query.
 | 
			
		||||
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for all data scientists to learn.
 | 
			
		||||
That said, we're not going to start with SQL, but instead we'll teach you dbplyr, which can translate your dplyr code to the SQL.
 | 
			
		||||
We'll use that as way to teach you some of the most important features of SQL.
 | 
			
		||||
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
 | 
			
		||||
 | 
			
		||||
[^import-databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
 | 
			
		||||
[^databases-1]: SQL is either pronounced "s"-"q"-"l" or "sequel".
 | 
			
		||||
 | 
			
		||||
### Prerequisites
 | 
			
		||||
 | 
			
		||||
@@ -73,10 +73,10 @@ This uses the ODBC protocol supported by many DBMS.
 | 
			
		||||
odbc requires a little more setup because you'll also need to install an ODBC driver and tell the odbc package where to find it.
 | 
			
		||||
 | 
			
		||||
Concretely, you create a database connection using `DBI::dbConnect()`.
 | 
			
		||||
The first argument selects the DBMS[^import-databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
 | 
			
		||||
The first argument selects the DBMS[^databases-2], then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it).
 | 
			
		||||
The following code shows a couple of typical examples:
 | 
			
		||||
 | 
			
		||||
[^import-databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
 | 
			
		||||
[^databases-2]: Typically, this is the only function you'll use from the client package, so we recommend using `::` to pull out that one function, rather than loading the complete package with `library()`.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
#| eval: false
 | 
			
		||||
@@ -133,16 +133,16 @@ dbWriteTable(con, "diamonds", ggplot2::diamonds)
 | 
			
		||||
If you're using duckdb in a real project, I highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
 | 
			
		||||
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
 | 
			
		||||
 | 
			
		||||
## Database basics
 | 
			
		||||
## DBI basics
 | 
			
		||||
 | 
			
		||||
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
 | 
			
		||||
 | 
			
		||||
### What's there?
 | 
			
		||||
 | 
			
		||||
The most important database objects for data scientists are tables.
 | 
			
		||||
DBI provides two useful functions to either list all the tables in the database[^import-databases-3] or to check if a specific table already exists:
 | 
			
		||||
DBI provides two useful functions to either list all the tables in the database[^databases-3] or to check if a specific table already exists:
 | 
			
		||||
 | 
			
		||||
[^import-databases-3]: At least, all the tables that you have permission to see.
 | 
			
		||||
[^databases-3]: At least, all the tables that you have permission to see.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
dbListTables(con)
 | 
			
		||||
@@ -279,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
 | 
			
		||||
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
 | 
			
		||||
 | 
			
		||||
A query is made up of **clauses**.
 | 
			
		||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
 | 
			
		||||
. This is what dplyr generates for an adulterated table
 | 
			
		||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^databases-4] and `FROM`[^databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
 | 
			
		||||
. This is what dplyr generates for an unadulterated table
 | 
			
		||||
:
 | 
			
		||||
 | 
			
		||||
[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
 | 
			
		||||
[^databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
 | 
			
		||||
    To avoid this confusion, we'll generally use query instead of `SELECT` statement.
 | 
			
		||||
 | 
			
		||||
[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
 | 
			
		||||
[^databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
 | 
			
		||||
    But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
@@ -334,14 +334,16 @@ The `SELECT` clause is the workhorse of queries and performs the same job as `se
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
planes |> 
 | 
			
		||||
  select(tailnum, type, manufacturer, model) |> 
 | 
			
		||||
  select(tailnum, type, manufacturer, model, year) |> 
 | 
			
		||||
  show_query()
 | 
			
		||||
 | 
			
		||||
planes |> 
 | 
			
		||||
  select(tailnum, type, manufacturer, model, year) |> 
 | 
			
		||||
  rename(year_built = year) |> 
 | 
			
		||||
  show_query()
 | 
			
		||||
 | 
			
		||||
planes |> 
 | 
			
		||||
  select(tailnum, type, manufacturer, model, year) |> 
 | 
			
		||||
  relocate(manufacturer, model, .before = type) |> 
 | 
			
		||||
  show_query()
 | 
			
		||||
```
 | 
			
		||||
@@ -350,42 +352,48 @@ This example also shows you how SQL does renaming.
 | 
			
		||||
In SQL terminology renaming is called **aliasing** and is done with `AS`.
 | 
			
		||||
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
 | 
			
		||||
 | 
			
		||||
::: callout-note
 | 
			
		||||
In the examples above note that `"year"` and `"type"` are wrapped in double quotes.
 | 
			
		||||
That's because these are **reserved words** in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.
 | 
			
		||||
 | 
			
		||||
When working with other databases you're likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.
 | 
			
		||||
 | 
			
		||||
``` sql
 | 
			
		||||
SELECT "tailnum", "type", "manufacturer", "model", "year"
 | 
			
		||||
FROM "planes"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Some other database systems use backticks instead of quotes:
 | 
			
		||||
 | 
			
		||||
``` sql
 | 
			
		||||
SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
 | 
			
		||||
FROM `planes`
 | 
			
		||||
```
 | 
			
		||||
:::
 | 
			
		||||
 | 
			
		||||
The translations for `mutate()` are similarly straightforward: each variable becomes a new expression in `SELECT`:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
diamonds_db |> 
 | 
			
		||||
flights |> 
 | 
			
		||||
  mutate(
 | 
			
		||||
    price_per_carat = price / carat
 | 
			
		||||
    speed = distance / (air_time / 60)
 | 
			
		||||
  ) |> 
 | 
			
		||||
  show_query()
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
We'll come back to the translation of individual components (like `/`) in @sec-sql-expressions.
 | 
			
		||||
 | 
			
		||||
::: callout-note
 | 
			
		||||
When working with other databases you're likely to see variable names wrapped in some sort of quote character, like this:
 | 
			
		||||
### FROM
 | 
			
		||||
 | 
			
		||||
``` sql
 | 
			
		||||
SELECT "year", "month", "day", "dep_time", "dep_delay"
 | 
			
		||||
FROM "flights"
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Or like this:
 | 
			
		||||
 | 
			
		||||
``` sql
 | 
			
		||||
SELECT `year`, `month`, `day`, `dep_time`, `dep_delay`
 | 
			
		||||
FROM `flights`
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Quoting is only required for **reserved words** like `SELECT` or `FROM` to avoid confusion between column/tables names and SQL operators.
 | 
			
		||||
But only a handful of client packages, like duckdb, know what all the reserved words are, so most packages quote everything just to be safe.
 | 
			
		||||
:::
 | 
			
		||||
The `FROM` clause defines the data source.
 | 
			
		||||
It's going to be rather uninteresting for a little while, because we're just using single tables.
 | 
			
		||||
You'll see more complex examples once we hit the join functions.
 | 
			
		||||
 | 
			
		||||
### GROUP BY
 | 
			
		||||
 | 
			
		||||
`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:
 | 
			
		||||
`group_by()` is translated to the `GROUP BY`[^databases-6] clause and `summarise()` is translated to the `SELECT` clause:
 | 
			
		||||
 | 
			
		||||
[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
 | 
			
		||||
[^databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
diamonds_db |> 
 | 
			
		||||
@@ -430,7 +438,7 @@ flights |>
 | 
			
		||||
SQL uses `NULL` instead of `NA`.
 | 
			
		||||
`NULL`s behave similarly to `NA`s.
 | 
			
		||||
The main difference is that while they're "infectious" in comparisons and arithmetic, they are silently dropped when summarizing.
 | 
			
		||||
dbplyr will remind you about this behaviour the first time you hit it:
 | 
			
		||||
dbplyr will remind you about this behavior the first time you hit it:
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
flights |> 
 | 
			
		||||
@@ -438,7 +446,7 @@ flights |>
 | 
			
		||||
  summarise(delay = mean(arr_delay))
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
If you want to learn more about how NULLs work, I recomend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand,
 | 
			
		||||
If you want to learn more about how NULLs work, I recommend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand.
 | 
			
		||||
 | 
			
		||||
In general, you can work with `NULL`s using the functions you'd use for `NA`s in R:
 | 
			
		||||
 | 
			
		||||
@@ -455,6 +463,17 @@ In this case, you could drop the parentheses and use a special operator that's e
 | 
			
		||||
WHERE "dep_delay" IS NOT NULL
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Note that if you `filter()` a variable that you created using a summarize, dbplyr will generate a `HAVING` clause, rather than a `FROM` clause.
 | 
			
		||||
This is a one of the idiosyncracies of SQL created because `WHERE` is evaluated before `SELECT`, so it needs another clause that's evaluated afterwards.
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
diamonds_db |> 
 | 
			
		||||
  group_by(cut) |> 
 | 
			
		||||
  summarise(n = n()) |> 
 | 
			
		||||
  filter(n > 100) |> 
 | 
			
		||||
  show_query()
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
### ORDER BY
 | 
			
		||||
 | 
			
		||||
Ordering rows involves a straightforward translation from `arrange()` to the `ORDER BY` clause:
 | 
			
		||||
@@ -501,33 +520,14 @@ As dbplyr improves over time, these cases will get rarer but will probably never
 | 
			
		||||
### Joins
 | 
			
		||||
 | 
			
		||||
If you're familiar with dplyr's joins, SQL joins are very similar.
 | 
			
		||||
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
 | 
			
		||||
Here's a simple example:
 | 
			
		||||
 | 
			
		||||
[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
 | 
			
		||||
 | 
			
		||||
```{r}
 | 
			
		||||
flights |> 
 | 
			
		||||
  left_join(planes |> rename(year_built = year), by = "tailnum") |> 
 | 
			
		||||
  show_query()
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
If you were writing this by hand, you'd probably write this as:
 | 
			
		||||
 | 
			
		||||
``` sql
 | 
			
		||||
SELECT 
 | 
			
		||||
  flights.*, 
 | 
			
		||||
  year as year_built, 
 | 
			
		||||
  "type", 
 | 
			
		||||
  manufacturer, 
 | 
			
		||||
  model, 
 | 
			
		||||
  engines, 
 | 
			
		||||
  seats, 
 | 
			
		||||
  speed
 | 
			
		||||
FROM flights
 | 
			
		||||
LEFT JOIN planes ON (flights.tailnum = planes.tailnum)
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.
 | 
			
		||||
 | 
			
		||||
dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`
 | 
			
		||||
@@ -641,7 +641,7 @@ Here's a couple of simple examples:
 | 
			
		||||
```{r}
 | 
			
		||||
flights |> 
 | 
			
		||||
  mutate_query(
 | 
			
		||||
    description = if_else(arr_deay > 0, "delayed", "on-time")
 | 
			
		||||
    description = if_else(arr_delay > 0, "delayed", "on-time")
 | 
			
		||||
  )
 | 
			
		||||
flights |> 
 | 
			
		||||
  mutate_query(
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user