Respond to feedback from twitter

This commit is contained in:
Hadley Wickham 2022-06-04 09:22:05 -05:00
parent 6408e00d93
commit d411ae3780
1 changed files with 37 additions and 18 deletions

View File

@ -161,7 +161,7 @@ con |>
`dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.
In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.
In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.
### Run a query {#sec-dbGetQuery}
@ -169,13 +169,12 @@ The way you'll usually retrieve data is with `dbGetQuery()`.
It takes a database connection and some SQL code and returns a data frame:
```{r}
con |>
dbGetQuery("
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price > 15000
") |>
as_tibble()
sql <- "
SELECT carat, cut, clarity, color, price
FROM diamonds
WHERE price > 15000
"
as_tibble(dbGetQuery(con, sql))
```
Don't worry if you've never seen SQL before; you'll learn more about it shortly.
@ -194,15 +193,32 @@ Now that you've learned the low-level basics for connecting to a database and ru
dbplyr is a dplyr **backend**, which means that you keep writing dplyr code but the backend executes it differently.
In this, dbplyr translates to SQL; other backends include [dtplyr](https://dtplyr.tidyverse.org) which translates to [data.table](https://r-datatable.com), and [multidplyr](https://multidplyr.tidyverse.org) which executes your code on multiple cores.
To use dbplyr, you must first use `tbl()` to create an object that represents a database table[^import-databases-4]:
[^import-databases-4]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, sql("SELECT * FROM foo")).`
To use dbplyr, you must first use `tbl()` to create an object that represents a database table:
```{r}
diamonds_db <- tbl(con, "diamonds")
diamonds_db
```
::: callout-note
There are two other common way to a database.
First, many corporate databases are very large so need some hierarchy to keep all the tables organised.
In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you're interested in:
```{r}
#| eval: false
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
```
Other times you might want to use your own SQL query as a starting point:
```{r}
#| eval: false
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
```
:::
This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
For example, take the following pipeline:
@ -233,6 +249,9 @@ big_diamonds <- big_diamonds_db |>
big_diamonds
```
Typically, you'll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below.
Then, once you're ready to analyse the data with functions that are unique to R, you'll `collect()` the data to get an in-memory tibble, and continue your work with pure R code.
## SQL
The rest of the chapter will teach you a little SQL through the lens of dbplyr.
@ -260,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.
A query is made up of **clauses**.
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-5] and `FROM`[^import-databases-6] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
. This is what dplyr generates for an adulterated table
:
[^import-databases-5]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
To avoid this confusion, we'll generally use query instead of `SELECT` statement.
[^import-databases-6]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
But if you want to work with data (as you always do!) you'll also need a `FROM` clause.
```{r}
@ -364,9 +383,9 @@ But only a handful of client packages, like duckdb, know what all the reserved w
### GROUP BY
`group_by()` is translated to the `GROUP BY`[^import-databases-7] clause and `summarise()` is translated to the `SELECT` clause:
`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:
[^import-databases-7]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
```{r}
diamonds_db |>
@ -482,10 +501,10 @@ As dbplyr improves over time, these cases will get rarer but will probably never
### Joins
If you're familiar with dplyr's joins, SQL joins are very similar.
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-8].
Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
Here's a simple example:
[^import-databases-8]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
```{r}
flights |>