Respond to feedback from twitter

2022-06-04 09:22:05 -05:00
parent 6408e00d93
commit d411ae3780
1 changed files with 37 additions and 18 deletions
--- a/import-databases.qmd
+++ b/import-databases.qmd
@@ -161,7 +161,7 @@ con |>

 `dbReadTable()` returns a `data.frame` so I use `as_tibble()` to convert it into a tibble so that it prints nicely.

-In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.
+In real life, it's rare that you'll use `dbReadTable()` because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.

 ### Run a query {#sec-dbGetQuery}

@@ -169,13 +169,12 @@ The way you'll usually retrieve data is with `dbGetQuery()`.
 It takes a database connection and some SQL code and returns a data frame:

 ```{r}
-con |> 
-  dbGetQuery("
-    SELECT carat, cut, clarity, color, price 
-    FROM diamonds 
-    WHERE price > 15000
-  ") |> 
-  as_tibble()
+sql <- "
+  SELECT carat, cut, clarity, color, price 
+  FROM diamonds 
+  WHERE price > 15000
+"
+as_tibble(dbGetQuery(con, sql))
 ```

 Don't worry if you've never seen SQL before; you'll learn more about it shortly.
@@ -194,15 +193,32 @@ Now that you've learned the low-level basics for connecting to a database and ru
 dbplyr is a dplyr **backend**, which means that you keep writing dplyr code but the backend executes it differently.
 In this, dbplyr translates to SQL; other backends include [dtplyr](https://dtplyr.tidyverse.org) which translates to [data.table](https://r-datatable.com), and [multidplyr](https://multidplyr.tidyverse.org) which executes your code on multiple cores.

-To use dbplyr, you must first use `tbl()` to create an object that represents a database table[^import-databases-4]:
-
-[^import-databases-4]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, sql("SELECT * FROM foo")).`
+To use dbplyr, you must first use `tbl()` to create an object that represents a database table:

 ```{r}
 diamonds_db <- tbl(con, "diamonds")
 diamonds_db
 ```

+::: callout-note
+There are two other common way to a database.
+First, many corporate databases are very large so need some hierarchy to keep all the tables organised.
+In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you're interested in:
+
+```{r}
+#| eval: false
+diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
+diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
+```
+
+Other times you might want to use your own SQL query as a starting point:
+
+```{r}
+#| eval: false
+diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
+```
+:::
+
 This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
 For example, take the following pipeline:

@@ -233,6 +249,9 @@ big_diamonds <- big_diamonds_db |>
 big_diamonds
 ```

+Typically, you'll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below.
+Then, once you're ready to analyse the data with functions that are unique to R, you'll `collect()` the data to get an in-memory tibble, and continue your work with pure R code.
+
 ## SQL

 The rest of the chapter will teach you a little SQL through the lens of dbplyr.
@@ -260,14 +279,14 @@ Common statements include `CREATE` for defining new tables, `INSERT` for adding
 We will on focus on `SELECT` statements, also called **queries**, because they are almost exclusively what you'll use as a data scientist.

 A query is made up of **clauses**.
-There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-5] and `FROM`[^import-databases-6] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
+There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^import-databases-4] and `FROM`[^import-databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
 . This is what dplyr generates for an adulterated table
 :

-[^import-databases-5]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
+[^import-databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
    To avoid this confusion, we'll generally use query instead of `SELECT` statement.

-[^import-databases-6]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
+[^import-databases-5]: Ok, technically, only the `SELECT` is required, since you can write queries like `SELECT 1+1` to perform basic calculations.
    But if you want to work with data (as you always do!) you'll also need a `FROM` clause.

 ```{r}
@@ -364,9 +383,9 @@ But only a handful of client packages, like duckdb, know what all the reserved w

 ### GROUP BY

-`group_by()` is translated to the `GROUP BY`[^import-databases-7] clause and `summarise()` is translated to the `SELECT` clause:
+`group_by()` is translated to the `GROUP BY`[^import-databases-6] clause and `summarise()` is translated to the `SELECT` clause:

-[^import-databases-7]: This is no coincidence: the dplyr function name was inspired by the SQL clause.
+[^import-databases-6]: This is no coincidence: the dplyr function name was inspired by the SQL clause.

 ```{r}
 diamonds_db |> 
@@ -482,10 +501,10 @@ As dbplyr improves over time, these cases will get rarer but will probably never
 ### Joins

 If you're familiar with dplyr's joins, SQL joins are very similar.
-Unfortunately, dbplyr's current translations are rather verbose[^import-databases-8].
+Unfortunately, dbplyr's current translations are rather verbose[^import-databases-7].
 Here's a simple example:

-[^import-databases-8]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃
+[^import-databases-7]: We're working on doing better in the future, so if you're lucky it'll be better by the time you're reading this 😃

 ```{r}
 flights |>