Update databases.qmd (#1068)
Fixed a bunch of errors/typos. I am so glad the second version of the book provides such a nicely written chapter on database.
This commit is contained in:
parent
d080f3279c
commit
5ac3dac6bd
|
@ -91,7 +91,7 @@ con <- DBI::dbConnect(
|
|||
)
|
||||
```
|
||||
|
||||
The precise details of the connection varies a lot from DBMS to DBMS so unfortunately we can't cover all the details here.
|
||||
The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can't cover all the details here.
|
||||
This means you'll need to do a little research on your own.
|
||||
Typically you can ask the other data scientists in your team or talk to your DBA (**d**ata**b**ase **a**dministrator).
|
||||
The initial setup will often take a little fiddling (and maybe some googling) to get right, but you'll generally only need to do it once.
|
||||
|
@ -112,7 +112,7 @@ con <- DBI::dbConnect(duckdb::duckdb())
|
|||
duckdb is a high-performance database that's designed very much for the needs of a data scientist.
|
||||
We use it here because it's very to easy to get started with, but it's also capable of handling gigabytes of data with great speed.
|
||||
If you want to use duckdb for a real data analysis project, you'll also need to supply the `dbdir` argument to make a persistent database and tell duckdb where to save it.
|
||||
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
|
||||
Assuming you're using a project (@sec-workflow-scripts-projects), it's reasonable to store it in the `duckdb` directory of the current project:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -122,7 +122,7 @@ con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
|
|||
### Load some data {#sec-load-data}
|
||||
|
||||
Since this is a new database, we need to start by adding some data.
|
||||
Here we'll use add `mpg` and `diamonds` datasets from ggplot2 using `DBI::dbWriteTable()`.
|
||||
Here we'll add `mpg` and `diamonds` datasets from ggplot2 using `DBI::dbWriteTable()`.
|
||||
The simplest usage of `dbWriteTable()` needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.
|
||||
|
||||
```{r}
|
||||
|
@ -131,11 +131,11 @@ dbWriteTable(con, "diamonds", ggplot2::diamonds)
|
|||
```
|
||||
|
||||
If you're using duckdb in a real project, we highly recommend learning about `duckdb_read_csv()` and `duckdb_register_arrow()`.
|
||||
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it in to R.
|
||||
These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.
|
||||
|
||||
## DBI basics
|
||||
|
||||
Now that we've connected to a database with some data in it, lets perform some basic operations with DBI.
|
||||
Now that we've connected to a database with some data in it, let's perform some basic operations with DBI.
|
||||
|
||||
### What's there?
|
||||
|
||||
|
@ -201,8 +201,8 @@ diamonds_db
|
|||
```
|
||||
|
||||
::: callout-note
|
||||
There are two other common way to a database.
|
||||
First, many corporate databases are very large so need some hierarchy to keep all the tables organised.
|
||||
There are two other common ways to interact with a database.
|
||||
First, many corporate databases are very large so you need some hierarchy to keep all the tables organised.
|
||||
In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you're interested in:
|
||||
|
||||
```{r}
|
||||
|
@ -233,7 +233,7 @@ big_diamonds_db
|
|||
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn't know the number of rows.
|
||||
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
|
||||
|
||||
You can see the SQL the dbplyr generates by a dbplyr query by calling `show_query()`:
|
||||
You can see the SQL code generated by the dbplyr function `show_query()`:
|
||||
|
||||
```{r}
|
||||
big_diamonds_db |>
|
||||
|
@ -259,7 +259,7 @@ It's a rather non-traditional introduction to SQL but we hope it will get you qu
|
|||
Luckily, if you understand dplyr you're in a great place to quickly pick up SQL because so many of the concepts are the same.
|
||||
|
||||
We'll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: `flights` and `planes`.
|
||||
These dataset are easy to get into our learning database because dbplyr has a function designed for this exact scenario:
|
||||
These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:
|
||||
|
||||
```{r}
|
||||
dbplyr::copy_nycflights13(con)
|
||||
|
@ -280,7 +280,7 @@ We will on focus on `SELECT` statements, also called **queries**, because they a
|
|||
|
||||
A query is made up of **clauses**.
|
||||
There are five important clauses: `SELECT`, `FROM`, `WHERE`, `ORDER BY`, and `GROUP BY`. Every query must have the `SELECT`[^databases-4] and `FROM`[^databases-5] clauses and the simplest query is `SELECT * FROM table`, which selects all columns from the specified table
|
||||
. This is what dplyr generates for an unadulterated table
|
||||
. This is what dbplyr generates for an unadulterated table
|
||||
:
|
||||
|
||||
[^databases-4]: Confusingly, depending on the context, `SELECT` is either a statement or a clause.
|
||||
|
@ -350,7 +350,7 @@ planes |>
|
|||
|
||||
This example also shows you how SQL does renaming.
|
||||
In SQL terminology renaming is called **aliasing** and is done with `AS`.
|
||||
Note that unlike with `mutate()`, the old name is on the left and the new name is on the right.
|
||||
Note that unlike `mutate()`, the old name is on the left and the new name is on the right.
|
||||
|
||||
::: callout-note
|
||||
In the examples above note that `"year"` and `"type"` are wrapped in double quotes.
|
||||
|
@ -578,7 +578,7 @@ So far we've focused on the big picture of how dplyr verbs are translated to the
|
|||
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?
|
||||
|
||||
To help see what's going on, we'll use a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL.
|
||||
That will make it a easier to explore a few variations and see how summaries and transformations can differ.
|
||||
That will make it a little easier to explore a few variations and see how summaries and transformations can differ.
|
||||
|
||||
```{r}
|
||||
summarize_query <- function(df, ...) {
|
||||
|
|
Loading…
Reference in New Issue