Sometimes it's possible to get someone to download a snapshot into a .csv for you, but this is generally not desirable as the iteration speed is very slow.
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and how to retrieve data by executing an SQL query.
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for you to learn as a data scientist.
However, we're not going to start with SQL, but instead we'll teach you dbplyr, which can convert your dplyr code to the equivalent SQL.
We'll use that as way to teach you some of the most important features of SQL.
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
The main focus of this chapter, is working with data that already exists, data that someone else has collected in a database for you, as this represents the most common case.
But as we go along, we will point out a few tips and tricks for getting your own data into a database.
- You'll always use DBI (**d**ata**b**ase **i**nterface), provides a set of generic functions that perform connect to the database, upload data, run queries, and so on.
If you can't find a specific package for your DBMS, you can usually use the generic odbc package instead.
This uses the widespread ODBC standard.
odbc requires a little more setup because you'll also need to install and configure an ODBC driver.
Concretely, to create a database connection using `DBI::dbConnect()`.
The first argument specifies the DBMS and the second and subsequent arguments describe where the database lives and any credentials that you'll need to access it.
The following code shows are few typical examples:
If you want to use duckdb for a real data analysis project[^import-databases-1], you'll also need to supply the `dbdir` argument to tell duckdb where to store the database files.
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
```
duckdb is a high-performance database that's designed very much with the needs of the data scientist in mind, and the developers very much understand R and the types of real problems that R users face.
As you'll see in this chapter, it's really easy to get started with but it can also handle very large datasets.
Notice something important with the diamonds dataset: the `cut`, `color`, and `clarity` columns were originally ordered factors, but now they're regular factors.
This particularly case isn't very important since ordered factors are barely different to regular factors, but it's good to know that the way that the database represents data can be slightly different to the way R represents data.
In this case, we're actually quite lucky because most databases don't support factors at all and would've converted the column to a string.
Again, not that important, because most of the time you'll be working with data that lives in a database, but good to be aware of if you're storing your own data into a database.
Generally you can expect numbers, strings, dates, and date-times to convert just fine, but other types may not.
-->
```
In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.
We won't discuss it further here, but if you're dealing with very large datasets it's possible to deal with a "page" of data at a time by using `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.
There are lots of other functions in DBI that you might find useful if you're managing your own data (like `dbWriteTable()` which we used in @sec-load-data), but we're going to skip past them in the interests of staying focused on working with data that already lives in a database.
Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
dbplyr is a dplyr **backend**, which means that you write the dplyr code that you're already familiar with and dbplyr translates it to run in a different way, in this case to SQL.
To use dbplyr you start start by creating a `tbl()`: this creates something that looks like a tibble, but is really a reference to a table in a database[^import-databases-3]:
[^import-databases-3]: If you want to mix SQL and dbplyr, you can also create a tbl from a SQL query with `tbl(con, SQL("SELECT * FROM foo")).`
You can tell it's a database query because it prints the database name at the top, and typically it won't be able to tell you the total number of rows.
This is because finding the total number of rows is often an expensive computation for a database.
This SQL is a little different to what you might write by hand: dbplyr quotes every variable name and may include parentheses when they're not absolutely needed.
If you were to write this by hand, you'd probably do:
The basic unit of composition in SQL is not a function, but a **statement**.
Common statements include `INSERT` for adding new data, `CREATE` for making new tables, and `UPDATE` for modifying data, and `SELECT` for retrieving data.
Unlike R SQL is (mostly) case insensitive, but by convention, to make them stand out the clauses are usually capitalized like `SELECT`, `FROM`, and `WHERE` above.
We're going to focus on `SELECT` statements because they are almost exclusively what you'll use as a data scientist.
The other statements will be handled by someone else; in the case that you need to update your own database, you can solve most problems with `dbWriteTable()` and/or `dbInsertTable()`.
In fact, as a data scientist in most cases you won't even be able to run these statements because you only have read only access to the database.
This ensures that there's no way for you to accidentally mess things up.
A `SELECT` statement is often called a query, and a query is made up of clauses.
`SELECT` is the workhorse of SQL queries, and is used for `select()`, `mutate()`, `rename()`, and `relocate()`.
In the next section, you'll see that `SELECT` is *also* used for `summarize()` when paired with `GROUP BY`.
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` --- they just change the number and order of the variables, renaming where necessary with `AS`.
Unlike R, the old name is on the left and the new name is on the right.
Some times it's not possible to express what you want in a single query.
For example, in `SELECT` can only refer to columns that exist in the `FROM`, not columns that you have just created.
So if you modify a column that you just created, dbplyr will need to create a subquery:
```{r}
diamonds_db |>
select(carat) |>
mutate(
carat2 = carat + 2,
carat3 = carat2 + 1
) |>
show_query()
```
A subquery is just a query that's nested inside of `FROM`, so instead of a table being used as the source, the new query is.
Another similar restriction is that `WHERE`, like `SELECT` can only operate on variables in `FROM`, so if you try and filter based on a variable that you just created, you'll need to create a subquery.
```{r}
diamonds_db |>
select(carat) |>
mutate(carat2 = carat + 2) |>
filter(carat2 > 1) |>
show_query()
```
Sometimes dbplyr uses a subquery where strictly speaking it's not necessary.
For example, take this pipeline that filters on a summary value:
```{r}
diamonds_db |>
group_by(cut) |>
summarise(
n = n(),
avg_price = mean(price)
) |>
filter(n > 10) |>
show_query()
```
In this case it's possible to use the special `HAVING` clause.
This is works the same way as `WHERE` except that it's applied *after* the aggregates have been computed, not before.
``` sql
SELECT "cut", COUNT(*) AS "n", AVG("price") AS "avg_price"
flights |> inner_join(planes, by = "tailnum") |> show_query()
flights |> left_join(planes, by = "tailnum") |> show_query()
flights |> full_join(planes, by = "tailnum") |> show_query()
```
### Semi and anti-joins
SQL's syntax for semi- and anti-joins are a bit arcane.
I don't remember these and just google if I ever need the syntax outside of SQL.
```{r}
flights |> semi_join(planes, by = "tailnum") |> show_query()
flights |> anti_join(planes, by = "tailnum") |> show_query()
```
### Temporary data
Sometimes it's useful to perform a join or semi/anti join with data that you have locally.
How can you get that data into the database?
There are a few ways to do so.
You can set `copy = TRUE` to automatically copy.
There are two other ways that give you a little more control:
`copy_to()` --- this works very similarly to `DBI::dbWriteTable()` but returns a `tbl` so you don't need to create one after the fact.
By default this creates a temporary table, which will only be visible to the current connection (not to other people using the database), and will automatically be deleted when the connection finishes.
Most database will allow you to create temporary tables, even if you don't otherwise have write access to the data.
`copy_inline()` --- new in the latest version of db.
Rather than copying the data to the database, it builds SQL that generates the data inline.
It's useful if you don't have permission to create temporary tables, and is faster than `copy_to()` for small datasets.
Now that you understand the big picture of a SQL query and the equivalence between the SELECT clauses and dplyr verbs, it's time to look more at the details of the conversion of the individual expressions, i.e. what happens when you use `mean(x)` in a `summarize()`?
- In R strings are surrounded by `"` or `'` and variable names (if needed) use `` ` ``. In SQL, strings only use `'` and most databases use `"` for variable names.
Note that every database uses a slightly different dialect of SQL.
For the vast majority of simple examples in this chapter, you won't see any differences.
But as you start to write more complex SQL you'll discover that what works on what database might not work on another.
Fortunately, dbplyr will take care a lot of this for you, as it automatically varies the SQL that it generates based on the database you're using.
It's not perfect, but if you discover the dbplyr creates SQL that works on one database but not another, please file an issue so we can try to make it better.
If you just want to see the SQL dbplyr generates for different databases, you can create a special simulated data frame.
This is mostly useful for the developers of dbplyr, but it also gives you an easy way to experiment with SQL variants.
```{r}
lf1 <- dbplyr::lazy_frame(name = "Hadley", con = dbplyr::simulate_oracle())
lf2 <- dbplyr::lazy_frame(name = "Hadley", con = dbplyr::simulate_postgres())