Sometimes it's possible to get someone to download a snapshot into a .csv for you, but this is generally not desirable as the iteration speed is very slow.
In this chapter, you'll first learn the basics of the DBI package: how to use it to connect to a database and how to retrieve data by executing an SQL query.
**SQL**, short for **s**tructured **q**uery **l**anguage, is the lingua franca of databases, and is an important language for you to learn as a data scientist.
However, we're not going to start with SQL, but instead we'll teach you dbplyr, which can convert your dplyr code to the equivalent SQL.
We'll use that as way to teach you some of the most important features of SQL.
You won't become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.
The main focus of this chapter, is working with data that already exists, data that someone else has collected in a database for you, as this represents the most common case.
But as we go along, we will point out a few tips and tricks for getting your own data into a database.
- You'll always use DBI (**d**ata**b**ase **i**nterface), provides a set of generic functions that perform connect to the database, upload data, run queries, and so on.
If you can't find a specific package for your DBMS, you can usually use the generic odbc package instead.
This uses the widespread ODBC standard.
odbc requires a little more setup because you'll also need to install and configure an ODBC driver.
Concretely, to create a database connection using `DBI::dbConnect()`.
The first argument specifies the DBMS and the second and subsequent arguments describe where the database lives and any credentials that you'll need to access it.
The following code shows are few typical examples:
If you want to use duckdb for a real data analysis project[^import-databases-1], you'll also need to supply the `dbdir` argument to tell duckdb where to store the database files.
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
```
duckdb is a high-performance database that's designed very much with the needs of the data scientist in mind, and the developers very much understand R and the types of real problems that R users face.
As you'll see in this chapter, it's really easy to get started with but it can also handle very large datasets.
Notice something important with the diamonds dataset: the `cut`, `color`, and `clarity` columns were originally ordered factors, but now they're regular factors.
This particularly case isn't very important since ordered factors are barely different to regular factors, but it's good to know that the way that the database represents data can be slightly different to the way R represents data.
In this case, we're actually quite lucky because most databases don't support factors at all and would've converted the column to a string.
Again, not that important, because most of the time you'll be working with data that lives in a database, but good to be aware of if you're storing your own data into a database.
Generally you can expect numbers, strings, dates, and date-times to convert just fine, but other types may not.
-->
```
In real life, it's rare that you'll use `dbReadTable()` because the whole reason you're using a database is that there's too much data to fit in a data frame, and you want to use the database to bring back only a subset of the rows and columns.
We won't discuss it further here, but if you're dealing with very large datasets it's possible to deal with a "page" of data at a time by using `dbSendQuery()` to get a "result set" which you can page through by calling `dbFetch()` until `dbHasCompleted()` returns `TRUE`.
There are lots of other functions in DBI that you might find useful if you're managing your own data (like `dbWriteTable()` which we used in @sec-load-data), but we're going to skip past them in the interests of staying focused on working with data that already lives in a database.
Now that you've learned the low-level basics for connecting to a database and running a query, we're going to switch it up a bit and learn a bit about dbplyr.
dbplyr is a dplyr **backend**, which means that you write the dplyr code that you're already familiar with and dbplyr translates it to run in a different way, in this case to SQL.
This object is **lazy**; when you use dplyr verbs on it, dplyr doesn't do any work: it just records the sequence of operations that you want to perform and only performs them when needed.
You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically it won't tell you the number of rows.
This is because finding the total number of rows usually requires executing the complete query, something we're trying to avoid.
You can see the SQL the dbplyr generates by a dbplyr query by calling `show_query()`:
It's an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you're likely to encounter in real organizations.
We're going to focus on `SELECT` statements, which are commonly called **queries**, because they are almost exclusively what you'll use as a data scientist.
Every query must have two clauses: `SELECT`[^import-databases-4] and `FROM`[^import-databases-5]. The simplest query is `SELECT * FROM tablename`, which selects all columns from the specified table
. This is what dplyr generates for an adulterated table
There are three other important clauses: `WHERE`, `ORDER BY`, and `GROUP BY`. `WHERE` and `ORDER BY` control which rows are included and how they are ordered:
- In SQL, case doesn't matter: you can write `select`, `SELECT`, or even `SeLeCt`. In this book we'll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
- In SQL, order matters: you must always write the clauses in the order `SELECT`, `FROM`, `WHERE`, `GROUP BY`, `ORDER BY`. Confusingly, this order doesn't match how they clauses actually evaluated which is first `FROM`, then `WHERE`, `GROUP BY`, `SELECT`, and `ORDER BY`.
Note that while SQL is a standard, it is an extremely complex standard and no database follows it exactly.
This means that while the main components that we'll focus on in this book are very similar between DBMSs, there are a lot of minor variations.
Fortunately, dbplyr knows about this problem and generates different translations for different databases.
It's not perfect, but it's continually improving, and if you hit a problem you can file an issue at [on GitHub](https://github.com/tidyverse/dbplyr/issues/) to help us improve it.
The `SELECT` clause is the workhorse of queries, and is equivalent to `select()`, `mutate()`, `rename()`, `relocate()`, and, as you'll learn in the next section, `summarize()`.
`select()`, `rename()`, and `relocate()` have very direct translations to `SELECT` as they affect where a column appears (if at all) along with its name:
If you want to learn more about how NULLs work, I recomend "[*Three valued logic*](https://modern-sql.com/concept/three-valued-logic)" by Markus Winand,
dplyr typically uses subqueries to work around limitations of SQL.
For example, expressions in the `SELECT` clause can't refer to columns that were just created.
That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes `year1` and then the second (outer) query can compute `year2`:
The main thing to notice here is the syntax: SQL joins use sub-clauses of the `FROM` clause to bring in additional tables, using `ON` to define how the tables are related.
dplyr's names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for `inner_join()`, `right_join()`, and `full_join():`
When you're working with data from a databases, you're likely to need many more joins that with data from other sources.
That's because database tables are often stored in a highly normalized form, where each "fact" is stored in a single place.
Typically, this involves of complex network of tables connected by primary and foreign keys.
If you hit this scenario, the [dm package](https://cynkra.github.io/dm/), by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, can be a life saver.
It can automatically determine the connections between tables in a database using the constraints the DBAs often supply, automatically visualize them so you can see what's going on, and automatically generate the joins you need to connect one table to another.
dbplyr also translates other verbs like `distinct()`, `slice_*()`, and `intersect()`, and a growing selection of tidyr functions like `pivot_longer()` and `pivot_wider()`.
Now we're going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use `mean(x)` in a `summarize()`?
You can generally trust dbplyr's translations, but again it's a good way to learn a bit more about SQL.
To help you see what's going on, I'm going to make a couple of little helper functions that run a `summarise()` or `mutate()` and show the generated SQL.
That'll make it a little easier to explore a few variations and see how summaries and transformations can differ.
Moving back to regular transformation, another really important SQL function is `CASE WHEN`. It's used for `if_else()` and it also inspired dplyr's `case_when()` function.
dbplyr also translates common string and date-time manipulation functions, which you can learn about in `vignette("translation-function", package = "dbplyr")`.
dbplyr's translation are certainly not perfect, and there are many R functions that aren't translated yet, but dbplyr does a surprisingly good job covering the functions that you'll use most of the time.