Fix code language
This commit is contained in:
@@ -11,7 +11,7 @@ Introduction</h1>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(DBI)
|
||||
<pre data-type="programlisting" data-code-language="r">library(DBI)
|
||||
library(dbplyr)
|
||||
library(tidyverse)</pre>
|
||||
</div>
|
||||
@@ -43,7 +43,7 @@ Connecting to a database</h1>
|
||||
</ul><p>If you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.</p>
|
||||
<p>Concretely, you create a database connection using <code><a href="https://dbi.r-dbi.org/reference/dbConnect.html">DBI::dbConnect()</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function you’ll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(
|
||||
RMariaDB::MariaDB(),
|
||||
username = "foo"
|
||||
)
|
||||
@@ -61,11 +61,11 @@ In this book</h2>
|
||||
<p>Setting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.</p>
|
||||
<p>Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(duckdb::duckdb())</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb())</pre>
|
||||
</div>
|
||||
<p>duckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very to easy to get started with, but it’s also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, you’ll also need to supply the <code>dbdir</code> argument to make a persistent database and tell duckdb where to save it. Assuming you’re using a project (<a href="#chp-workflow-scripts" data-type="xref">#chp-workflow-scripts</a>), it’s reasonable to store it in the <code>duckdb</code> directory of the current project:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")</pre>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
@@ -74,7 +74,7 @@ In this book</h2>
|
||||
Load some data</h2>
|
||||
<p>Since this is a new database, we need to start by adding some data. Here we’ll add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">DBI::dbWriteTable()</a></code>. The simplest usage of <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dbWriteTable(con, "mpg", ggplot2::mpg)
|
||||
<pre data-type="programlisting" data-code-language="r">dbWriteTable(con, "mpg", ggplot2::mpg)
|
||||
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
|
||||
</div>
|
||||
<p>If you’re using duckdb in a real project, we highly recommend learning about <code>duckdb_read_csv()</code> and <code>duckdb_register_arrow()</code>. These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.</p>
|
||||
@@ -92,7 +92,7 @@ DBI basics</h1>
|
||||
What’s there?</h2>
|
||||
<p>The most important database objects for data scientists are tables. DBI provides two useful functions to either list all the tables in the database<span data-type="footnote">At least, all the tables that you have permission to see.</span> or to check if a specific table already exists:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dbListTables(con)
|
||||
<pre data-type="programlisting" data-code-language="r">dbListTables(con)
|
||||
#> [1] "diamonds" "mpg"
|
||||
dbExistsTable(con, "foo")
|
||||
#> [1] FALSE</pre>
|
||||
@@ -104,7 +104,7 @@ dbExistsTable(con, "foo")
|
||||
Extract some data</h2>
|
||||
<p>Once you’ve determined a table exists, you can retrieve it with <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">con |>
|
||||
<pre data-type="programlisting" data-code-language="r">con |>
|
||||
dbReadTable("diamonds") |>
|
||||
as_tibble()
|
||||
#> # A tibble: 53,940 × 10
|
||||
@@ -127,7 +127,7 @@ Extract some data</h2>
|
||||
Run a query</h2>
|
||||
<p>The way you’ll usually retrieve data is with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">sql <- "
|
||||
<pre data-type="programlisting" data-code-language="r">sql <- "
|
||||
SELECT carat, cut, clarity, color, price
|
||||
FROM diamonds
|
||||
WHERE price > 15000
|
||||
@@ -161,7 +161,7 @@ dbplyr basics</h1>
|
||||
<p>Now that you’ve learned the low-level basics for connecting to a database and running a query, we’re going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="https://dtplyr.tidyverse.org">dtplyr</a> which translates to <a href="https://r-datatable.com">data.table</a>, and <a href="https://multidplyr.tidyverse.org">multidplyr</a> which executes your code on multiple cores.</p>
|
||||
<p>To use dbplyr, you must first use <code><a href="https://dplyr.tidyverse.org/reference/tbl.html">tbl()</a></code> to create an object that represents a database table:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, "diamonds")
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, "diamonds")
|
||||
diamonds_db
|
||||
#> # Source: table<diamonds> [?? x 10]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
@@ -183,10 +183,10 @@ diamonds_db
|
||||
</div>
|
||||
|
||||
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
|
||||
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
|
||||
@@ -197,7 +197,7 @@ FROM `planes`</pre></div>
|
||||
|
||||
<p>This object is <strong>lazy</strong>; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db <- diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">big_diamonds_db <- diamonds_db |>
|
||||
filter(price > 15000) |>
|
||||
select(carat:clarity, price)
|
||||
|
||||
@@ -217,7 +217,7 @@ big_diamonds_db
|
||||
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.</p>
|
||||
<p>You can see the SQL code generated by the dbplyr function <code><a href="https://dplyr.tidyverse.org/reference/explain.html">show_query()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">big_diamonds_db |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
#> SELECT carat, cut, color, clarity, price
|
||||
@@ -226,7 +226,7 @@ big_diamonds_db
|
||||
</div>
|
||||
<p>To get all the data back into R, you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> to get the data, then turns the result into a tibble:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">big_diamonds <- big_diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">big_diamonds <- big_diamonds_db |>
|
||||
collect()
|
||||
big_diamonds
|
||||
#> # A tibble: 1,655 × 5
|
||||
@@ -249,7 +249,7 @@ SQL</h1>
|
||||
<p>The rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.</p>
|
||||
<p>We’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: <code>flights</code> and <code>planes</code>. These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">dbplyr::copy_nycflights13(con)
|
||||
<pre data-type="programlisting" data-code-language="r">dbplyr::copy_nycflights13(con)
|
||||
#> Creating table: airlines
|
||||
#> Creating table: airports
|
||||
#> Creating table: flights
|
||||
@@ -268,7 +268,7 @@ SQL basics</h2>
|
||||
<p>The top-level components of SQL are called <strong>statements</strong>. Common statements include <code>CREATE</code> for defining new tables, <code>INSERT</code> for adding data, and <code>SELECT</code> for retrieving data. We will on focus on <code>SELECT</code> statements, also called <strong>queries</strong>, because they are almost exclusively what you’ll use as a data scientist.</p>
|
||||
<p>A query is made up of <strong>clauses</strong>. There are five important clauses: <code>SELECT</code>, <code>FROM</code>, <code>WHERE</code>, <code>ORDER BY</code>, and <code>GROUP BY</code>. Every query must have the <code>SELECT</code><span data-type="footnote">Confusingly, depending on the context, <code>SELECT</code> is either a statement or a clause. To avoid this confusion, we’ll generally use query instead of <code>SELECT</code> statement.</span> and <code>FROM</code><span data-type="footnote">Ok, technically, only the <code>SELECT</code> is required, since you can write queries like <code>SELECT 1+1</code> to perform basic calculations. But if you want to work with data (as you always do!) you’ll also need a <code>FROM</code> clause.</span> clauses and the simplest query is <code>SELECT * FROM table</code>, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |> show_query()
|
||||
<pre data-type="programlisting" data-code-language="r">flights |> show_query()
|
||||
#> <SQL>
|
||||
#> SELECT *
|
||||
#> FROM flights
|
||||
@@ -279,7 +279,7 @@ planes |> show_query()
|
||||
</div>
|
||||
<p><code>WHERE</code> and <code>ORDER BY</code> control which rows are included and how they are ordered:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest == "IAH") |>
|
||||
arrange(dep_delay) |>
|
||||
show_query()
|
||||
@@ -291,7 +291,7 @@ planes |> show_query()
|
||||
</div>
|
||||
<p><code>GROUP BY</code> converts the query to a summary, causing aggregation to happen:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
show_query()
|
||||
@@ -312,10 +312,10 @@ planes |> show_query()
|
||||
</div>
|
||||
|
||||
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
|
||||
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
|
||||
@@ -332,7 +332,7 @@ SELECT</h2>
|
||||
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and, as you’ll learn in the next section, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">planes |>
|
||||
<pre data-type="programlisting" data-code-language="r">planes |>
|
||||
select(tailnum, type, manufacturer, model, year) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -364,10 +364,10 @@ planes |>
|
||||
</div>
|
||||
|
||||
<p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
|
||||
diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre>
|
||||
</div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre>
|
||||
</div>
|
||||
|
||||
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
|
||||
@@ -378,7 +378,7 @@ FROM `planes`</pre></div>
|
||||
|
||||
<p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
speed = distance / (air_time / 60)
|
||||
) |>
|
||||
@@ -401,7 +401,7 @@ FROM</h2>
|
||||
GROUP BY</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> is translated to the <code>SELECT</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
n = n(),
|
||||
@@ -421,7 +421,7 @@ GROUP BY</h2>
|
||||
WHERE</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is translated to the <code>WHERE</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest == "IAH" | dest == "HOU") |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -444,7 +444,7 @@ flights |>
|
||||
<li>SQL uses only <code>''</code> for strings, not <code>""</code>. In SQL, <code>""</code> is used to identify variables, like R’s <code>``</code>.</li>
|
||||
</ul><p>Another useful SQL operator is <code>IN</code>, which is very close to R’s <code>%in%</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(dest %in% c("IAH", "HOU")) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -454,7 +454,7 @@ flights |>
|
||||
</div>
|
||||
<p>SQL uses <code>NULL</code> instead of <code>NA</code>. <code>NULL</code>s behave similarly to <code>NA</code>s. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(delay = mean(arr_delay))
|
||||
#> Warning: Missing values are always removed in SQL aggregation functions.
|
||||
@@ -475,7 +475,7 @@ flights |>
|
||||
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="https://modern-sql.com/concept/three-valued-logic"><em>Three valued logic</em></a>” by Markus Winand.</p>
|
||||
<p>In general, you can work with <code>NULL</code>s using the functions you’d use for <code>NA</code>s in R:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
filter(!is.na(dep_delay)) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -487,7 +487,7 @@ flights |>
|
||||
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
|
||||
<p>Note that if you <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause that’s evaluated afterwards.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(n = n()) |>
|
||||
filter(n > 100) |>
|
||||
@@ -505,7 +505,7 @@ flights |>
|
||||
ORDER BY</h2>
|
||||
<p>Ordering rows involves a straightforward translation from <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> to the <code>ORDER BY</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
arrange(year, month, day, desc(dep_delay)) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -522,7 +522,7 @@ Subqueries</h2>
|
||||
<p>Sometimes it’s not possible to translate a dplyr pipeline into a single <code>SELECT</code> statement and you need to use a subquery. A <strong>subquery</strong> is just a query used as a data source in the <code>FROM</code> clause, instead of the usual table.</p>
|
||||
<p>dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the <code>SELECT</code> clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes <code>year1</code> and then the second (outer) query can compute <code>year2</code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(
|
||||
year1 = year + 1,
|
||||
year2 = year1 + 1
|
||||
@@ -537,7 +537,7 @@ Subqueries</h2>
|
||||
</div>
|
||||
<p>You’ll also see this if you attempted to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, it’s evaluated before it, so we need a subquery in this (silly) example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate(year1 = year + 1) |>
|
||||
filter(year1 == 2014) |>
|
||||
show_query()
|
||||
@@ -557,7 +557,7 @@ Subqueries</h2>
|
||||
Joins</h2>
|
||||
<p>If you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
left_join(planes |> rename(year_built = year), by = "tailnum") |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -619,7 +619,7 @@ Function translations</h1>
|
||||
<p>So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
|
||||
<p>To help see what’s going on, we’ll use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">summarize_query <- function(df, ...) {
|
||||
<pre data-type="programlisting" data-code-language="r">summarize_query <- function(df, ...) {
|
||||
df |>
|
||||
summarise(...) |>
|
||||
show_query()
|
||||
@@ -632,7 +632,7 @@ mutate_query <- function(df, ...) {
|
||||
</div>
|
||||
<p>Let’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>, have a relatively simple translation while others, like <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
summarize_query(
|
||||
mean = mean(arr_delay, na.rm = TRUE),
|
||||
@@ -652,7 +652,7 @@ mutate_query <- function(df, ...) {
|
||||
</div>
|
||||
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(year, month, day) |>
|
||||
mutate_query(
|
||||
mean = mean(arr_delay, na.rm = TRUE),
|
||||
@@ -668,7 +668,7 @@ mutate_query <- function(df, ...) {
|
||||
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
|
||||
<p>Window functions include all functions that look forward or backwards, like <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
arrange(time_hour) |>
|
||||
mutate_query(
|
||||
@@ -686,7 +686,7 @@ mutate_query <- function(df, ...) {
|
||||
<p>Here it’s important to <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> the data, because SQL tables have no intrinsic order. In fact, if you don’t use <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesn’t automatically apply to window functions.</p>
|
||||
<p>Another important SQL function is <code>CASE WHEN</code>. It’s used as the translation of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, the dplyr function that it directly inspired. Here’s a couple of simple examples:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate_query(
|
||||
description = if_else(arr_delay > 0, "delayed", "on-time")
|
||||
)
|
||||
@@ -712,7 +712,7 @@ flights |>
|
||||
</div>
|
||||
<p><code>CASE WHEN</code> is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
mutate_query(
|
||||
description = cut(
|
||||
arr_delay,
|
||||
|
||||
Reference in New Issue
Block a user