Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -15,7 +15,7 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
@@ -62,7 +62,7 @@ Connecting to a database</h1>
<ul><li><p>Youll always use DBI (<strong>d</strong>ata<strong>b</strong>ase <strong>i</strong>nterface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.</p></li>
<li><p>Youll also use a package tailored for the DBMS youre connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. Theres usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.</p></li>
</ul><p>If you cant find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because youll also need to install an ODBC driver and tell the odbc package where to find it.</p>
<p>Concretely, you create a database connection using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbConnect" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbConnect</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="#chp-https://rdrr.io/r/base/library" data-type="xref">#chp-https://rdrr.io/r/base/library</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
<p>Concretely, you create a database connection using <code><a href="https://dbi.r-dbi.org/reference/dbConnect.html">DBI::dbConnect()</a></code>. The first argument selects the DBMS<span data-type="footnote">Typically, this is the only function youll use from the client package, so we recommend using <code>::</code> to pull out that one function, rather than loading the complete package with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>.</span>, then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con &lt;- DBI::dbConnect(
RMariaDB::MariaDB(),
@@ -93,7 +93,7 @@ In this book</h2>
<section id="sec-load-data" data-type="sect2">
<h2>
Load some data</h2>
<p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code>. The simplest usage of <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
<p>Since this is a new database, we need to start by adding some data. Here well add <code>mpg</code> and <code>diamonds</code> datasets from ggplot2 using <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">DBI::dbWriteTable()</a></code>. The simplest usage of <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">dbWriteTable(con, "mpg", ggplot2::mpg)
dbWriteTable(con, "diamonds", ggplot2::diamonds)</pre>
@@ -123,7 +123,7 @@ dbExistsTable(con, "foo")
<section id="extract-some-data" data-type="sect2">
<h2>
Extract some data</h2>
<p>Once youve determined a table exists, you can retrieve it with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code>:</p>
<p>Once youve determined a table exists, you can retrieve it with <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">con |&gt;
dbReadTable("diamonds") |&gt;
@@ -139,14 +139,14 @@ Extract some data</h2>
#&gt; 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#&gt; # … with 53,934 more rows</pre>
</div>
<p><code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code> returns a <code>data.frame</code> so we use <code><a href="#chp-https://tibble.tidyverse.org/reference/as_tibble" data-type="xref">#chp-https://tibble.tidyverse.org/reference/as_tibble</a></code> to convert it into a tibble so that it prints nicely.</p>
<p>In real life, its rare that youll use <code><a href="#chp-https://dbi.r-dbi.org/reference/dbReadTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbReadTable</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p>
<p><code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> returns a <code>data.frame</code> so we use <code><a href="https://tibble.tidyverse.org/reference/as_tibble.html">as_tibble()</a></code> to convert it into a tibble so that it prints nicely.</p>
<p>In real life, its rare that youll use <code><a href="https://dbi.r-dbi.org/reference/dbReadTable.html">dbReadTable()</a></code> because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.</p>
</section>
<section id="sec-dbGetQuery" data-type="sect2">
<h2>
Run a query</h2>
<p>The way youll usually retrieve data is with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
<p>The way youll usually retrieve data is with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code>. It takes a database connection and some SQL code and returns a data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sql &lt;- "
SELECT carat, cut, clarity, color, price
@@ -166,21 +166,21 @@ as_tibble(dbGetQuery(con, sql))
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Dont worry if youve never seen SQL before; youll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where <code>price</code> is greater than 15,000.</p>
<p>Youll need to be a little careful with <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="#chp-https://dbi.r-dbi.org/reference/dbSendQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbSendQuery</a></code> to get a “result set” which you can page through by calling <code><a href="#chp-https://dbi.r-dbi.org/reference/dbFetch" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbFetch</a></code> until <code><a href="#chp-https://dbi.r-dbi.org/reference/dbHasCompleted" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbHasCompleted</a></code> returns <code>TRUE</code>.</p>
<p>Youll need to be a little careful with <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> since it can potentially return more data than you have memory. We wont discuss it further here, but if youre dealing with very large datasets its possible to deal with a “page” of data at a time by using <code><a href="https://dbi.r-dbi.org/reference/dbSendQuery.html">dbSendQuery()</a></code> to get a “result set” which you can page through by calling <code><a href="https://dbi.r-dbi.org/reference/dbFetch.html">dbFetch()</a></code> until <code><a href="https://dbi.r-dbi.org/reference/dbHasCompleted.html">dbHasCompleted()</a></code> returns <code>TRUE</code>.</p>
</section>
<section id="other-functions" data-type="sect2">
<h2>
Other functions</h2>
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="#chp-https://dbi.r-dbi.org/reference/dbWriteTable" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbWriteTable</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
<p>There are lots of other functions in DBI that you might find useful if youre managing your own data (like <code><a href="https://dbi.r-dbi.org/reference/dbWriteTable.html">dbWriteTable()</a></code> which we used in <a href="#sec-load-data" data-type="xref">#sec-load-data</a>), but were going to skip past them in the interest of staying focused on working with data that already lives in a database.</p>
</section>
</section>
<section id="dbplyr-basics" data-type="sect1">
<h1>
dbplyr basics</h1>
<p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="#chp-https://dtplyr.tidyverse" data-type="xref">#chp-https://dtplyr.tidyverse</a> which translates to <a href="#chp-https://r-datatable" data-type="xref">#chp-https://r-datatable</a>, and <a href="#chp-https://multidplyr.tidyverse" data-type="xref">#chp-https://multidplyr.tidyverse</a> which executes your code on multiple cores.</p>
<p>To use dbplyr, you must first use <code><a href="#chp-https://dplyr.tidyverse.org/reference/tbl" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/tbl</a></code> to create an object that represents a database table:</p>
<p>Now that youve learned the low-level basics for connecting to a database and running a query, were going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr <strong>backend</strong>, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include <a href="https://dtplyr.tidyverse.org">dtplyr</a> which translates to <a href="https://r-datatable.com">data.table</a>, and <a href="https://multidplyr.tidyverse.org">multidplyr</a> which executes your code on multiple cores.</p>
<p>To use dbplyr, you must first use <code><a href="https://dplyr.tidyverse.org/reference/tbl.html">tbl()</a></code> to create an object that represents a database table:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, "diamonds")
diamonds_db
@@ -212,7 +212,7 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
@@ -238,7 +238,7 @@ big_diamonds_db
#&gt; # … with more rows</pre>
</div>
<p>You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesnt know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something were trying to avoid.</p>
<p>You can see the SQL code generated by the dbplyr function <code><a href="#chp-https://dplyr.tidyverse.org/reference/explain" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/explain</a></code>:</p>
<p>You can see the SQL code generated by the dbplyr function <code><a href="https://dplyr.tidyverse.org/reference/explain.html">show_query()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds_db |&gt;
show_query()
@@ -247,7 +247,7 @@ big_diamonds_db
#&gt; FROM diamonds
#&gt; WHERE (price &gt; 15000.0)</pre>
</div>
<p>To get all the data back into R, you call <code><a href="#chp-https://dplyr.tidyverse.org/reference/compute" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/compute</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="#chp-https://dbi.r-dbi.org/reference/dbGetQuery" data-type="xref">#chp-https://dbi.r-dbi.org/reference/dbGetQuery</a></code> to get the data, then turns the result into a tibble:</p>
<p>To get all the data back into R, you call <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code>. Behind the scenes, this generates the SQL, calls <code><a href="https://dbi.r-dbi.org/reference/dbGetQuery.html">dbGetQuery()</a></code> to get the data, then turns the result into a tibble:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">big_diamonds &lt;- big_diamonds_db |&gt;
collect()
@@ -263,7 +263,7 @@ big_diamonds
#&gt; 6 1.73 Very Good G VS1 15014
#&gt; # … with 1,649 more rows</pre>
</div>
<p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="#chp-https://dplyr.tidyverse.org/reference/compute" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/compute</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p>
<p>Typically, youll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once youre ready to analyse the data with functions that are unique to R, youll <code><a href="https://dplyr.tidyverse.org/reference/compute.html">collect()</a></code> the data to get an in-memory tibble, and continue your work with pure R code.</p>
</section>
<section id="sql" data-type="sect1">
@@ -343,7 +343,7 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
@@ -354,8 +354,8 @@ FROM `planes`</pre></div>
<section id="select" data-type="sect2">
<h2>
SELECT</h2>
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code>, and, as youll learn in the next section, <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>.</p>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/rename" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/rename</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/relocate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/relocate</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
<p>The <code>SELECT</code> clause is the workhorse of queries and performs the same job as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code>, and, as youll learn in the next section, <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>.</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/rename.html">rename()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/relocate.html">relocate()</a></code> have very direct translations to <code>SELECT</code> as they just affect where a column appears (if at all) along with its name:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">planes |&gt;
select(tailnum, type, manufacturer, model, year) |&gt;
@@ -380,7 +380,7 @@ planes |&gt;
#&gt; SELECT tailnum, manufacturer, model, "type", "year"
#&gt; FROM planes</pre>
</div>
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code>, the old name is on the left and the new name is on the right.</p>
<p>This example also shows you how SQL does renaming. In SQL terminology renaming is called <strong>aliasing</strong> and is done with <code>AS</code>. Note that unlike <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>, the old name is on the left and the new name is on the right.</p>
<div data-type="note"><div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"/>
@@ -397,13 +397,13 @@ diamonds_db &lt;- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pr
<pre data-type="programlisting" data-code-language="downlit">diamonds_db &lt;- tbl(con, sql("SELECT * FROM diamonds"))</pre>
</div>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="#chp-https://github.com/tidyverse/dbplyr/issues/" data-type="xref">#chp-https://github.com/tidyverse/dbplyr/issues/</a> to help us do better.</p>
<p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that well focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. Its not perfect, but its continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p>
<p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. Thats because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases youre likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year"
FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
FROM `planes`</pre></div>
<p>The translations for <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<p>The translations for <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> are similarly straightforward: each variable becomes a new expression in <code>SELECT</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(
@@ -426,7 +426,7 @@ FROM</h2>
<section id="group-by" data-type="sect2">
<h2>
GROUP BY</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/group_by" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/group_by</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> is translated to the <code>SELECT</code> clause:</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> is translated to the <code>SELECT</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt;
group_by(cut) |&gt;
@@ -440,13 +440,13 @@ GROUP BY</h2>
#&gt; FROM diamonds
#&gt; GROUP BY cut</pre>
</div>
<p>Well come back to whats happening with translation <code><a href="#chp-https://dplyr.tidyverse.org/reference/context" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/context</a></code> and <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
<p>Well come back to whats happening with translation <code><a href="https://dplyr.tidyverse.org/reference/context.html">n()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> in <a href="#sec-sql-expressions" data-type="xref">#sec-sql-expressions</a>.</p>
</section>
<section id="where" data-type="sect2">
<h2>
WHERE</h2>
<p><code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> is translated to the <code>WHERE</code> clause:</p>
<p><code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> is translated to the <code>WHERE</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
filter(dest == "IAH" | dest == "HOU") |&gt;
@@ -499,7 +499,7 @@ flights |&gt;
#&gt; 6 LAX 0.547
#&gt; # … with more rows</pre>
</div>
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="#chp-https://modern-sql.com/concept/three-valued-logic" data-type="xref">#chp-https://modern-sql.com/concept/three-valued-logic</a>” by Markus Winand.</p>
<p>If you want to learn more about how NULLs work, you might enjoy “<a href="https://modern-sql.com/concept/three-valued-logic"><em>Three valued logic</em></a>” by Markus Winand.</p>
<p>In general, you can work with <code>NULL</code>s using the functions youd use for <code>NA</code>s in R:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
@@ -512,7 +512,7 @@ flights |&gt;
</div>
<p>This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isnt as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator thats easier to read:</p>
<pre data-type="programlisting" data-code-language="sql">WHERE "dep_delay" IS NOT NULL</pre>
<p>Note that if you <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p>
<p>Note that if you <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you created using a summarize, dbplyr will generate a <code>HAVING</code> clause, rather than a <code>FROM</code> clause. This is a one of the idiosyncracies of SQL created because <code>WHERE</code> is evaluated before <code>SELECT</code>, so it needs another clause thats evaluated afterwards.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds_db |&gt;
group_by(cut) |&gt;
@@ -530,7 +530,7 @@ flights |&gt;
<section id="order-by" data-type="sect2">
<h2>
ORDER BY</h2>
<p>Ordering rows involves a straightforward translation from <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> to the <code>ORDER BY</code> clause:</p>
<p>Ordering rows involves a straightforward translation from <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> to the <code>ORDER BY</code> clause:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
arrange(year, month, day, desc(dep_delay)) |&gt;
@@ -540,7 +540,7 @@ ORDER BY</h2>
#&gt; FROM flights
#&gt; ORDER BY "year", "month", "day", dep_delay DESC</pre>
</div>
<p>Notice how <code><a href="#chp-https://dplyr.tidyverse.org/reference/desc" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/desc</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p>
<p>Notice how <code><a href="https://dplyr.tidyverse.org/reference/desc.html">desc()</a></code> is translated to <code>DESC</code>: this is one of the many dplyr functions whose name was directly inspired by SQL.</p>
</section>
<section id="subqueries" data-type="sect2">
@@ -562,7 +562,7 @@ Subqueries</h2>
#&gt; FROM flights
#&gt; ) q01</pre>
</div>
<p>Youll also see this if you attempted to <code><a href="#chp-https://dplyr.tidyverse.org/reference/filter" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/filter</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p>
<p>Youll also see this if you attempted to <code><a href="https://dplyr.tidyverse.org/reference/filter.html">filter()</a></code> a variable that you just created. Remember, even though <code>WHERE</code> is written after <code>SELECT</code>, its evaluated before it, so we need a subquery in this (silly) example:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate(year1 = year + 1) |&gt;
@@ -603,7 +603,7 @@ Joins</h2>
#&gt; ON (flights.tailnum = planes.tailnum)</pre>
</div>
<p>The main thing to notice here is the syntax: SQL joins use sub-clauses of the <code>FROM</code> clause to bring in additional tables, using <code>ON</code> to define how the tables are related.</p>
<p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>, and <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate-joins" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate-joins</a></code>:</p>
<p>dplyrs names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">inner_join()</a></code>, <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">right_join()</a></code>, and <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">full_join()</a></code>:</p>
<pre data-type="programlisting" data-code-language="sql">SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
INNER JOIN planes ON (flights.tailnum = planes.tailnum)
@@ -615,19 +615,19 @@ RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
FROM flights
FULL JOIN planes ON (flights.tailnum = planes.tailnum)</pre>
<p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="#chp-https://cynkra.github.io/dm/" data-type="xref">#chp-https://cynkra.github.io/dm/</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p>
<p>Youre likely to need many joins when working with data from a database. Thats because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the <a href="https://cynkra.github.io/dm/">dm package</a>, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see whats going on, and generate the joins you need to connect one table to another.</p>
</section>
<section id="other-verbs" data-type="sect2">
<h2>
Other verbs</h2>
<p>dbplyr also translates other verbs like <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>, <code>slice_*()</code>, and <code><a href="#chp-https://generics.r-lib.org/reference/setops" data-type="xref">#chp-https://generics.r-lib.org/reference/setops</a></code>, and a growing selection of tidyr functions like <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
<p>dbplyr also translates other verbs like <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>, <code>slice_*()</code>, and <code><a href="https://generics.r-lib.org/reference/setops.html">intersect()</a></code>, and a growing selection of tidyr functions like <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>. The easiest way to see the full set of whats currently available is to visit the dbplyr website: <a href="https://dbplyr.tidyverse.org/reference/" class="uri">https://dbplyr.tidyverse.org/reference/</a>.</p>
</section>
<section id="exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What is <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code> translated to? How about <code><a href="#chp-https://rdrr.io/r/utils/head" data-type="xref">#chp-https://rdrr.io/r/utils/head</a></code>?</p></li>
<ol type="1"><li><p>What is <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code> translated to? How about <code><a href="https://rdrr.io/r/utils/head.html">head()</a></code>?</p></li>
<li>
<p>Explain what each of the following SQL queries do and try recreate them using dbplyr.</p>
<pre data-type="programlisting" data-code-language="sql">SELECT *
@@ -643,8 +643,8 @@ FROM flights</pre>
<section id="sec-sql-expressions" data-type="sect1">
<h1>
Function translations</h1>
<p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code>?</p>
<p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="#chp-https://dplyr.tidyverse.org/reference/summarise" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/summarise</a></code> or <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
<p>So far weve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now were going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
<p>To help see whats going on, well use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">summarize_query &lt;- function(df, ...) {
df |&gt;
@@ -657,7 +657,7 @@ mutate_query &lt;- function(df, ...) {
show_query()
}</pre>
</div>
<p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code>, have a relatively simple translation while others, like <code><a href="#chp-https://rdrr.io/r/stats/median" data-type="xref">#chp-https://rdrr.io/r/stats/median</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
<p>Lets dive in with some summaries! Looking at the code below youll notice that some summary functions, like <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code>, have a relatively simple translation while others, like <code><a href="https://rdrr.io/r/stats/median.html">median()</a></code>, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
@@ -677,7 +677,7 @@ mutate_query &lt;- function(df, ...) {
#&gt; FROM flights
#&gt; GROUP BY "year", "month", "day"</pre>
</div>
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
<p>The translation of summary functions becomes more complicated when you use them inside a <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding <code>OVER</code> after it:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(year, month, day) |&gt;
@@ -693,7 +693,7 @@ mutate_query &lt;- function(df, ...) {
#&gt; FROM flights</pre>
</div>
<p>In SQL, the <code>GROUP BY</code> clause is used exclusively for summary so here you can see that the grouping has moved to the <code>PARTITION BY</code> argument to <code>OVER</code>.</p>
<p>Window functions include all functions that look forward or backwards, like <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/lead-lag" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/lead-lag</a></code>:</p>
<p>Window functions include all functions that look forward or backwards, like <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lead()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/lead-lag.html">lag()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
group_by(dest) |&gt;
@@ -710,8 +710,8 @@ mutate_query &lt;- function(df, ...) {
#&gt; FROM flights
#&gt; ORDER BY time_hour</pre>
</div>
<p>Here its important to <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="#chp-https://dplyr.tidyverse.org/reference/arrange" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/arrange</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p>
<p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> and <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p>
<p>Here its important to <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> the data, because SQL tables have no intrinsic order. In fact, if you dont use <code><a href="https://dplyr.tidyverse.org/reference/arrange.html">arrange()</a></code> you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the <code>ORDER BY</code> clause of the main query doesnt automatically apply to window functions.</p>
<p>Another important SQL function is <code>CASE WHEN</code>. Its used as the translation of <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> and <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>, the dplyr function that it directly inspired. Heres a couple of simple examples:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate_query(
@@ -737,7 +737,7 @@ flights |&gt;
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
<p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="#chp-https://rdrr.io/r/base/cut" data-type="xref">#chp-https://rdrr.io/r/base/cut</a></code>:</p>
<p><code>CASE WHEN</code> is also used for some other functions that dont have a direct translation from R to SQL. A good example of this is <code><a href="https://rdrr.io/r/base/cut.html">cut()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">flights |&gt;
mutate_query(
@@ -755,16 +755,16 @@ flights |&gt;
#&gt; END AS description
#&gt; FROM flights</pre>
</div>
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="#chp-https://dbplyr.tidyverse.org/articles/translation-function" data-type="xref">#chp-https://dbplyr.tidyverse.org/articles/translation-function</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyrs translations are certainly not perfect, and there are many R functions that arent translated yet, but dbplyr does a surprisingly good job covering the functions that youll use most of the time.</p>
<section id="learning-more" data-type="sect2">
<h2>
Learning more</h2>
<p>If youve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
<ul><li>
<a href="#chp-https://sqlfordatascientists" data-type="xref">#chp-https://sqlfordatascientists</a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li>
<a href="https://sqlfordatascientists.com"><em>SQL for Data Scientists</em></a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data youre likely to encounter in real organisations.</li>
<li>
<a href="#chp-https://www.practicalsql" data-type="xref">#chp-https://www.practicalsql</a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
<a href="https://www.practicalsql.com"><em>Practical SQL</em></a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
</ul></section>
</section>
</section>