Re-render book for O'Reilly
This commit is contained in:
@@ -164,7 +164,7 @@ dbplyr basics</h1>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db <- tbl(con, "diamonds")
|
||||
diamonds_db
|
||||
#> # Source: table<diamonds> [?? x 10]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> carat cut color clarity depth table price x y z
|
||||
#> <dbl> <fct> <fct> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
|
||||
#> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
|
||||
@@ -203,7 +203,7 @@ FROM `planes`</pre></div>
|
||||
|
||||
big_diamonds_db
|
||||
#> # Source: SQL [?? x 5]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> carat cut color clarity price
|
||||
#> <dbl> <fct> <fct> <fct> <int>
|
||||
#> 1 1.54 Premium E VS2 15002
|
||||
@@ -293,7 +293,7 @@ planes |> show_query()
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
#> SELECT dest, AVG(dep_delay) AS dep_delay
|
||||
@@ -399,11 +399,11 @@ FROM</h2>
|
||||
<section id="group-by" data-type="sect2">
|
||||
<h2>
|
||||
GROUP BY</h2>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> is translated to the <code>SELECT</code> clause:</p>
|
||||
<p><code><a href="https://dplyr.tidyverse.org/reference/group_by.html">group_by()</a></code> is translated to the <code>GROUP BY</code><span data-type="footnote">This is no coincidence: the dplyr function name was inspired by the SQL clause.</span> clause and <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> is translated to the <code>SELECT</code> clause:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(
|
||||
summarize(
|
||||
n = n(),
|
||||
avg_price = mean(price, na.rm = TRUE)
|
||||
) |>
|
||||
@@ -456,12 +456,12 @@ flights |>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">flights |>
|
||||
group_by(dest) |>
|
||||
summarise(delay = mean(arr_delay))
|
||||
summarize(delay = mean(arr_delay))
|
||||
#> Warning: Missing values are always removed in SQL aggregation functions.
|
||||
#> Use `na.rm = TRUE` to silence this warning
|
||||
#> This warning is displayed once every 8 hours.
|
||||
#> # Source: SQL [?? x 2]
|
||||
#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> # Database: DuckDB 0.6.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
|
||||
#> dest delay
|
||||
#> <chr> <dbl>
|
||||
#> 1 ATL 11.3
|
||||
@@ -489,7 +489,7 @@ flights |>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds_db |>
|
||||
group_by(cut) |>
|
||||
summarise(n = n()) |>
|
||||
summarize(n = n()) |>
|
||||
filter(n > 100) |>
|
||||
show_query()
|
||||
#> <SQL>
|
||||
@@ -617,11 +617,11 @@ FROM flights</pre>
|
||||
<h1>
|
||||
Function translations</h1>
|
||||
<p>So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use <code>mean(x)</code> in a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code>?</p>
|
||||
<p>To help see what’s going on, we’ll use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarise()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
|
||||
<p>To help see what’s going on, we’ll use a couple of little helper functions that run a <code><a href="https://dplyr.tidyverse.org/reference/summarise.html">summarize()</a></code> or <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">summarize_query <- function(df, ...) {
|
||||
df |>
|
||||
summarise(...) |>
|
||||
summarize(...) |>
|
||||
show_query()
|
||||
}
|
||||
mutate_query <- function(df, ...) {
|
||||
@@ -729,15 +729,18 @@ flights |>
|
||||
#> FROM flights</pre>
|
||||
</div>
|
||||
<p>dbplyr also translates common string and date-time manipulation functions, which you can learn about in <code><a href="https://dbplyr.tidyverse.org/articles/translation-function.html">vignette("translation-function", package = "dbplyr")</a></code>. dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.</p>
|
||||
</section>
|
||||
|
||||
<section id="learning-more" data-type="sect2">
|
||||
<h2>
|
||||
Learning more</h2>
|
||||
<p>If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
|
||||
<section id="summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you learned how to access data from databases. We focused on dbplyr, a dplyr “backend” that allows you to write the dplyr code you’re familiar with, and have it be automatically translated to SQL. We used that translation to teach you a little SQL; it’s important to learn some SQL because it’s <em>the</em> most commonly used language for working with data and knowing some will it easier for you to communicate with other data folks who don’t use R. If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:</p>
|
||||
<ul><li>
|
||||
<a href="https://sqlfordatascientists.com"><em>SQL for Data Scientists</em></a> by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organisations.</li>
|
||||
<li>
|
||||
<a href="https://www.practicalsql.com"><em>Practical SQL</em></a> by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.</li>
|
||||
</ul></section>
|
||||
</ul><p>In the next chapter, we’ll learn about another dplyr backend for working with large data: arrow. Arrow is designed for working with large files on disk, and is a natural complement to databases.</p>
|
||||
|
||||
|
||||
</section>
|
||||
</section>
|
||||
|
||||
Reference in New Issue
Block a user