More work on O'Reilly book
* Make width narrower * Convert deps to table * Strip chapter status
This commit is contained in:
		
							
								
								
									
										11
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										11
									
								
								README.md
									
									
									
									
									
								
							| @@ -47,6 +47,17 @@ devtools::install_github("hadley/r4ds") | ||||
|     knitr::include_graphics("screenshots/rstudio-wg.png") | ||||
|     ``` | ||||
|  | ||||
| ### O'Reilly | ||||
|  | ||||
| To generate book for O'Reilly, build the book then: | ||||
|  | ||||
| ```{r} | ||||
| devtools::load_all("../minibook/"); process_book() | ||||
|  | ||||
| html <- list.files("oreilly", pattern = "[.]html$", full.names = TRUE) | ||||
| file.copy(html, "../r-for-data-science-2e/", overwrite = TRUE) | ||||
| ``` | ||||
|  | ||||
| ## Code of Conduct | ||||
|  | ||||
| Please note that r4ds uses a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). | ||||
|   | ||||
| @@ -17,7 +17,7 @@ options( | ||||
|   # Activate crayon output - temporarily disabled for quarto | ||||
|   # crayon.enabled = TRUE, | ||||
|   pillar.bold = TRUE, | ||||
|   width = 80 | ||||
|   width = 77 # 80 - 3 for #> comment | ||||
| ) | ||||
|  | ||||
| ggplot2::theme_set(ggplot2::theme_gray(12)) | ||||
| @@ -39,7 +39,7 @@ status <- function(type) { | ||||
|   ) | ||||
|  | ||||
|   cat(paste0( | ||||
|     "::: callout-", class, "\n", | ||||
|     "::: status callout-", class, "\n", | ||||
|     "You are reading the work-in-progress second edition of R for Data Science. ", | ||||
|     "This chapter ", status, ". ", | ||||
|     "You can find the complete first edition at <https://r4ds.had.co.nz>.\n", | ||||
|   | ||||
							
								
								
									
										18
									
								
								intro.qmd
									
									
									
									
									
								
							
							
						
						
									
										18
									
								
								intro.qmd
									
									
									
									
									
								
							| @@ -340,6 +340,22 @@ The book is powered by [Quarto](https://quarto.org) which makes it easy to write | ||||
| This book was built with: | ||||
|  | ||||
| ```{r} | ||||
| sessioninfo::session_info(c("tidyverse")) | ||||
| #| echo: false | ||||
| #| results: asis | ||||
|  | ||||
| pkgs <- sessioninfo::package_info( | ||||
|   tidyverse:::tidyverse_packages(), | ||||
|   dependencies = FALSE | ||||
| ) | ||||
| df <- tibble( | ||||
|   package = pkgs$package, | ||||
|   version = pkgs$ondiskversion, | ||||
|   source = gsub("@", "\\\\@", pkgs$source) | ||||
| ) | ||||
| knitr::kable(df, format = "markdown") | ||||
| ``` | ||||
|  | ||||
| ```{r} | ||||
| cli:::ruler() | ||||
| ``` | ||||
|  | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-EDA"> | ||||
| <h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-base-R"> | ||||
| <h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| <p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.</p> | ||||
| <h1><span id="sec-base-r" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">A field guide to base R</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.</p><p>This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a <strong>lot</strong> of base R functions: from <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> to load packages, to <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code> and <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like <code>+</code>, <code>-</code>, <code>/</code>, <code>*</code>, <code>|</code>, <code>&</code>, and <code>!</code>. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.</p><p>After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!</p><p>In this chapter, we’ll focus on four big topics: subsetting with <code>[</code>, subsetting with <code>[[</code> and <code>$</code>, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.</p> | ||||
| <section id="prerequisites" data-type="sect2"> | ||||
| <h2> | ||||
| Prerequisites</h2> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-communicate-plots"> | ||||
| <h1><span id="sec-graphics-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Graphics for communication</span></span></h1><div data-type="important"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-graphics-communication" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Graphics for communication</span></span></h1><p>::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-data-import"> | ||||
| <h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-data-import" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data import</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -83,7 +75,7 @@ Reading data from a file</h1> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">students <- read_csv("data/students.csv") | ||||
| #> Rows: 6 Columns: 5 | ||||
| #> ── Column specification ──────────────────────────────────────────────────────── | ||||
| #> ── Column specification ───────────────────────────────────────────────────── | ||||
| #> Delimiter: "," | ||||
| #> chr (4): Full Name, favourite.food, mealPlan, AGE | ||||
| #> dbl (1): Student ID | ||||
| @@ -324,7 +316,7 @@ Guessing types</h2> | ||||
|   T,Inf,2021-02-16,ghi" | ||||
| ) | ||||
| #> Rows: 3 Columns: 4 | ||||
| #> ── Column specification ──────────────────────────────────────────────────────── | ||||
| #> ── Column specification ───────────────────────────────────────────────────── | ||||
| #> Delimiter: "," | ||||
| #> chr  (1): string | ||||
| #> dbl  (1): numeric | ||||
| @@ -360,7 +352,7 @@ Missing values, column types, and problems</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">df <- read_csv(csv) | ||||
| #> Rows: 4 Columns: 1 | ||||
| #> ── Column specification ──────────────────────────────────────────────────────── | ||||
| #> ── Column specification ───────────────────────────────────────────────────── | ||||
| #> Delimiter: "," | ||||
| #> chr (1): x | ||||
| #>  | ||||
| @@ -370,8 +362,8 @@ Missing values, column types, and problems</h2> | ||||
| <p>In this very small case, you can easily see the missing value <code>.</code>. But what happens if you have thousands of rows with only a few missing values represented by <code>.</code>s speckled amongst them? One approach is to tell readr that <code>x</code> is a numeric column, and then see where it fails. You can do that with the <code>col_types</code> argument, which takes a named list:</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">df <- read_csv(csv, col_types = list(x = col_double())) | ||||
| #> Warning: One or more parsing issues, call `problems()` on your data frame for details, | ||||
| #> e.g.: | ||||
| #> Warning: One or more parsing issues, call `problems()` on your data frame for | ||||
| #> details, e.g.: | ||||
| #>   dat <- vroom(...) | ||||
| #>   problems(dat)</pre> | ||||
| </div> | ||||
| @@ -381,13 +373,13 @@ Missing values, column types, and problems</h2> | ||||
| #> # A tibble: 1 × 5 | ||||
| #>     row   col expected actual file                                     | ||||
| #>   <int> <int> <chr>    <chr>  <chr>                                    | ||||
| #> 1     3     1 a double .      /private/tmp/Rtmp43JYhG/file7cf337a06034</pre> | ||||
| #> 1     3     1 a double .      /private/tmp/Rtmpc2nAIe/file8f2f488fc2f4</pre> | ||||
| </div> | ||||
| <p>This tells us that there was a problem in row 3, col 1 where readr expected a double but got a <code>.</code>. That suggests this dataset uses <code>.</code> for missing values. So then we set <code>na = "."</code>, the automatic guessing succeeds, giving us the numeric column that we want:</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">df <- read_csv(csv, na = ".") | ||||
| #> Rows: 4 Columns: 1 | ||||
| #> ── Column specification ──────────────────────────────────────────────────────── | ||||
| #> ── Column specification ───────────────────────────────────────────────────── | ||||
| #> Delimiter: "," | ||||
| #> dbl (1): x | ||||
| #>  | ||||
| @@ -447,7 +439,7 @@ Reading data from multiple files</h1> | ||||
| <pre data-type="programlisting" data-code-language="downlit">sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv") | ||||
| read_csv(sales_files, id = "file") | ||||
| #> Rows: 19 Columns: 6 | ||||
| #> ── Column specification ──────────────────────────────────────────────────────── | ||||
| #> ── Column specification ───────────────────────────────────────────────────── | ||||
| #> Delimiter: "," | ||||
| #> chr (1): month | ||||
| #> dbl (4): year, brand, item, n | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-data-tidy"> | ||||
| <h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -174,21 +166,21 @@ Data in column names</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">billboard | ||||
| #> # A tibble: 317 × 79 | ||||
| #>   artist  track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9 | ||||
| #>   <chr>   <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> | ||||
| #> 1 2 Pac   Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA | ||||
| #> 2 2Ge+her The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA | ||||
| #> 3 3 Door… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51 | ||||
| #> 4 3 Door… Loser 2000-10-21    76    76    72    69    67    65    55    59    62 | ||||
| #> 5 504 Bo… Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53 | ||||
| #> 6 98^0    Give… 2000-08-19    51    39    34    26    26    19     2     2     3 | ||||
| #> # … with 311 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, | ||||
| #> #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>, | ||||
| #> #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>, | ||||
| #> #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, | ||||
| #> #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, | ||||
| #> #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, | ||||
| #> #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …</pre> | ||||
| #>   artist     track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8 | ||||
| #>   <chr>      <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> | ||||
| #> 1 2 Pac      Baby… 2000-02-26    87    82    72    77    87    94    99    NA | ||||
| #> 2 2Ge+her    The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA | ||||
| #> 3 3 Doors D… Kryp… 2000-04-08    81    70    68    67    66    57    54    53 | ||||
| #> 4 3 Doors D… Loser 2000-10-21    76    76    72    69    67    65    55    59 | ||||
| #> 5 504 Boyz   Wobb… 2000-04-15    57    34    25    17    17    31    36    49 | ||||
| #> 6 98^0       Give… 2000-08-19    51    39    34    26    26    19     2     2 | ||||
| #> # … with 311 more rows, 68 more variables: wk9 <dbl>, wk10 <dbl>, | ||||
| #> #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, | ||||
| #> #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, | ||||
| #> #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, | ||||
| #> #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, | ||||
| #> #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, | ||||
| #> #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …</pre> | ||||
| </div> | ||||
| <p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p> | ||||
| <p>To tidy this data, we’ll use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p> | ||||
| @@ -347,21 +339,21 @@ Many variables in column names</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">who2 | ||||
| #> # A tibble: 7,240 × 58 | ||||
| #>   country   year sp_m_…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_…⁶ sp_m_65 sp_f_…⁷ | ||||
| #>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> | ||||
| #> 1 Afghani…  1980      NA      NA      NA      NA      NA      NA      NA      NA | ||||
| #> 2 Afghani…  1981      NA      NA      NA      NA      NA      NA      NA      NA | ||||
| #> 3 Afghani…  1982      NA      NA      NA      NA      NA      NA      NA      NA | ||||
| #> 4 Afghani…  1983      NA      NA      NA      NA      NA      NA      NA      NA | ||||
| #> 5 Afghani…  1984      NA      NA      NA      NA      NA      NA      NA      NA | ||||
| #> 6 Afghani…  1985      NA      NA      NA      NA      NA      NA      NA      NA | ||||
| #> # … with 7,234 more rows, 48 more variables: sp_f_1524 <dbl>, sp_f_2534 <dbl>, | ||||
| #> #   sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>, | ||||
| #> #   sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>, | ||||
| #> #   sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>, | ||||
| #> #   sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>, | ||||
| #> #   sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>, | ||||
| #> #   ep_m_2534 <dbl>, ep_m_3544 <dbl>, ep_m_4554 <dbl>, ep_m_5564 <dbl>, …</pre> | ||||
| #>   country      year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65 | ||||
| #>   <chr>       <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> | ||||
| #> 1 Afghanistan  1980       NA       NA      NA      NA      NA      NA      NA | ||||
| #> 2 Afghanistan  1981       NA       NA      NA      NA      NA      NA      NA | ||||
| #> 3 Afghanistan  1982       NA       NA      NA      NA      NA      NA      NA | ||||
| #> 4 Afghanistan  1983       NA       NA      NA      NA      NA      NA      NA | ||||
| #> 5 Afghanistan  1984       NA       NA      NA      NA      NA      NA      NA | ||||
| #> 6 Afghanistan  1985       NA       NA      NA      NA      NA      NA      NA | ||||
| #> # … with 7,234 more rows, 49 more variables: sp_f_014 <dbl>, | ||||
| #> #   sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, | ||||
| #> #   sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, | ||||
| #> #   sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, | ||||
| #> #   sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, | ||||
| #> #   sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, | ||||
| #> #   ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, ep_m_3544 <dbl>, …</pre> | ||||
| </div> | ||||
| <p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p> | ||||
| <p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p> | ||||
| @@ -456,12 +448,12 @@ Widening data</h2> | ||||
| #> # A tibble: 500 × 5 | ||||
| #>   org_pac_id org_nm                     measure_cd   measure_title    prf_r…¹ | ||||
| #>   <chr>      <chr>                      <chr>        <chr>              <dbl> | ||||
| #> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS SSM…      63 | ||||
| #> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS SSM…      87 | ||||
| #> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS SSM…      86 | ||||
| #> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS SSM…      57 | ||||
| #> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS SSM…      85 | ||||
| #> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM…      24 | ||||
| #> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS …      63 | ||||
| #> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS …      87 | ||||
| #> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS …      86 | ||||
| #> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS …      57 | ||||
| #> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS …      85 | ||||
| #> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS …      24 | ||||
| #> # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre> | ||||
| </div> | ||||
| <p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p> | ||||
| @@ -471,7 +463,7 @@ Widening data</h2> | ||||
| #> # A tibble: 6 × 2 | ||||
| #>   measure_cd   measure_title                                                  | ||||
| #>   <chr>        <chr>                                                          | ||||
| #> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor… | ||||
| #> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In… | ||||
| #> 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate             | ||||
| #> 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider               | ||||
| #> 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education             | ||||
| @@ -489,12 +481,12 @@ Widening data</h2> | ||||
| #> # A tibble: 500 × 9 | ||||
| #>   org_pac_id org_nm   measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷ | ||||
| #>   <chr>      <chr>    <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> | ||||
| #> 1 0446157747 USC CARE M… CAHPS …      63      NA      NA      NA      NA      NA | ||||
| #> 2 0446157747 USC CARE M… CAHPS …      NA      87      NA      NA      NA      NA | ||||
| #> 3 0446157747 USC CARE M… CAHPS …      NA      NA      86      NA      NA      NA | ||||
| #> 4 0446157747 USC CARE M… CAHPS …      NA      NA      NA      57      NA      NA | ||||
| #> 5 0446157747 USC CARE M… CAHPS …      NA      NA      NA      NA      85      NA | ||||
| #> 6 0446157747 USC CARE M… CAHPS …      NA      NA      NA      NA      NA      24 | ||||
| #> 1 0446157747 USC CAR… CAHPS …      63      NA      NA      NA      NA      NA | ||||
| #> 2 0446157747 USC CAR… CAHPS …      NA      87      NA      NA      NA      NA | ||||
| #> 3 0446157747 USC CAR… CAHPS …      NA      NA      86      NA      NA      NA | ||||
| #> 4 0446157747 USC CAR… CAHPS …      NA      NA      NA      57      NA      NA | ||||
| #> 5 0446157747 USC CAR… CAHPS …      NA      NA      NA      NA      85      NA | ||||
| #> 6 0446157747 USC CAR… CAHPS …      NA      NA      NA      NA      NA      24 | ||||
| #> # … with 494 more rows, and abbreviated variable names ¹measure_title, | ||||
| #> #   ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8, | ||||
| #> #   ⁷CAHPS_GRP_12</pre> | ||||
| @@ -510,11 +502,11 @@ Widening data</h2> | ||||
| #> # A tibble: 95 × 8 | ||||
| #>   org_pac_id org_nm           CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ | ||||
| #>   <chr>      <chr>              <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> | ||||
| #> 1 0446157747 USC CARE MEDICAL G…      63      87      86      57      85      24 | ||||
| #> 2 0446162697 ASSOCIATION OF UNI…      59      85      83      63      88      22 | ||||
| #> 3 0547164295 BEAVER MEDICAL GRO…      49      NA      75      44      73      12 | ||||
| #> 4 0749333730 CAPE PHYSICIANS AS…      67      84      85      65      82      24 | ||||
| #> 5 0840104360 ALLIANCE PHYSICIAN…      66      87      87      64      87      28 | ||||
| #> 1 0446157747 USC CARE MEDICA…      63      87      86      57      85      24 | ||||
| #> 2 0446162697 ASSOCIATION OF …      59      85      83      63      88      22 | ||||
| #> 3 0547164295 BEAVER MEDICAL …      49      NA      75      44      73      12 | ||||
| #> 4 0749333730 CAPE PHYSICIANS…      67      84      85      65      82      24 | ||||
| #> 5 0840104360 ALLIANCE PHYSIC…      66      87      87      64      87      28 | ||||
| #> 6 0840109864 REX HOSPITAL INC      73      87      84      67      91      30 | ||||
| #> # … with 89 more rows, and abbreviated variable names ¹CAHPS_GRP_1, | ||||
| #> #   ²CAHPS_GRP_2, ³CAHPS_GRP_3, ⁴CAHPS_GRP_5, ⁵CAHPS_GRP_8, ⁶CAHPS_GRP_12</pre> | ||||
| @@ -602,7 +594,8 @@ How does<code>pivot_wider()</code> work?</h2> | ||||
|   names_from = name, | ||||
|   values_from = value | ||||
| ) | ||||
| #> Warning: Values from `value` are not uniquely identified; output will contain list-cols. | ||||
| #> Warning: Values from `value` are not uniquely identified; output will contain | ||||
| #> list-cols. | ||||
| #> • Use `values_fn = list` to suppress this warning. | ||||
| #> • Use `values_fn = {summary_fun}` to summarise duplicates. | ||||
| #> • Use the following dplyr code to identify duplicates. | ||||
| @@ -695,15 +688,16 @@ col_year <- gapminder |> | ||||
|   )  | ||||
| col_year | ||||
| #> # A tibble: 142 × 13 | ||||
| #>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997` | ||||
| #>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> | ||||
| #> 1 Afghani…   2.89   2.91   2.93   2.92   2.87   2.90   2.99   2.93   2.81   2.80 | ||||
| #> 2 Albania    3.20   3.29   3.36   3.44   3.52   3.55   3.56   3.57   3.40   3.50 | ||||
| #> 3 Algeria    3.39   3.48   3.41   3.51   3.62   3.69   3.76   3.75   3.70   3.68 | ||||
| #> 4 Angola     3.55   3.58   3.63   3.74   3.74   3.48   3.44   3.39   3.42   3.36 | ||||
| #> 5 Argenti…   3.77   3.84   3.85   3.91   3.98   4.00   3.95   3.96   3.97   4.04 | ||||
| #> 6 Austral…   4.00   4.04   4.09   4.16   4.23   4.26   4.29   4.34   4.37   4.43 | ||||
| #> # … with 136 more rows, and 2 more variables: `2002` <dbl>, `2007` <dbl></pre> | ||||
| #>   country     `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` | ||||
| #>   <fct>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> | ||||
| #> 1 Afghanistan   2.89   2.91   2.93   2.92   2.87   2.90   2.99   2.93   2.81 | ||||
| #> 2 Albania       3.20   3.29   3.36   3.44   3.52   3.55   3.56   3.57   3.40 | ||||
| #> 3 Algeria       3.39   3.48   3.41   3.51   3.62   3.69   3.76   3.75   3.70 | ||||
| #> 4 Angola        3.55   3.58   3.63   3.74   3.74   3.48   3.44   3.39   3.42 | ||||
| #> 5 Argentina     3.77   3.84   3.85   3.91   3.98   4.00   3.95   3.96   3.97 | ||||
| #> 6 Australia     4.00   4.04   4.09   4.16   4.23   4.26   4.29   4.34   4.37 | ||||
| #> # … with 136 more rows, and 3 more variables: `1997` <dbl>, `2002` <dbl>, | ||||
| #> #   `2007` <dbl></pre> | ||||
| </div> | ||||
| <p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms don’t want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p> | ||||
| <div class="cell"> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-data-transform"> | ||||
| <h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-data-transform" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data transformation</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -21,12 +13,12 @@ Prerequisites</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">library(nycflights13) | ||||
| library(tidyverse) | ||||
| #> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── | ||||
| #> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ── | ||||
| #> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000 | ||||
| #> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000   | ||||
| #> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000    | ||||
| #> ✔ readr   2.1.3             ✔ forcats 0.5.2         | ||||
| #> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── | ||||
| #> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ── | ||||
| #> ✖ dplyr::filter() masks stats::filter() | ||||
| #> ✖ dplyr::lag()    masks stats::lag()</pre> | ||||
| </div> | ||||
| @@ -40,7 +32,7 @@ nycflights13</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -92,7 +84,7 @@ Rows</h1> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   filter(arr_delay > 120) | ||||
| #> # A tibble: 10,034 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      811      630     101    1047     830     137 MQ      | ||||
| #> 2  2013     1     1      848     1835     853    1001    1950     851 MQ      | ||||
| @@ -111,7 +103,7 @@ Rows</h1> | ||||
| flights |>  | ||||
|   filter(month == 1 & day == 1) | ||||
| #> # A tibble: 842 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -128,7 +120,7 @@ flights |> | ||||
| flights |>  | ||||
|   filter(month == 1 | month == 2) | ||||
| #> # A tibble: 51,955 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -147,7 +139,7 @@ flights |> | ||||
| flights |>  | ||||
|   filter(month %in% c(1, 2)) | ||||
| #> # A tibble: 51,955 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -197,7 +189,7 @@ Common mistakes</h2> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   arrange(year, month, day, dep_time) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -215,7 +207,7 @@ Common mistakes</h2> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   arrange(desc(dep_delay)) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     9      641      900    1301    1242    1530    1272 HA      | ||||
| #> 2  2013     6    15     1432     1935    1137    1607    2120    1127 MQ      | ||||
| @@ -234,7 +226,7 @@ Common mistakes</h2> | ||||
|   filter(dep_delay <= 10 & dep_delay >= -10) |>  | ||||
|   arrange(desc(arr_delay)) | ||||
| #> # A tibble: 239,109 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013    11     1      658      700      -2    1329    1015     194 VX      | ||||
| #> 2  2013     4    18      558      600      -2    1149     850     179 AA      | ||||
| @@ -285,7 +277,7 @@ Columns</h1> | ||||
|     speed = distance / air_time * 60 | ||||
|   ) | ||||
| #> # A tibble: 336,776 × 21 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -308,18 +300,19 @@ Columns</h1> | ||||
|     .before = 1 | ||||
|   ) | ||||
| #> # A tibble: 336,776 × 21 | ||||
| #>    gain speed  year month   day dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ | ||||
| #>   <dbl> <dbl> <int> <int> <int>    <int>   <int>   <dbl>   <int>   <int>   <dbl> | ||||
| #> 1    -9  370.  2013     1     1      517     515       2     830     819      11 | ||||
| #> 2   -16  374.  2013     1     1      533     529       4     850     830      20 | ||||
| #> 3   -31  408.  2013     1     1      542     540       2     923     850      33 | ||||
| #> 4    17  517.  2013     1     1      544     545      -1    1004    1022     -18 | ||||
| #> 5    19  394.  2013     1     1      554     600      -6     812     837     -25 | ||||
| #> 6   -16  288.  2013     1     1      554     558      -4     740     728      12 | ||||
| #> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>, | ||||
| #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, | ||||
| #> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names | ||||
| #> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> | ||||
| #>    gain speed  year month   day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ | ||||
| #>   <dbl> <dbl> <int> <int> <int>    <int>        <int>   <dbl>   <int>   <int> | ||||
| #> 1    -9  370.  2013     1     1      517          515       2     830     819 | ||||
| #> 2   -16  374.  2013     1     1      533          529       4     850     830 | ||||
| #> 3   -31  408.  2013     1     1      542          540       2     923     850 | ||||
| #> 4    17  517.  2013     1     1      544          545      -1    1004    1022 | ||||
| #> 5    19  394.  2013     1     1      554          600      -6     812     837 | ||||
| #> 6   -16  288.  2013     1     1      554          558      -4     740     728 | ||||
| #> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>, | ||||
| #> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, | ||||
| #> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, | ||||
| #> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time, | ||||
| #> #   ²dep_delay, ³arr_time, ⁴sched_arr_time</pre> | ||||
| </div> | ||||
| <p>The <code>.</code> is a sign that <code>.before</code> is an argument to the function, not the name of a new variable. You can also use <code>.after</code> to add after a variable, and in both <code>.before</code> and <code>.after</code> you can the name of a variable name instead of a position. For example, we could add the new variables after <code>day:</code></p> | ||||
| <div class="cell"> | ||||
| @@ -330,18 +323,19 @@ Columns</h1> | ||||
|     .after = day | ||||
|   ) | ||||
| #> # A tibble: 336,776 × 21 | ||||
| #>    year month   day  gain speed dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ | ||||
| #>   <int> <int> <int> <dbl> <dbl>    <int>   <int>   <dbl>   <int>   <int>   <dbl> | ||||
| #> 1  2013     1     1    -9  370.      517     515       2     830     819      11 | ||||
| #> 2  2013     1     1   -16  374.      533     529       4     850     830      20 | ||||
| #> 3  2013     1     1   -31  408.      542     540       2     923     850      33 | ||||
| #> 4  2013     1     1    17  517.      544     545      -1    1004    1022     -18 | ||||
| #> 5  2013     1     1    19  394.      554     600      -6     812     837     -25 | ||||
| #> 6  2013     1     1   -16  288.      554     558      -4     740     728      12 | ||||
| #> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>, | ||||
| #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, | ||||
| #> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names | ||||
| #> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> | ||||
| #>    year month   day  gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ | ||||
| #>   <int> <int> <int> <dbl> <dbl>    <int>        <int>   <dbl>   <int>   <int> | ||||
| #> 1  2013     1     1    -9  370.      517          515       2     830     819 | ||||
| #> 2  2013     1     1   -16  374.      533          529       4     850     830 | ||||
| #> 3  2013     1     1   -31  408.      542          540       2     923     850 | ||||
| #> 4  2013     1     1    17  517.      544          545      -1    1004    1022 | ||||
| #> 5  2013     1     1    19  394.      554          600      -6     812     837 | ||||
| #> 6  2013     1     1   -16  288.      554          558      -4     740     728 | ||||
| #> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>, | ||||
| #> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, | ||||
| #> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, | ||||
| #> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time, | ||||
| #> #   ²dep_delay, ³arr_time, ⁴sched_arr_time</pre> | ||||
| </div> | ||||
| <p>Alternatively, you can control which variables are kept with the <code>.keep</code> argument. A particularly useful argument is <code>"used"</code> which allows you to see the inputs and outputs from your calculations:</p> | ||||
| <div class="cell"> | ||||
| @@ -403,18 +397,18 @@ flights |> | ||||
| flights |>  | ||||
|   select(!year:day) | ||||
| #> # A tibble: 336,776 × 16 | ||||
| #>   dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin | ||||
| #>      <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  | ||||
| #> 1      517     515       2     830     819      11 UA        1545 N14228  EWR    | ||||
| #> 2      533     529       4     850     830      20 UA        1714 N24211  LGA    | ||||
| #> 3      542     540       2     923     850      33 AA        1141 N619AA  JFK    | ||||
| #> 4      544     545      -1    1004    1022     -18 B6         725 N804JB  JFK    | ||||
| #> 5      554     600      -6     812     837     -25 DL         461 N668DN  LGA    | ||||
| #> 6      554     558      -4     740     728      12 UA        1696 N39463  EWR    | ||||
| #> # … with 336,770 more rows, 6 more variables: dest <chr>, air_time <dbl>, | ||||
| #> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated | ||||
| #> #   variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, | ||||
| #> #   ⁵arr_delay | ||||
| #>   dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum | ||||
| #>      <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   | ||||
| #> 1      517         515       2     830     819      11 UA        1545 N14228  | ||||
| #> 2      533         529       4     850     830      20 UA        1714 N24211  | ||||
| #> 3      542         540       2     923     850      33 AA        1141 N619AA  | ||||
| #> 4      544         545      -1    1004    1022     -18 B6         725 N804JB  | ||||
| #> 5      554         600      -6     812     837     -25 DL         461 N668DN  | ||||
| #> 6      554         558      -4     740     728      12 UA        1696 N39463  | ||||
| #> # … with 336,770 more rows, 7 more variables: origin <chr>, dest <chr>, | ||||
| #> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, | ||||
| #> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time, | ||||
| #> #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay | ||||
|  | ||||
| # Select all columns that are characters | ||||
| flights |>  | ||||
| @@ -466,7 +460,7 @@ flights |> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   rename(tail_num = tailnum) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -492,25 +486,25 @@ flights |> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   relocate(time_hour, air_time) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>   time_hour           air_time  year month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴ | ||||
| #>   <dttm>                 <dbl> <int> <int> <int>   <int>   <int>   <dbl>   <int> | ||||
| #> 1 2013-01-01 05:00:00      227  2013     1     1     517     515       2     830 | ||||
| #> 2 2013-01-01 05:00:00      227  2013     1     1     533     529       4     850 | ||||
| #> 3 2013-01-01 05:00:00      160  2013     1     1     542     540       2     923 | ||||
| #> 4 2013-01-01 05:00:00      183  2013     1     1     544     545      -1    1004 | ||||
| #> 5 2013-01-01 06:00:00      116  2013     1     1     554     600      -6     812 | ||||
| #> 6 2013-01-01 05:00:00      150  2013     1     1     554     558      -4     740 | ||||
| #> # … with 336,770 more rows, 10 more variables: sched_arr_time <int>, | ||||
| #> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, | ||||
| #> #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated | ||||
| #> #   variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time</pre> | ||||
| #>   time_hour           air_time  year month   day dep_time sched_dep…¹ dep_d…² | ||||
| #>   <dttm>                 <dbl> <int> <int> <int>    <int>       <int>   <dbl> | ||||
| #> 1 2013-01-01 05:00:00      227  2013     1     1      517         515       2 | ||||
| #> 2 2013-01-01 05:00:00      227  2013     1     1      533         529       4 | ||||
| #> 3 2013-01-01 05:00:00      160  2013     1     1      542         540       2 | ||||
| #> 4 2013-01-01 05:00:00      183  2013     1     1      544         545      -1 | ||||
| #> 5 2013-01-01 06:00:00      116  2013     1     1      554         600      -6 | ||||
| #> 6 2013-01-01 05:00:00      150  2013     1     1      554         558      -4 | ||||
| #> # … with 336,770 more rows, 11 more variables: arr_time <int>, | ||||
| #> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, | ||||
| #> #   tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>, | ||||
| #> #   minute <dbl>, and abbreviated variable names ¹sched_dep_time, ²dep_delay</pre> | ||||
| </div> | ||||
| <p>But you can use the same <code>.before</code> and <code>.after</code> arguments as <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to choose where to put them:</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   relocate(year:dep_time, .after = time_hour) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>   sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest  | ||||
| #>   sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest  | ||||
| #>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr> | ||||
| #> 1     515       2     830     819      11 UA        1545 N14228  EWR    IAH   | ||||
| #> 2     529       4     850     830      20 UA        1714 N24211  LGA    IAH   | ||||
| @@ -518,14 +512,14 @@ flights |> | ||||
| #> 4     545      -1    1004    1022     -18 B6         725 N804JB  JFK    BQN   | ||||
| #> 5     600      -6     812     837     -25 DL         461 N668DN  LGA    ATL   | ||||
| #> 6     558      -4     740     728      12 UA        1696 N39463  EWR    ORD   | ||||
| #> # … with 336,770 more rows, 9 more variables: air_time <dbl>, distance <dbl>, | ||||
| #> #   hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>, month <int>, | ||||
| #> #   day <int>, dep_time <int>, and abbreviated variable names ¹sched_dep_time, | ||||
| #> #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay | ||||
| #> # … with 336,770 more rows, 9 more variables: air_time <dbl>, | ||||
| #> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>, | ||||
| #> #   month <int>, day <int>, dep_time <int>, and abbreviated variable names | ||||
| #> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay | ||||
| flights |>  | ||||
|   relocate(starts_with("arr"), .before = dep_time) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>    year month   day arr_time arr_delay dep_time sched_…¹ dep_d…² sched…³ carrier | ||||
| #>    year month   day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <dbl>   <int>   <int>   <dbl>   <int> <chr>   | ||||
| #> 1  2013     1     1      830       11     517     515       2     819 UA      | ||||
| #> 2  2013     1     1      850       20     533     529       4     830 UA      | ||||
| @@ -536,7 +530,7 @@ flights |> | ||||
| #> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>, | ||||
| #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, | ||||
| #> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names | ||||
| #> #   ¹sched_dep_time, ²dep_delay, ³sched_arr_time</pre> | ||||
| #> #   ¹arr_delay, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵sched_arr_time</pre> | ||||
| </div> | ||||
| </section> | ||||
|  | ||||
| @@ -580,7 +574,7 @@ Groups</h1> | ||||
|   group_by(month) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #> # Groups:   month [12] | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -679,7 +673,7 @@ The<code>slice_</code> functions</h2> | ||||
|   slice_max(arr_delay, n = 1) | ||||
| #> # A tibble: 108 × 19 | ||||
| #> # Groups:   dest [105] | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     7    22     2145     2007      98     132    2259     153 B6      | ||||
| #> 2  2013     7    23     1139      800     219    1250     909     221 B6      | ||||
| @@ -725,7 +719,7 @@ Grouping by multiple variables</h2> | ||||
| daily | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #> # Groups:   year, month, day [365] | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -744,8 +738,8 @@ daily | ||||
|   summarize( | ||||
|     n = n() | ||||
|   ) | ||||
| #> `summarise()` has grouped output by 'year', 'month'. You can override using the | ||||
| #> `.groups` argument.</pre> | ||||
| #> `summarise()` has grouped output by 'year', 'month'. You can override using | ||||
| #> the `.groups` argument.</pre> | ||||
| </div> | ||||
| <p>If you’re happy with this behavior, you can explicitly request it in order to suppress the message:</p> | ||||
| <div class="cell"> | ||||
|   | ||||
| @@ -14,12 +14,12 @@ Prerequisites</h2> | ||||
| <p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">library(tidyverse) | ||||
| #> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── | ||||
| #> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ── | ||||
| #> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000 | ||||
| #> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000   | ||||
| #> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000    | ||||
| #> ✔ readr   2.1.3             ✔ forcats 0.5.2         | ||||
| #> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── | ||||
| #> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ── | ||||
| #> ✖ dplyr::filter() masks stats::filter() | ||||
| #> ✖ dplyr::lag()    masks stats::lag()</pre> | ||||
| </div> | ||||
| @@ -47,12 +47,12 @@ The<code>mpg</code> data frame</h2> | ||||
| #> # A tibble: 234 × 11 | ||||
| #>   manufacturer model displ  year   cyl trans    drv     cty   hwy fl    class | ||||
| #>   <chr>        <chr> <dbl> <int> <int> <chr>    <chr> <int> <int> <chr> <chr> | ||||
| #> 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa… | ||||
| #> 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa… | ||||
| #> 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa… | ||||
| #> 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa… | ||||
| #> 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa… | ||||
| #> 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa… | ||||
| #> 1 audi         a4      1.8  1999     4 auto(l5) f        18    29 p     comp… | ||||
| #> 2 audi         a4      1.8  1999     4 manual(… f        21    29 p     comp… | ||||
| #> 3 audi         a4      2    2008     4 manual(… f        20    31 p     comp… | ||||
| #> 4 audi         a4      2    2008     4 auto(av) f        21    30 p     comp… | ||||
| #> 5 audi         a4      2.8  1999     6 auto(l5) f        16    26 p     comp… | ||||
| #> 6 audi         a4      2.8  1999     6 manual(… f        18    26 p     comp… | ||||
| #> # … with 228 more rows</pre> | ||||
| </div> | ||||
| <p>Among the variables in <code>mpg</code> are:</p> | ||||
|   | ||||
| @@ -1,26 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-databases"> | ||||
| <h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds")) | ||||
| diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre> | ||||
| </div><p>Other times you might want to use your own SQL query as a starting point:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))</pre> | ||||
| </div> | ||||
|  | ||||
| <p>Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue <a href="https://github.com/tidyverse/dbplyr/issues/">on GitHub</a> to help us do better.</p> | ||||
|  | ||||
| <p>In the examples above note that <code>"year"</code> and <code>"type"</code> are wrapped in double quotes. That’s because these are <strong>reserved words</strong> in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.</p><p>When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.</p><pre data-type="programlisting" data-code-language="sql">SELECT "tailnum", "type", "manufacturer", "model", "year" | ||||
| FROM "planes"</pre><p>Some other database systems use backticks instead of quotes:</p><pre data-type="programlisting" data-code-language="sql">SELECT `tailnum`, `type`, `manufacturer`, `model`, `year` | ||||
| FROM `planes`</pre></div> | ||||
|  | ||||
| <h1><span id="sec-import-databases" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Databases</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -203,8 +182,6 @@ diamonds_db | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds")) | ||||
| diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre> | ||||
| @@ -334,8 +311,6 @@ planes |> show_query() | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds")) | ||||
| diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre> | ||||
| @@ -388,8 +363,6 @@ planes |> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">diamonds_db <- tbl(con, in_schema("sales", "diamonds")) | ||||
| diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))</pre> | ||||
| @@ -665,8 +638,8 @@ mutate_query <- function(df, ...) { | ||||
|     mean = mean(arr_delay, na.rm = TRUE), | ||||
|     median = median(arr_delay, na.rm = TRUE) | ||||
|   ) | ||||
| #> `summarise()` has grouped output by "year" and "month". You can override using | ||||
| #> the `.groups` argument. | ||||
| #> `summarise()` has grouped output by "year" and "month". You can override | ||||
| #> using the `.groups` argument. | ||||
| #> <SQL> | ||||
| #> SELECT | ||||
| #>   "year", | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-datetimes"> | ||||
| <h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-dates-and-times" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Dates and times</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -43,7 +35,7 @@ Creating date/times</h1> | ||||
| <pre data-type="programlisting" data-code-language="downlit">today() | ||||
| #> [1] "2022-11-18" | ||||
| now() | ||||
| #> [1] "2022-11-18 10:21:36 CST"</pre> | ||||
| #> [1] "2022-11-18 10:59:07 CST"</pre> | ||||
| </div> | ||||
| <p>Otherwise, the following sections describe the four ways you’re likely to create a date/time:</p> | ||||
| <ul><li>While reading a file with readr.</li> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-factors"> | ||||
| <h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -124,12 +116,12 @@ General Social Survey</h1> | ||||
| #> # A tibble: 21,483 × 9 | ||||
| #>    year marital         age race  rincome        partyid  relig denom tvhours | ||||
| #>   <int> <fct>         <int> <fct> <fct>          <fct>    <fct> <fct>   <int> | ||||
| #> 1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12 | ||||
| #> 2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA | ||||
| #> 3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2 | ||||
| #> 4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4 | ||||
| #> 5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1 | ||||
| #> 6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA | ||||
| #> 1  2000 Never married    26 White $8000 to 9999  Ind,nea… Prot… Sout…      12 | ||||
| #> 2  2000 Divorced         48 White $8000 to 9999  Not str… Prot… Bapt…      NA | ||||
| #> 3  2000 Widowed          67 White Not applicable Indepen… Prot… No d…       2 | ||||
| #> 4  2000 Never married    39 White Not applicable Ind,nea… Orth… Not …       4 | ||||
| #> 5  2000 Divorced         25 White Not applicable Not str… None  Not …       1 | ||||
| #> 6  2000 Married          25 White $20000 - 24999 Strong … Prot… Sout…      NA | ||||
| #> # … with 21,477 more rows</pre> | ||||
| </div> | ||||
| <p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p> | ||||
|   | ||||
| @@ -1,17 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-functions"> | ||||
| <h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div><h1> | ||||
| RStudio | ||||
| </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that you’ve written, place the cursor on the name of the function and press <code>F2</code>.</p></li> | ||||
| <li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li> | ||||
| </ul></div> | ||||
|  | ||||
| <h1><span id="sec-functions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Functions</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -278,9 +266,7 @@ mape <- function(actual, predicted) { | ||||
| </div> | ||||
| <div data-type="note"><h1> | ||||
| RStudio | ||||
| </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that you’ve written, place the cursor on the name of the function and press <code>F2</code>.</p></li> | ||||
| </h1><p>Once you start writing functions, there are two RStudio shortcuts that are super useful:</p><ul><li><p>To find the definition of a function that you’ve written, place the cursor on the name of the function and press <code>F2</code>.</p></li> | ||||
| <li><p>To quickly jump to a function, press <code>Ctrl + .</code> to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.</p></li> | ||||
| </ul></div> | ||||
|  | ||||
| @@ -490,7 +476,7 @@ flights |> unique_where(tailnum == "N14228", month) | ||||
|  | ||||
| flights_sub(dest == "IAH", contains("time")) | ||||
| #> # A tibble: 7,198 × 8 | ||||
| #>   time_hour           carrier flight dep_time sched_de…¹ arr_t…² sched…³ air_t…⁴ | ||||
| #>   time_hour           carrier flight dep_time sched…¹ arr_t…² sched…³ air_t…⁴ | ||||
| #>   <dttm>              <chr>    <int>    <int>   <int>   <int>   <int>   <dbl> | ||||
| #> 1 2013-01-01 05:00:00 UA        1545      517     515     830     819     227 | ||||
| #> 2 2013-01-01 05:00:00 UA        1714      533     529     850     830     227 | ||||
| @@ -529,8 +515,8 @@ flights |> | ||||
| } | ||||
| flights |>  | ||||
|   count_missing(c(year, month, day), dep_time) | ||||
| #> `summarise()` has grouped output by 'year', 'month'. You can override using the | ||||
| #> `.groups` argument. | ||||
| #> `summarise()` has grouped output by 'year', 'month'. You can override using | ||||
| #> the `.groups` argument. | ||||
| #> # A tibble: 365 × 4 | ||||
| #> # Groups:   year, month [12] | ||||
| #>    year month   day n_miss | ||||
|   | ||||
| @@ -98,12 +98,12 @@ The tidyverse</h2> | ||||
| <p>You will not be able to use the functions, objects, or help files in a package until you load it with <code><a href="https://rdrr.io/r/base/library.html">library()</a></code>. Once you have installed a package, you can load it using the <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> function:</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">library(tidyverse) | ||||
| #> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── | ||||
| #> ── Attaching packages ──────────────────────────────────── tidyverse 1.3.2 ── | ||||
| #> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000 | ||||
| #> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000   | ||||
| #> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000    | ||||
| #> ✔ readr   2.1.3             ✔ forcats 0.5.2         | ||||
| #> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── | ||||
| #> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ── | ||||
| #> ✖ dplyr::filter() masks stats::filter() | ||||
| #> ✖ dplyr::lag()    masks stats::lag()</pre> | ||||
| </div> | ||||
| @@ -162,134 +162,105 @@ Acknowledgements</h1> | ||||
| Colophon</h1> | ||||
| <p>An online version of this book is available at <a href="https://r4ds.hadley.nz" class="uri">https://r4ds.hadley.nz</a>. It will continue to evolve in between reprints of the physical book. The source of the book is available at <a href="https://github.com/hadley/r4ds" class="uri">https://github.com/hadley/r4ds</a>. The book is powered by <a href="https://quarto.org">Quarto</a> which makes it easy to write books that combine text and executable code.</p> | ||||
| <p>This book was built with:</p> | ||||
| <div class="cell-output-display"> | ||||
| <table class="table"><colgroup><col style="width: 14%"/><col style="width: 14%"/><col style="width: 71%"/></colgroup><thead><tr class="header"><th style="text-align: left;">package</th> | ||||
| <th style="text-align: left;">version</th> | ||||
| <th style="text-align: left;">source</th> | ||||
| </tr></thead><tbody><tr class="odd"><td style="text-align: left;">broom</td> | ||||
| <td style="text-align: left;">1.0.1</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">cli</td> | ||||
| <td style="text-align: left;">3.4.1</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.1)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">crayon</td> | ||||
| <td style="text-align: left;">1.5.2</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">dbplyr</td> | ||||
| <td style="text-align: left;">2.2.1.9000</td> | ||||
| <td style="text-align: left;">Github (tidyverse/dbplyr@f7b5596f6125011ab0dcd4eccbfe56c5294214da)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">dplyr</td> | ||||
| <td style="text-align: left;">1.0.99.9000</td> | ||||
| <td style="text-align: left;">local</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">dtplyr</td> | ||||
| <td style="text-align: left;">1.2.2</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">forcats</td> | ||||
| <td style="text-align: left;">0.5.2</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">ggplot2</td> | ||||
| <td style="text-align: left;">3.4.0.9000</td> | ||||
| <td style="text-align: left;">Github (tidyverse/ggplot2@4fea51b1eb2cdacebeacf425627dcbc1d61a5d3e)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">googledrive</td> | ||||
| <td style="text-align: left;">2.0.0</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">googlesheets4</td> | ||||
| <td style="text-align: left;">1.0.1</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">haven</td> | ||||
| <td style="text-align: left;">2.5.1</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">hms</td> | ||||
| <td style="text-align: left;">1.1.2</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">httr</td> | ||||
| <td style="text-align: left;">1.4.4</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">jsonlite</td> | ||||
| <td style="text-align: left;">1.8.3</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.1)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">lubridate</td> | ||||
| <td style="text-align: left;">1.9.0</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.1)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">magrittr</td> | ||||
| <td style="text-align: left;">2.0.3</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">modelr</td> | ||||
| <td style="text-align: left;">0.1.9</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">pillar</td> | ||||
| <td style="text-align: left;">1.8.1</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">purrr</td> | ||||
| <td style="text-align: left;">0.9000.0.9000</td> | ||||
| <td style="text-align: left;">Github (tidyverse/purrr@aaaa58a571cc449dbcc4348e77e589b373e1e059)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">readr</td> | ||||
| <td style="text-align: left;">2.1.3</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.1)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">readxl</td> | ||||
| <td style="text-align: left;">1.4.1</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">reprex</td> | ||||
| <td style="text-align: left;">2.0.2</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">rlang</td> | ||||
| <td style="text-align: left;">1.0.6</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">rstudioapi</td> | ||||
| <td style="text-align: left;">0.14</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">rvest</td> | ||||
| <td style="text-align: left;">1.0.3</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">stringr</td> | ||||
| <td style="text-align: left;">1.4.1.9000</td> | ||||
| <td style="text-align: left;">Github (tidyverse/stringr@ebf38238cbb80bf0e852d5d8d056c04e36d7c20c)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">tibble</td> | ||||
| <td style="text-align: left;">3.1.8</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">tidyr</td> | ||||
| <td style="text-align: left;">1.2.1.9001</td> | ||||
| <td style="text-align: left;">Github (tidyverse/tidyr@91747952f10c961be747c0de1026d109c920e4fc)</td> | ||||
| </tr><tr class="odd"><td style="text-align: left;">tidyverse</td> | ||||
| <td style="text-align: left;">1.3.2</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr><tr class="even"><td style="text-align: left;">xml2</td> | ||||
| <td style="text-align: left;">1.3.3</td> | ||||
| <td style="text-align: left;">CRAN (R 4.2.0)</td> | ||||
| </tr></tbody></table></div> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">sessioninfo::session_info(c("tidyverse")) | ||||
| #> ─ Session info ─────────────────────────────────────────────────────────────── | ||||
| #>  setting  value | ||||
| #>  version  R version 4.2.1 (2022-06-23) | ||||
| #>  os       macOS Ventura 13.0.1 | ||||
| #>  system   aarch64, darwin20 | ||||
| #>  ui       X11 | ||||
| #>  language (EN) | ||||
| #>  collate  en_US.UTF-8 | ||||
| #>  ctype    en_US.UTF-8 | ||||
| #>  tz       America/Chicago | ||||
| #>  date     2022-11-18 | ||||
| #>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) | ||||
| #>  | ||||
| #> ─ Packages ─────────────────────────────────────────────────────────────────── | ||||
| #>  package       * version       date (UTC) lib source | ||||
| #>  askpass         1.1           2019-01-13 [1] CRAN (R 4.2.0) | ||||
| #>  assertthat      0.2.1         2019-03-21 [1] CRAN (R 4.2.0) | ||||
| #>  backports       1.4.1         2021-12-13 [1] CRAN (R 4.2.0) | ||||
| #>  base64enc       0.1-3         2015-07-28 [1] CRAN (R 4.2.0) | ||||
| #>  bit             4.0.4         2020-08-04 [1] CRAN (R 4.2.0) | ||||
| #>  bit64           4.0.5         2020-08-30 [1] CRAN (R 4.2.0) | ||||
| #>  blob            1.2.3         2022-04-10 [1] CRAN (R 4.2.0) | ||||
| #>  broom           1.0.1         2022-08-29 [1] CRAN (R 4.2.0) | ||||
| #>  bslib           0.4.1         2022-11-02 [1] CRAN (R 4.2.0) | ||||
| #>  cachem          1.0.6         2021-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  callr           3.7.3         2022-11-02 [1] CRAN (R 4.2.1) | ||||
| #>  cellranger      1.1.0         2016-07-27 [1] CRAN (R 4.2.0) | ||||
| #>  cli             3.4.1         2022-09-23 [1] CRAN (R 4.2.1) | ||||
| #>  clipr           0.8.0         2022-02-22 [1] CRAN (R 4.2.0) | ||||
| #>  colorspace      2.0-3         2022-02-21 [1] CRAN (R 4.2.0) | ||||
| #>  cpp11           0.4.3         2022-10-12 [1] CRAN (R 4.2.0) | ||||
| #>  crayon          1.5.2         2022-09-29 [1] CRAN (R 4.2.0) | ||||
| #>  curl            4.3.3         2022-10-06 [1] CRAN (R 4.2.0) | ||||
| #>  data.table      1.14.4        2022-10-17 [1] CRAN (R 4.2.1) | ||||
| #>  DBI             1.1.3         2022-06-18 [1] CRAN (R 4.2.0) | ||||
| #>  dbplyr          2.2.1.9000    2022-11-03 [1] Github (tidyverse/dbplyr@f7b5596) | ||||
| #>  digest          0.6.30        2022-10-18 [1] CRAN (R 4.2.0) | ||||
| #>  dplyr         * 1.0.99.9000   2022-11-17 [1] local | ||||
| #>  dtplyr          1.2.2         2022-08-20 [1] CRAN (R 4.2.0) | ||||
| #>  ellipsis        0.3.2         2021-04-29 [1] CRAN (R 4.2.0) | ||||
| #>  evaluate        0.18          2022-11-07 [1] CRAN (R 4.2.1) | ||||
| #>  fansi           1.0.3         2022-03-24 [1] CRAN (R 4.2.0) | ||||
| #>  farver          2.1.1         2022-07-06 [1] CRAN (R 4.2.0) | ||||
| #>  fastmap         1.1.0         2021-01-25 [1] CRAN (R 4.2.0) | ||||
| #>  forcats       * 0.5.2         2022-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  fs              1.5.2         2021-12-08 [1] CRAN (R 4.2.0) | ||||
| #>  gargle          1.2.1.9000    2022-10-27 [1] Github (r-lib/gargle@69d3f28) | ||||
| #>  generics        0.1.3         2022-07-05 [1] CRAN (R 4.2.0) | ||||
| #>  ggplot2       * 3.4.0.9000    2022-11-10 [1] Github (tidyverse/ggplot2@4fea51b) | ||||
| #>  glue            1.6.2         2022-02-24 [1] CRAN (R 4.2.0) | ||||
| #>  googledrive     2.0.0         2021-07-08 [1] CRAN (R 4.2.0) | ||||
| #>  googlesheets4   1.0.1         2022-08-13 [1] CRAN (R 4.2.0) | ||||
| #>  gtable          0.3.1.9000    2022-09-25 [1] local | ||||
| #>  haven           2.5.1         2022-08-22 [1] CRAN (R 4.2.0) | ||||
| #>  highr           0.9           2021-04-16 [1] CRAN (R 4.2.0) | ||||
| #>  hms             1.1.2         2022-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  htmltools       0.5.3         2022-07-18 [1] CRAN (R 4.2.0) | ||||
| #>  httr            1.4.4         2022-08-17 [1] CRAN (R 4.2.0) | ||||
| #>  ids             1.0.1         2017-05-31 [1] CRAN (R 4.2.0) | ||||
| #>  isoband         0.2.6         2022-10-06 [1] CRAN (R 4.2.0) | ||||
| #>  jquerylib       0.1.4         2021-04-26 [1] CRAN (R 4.2.0) | ||||
| #>  jsonlite        1.8.3         2022-10-21 [1] CRAN (R 4.2.1) | ||||
| #>  knitr           1.40          2022-08-24 [1] CRAN (R 4.2.0) | ||||
| #>  labeling        0.4.2         2020-10-20 [1] CRAN (R 4.2.0) | ||||
| #>  lattice         0.20-45       2021-09-22 [2] CRAN (R 4.2.1) | ||||
| #>  lifecycle       1.0.3.9000    2022-10-10 [1] Github (r-lib/lifecycle@80a1e52) | ||||
| #>  lubridate       1.9.0         2022-11-06 [1] CRAN (R 4.2.1) | ||||
| #>  magrittr        2.0.3         2022-03-30 [1] CRAN (R 4.2.0) | ||||
| #>  MASS            7.3-58.1      2022-08-03 [1] CRAN (R 4.2.0) | ||||
| #>  Matrix          1.5-1         2022-09-13 [1] CRAN (R 4.2.0) | ||||
| #>  memoise         2.0.1         2021-11-26 [1] CRAN (R 4.2.0) | ||||
| #>  mgcv            1.8-41        2022-10-21 [1] CRAN (R 4.2.0) | ||||
| #>  mime            0.12          2021-09-28 [1] CRAN (R 4.2.0) | ||||
| #>  modelr          0.1.9         2022-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  munsell         0.5.0         2018-06-12 [1] CRAN (R 4.2.0) | ||||
| #>  nlme            3.1-160       2022-10-10 [1] CRAN (R 4.2.0) | ||||
| #>  openssl         2.0.4         2022-10-17 [1] CRAN (R 4.2.1) | ||||
| #>  pillar          1.8.1         2022-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  pkgconfig       2.0.3         2019-09-22 [1] CRAN (R 4.2.0) | ||||
| #>  prettyunits     1.1.1         2020-01-24 [1] CRAN (R 4.2.0) | ||||
| #>  processx        3.8.0         2022-10-26 [1] CRAN (R 4.2.1) | ||||
| #>  progress        1.2.2         2019-05-16 [1] CRAN (R 4.2.0) | ||||
| #>  ps              1.7.2         2022-10-26 [1] CRAN (R 4.2.1) | ||||
| #>  purrr         * 0.9000.0.9000 2022-11-10 [1] Github (tidyverse/purrr@aaaa58a) | ||||
| #>  R6              2.5.1         2021-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  rappdirs        0.3.3         2021-01-31 [1] CRAN (R 4.2.0) | ||||
| #>  RColorBrewer    1.1-3         2022-04-03 [1] CRAN (R 4.2.0) | ||||
| #>  readr         * 2.1.3         2022-10-01 [1] CRAN (R 4.2.1) | ||||
| #>  readxl          1.4.1         2022-08-17 [1] CRAN (R 4.2.0) | ||||
| #>  rematch         1.0.1         2016-04-21 [1] CRAN (R 4.2.0) | ||||
| #>  rematch2        2.1.2         2020-05-01 [1] CRAN (R 4.2.0) | ||||
| #>  reprex          2.0.2         2022-08-17 [1] CRAN (R 4.2.0) | ||||
| #>  rlang           1.0.6         2022-09-24 [1] CRAN (R 4.2.0) | ||||
| #>  rmarkdown       2.18          2022-11-09 [1] CRAN (R 4.2.1) | ||||
| #>  rstudioapi      0.14          2022-08-22 [1] CRAN (R 4.2.0) | ||||
| #>  rvest           1.0.3         2022-08-19 [1] CRAN (R 4.2.0) | ||||
| #>  sass            0.4.2         2022-07-16 [1] CRAN (R 4.2.0) | ||||
| #>  scales          1.2.1         2022-08-20 [1] CRAN (R 4.2.0) | ||||
| #>  selectr         0.4-2         2019-11-20 [1] CRAN (R 4.2.0) | ||||
| #>  stringi         1.7.8         2022-07-11 [1] CRAN (R 4.2.0) | ||||
| #>  stringr       * 1.4.1.9000    2022-11-10 [1] Github (tidyverse/stringr@ebf3823) | ||||
| #>  sys             3.4.1         2022-10-18 [1] CRAN (R 4.2.0) | ||||
| #>  tibble        * 3.1.8         2022-07-22 [1] CRAN (R 4.2.0) | ||||
| #>  tidyr         * 1.2.1.9001    2022-11-05 [1] Github (tidyverse/tidyr@9174795) | ||||
| #>  tidyselect      1.2.0         2022-10-10 [1] CRAN (R 4.2.1) | ||||
| #>  tidyverse     * 1.3.2         2022-07-18 [1] CRAN (R 4.2.0) | ||||
| #>  timechange      0.1.1         2022-11-04 [1] CRAN (R 4.2.1) | ||||
| #>  tinytex         0.42          2022-09-27 [1] CRAN (R 4.2.1) | ||||
| #>  tzdb            0.3.0         2022-03-28 [1] CRAN (R 4.2.0) | ||||
| #>  utf8            1.2.2         2021-07-24 [1] CRAN (R 4.2.0) | ||||
| #>  uuid            1.1-0         2022-04-19 [1] CRAN (R 4.2.0) | ||||
| #>  vctrs           0.5.0         2022-10-22 [1] CRAN (R 4.2.0) | ||||
| #>  viridisLite     0.4.1         2022-08-22 [1] CRAN (R 4.2.0) | ||||
| #>  vroom           1.6.0         2022-09-30 [1] CRAN (R 4.2.0) | ||||
| #>  withr           2.5.0         2022-03-03 [1] CRAN (R 4.2.0) | ||||
| #>  xfun            0.34          2022-10-18 [1] CRAN (R 4.2.1) | ||||
| #>  xml2            1.3.3         2021-11-30 [1] CRAN (R 4.2.0) | ||||
| #>  yaml            2.3.6         2022-10-18 [1] CRAN (R 4.2.0) | ||||
| #>  | ||||
| #>  [1] /Users/hadleywickham/Library/R/arm64/4.2/library | ||||
| #>  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library | ||||
| #>  | ||||
| #> ────────────────────────────────────────────────────────────────────────────── | ||||
| cli:::ruler() | ||||
| #> ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8 | ||||
| #> 12345678901234567890123456789012345678901234567890123456789012345678901234567890</pre> | ||||
| <pre data-type="programlisting" data-code-language="downlit">cli:::ruler() | ||||
| #> ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+-- | ||||
| #> 12345678901234567890123456789012345678901234567890123456789012345678901234567</pre> | ||||
| </div> | ||||
|  | ||||
|  | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-iteration"> | ||||
| <h1><span id="sec-iteration" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Iteration</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-iteration" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Iteration</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -226,9 +218,10 @@ df_miss |> | ||||
|     n = n() | ||||
|   ) | ||||
| #> # A tibble: 1 × 9 | ||||
| #>   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss     n | ||||
| #>   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_med…¹ d_n_m…²     n | ||||
| #>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>   <dbl>   <int> <int> | ||||
| #> 1    0.429        1   -0.721        1   -0.796        2    0.704        0     5</pre> | ||||
| #> 1    0.429        1   -0.721        1   -0.796        2   0.704       0     5 | ||||
| #> # … with abbreviated variable names ¹d_median, ²d_n_miss</pre> | ||||
| </div> | ||||
| <p>If you look carefully, you might intuit that the columns are named using using a glue specification (<a href="#sec-glue" data-type="xref">#sec-glue</a>) like <code>{.col}_{.fn}</code> where <code>.col</code> is the name of the original column and <code>.fn</code> is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use <code>.names</code> argument to supply your own glue spec.</p> | ||||
| </section> | ||||
| @@ -251,9 +244,10 @@ Column names</h2> | ||||
|     n = n(), | ||||
|   ) | ||||
| #> # A tibble: 1 × 9 | ||||
| #>   median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d     n | ||||
| #>   median_a n_miss_a median_b n_miss_b median_c n_miss_c media…¹ n_mis…²     n | ||||
| #>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>   <dbl>   <int> <int> | ||||
| #> 1    0.429        1   -0.721        1   -0.796        2    0.704        0     5</pre> | ||||
| #> 1    0.429        1   -0.721        1   -0.796        2   0.704       0     5 | ||||
| #> # … with abbreviated variable names ¹median_d, ²n_miss_d</pre> | ||||
| </div> | ||||
| <p>The <code>.names</code> argument is particularly important when you use <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> with <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code>. By default the output of <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> is given the same names as the inputs. This means that <code><a href="https://dplyr.tidyverse.org/reference/across.html">across()</a></code> inside of <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> will replace existing columns. For example, here we use <code><a href="https://dplyr.tidyverse.org/reference/coalesce.html">coalesce()</a></code> to replace <code>NA</code>s with <code>0</code>:</p> | ||||
| <div class="cell"> | ||||
| @@ -930,8 +924,8 @@ DBI::dbCreateTable(con, "gapminder", template)</pre> | ||||
| <pre data-type="programlisting" data-code-language="downlit">con |> tbl("gapminder") | ||||
| #> # Source:   table<gapminder> [0 x 6] | ||||
| #> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:] | ||||
| #> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>, | ||||
| #> #   gdpPercap <dbl>, year <dbl></pre> | ||||
| #> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, | ||||
| #> #   pop <dbl>, gdpPercap <dbl>, year <dbl></pre> | ||||
| </div> | ||||
| <p>Next, we need a function that takes a single file path, reads it into R, and adds the result to the <code>gapminder</code> table. We can do that by combining <code>read_excel()</code> with <code><a href="https://dbi.r-dbi.org/reference/dbAppendTable.html">DBI::dbAppendTable()</a></code>:</p> | ||||
| <div class="cell"> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-joins"> | ||||
| <h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-joins" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Joins</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -59,12 +51,12 @@ Primary and foreign keys</h2> | ||||
| #> # A tibble: 1,458 × 8 | ||||
| #>   faa   name                             lat   lon   alt    tz dst   tzone    | ||||
| #>   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>    | ||||
| #> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne… | ||||
| #> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch… | ||||
| #> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch… | ||||
| #> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne… | ||||
| #> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne… | ||||
| #> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne… | ||||
| #> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America… | ||||
| #> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America… | ||||
| #> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America… | ||||
| #> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America… | ||||
| #> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America… | ||||
| #> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America… | ||||
| #> # … with 1,452 more rows</pre> | ||||
| </div> | ||||
| </li> | ||||
| @@ -75,12 +67,12 @@ Primary and foreign keys</h2> | ||||
| #> # A tibble: 3,322 × 9 | ||||
| #>   tailnum  year type                 manuf…¹ model engines seats speed engine | ||||
| #>   <chr>   <int> <chr>                <chr>   <chr>   <int> <int> <int> <chr>  | ||||
| #> 1 N10156   2004 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo… | ||||
| #> 2 N102UW   1998 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 3 N103US   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 4 N104UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 5 N10575   2002 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo… | ||||
| #> 6 N105UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 1 N10156   2004 Fixed wing multi en… EMBRAER EMB-…       2    55    NA Turbo… | ||||
| #> 2 N102UW   1998 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 3 N103US   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 4 N104UW   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> 5 N10575   2002 Fixed wing multi en… EMBRAER EMB-…       2    55    NA Turbo… | ||||
| #> 6 N105UW   1999 Fixed wing multi en… AIRBUS… A320…       2   182    NA Turbo… | ||||
| #> # … with 3,316 more rows, and abbreviated variable name ¹manufacturer</pre> | ||||
| </div> | ||||
| </li> | ||||
| @@ -89,7 +81,7 @@ Primary and foreign keys</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">weather | ||||
| #> # A tibble: 26,115 × 15 | ||||
| #>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust | ||||
| #>   origin  year month   day  hour  temp  dewp humid wind_dir wind_sp…¹ wind_…² | ||||
| #>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>     <dbl>   <dbl> | ||||
| #> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270     10.4       NA | ||||
| #> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250      8.06      NA | ||||
| @@ -97,8 +89,9 @@ Primary and foreign keys</h2> | ||||
| #> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250     12.7       NA | ||||
| #> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260     12.7       NA | ||||
| #> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240     11.5       NA | ||||
| #> # … with 26,109 more rows, and 4 more variables: precip <dbl>, pressure <dbl>, | ||||
| #> #   visib <dbl>, time_hour <dttm></pre> | ||||
| #> # … with 26,109 more rows, 4 more variables: precip <dbl>, pressure <dbl>, | ||||
| #> #   visib <dbl>, time_hour <dttm>, and abbreviated variable names | ||||
| #> #   ¹wind_speed, ²wind_gust</pre> | ||||
| </div> | ||||
| </li> | ||||
| </ul><p>A <strong>foreign key</strong> is a variable (or set of variables) that corresponds to a primary key in another table. For example:</p> | ||||
| @@ -147,8 +140,8 @@ weather |> | ||||
|   filter(is.na(tailnum)) | ||||
| #> # A tibble: 0 × 9 | ||||
| #> # … with 9 variables: tailnum <chr>, year <int>, type <chr>, | ||||
| #> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>, | ||||
| #> #   engine <chr> | ||||
| #> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>, | ||||
| #> #   speed <int>, engine <chr> | ||||
|  | ||||
| weather |>  | ||||
|   filter(is.na(time_hour) | is.na(origin)) | ||||
| @@ -189,7 +182,7 @@ Surrogate keys</h2> | ||||
|   mutate(id = row_number(), .before = 1) | ||||
| flights2 | ||||
| #> # A tibble: 336,776 × 20 | ||||
| #>      id  year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ | ||||
| #>      id  year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ | ||||
| #>   <int> <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> | ||||
| #> 1     1  2013     1     1      517        515       2     830     819      11 | ||||
| #> 2     2  2013     1     1      533        529       4     850     830      20 | ||||
| @@ -199,8 +192,9 @@ flights2 | ||||
| #> 6     6  2013     1     1      554        558      -4     740     728      12 | ||||
| #> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>, | ||||
| #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, | ||||
| #> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names | ||||
| #> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay</pre> | ||||
| #> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable | ||||
| #> #   names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, | ||||
| #> #   ⁵arr_delay</pre> | ||||
| </div> | ||||
| <p>Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.</p> | ||||
| </section> | ||||
| @@ -249,12 +243,12 @@ flights2 | ||||
| #> # A tibble: 336,776 × 7 | ||||
| #>    year time_hour           origin dest  tailnum carrier name                 | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc.  | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc.  | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc. | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines In… | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines In… | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines I… | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways      | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc. | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc.  | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines In… | ||||
| #> # … with 336,770 more rows</pre> | ||||
| </div> | ||||
| <p>Or we could find out the temperature and wind speed when each plane departed:</p> | ||||
| @@ -281,12 +275,12 @@ flights2 | ||||
| #> # A tibble: 336,776 × 9 | ||||
| #>    year time_hour           origin dest  tailnum carrier type   engines seats | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <int> <int> | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed wi…       2   149 | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed wi…       2   149 | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed wi…       2   178 | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed wi…       2   200 | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed wi…       2   178 | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed wi…       2   191 | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed…       2   149 | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed…       2   149 | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed…       2   178 | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed…       2   200 | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed…       2   178 | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed…       2   191 | ||||
| #> # … with 336,770 more rows</pre> | ||||
| </div> | ||||
| <p>When <code><a href="https://dplyr.tidyverse.org/reference/mutate-joins.html">left_join()</a></code> fails to find a match for a row in <code>x</code>, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number <code>N3ALAA</code> so the <code>type</code>, <code>engines</code>, and <code>seats</code> will be missing:</p> | ||||
| @@ -318,7 +312,7 @@ Specifying join keys</h2> | ||||
|   left_join(planes) | ||||
| #> Joining with `by = join_by(year, tailnum)` | ||||
| #> # A tibble: 336,776 × 13 | ||||
| #>    year time_hour           origin dest  tailnum carrier type  manufactu…¹ model | ||||
| #>    year time_hour           origin dest  tailnum carrier type  manufa…¹ model | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>    <chr> | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>     <NA>  | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>     <NA>  | ||||
| @@ -334,17 +328,16 @@ Specifying join keys</h2> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights2 |>  | ||||
|   left_join(planes, join_by(tailnum)) | ||||
| #> # A tibble: 336,776 × 14 | ||||
| #>   year.x time_hour           origin dest  tailnum carrier year.y type    manuf…¹ | ||||
| #>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>   <chr>   | ||||
| #> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed … BOEING  | ||||
| #> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed … BOEING  | ||||
| #> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed … BOEING  | ||||
| #> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed … AIRBUS  | ||||
| #> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed … BOEING  | ||||
| #> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed … BOEING  | ||||
| #> # … with 336,770 more rows, 5 more variables: model <chr>, engines <int>, | ||||
| #> #   seats <int>, speed <int>, engine <chr>, and abbreviated variable name | ||||
| #> #   ¹manufacturer</pre> | ||||
| #>   year.x time_hour           origin dest  tailnum carrier year.y type         | ||||
| #>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>        | ||||
| #> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing … | ||||
| #> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing … | ||||
| #> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing … | ||||
| #> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing … | ||||
| #> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing … | ||||
| #> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing … | ||||
| #> # … with 336,770 more rows, and 6 more variables: manufacturer <chr>, | ||||
| #> #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr></pre> | ||||
| </div> | ||||
| <p>Note that the <code>year</code> variables are disambiguated in the output with a suffix (<code>year.x</code> and <code>year.y</code>), which tells you whether the variable came from the <code>x</code> or <code>y</code> argument. You can override the default suffixes with the <code>suffix</code> argument.</p> | ||||
| <p><code>join_by(tailnum)</code> is short for <code>join_by(tailnum == tailnum)</code>. It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an <strong>equi-join</strong>. You’ll learn about non-equi-joins in <a href="#sec-non-equi-joins" data-type="xref">#sec-non-equi-joins</a>.</p> | ||||
| @@ -353,30 +346,30 @@ Specifying join keys</h2> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights2 |>  | ||||
|   left_join(airports, join_by(dest == faa)) | ||||
| #> # A tibble: 336,776 × 13 | ||||
| #>    year time_hour           origin dest  tailnum carrier name    lat   lon   alt | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl> | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Geor…  30.0 -95.3    97 | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Geor…  30.0 -95.3    97 | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3     8 | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>   NA    NA      NA | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4  1026 | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chic…  42.0 -87.9   668 | ||||
| #> # … with 336,770 more rows, and 3 more variables: tz <dbl>, dst <chr>, | ||||
| #> #   tzone <chr> | ||||
| #>    year time_hour           origin dest  tailnum carrier name       lat   lon | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl> | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George …  30.0 -95.3 | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George …  30.0 -95.3 | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami I…  25.8 -80.3 | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>      NA    NA   | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfi…  33.6 -84.4 | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago…  42.0 -87.9 | ||||
| #> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>, | ||||
| #> #   dst <chr>, tzone <chr> | ||||
|  | ||||
| flights2 |>  | ||||
|   left_join(airports, join_by(origin == faa)) | ||||
| #> # A tibble: 336,776 × 13 | ||||
| #>    year time_hour           origin dest  tailnum carrier name    lat   lon   alt | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl> | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18 | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22 | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13 | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13 | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22 | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18 | ||||
| #> # … with 336,770 more rows, and 3 more variables: tz <dbl>, dst <chr>, | ||||
| #> #   tzone <chr></pre> | ||||
| #>    year time_hour           origin dest  tailnum carrier name       lat   lon | ||||
| #>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>    <dbl> <dbl> | ||||
| #> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newark …  40.7 -74.2 | ||||
| #> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La Guar…  40.8 -73.9 | ||||
| #> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John F …  40.6 -73.8 | ||||
| #> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John F …  40.6 -73.8 | ||||
| #> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La Guar…  40.8 -73.9 | ||||
| #> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newark …  40.7 -74.2 | ||||
| #> # … with 336,770 more rows, and 4 more variables: alt <dbl>, tz <dbl>, | ||||
| #> #   dst <chr>, tzone <chr></pre> | ||||
| </div> | ||||
| <p>In older code you might see a different way of specifying the join keys, using a character vector:</p> | ||||
| <ul><li> | ||||
| @@ -407,12 +400,12 @@ Filtering joins</h2> | ||||
| #> # A tibble: 101 × 8 | ||||
| #>   faa   name                               lat    lon   alt    tz dst   tzone | ||||
| #>   <chr> <chr>                            <dbl>  <dbl> <dbl> <dbl> <chr> <chr> | ||||
| #> 1 ABQ   Albuquerque International Sunport  35.0 -107.   5355    -7 A     Americ… | ||||
| #> 2 ACK   Nantucket Mem                      41.3  -70.1    48    -5 A     Americ… | ||||
| #> 3 ALB   Albany Intl                        42.7  -73.8   285    -5 A     Americ… | ||||
| #> 4 ANC   Ted Stevens Anchorage Intl         61.2 -150.    152    -9 A     Americ… | ||||
| #> 5 ATL   Hartsfield Jackson Atlanta Intl    33.6  -84.4  1026    -5 A     Americ… | ||||
| #> 6 AUS   Austin Bergstrom Intl              30.2  -97.7   542    -6 A     Americ… | ||||
| #> 1 ABQ   Albuquerque International Sunpo…  35.0 -107.   5355    -7 A     Amer… | ||||
| #> 2 ACK   Nantucket Mem                     41.3  -70.1    48    -5 A     Amer… | ||||
| #> 3 ALB   Albany Intl                       42.7  -73.8   285    -5 A     Amer… | ||||
| #> 4 ANC   Ted Stevens Anchorage Intl        61.2 -150.    152    -9 A     Amer… | ||||
| #> 5 ATL   Hartsfield Jackson Atlanta Intl   33.6  -84.4  1026    -5 A     Amer… | ||||
| #> 6 AUS   Austin Bergstrom Intl             30.2  -97.7   542    -6 A     Amer… | ||||
| #> # … with 95 more rows</pre> | ||||
| </div> | ||||
| <p><strong>Anti-joins</strong> are the opposite: they return all rows in <code>x</code> that don’t have a match in <code>y</code>. They’re useful for finding missing values that are <strong>implicit</strong> in the data, the topic of <a href="#sec-missing-implicit" data-type="xref">#sec-missing-implicit</a>. Implicitly missing values don’t show up as <code>NA</code>s but instead only exist as an absence. For example, we can find rows that as missing from <code>airports</code> by looking for flights that don’t have a matching destination airport:</p> | ||||
| @@ -666,12 +659,12 @@ plane_flights | ||||
| #> # A tibble: 284,170 × 9 | ||||
| #>   tailnum type   engines seats  year time_hour           origin dest  carrier | ||||
| #>   <chr>   <chr>    <int> <int> <int> <dttm>              <chr>  <chr> <chr>   | ||||
| #> 1 N10156  Fixed wi…       2    55  2013 2013-01-10 06:00:00 EWR    PIT   EV      | ||||
| #> 2 N10156  Fixed wi…       2    55  2013 2013-01-10 10:00:00 EWR    CHS   EV      | ||||
| #> 3 N10156  Fixed wi…       2    55  2013 2013-01-10 15:00:00 EWR    MSP   EV      | ||||
| #> 4 N10156  Fixed wi…       2    55  2013 2013-01-11 06:00:00 EWR    CMH   EV      | ||||
| #> 5 N10156  Fixed wi…       2    55  2013 2013-01-11 11:00:00 EWR    MCI   EV      | ||||
| #> 6 N10156  Fixed wi…       2    55  2013 2013-01-11 18:00:00 EWR    PWM   EV      | ||||
| #> 1 N10156  Fixed…       2    55  2013 2013-01-10 06:00:00 EWR    PIT   EV      | ||||
| #> 2 N10156  Fixed…       2    55  2013 2013-01-10 10:00:00 EWR    CHS   EV      | ||||
| #> 3 N10156  Fixed…       2    55  2013 2013-01-10 15:00:00 EWR    MSP   EV      | ||||
| #> 4 N10156  Fixed…       2    55  2013 2013-01-11 06:00:00 EWR    CMH   EV      | ||||
| #> 5 N10156  Fixed…       2    55  2013 2013-01-11 11:00:00 EWR    MCI   EV      | ||||
| #> 6 N10156  Fixed…       2    55  2013 2013-01-11 18:00:00 EWR    PWM   EV      | ||||
| #> # … with 284,164 more rows</pre> | ||||
| </div> | ||||
| </section> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-logicals"> | ||||
| <h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-logicals" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Logical vectors</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -55,7 +47,7 @@ Comparisons</h1> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20) | ||||
| #> # A tibble: 172,286 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      601      600       1     844     850      -6 B6      | ||||
| #> 2  2013     1     1      602      610      -8     812     820      -8 DL      | ||||
| @@ -185,7 +177,7 @@ is.na(c("a", NA, "b")) | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   filter(is.na(dep_time)) | ||||
| #> # A tibble: 8,255 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1       NA     1630      NA      NA    1815      NA EV      | ||||
| #> 2  2013     1     1       NA     1935      NA      NA    2240      NA AA      | ||||
| @@ -204,7 +196,7 @@ is.na(c("a", NA, "b")) | ||||
|   filter(month == 1, day == 1) |>  | ||||
|   arrange(dep_time) | ||||
| #> # A tibble: 842 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -221,7 +213,7 @@ flights |> | ||||
|   filter(month == 1, day == 1) |>  | ||||
|   arrange(desc(is.na(dep_time)), dep_time) | ||||
| #> # A tibble: 842 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1       NA     1630      NA      NA    1815      NA EV      | ||||
| #> 2  2013     1     1       NA     1935      NA      NA    2240      NA AA      | ||||
| @@ -294,7 +286,7 @@ Order of operations</h2> | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|    filter(month == 11 | 12) | ||||
| #> # A tibble: 336,776 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      533      529       4     850     830      20 UA      | ||||
| @@ -356,7 +348,7 @@ c(1, 2, NA) %in% NA | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   filter(dep_time %in% c(NA, 0800)) | ||||
| #> # A tibble: 8,803 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      800      800       0    1022    1014       8 DL      | ||||
| #> 2  2013     1     1      800      810     -10     949     955      -6 MQ      | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-missing-values"> | ||||
| <h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-missing-values" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Missing values</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-numbers"> | ||||
| <h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-numbers" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Numbers</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -218,7 +210,7 @@ x * c(1, 2, 3) | ||||
| <pre data-type="programlisting" data-code-language="downlit">flights |>  | ||||
|   filter(month == c(1, 2)) | ||||
| #> # A tibble: 25,977 × 19 | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1      542      540       2     923     850      33 AA      | ||||
| @@ -759,8 +751,8 @@ Positions</h2> | ||||
|     fifth_dep = nth(dep_time, 5), | ||||
|     last_dep = last(dep_time) | ||||
|   ) | ||||
| #> `summarise()` has grouped output by 'year', 'month'. You can override using the | ||||
| #> `.groups` argument. | ||||
| #> `summarise()` has grouped output by 'year', 'month'. You can override using | ||||
| #> the `.groups` argument. | ||||
| #> # A tibble: 365 × 6 | ||||
| #> # Groups:   year, month [12] | ||||
| #>    year month   day first_dep fifth_dep last_dep | ||||
| @@ -783,7 +775,7 @@ Positions</h2> | ||||
|   filter(r %in% c(1, max(r))) | ||||
| #> # A tibble: 1,195 × 20 | ||||
| #> # Groups:   year, month, day [365] | ||||
| #>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier | ||||
| #>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>   | ||||
| #> 1  2013     1     1      517      515       2     830     819      11 UA      | ||||
| #> 2  2013     1     1     2353     2359      -6     425     445     -20 B6      | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-quarto-formats"> | ||||
| <h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-quarto-formats" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto formats</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-quarto-workflow"> | ||||
| <h1><span id="sec-quarto-workflow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto workflow</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| <p>Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the <em>console</em>, then capture what works in the <em>script editor</em>. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.</p><p>Quarto is also important because it so tightly integrates prose and code. This makes it a great <strong>analysis notebook</strong> because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:</p><ul><li><p>Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!</p></li> | ||||
| <h1><span id="sec-quarto-workflow" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto workflow</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the <em>console</em>, then capture what works in the <em>script editor</em>. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.</p><p>Quarto is also important because it so tightly integrates prose and code. This makes it a great <strong>analysis notebook</strong> because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:</p><ul><li><p>Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!</p></li> | ||||
| <li><p>Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.</p></li> | ||||
| <li><p>Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share not only what you’ve done, but why you did it with your colleagues or lab mates.</p></li> | ||||
| </ul><p>Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. We’ve drawn on our own experiences and Colin Purrington’s advice on lab notebooks (<a href="https://colinpurrington.com/tips/lab-notebooks" class="uri">https://colinpurrington.com/tips/lab-notebooks</a>) to come up with the following tips:</p><ul><li><p>Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.</p></li> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-quarto"> | ||||
| <h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-quarto" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Quarto</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
|   | ||||
| @@ -1,29 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-rectangling"> | ||||
| <h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data rectangling</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div><h1> | ||||
| Base R | ||||
| </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5)) | ||||
| #>   x.1.3 x.3.5 | ||||
| #> 1     1     3 | ||||
| #> 2     2     4 | ||||
| #> 3     3     5</pre> | ||||
| </div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesn’t print particularly well:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">data.frame( | ||||
|   x = I(list(1:2, 3:5)),  | ||||
|   y = c("1, 2", "3, 4, 5") | ||||
| ) | ||||
| #>         x       y | ||||
| #> 1    1, 2    1, 2 | ||||
| #> 2 3, 4, 5 3, 4, 5</pre> | ||||
| </div><p>It’s easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div> | ||||
|  | ||||
| <h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data rectangling</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -198,9 +174,7 @@ df | ||||
| <p>Similarly, if you <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> a data frame in RStudio, you’ll get the standard tabular view, which doesn’t allow you to selectively expand list columns. To explore those fields you’ll need to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> and view, e.g. <code>df |> pull(z) |> View()</code>.</p> | ||||
| <div data-type="note"><h1> | ||||
| Base R | ||||
| </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell"> | ||||
| </h1><p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5)) | ||||
| #>   x.1.3 x.3.5 | ||||
| #> 1     1     3 | ||||
| @@ -486,15 +460,15 @@ repos | ||||
|   unnest_longer(json) |>  | ||||
|   unnest_wider(json)  | ||||
| #> # A tibble: 176 × 68 | ||||
| #>        id name  full_…¹ owner        private html_…² descr…³ fork  url   forks…⁴ | ||||
| #>     <int> <chr> <chr>   <list>       <lgl>   <chr>   <chr>   <lgl> <chr> <chr>   | ||||
| #> 1  6.12e7 after gaborc… <named list> FALSE   https:… Run Co… FALSE http… https:… | ||||
| #> 2  4.05e7 argu… gaborc… <named list> FALSE   https:… Declar… FALSE http… https:… | ||||
| #> 3  3.64e7 ask   gaborc… <named list> FALSE   https:… Friend… FALSE http… https:… | ||||
| #> 4  3.49e7 base… gaborc… <named list> FALSE   https:… Do we … FALSE http… https:… | ||||
| #> 5  6.16e7 cite… gaborc… <named list> FALSE   https:… Test R… TRUE  http… https:… | ||||
| #> 6  3.39e7 clis… gaborc… <named list> FALSE   https:… Unicod… FALSE http… https:… | ||||
| #> # … with 170 more rows, 58 more variables: keys_url <chr>, | ||||
| #>         id name      full_…¹ owner        private html_…² descr…³ fork  url   | ||||
| #>      <int> <chr>     <chr>   <list>       <lgl>   <chr>   <chr>   <lgl> <chr> | ||||
| #> 1 61160198 after     gaborc… <named list> FALSE   https:… Run Co… FALSE http… | ||||
| #> 2 40500181 argufy    gaborc… <named list> FALSE   https:… Declar… FALSE http… | ||||
| #> 3 36442442 ask       gaborc… <named list> FALSE   https:… Friend… FALSE http… | ||||
| #> 4 34924886 baseimpo… gaborc… <named list> FALSE   https:… Do we … FALSE http… | ||||
| #> 5 61620661 citest    gaborc… <named list> FALSE   https:… Test R… TRUE  http… | ||||
| #> 6 33907457 clisymbo… gaborc… <named list> FALSE   https:… Unicod… FALSE http… | ||||
| #> # … with 170 more rows, 59 more variables: forks_url <chr>, keys_url <chr>, | ||||
| #> #   collaborators_url <chr>, teams_url <chr>, hooks_url <chr>, | ||||
| #> #   issue_events_url <chr>, events_url <chr>, assignees_url <chr>, | ||||
| #> #   branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>, | ||||
| @@ -541,12 +515,12 @@ repos | ||||
| #> # A tibble: 176 × 4 | ||||
| #>         id full_name               owner             description              | ||||
| #>      <int> <chr>                   <list>            <chr>                    | ||||
| #> 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Background  | ||||
| #> 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function argum… | ||||
| #> 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interaction i… | ||||
| #> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for und… | ||||
| #> 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo fo… | ||||
| #> 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI ap… | ||||
| #> 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Backgro… | ||||
| #> 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function ar… | ||||
| #> 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interactio… | ||||
| #> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for … | ||||
| #> 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo… | ||||
| #> 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI… | ||||
| #> # … with 170 more rows</pre> | ||||
| </div> | ||||
| <p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p> | ||||
| @@ -572,21 +546,21 @@ repos | ||||
|   select(id, full_name, owner, description) |>  | ||||
|   unnest_wider(owner, names_sep = "_") | ||||
| #> # A tibble: 176 × 20 | ||||
| #>       id full_…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷ owner…⁸ owner…⁹ | ||||
| #>    <int> <chr>   <chr>     <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   | ||||
| #> 1 6.12e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:… | ||||
| #> 2 4.05e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:… | ||||
| #> 3 3.64e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:… | ||||
| #> 4 3.49e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:… | ||||
| #> 5 6.16e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:… | ||||
| #> 6 3.39e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:… | ||||
| #> # … with 170 more rows, 10 more variables: owner_gists_url <chr>, | ||||
| #> #   owner_starred_url <chr>, owner_subscriptions_url <chr>, | ||||
| #> #   owner_organizations_url <chr>, owner_repos_url <chr>, | ||||
| #> #   owner_events_url <chr>, owner_received_events_url <chr>, owner_type <chr>, | ||||
| #> #   owner_site_admin <lgl>, description <chr>, and abbreviated variable names | ||||
| #> #   ¹full_name, ²owner_login, ³owner_id, ⁴owner_avatar_url, ⁵owner_gravatar_id, | ||||
| #> #   ⁶owner_url, ⁷owner_html_url, ⁸owner_followers_url, ⁹owner_following_url</pre> | ||||
| #>         id full_name  owner…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷ | ||||
| #>      <int> <chr>      <chr>     <int> <chr>   <chr>   <chr>   <chr>   <chr>   | ||||
| #> 1 61160198 gaborcsar… gaborc…  660288 https:… ""      https:… https:… https:… | ||||
| #> 2 40500181 gaborcsar… gaborc…  660288 https:… ""      https:… https:… https:… | ||||
| #> 3 36442442 gaborcsar… gaborc…  660288 https:… ""      https:… https:… https:… | ||||
| #> 4 34924886 gaborcsar… gaborc…  660288 https:… ""      https:… https:… https:… | ||||
| #> 5 61620661 gaborcsar… gaborc…  660288 https:… ""      https:… https:… https:… | ||||
| #> 6 33907457 gaborcsar… gaborc…  660288 https:… ""      https:… https:… https:… | ||||
| #> # … with 170 more rows, 11 more variables: owner_following_url <chr>, | ||||
| #> #   owner_gists_url <chr>, owner_starred_url <chr>, | ||||
| #> #   owner_subscriptions_url <chr>, owner_organizations_url <chr>, | ||||
| #> #   owner_repos_url <chr>, owner_events_url <chr>, | ||||
| #> #   owner_received_events_url <chr>, owner_type <chr>, | ||||
| #> #   owner_site_admin <lgl>, description <chr>, and abbreviated variable | ||||
| #> #   names ¹owner_login, ²owner_id, ³owner_avatar_url, ⁴owner_gravatar_id, …</pre> | ||||
| </div> | ||||
| <p>This gives another wide dataset, but you can see that <code>owner</code> appears to contain a lot of additional data about the person who “owns” the repository.</p> | ||||
| </section> | ||||
| @@ -616,12 +590,12 @@ chars | ||||
| #> # A tibble: 30 × 18 | ||||
| #>   url         id name  gender culture born  died  alive titles aliases father | ||||
| #>   <chr>    <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr>  | ||||
| #> 1 https://ww…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 2 https://ww…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 3 https://ww…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 4 https://ww…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""     | ||||
| #> 5 https://ww…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 6 https://ww…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""     | ||||
| #> 1 https:/…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 2 https:/…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 3 https:/…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 4 https:/…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""     | ||||
| #> 5 https:/…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""     | ||||
| #> 6 https:/…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""     | ||||
| #> # … with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>, | ||||
| #> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>, | ||||
| #> #   playedBy <list></pre> | ||||
| @@ -635,11 +609,11 @@ characters | ||||
| #> # A tibble: 30 × 7 | ||||
| #>      id name              gender culture    born                  died  alive | ||||
| #>   <int> <chr>             <chr>  <chr>      <chr>                 <chr> <lgl> | ||||
| #> 1  1022 Theon Greyjoy     Male   "Ironborn" "In 278 AC or 279 AC, a… ""    TRUE  | ||||
| #> 2  1052 Tyrion Lannister  Male   ""         "In 273 AC, at Casterly… ""    TRUE  | ||||
| #> 3  1074 Victarion Greyjoy Male   "Ironborn" "In 268 AC or before, a… ""    TRUE  | ||||
| #> 1  1022 Theon Greyjoy     Male   "Ironborn" "In 278 AC or 279 AC… ""    TRUE  | ||||
| #> 2  1052 Tyrion Lannister  Male   ""         "In 273 AC, at Caste… ""    TRUE  | ||||
| #> 3  1074 Victarion Greyjoy Male   "Ironborn" "In 268 AC or before… ""    TRUE  | ||||
| #> 4  1109 Will              Male   ""         ""                    "In … FALSE | ||||
| #> 5  1166 Areo Hotah        Male   "Norvoshi" "In 257 AC or before, a… ""    TRUE  | ||||
| #> 5  1166 Areo Hotah        Male   "Norvoshi" "In 257 AC or before… ""    TRUE  | ||||
| #> 6  1267 Chett             Male   ""         "At Hag's Mire"       "In … FALSE | ||||
| #> # … with 24 more rows</pre> | ||||
| </div> | ||||
| @@ -649,15 +623,15 @@ characters | ||||
|   unnest_wider(json) |>  | ||||
|   select(id, where(is.list)) | ||||
| #> # A tibble: 30 × 8 | ||||
| #>      id titles    aliases    allegiances books     povBooks  tvSeries  playedBy  | ||||
| #>      id titles    aliases    allegiances books     povBooks  tvSeries playe…¹ | ||||
| #>   <int> <list>    <list>     <list>      <list>    <list>    <list>   <list>  | ||||
| #> 1  1022 <chr [3]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr [1]> | ||||
| #> 2  1052 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr [1]> | ||||
| #> 3  1074 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr [1]> | ||||
| #> 4  1109 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr [1]> <chr [1]> <chr [1]> | ||||
| #> 5  1166 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr [1]> | ||||
| #> 6  1267 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr [1]> <chr [1]> | ||||
| #> # … with 24 more rows</pre> | ||||
| #> 1  1022 <chr [3]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr>    <chr>   | ||||
| #> 2  1052 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr>    <chr>   | ||||
| #> 3  1074 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr>    <chr>   | ||||
| #> 4  1109 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr [1]> <chr>    <chr>   | ||||
| #> 5  1166 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr>    <chr>   | ||||
| #> 6  1267 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr>    <chr>   | ||||
| #> # … with 24 more rows, and abbreviated variable name ¹playedBy</pre> | ||||
| </div> | ||||
| <p>Lets explore the <code>titles</code> column. It’s an unnamed list-column, so we’ll unnest it into rows:</p> | ||||
| <div class="cell"> | ||||
| @@ -717,7 +691,7 @@ characters |> | ||||
| #>   <int> <chr>             <chr>                                               | ||||
| #> 1  1022 Theon Greyjoy     Prince of Winterfell                                | ||||
| #> 2  1022 Theon Greyjoy     Captain of Sea Bitch                                | ||||
| #> 3  1022 Theon Greyjoy     Lord of the Iron Islands (by law of the green lands) | ||||
| #> 3  1022 Theon Greyjoy     Lord of the Iron Islands (by law of the green land… | ||||
| #> 4  1052 Tyrion Lannister  Acting Hand of the King (former)                    | ||||
| #> 5  1052 Tyrion Lannister  Master of Coin (former)                             | ||||
| #> 6  1074 Victarion Greyjoy Lord Captain of the Iron Fleet                      | ||||
| @@ -855,15 +829,15 @@ Deeply nested</h2> | ||||
|   unnest_wider(results) | ||||
| locations | ||||
| #> # A tibble: 7 × 6 | ||||
| #>   city       address_components formatted_address   geometry     place_id types  | ||||
| #>   city       address_components formatted_address geometry     place…¹ types  | ||||
| #>   <chr>      <list>             <chr>             <list>       <chr>   <list> | ||||
| #> 1 Houston    <list [4]>         Houston, TX, USA    <named list> ChIJAYW… <list> | ||||
| #> 2 Washington <list [2]>         Washington, USA     <named list> ChIJ-bD… <list> | ||||
| #> 3 Washington <list [4]>         Washington, DC, USA <named list> ChIJW-T… <list> | ||||
| #> 4 New York   <list [3]>         New York, NY, USA   <named list> ChIJOwg… <list> | ||||
| #> 5 Chicago    <list [4]>         Chicago, IL, USA    <named list> ChIJ7cv… <list> | ||||
| #> 6 Arlington  <list [4]>         Arlington, TX, USA  <named list> ChIJ05g… <list> | ||||
| #> # … with 1 more row</pre> | ||||
| #> 1 Houston    <list [4]>         Houston, TX, USA  <named list> ChIJAY… <list> | ||||
| #> 2 Washington <list [2]>         Washington, USA   <named list> ChIJ-b… <list> | ||||
| #> 3 Washington <list [4]>         Washington, DC, … <named list> ChIJW-… <list> | ||||
| #> 4 New York   <list [3]>         New York, NY, USA <named list> ChIJOw… <list> | ||||
| #> 5 Chicago    <list [4]>         Chicago, IL, USA  <named list> ChIJ7c… <list> | ||||
| #> 6 Arlington  <list [4]>         Arlington, TX, U… <named list> ChIJ05… <list> | ||||
| #> # … with 1 more row, and abbreviated variable name ¹place_id</pre> | ||||
| </div> | ||||
| <p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p> | ||||
| <p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p> | ||||
| @@ -872,14 +846,14 @@ locations | ||||
|   select(city, formatted_address, geometry) |>  | ||||
|   unnest_wider(geometry) | ||||
| #> # A tibble: 7 × 6 | ||||
| #>   city       formatted_address   bounds       location     locati…¹ viewport     | ||||
| #>   city       formatted_address bounds       location     locat…¹ viewport     | ||||
| #>   <chr>      <chr>             <list>       <list>       <chr>   <list>       | ||||
| #> 1 Houston    Houston, TX, USA    <named list> <named list> APPROXI… <named list> | ||||
| #> 2 Washington Washington, USA     <named list> <named list> APPROXI… <named list> | ||||
| #> 3 Washington Washington, DC, USA <named list> <named list> APPROXI… <named list> | ||||
| #> 4 New York   New York, NY, USA   <named list> <named list> APPROXI… <named list> | ||||
| #> 5 Chicago    Chicago, IL, USA    <named list> <named list> APPROXI… <named list> | ||||
| #> 6 Arlington  Arlington, TX, USA  <named list> <named list> APPROXI… <named list> | ||||
| #> 1 Houston    Houston, TX, USA  <named list> <named list> APPROX… <named list> | ||||
| #> 2 Washington Washington, USA   <named list> <named list> APPROX… <named list> | ||||
| #> 3 Washington Washington, DC, … <named list> <named list> APPROX… <named list> | ||||
| #> 4 New York   New York, NY, USA <named list> <named list> APPROX… <named list> | ||||
| #> 5 Chicago    Chicago, IL, USA  <named list> <named list> APPROX… <named list> | ||||
| #> 6 Arlington  Arlington, TX, U… <named list> <named list> APPROX… <named list> | ||||
| #> # … with 1 more row, and abbreviated variable name ¹location_type</pre> | ||||
| </div> | ||||
| <p>That gives us new <code>bounds</code> (a rectangular region) and <code>location</code> (a point). We can unnest <code>location</code> to see the latitude (<code>lat</code>) and longitude (<code>lng</code>):</p> | ||||
| @@ -889,14 +863,14 @@ locations | ||||
|   unnest_wider(geometry) |>  | ||||
|   unnest_wider(location) | ||||
| #> # A tibble: 7 × 7 | ||||
| #>   city       formatted_address   bounds         lat    lng locati…¹ viewport     | ||||
| #>   city       formatted_address bounds         lat    lng locat…¹ viewport     | ||||
| #>   <chr>      <chr>             <list>       <dbl>  <dbl> <chr>   <list>       | ||||
| #> 1 Houston    Houston, TX, USA    <named list>  29.8  -95.4 APPROXI… <named list> | ||||
| #> 2 Washington Washington, USA     <named list>  47.8 -121.  APPROXI… <named list> | ||||
| #> 3 Washington Washington, DC, USA <named list>  38.9  -77.0 APPROXI… <named list> | ||||
| #> 4 New York   New York, NY, USA   <named list>  40.7  -74.0 APPROXI… <named list> | ||||
| #> 5 Chicago    Chicago, IL, USA    <named list>  41.9  -87.6 APPROXI… <named list> | ||||
| #> 6 Arlington  Arlington, TX, USA  <named list>  32.7  -97.1 APPROXI… <named list> | ||||
| #> 1 Houston    Houston, TX, USA  <named list>  29.8  -95.4 APPROX… <named list> | ||||
| #> 2 Washington Washington, USA   <named list>  47.8 -121.  APPROX… <named list> | ||||
| #> 3 Washington Washington, DC, … <named list>  38.9  -77.0 APPROX… <named list> | ||||
| #> 4 New York   New York, NY, USA <named list>  40.7  -74.0 APPROX… <named list> | ||||
| #> 5 Chicago    Chicago, IL, USA  <named list>  41.9  -87.6 APPROX… <named list> | ||||
| #> 6 Arlington  Arlington, TX, U… <named list>  32.7  -97.1 APPROX… <named list> | ||||
| #> # … with 1 more row, and abbreviated variable name ¹location_type</pre> | ||||
| </div> | ||||
| <p>Extracting the bounds requires a few more steps:</p> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-regexps"> | ||||
| <h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-regular-expressions" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Regular expressions</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -1006,8 +998,9 @@ Base R</h2> | ||||
| <p><code>apropos(pattern)</code> searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">apropos("replace") | ||||
| #> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod" | ||||
| #> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"</pre> | ||||
| #> [1] "%+replace%"       "replace"          "replace_na"       | ||||
| #> [4] "setReplaceMethod" "str_replace"      "str_replace_all"  | ||||
| #> [7] "str_replace_na"   "theme_replace"</pre> | ||||
| </div> | ||||
| <p><code>list.files(path, pattern)</code> lists all files in <code>path</code> that match a regular expression <code>pattern</code>. For example, you can find all the R Markdown files in the current directory with:</p> | ||||
| <div class="cell"> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-spreadsheets"> | ||||
| <h1><span id="sec-import-spreadsheets" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Spreadsheets</span></span></h1><div data-type="important"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-import-spreadsheets" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Spreadsheets</span></span></h1><p>::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
| @@ -197,16 +189,16 @@ Reading individual sheets</h2> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">read_excel("data/penguins.xlsx", sheet = "Torgersen Island") | ||||
| #> # A tibble: 52 × 8 | ||||
| #>   species island    bill_length_mm     bill_depth_mm flipp…¹ body_…² sex    year | ||||
| #>   species island    bill_length_mm     bill_dep…¹ flipp…² body_…³ sex    year | ||||
| #>   <chr>   <chr>     <chr>              <chr>      <chr>   <chr>   <chr> <dbl> | ||||
| #> 1 Adelie  Torgersen 39.1               18.7       181     3750    male   2007 | ||||
| #> 2 Adelie  Torgersen 39.5               17.399999999… 186     3800    fema…  2007 | ||||
| #> 2 Adelie  Torgersen 39.5               17.399999… 186     3800    fema…  2007 | ||||
| #> 3 Adelie  Torgersen 40.299999999999997 18         195     3250    fema…  2007 | ||||
| #> 4 Adelie  Torgersen NA                 NA         NA      NA      NA     2007 | ||||
| #> 5 Adelie  Torgersen 36.700000000000003 19.3       193     3450    fema…  2007 | ||||
| #> 6 Adelie  Torgersen 39.299999999999997 20.6       190     3650    male   2007 | ||||
| #> # … with 46 more rows, and abbreviated variable names ¹flipper_length_mm, | ||||
| #> #   ²body_mass_g</pre> | ||||
| #> # … with 46 more rows, and abbreviated variable names ¹bill_depth_mm, | ||||
| #> #   ²flipper_length_mm, ³body_mass_g</pre> | ||||
| </div> | ||||
| <p>Some variables that appear to contain numerical data are read in as characters due to the character string <code>"NA"</code> not being recognized as a true <code>NA</code>.</p> | ||||
| <div class="cell"> | ||||
| @@ -214,7 +206,7 @@ Reading individual sheets</h2> | ||||
|  | ||||
| penguins_torgersen | ||||
| #> # A tibble: 52 × 8 | ||||
| #>   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year | ||||
| #>   species island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex    year | ||||
| #>   <chr>   <chr>              <dbl>         <dbl>    <dbl>   <dbl> <chr> <dbl> | ||||
| #> 1 Adelie  Torgersen           39.1          18.7      181    3750 male   2007 | ||||
| #> 2 Adelie  Torgersen           39.5          17.4      186    3800 fema…  2007 | ||||
| @@ -249,7 +241,7 @@ dim(penguins_dream) | ||||
| <pre data-type="programlisting" data-code-language="downlit">penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream) | ||||
| penguins | ||||
| #> # A tibble: 344 × 8 | ||||
| #>   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year | ||||
| #>   species island    bill_length_mm bill_depth_mm flippe…¹ body_…² sex    year | ||||
| #>   <chr>   <chr>              <dbl>         <dbl>    <dbl>   <dbl> <chr> <dbl> | ||||
| #> 1 Adelie  Torgersen           39.1          18.7      181    3750 male   2007 | ||||
| #> 2 Adelie  Torgersen           39.5          17.4      186    3800 fema…  2007 | ||||
| @@ -289,10 +281,10 @@ deaths | ||||
| #> # A tibble: 18 × 6 | ||||
| #>   `Lots of people`             ...2       ...3  ...4     ...5          ...6   | ||||
| #>   <chr>                        <chr>      <chr> <chr>    <chr>         <chr>  | ||||
| #> 1 simply cannot resist writing <NA>       <NA>  <NA>     <NA>          some not… | ||||
| #> 2 at                           the        top   <NA>     of            their sp… | ||||
| #> 1 simply cannot resist writing <NA>       <NA>  <NA>     <NA>          some … | ||||
| #> 2 at                           the        top   <NA>     of            their… | ||||
| #> 3 or                           merging    <NA>  <NA>     <NA>          cells  | ||||
| #> 4 Name                         Profession Age   Has kids Date of birth Date of … | ||||
| #> 4 Name                         Profession Age   Has kids Date of birth Date … | ||||
| #> 5 David Bowie                  musician   69    TRUE     17175         42379  | ||||
| #> 6 Carrie Fisher                actor      60    TRUE     20749         42731  | ||||
| #> # … with 12 more rows</pre> | ||||
| @@ -302,7 +294,7 @@ deaths | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4) | ||||
| #> # A tibble: 14 × 6 | ||||
| #>   Name          Profession Age   `Has kids` `Date of birth`     `Date of death` | ||||
| #>   Name          Profession Age   `Has kids` `Date of birth`     Date of dea…¹ | ||||
| #>   <chr>         <chr>      <chr> <chr>      <dttm>              <chr>         | ||||
| #> 1 David Bowie   musician   69    TRUE       1947-01-08 00:00:00 42379         | ||||
| #> 2 Carrie Fisher actor      60    TRUE       1956-10-21 00:00:00 42731         | ||||
| @@ -310,21 +302,22 @@ deaths | ||||
| #> 4 Bill Paxton   actor      61    TRUE       1955-05-17 00:00:00 42791         | ||||
| #> 5 Prince        musician   57    TRUE       1958-06-07 00:00:00 42481         | ||||
| #> 6 Alan Rickman  actor      69    FALSE      1946-02-21 00:00:00 42383         | ||||
| #> # … with 8 more rows</pre> | ||||
| #> # … with 8 more rows, and abbreviated variable name ¹`Date of death`</pre> | ||||
| </div> | ||||
| <p>We could also set <code>n_max</code> to omit the extraneous rows at the bottom.</p> | ||||
| <div class="cell"> | ||||
| <pre data-type="programlisting" data-code-language="downlit">read_excel(deaths_path, skip = 4, n_max = 10) | ||||
| #> # A tibble: 10 × 6 | ||||
| #>   Name          Profession   Age Has k…¹ `Date of birth`     `Date of death`     | ||||
| #>   Name          Profe…¹   Age Has k…² `Date of birth`     `Date of death`     | ||||
| #>   <chr>         <chr>   <dbl> <lgl>   <dttm>              <dttm>              | ||||
| #> 1 David Bowie   musician      69 TRUE    1947-01-08 00:00:00 2016-01-10 00:00:00 | ||||
| #> 1 David Bowie   musici…    69 TRUE    1947-01-08 00:00:00 2016-01-10 00:00:00 | ||||
| #> 2 Carrie Fisher actor      60 TRUE    1956-10-21 00:00:00 2016-12-27 00:00:00 | ||||
| #> 3 Chuck Berry   musician      90 TRUE    1926-10-18 00:00:00 2017-03-18 00:00:00 | ||||
| #> 3 Chuck Berry   musici…    90 TRUE    1926-10-18 00:00:00 2017-03-18 00:00:00 | ||||
| #> 4 Bill Paxton   actor      61 TRUE    1955-05-17 00:00:00 2017-02-25 00:00:00 | ||||
| #> 5 Prince        musician      57 TRUE    1958-06-07 00:00:00 2016-04-21 00:00:00 | ||||
| #> 5 Prince        musici…    57 TRUE    1958-06-07 00:00:00 2016-04-21 00:00:00 | ||||
| #> 6 Alan Rickman  actor      69 FALSE   1946-02-21 00:00:00 2016-01-14 00:00:00 | ||||
| #> # … with 4 more rows, and abbreviated variable name ¹`Has kids`</pre> | ||||
| #> # … with 4 more rows, and abbreviated variable names ¹Profession, | ||||
| #> #   ²`Has kids`</pre> | ||||
| </div> | ||||
| <p>Another approach is using cell ranges. In Excel, the top left cell is <code>A1</code>. As you move across columns to the right, the cell label moves down the alphabet, i.e. <code>B1</code>, <code>C1</code>, etc. And as you move down a column, the number in the cell label increases, i.e. <code>A2</code>, <code>A3</code>, etc.</p> | ||||
| <p>The data we want to read in starts in cell <code>A5</code> and ends in cell <code>F15</code>. In spreadsheet notation, this is <code>A5:F15</code>.</p> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-strings"> | ||||
| <h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
|  | ||||
| <h1><span id="sec-strings" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Strings</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p> | ||||
| <section id="introduction" data-type="sect1"> | ||||
| <h1> | ||||
| Introduction</h1> | ||||
|   | ||||
| @@ -1,10 +1,2 @@ | ||||
| <section data-type="chapter" id="chp-webscraping"> | ||||
| <h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><div data-type="important"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| </section> | ||||
| <h1><span id="sec-scraping" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Web scraping</span></span></h1><p>::: status callout-important You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p></section> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-workflow-basics"> | ||||
| <h1><span id="sec-workflow-basics" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: basics</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| <p>You now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.</p><p>Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.</p> | ||||
| <h1><span id="sec-workflow-basics" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: basics</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>You now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.</p><p>Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.</p> | ||||
| <section id="coding-basics" data-type="sect1"> | ||||
| <h1> | ||||
| Coding basics</h1> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-workflow-help"> | ||||
| <h1><span id="sec-workflow-getting-help" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Getting help</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| <p>This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.</p> | ||||
| <h1><span id="sec-workflow-getting-help" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Getting help</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.</p> | ||||
| <section id="google-is-your-friend" data-type="sect1"> | ||||
| <h1> | ||||
| Google is your friend</h1> | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-workflow-pipes"> | ||||
| <h1><span id="sec-workflow-pipes" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Pipes</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| <p>The pipe, <code>|></code>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss <code>%>%</code>, a predecessor to <code>|></code>.</p><p>To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use <code>|></code> instead of <code>%>%</code> as shown in <a href="#fig-pipe-options" data-type="xref">#fig-pipe-options</a>; more on <code>%>%</code> shortly.</p><div class="cell"> | ||||
| <h1><span id="sec-workflow-pipes" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: Pipes</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>The pipe, <code>|></code>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss <code>%>%</code>, a predecessor to <code>|></code>.</p><p>To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use <code>|></code> instead of <code>%>%</code> as shown in <a href="#fig-pipe-options" data-type="xref">#fig-pipe-options</a>; more on <code>%>%</code> shortly.</p><div class="cell"> | ||||
| <div class="cell-output-display"> | ||||
|  | ||||
| <figure id="fig-pipe-options"><p><img src="screenshots/rstudio-pipe-options.png" alt="Screenshot showing the "Use native pipe operator" option which can be found on the "Editing" panel of the "Code" options." width="616"/></p> | ||||
|   | ||||
| @@ -1,15 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-workflow-scripts"> | ||||
| <h1><span id="sec-workflow-scripts-projects" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: scripts and projects</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div><h1> | ||||
| RStudio server | ||||
| </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.</p></div> | ||||
| <p>This chapter will introduce you to two very important tools for organizing your code: scripts and projects.</p> | ||||
| <h1><span id="sec-workflow-scripts-projects" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: scripts and projects</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>This chapter will introduce you to two very important tools for organizing your code: scripts and projects.</p> | ||||
| <section id="scripts" data-type="sect1"> | ||||
| <h1> | ||||
| Scripts</h1> | ||||
| @@ -126,9 +116,7 @@ What is the source of truth?</h2> | ||||
| </ol><p>We collectively use this pattern hundreds of times a week.</p> | ||||
| <div data-type="note"><h1> | ||||
| RStudio server | ||||
| </h1><p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p> | ||||
|  | ||||
| <p>If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.</p></div> | ||||
| </h1><p>If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.</p></div> | ||||
|  | ||||
| </section> | ||||
|  | ||||
|   | ||||
| @@ -1,13 +1,5 @@ | ||||
| <section data-type="chapter" id="chp-workflow-style"> | ||||
| <h1><span id="sec-workflow-style" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: code style</span></span></h1><div data-type="note"><div class="callout-body d-flex"> | ||||
| <div class="callout-icon-container"> | ||||
| <i class="callout-icon"/> | ||||
| </div> | ||||
|  | ||||
| </div> | ||||
|  | ||||
| <p>You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div> | ||||
| <p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="https://style.tidyverse.org">tidyverse style guide</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="https://styler.r-lib.org">styler</a> package by Lorenz Walthert. Once you’ve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudio’s <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell"> | ||||
| <h1><span id="sec-workflow-style" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: code style</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the <a href="https://style.tidyverse.org">tidyverse style guide</a>, which is used throughout this book.</p><p>Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the <a href="https://styler.r-lib.org">styler</a> package by Lorenz Walthert. Once you’ve installed it with <code>install.packages("styler")</code>, an easy way to use it is via RStudio’s <strong>command palette</strong>. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. <a href="#fig-styler" data-type="xref">#fig-styler</a> shows the results.</p><div class="cell"> | ||||
| <div class="cell-output-display"> | ||||
|  | ||||
| <figure id="fig-styler"><p><img src="screenshots/rstudio-palette.png" alt="A screenshot showing the command palette after typing "styler", showing the four styling tool provided by the package." width="638"/></p> | ||||
|   | ||||
		Reference in New Issue
	
	Block a user