Convert to Quarto book (#1026)

* Use _quarto.yml
* Rmd -> qmd
* Update build type to None
* Update styling Styling to get close to bs4_book + color
* Convert crossrefs
* Covert chunk options
* Switch to plausible for analytics
* Update action
This commit is contained in:
Mine Cetinkaya-Rundel 2022-05-13 16:46:49 -04:00 committed by GitHub
parent 12474765cf
commit 262f4ba02f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
61 changed files with 1558 additions and 831 deletions

View File

@ -1,27 +1,34 @@
on:
push:
branches: [main, master]
branches: main
pull_request:
branches: [main, master]
branches: main
# to be able to trigger a manual build
workflow_dispatch:
schedule:
# run every day at 11 PM
- cron: '0 23 * * *'
name: Build book
name: Render and deploy Book to Netlify
env:
isExtPR: ${{ github.event.pull_request.head.repo.fork == true }}
jobs:
build:
build-deploy:
runs-on: ubuntu-latest
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v2
- uses: r-lib/actions/setup-pandoc@v2
- name: Install Quarto
uses: quarto-dev/quarto-actions/install-quarto@v1
with:
# To install LaTeX to build PDF book
tinytex: true
# uncomment below and fill to pin a version
# version: 0.9.105
- uses: r-lib/actions/setup-r@v2
with:
@ -29,20 +36,10 @@ jobs:
- uses: r-lib/actions/setup-r-dependencies@v2
- name: Cache bookdown results
uses: actions/cache@v2
with:
path: _bookdown_files
key: bookdown-3-${{ hashFiles('**/*Rmd') }}
restore-keys: bookdown-3-
- name: Build site
- name: Render book to all format
# Add any command line argument needed
run: |
# Allows [implcit heading links] to work; will need to convert
# to explicit before switching to visual editor
options(bookdown.render.file_scope = FALSE)
bookdown::render_book("index.Rmd")
shell: Rscript {0}
quarto render
- name: Deploy to Netlify
if: contains(env.isExtPR, 'false')

5
.gitignore vendored
View File

@ -7,11 +7,12 @@ _book
*.md
!CODE_OF_CONDUCT.md
*.html
!ga_script.html
!plausible.html
search_index.json
libs
*.rds
_main.*
bookdown*
tmp-pdfcrop-*
figures
/.quarto/

View File

@ -47,9 +47,6 @@ Suggests:
tidymodels,
xml2
Remotes:
r-lib/downlit,
rstudio/bookdown,
rstudio/bslib,
tidyverse/stringr,
tidyverse/tidyr
Encoding: UTF-8

View File

@ -1,7 +1,10 @@
# Exploratory Data Analysis
# Exploratory Data Analysis {#sec-exploratory-data-analysis}
```{r, results = "asis", echo = FALSE}
status("restructuring")
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
## Introduction
@ -78,7 +81,7 @@ To make the discussion easier, let's define some terms:
Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.
So far, all of the data that you've seen has been tidy.
In real-life, most data isn't tidy, so we'll come back to these ideas again in Chapter \@ref(tidy-intro).
In real-life, most data isn't tidy, so we'll come back to these ideas again in [Chapter -@sec-list-columns] and [Chapter -@sec-rectangle-data].
## Variation
@ -98,7 +101,7 @@ In R, categorical variables are usually saved as factors or character vectors.
To examine the distribution of a categorical variable, you can use a bar chart:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A bar chart of cuts of diamonds. The cuts are presented in increasing
#| order of frequency: Fair (less than 2500), Good (approximately 5000),
#| Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal
@ -121,7 +124,7 @@ Numbers and date-times are two examples of continuous variables.
To examine the distribution of a continuous variable, you can use a histogram:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5
#| and the y-axis ranging from 0 to 30000. The distribution is right skewed
#| with very few diamonds in the bin centered at 0, almost 30000 diamonds in
@ -150,7 +153,7 @@ You should always explore a variety of binwidths when working with histograms, a
For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and
#| the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1),
#| resulting in many bars. The distribution is right skewed but there are lots
@ -168,7 +171,7 @@ If you wish to overlay multiple histograms in the same plot, I recommend using `
It's much easier to understand overlapping lines than bars.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A frequency polygon of carats of diamonds where each cut of carat (Fair,
#| Good, Very Good, Premium, and Ideal) is represented with a different color
#| line. The x-axis ranges from 0 to 3 and the y-axis ranges from 0 to almost
@ -183,7 +186,7 @@ ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
We've also customized the thickness of the lines using the `size` argument in order to make them stand out a bit more against the background.
There are a few challenges with this type of plot, which we will come back to in Section \@ref(cat-cont) on visualizing a categorical and a continuous variable.
There are a few challenges with this type of plot, which we will come back to in @sec-cat-cont on visualizing a categorical and a continuous variable.
Now that you can visualize variation, what should you look for in your plots?
And what type of follow-up questions should you ask?
@ -213,7 +216,7 @@ As an example, the histogram below suggests several interesting questions:
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
```{r}
#| fig.alt: >
#| fig-alt: >
#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and
#| the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow
#| (0.01), resulting in a very large number of skinny bars. The distribution
@ -239,7 +242,7 @@ The histogram below shows the length (in minutes) of 272 eruptions of the Old Fa
Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5,
#| and the y-axis ranges from 0 to roughly 40. The distribution is bimodal
#| with peaks around 1.75 and 4.5.
@ -260,7 +263,7 @@ For example, take the distribution of the `y` variable from the diamonds dataset
The only evidence of outliers is the unusually wide limits on the x-axis.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the
#| y-axis ranges from 0 to 12000. There is a peak around 5, and the data
#| appear to be completely clustered around the peak.
@ -273,7 +276,7 @@ There are so many observations in the common bins that the rare bins are very sh
To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the
#| y-axis ranges from 0 to 50. There is a peak around 5, and the data
#| appear to be completely clustered around the peak. Other than those data,
@ -338,7 +341,7 @@ You'll need to figure out what caused them (e.g. a data entry error) and disclos
What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
## Missing values {#missing-values-eda}
## Missing values {#sec-missing-values-eda}
If you've encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.
@ -374,7 +377,7 @@ It's not obvious where you should plot missing values, so ggplot2 doesn't includ
```{r}
#| dev: "png"
#| fig.alt: >
#| fig-alt: >
#| A scatterplot of widths vs. lengths of diamonds. There is a strong,
#| linear association between the two variables. All but one of the diamonds
#| has length greater than 3. The one outlier has a length of 0 and a width
@ -387,7 +390,7 @@ ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
To suppress that warning, set `na.rm = TRUE`:
```{r}
#| eval: FALSE
#| eval: false
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
@ -399,7 +402,7 @@ So you might want to compare the scheduled departure times for cancelled and non
You can do this by making a new variable with `is.na()`.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A frequency polygon of scheduled departure times of flights. Two lines
#| represent flights that are cancelled and not cancelled. The x-axis ranges
#| from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of
@ -434,7 +437,7 @@ If variation describes the behavior *within* a variable, covariation describes t
The best way to spot covariation is to visualize the relationship between two or more variables.
How you do that depends again on the types of variables involved.
### A categorical and continuous variable {#cat-cont}
### A categorical and continuous variable {#sec-cat-cont}
It's common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon.
The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count.
@ -442,7 +445,7 @@ That means if one of the groups is much smaller than the others, it's hard to se
For example, let's explore how the price of a diamond varies with its quality (measured by `cut`):
```{r}
#| fig.alt: >
#| fig-alt: >
#| A frequency polygon of prices of diamonds where each cut of carat (Fair,
#| Good, Very Good, Premium, and Ideal) is represented with a different color
#| line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to
@ -457,7 +460,7 @@ ggplot(data = diamonds, mapping = aes(x = price)) +
It's hard to see the difference in distribution because the overall counts differ so much:
```{r}
#| fig.alt: >
#| fig-alt: >
#| Bar chart of cuts of diamonds showing large variability between the
#| frenquencies of various cuts. Fair diamonds have the lowest frequency,
#| then Good, then Very Good, then Premium, and then Ideal.
@ -470,7 +473,7 @@ To make the comparison easier we need to swap what is displayed on the y-axis.
Instead of displaying count, we'll display the **density**, which is the count standardized so that the area under each frequency polygon is one.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A frequency polygon of densities of prices of diamonds where each cut of
#| carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a
#| different color line. The x-axis ranges from 0 to 20000. The lines overlap
@ -503,8 +506,7 @@ Each boxplot consists of:
```{r}
#| echo: false
#| out.width: "100%"
#| fig.alt: >
#| fig-alt: >
#| A diagram depicting how a boxplot is created following the steps outlined
#| above.
@ -513,9 +515,9 @@ knitr::include_graphics("images/EDA-boxplot.png")
Let's take a look at the distribution of price by cut using `geom_boxplot()`:
```{r fig.height = 3}
#| fig.height: 3
#| fig.alt: >
```{r}
#| fig-height: 3
#| fig-alt: >
#| Side-by-side boxplots of prices of diamonds by cut. The distribution of
#| prices is right skewed for each cut (Fair, Good, Very Good, Premium, and
#| Ideal). The medians are close to each other, with the median for Ideal
@ -537,7 +539,7 @@ For example, take the `class` variable in the `mpg` dataset.
You might be interested to know how highway mileage varies across classes:
```{r}
#| fig.alt: >
#| fig-alt: >
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
#| on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact,
#| and suv).
@ -549,8 +551,8 @@ ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:
```{r}
#| fig.height: 3
#| fig.alt: >
#| fig-height: 3
#| fig-alt: >
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
#| on the x-axis and ordered by increasing median highway mileage (pickup,
#| suv, minivan, 2seater, subcompact, compact, and midsize).
@ -564,7 +566,7 @@ If you have long variable names, `geom_boxplot()` will work better if you flip i
You can do that by exchanging the x and y aesthetic mappings.
```{r}
#| fig.alt: >
#| fig-alt: >
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
#| on the y-axis and ordered by increasing median highway mileage.
@ -603,7 +605,7 @@ To visualize the covariation between categorical variables, you'll need to count
One way to do that is to rely on the built-in `geom_count()`:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A scatterplot of color vs. cut of diamonds. There is one point for each
#| combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal)
#| abd color (D, E, F, G, G, I, and J). The sizes of the points represent
@ -621,7 +623,7 @@ A more commonly used way of representing the covariation between two categorical
In creating this bar chart, we map the variable we want to divide the data into first to the `x` aesthetic and the variable we then further want to divide each group into to the `fill` aesthetic.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A bar chart of cuts of diamonds, segmented by color. The number of diamonds
#| for each level of cut increases from Fair to Ideal and the heights
#| of the segments within each bar represent the number of diamonds that fall
@ -635,7 +637,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A bar chart of cuts of diamonds, segmented by color. The heights of each
#| of the bars representing each cut of diamond are the same, 1. The heights
#| of the segments within each bar represent the proportion of diamonds that
@ -656,7 +658,7 @@ diamonds |>
Then visualize with `geom_tile()` and the fill aesthetic:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A tile plot of cut vs. color of diamonds. Each tile represents a
#| cut/color combination and tiles are colored according to the number of
#| observations in each tile. There are more Ideal diamonds than other cuts,
@ -693,7 +695,7 @@ For example, you can see an exponential relationship between the carat size and
```{r}
#| dev: "png"
#| fig.alt: >
#| fig-alt: >
#| A scatterplot of price vs. carat. The relationship is positive, somewhat
#| strong, and exponential.
@ -706,7 +708,7 @@ You've already seen one way to fix the problem: using the `alpha` aesthetic to a
```{r}
#| dev: "png"
#| fig.alt: >
#| fig-alt: >
#| A scatterplot of price vs. carat. The relationship is positive, somewhat
#| strong, and exponential. The points are transparent, showing clusters where
#| the number of points is higher than other areas, The most obvious clusters
@ -727,10 +729,9 @@ Now you'll learn how to use `geom_bin2d()` and `geom_hex()` to bin in two dimens
You will need to install the hexbin package to use `geom_hex()`.
```{r}
#| fig.asp: 1
#| out.width: "50%"
#| message: FALSE
#| fig.alt: >
#| layout-ncol: 2
#| eval: false
#| fig-alt: >
#| Plot 1: A binned density plot of price vs. carat. Plot 2: A hexagonal bin
#| plot of price vs. carat. Both plots show that the highest density of
#| diamonds have low carats and low prices.
@ -748,7 +749,7 @@ Then you can use one of the techniques for visualizing the combination of a cate
For example, you could bin `carat` and then for each group, display a boxplot:
```{r}
#| fig.alt: >
#| fig-alt: >
#| Side-by-side box plots of price by carat. Each box plot represents diamonds
#| that are 0.1 carats apart in weight. The box plots show that as carat
#| increases the median price increases as well. Additionally, diamonds with
@ -769,7 +770,7 @@ Another approach is to display approximately the same number of points in each b
That's the job of `cut_number()`:
```{r}
#| fig.alt: >
#| fig-alt: >
#| Side-by-side box plots of price by carat. Each box plot represents 20
#| diamonds. The box plots show that as carat increases the median price
#| increases as well. Cheaper, smaller diamonds have outliers on the higher
@ -797,7 +798,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
```{r}
#| dev: "png"
#| fig.alt: >
#| fig-alt: >
#| A scatterplot of widths vs. lengths of diamonds. There is a positive,
#| strong, linear relationship. There are a few unusual observations
#| above and below the bulk of the data, more below it than above.
@ -829,8 +830,8 @@ A scatterplot of Old Faithful eruption lengths versus the wait time between erup
The scatterplot also displays the two clusters that we noticed above.
```{r}
#| fig.height: 2
#| fig.alt: >
#| fig-height: 2
#| fig-alt: >
#| A scatterplot of eruption time vs. waiting time to next eruption of the
#| Old Faithful geyser. There are two clusters of points: one with low
#| eruption times and short waiting times and one with long eruption times and
@ -855,9 +856,9 @@ Note that instead of using the raw values of `price` and `carat`, we log transfo
Then, we exponentiate the residuals to put them back in the scale of raw prices.
```{r}
#| dev: "png"
#| message: false
#| fig.alt: >
#| dev: "png"
#| fig-alt: >
#| A scatter plot of residuals vs. carat of diamonds. The x-axis ranges from 0
#| to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered
#| around low values of carat and residuals. There is a clear, curved pattern
@ -884,7 +885,7 @@ ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
```{r}
#| fig.alt: >
#| fig-alt: >
#| Side-by-side box plots of residuals by cut. The x-axis displays the various
#| cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are
#| quite similar, between roughly 0.75 to 1.25. Each of the distributions of
@ -902,8 +903,8 @@ As we move on from these introductory chapters, we'll transition to a more conci
So far we've been very explicit, which is helpful when you are learning:
```{r}
#| eval: FALSE
#| fig.alt: >
#| eval: false
#| fig-alt: >
#| A frequency polygon plot of eruption times for the Old Faithful geyser.
#| The distribution of eruption times is binomodal with one mode around 1.75
#| and the other around 4.5.
@ -916,13 +917,13 @@ Typically, the first one or two arguments to a function are so important that yo
The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`.
In the remainder of the book, we won't supply those names.
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots.
That's a really important programming concern that we'll come back to in Chapter \@ref(functions).
That's a really important programming concern that we'll come back to in [Chapter -@sec-functions].
Rewriting the previous plot more concisely yields:
```{r}
#| eval: FALSE
#| fig.alt: >
#| eval: false
#| fig-alt: >
#| A frequency polygon plot of eruption times for the Old Faithful geyser.
#| The distribution of eruption times is binomodal with one mode around 1.75
#| and the other around 4.5.
@ -936,8 +937,8 @@ Watch for the transition from `|>` to `+`.
I wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered.
```{r}
#| eval: FALSE
#| fig.alt: >
#| eval: false
#| fig-alt: >
#| A tile plot of cut vs. clarity of diamonds. Each tile represents a
#| cut/ckarity combination and tiles are colored according to the number of
#| observations in each tile. There are more Ideal diamonds than other cuts,

View File

@ -1,58 +0,0 @@
delete_merged_file: true
new_session: yes
rmd_files: [
"index.Rmd",
"preface-2e.Rmd",
"intro.Rmd",
"whole-game.Rmd",
"data-visualize.Rmd",
"workflow-basics.Rmd",
"data-transform.Rmd",
"workflow-pipes.Rmd",
"data-tidy.Rmd",
"workflow-style.Rmd",
"data-import.Rmd",
"workflow-scripts.Rmd",
"EDA.Rmd",
"workflow-help.Rmd",
"transform.Rmd",
"tibble.Rmd",
"relational-data.Rmd",
"logicals.Rmd",
"numbers.Rmd",
"strings.Rmd",
"regexps.Rmd",
"factors.Rmd",
"datetimes.Rmd",
"missing-values.Rmd",
"column-wise.Rmd",
"import.Rmd",
"import-rectangular.Rmd",
"import-spreadsheets.Rmd",
"import-databases.Rmd",
"import-webscrape.Rmd",
"import-other.Rmd",
"tidy.Rmd",
"list-columns.Rmd",
"rectangle.Rmd",
"program.Rmd",
"functions.Rmd",
"vectors.Rmd",
"iteration.Rmd",
"prog-strings.Rmd",
"communicate.Rmd",
"rmarkdown.Rmd",
"communicate-plots.Rmd",
"rmarkdown-formats.Rmd",
"rmarkdown-workflow.Rmd",
]
before_chapter_script: "_common.R"

View File

@ -14,7 +14,7 @@ options(dplyr.print_min = 6, dplyr.print_max = 6)
# Activate crayon output
options(
crayon.enabled = TRUE,
#crayon.enabled = TRUE,
pillar.bold = TRUE,
stringr.html = FALSE
)
@ -31,7 +31,7 @@ status <- function(type) {
)
cat(paste0(
"::: {.rmdnote}\n",
"::: status\n",
"You are reading the work-in-progress second edition of R for Data Science. ",
"This chapter ", status, ". ",
"You can find the complete first edition at <https://r4ds.had.co.nz>.\n",

View File

@ -1,12 +0,0 @@
bookdown::bs4_book:
theme:
primary: "#637238"
repo:
base: https://github.com/hadley/r4ds
branch: main
includes:
in_header: [ga_script.html]
bookdown::pdf_book:
latex_engine: "xelatex"

103
_quarto.yml Normal file
View File

@ -0,0 +1,103 @@
project:
type: book
output-dir: _book
book:
title: "R for Data Science (2e)"
author-meta: "Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund"
date-meta: today
description-meta: |
This book will teach you how to do data science with R: You'll learn how to
get your data into R, get it into the most useful structure, transform it,
visualise it, and model it. In this book, you will find a practicum of
skills for data science. Just as a chemist learns how to clean test tubes
and stock a lab, you'll learn how to clean data and draw plots---and many
other things besides. These are the skills that allow data science to
happen, and here you will find the best practices for doing each of these
things with R. You'll learn how to use the grammar of graphics, literate
programming to save time and make your work reproducible. Along the way,
you'll also learn how to manage cognitive resources to facilitate
discoveries when wrangling, visualising, and exploring data.
page-footer:
left: |
R for Data Science (2e) was written by Hadley Wickham, Mine
Çetinkaya-Rundel, and Garrett Grolemund.
right: |
This book was built with <a href="https://quarto.org/">Quarto</a>.
cover-image: cover.png
site-url: https://r4ds.hadley.nz/
repo-url: https://github.com/hadley/r4ds/
repo-branch: main
repo-actions: [edit, issue]
chapters:
- index.qmd
- preface-2e.qmd
- intro.qmd
- part: whole-game.qmd
chapters:
- data-visualize.qmd
- workflow-basics.qmd
- data-transform.qmd
- workflow-pipes.qmd
- data-tidy.qmd
- workflow-style.qmd
- data-import.qmd
- workflow-scripts.qmd
- EDA.qmd
- workflow-help.qmd
- part: transform.qmd
chapters:
- tibble.qmd
- relational-data.qmd
- logicals.qmd
- numbers.qmd
- strings.qmd
- regexps.qmd
- factors.qmd
- datetimes.qmd
- missing-values.qmd
- column-wise.qmd
- part: import.qmd
chapters:
- import-rectangular.qmd
- import-spreadsheets.qmd
- import-databases.qmd
- import-webscrape.qmd
- import-other.qmd
- part: tidy.qmd
chapters:
- list-columns.qmd
- rectangle.qmd
- part: program.qmd
chapters:
- functions.qmd
- vectors.qmd
- iteration.qmd
- prog-strings.qmd
- part: communicate.qmd
chapters:
- rmarkdown.qmd
- communicate-plots.qmd
- rmarkdown-formats.qmd
- rmarkdown-workflow.qmd
format:
html:
theme:
- cosmo
- r4ds.scss
cover-image: cover.png
code-link: true
include-in-header: "plausible.html"
editor: visual

View File

@ -1,6 +1,9 @@
# Column-wise operations {#column-wise}
# Column-wise operations {#sec-column-wise}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
@ -13,7 +16,10 @@ status("drafting")
In this chapter we'll continue using dplyr.
dplyr is a member of the core tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```

View File

@ -1,8 +1,14 @@
# Graphics for communication
# Graphics for communication {#sec-graphics-communication}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
In [exploratory data analysis], you learned how to use plots as tools for *exploration*.
In \[exploratory data analysis\], you learned how to use plots as tools for *exploration*.
When you make exploratory plots, you know---even before looking---which variables the plot will display.
You made each plot for a purpose, could quickly look at it, and then move on to the next plot.
In the course of most analyses, you'll produce tens or hundreds of plots, most of which are immediately thrown away.
@ -26,7 +32,9 @@ Rather than loading those extensions here, we'll refer to their functions explic
This will help make it clear which functions are built into ggplot2, and which come from other packages.
Don't forget you'll need to install those packages with `install.packages()` if you don't already have them.
```{r, message = FALSE}
```{r}
#| message: false
library(tidyverse)
```
@ -36,7 +44,9 @@ The easiest place to start when turning an exploratory graphic into an expositor
You add labels with the `labs()` function.
This example adds a plot title:
```{r, message = FALSE}
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
@ -52,7 +62,9 @@ If you need to add more text, there are two other useful labels that you can use
- `caption` adds text at the bottom right of the plot, often used to describe the source of the data.
```{r, message = FALSE}
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
@ -66,7 +78,9 @@ ggplot(mpg, aes(displ, hwy)) +
You can also use `labs()` to replace the axis and legend titles.
It's usually a good idea to replace short variable names with more detailed descriptions, and to include the units.
```{r, message = FALSE}
```{r}
#| message: false
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
@ -80,7 +94,11 @@ ggplot(mpg, aes(displ, hwy)) +
It's possible to use mathematical equations instead of text strings.
Just switch `""` out for `quote()` and read about the available options in `?plotmath`:
```{r, fig.asp = 1, out.width = "50%", fig.width = 3}
```{r}
#| fig-asp: 1
#| out-width: "50%"
#| fig-width: 3
df <- tibble(
x = runif(10),
y = runif(10)
@ -213,9 +231,17 @@ Another approach is to use `stringr::str_wrap()` to automatically add line break
```
Note the use of `hjust` and `vjust` to control the alignment of the label.
Figure \@ref(fig:just) shows all nine possible combinations.
@fig-just shows all nine possible combinations.
```{r}
#| label: fig-just
#| echo: false
#| fig-width: 4.5
#| fig-asp: 0.5
#| out-width: "60%"
#| fig-cap: >
#| All nine combinations of `hjust` and `vjust`.
```{r just, echo = FALSE, fig.cap = "All nine combinations of `hjust` and `vjust`.", fig.asp = 0.5, fig.width = 4.5, out.width = "60%"}
vjust <- c(bottom = 0, center = 0.5, top = 1)
hjust <- c(left = 0, center = 0.5, right = 1)
@ -273,14 +299,19 @@ Scales control the mapping from data values to things that you can perceive.
Normally, ggplot2 automatically adds scales for you.
For example, when you type:
```{r default-scales, fig.show = "hide"}
```{r}
#| label: default-scales
#| fig-show: "hide"
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
```
ggplot2 automatically adds default scales behind the scenes:
```{r, fig.show = "hide"}
```{r}
#| fig-show: "hide"
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
@ -355,7 +386,11 @@ To control the overall position of the legend, you need to use a `theme()` setti
We'll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot.
The theme setting `legend.position` controls where the legend is drawn:
```{r fig.asp = 1, fig.align = "default", out.width = "50%", fig.width = 4}
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-asp: 1
base <- ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
@ -388,7 +423,12 @@ Fortunately, the same principles apply to all the other aesthetics, so once you'
It's very useful to plot transformations of your variable.
For example, as we've seen in [diamond prices](diamond-prices) it's easier to see the precise relationship between `carat` and `price` if we log transform them:
```{r, fig.align = "default", out.width = "50%"}
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
ggplot(diamonds, aes(carat, price)) +
geom_bin2d()
@ -412,7 +452,12 @@ The default categorical scale picks colours that are evenly spaced around the co
Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness.
The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.
```{r, fig.align = "default", out.width = "50%"}
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv))
@ -432,11 +477,16 @@ ggplot(mpg, aes(displ, hwy)) +
```
The ColorBrewer scales are documented online at <http://colorbrewer2.org/> and made available in R via the **RColorBrewer** package, by Erich Neuwirth.
Figure \@ref(fig:brewer) shows the complete list of all palettes.
@fig-brewer shows the complete list of all palettes.
The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a "middle".
This often arises if you've used `cut()` to make a continuous variable into a categorical variable.
```{r brewer, fig.asp = 2.5, echo = FALSE, fig.cap = "All ColourBrewer scales."}
```{r}
#| label: fig-brewer
#| echo: false
#| fig.cap: All ColourBrewer scales.
#| fig.asp: 2.5
par(mar = c(0, 3, 0, 0))
RColorBrewer::display.brewer.all()
```
@ -463,7 +513,12 @@ It's a continuous analog of the categorical ColorBrewer scales.
The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored a continuous colour scheme that has good perceptual properties.
Here's an example from the viridis vignette.
```{r, fig.align = "default", fig.asp = 1, out.width = "50%", fig.width = 4}
```{r}
#| fig-align: default
#| layout-ncol: 2
#| fig-width: 4
#| fig-asp: 1
df <- tibble(
x = rnorm(10000),
y = rnorm(10000)
@ -484,7 +539,9 @@ Note that all colour scales come in two variety: `scale_colour_x()` and `scale_f
1. Why doesn't the following code override the default scale?
```{r fig.show = "hide"}
```{r}
#| fig-show: "hide"
ggplot(df, aes(x, y)) +
geom_hex() +
scale_colour_gradient(low = "white", high = "red") +
@ -504,7 +561,10 @@ Note that all colour scales come in two variety: `scale_colour_x()` and `scale_f
4. Use `override.aes` to make the legend on the following plot easier to see.
```{r, dev = "png", out.width = "50%"}
```{r}
#| dev: "png"
#| out-width: "50%"
ggplot(diamonds, aes(carat, price)) +
geom_point(aes(colour = cut), alpha = 1/20)
```
@ -520,7 +580,12 @@ There are three ways to control the plot limits:
To zoom in on a region of the plot, it's generally best to use `coord_cartesian()`.
Compare the following two plots:
```{r out.width = "50%", fig.align = "default", message = FALSE}
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
#| message: false
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
@ -538,7 +603,11 @@ Reducing the limits is basically equivalent to subsetting the data.
It is generally more useful if you want *expand* the limits, for example, to match scales across different plots.
For example, if we extract two classes of cars and plot them separately, it's difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.
```{r out.width = "50%", fig.align = "default", fig.width = 4}
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
suv <- mpg |> filter(class == "suv")
compact <- mpg |> filter(class == "compact")
@ -551,7 +620,11 @@ ggplot(compact, aes(displ, hwy, colour = drv)) +
One way to overcome this problem is to share scales across multiple plots, training the scales with the `limits` of the full data.
```{r out.width = "50%", fig.align = "default", fig.width = 4}
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 3
x_scale <- scale_x_continuous(limits = range(mpg$displ))
y_scale <- scale_y_continuous(limits = range(mpg$hwy))
col_scale <- scale_colour_discrete(limits = unique(mpg$drv))
@ -573,19 +646,25 @@ In this particular case, you could have simply used faceting, but this technique
## Themes
Finally, you can customise the non-data elements of your plot with a theme:
Finally, you can customize the non-data elements of your plot with a theme:
```{r}
#| message: false
```{r, message = FALSE}
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
```
ggplot2 includes eight themes by default, as shown in Figure \@ref(fig:themes).
ggplot2 includes eight themes by default, as shown in @fig-themes.
Many more are included in add-on packages like **ggthemes** (<https://github.com/jrnold/ggthemes>), by Jeffrey Arnold.
```{r themes, echo = FALSE, fig.cap = "The eight themes built-in to ggplot2."}
```{r}
#| label: fig-themes
#| echo: false
#| fig.cap: The eight themes built-in to ggplot2.
knitr::include_graphics("images/visualization-themes.png")
```
@ -604,12 +683,16 @@ You can also create your own themes, if you are trying to match a particular cor
There are two main ways to get your plots out of R and into your final write-up: `ggsave()` and knitr.
`ggsave()` will save the most recent plot to disk:
```{r, fig.show = "none"}
```{r}
#| fig-show: "hide"
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
```
```{r, include = FALSE}
```{r}
#| include: false
file.remove("my-plot.pdf")
```
@ -645,19 +728,30 @@ If `fig.width` is larger than the size the figure is rendered in the final doc,
You'll often need to do a little experimentation to figure out the right ratio between the `fig.width` and the eventual width in your document.
To illustrate the principle, the following three plots have `fig.width` of 4, 6, and 8 respectively:
```{r, include = FALSE}
```{r}
#| include: false
plot <- ggplot(mpg, aes(displ, hwy)) + geom_point()
```
```{r, fig.width = 4, echo = FALSE}
```{r}
#| echo: false
#| fig-width: 4
plot
```
```{r, fig.width = 6, echo = FALSE}
```{r}
#| echo: false
#| fig-width: 6
plot
```
```{r, fig.width = 8, echo = FALSE}
```{r}
#| echo: false
#| fig-width: 8
plot
```

View File

@ -1,11 +1,18 @@
# (PART) Communicate {.unnumbered}
# Communicate {#sec-communicate-intro .unnumbered}
# Introduction {#communicate-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualisation.
However, it doesn't matter how great your analysis is unless you can explain it to others: you need to **communicate** your results.
```{r echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("diagrams/data-science-communicate.png")
```

View File

@ -1,4 +1,4 @@
# Contributing
# Contributing {#sec-contributing}
This book has been developed in the open, and it wouldn't be nearly as good without your contributions.
There are a number of ways you can help make the book even better:

View File

@ -1,7 +1,10 @@
# Data import {#data-import}
# Data import {#sec-data-import}
```{r, results = "asis", echo = FALSE}
status("restructuring")
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
## Introduction
@ -51,15 +54,16 @@ read_lines("data/students.csv") |> cat(sep = "\n")
```
Note that the `,`s separate the columns.
Table \@ref(tab:students-table) shows a representation of the same data as a table.
@tbl-students-table shows a representation of the same data as a table.
```{r}
#| label: students-table
#| label: tbl-students-table
#| echo: false
#| message: false
#| tbl-cap: Data from the students.csv file as a table.
read_csv("data/students.csv") |>
knitr::kable(caption = "Data from the students.csv file as a table.")
knitr::kable()
```
The first argument to `read_csv()` is the most important: it's the path to the file to read.
@ -72,7 +76,7 @@ students <- read_csv("data/students.csv")
When you run `read_csv()` it prints out a message that tells you how many rows (excluding the header row) and columns the data has along with the delimiter used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about how to retrieve the full column specification as well as how to quiet this message.
This message is an important part of readr, which we'll come back to in Section \@ref(parsing-a-file) on parsing a file.
This message is an important part of readr, which we'll come back to in @sec-parsing-a-file on parsing a file.
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
@ -113,7 +117,7 @@ There are two cases where you might want to tweak this behavior:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in Chapter \@ref(strings).)
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [Chapter -@sec-strings].)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
@ -168,7 +172,7 @@ Another common task after reading in data is to consider variable types.
For example, `meal_type` is a categorical variable with a known set of possible values.
In R, factors can be used to work with categorical variables.
We can convert this variable to a factor using the `factor()` function.
You'll learn more about factors in Chapter \@ref(factors).
You'll learn more about factors in [Chapter -@sec-factors].
```{r}
students <- students |>
@ -181,7 +185,7 @@ students
Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in Chapter \@ref(import-spreadsheets) in further detail.
We discuss the details of fixing this issue in [Chapter -@sec-import-spreadsheets] in further detail.
### Compared to base R
@ -253,7 +257,7 @@ sales_files <- dir_ls("data", glob = "*sales.csv")
sales_files
```
## Writing to a file
## Writing to a file {#sec-writing-to-a-file}
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by:
@ -316,7 +320,7 @@ There are two alternatives:
```
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in Chapter \@ref(list-columns); feather currently does not.
RDS supports list-columns (which you'll learn about in [Chapter -@sec-list-columns]; feather currently does not.
```{r}
#| include: false
@ -332,13 +336,13 @@ They're certainly not perfect, but they are a good place to start.
For rectangular data:
- **readxl** reads Excel files (both `.xls` and `.xlsx`).
See Chapter \@ref(import-spreadsheets) for more on working with data stored in Excel spreadsheets.
See [Chapter -@sec-import-spreadsheets] for more on working with data stored in Excel spreadsheets.
- **googlesheets4** reads Google Sheets.
Also see Chapter \@ref(import-spreadsheets) for more on working with data stored in Google Sheets.
Also see [Chapter -@sec-import-spreadsheets] for more on working with data stored in Google Sheets.
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
See Chapter \@ref(import-databases) for more on working with databases .
See [Chapter -@sec-import-databases] for more on working with databases .
- **haven** reads SPSS, Stata, and SAS files.

View File

@ -1,6 +1,9 @@
# Data tidying {#data-tidy}
# Data tidying {#sec-data-tidy}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
@ -36,7 +39,7 @@ library(tidyverse)
From this chapter on, we'll suppress the loading message from `library(tidyverse)`.
## Tidy data
## Tidy data {#sec-tidy-data}
You can represent the same underlying data in multiple ways.
The example below shows the same data organised in four different ways.
@ -63,16 +66,15 @@ There are three interrelated rules that make a dataset tidy:
2. Each observation is row; each row is an observation.
3. Each value is a cell; each cell is a single value.
Figure \@ref(fig:tidy-structure) shows the rules visually.
@fig-tidy-structure shows the rules visually.
```{r}
#| label: tidy-structure
#| echo: FALSE
#| out.width: NULL
#| fig.cap: >
#| label: fig-tidy-structure
#| echo: false
#| fig-cap: >
#| The following three rules make a dataset tidy: variables are columns,
#| observations are rows, and values are cells.
#| fig.alt: >
#| fig-alt: >
#| Three panels, each representing a tidy data frame. The first panel
#| shows that each variable is a column. The second panel shows that each
#| observation is a row. The third panel shows that each value is
@ -95,8 +97,8 @@ dplyr, ggplot2, and all the other packages in the tidyverse are designed to work
Here are a couple of small examples showing how you might work with `table1`.
```{r}
#| fig.width: 5
#| fig.alt: >
#| fig-width: 5
#| fig-alt: >
#| This figure shows the numbers of cases in 1999 and 2000 for
#| Afghanistan, Brazil, and China, with year on the x-axis and number
#| of cases on the y-axis. Each point on the plot represents the number
@ -165,7 +167,7 @@ These examples are drawn from `vignette("pivot", package = "tidyr")`, which you
Let's dive in.
### Data in column names {#billboard}
### Data in column names {#sec-billboard}
The `billboard` dataset records the billboard rank of songs in the year 2000:
@ -201,7 +203,7 @@ Take 2 Pac's "Baby Don't Cry", for example.
The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
These `NA`s don't really represent unknown observations; they're forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer` to get rid of them by setting `values_drop_na = TRUE`:
[^data-tidy-1]: We'll come back to this idea in Chapter \@ref(missing-values).
[^data-tidy-1]: We'll come back to this idea in [Chapter -@sec-missing-values].
```{r}
billboard |>
@ -217,7 +219,7 @@ You might also wonder what happens if a song is in the top 100 for more than 76
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
This data is now tidy, but we could make future computation a bit easier by converting `week` into a number using `mutate()` and `parse_number()`.
You'll learn more about `parse_number()` and friends in Chapter \@ref(data-import).
You'll learn more about `parse_number()` and friends in [Chapter -@sec-data-import].
```{r}
billboard_tidy <- billboard |>
@ -234,13 +236,13 @@ billboard_tidy
```
Now we're in a good position to look at how song ranks vary over time by drawing a plot.
The code is shown below and the result is Figure \@ref(fig:billboard-ranks).
The code is shown below and the result is @fig-billboard-ranks.
```{r}
#| label: billboard-ranks
#| fig.cap: >
#| label: fig-billboard-ranks
#| fig-cap: >
#| A line plot showing how the rank of a song changes over time.
#| fig.alt: >
#| fig-alt: >
#| A line plot with week on the x-axis and rank on the y-axis, where
#| each line represents a song. Most songs appear to start at a high rank,
#| rapidly accelerate to a low rank, and then decay again. There are
@ -281,59 +283,56 @@ df |>
How does this transformation take place?
It's easier to see if we take it component by component.
Columns that are already variables need to be repeated, once for each column in `cols`, as shown in Figure \@ref(fig:pivot-variables).
Columns that are already variables need to be repeated, once for each column in `cols`, as shown in @fig-pivot-variables.
```{r}
#| label: pivot-variables
#| echo: FALSE
#| out.width: NULL
#| fig.alt: >
#| label: fig-pivot-variables
#| echo: false
#| fig-cap: >
#| Columns that are already variables need to be repeated, once for
#| each column that is pivotted.
#| fig-alt: >
#| A diagram showing how `pivot_longer()` transforms a simple
#| dataset, using color to highlight how the values in the `var` column
#| ("A", "B", "C") are each repeated twice in the output because there are
#| two columns being pivotted ("col1" and "col2").
#| fig.cap: >
#| Columns that are already variables need to be repeated, once for
#| each column that is pivotted.
knitr::include_graphics("diagrams/tidy-data/variables.png", dpi = 270)
```
The column names become values in a new variable, whose name is given by `names_to`, as shown in Figure \@ref(fig:pivot-names).
The column names become values in a new variable, whose name is given by `names_to`, as shown in @fig-pivot-names.
They need to be repeated once for each row in the original dataset.
```{r}
#| label: pivot-names
#| label: fig-pivot-names
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| fig-cap: >
#| The column names of pivoted columns become a new column.
#| fig-alt: >
#| A diagram showing how `pivot_longer()` transforms a simple
#| data set, using color to highlight how column names ("col1" and
#| "col2") become the values in a new `var` column. They are repeated
#| three times because there were three rows in the input.
#| fig.cap: >
#| The column names of pivoted columns become a new column.
knitr::include_graphics("diagrams/tidy-data/column-names.png", dpi = 270)
```
The cell values also become values in a new variable, with a name given by `values_to`.
They are unwound row by row.
Figure \@ref(fig:pivot-values) illustrates the process.
@fig-pivot-values illustrates the process.
```{r}
#| label: pivot-values
#| label: fig-pivot-values
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| fig-cap: >
#| The number of values is preserved (not repeated), but unwound
#| row-by-row.
#| fig-alt: >
#| A diagram showing how `pivot_longer()` transforms data,
#| using color to highlight how the cell values (the numbers 1 to 6)
#| become the values in a new `value` column. They are unwound row-by-row,
#| so the original rows (1,2), then (3,4), then (5,6), become a column
#| running from 1 to 6.
#| fig.cap: >
#| The number of values is preserved (not repeated), but unwound
#| row-by-row.
knitr::include_graphics("diagrams/tidy-data/cell-values.png", dpi = 270)
```
@ -367,26 +366,25 @@ who2 |>
)
```
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in Chapter \@ref(regular-expressions).
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in [Chapter -@sec-regular-expressions].
Conceptually, this is only a minor variation on the simpler case you've already seen.
Figure \@ref(fig:pivot-multiple-names) shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns.
@fig-pivot-multiple-names shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns.
You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that gives better performance.
```{r}
#| label: pivot-multiple-names
#| echo: FALSE
#| out.width: NULL
#| fig.alt: >
#| label: fig-pivot-multiple-names
#| echo: false
#| fig-cap: >
#| Pivotting with many variables in the column names means that each
#| column name now fills in values in multiple output columns.
#| fig-alt: >
#| A diagram that uses color to illustrate how supplying `names_sep`
#| and multiple `names_to` creates multiple variables in the output.
#| The input has variable names "x_1" and "y_2" which are split up
#| by "_" to create name and number columns in the output. This is
#| is similar case with a single `names_to`, but what would have been a
#| single output variable is now separated into multiple variables.
#| fig.cap: >
#| Pivotting with many variables in the column names means that each
#| column name now fills in values in multiple output columns.
knitr::include_graphics("diagrams/tidy-data/multiple-names.png", dpi = 270)
```
@ -420,23 +418,22 @@ household |>
We again use `values_drop_na = TRUE`, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and `parse_number()` to convert (e.g.) `child1` into 1.
Figure \@ref(fig:pivot-names-and-values) illustrates the basic idea with a simpler example.
@fig-pivot-names-and-values illustrates the basic idea with a simpler example.
When you use `".value"` in `names_to`, the column names in the input contribute to both values and variable names in the output.
```{r}
#| label: pivot-names-and-values
#| label: fig-pivot-names-and-values
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| A diagram that uses color to illustrate how the special ".value"
#| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
#| and we want to use the first component ("x", "y") as a variable name
#| and the second ("1", "2") as the value for a new "id" column.
#| fig.cap: >
#| fig-cap: >
#| Pivoting with `names_to = c(".value", "id")` splits the column names
#| into two components: the first part determines the output column
#| name (`x` or `y`), and the second part determines the value of the
#| `id` column.
#| fig-alt: >
#| A diagram that uses color to illustrate how the special ".value"
#| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
#| and we want to use the first component ("x", "y") as a variable name
#| and the second ("1", "2") as the value for a new "id" column.
knitr::include_graphics("diagrams/tidy-data/names-and-values.png", dpi = 270)
```
@ -544,7 +541,7 @@ df |>
It then fills in all the missing values using the data in the input.
In this case, not every cell in the output has corresponding value in the input as there's no entry for id "B" and name "z", so that cell remains missing.
We'll come back to this idea that `pivot_wider()` can "make" missing values in Chapter \@ref(missing-values).
We'll come back to this idea that `pivot_wider()` can "make" missing values in [Chapter -@sec-missing-values].
You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output.
The example below has two rows that correspond to id "A" and name "x":
@ -560,7 +557,7 @@ df <- tribble(
)
```
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in Chapter \@ref(list-columns):
If we attempt to pivot this we get an output that contains list-columns, which you'll learn more about in [Chapter -@sec-list-columns]:
```{r}
df |> pivot_wider(
@ -669,7 +666,7 @@ cluster_id <- cluster$cluster |>
cluster_id
```
You could then combine this back with the original data using one of the joins you'll learn about in Chapter \@ref(relational-data).
You could then combine this back with the original data using one of the joins you'll learn about in [Chapter -@sec-relational-data].
```{r}
gapminder |> left_join(cluster_id)

View File

@ -1,6 +1,9 @@
# Data transformation {#data-transform}
# Data transformation {#sec-data-transform}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
@ -43,7 +46,7 @@ If you've used R before, you might notice that this data frame prints a little d
That's because it's a **tibble**, a special type of data frame used by the tidyverse to avoid some common gotchas.
The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen.
To see everything, use `View(flights)` to open the dataset in the RStudio viewer.
We'll come back to other important differences in Chapter \@ref(tibbles).
We'll come back to other important differences in [Chapter -@sec-tibbles].
You might have noticed the short abbreviations that follow each column name.
These tell you the type of each variable: `<int>` is short for integer, `<dbl>` is short for double (aka real numbers), `<chr>` for character (aka strings), and `<dttm>` for date-time.
@ -65,7 +68,9 @@ The pipe takes the thing on its left and passes it along to the function on its
The easiest way to pronounce the pipe is "then".
That makes it possible to get a sense of the following code even though you haven't yet learnt the details:
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
filter(dest == "IAH") |>
group_by(year, month, day) |>
@ -75,10 +80,10 @@ flights |>
```
The code starts with the flights dataset, then filters it, then groups it, then summarizes it.
We'll come back to the pipe and its alternatives in Chapter \@ref(pipes).
We'll come back to the pipe and its alternatives in @sec-pipes.
dplyr's verbs are organised into four groups based on what they operate on: **rows**, **columns**, **groups**, or **tables**.
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to verb that work on tables in Chapter \@ref(relational-data).
In the following sections you'll learn the most important verbs for rows, columns, and groups, then we'll come back to verb that work on tables in [Chapter -@sec-relational-data].
Let's dive in!
## Rows
@ -122,7 +127,7 @@ flights |>
filter(month %in% c(1, 2))
```
We'll come back to these comparisons and logical operators in more detail in Chapter \@ref(logical).
We'll come back to these comparisons and logical operators in more detail in [Chapter -@sec-logicals].
When you run `filter()` dplyr executes the filtering operation, creating a new data frame, and then prints it.
It doesn't modify the existing `flights` dataset because dplyr functions never modify their inputs.
@ -138,20 +143,24 @@ jan1 <- flights |>
When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
`filter()` will let you know when this happens:
```{r, error = TRUE}
```{r}
#| error: true
flights |>
filter(month = 1)
```
Another mistakes is you write "or" statements like you would in English:
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
filter(month == 1 | 2)
```
This works, in the sense that it doesn't throw an error, but it doesn't do what you want.
We'll come back to what it does and why in Section \@ref(boolean-operations).
We'll come back to what it does and why in @sec-boolean-operations.
### `arrange()`
@ -210,7 +219,7 @@ flights |>
There are four important verbs that affect the columns without changing the rows: `mutate()`, `select()`, `rename()`, and `relocate()`.
`mutate()` creates new columns that are functions of the existing columns; `select()`, `rename()`, and `relocate()` change which columns are present, their names, or their positions.
### `mutate()` {#mutate}
### `mutate()` {#sec-mutate}
The job of `mutate()` is to add new columns that are calculated from the existing columns.
In the transform chapters, you'll learn a large set of functions that you can use to manipulate different types of variables.
@ -264,7 +273,7 @@ flights |>
)
```
### `select()` {#select}
### `select()` {#sec-select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
In this situation, the first challenge is often just focusing on the variables you're interested in.
@ -297,7 +306,7 @@ There are a number of helper functions you can use within `select()`:
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a pattern.
Once you know regular expressions (the topic of [Chapter -@sec-regular-expressions]) you'll also be use `matches()` to select variables that match a pattern.
You can rename variables as you `select()` them by using `=`.
The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
@ -341,7 +350,10 @@ flights |>
### Exercises
```{r, eval = FALSE, echo = FALSE}
```{r}
#| eval: false
#| echo: false
# For data checking, not used in results shown in book
flights <- flights |> mutate(
dep_time = hour * 60 + minute,
@ -378,7 +390,9 @@ ggplot(flights, aes(air_time - airtime2)) + geom_histogram()
How do the select helpers deal with case by default?
How can you change that default?
```{r, eval = FALSE}
```{r}
#| eval: false
select(flights, contains("TIME"))
```
@ -400,7 +414,7 @@ flights |>
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that it's now "grouped by" month.
This means subsequent operations will now work "by month".
### `summarize()` {#summarize}
### `summarize()` {#sec-summarize}
The most important grouped operation is a summary.
It collapses each group to a single row[^data-transform-3].
@ -418,7 +432,7 @@ flights |>
Uhoh!
Something has gone wrong and all of our results are `NA` (pronounced "N-A"), R's symbol for missing value.
We'll come back to discuss missing values in Chapter \@ref(missing-values), but for now we'll remove them by using `na.rm = TRUE`:
We'll come back to discuss missing values in [Chapter -@sec-missing-values], but for now we'll remove them by using `na.rm = TRUE`:
```{r}
flights |>
@ -538,7 +552,7 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
5. Explain what `count()` does in terms of the dplyr verbs you just learn.
What does the `sort` argument to `count()` do?
## Case study: aggregates and sample size {#sample-size}
## Case study: aggregates and sample size {#sec-sample-size}
Whenever you do any aggregation, it's always a good idea to include a count (`n()`).
That way, you can ensure that you're not drawing conclusions based on very small amounts of data.

View File

@ -1,4 +1,10 @@
# Data visualization {#data-visualisation}
# Data visualization {#sec-data-visualisation}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
@ -166,7 +172,8 @@ Here we change the levels of a point's size, shape, and color to make the point
```{r}
#| echo: false
#| fig.asp: 1/4
#| fig.asp: 0.25
#| fig-width: 8
#| fig-alt: >
#| Diagram that shows four plotting characters next to each other. The first
#| is a large circle, the second is a small circle, the third is a triangle,
@ -227,9 +234,9 @@ ggplot(data = mpg) +
Similarly, we could have mapped `class` to the *alpha* aesthetic, which controls the transparency of the points, or to the *shape* aesthetic, which controls the shape of the points.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| out-width: "50%"
#| fig-align: "default"
#| fig-height: 2
#| warning: false
#| fig-alt: >
#| Two scatterplots next to each other, both visualizing highway fuel
@ -282,14 +289,14 @@ You'll need to pick a value that makes sense for that aesthetic:
- The name of a color as a character string.
- The size of a point in mm.
- The shape of a point as a number, as shown in Figure \@ref(fig:shapes).
- The shape of a point as a number, as shown in @fig-shapes.
```{r}
#| label: shapes
#| label: fig-shapes
#| echo: false
#| warning: false
#| fig.asp: 1/2.75
#| fig.align: "center"
#| fig.asp: 0.364
#| fig-align: "center"
#| fig-cap: >
#| R has 25 built in shapes that are identified by numbers. There are some
#| seeming duplicates: for example, 0, 15, and 22 are all squares. The
@ -333,7 +340,7 @@ ggplot(shapes, aes(x, y)) +
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars
#| that shows a negative association. All points are red and
#| the legend shows a red point that is mapped to the word 'blue'.
#| the legend shows a red point that is mapped to the word blue.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
@ -514,9 +521,9 @@ How are these two plots similar?
```{r}
#| echo: false
#| message: false
#| layout-ncol: 2
#| fig-width: 4
#| out-width: "50%"
#| fig-align: "default"
#| fig-height: 2
#| fig-alt: >
#| There are two plots. The plot on the left is a scatterplot of highway fuel
#| efficiency versus engine size of cars and the plot on the right shows a
@ -610,9 +617,9 @@ In practice, ggplot2 will automatically group the data for these geoms whenever
It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.
```{r}
#| layout-ncol: 3
#| fig-width: 3
#| fig-align: "default"
#| out-width: "33%"
#| fig-height: 3
#| message: false
#| fig-alt: >
#| Three plots, each with highway fuel efficiency on the y-axis and engine
@ -749,9 +756,9 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```{r}
#| echo: false
#| message: false
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| fig-alt: >
#| There are six scatterplots in this figure, arranged in a 3x2 grid.
#| In all plots highway fuel efficiency of cars are on the y-axis and
@ -958,8 +965,9 @@ There's one more piece of magic associated with bar charts.
You can color a bar chart using either the `color` aesthetic, or, more usefully, `fill`:
```{r}
#| out-width: "50%"
#| fig-align: "default"
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| fig-alt: >
#| Two bar charts of cut of diamonds. In the first plot, the bars have colored
#| borders. In the second plot, they're filled with colors. Heights of the
@ -994,8 +1002,9 @@ If you don't want a stacked bar chart, you can use one of three other options: `
To see that overlapping we either need to make the bars slightly transparent by setting `alpha` to a small value, or completely transparent by setting `fill = NA`.
```{r}
#| out-width: "50%"
#| fig-align: "default"
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| fig-alt: >
#| Two segmented bar charts of cut of diamonds, where each bar is filled
#| with colors for the levels of clarity. Heights of the bars correspond
@ -1112,16 +1121,16 @@ There are a three other coordinate systems that are occasionally helpful.
It's also useful for long labels: it's hard to get them to fit without overlapping on the x-axis.
```{r}
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| fig-width: 4
#| fig-height: 2
#| layout-ncol: 2
#| fig-alt: >
#| Two side-by-side box plots of highway fuel efficiency of cars. A
#| separate box plot is created for cars in each level of class (2seater,
#| compact, midsize, minivan, pickup, subcompact, and suv). In the first
#| plot class is on the x-axis, in the second plot class is on the y-axis.
#| The second plot makes it easier to read the names of the levels of class
#| since they're listed down the y-axis, avoiding overlap.
#| since they are listed down the y-axis, avoiding overlap.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
@ -1133,8 +1142,6 @@ There are a three other coordinate systems that are occasionally helpful.
However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables.
```{r}
#| fig-width: 3
#| fig-align: "default"
#| fig-alt: >
#| Side-by-side box plots of highway fuel efficiency of cars. A separate
#| box plot is drawn along the y-axis for cars in each level of class
@ -1148,13 +1155,13 @@ There are a three other coordinate systems that are occasionally helpful.
This is very important if you're plotting spatial data with ggplot2 (which unfortunately we don't have the space to cover in this book).
```{r}
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| message: false
#| fig-alt: >
#| Two maps of the boundaries of New Zealand. In the first plot the aspect
#| ratio is incorrect, in the second plot it's correct.
#| ratio is incorrect, in the second plot it is correct.
nz <- map_data("nz")
@ -1170,10 +1177,9 @@ There are a three other coordinate systems that are occasionally helpful.
Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
```{r}
#| fig-width: 3
#| out-width: "50%"
#| fig-align: "default"
#| fig.asp: 1
#| layout-ncol: 2
#| fig-width: 4
#| fig-asp: 1
#| fig-alt: >
#| There are two plots. On the left is a bar chart of cut of diamonds,
#| on the right is a Coxcomb chart of the same data.
@ -1205,13 +1211,11 @@ There are a three other coordinate systems that are occasionally helpful.
What does `geom_abline()` do?
```{r}
#| fig.asp: 1
#| out-width: "50%"
#| fig-alt: >
#| Scatterplot of highway fuel efficiency versus engine size of cars that
#| shows a negative association. The plot also has a straight line that
#| follows the trend of the relationship between the variables but doesn't
#| go through the cloud of points, it's beneath it.
#| follows the trend of the relationship between the variables but does not
#| go through the cloud of points, it is beneath it.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
@ -1244,7 +1248,6 @@ To see how this works, consider how you could build a basic plot from scratch: y
```{r}
#| echo: false
#| out-width: "100%"
#| fig-alt: >
#| A figure demonstrating the steps for going from raw data to table of counts
#| where each row represents one level of cut and a count column shows how many
@ -1261,7 +1264,6 @@ You would map the values of each variable to the levels of an aesthetic.
```{r}
#| echo: false
#| out-width: "100%"
#| fig-alt: >
#| A figure demonstrating the steps for going from raw data to table of counts
#| where each row represents one level of cut and a count column shows how
@ -1278,7 +1280,6 @@ You could also extend the plot by adding one or more additional layers, where ea
```{r}
#| echo: false
#| out-width: "100%"
#| fig-alt: >
#| A figure demonstrating the steps for going from raw data to bar chart where
#| each bar represents one level of cut and filled in with a different color.

View File

@ -1,4 +1,11 @@
# Dates and times
# Dates and times {#sec-dates-and-times}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
## Introduction
@ -25,7 +32,9 @@ This chapter will focus on the **lubridate** package, which makes it easier to w
lubridate is not part of core tidyverse because you only need it when you're working with dates/times.
We will also need nycflights13 for practice data.
```{r setup, message = FALSE}
```{r}
#| message: false
library(tidyverse)
library(lubridate)
@ -188,7 +197,9 @@ as_date(365 * 10 + 2)
1. What happens if you parse a string that contains invalid dates?
```{r, eval = FALSE}
```{r}
#| eval: false
ymd(c("2010-10-10", "bananas"))
```
@ -531,9 +542,14 @@ How do you pick between duration, periods, and intervals?
As always, pick the simplest data structure that solves your problem.
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
Figure \@ref(fig:dt-algebra) summarises permitted arithmetic operations between the different data types.
@fig-dt-algebra summarizes permitted arithmetic operations between the different data types.
```{r}
#| label: fig-dt-algebra
#| echo: false
#| fig-cap: >
#| The allowed arithmetic operations between pairs of date/time classes.
```{r dt-algebra, echo = FALSE, fig.cap = "The allowed arithmetic operations between pairs of date/time classes."}
knitr::include_graphics("diagrams/datetimes-arithmetic.png")
```

View File

@ -1,6 +1,9 @@
# Factors
# Factors {#sec-factors}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
@ -19,7 +22,10 @@ Base R some basic tools for creating and manipulating factors.
We'll supplement these with the **forcats** package, which is part of the core tidyverse.
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -122,7 +128,7 @@ gss_cat |>
Or with a bar chart:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A bar chart showing the distribution of race. There are ~2000
#| records with race "Other", 3000 with race "Black", and other
#| 15,000 with race "White".
@ -152,7 +158,7 @@ It's often useful to change the order of the factor levels in a visualization.
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
```{r}
#| fig.alt: >
#| fig-alt: >
#| A scatterplot of with tvhours on the x-axis and religion on the y-axis.
#| The y-axis is ordered seemingly aribtrarily making it hard to get
#| any sense of overall pattern.
@ -177,7 +183,7 @@ We can improve it by reordering the levels of `relig` using `fct_reorder()`.
- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
```{r}
#| fig.alt: >
#| fig-alt: >
#| The same scatterplot as above, but now the religion is displayed in
#| increasing order of tvhours. "Other eastern" has the fewest tvhours
#| under 2, and "Don't know" has the highest (over 5).
@ -190,7 +196,9 @@ Reordering religion makes it much easier to see that people in the "Don't know"
As you start making more complicated transformations, I'd recommend moving them out of `aes()` and into a separate `mutate()` step.
For example, you could rewrite the plot above as:
```{r, eval = FALSE}
```{r}
#| eval: false
relig_summary |>
mutate(
relig = fct_reorder(relig, tvhours)
@ -202,7 +210,7 @@ relig_summary |>
What if we create a similar plot looking at how average age varies across reported income level?
```{r}
#| fig.alt: >
#| fig-alt: >
#| A scatterplot with age on the x-axis and income on the y-axis. Income
#| has been reordered in order of average age which doesn't make much
#| sense. One section of the y-axis goes from $6000-6999, then <$1000,
@ -228,7 +236,7 @@ You can use `fct_relevel()`.
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
```{r}
#| fig.alt: >
#| fig-alt: >
#| The same scatterplot but now "Not Applicable" is displayed at the
#| bottom of the y-axis. Generally there is a positive association
#| between income and age, and the income band with the highest average
@ -243,8 +251,11 @@ Another type of reordering is useful when you are coloring the lines on a plot.
`fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values.
This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
#| fig.alt:
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| fig-alt:
#| - >
#| A line plot with age on the x-axis and proportion on the y-axis.
#| There is one line for each category of marital status: no answer,
@ -278,7 +289,7 @@ Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing
Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.
```{r}
#| fig.alt: >
#| fig-alt: >
#| A bar char of marital status ordered in from least to most common:
#| no answer (~0), separated (~1,000), widowed (~2,000), divorced
#| (~3,000), never married (~5,000), married (~10,000).

View File

@ -1,4 +1,10 @@
# Functions
# Functions {#sec-functions}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
@ -58,7 +64,9 @@ Extracting repeated code out into a function is a good idea because it prevents
To write a function you need to first analyse the code.
How many inputs does it have?
```{r, eval = FALSE}
```{r}
#| eval: false
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
```
@ -127,7 +135,7 @@ df$d <- rescale01(df$d)
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors.
There is still quite a bit of duplication since we're doing the same thing to multiple columns.
We'll learn how to eliminate that duplication with iteration in Chapter \@ref(iteration), once you've learned more about R's data structures in Chapter \@ref(vectors).
We'll learn how to eliminate that duplication with iteration in [Chapter -@sec-iteration], once you've learned more about R's data structures in [Chapter -@sec-vectors].
Another advantage of functions is that if our requirements change, we only need to make the change in one place.
For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
@ -164,7 +172,9 @@ The more repetition you have in your code, the more places you need to remember
How many arguments does it need?
Can you rewrite it to be more expressive or less duplicative?
```{r, eval = FALSE}
```{r}
#| eval: false
mean(is.na(x))
x / sum(x, na.rm = TRUE)
@ -210,7 +220,9 @@ There are some exceptions: nouns are ok if the function computes a very well kno
A good sign that a noun might be a better choice is if you're using a very broad verb like "get", "compute", "calculate", or "determine".
Use your best judgement and don't be afraid to rename a function if you figure out a better name later.
```{r, eval = FALSE}
```{r}
#| eval: false
# Too short
f()
@ -228,7 +240,9 @@ It doesn't really matter which one you pick, the important thing is to be consis
R itself is not very consistent, but there's nothing you can do about that.
Make sure you don't fall into the same trap by making your code as consistent as possible.
```{r, eval = FALSE}
```{r}
#| eval: false
# Never do this!
col_mins <- function(x, y) {}
rowMaxes <- function(y, x) {}
@ -238,7 +252,9 @@ If you have a family of functions that do similar things, make sure they have co
Use a common prefix to indicate that they are connected.
That's better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
```{r, eval = FALSE}
```{r}
#| eval: false
# Good
input_select()
input_checkbox()
@ -255,7 +271,9 @@ A good example of this design is the stringr package: if you don't remember exac
Where possible, avoid overriding existing functions and variables.
It's impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.
```{r, eval = FALSE}
```{r}
#| eval: false
# Don't do this!
T <- FALSE
c <- 10
@ -296,12 +314,14 @@ It's a great idea to capture that sort of thinking in a comment.
4. Make a case for why `norm_r()`, `norm_d()` etc would be better than `rnorm()`, `dnorm()`.
Make a case for the opposite.
## Conditional execution
## Conditional execution {#sec-conditional-execution}
An `if` statement allows you to conditionally execute code.
It looks like this:
```{r, eval = FALSE}
```{r}
#| eval: false
if (condition) {
# code executed when condition is TRUE
} else {
@ -335,7 +355,9 @@ The `condition` must evaluate to either `TRUE` or `FALSE`.
If it's a vector, you'll get a warning message; if it's an `NA`, you'll get an error.
Watch out for these messages in your own code:
```{r, error = TRUE}
```{r}
#| error: true
if (c(TRUE, FALSE)) {}
if (NA) {}
@ -374,7 +396,9 @@ And remember, `x == NA` doesn't do anything useful!
You can chain multiple if statements together:
```{r, eval = FALSE}
```{r}
#| eval: false
if (this) {
# do that
} else if (that) {
@ -388,7 +412,9 @@ But if you end up with a very long series of chained `if` statements, you should
One useful technique is the `switch()` function.
It allows you to evaluate selected code based on position or name.
```{r, echo = FALSE}
```{r}
#| echo: false
function(x, y, op) {
switch(op,
plus = x + y,
@ -412,7 +438,9 @@ An opening curly brace should never go on its own line and should always be foll
A closing curly brace should always go on its own line, unless it's followed by `else`.
Always indent the code inside curly braces.
```{r, eval = FALSE}
```{r}
#| eval: false
# Good
if (y < 0 && debug) {
message("Y is negative")
@ -473,7 +501,9 @@ if (y < 20) {
4. How could you use `cut()` to simplify this set of nested if-else statements?
```{r, eval = FALSE}
```{r}
#| eval: false
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
@ -496,7 +526,9 @@ if (y < 20) {
6. What does this `switch()` call do?
What happens if `x` is "e"?
```{r, eval = FALSE}
```{r}
#| eval: false
switch(x,
a = ,
b = "ab",
@ -545,7 +577,9 @@ Even though `na.rm = TRUE` is what you usually put in your code, it's a bad idea
When you call a function, you typically omit the names of the data arguments, because they are used so commonly.
If you override the default value of a detail argument, you should use the full name:
```{r, eval = FALSE}
```{r}
#| eval: false
# Good
mean(1:10, na.rm = TRUE)
@ -559,7 +593,9 @@ You can refer to an argument by its unique prefix (e.g. `mean(x, n = TRUE)`), bu
Notice that when you call a function, you should place a space around `=` in function calls, and always put a space after a comma, not before (just like in regular English).
Using whitespace makes it easier to skim the function for the important components.
```{r, eval = FALSE}
```{r}
#| eval: false
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
@ -651,7 +687,9 @@ wt_mean <- function(x, w, na.rm = FALSE) {
This is a lot of extra work for little additional gain.
A useful compromise is the built-in `stopifnot()`: it checks that each argument is `TRUE`, and produces a generic error message if not.
```{r, error = TRUE}
```{r}
#| error: true
wt_mean <- function(x, w, na.rm = FALSE) {
stopifnot(is.logical(na.rm), length(na.rm) == 1)
stopifnot(length(x) == length(w))
@ -761,7 +799,9 @@ complicated_function <- function(x, y, z) {
Another reason is because you have a `if` statement with one complex block and one simple block.
For example, you might write an if statement like this:
```{r, eval = FALSE}
```{r}
#| eval: false
f <- function() {
if (x) {
# Do
@ -781,7 +821,8 @@ f <- function() {
But if the first block is very long, by the time you get to the `else`, you've forgotten the `condition`.
One way to rewrite it is to use an early return for the simple case:
```{r, eval = FALSE}
```{r}
#| eval: false
f <- function() {
if (!x) {
@ -839,7 +880,9 @@ dim(x)
And we can still use it in a pipe:
```{r, include = FALSE}
```{r}
#| include: false
library(dplyr)
```

View File

@ -1,9 +0,0 @@
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-115082821-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-115082821-1');
</script>

View File

@ -1,6 +1,9 @@
# Databases {#import-databases}
# Databases {#sec-import-databases}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
@ -21,7 +24,10 @@ But as we go along, we'll also point out a few tips and tricks for getting your
### Prerequisites
```{r, message = FALSE}
```{r}
#| label: setup
#| message: false
library(DBI)
library(tidyverse)
```
@ -57,7 +63,9 @@ Many commercial databases use the odbc standard for communication so if you're u
In most cases connecting to the database looks something like this:
```{r, eval = FALSE}
```{r}
#| eval: false
con <- DBI::dbConnect(RMariaDB::MariaDB(), username = "foo")
con <- DBI::dbConnect(RPostgres::Postgres(), hostname = "databases.mycompany.com", port = 1234)
```
@ -81,9 +89,11 @@ con <- DBI::dbConnect(duckdb::duckdb())
```
If you want to use duckdb for a real data analysis project, you'll also need to supply the `dbdir` argument to tell duckdb where to store the database files.
Assuming you're using a project (Chapter \@ref(rstudio-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
Assuming you're using a project (Chapter -@sec-workflow-scripts-projects)), it's reasonable to store it in the `duckdb` directory of the current project:
```{r, eval = FALSE}
```{r}
#| eval: false
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
```
@ -128,7 +138,7 @@ dbExistsTable(con, "foo")
The simplest way to get data out of a database is with `dbReadTable()`:
```{r}
as_tibble(dbReadTable(con, "mtcars"))
as_tibble(dbReadTable(con, "mpg"))
as_tibble(dbReadTable(con, "diamonds"))
```
@ -260,7 +270,7 @@ diamonds_db |> relocate(x:z) |> show_query()
```
The translations for `mutate()` are similarly straightforward.
We'll come back to the translation of individual components in Section \@ref(sql-expressions).
We'll come back to the translation of individual components in @sec-sql-expressions.
```{r}
diamonds_db |> mutate(price_per_carat = price / carat) |> show_query()
@ -404,7 +414,7 @@ Most database will allow you to create temporary tables, even if you don't other
Rather than copying the data to the database, it builds SQL that generates the data inline.
It's useful if you don't have permission to create temporary tables, and is faster than `copy_to()` for small datasets.
## SQL expressions
## SQL expressions {#sec-sql-expressions}
Now that you understand the big picture of a SQL query and the equivalence between the SELECT clauses and dplyr verbs, it's time to look more at the details of the conversion of the individual expressions, i.e. what happens when you use `mean(x)` in a `summarize()`?

View File

@ -1,7 +0,0 @@
# Other types of data {#import-other}
```{r, results = "asis", echo = FALSE}
status("drafting")
```
<!--# TO DO: Write chapter. -->

10
import-other.qmd Normal file
View File

@ -0,0 +1,10 @@
# Other types of data {#sec-import-other}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
<!--# TO DO: Write chapter. -->

View File

@ -1,4 +1,11 @@
# Rectangular data {#import-rectangular}
# Rectangular data {#sec-import-rectangular}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
Things that should be mentioned in this chapter:
@ -8,6 +15,8 @@ Things that should be mentioned in this chapter:
<!--# Moved from original import chapter -->
```{r}
#| message: false
library(tidyverse)
```
@ -115,7 +124,7 @@ parse_number("123.456.789", locale = locale(grouping_mark = "."))
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
```
### Strings {#readr-strings}
### Strings {#sec-readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
@ -174,7 +183,7 @@ The first argument to `guess_encoding()` can either be a path to a file, or, as
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Factors {#readr-factors}
### Factors {#sec-readr-factors}
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
@ -186,7 +195,7 @@ parse_factor(c("apple", "banana", "bananana"), levels = fruit)
But if you have many problematic entries, it's often easier to leave them as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#readr-datetimes}
### Dates, date-times, and times {#sec-readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
@ -224,7 +233,7 @@ Year
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
: `%y` (2 digits); 00-69 -\> 2000-2069, 70-99 -\> 1970-1999.
Month
@ -315,7 +324,7 @@ parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
t2 <- "11:15:10.12 PM"
```
## Parsing a file
## Parsing a file {#sec-parsing-a-file}
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
@ -387,7 +396,9 @@ tail(challenge)
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r, eval = FALSE}
```{r}
#| eval: false
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
@ -458,10 +469,4 @@ There are a few other general strategies to help you parse files:
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
```{r, results = "asis", echo = FALSE}
status("drafting")
```
<!--# TO DO: Write chapter. -->
> > > > > > > bfaa80ba44aec5248d15093a9c521a6e2acf27ed

View File

@ -1,11 +1,18 @@
# Spreadsheets {#import-spreadsheets}
# Spreadsheets {#sec-import-spreadsheets}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
## Introduction
So far you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
Sometimes you need to analyze data that lives in a spreadsheet.
In this chapter we will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
This will build on much of what you've learned in Chapter \@ref(data-import) and Chapter \@ref(import-rectangular), but we will also discuss additional considerations and complexities when working with data from spreadsheets.
This will build on much of what you've learned in [Chapter -@sec-data-import] and [Chapter -@sec-import-rectangular], but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: <https://doi.org/10.1080/00031305.2017.1375989>.
The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise.
@ -18,6 +25,8 @@ In this chapter, you'll learn how to load data from Excel spreadsheets in R with
This package is non-core tidyverse, so you need to load it explicitly but it is installed automatically when you install the tidyverse package.
```{r}
#| message: false
library(readxl)
library(tidyverse)
```
@ -37,11 +46,20 @@ Most of readxl's functions allow you to load Excel spreadsheets into R:
These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g. `read_csv()`, `read_table()`, etc.
For the rest of the chapter we will focus on using `read_excel()`.
### Reading spreadsheets
### Reading spreadsheets {#sec-reading-spreadsheets}
Figure \@ref(fig:students-excel) shows what the spreadsheet we're going to read into R looks like in Excel.
@fig-students-excel shows what the spreadsheet we're going to read into R looks like in Excel.
```{r}
#| label: fig-students-excel
#| echo: false
#| fig-cap: >
#| Spreadsheet called students.xlsx in Excel.
#| fig-alt: >
#| A look at the students spreadsheet in Excel. The spreadsheet contains
#| information on 6 students, their ID, full name, favourite food, meal plan,
#| and age.
```{r students-excel, fig.alt = "A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.", fig.cap = "Spreadsheet called students.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-students.png")
```
@ -143,11 +161,19 @@ That might be tempting, but it's strongly not recommended.
### Reading individual sheets
An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets.
Figure \@ref(fig:penguins-islands) shows an Excel spreadsheet with multiple sheets.
@fig-penguins-islands shows an Excel spreadsheet with multiple sheets.
The data come from the **palmerpenguins** package.
Each sheet contains information on penguins from a different island where data were collected.
```{r penguins-islands, fig.alt = "A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island.", fig.cap = "Spreadsheet called penguins.xlsx in Excel.", echo = FALSE}
```{r}
#| label: fig-penguins-islands
#| echo: false
#| fig-cap: >
#| Spreadsheet called penguins.xlsx in Excel.
#| fig-alt: >
#| A look at the penguins spreadsheet in Excel. The spreadsheet contains has
#| three sheets: Torgersen Island, Biscoe Island, and Dream Island.
knitr::include_graphics("images/import-spreadsheets-penguins-islands.png")
```
@ -196,14 +222,29 @@ penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
```
In Chapter \@ref(iteration) we'll talk about ways of doing this sort of task without repetitive code <!--# Check to make sure that's the right place to present it -->.
In [Chapter -@sec-iteration] we'll talk about ways of doing this sort of task without repetitive code <!--# Check to make sure that's the right place to present it -->.
### Reading part of a sheet
Since many use Excel spreadsheets for presentation as well as for data storage, it's quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R.
Figure \@ref(fig:deaths-excel) shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.
@fig-deaths-excel shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.
```{r}
#| label: fig-deaths-excel
#| echo: false
#| fig-cap: >
#| Spreadsheet called deaths.xlsx in Excel.
#| fig-alt: >
#| A look at the deaths spreadsheet in Excel. The spreadsheet has four rows
#| on top that contain non-data information; the text 'For the same of
#| consistency in the data layout, which is really a beautiful thing, I will
#| keep making notes up here.' is spread across cells in these top four rows.
#| Then, there is a data frame that includes information on deaths of 10
#| famous people, including their names, professions, ages, whether they have
#| kids or not, date of birth and death. At the bottom, there are four more
#| rows of non-data information; the text 'This has been really fun, but
#| we're signing off now!' is spread across cells in these bottom four rows.
```{r deaths-excel, fig.alt = "A look at the deaths spreadsheet in Excel. The spreadsheet has four rows on top that contain non-data information; the text 'For the same of consistency in the data layout, which is really a beautiful thing, I will keep making notes up here.' is spread across cells in these top four rows. Then, there is a data frame that includes information on deaths of 10 famous people, including their names, professions, ages, whether they have kids r not, date of birth and death. At the bottom, there are four more rows of non-data information; the text 'This has been really fun, but we're signing off now!' is spread across cells in these bottom four rows.", fig.cap = "Spreadsheet called deaths.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-deaths.png")
```
@ -244,19 +285,25 @@ In spreadsheet notation, this is `A5:F15`.
- Supply this information to the `range` argument:
```{r results = "hide"}
```{r}
#| results: "hide"
read_excel(deaths_path, range = "A5:F15")
```
- Specify rows:
```{r results = "hide"}
```{r}
#| results: "hide"
read_excel(deaths_path, range = cell_rows(c(5, 15)))
```
- Specify cells that mark the top-left and bottom-right corners of the data -- the top-left corner, `A5`, translates to `c(5, 1)` (5th row down, 1st column) and the bottom-right corner, `F15`, translates to `c(15, 6)`:
```{r results = "hide"}
```{r}
#| results: "hide"
read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))
```
@ -294,7 +341,7 @@ Confusingly, it's also possible to have something that looks like a number but i
These differences between how the underlying data are stored vs. how they're displayed can cause surprises when the data are loaded into R.
By default readxl will guess the data type in a given column.
A recommended workflow is to let readxl guess the column types, confirm that you're happy with the guessed column types, and if not, go back and re-import specifying `col_types` as shown in Section \@ref(reading-spreadsheets).
A recommended workflow is to let readxl guess the column types, confirm that you're happy with the guessed column types, and if not, go back and re-import specifying `col_types` as shown in @sec-reading-spreadsheets.
Another challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g. some cells are numeric, others text, others dates.
When importing the data into R readxl has to make some decisions.
@ -322,22 +369,31 @@ bake_sale
You can write data back to disk as an Excel file using the `write_xlsx()` from the **writexl** package.
```{r eval = FALSE}
```{r}
#| eval: false
library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
```
Figure \@ref(fig:bake-sale-excel) shows what the data looks like in Excel.
@fig-bake-sale-excel shows what the data looks like in Excel.
Note that column names are included and bolded.
These can be turned off by setting `col_names` and `format_headers` arguments to `FALSE`.
```{r bake-sale-excel, fig.alt = "Bake sale data frame created earlier in Excel.", fig.cap = "Spreadsheet called bake_sale.xlsx in Excel.", echo = FALSE}
```{r}
#| label: fig-bake-sale-excel
#| echo: false
#| fig-cap: >
#| Spreadsheet called bake_sale.xlsx in Excel.
#| fig-alt: >
#| Bake sale data frame created earlier in Excel.
knitr::include_graphics("images/import-spreadsheets-bake-sale.png")
```
Just like reading from a CSV, information on data type is lost when we read the data back in.
This makes Excel files unreliable for caching interim results as well.
For alternatives, see Section \@ref(writing-to-a-file).
For alternatives, see @sec-writing-to-a-file.
```{r}
read_excel("data/bake-sale.xlsx")
@ -354,7 +410,9 @@ A good way of familiarizing yourself with the coding style used in a new package
Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the `penguins` data frame.
```{r message = FALSE}
```{r}
#| message: false
library(openxlsx)
library(palmerpenguins)
@ -392,14 +450,24 @@ penguins_species
And we can write this to this with `saveWorkbook()`.
```{r eval = FALSE}
```{r}
#| eval: false
saveWorkbook(penguins_species, "data/penguins-species.xlsx")
```
The resulting spreadsheet is shown in Figure \@ref(fig:penguins-species).
The resulting spreadsheet is shown in @fig-penguins-species.
By default, openxlsx formats the data as an Excel table.
```{r penguins-species, fig.alt = "A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island.", fig.cap = "Spreadsheet called penguins.xlsx in Excel.", echo = FALSE}
```{r}
#| label: fig-penguins-species
#| echo: false
#| fig-cap: >
#| Spreadsheet called penguins.xlsx in Excel.
#| fig-alt: >
#| A look at the penguins spreadsheet in Excel. The spreadsheet contains has
#| three sheets: Torgersen Island, Biscoe Island, and Dream Island.
knitr::include_graphics("images/import-spreadsheets-penguins-species.png")
```
@ -414,6 +482,8 @@ See <https://ycphs.github.io/openxlsx/articles/Formatting.html> for an extensive
## Google Sheets
<!--# TO DO: Write section. -->
### Prerequisites
TO DO:
@ -434,9 +504,3 @@ TO DO:
### Write sheets
### Exercises
```{r, results = "asis", echo = FALSE}
status("drafting")
```
<!--# TO DO: Write chapter. -->

View File

@ -1,7 +0,0 @@
# Web scraping {#import-webscrape}
```{r, results = "asis", echo = FALSE}
status("drafting")
```
<!--# TO DO: Write chapter. -->

10
import-webscrape.qmd Normal file
View File

@ -0,0 +1,10 @@
# Web scraping {#sec-import-webscrape}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
<!--# TO DO: Write chapter. -->

View File

@ -1,21 +0,0 @@
# (PART) Import {.unnumbered}
# Introduction {#import-intro .unnumbered}
In this part of the book, you'll learn how to get your into R.
We'll focus on plain-text rectangular formats, spreadsheets, databases, and web data.
<!--# TO DO: Decide if a diagram is needed, see wrangle-intro for reference. -->
This part of the book proceeds as follows:
- In Chapter \@ref(import-rectangular), you'll learn how to get plain-text data in rectangular formats from disk and into R.
- In Chapter \@ref(import-spreadsheets), you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
- In Chapter \@ref(import-databases), you'll learn about getting data into R from databases.
<!--# TO DO: List which types of databases. -->
- In Chapter \@ref(import-webscrape), you'll learn about harvesting data off the web and getting it into R.
- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in Chapter \@ref(import-other).

25
import.qmd Normal file
View File

@ -0,0 +1,25 @@
# Import {#sec-import-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
In this part of the book, you'll learn how to get your into R.
We'll focus on plain-text rectangular formats, spreadsheets, databases, and web data.
<!--# TO DO: Decide if a diagram is needed, see wrangle-intro for reference. -->
This part of the book proceeds as follows:
- In [Chapter -@sec-import-rectangular], you'll learn how to get plain-text data in rectangular formats from disk and into R.
- In [Chapter -@sec-import-spreadsheets], you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
- In [Chapter -@sec-import-databases], you'll learn about getting data into R from databases.
<!--# TO DO: List which types of databases. -->
- In [Chapter -@sec-import-webscrape], you'll learn about harvesting data off the web and getting it into R.
- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in [Chapter -@sec-import-other].

View File

@ -1,19 +1,7 @@
---
knit: "bookdown::render_book"
title: "R for Data Science (2e)"
author: "Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund"
description: "This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it, and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You'll learn how to use the grammar of graphics, literate programming to save time and make your work reproducible. Along the way, you'll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data."
url: 'https\://r4ds.had.co.nz/'
github-repo: hadley/r4ds
twitter-handle: hadley
cover-image: cover.png
site: bookdown::bookdown_site
documentclass: book
---
# Welcome {.unnumbered}
[![Buy from amazon](cover.png){.cover width="250"}](http://amzn.to/2aHLAQ1) This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
This is the website for the work-in-progress 2nd edition of **"R for Data Science"**.
This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
<!--# TO DO: Should "model it" stay here? Omitted? Mentioned with an explanation as to where to go for modeling? --> In this book, you will find a practicum of skills for data science.
Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides.
These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R.

View File

@ -1,4 +1,10 @@
# Introduction
# Introduction {#sec-intro}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge.
The goal of "R for Data Science" is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly.
@ -12,9 +18,8 @@ Our model of the tools needed in a typical data science project looks something
```{r}
#| echo: false
#| out.width: "75%"
#| fig.align: "center"
#| fig.alt: >
#| fig-align: "center"
#| fig-alt: >
#| A diagram displaying the data science cycle: Import -> Tidy -> Understand
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
#| Communicate. Surrounding all of these is Communicate.
@ -149,9 +154,8 @@ When you start RStudio, you'll see two key regions in the interface: the console
```{r}
#| echo: false
#| out.width: "75%"
#| fig.align: "center"
#| fig.alt: >
#| fig-align: "center"
#| fig-alt: >
#| The RStudio IDE with the panes Console and Output highlighted.
knitr::include_graphics("diagrams/rstudio-console.png")
@ -271,7 +275,7 @@ contribs <- contribs |>
filter(!name %in% c("hadley", "Garrett", "Hadley Wickham",
"Garrett Grolemund", "Mine Cetinkaya-Rundel")) |>
arrange(name) |>
mutate(uname = ifelse(!grepl(" ", name), paste0("@", name), name))
mutate(uname = ifelse(!grepl(" ", name), paste0("\\@", name), name))
cat("Thanks go to all contributers in alphabetical order: ")
cat(paste0(contribs$uname, collapse = ", "))

View File

@ -1,8 +1,14 @@
# Iteration
# Iteration {#sec-iteration}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
In Chapter \@ref(functions), we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
In [Chapter -@sec-functions], we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
@ -24,7 +30,10 @@ Once you master the vocabulary of FP, you can solve many common iteration proble
Once you've mastered the for loops provided by base R, you'll learn some of the powerful programming tools provided by purrr, one of the tidyverse core packages.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -109,7 +118,9 @@ Then we'll move on to some variations of the for loop that help you solve other
2. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:
```{r, eval = FALSE}
```{r}
#| eval: false
out <- ""
for (x in letters) {
out <- stringr::str_c(out, x)
@ -138,7 +149,9 @@ Then we'll move on to some variations of the for loop that help you solve other
4. It's common to see for loops that don't preallocate the output and instead increase the length of a vector at each step:
```{r, eval = FALSE}
```{r}
#| eval: false
output <- vector("integer", 0)
for (i in seq_along(x)) {
output <- c(output, lengths(x[[i]]))
@ -164,7 +177,7 @@ There are four variations on the basic theme of the for loop:
### Modifying an existing object
Sometimes you want to use a for loop to modify an existing object.
For example, remember our challenge from Chapter \@ref(functions) on functions.
For example, remember our challenge from [Chapter -@sec-functions] on functions.
We wanted to rescale every column in a data frame:
```{r}
@ -218,14 +231,18 @@ There are two other forms:
This is useful if you want to use the name in a plot title or a file name.
If you're creating named output, make sure to name the results vector like so:
```{r, eval = FALSE}
```{r}
#| eval: false
results <- vector("list", length(x))
names(results) <- names(x)
```
Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:
```{r, eval = FALSE}
```{r}
#| eval: false
for (i in seq_along(x)) {
name <- names(x)[[i]]
value <- x[[i]]
@ -287,7 +304,9 @@ You can't do that sort of iteration with the for loop.
Instead, you can use a while loop.
A while loop is simpler than a for loop because it only has two components, a condition and a body:
```{r, eval = FALSE}
```{r}
#| eval: false
while (condition) {
# body
}
@ -295,7 +314,9 @@ while (condition) {
A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can't rewrite every while loop as a for loop:
```{r, eval = FALSE}
```{r}
#| eval: false
for (i in seq_along(x)) {
# body
}
@ -344,7 +365,9 @@ However, it is good to know they exist so that you're prepared for problems wher
3. Write a function that prints the mean of each numeric column in a data frame, along with its name.
For example, `show_mean(mpg)` would print:
```{r, eval = FALSE}
```{r}
#| eval: false
show_mean(mpg)
#> displ: 3.47
#> year: 2004
@ -357,7 +380,9 @@ However, it is good to know they exist so that you're prepared for problems wher
4. What does this code do?
How does it work?
```{r, eval = FALSE}
```{r}
#| eval: false
trans <- list(
disp = function(x) x * 0.0163871,
am = function(x) {
@ -755,7 +780,9 @@ map2(mu, sigma, rnorm, n = 5) |> str()
`map2()` generates this series of function calls:
```{r, echo = FALSE}
```{r}
#| echo: false
knitr::include_graphics("diagrams/lists-map2.png")
```
@ -787,14 +814,18 @@ args1 |>
That looks like:
```{r, echo = FALSE}
```{r}
#| echo: false
knitr::include_graphics("diagrams/lists-pmap-unnamed.png")
```
If you don't name the list's elements, `pmap()` will use positional matching when calling the function.
That's a little fragile, and makes the code harder to read, so it's better to name the arguments:
```{r, eval = FALSE}
```{r}
#| eval: false
args2 <- list(mean = mu, sd = sigma, n = n)
args2 |>
pmap(rnorm) |>
@ -803,7 +834,9 @@ args2 |>
That generates longer, but safer, calls:
```{r, echo = FALSE}
```{r}
#| echo: false
knitr::include_graphics("diagrams/lists-pmap-named.png")
```
@ -841,7 +874,10 @@ To handle this case, you can use `invoke_map()`:
invoke_map(f, param, n = 5) |> str()
```
```{r, echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| out-width: null
knitr::include_graphics("diagrams/lists-invoke.png")
```
@ -851,7 +887,9 @@ The subsequent arguments are passed on to every function.
And again, you can use `tribble()` to make creating these matching pairs a little easier:
```{r, eval = FALSE}
```{r}
#| eval: false
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
@ -862,7 +900,7 @@ sim |>
mutate(sim = invoke_map(f, params, n = 10))
```
## Walk {#walk}
## Walk {#sec-walk}
Walk is an alternative to map that you use when you want to call a function for its side effects, rather than for its return value.
You typically do this because you want to render output to the screen or save files to disk - the important thing is the action, not the return value.
@ -878,7 +916,9 @@ x |>
`walk()` is generally not that useful compared to `walk2()` or `pwalk()`.
For example, if you had a list of plots and a vector of file names, you could use `pwalk()` to save each file to the corresponding location on disk:
```{r, eval = FALSE}
```{r}
#| eval: false
library(ggplot2)
plots <- mtcars |>
split(.$cyl) |>
@ -1009,7 +1049,9 @@ x |> accumulate(`+`)
But it has a number of bugs as illustrated with the following inputs:
```{r, eval = FALSE}
```{r}
#| eval: false
df <- tibble(
x = 1:3,
y = 3:1,

View File

@ -1,6 +1,9 @@
# List columns
# List columns {#sec-list-columns}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
@ -13,7 +16,10 @@ status("drafting")
In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets.
tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```

View File

@ -1,6 +1,9 @@
# Logical vectors {#logicals}
# Logical vectors {#sec-logicals}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
@ -19,7 +22,10 @@ We'll finish off with some tools for making conditional changes, and a cool hack
Most of the functions you'll learn about in this chapter are provided by base R, so we don't need the tidyverse, but but we'll still load it so we can use `mutate()`, `filter()`, and friends to work with data frames.
We'll also continue to draw examples from the nyclights13 dataset.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(nycflights13)
```
@ -67,7 +73,9 @@ This is particularly useful for more complicated logic because naming the interm
All up, the initial filter is equivalent to:
```{r, results = FALSE}
```{r}
#| results: false
flights |>
mutate(
daytime = dep_time > 600 & dep_time < 2000,
@ -111,7 +119,7 @@ One option is to use `dplyr::near()` which ignores small differences:
near(x, c(1, 2))
```
### Missing values {#na-comparison}
### Missing values {#sec-na-comparison}
Missing values represent the unknown so they are "contagious": almost any operation involving an unknown value will also be unknown:
@ -188,20 +196,21 @@ flights |>
Once you have multiple logical vectors, you can combine them together using Boolean algebra.
In R, `&` is "and", `|` is "or", and `!` is "not", and `xor()` is exclusive or[^logicals-2].
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations and how they work.
@fig-bool-ops shows the complete set of Boolean operations and how they work.
[^logicals-2]: That is, `xor(x, y)` is true if x is true, or y is true, but not both.
This is how we usually use "or" In English.
Both is not usually an acceptable answer to the question "would you like ice cream or cake?".
```{r bool-ops}
```{r}
#| label: fig-bool-ops
#| echo: false
#| out.width: NULL
#| fig.cap: >
#| out-width: NULL
#| fig-cap: >
#| The complete set of boolean operations. `x` is the left-hand
#| circle, `y` is the right-hand circle, and the shaded region show
#| which parts each operator selects.
#| fig.alt: >
#| fig-alt: >
#| Six Venn diagrams, each explaining a given logical operator. The
#| circles (sets) in each of the Venn diagrams represent x and y. 1. y &
#| !x is y but none of x; x & y is the intersection of x and y; x & !y is
@ -214,9 +223,9 @@ knitr::include_graphics("diagrams/transform.png", dpi = 270)
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them in dplyr functions!
These are called short-circuiting operators and only ever return a single `TRUE` or `FALSE`.
They're important for programming and you'll learn more about them in Section \@ref(conditional-execution).
They're important for programming and you'll learn more about them in @sec-conditional-execution.
### Missing values {#na-boolean}
### Missing values {#sec-na-boolean}
The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:
@ -240,7 +249,9 @@ Similar reasoning applies with `NA & FALSE`.
Note that the order of operations doesn't work like English.
Take the following code finds all flights that departed in November or December:
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
filter(month == 11 | month == 12)
```
@ -279,7 +290,9 @@ letters[1:10] %in% c("a", "e", "i", "o", "u")
So to find all flights in November and December we could write:
```{r, eval = FALSE}
```{r}
#| eval: false
flights |>
filter(month %in% c(11, 12))
```
@ -304,7 +317,7 @@ flights |>
2. How many flights have a missing `dep_time`? What other variables are missing in these rows? What might these rows represent?
3. Assuming that a missing `dep_time` implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and average delay of non-cancelled flights?
## Summaries {#logical-summaries}
## Summaries {#sec-logical-summaries}
The following sections describe some useful techniques for summarizing logical vectors.
As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.
@ -335,12 +348,13 @@ That leads us to the numeric summaries.
When you use a logical vector in a numeric context, `TRUE` becomes 1 and `FALSE` becomes 0.
This makes `sum()` and `mean()` very useful with logical vectors because `sum(x)` will give the number of `TRUE`s and `mean(x)` the proportion of `TRUE`s.
That lets us see the distribution of delays across the days of the year as shown in Figure \@ref(fig:prop-delayed-dist).
That lets us see the distribution of delays across the days of the year as shown in @fig-prop-delayed-dist.
```{r prop-delayed-dist}
#| fig.cap: >
```{r}
#| label: fig-prop-delayed-dist
#| fig-cap: >
#| A histogram showing the proportion of delayed flights each day.
#| fig.alt: >
#| fig-alt: >
#| The distribution is unimodal and mildly right skewed. The distribution
#| peaks around 30% delayed flights.
flights |>
@ -368,7 +382,7 @@ flights |>
### Logical subsetting
There's one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest.
This makes use of the base `[` (pronounced subset) operator, which you'll learn more about this in Section \@ref(vector-subsetting).
This makes use of the base `[` (pronounced subset) operator, which you'll learn more about this in @sec-vector-subsetting.
Imagine we wanted to look at the average delay just for flights that were actually delayed.
One way to do so would be to first filter the flights:
@ -388,7 +402,7 @@ This works, but what if we wanted to also compute the average delay for flights
We'd need to perform a separate filter step, and then figure out how to combine the two data frames together[^logicals-3].
Instead you could use `[` to perform an inline filtering: `arr_delay[arr_delay > 0]` will yield only the positive arrival delays.
[^logicals-3]: We'll cover this in Chapter \@ref(relational-data)
[^logicals-3]: We'll cover this in [Chapter -@sec-relational-data]
This leads to:
@ -530,7 +544,7 @@ flights |>
)
```
## Making groups {#groups-from-logical}
## Making groups {#sec-groups-from-logical}
Before we move on to the next chapter, I want to show you one last trick.
I don't know exactly how to describe it, and it feels a little magical, but it's super handy so I wanted to make sure you knew about it.

View File

@ -1,34 +1,41 @@
# Missing values {#missing-values}
# Missing values {#sec-missing-values}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
## Introduction
You've already learned the basics of missing values earlier in the the book: you first saw them in Section \@ref(summarize) where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in Section \@ref(na-comparison).
In this chapter, we'll come back to missing values in more depth, so you can learn more of the details.
You've already learned the basics of missing values earlier in the the book.
You first saw them in @sec-summarize where they interfered with computing summary statistics, and you learned about their their infectious nature and how to check for their presence in @sec-na-comparison.
Now we'll come back to them in more depth, so you can learn more of the details.
We'll start by discussing some general tools for explicitly missing values that recorded as `NA`.
We'll start by discussing some general tools for working with missing values recorded as `NA`s.
We'll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
We'll finish off with a of empty groups, caused by factor levels that don't appear in the data.
We'll finish off with a related discussion of empty groups, caused by factor levels that don't appear in the data.
### Prerequisites
The functions for working will missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
## Explicit missing values
To begin, let's explore a few handy tools for creating or eliminating explicitly `NA`s.
In the following sections you'll learn how to carry the last observation forward, convert `NA`s to fixed values, convert some fixed value to `NA`s, and learn about the special variant of `NA` known as "not a number".
To begin, let's explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an `NA`.
### Last observation carried forward
Missing values are commonly used as data entry convenience where they indicate a repeat of the value in the previous row:
A common use for missing values is as a data entry convenience.
Sometimes data that has been entered by hand, missing values indicate that the value in the previous row has been repeated:
```{r}
treatment <- tribble(
@ -63,7 +70,9 @@ coalesce(x, 0)
You could use `mutate()` together with `across()` to apply this treatment to (say) every numeric column in a data frame:
```{r, eval = FALSE}
```{r}
#| eval: false
df |>
mutate(across(where(is.numeric), coalesce, 0))
```
@ -83,7 +92,9 @@ na_if(x, -99)
You could apply this transformation to every numeric column in a data frame with the following code.
```{r, eval = FALSE}
```{r}
#| eval: false
df |>
mutate(across(where(is.numeric), na_if, -99))
```
@ -156,7 +167,7 @@ stocks |>
```
By default, making data longer preserves explicit missing values, but if they are structural missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting `drop_na = TRUE`.
See the examples in Chapter \@ref(tidy-data) for more details.
See the examples in @sec-tidy-data for more details.
### Complete
@ -238,10 +249,10 @@ The same principle applies to ggplot2's discrete axes, which will also drop leve
You can force them to display with by supplying `drop = FALSE` to the appropriate discrete axis:
```{r}
#| fig.align: default
#| out.width: "50%"
#| fig.width: 3
#| fig.alt:
#| layout-ncol: 2
#| fig-width: 3
#| fig-height: 2
#| fig-alt:
#| - >
#| A bar chart with a single value on the x-axis, "no".
#| - >

View File

@ -1,6 +1,9 @@
# Numbers {#numbers}
# Numbers {#sec-numbers}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
@ -17,7 +20,10 @@ This chapter mostly uses functions from base R, which are available without load
But we still need the tidyverse because we'll use these base R functions inside of tidyverse functions like `mutate()` and `filter()`.
Like in the last chapter, we'll use real examples from nycflights13, as well as toy examples made with `c()` and `tribble()`.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(nycflights13)
```
@ -31,7 +37,7 @@ This function is great for quick exploration and checks during analysis:
flights |> count(dest)
```
(Despite the advice in Chapter \@ref(code-style), I usually put `count()` on a single line because I'm usually using it at the console for a quick check that my calculation is working as expected.)
(Despite the advice in [Chapter -@sec-workflow-style], I usually put `count()` on a single line because I'm usually using it at the console for a quick check that my calculation is working as expected.)
If you want to see the most common values add `sort = TRUE`:
@ -56,7 +62,9 @@ flights |>
`n()` is a special summary function that doesn't take any arguments and instead access information about the "current" group.
This means that it only works inside dplyr verbs:
```{r, error = TRUE}
```{r}
#| error: true
n()
```
@ -115,7 +123,7 @@ As an example, while R provides all the trigonometric functions that you might d
### Arithmetic and recycling rules
We introduced the basics of arithmetic (`+`, `-`, `*`, `/`, `^`) in Chapter \@ref(workflow-basics) and have used them a bunch since.
We introduced the basics of arithmetic (`+`, `-`, `*`, `/`, `^`) in [Chapter -@sec-workflow-basics] and have used them a bunch since.
These functions don't need a huge amount of explanation because they do what you learned in grade school.
But we need to briefly talk about the **recycling rules** which determine what happens when the left and right hand sides have different lengths.
This is important for operations like `flights |> mutate(air_time = air_time / 60)` because there are 336,776 numbers on the left of `/` but only one on the right.
@ -205,16 +213,17 @@ flights |>
)
```
We can combine that with the `mean(is.na(x))` trick from Section \@ref(logical-summaries) to see how the proportion of cancelled flights varies over the course of the day.
The results are shown in Figure \@ref(fig:prop-cancelled).
We can combine that with the `mean(is.na(x))` trick from @sec-logical-summaries to see how the proportion of cancelled flights varies over the course of the day.
The results are shown in @fig-prop-cancelled.
```{r prop-cancelled}
#| fig.cap: >
```{r}
#| label: fig-prop-cancelled
#| fig-cap: >
#| A line plot with scheduled departure hour on the x-axis, and proportion
#| of cancelled flights on the y-axis. Cancellations seem to accumulate
#| over the course of the day until 8pm, very late flights are much
#| less likely to be cancelled.
#| fig.alt: >
#| fig-alt: >
#| A line plot showing how proportion of cancelled flights changes over
#| the course of the day. The proportion starts low at around 0.5% at
#| 6am, then steadily increases over the course of the day until peaking
@ -225,7 +234,7 @@ flights |>
summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |>
filter(hour > 1) |>
ggplot(aes(hour, prop_cancelled)) +
geom_line(colour = "grey50") +
geom_line(color = "grey50") +
geom_point(aes(size = n))
```
@ -270,7 +279,7 @@ I recommend using `log2()` or `log10()`.
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
### Rounding {#rounding}
### Rounding {#sec-rounding}
Use `round(x)` to round a number to the nearest integer:
@ -355,7 +364,7 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
### Exercises
1. Explain in words what each line of the code used to generate Figure \@ref(fig:prop-cancelled) does.
1. Explain in words what each line of the code used to generate @fig-prop-cancelled does.
2. What trigonometric functions does R provide?
Guess some names and look up the documentation.
@ -377,7 +386,7 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.
### Fill in missing values {#missing-values-numbers}
### Fill in missing values {#sec-missing-values-numbers}
You can fill in missing values with dplyr's `coalesce()`:
@ -458,7 +467,7 @@ lead(x)
```
- `x == lag(x)` tells you when the current value changes.
This is often useful combined with the grouping trick described in Section \@ref(groups-from-logical).
This is often useful combined with the grouping trick described in @sec-groups-from-logical.
```{r}
x == lag(x)
@ -485,7 +494,9 @@ You can lead or lag by more than one position by using the second argument, `n`.
6. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
Using `lag()`, explore how the average flight delay for an hour is related to the average delay for the previous hour.
```{r, results = FALSE}
```{r}
#| results: false
flights |>
mutate(hour = dep_time %/% 100) |>
group_by(year, month, day, hour) |>
@ -519,14 +530,15 @@ An alternative is to use the `median()`, which finds a value that lies in the "m
Depending on the shape of the distribution of the variable you're interested in, mean or median might be a better measure of center.
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
Figure \@ref(fig:mean-vs-median) compares the mean vs the median when looking at the hourly vs median departure delay.
@fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay.
The median delay is always smaller than the mean delay because because flight sometimes leave multiple hours late, but never leave multiple hours early.
```{r mean-vs-median}
#| fig.cap: >
```{r}
#| label: fig-mean-vs-median
#| fig-cap: >
#| A scatterplot showing the differences of summarising hourly depature
#| delay with median instead of median.
#| fig.alt: >
#| fig-alt: >
#| All points fall below a 45° line, meaning that the median delay is
#| always less than the mean delay. Most points are clustered in a
#| dense region of mean [0, 20] and median [0, 5]. As the mean delay
@ -541,7 +553,7 @@ flights |>
.groups = "drop"
) |>
ggplot(aes(mean, median)) +
geom_abline(slope = 1, intercept = 0, colour = "white", size = 2) +
geom_abline(slope = 1, intercept = 0, color = "white", size = 2) +
geom_point()
```
@ -552,7 +564,7 @@ For these reasons, the mode tends not to be used by statisticians and there's no
[^numbers-1]: The `mode()` function does something quite different!
### Minimum, maximum, and quantiles {#min-max-summary}
### Minimum, maximum, and quantiles {#sec-min-max-summary}
What if you're interested in locations other than the center?
`min()` and `max()` will give you the largest and smallest values.
@ -597,16 +609,18 @@ It's worth remembering that all of the summary statistics described above are a
This means that they're fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups.
That's why it's always a good idea to visualize the distribution before committing to your summary statistics.
Figure \@ref(fig:flights-dist) shows the overall distribution of departure delays.
@fig-flights-dist shows the overall distribution of departure delays.
The distribution is so skewed that we have to zoom in to see the bulk of the data.
This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.
```{r flights-dist}
#| fig.cap: >
#| The distribution of `dep_delay` is highly skewed. On the left we
#| see the full range of the data. Zooming into just delays less than
#| 2 hours continues to show a very skewed distribution.
#| fig.alt: >
```{r}
#| label: fig-flights-dist
#| fig-cap: >
#| The distribution of `dep_delay` appears highly skewed to the right in
#| both histograms.
#| fig-subcap: ["Histogram shows the full range of delays.",
#| "Histogram is zoomed in to show delays less than 2 hours."]
#| fig-alt: >
#| Two histograms of `dep_delay`. On the left, it's very hard to see
#| any pattern except that there's a very large spike around zero, the
#| bars rapidly decay in height, and for most of the plot, you can't
@ -615,10 +629,10 @@ This suggests that the mean is unlikely to be a good summary and we might prefer
#| see that the spike occurs slightly below zero (i.e. most flights
#| leave a couple of minutes early), but there's still a very steep
#| decay after that.
#| out.width: 50%
#| fig.align: default
#| fig.width: 4
#| fig.height: 2
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
flights |>
ggplot(aes(dep_delay)) +
geom_histogram(binwidth = 15)
@ -630,15 +644,16 @@ flights |>
```
It's also a good idea to check that distributions for subgroups resemble the whole.
Figure \@ref(fig:flights-dist-daily) overlays a frequency polygon for each day.
@fig-flights-dist-daily overlays a frequency polygon for each day.
The distributions seem to follow a common pattern, suggesting it's fine to use the same summary for each day.
```{r flights-dist-daily}
#| fig.cap: >
```{r}
#| label: fig-flights-dist-daily
#| fig-cap: >
#| 365 frequency polygons of `dep_delay`, one for each day. The frequency
#| polygons appear to have the same shape, suggesting that it's reasonable
#| to compare days by looking at just a few summary statistics.
#| fig.alt: >
#| fig-alt: >
#| The distribution of `dep_delay` is highly right skewed with a strong
#| peak slightly less than 0. The 365 frequency polygons are mostly
#| overlapping forming a thick black bland.
@ -650,12 +665,12 @@ flights |>
Don't be afraid to explore your own custom summaries specifically tailored for the data that you're working with.
In this case, that might mean separately summarizing the flights that left early vs the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation.
Finally, don't forget what you learned in Section \@ref(sample-size): whenever creating numerical summaries, it's a good idea to include the number of observations in each group.
Finally, don't forget what you learned in @sec-sample-size: whenever creating numerical summaries, it's a good idea to include the number of observations in each group.
### Positions
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at specific position.
You can do this with the base R `[` function, but we're not cover it until Section \@ref(vector-subsetting), because it's a very powerful and general function.
You can do this with the base R `[` function, but we're not cover it until @sec-vector-subsetting, because it's a very powerful and general function.
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
For example, we can find the first and last departure for each day:
@ -690,7 +705,7 @@ flights |>
### With `mutate()`
As the names suggest, the summary functions are typically paired with `summarise()`.
However, because of the recycling rules we discussed in Section \@ref(scalars-and-recycling-rules) they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
However, because of the recycling rules we discussed in @sec-scalars-and-recycling-rules they can also be usefully paired with `mutate()`, particularly when you want do some sort of group standardization.
For example:
- `x / sum(x)` calculates the proportion of a total.

1
plausible.html Normal file
View File

@ -0,0 +1 @@
<script defer data-domain="r4ds.hadley.nz" src="https://plausible.io/js/plausible.js"></script>

View File

@ -4,18 +4,22 @@ Welcome to the second edition of "R for Data Science".
## Major changes {.unnumbered}
- The first part is renamed to "whole game" to reflect the entire data science cycle. It gains a new chapter that briefly introduces the basics of reading data from csv files.
- The first part is renamed to "whole game" to reflect the entire data science cycle.
It gains a new chapter that briefly introduces the basics of reading data from csv files.
- The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.
- The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values.
These were previously parts of the data transformation chapter, but needed much more room.
- We've added new chapters on column-wise and row-wise operations.
- We've added a new set of chapters on import that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and scraping data from the web.
- The modeling part has been removed. For modeling, we recommend using packages from [tidymodels](https://www.tidymodels.org/) and reading [Tidy Modeling with R](https://www.tmwr.org/) by Max Kuhn and Julia Silge to learn more about them.
- The modeling part has been removed.
For modeling, we recommend using packages from [tidymodels](https://www.tidymodels.org/) and reading [Tidy Modeling with R](https://www.tmwr.org/) by Max Kuhn and Julia Silge to learn more about them.
- We've switched from the magrittr pipe to the base pipe.
## Acknowledgements {.unnumbered}
*TO DO: Add acknowledgements.*

View File

@ -1,6 +1,9 @@
# Programming with strings
# Programming with strings {#sec-programming-with-strings}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("drafting")
```
@ -260,7 +263,9 @@ The main difference is the prefix: `str_` vs. `stri_`.
1. What do the `extra` and `fill` arguments do in `separate()`?
Experiment with the various options for the following two toy datasets.
```{r, eval = FALSE}
```{r}
#| eval: false
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) |>
separate(x, c("one", "two", "three"))
@ -278,7 +283,9 @@ The main difference is the prefix: `str_` vs. `stri_`.
4. In the following example we're using `unite()` to create a `date` column from `month` and `day` columns.
How would you achieve the same outcome using `mutate()` and `paste()` instead of unite?
```{r, eval = FALSE}
```{r}
#| eval: false
events <- tribble(
~month, ~day,
1 , 20,

View File

@ -1,11 +1,18 @@
# (PART) Program {.unnumbered}
# Program {#sec-program-intro .unnumbered}
# Introduction {#program-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
In this part of the book, you'll improve your programming skills.
Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
```{r echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("diagrams/data-science-program.png")
```
@ -28,18 +35,18 @@ But this doesn't mean you should rewrite every function: you need to balance wha
In the following four chapters, you'll learn skills that will allow you to both tackle new programs and to solve existing problems with greater clarity and ease:
1. In Chapter \@ref(pipes), you will dive deep into the **pipe**, `|>`, and learn more about how it works, what the alternatives are, and when not to use it.
1. In @sec-pipes, you will dive deep into the **pipe**, `|>`, and learn more about how it works, what the alternatives are, and when not to use it.
2. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice.
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
Instead, in Chapter \@ref(functions), you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
Instead, in [Chapter -@sec-functions], you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in Chapter \@ref(vectors).
3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in [Chapter -@sec-vectors].
You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
4. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
You need tools for **iteration** that let you do similar things again and again.
These tools include for loops and functional programming, which you'll learn about in Chapter \@ref(iteration).
These tools include for loops and functional programming, which you'll learn about in [Chapter -@sec-iteration].
## Learning more

View File

@ -15,7 +15,7 @@ LaTeX: XeLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
BuildType: Website
BuildType: None
MarkdownWrap: Sentence
MarkdownCanonical: Yes

View File

@ -1,15 +0,0 @@
.book .book-header h1 {
opacity: 1;
text-align: left;
}
#header .title {
margin-bottom: 0em;
}
#header h4.author {
margin: 0;
color: #666;
}
#header h4.author em {
font-style: normal;
}

53
r4ds.scss Normal file
View File

@ -0,0 +1,53 @@
/*-- scss:defaults --*/
$primary: #637238 !default;
/*-- scss:rules --*/
.sidebar-title {
color: #637238;
}
img.quarto-cover-image {
box-shadow: 0 .5rem 1rem rgba(0,0,0,.15);
}
/* status box styling */
.status {
border: 2px solid #637238;
padding: 1em;
margin-bottom: 1em;
}
.status p {
margin-bottom: 0;
}
/* Headings ------------------------------------------------------ */
h2 {
margin-top: 2rem;
margin-bottom: 1rem;
font-size: 1.5rem;
}
h3 { margin-top: 1.5em; font-size: 1.2rem; }
h4 { margin-top: 1.5em; font-size: 1.1rem; }
h5 { margin-top: 1.5em; font-size: 1rem; }
h1, h2, h3, h4, h5 {
line-height: 1.3;
}
.quarto-section-identifier {
color: #6C6C6C;
font-weight: normal;
}
/* Code ------------------------------------------------ */
$code-color: #373a3c !default;
pre {
background-image: linear-gradient(160deg,#f8f8f8 0,#f1f1f1 100%);
}

View File

@ -1,4 +1,10 @@
# Data rectangling {#rectangle-data}
# Data rectangling {#sec-rectangle-data}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
@ -9,7 +15,10 @@
In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets.
tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```

View File

@ -1,12 +1,15 @@
# Regular expressions
# Regular expressions {#sec-regular-expressions}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("restructuring")
```
## Introduction
You learned the basics of regular expressions in Chapter \@ref(strings), but regular expressions are fairly rich language so it's worth spending some extra time on the details.
You learned the basics of regular expressions in [Chapter -@sec-strings], but regular expressions are fairly rich language so it's worth spending some extra time on the details.
The chapter starts by expanding your knowledge of patterns, to cover six important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, and alternation).
Here we'll focus mostly on the language itself, not the functions that use it.
@ -22,7 +25,10 @@ We'll finish by discussing the various "flags" that allow you to tweak the opera
This chapter will use regular expressions as provided by the **stringr** package.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -45,7 +51,7 @@ It's not R specific, but it covers the most advanced features and explains how r
## Pattern language
You learned the very basics of the regular expression pattern language in Chapter \@ref(strings), and now its time to dig into more of the details.
You learned the very basics of the regular expression pattern language in [Chapter -@sec-strings], and now its time to dig into more of the details.
First, we'll start with **escaping**, which allows you to match characters that the pattern language otherwise treats specially.
Next you'll learn about **anchors**, which allow you to match the start or end of the string.
Then you'll learn about **character classes** and their shortcuts, which allow you to match any character from a set.
@ -54,15 +60,15 @@ We'll finish up with **quantifiers**, which control how many times a pattern can
The terms I use here are the technical names for each component.
They're not always the most evocative of their purpose, but it's very helpful to know the correct terms if you later want to Google for more details.
I'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in Chapter \@ref(strings), i.e.:
I'll concentrate on showing how these patterns work with `str_view()` and `str_view_all()` but remember that you can use them with any of the functions that you learned about in [Chapter -@sec-strings], i.e.:
- `str_detect(x, pattern)` returns a logical vector the same length as `x`, indicating whether each element matches (`TRUE`) or doesn't match (`FALSE`) the pattern.
- `str_count(x, pattern)` returns the number of times `pattern` matches in each element of `x`.
- `str_replace_all(x, pattern, replacement)` replaces every instance of `pattern` with `replacement`.
### Escaping {#regexp-escaping}
### Escaping {#sec-regexp-escaping}
In Chapter \@ref(strings), you'll learned how to match a literal `.` by using `fixed(".")`.
In [Chapter -@sec-strings], you'll learned how to match a literal `.` by using `fixed(".")`.
But what if you want to match a literal `.` as part of a bigger regular expression?
You'll need to use an **escape**, which tells the regular expression you want it to match exactly, not use its special behavior.
Like strings, regexps use the backslash for escaping, so to match a `.`, you need the regexp `\.`.
@ -94,7 +100,7 @@ str_view(x)
str_view(x, "\\\\")
```
Alternatively, you might find it easier to use the raw strings you learned about in Section \@ref(raw-strings)).
Alternatively, you might find it easier to use the raw strings you learned about in @sec-raw-strings).
That lets you to avoid one layer of escaping:
```{r}
@ -195,7 +201,7 @@ str_view_all("abcd12345!@#%. ", "\\S+")
### Quantifiers
The **quantifiers** control how many times a pattern matches.
In Chapter \@ref(strings) you learned about `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
In [Chapter -@sec-strings] you learned about `?` (0 or 1 matches), `+` (1 or more matches), and `*` (0 or more matches).
For example, `colou?r` will match American or British spelling, `\d+` will match one or more digits, and `\s?` will optionally match a single whitespace.
You can also specify the number of matches precisely:
@ -326,7 +332,7 @@ It's typically much easier to come up with positive examples than negative examp
Nevertheless they're still useful; even if you don't get them correct right away, you can slowly accumulate them as you work on your problem.
If you you later get more into programming and learn about unit tests, you can then turn these examples into automated test that ensure you never you never make the same mistake twice.)
### Boolean operations
### Boolean operations {#sec-boolean-operations}
Imagine we want to find words that only contain consonants.
One technique is to create a character class that contains all letters except for the vowels (`[^aeiou]`), then allow that to match any number of letters (`[^aeiou]+`), then force it to match the whole string by anchoring to the beginning and the end (`^[^aeiou]+$`):
@ -558,7 +564,9 @@ Typically, however, you'll find it easier to just ignore that result by setting
The are a number of settings, often called **flags** in other programming languages, that you can use to control some of the details of the regex.
In stringr, you can use these by wrapping the pattern in a call to `regex()`:
```{r, eval = FALSE}
```{r}
#| eval: false
# The regular call:
str_view(fruit, "nana")
# is shorthand for

View File

@ -1,6 +1,9 @@
# Two-table verbs {#relational-data}
# Two-table verbs {#sec-relational-data}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("restructuring")
```
@ -31,20 +34,23 @@ If so, you should find the concepts in this chapter familiar, although their exp
One other major terminology difference between databases and R is that what we generally refer to as data frames in R while the same concept is referred to as "table" in databases.
Hence you'll see references to one-table and two-table verbs in dplyr documentation.
Generally, dplyr is a little easier to use than SQL because dplyr is specialized to do data analysis: it makes common data analysis operations easier, at the expense of making it more difficult to do other things that aren't commonly needed for data analysis.
If you're not familiar with databases or SQL, you'll learn more about them in Chapter \@ref(import-databases).
If you're not familiar with databases or SQL, you'll learn more about them in [Chapter -@sec-import-databases].
### Prerequisites
We will explore relational data from `nycflights13` using the two-table verbs from dplyr.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(nycflights13)
```
## nycflights13 {#nycflights13-relational}
## nycflights13 {#sec-nycflights13-relational}
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in Chapter \@ref(data-transform) on data transformation:
nycflights13 contains five tibbles : `airlines`, `airports`, `weather` and `planes` which are all related to the `flights` data frame that you used in [Chapter -@sec-data-transform] on data transformation:
- `airlines` lets you look up the full carrier name from its abbreviated code:
@ -80,16 +86,17 @@ These datasets are connected as follows:
- `flights` connects to `weather` via `origin` (the location), and `year`, `month`, `day` and `hour` (the time).
One way to show the relationships between the different data frames is with a diagram, as in Figure \@ref(fig:flights-relationships).
One way to show the relationships between the different data frames is with a diagram, as in @fig-flights-relationships.
This diagram is a little overwhelming, but it's simple compared to some you'll see in the wild!
The key to understanding diagrams like this is that you'll solve real problems by working with pairs of data frames.
You don't need to understand the whole thing; you just need to understand the chain of connections between the two data frames that you're interested in.
```{r flights-relationships, echo = FALSE}
```{r}
#| label: fig-flights-relationships
#| echo: false
#| fig.cap: >
#| fig-cap: >
#| Connections between all six data frames in the nycflights package.
#| fig.alt: >
#| fig-alt: >
#| Diagram showing the relationships between airports, planes, flights,
#| weather, and airlines datasets from the nycflights13 package. The faa
#| variable in the airports data frame is connected to the origin and dest
@ -200,7 +207,7 @@ For example, in this data there's a many-to-many relationship between airlines a
How would you characterise the relationship between the `Batting`, `Pitching`, and `Fielding` data frames?
## Mutating joins {#mutating-joins}
## Mutating joins {#sec-mutating-joins}
The first tool we'll look at for combining a pair of data frames is the **mutating join**.
A mutating join allows you to combine variables from two data frames.
@ -252,11 +259,12 @@ To help you learn how joins work, I'm going to use a visual representation:
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| out-width: null
#| fig-alt: >
#| x and y are two data frames with 2 columns and 3 rows each. The first
#| column in each is the key and the second is the value. The contents of
#| these data frames are given in the subsequent code chunk.
knitr::include_graphics("diagrams/join-setup.png")
```
@ -282,7 +290,13 @@ In these examples I'll show a single key variable, but the idea generalises in a
A join is a way of connecting each row in `x` to zero, one, or more rows in `y`.
The following diagram shows each potential match as an intersection of a pair of lines.
```{r, echo = FALSE, out.width = NULL, fig.alt = "x and y data frames placed next to each other. with the key variable moved up front in y so that the key variable in x and key variable in y appear next to each other."}
```{r}
#| echo: false
#| fig-alt: >
#| x and y data frames placed next to each other. with the key variable
#| moved up front in y so that the key variable in x and key variable
#| in y appear next to each other.
knitr::include_graphics("diagrams/join-setup2.png")
```
@ -292,16 +306,31 @@ This is to emphasize that joins match based on the key; the other columns are ju
In an actual join, matches will be indicated with dots.
The number of dots = the number of matches = the number of rows in the output.
```{r join-inner, echo = FALSE, out.width = NULL, fig.alt = "Keys 1 and 2 in x and y data frames are matched and indicated with lines joining these rows with dot in the middle. Hence, there are two dots in this diagram. The resulting joined data frame has two rows and 3 columns: key, val_x, and val_y. Values in the key column are 1 and 2 (the matched values)."}
```{r}
#| label: join-inner
#| echo: false
#| out-width: null
#| fig-alt: >
#| Keys 1 and 2 in x and y data frames are matched and indicated with lines
#| joining these rows with dot in the middle. Hence, there are two dots in
#| this diagram. The resulting joined data frame has two rows and 3 columns:
#| key, val_x, and val_y. Values in the key column are 1 and 2, the matched
#| values.
knitr::include_graphics("diagrams/join-inner.png")
```
### Inner join {#inner-join}
### Inner join {#sec-inner-join}
The simplest type of join is the **inner join**.
An inner join matches pairs of observations whenever their keys are equal:
```{r, echo = FALSE, ref.label = "join-inner", opts.label = TRUE}
```{r}
#| ref.label: join-inner
#| echo: false
#| out-width: null
#| opts.label: true
knitr::include_graphics("diagrams/join-inner.png")
```
@ -318,7 +347,7 @@ x |>
The most important property of an inner join is that unmatched rows are not included in the result.
This means that generally inner joins are usually not appropriate for use in analysis because it's too easy to lose observations.
### Outer joins {#outer-join}
### Outer joins {#sec-outer-join}
An inner join keeps observations that appear in both data frames.
An **outer join** keeps observations that appear in at least one of the data frames.
@ -333,7 +362,28 @@ This observation has a key that always matches (if no other key matches), and a
Graphically, that looks like:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Three diagrams for left, right, and full joins. In each diagram data frame x is on the left and y is on the right. The result of the join is always a data frame with three columns (key, val_x, and val_y). Left join: keys 1 and 2 from x are matched to those in y, key 3 is also carried along to the joined result since it's on the left data frame, but key 4 from y is not carried along since it's on the right but not on the left. The result is a data frame with 3 rows: keys 1, 2, and 3, all values from val_x, and the corresponding values from val_y for keys 1 and 2 with an NA for key 3, val_y. Right join: keys 1 and 2 from x are matched to those in y, key 4 is also carried along to the joined result since it's on the right data frame, but key 3 from x is not carried along since it's on the left but not on the right. The result is a data frame with 3 rows: keys 1, 2, and 4, all values from val_y, and the corresponding values from val_x for keys 1 and 2 with an NA for key 4, val_x. Full join: The resulting data frame has 4 rows: keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since those keys aren't present in their respective data frames."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Three diagrams for left, right, and full joins. In each diagram data frame
#| x is on the left and y is on the right. The result of the join is always a
#| data frame with three columns (key, val_x, and val_y). Left join: keys 1
#| and 2 from x are matched to those in y, key 3 is also carried along to the
#| joined result since it's on the left data frame, but key 4 from y is not
#| carried along since it's on the right but not on the left. The result is
#| a data frame with 3 rows: keys 1, 2, and 3, all values from val_x, and
#| the corresponding values from val_y for keys 1 and 2 with an NA for key 3,
#| val_y. Right join: keys 1 and 2 from x are matched to those in y, key 4 is
#| also carried along to the joined result since it's on the right data frame,
#| but key 3 from x is not carried along since it's on the left but not on the
#| right. The result is a data frame with 3 rows: keys 1, 2, and 4, all values
#| from val_y, and the corresponding values from val_x for keys 1 and 2 with
#| an NA for key 4, val_x. Full join: The resulting data frame has 4 rows:
#| keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2,
#| val_y and key 4, val_x are NAs since those keys aren't present in their
#| respective data frames.
knitr::include_graphics("diagrams/join-outer.png")
```
@ -342,14 +392,25 @@ The left join should be your default join: use it unless you have a strong reaso
Another way to depict the different types of joins is with a Venn diagram:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Venn diagrams for inner, full, left, and right joins. Each join represented with two intersecting circles representing data frames x and y, with x on the right and y on the left. Shading indicates the result of the join. Inner join: Only intersection is shaded. Full join: Everything is shaded. Left join: Only x is shaded, but not the area in y that doesn't intersect with x. Right join: Only y is shaded, but not the area in x that doesn't intersect with y."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Venn diagrams for inner, full, left, and right joins. Each join represented
#| with two intersecting circles representing data frames x and y, with x on
#| the right and y on the left. Shading indicates the result of the join.
#| Inner join: Only intersection is shaded. Full join: Everything is shaded.
#| Left join: Only x is shaded, but not the area in y that doesn't intersect
#| with x. Right join: Only y is shaded, but not the area in x that doesn't
#| intersect with y.
knitr::include_graphics("diagrams/join-venn.png")
```
However, this is not a great representation.
It might jog your memory about which join preserves the observations in which data frame, but it suffers from a major limitation: a Venn diagram can't show what happens when keys don't uniquely identify an observation.
### Duplicate keys {#join-matches}
### Duplicate keys {#sec-join-matches}
So far all the diagrams have assumed that the keys are unique.
But that's not always the case.
@ -361,7 +422,18 @@ TODO: update for new warnings
1. One data frame has duplicate keys.
This is useful when you want to add in additional information as there is typically a one-to-many relationship.
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram describing a left join where one of the data frames (x) has duplicate keys. Data frame x is on the left, has 4 rows and 2 columns (key, val_x), and has the keys 1, 2, 2, and 1. Data frame y is on the right, has 2 rows and 2 columns (key, val_y), and has the keys 1 and 2. Left joining these two data frames yields a data frame with 4 rows (keys 1, 2, 2, and 1) and 3 columns (val_x, key, val_y). All values from x$val_x are carried along, values in y for key 1 and 2 are duplicated."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Diagram describing a left join where one of the data frames (x) has
#| duplicate keys. Data frame x is on the left, has 4 rows and 2 columns
#| (key, val_x), and has the keys 1, 2, 2, and 1. Data frame y is on the
#| right, has 2 rows and 2 columns (key, val_y), and has the keys 1 and 2.
#| Left joining these two data frames yields a data frame with 4 rows
#| (keys 1, 2, 2, and 1) and 3 columns (val_x, key, val_y). All values
#| from x$val_x are carried along, values in y for key 1 and 2 are duplicated.
knitr::include_graphics("diagrams/join-one-to-many.png")
```
@ -388,7 +460,18 @@ TODO: update for new warnings
This is usually an error because in neither data frame do the keys uniquely identify an observation.
When you join duplicated keys, you get all possible combinations, the Cartesian product:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram describing a left join where both data frames (x and y) have duplicate keys. Data frame x is on the left, has 4 rows and 2 columns (key, val_x), and has the keys 1, 2, 2, and 3. Data frame y is on the right, has 4 rows and 2 columns (key, val_y), and has the keys 1, 2, 2, and 3 as well. Left joining these two data frames yields a data frame with 6 rows (keys 1, 2, 2, 2, 2, and 3) and 3 columns (key, val_x, val_y). All values from both datasets are included."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Diagram describing a left join where both data frames (x and y) have
#| duplicate keys. Data frame x is on the left, has 4 rows and 2 columns
#| (key, val_x), and has the keys 1, 2, 2, and 3. Data frame y is on the
#| right, has 4 rows and 2 columns (key, val_y), and has the keys 1, 2, 2,
#| and 3 as well. Left joining these two data frames yields a data frame
#| with 6 rows (keys 1, 2, 2, 2, 2, and 3) and 3 columns (key, val_x,
#| val_y). All values from both datasets are included.
knitr::include_graphics("diagrams/join-many-to-many.png")
```
@ -410,7 +493,7 @@ TODO: update for new warnings
left_join(x, y, by = "key")
```
### Defining the key columns {#join-by}
### Defining the key columns {#sec-join-by}
So far, the pairs of data frames have always been joined by a single variable, and that variable has the same name in both data frames.
That constraint was encoded by `by = "key"`.
@ -455,7 +538,9 @@ You can use other values for `by` to connect the data frames in other ways:
1. Compute the average delay by destination, then join on the `airports` data frame so you can show the spatial distribution of delays.
Here's an easy way to draw a map of the United States:
```{r, eval = FALSE}
```{r}
#| eval: false
airports |>
semi_join(flights, c("faa" = "dest")) |>
ggplot(aes(lon, lat)) +
@ -477,7 +562,10 @@ You can use other values for `by` to connect the data frames in other ways:
5. What happened on June 13 2013?
Display the spatial pattern of delays, and then use Google to cross-reference with the weather.
```{r, eval = FALSE, include = FALSE}
```{r}
#| eval: false
#| include: false
worst <- filter(flights, !is.na(dep_time), month == 6, day == 13)
worst |>
group_by(dest) |>
@ -498,7 +586,7 @@ Rolling joins
Overlap joins
## Filtering joins {#filtering-joins}
## Filtering joins {#sec-filtering-joins}
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables.
There are two types:
@ -537,21 +625,49 @@ flights |>
Graphically, a semi-join looks like this:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram of a semi join. Data frame x is on the left and has two columns (key and val_x) with keys 1, 2, and 3. Diagram y is on the right and also has two columns (key and val_y) with keys 1, 2, and 4. Semi joining these two results in a data frame with two rows and two columns (key and val_x), with keys 1 and 2 (the only keys that match between the two data frames)."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Diagram of a semi join. Data frame x is on the left and has two columns
#| (key and val_x) with keys 1, 2, and 3. Diagram y is on the right and also
#| has two columns (key and val_y) with keys 1, 2, and 4. Semi joining these
#| two results in a data frame with two rows and two columns (key and val_x),
#| with keys 1 and 2 (the only keys that match between the two data frames).
knitr::include_graphics("diagrams/join-semi.png")
```
Only the existence of a match is important; it doesn't matter which observation is matched.
This means that filtering joins never duplicate rows like mutating joins do:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram of a semi join with data frames with duplicated keys. Data frame x is on the left and has two columns (key and val_x) with keys 1, 2, 2, and 3. Diagram y is on the right and also has two columns (key and val_y) with keys 1, 2, 2, and 3 as well. Semi joining these two results in a data frame with four rows and two columns (key and val_x), with keys 1, 2, 2, and 3 (the matching keys, each appearing as many times as they do in x)."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Diagram of a semi join with data frames with duplicated keys. Data frame
#| x is on the left and has two columns (key and val_x) with keys 1, 2, 2,
#| and 3. Diagram y is on the right and also has two columns (key and val_y)
#| with keys 1, 2, 2, and 3 as well. Semi joining these two results in a data
#| frame with four rows and two columns (key and val_x), with keys 1, 2, 2,
#| and 3 (the matching keys, each appearing as many times as they do in x).
knitr::include_graphics("diagrams/join-semi-many.png")
```
The inverse of a semi-join is an anti-join.
An anti-join keeps the rows that *don't* have a match:
```{r, echo = FALSE, out.width = NULL, fig.alt = "Diagram of an anti join. Data frame x is on the left and has two columns (key and val_x) with keys 1, 2, and 3. Diagram y is on the right and also has two columns (key and val_y) with keys 1, 2, and 4. Anti joining these two results in a data frame with one row and two columns (key and val_x), with keys 3 only (the only key in x that is not in y)."}
```{r}
#| echo: false
#| out-width: null
#| fig-alt: >
#| Diagram of an anti join. Data frame x is on the left and has two columns
#| (key and val_x) with keys 1, 2, and 3. Diagram y is on the right and also
#| has two columns (key and val_y) with keys 1, 2, and 4. Anti joining these
#| two results in a data frame with one row and two columns (key and val_x),
#| with keys 3 only (the only key in x that is not in y).
knitr::include_graphics("diagrams/join-anti.png")
```
@ -612,7 +728,7 @@ Your own data is unlikely to be so nice, so there are a few things that you shou
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
If you have an inner join with duplicate keys in both data frames, you might get unlucky as the number of dropped rows might exactly equal the number of duplicated rows!
## Set operations {#set-operations}
## Set operations {#sec-set-operations}
The final type of two-table verb are the set operations.
Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces.

View File

@ -1,4 +1,10 @@
# R Markdown formats
# R Markdown formats {#sec-rmarkdown-formats}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
@ -24,7 +30,10 @@ There are two ways to set the output of a document:
RStudio's knit button renders a file to the first format listed in its `output` field.
You can render to additional formats by clicking the dropdown menu beside the knit button.
```{r, echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| out-width: null
knitr::include_graphics("screenshots/rmarkdown-knit.png")
```
@ -79,7 +88,9 @@ There are a number of basic variations on that theme, generating different types
Remember, when generating a document to share with decision makers, you can turn off the default display of code by setting global options in the setup chunk:
```{r, eval = FALSE}
```{r}
#| eval: false
knitr::opts_chunk$set(echo = FALSE)
```
@ -158,7 +169,10 @@ Flexdashboard makes it particularly easy to create dashboards using R Markdown a
For example, you can produce this dashboard:
```{r, echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("screenshots/rmarkdown-flexdashboard.png")
```
@ -222,14 +236,19 @@ runtime: shiny
Then you can use the "input" functions to add interactive components to the document:
```{r, eval = FALSE}
```{r}
#| eval: false
library(shiny)
textInput("name", "What is your name?")
numericInput("age", "How old are you?", NA, min = 0, max = 150)
```
```{r, echo = FALSE, out.width = NULL}
```{r}
#| echo: false
#| out-width: null
knitr::include_graphics("screenshots/rmarkdown-shiny.png")
```

View File

@ -1,4 +1,10 @@
# R Markdown workflow
# R Markdown workflow {#sec-rmarkdown-workflow}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the *console*, then capture what works in the *script editor*.
R Markdown brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture.
@ -29,7 +35,7 @@ I've drawn on my own experiences and Colin Purrington's advice on lab notebooks
- Use the YAML header date field to record the date you started working on the notebook:
``` {.yaml}
``` yaml
date: 2016-08-23
```

View File

@ -1,4 +1,10 @@
# R Markdown
# R Markdown {#sec-rmarkdown}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
@ -27,7 +33,10 @@ Both cheatsheets are also available at <https://rstudio.com/resources/cheatsheet
You need the **rmarkdown** package, but you don't need to explicitly install it or load it, as RStudio automatically does both when needed.
```{r setup, include = FALSE}
```{r}
#| label: setup
#| message: false
chunk <- "```"
inline <- function(x = "") paste0("`` `r ", x, "` ``")
library(tidyverse)
@ -51,7 +60,10 @@ When you open an `.Rmd`, you get a notebook interface where code and output are
You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter.
RStudio executes the code and displays the results inline with the code:
```{r, echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("rmarkdown/diamond-sizes-notebook.png")
```
@ -59,15 +71,21 @@ To produce a complete report containing all text, code, and results, click "Knit
You can also do this programmatically with `rmarkdown::render("1-example.Rmd")`.
This will display the report in the viewer pane, and create a self-contained HTML file that you can share with others.
```{r, echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("rmarkdown/diamond-sizes-report.png")
```
When you **knit** the document, R Markdown sends the .Rmd file to **knitr**, <http://yihui.name/knitr/>, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output.
The markdown file generated by knitr is then processed by **pandoc**, <http://pandoc.org/>, which is responsible for creating the finished file.
The advantage of this two step workflow is that you can create a very wide range of output formats, as you'll learn about in [R Markdown formats].
The advantage of this two step workflow is that you can create a very wide range of output formats, as you'll learn about in \[R Markdown formats\].
```{r}
#| echo: false
#| out-width: "75%"
```{r, echo = FALSE, out.width = "75%"}
knitr::include_graphics("images/RMarkdownFlow.png")
```
@ -107,7 +125,10 @@ Markdown is designed to be easy to read and easy to write.
It is also very easy to learn.
The guide below shows how to use Pandoc's Markdown, a slightly extended version of Markdown that R Markdown understands.
```{r, echo = FALSE, comment = ""}
```{r}
#| echo: false
#| comment: ""
cat(readr::read_file("rmarkdown/markdown.Rmd"))
```
@ -160,12 +181,15 @@ This has three advantages:
1. You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:
```{r, echo = FALSE, out.width = "30%"}
```{r}
#| echo: false
#| out-width: "30%"
knitr::include_graphics("screenshots/rmarkdown-chunk-nav.png")
```
2. Graphics produced by the chunks will have useful names that make them easier to use elsewhere.
More on that in [other important options].
More on that in \[other important options\].
3. You can set up networks of cached chunks to avoid re-performing expensive computations on every run.
More on that below.
@ -222,21 +246,21 @@ mtcars[1:5, ]
```
If you prefer that data be displayed with additional formatting you can use the `knitr::kable` function.
The code below generates Table \@ref(tab:kable).
The code below generates @tbl-kable.
```{r kable}
knitr::kable(
mtcars[1:5, ],
caption = "A knitr kable."
)
```{r}
#| label: tbl-kable
#| tbl-cap: A knitr kable.
knitr::kable(mtcars[1:5, ], )
```
Read the documentation for `?knitr::kable` to see the other ways in which you can customise the table.
For even deeper customisation, consider the **xtable**, **stargazer**, **pander**, **tables**, and **ascii** packages.
Read the documentation for `?knitr::kable` to see the other ways in which you can customize the table.
For even deeper customization, consider the **xtable**, **stargazer**, **pander**, **tables**, and **ascii** packages.
Each provides a set of tools for returning formatted tables from R code.
There is also a rich set of options for controlling how figures are embedded.
You'll learn about these in [saving your plots].
You'll learn about these in \[saving your plots\].
### Caching
@ -293,7 +317,9 @@ As you work more with knitr, you will discover that some of the default chunk op
You can do this by calling `knitr::opts_chunk$set()` in a code chunk.
For example, when writing books and tutorials I set:
```{r, eval = FALSE}
```{r}
#| eval: false
knitr::opts_chunk$set(
comment = "#>",
collapse = TRUE
@ -384,7 +410,11 @@ To declare one or more parameters, use the `params` field.
This example uses a `my_class` parameter to determine which class of cars to display:
```{r, echo = FALSE, out.width = "100%", comment = ""}
```{r}
#| echo: false
#| out-width: "100%"
#| comment: ""
cat(readr::read_file("rmarkdown/fuel-economy.Rmd"))
```
@ -394,7 +424,7 @@ You can write atomic vectors directly into the YAML header.
You can also run arbitrary R expressions by prefacing the parameter value with `!r`.
This is a good way to specify date/time parameters.
``` {.yaml}
``` yaml
params:
start: !r lubridate::ymd("2015-01-01")
snapshot: !r lubridate::ymd_hms("2015-01-01 12:30:00")
@ -425,7 +455,9 @@ reports
Then we match the column names to the argument names of `render()`, and use purrr's **parallel** walk to call `render()` once for each row:
```{r, eval = FALSE}
```{r}
#| eval: false
reports |>
select(output_file = filename, params) |>
purrr::pwalk(rmarkdown::render, input = "fuel-economy.Rmd")
@ -437,7 +469,7 @@ Pandoc can automatically generate citations and a bibliography in a number of st
To use this feature, specify a bibliography file using the `bibliography` field in your file's header.
The field should contain a path from the directory that contains your .Rmd file to the file that contains the bibliography file:
``` {.yaml}
``` yaml
bibliography: rmarkdown.bib
```
@ -447,7 +479,7 @@ To create a citation within your .Rmd file, use a key composed of '\@' + the cit
Then place the citation in square brackets.
Here are some examples:
``` {.markdown}
``` markdown
Separate multiple citations with a `;`: Blah blah [@smith04; @doe99].
You can add arbitrary comments inside the square brackets:
@ -466,7 +498,7 @@ As a result it is common practice to end your file with a section header for the
You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the `csl` field:
``` {.yaml}
``` yaml
bibliography: rmarkdown.bib
csl: apa.csl
```

View File

@ -1,6 +1,9 @@
# Strings
# Strings {#sec-strings}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("restructuring")
```
@ -15,15 +18,17 @@ Next, we'll discuss the basics of regular expressions, a powerful tool for descr
The chapter finishes up with functions that work with individual letters, a brief discussion of where your expectations from English might steer you wrong when working with other languages, and a few useful non-stringr functions.
This chapter is paired with two other chapters.
Regular expression are a big topic, so we'll come back to them again in Chapter \@ref(regular-expressions).
We'll also come back to strings again in Chapter \@ref(programming-with-strings) where we'll look at them from a programming perspective rather than a data analysis perspective.
Regular expression are a big topic, so we'll come back to them again in [Chapter -@sec-regular-expressions]. We'll also come back to strings again in [Chapter -@sec-programming-with-strings] where we'll look at them from a programming perspective rather than a data analysis perspective.
### Prerequisites
In this chapter, we'll use functions from the stringr package which is part of the core tidyverse.
We'll also use the babynames data since it provides some fun strings to manipulate.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(babynames)
```
@ -32,7 +37,9 @@ Similar functionality is available in base R (through functions like `grepl()`,
You can easily tell when you're using a stringr function because all stringr functions start with `str_`.
This is particularly useful if you use RStudio, because typing `str_` will trigger autocomplete, allowing you jog your memory of which functions are available.
```{r, echo = FALSE}
```{r}
#| echo: false
knitr::include_graphics("screenshots/stringr-autocomplete.png")
```
@ -82,7 +89,7 @@ x
str_view(x)
```
### Raw strings
### Raw strings {#sec-raw-strings}
Creating a string with multiple quotes or backslashes gets confusing quickly.
To illustrate the problem, lets create a string that contains the contents of the chunk where I define the `double_quote` and `single_quote` variables:
@ -119,7 +126,7 @@ str_view(x)
Note that `str_view()` shows special whitespace characters (i.e. everything except spaces and newlines) with a blue background to make them easier to spot.
### Vectors {#string-vector}
### Vectors {#sec-string-vector}
You can combine multiple strings into a character vector by using `c()`:
@ -129,7 +136,7 @@ x
```
Technically, a string is a length-1 character vector, but this doesn't have much bearing on your data analysis life.
We'll come back to this idea is more detail when we think about vectors as a programming tool in Chapter \@ref(vectors).
We'll come back to this idea is more detail when we think about vectors as a programming tool in [Chapter -@sec-vectors].
### Exercises
@ -230,7 +237,9 @@ df |>
1. Compare and contrast the results of `paste0()` with `str_c()` for the following inputs:
```{r, eval = FALSE}
```{r}
#| eval: false
str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])
```
@ -278,7 +287,12 @@ We can also use `str_detect()` with `summarize()` by remembering that when you u
That means `sum(str_detect(x, pattern))` will tell you the number of observations that match, while `mean(str_detect(x, pattern))` tells you the proportion of observations that match.
For example, the following snippet computes and visualizes the proportion of baby names that contain "x", broken down by year:
```{r, fig.alt = "A timeseries showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019."}
```{r}
#| fig-alt: >
#| A timeseries showing the proportion of baby names that contain the letter x.
#| The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in
#| 1980, then increases rapidly to 16 per 1000 in 2019.
babynames |>
group_by(year) |>
summarise(prop_x = mean(str_detect(name, "x"))) |>
@ -302,7 +316,7 @@ For example, `.`
will match any character[^strings-8], so `"a."` will match any string that contains an "a" followed by another character
:
[^strings-7]: You'll learn how to escape this special behaviour in Section \@ref(regexp-escaping)
[^strings-7]: You'll learn how to escape this special behaviour in @sec-regexp-escaping
[^strings-8]: Well, any character apart from `\n`.
@ -317,7 +331,7 @@ This shows which characters are matched by surrounding it with `<>` and coloring
str_view_all(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
```
Regular expressions are a powerful and flexible language which we'll come back to in Chapter \@ref(regular-expressions).
Regular expressions are a powerful and flexible language which we'll come back to in [Chapter -@sec-regular-expressions].
Here I'll just introduce only the most important components: quantifiers and character classes.
**Quantifiers** control how many times an element that can be applied to other pattern: `?` makes a pattern optional (i.e. it matches 0 or 1 times), `+` lets a pattern repeat (i.e. it matches at least once), and `*` lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
@ -391,7 +405,7 @@ There are three ways we could fix this:
- Add the upper case vowels to the character class: `str_count(name, "[aeiouAEIOU]")`.
- Tell the regular expression to ignore case: `str_count(regex(name, ignore.case = TRUE), "[aeiou]")`. We'll talk about this next.
- Use `str_lower()` to convert the names to lower case: `str_count(to_lower(name), "[aeiou]")`. We'll come back to this function in Section \@ref(other-languages).
- Use `str_lower()` to convert the names to lower case: `str_count(to_lower(name), "[aeiou]")`. We'll come back to this function in @sec-other-languages.
This is pretty typical when working with strings --- there are often multiple ways to reach your goal, either making your pattern more complicated or by doing some preprocessing on your string.
If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.
@ -427,7 +441,7 @@ str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
Alternatively, you can provide a replacement function: it's called with a vector of matches, and should return what to replacement them with.
We'll come back to this powerful tool in Chapter \@ref(programming-with-strings).
We'll come back to this powerful tool in [Chapter -@sec-programming-with-strings].
```{r}
x <- c("1 house", "1 person has 2 cars", "3 people")
@ -478,7 +492,7 @@ In this section you'll learn how to use various functions tidyr to extract them.
Waiting on: <https://github.com/tidyverse/tidyups/pull/15>
## Locale dependent operations {#other-languages}
## Locale dependent operations {#sec-other-languages}
So far all of our examples have been using English.
The details of the many ways other languages are different to English are too diverse to detail here, but I wanted to give a quick outline of the functions who's behavior differs based on your **locale**, the set of settings that vary from country to country.

View File

@ -1,6 +1,9 @@
# Tibbles
# Tibbles {#sec-tibbles}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
@ -19,7 +22,10 @@ If this chapter leaves you wanting to learn more about tibbles, you might enjoy
In this chapter we'll explore the **tibble** package, part of the core tidyverse.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -47,7 +53,9 @@ tibble(
Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
```{r, error = TRUE}
```{r}
#| error: true
tibble(
x = c(1, 5),
y = c("a", "b", "c")
@ -155,7 +163,9 @@ You can see a complete list of options by looking at the package help with `pack
A final option is to use RStudio's built-in data viewer to get a scrollable view of the complete dataset.
This is also often useful at the end of a long chain of manipulations.
```{r, eval = FALSE}
```{r}
#| eval: false
flights |> View()
```
@ -175,7 +185,7 @@ tb |> pull(x1) # by name
tb |> pull(1) # by position
```
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in Chapter \@ref(vectors).
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in [Chapter -@sec-vectors].
```{r}
tb |> pull(x1, name = id)
@ -223,9 +233,11 @@ If you hit one of those functions, just use `as.data.frame()` to turn your tibbl
2. Compare and contrast the following operations on a `data.frame` and equivalent tibble.
What is different?
Why might the default `data.frame` behaviours cause you frustration?
Why might the default `data.frame` behaviors cause you frustration?
```{r}
#| eval: false
```{r, eval = FALSE}
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]

View File

@ -1,21 +0,0 @@
# (PART) Tidy {.unnumbered}
# Introduction {#tidy-intro .unnumbered}
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualisation and modelling.
Data wrangling is very important: without it you can't work with your own data!
There are three main parts to data wrangling:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-wrangle.png")
```
<!--# TO DO: Redo the diagram without highlighting import. -->
This part of the book proceeds as follows:
- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists.
- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
<!--# TO DO: Revisit bullet points about new chapters. -->

28
tidy.qmd Normal file
View File

@ -0,0 +1,28 @@
# Tidy {#sec-tidy-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualization and modelling.
Data wrangling is very important: without it you can't work with your own data!
There are three main parts to data wrangling:
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("diagrams/data-science-wrangle.png")
```
<!--# TO DO: Redo the diagram without highlighting import. -->
This part of the book proceeds as follows:
- [Chapter -@sec-list-columns] will give you tools for working with list columns --- data stored in columns of a tibble as lists.
- In [Chapter -@sec-rectangle-data], you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
<!--# TO DO: Revisit bullet points about new chapters. -->

View File

@ -1,6 +1,10 @@
# (PART) Transform {.unnumbered}
# Transform {#sec-transform-intro .unnumbered}
# Introduction {#data-types-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
In this part of the book, you'll learn about various types of data the columns of a data frame can contain and how to transform them.
The transformations you might want to apply to a column vary depending on the type of data you're working with, for example if you have text strings you might want to extract or remove certain pieces while if you have numerical data, you might want to rescale them.
@ -11,26 +15,26 @@ Now we'll focus on new skills for specific types of data you will frequently enc
This part of the book proceeds as follows:
- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**.
- In [Chapter -@sec-tibbles], you'll learn about the variant of the data frame that we use in this book: the **tibble**.
You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets.
- [Chapter -@sec-relational-data] will give you tools for working with multiple interrelated datasets.
- Chapter \@ref(numbers) ...
- [Chapter -@sec-numbers] ...
- Chapter \@ref(logicals) ...
- [Chapter -@sec-logicals] ...
- Chapter \@ref(missing-values)...
- [Chapter -@sec-missing-values]...
- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- [Chapter -@sec-strings] will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- Chapter \@ref(regular-expressions) ...
- [Chapter -@sec-regular-expressions] ...
- Chapter \@ref(factors) will introduce factors -- how R stores categorical data.
- [Chapter -@sec-factors] will introduce factors -- how R stores categorical data.
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times.
- [Chapter -@sec-dates-and-times] will give you the key tools for working with dates and date-times.
- Chapter \@ref(column-wise) will give you tools for performing the same operation on multiple columns.
- [Chapter -@sec-column-wise] will give you tools for performing the same operation on multiple columns.
<!-- TO DO: Add chapter descriptions -->

View File

@ -1,4 +1,10 @@
# Vectors {#vectors}
# Vectors {#sec-vectors}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
## Introduction
@ -17,7 +23,10 @@ Even when complete, you'll still need to understand vectors, it'll just make it
The focus of this chapter is on base R data structures, so it isn't essential to load any packages.
We will, however, use a handful of functions from the **purrr** package to avoid some inconsistencies in base R.
```{r setup, message = FALSE}
```{r}
#| label: setup
#| message: false
library(tidyverse)
```
@ -34,9 +43,15 @@ The chief difference between atomic vectors and lists is that atomic vectors are
There's one other related object: `NULL`.
`NULL` is often used to represent the absence of a vector (as opposed to `NA` which is used to represent the absence of a value in a vector).
`NULL` typically behaves like a vector of length 0.
Figure \@ref(fig:datatypes) summarises the interrelationships.
@fig-datatypes summarises the interrelationships.
```{r}
#| label: fig-datatypes
#| echo: false
#| out-width: "50%"
#| fig-cap: >
#| The hierarchy of R's vector types.
```{r datatypes, echo = FALSE, out.width = "50%", fig.cap = "The hierarchy of R's vector types"}
knitr::include_graphics("diagrams/data-structures-overview.png")
```
@ -75,7 +90,7 @@ Raw and complex are rarely used during a data analysis, so I won't discuss them
### Logical
Logical vectors are the simplest type of atomic vector because they can take only three possible values: `FALSE`, `TRUE`, and `NA`.
Logical vectors are usually constructed with comparison operators, as described in [comparisons].
Logical vectors are usually constructed with comparison operators, as described in \[comparisons\].
You can also create them by hand with `c()`:
```{r}
@ -133,7 +148,7 @@ The distinction between integers and doubles is not usually important, but there
Character vectors are the most complex type of atomic vector, because each element of a character vector is a string, and a string can contain an arbitrary amount of data.
You've already learned a lot about working with strings in [strings].
You've already learned a lot about working with strings in \[strings\].
Here I wanted to mention one important feature of the underlying string implementation: R uses a global string pool.
This means that each unique string is only stored in memory once, and every use of the string points to that representation.
This reduces the amount of memory needed by duplicated strings.
@ -150,7 +165,7 @@ lobstr::obj_size(y)
`y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string.
A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 \* 1000 + 152 = 8,144 B.
### Missing values {#missing-values-vectors}
### Missing values {#sec-missing-values-vectors}
Note that each type of atomic vector has its own missing value:
@ -223,7 +238,9 @@ mean(y) # what proportion are greater than 10?
You may see some code (typically older) that relies on implicit coercion in the opposite direction, from integer to logical:
```{r, eval = FALSE}
```{r}
#| eval: false
if (length(x)) {
# do something
}
@ -263,7 +280,7 @@ Instead, it's safer to use the `is_*` functions provided by purrr, which are sum
| `is_list()` | | | | | x |
| `is_vector()` | x | x | x | x | x |
### Scalars and recycling rules
### Scalars and recycling rules {#sec-scalars-and-recycling-rules}
As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors.
This is called vector **recycling**, because the shorter vector is repeated, or recycled, to the same length as the longer vector.
@ -298,7 +315,9 @@ While vector recycling can be used to create very succinct, clever code, it can
For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar.
If you do want to recycle, you'll need to do it yourself with `rep()`:
```{r, error = TRUE}
```{r}
#| error: true
tibble(x = 1:4, y = 1:2)
tibble(x = 1:4, y = rep(1:2, 2))
@ -323,7 +342,7 @@ set_names(1:3, c("a", "b", "c"))
Named vectors are most useful for subsetting, described next.
### Subsetting {#vector-subsetting}
### Subsetting {#sec-vector-subsetting}
So far we've used `dplyr::filter()` to filter the rows in a tibble.
`filter()` only works with tibble, so we'll need a new tool for vectors: `[`.
@ -354,7 +373,9 @@ There are four types of things that you can subset a vector with:
It's an error to mix positive and negative values:
```{r, error = TRUE}
```{r}
#| error: true
x[c(1, -1)]
```
@ -422,7 +443,7 @@ The distinction between `[` and `[[` is most important for lists, as we'll see s
6. What happens when you subset with a positive integer that's bigger than the length of the vector?
What happens when you subset with a name that doesn't exist?
## Recursive vectors (lists) {#lists}
## Recursive vectors (lists) {#sec-lists}
Lists are a step up in complexity from atomic vectors, because lists can contain other lists.
This makes them suitable for representing hierarchical or tree-like structures.
@ -469,7 +490,10 @@ x3 <- list(1, list(2, list(3)))
I'll draw them as follows:
```{r, echo = FALSE, out.width = "75%"}
```{r}
#| echo: false
#| out-width: "75%"
knitr::include_graphics("diagrams/lists-structure.png")
```
@ -517,9 +541,15 @@ a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
```
The distinction between `[` and `[[` is really important for lists, because `[[` drills down into the list while `[` returns a new, smaller list.
Compare the code and output above with the visual representation in Figure \@ref(fig:lists-subsetting).
Compare the code and output above with the visual representation in @fig-lists-subsetting.
```{r}
#| label: fig-lists-subsetting
#| echo: false
#| out-width: "75%"
#| fig-cap: >
#| Subsetting a list, visually.
```{r lists-subsetting, echo = FALSE, out.width = "75%", fig.cap = "Subsetting a list, visually."}
knitr::include_graphics("diagrams/lists-subsetting.png")
```
@ -528,13 +558,19 @@ knitr::include_graphics("diagrams/lists-subsetting.png")
The difference between `[` and `[[` is very important, but it's easy to get confused.
To help you remember, let me show you an unusual pepper shaker.
```{r, echo = FALSE, out.width = "25%"}
```{r}
#| echo: false
#| out-width: "25%"
knitr::include_graphics("images/pepper.jpg")
```
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
```{r, echo = FALSE, out.width = "25%"}
```{r}
#| echo: false
#| out-width: "25%"
knitr::include_graphics("images/pepper-1.jpg")
```
@ -543,13 +579,19 @@ knitr::include_graphics("images/pepper-1.jpg")
`x[[1]]` is:
```{r, echo = FALSE, out.width = "25%"}
```{r}
#| echo: false
#| out-width: "25%"
knitr::include_graphics("images/pepper-2.jpg")
```
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
```{r, echo = FALSE, out.width = "25%"}
```{r}
#| echo: false
#| out-width: "25%"
knitr::include_graphics("images/pepper-3.jpg")
```

View File

@ -1,6 +1,10 @@
# (PART) Whole game {.unnumbered}
# Whole game {#sec-whole-game-intro .unnumbered}
# Introduction {#explore-intro .unnumbered}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
```
The goal of the first part of this book is to introduce you the data science workflow including data **importing**, **tidying**, and data **exploration** as quickly as possible.
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
@ -8,8 +12,7 @@ The goal of data exploration is to generate many promising leads that you can la
```{r}
#| echo: false
#| out.width: "75%"
#| fig.alt: >
#| fig-alt: >
#| A diagram displaying the data science cycle: Import -> Tidy -> Explore
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
#| Communicate. Surrounding all of these is Communicate. Explore is highlighted.
@ -22,20 +25,20 @@ knitr::include_graphics("diagrams/data-science-explore.png")
In this part of the book, you will learn several useful tools that have an immediate payoff:
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
In Chapter \@ref(data-visualisation) you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
In [Chapter -@sec-data-visualisation] you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
- Visualisation alone is typically not enough, so in Chapter \@ref(data-transform), you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- Visualisation alone is typically not enough, so in [Chapter -@sec-data-transform], you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- In Chapter \@ref(data-tidy), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier.
- In [Chapter -@sec-data-tidy], you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.
- Before you can transform and visualize your data, you need to first get your data into R.
In Chapter \@ref(data-import) you'll learn the basics of getting plain-text, rectangular data into R.
In [Chapter -@sec-data-import] you'll learn the basics of getting plain-text, rectangular data into R.
- Finally, in Chapter \@ref(exploratory-data-analysis), you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.
- Finally, in [Chapter -@sec-exploratory-data-analysis], you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet and details of modeling fall outside the scope of this book.
Nestled among these five chapters that teach you the tools for doing data science are three chapters that focus on your R workflow.
In Chapters \@ref(workflow-basics), \@ref(workflow-pipes), \@ref(workflow-style), and \@ref(workflow-scripts-projects), you'll learn good workflow practices for writing and organizing your R code.
In [Chapter -@sec-workflow-basics], [Chapter -@sec-workflow-pipes], [Chapter -@sec-workflow-style], and [Chapter -@sec-workflow-scripts-projects], you'll learn good workflow practices for writing and organizing your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.

View File

@ -1,4 +1,11 @@
# Workflow: basics
# Workflow: basics {#sec-workflow-basics}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
You now have some experience running R code.
We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration!
@ -80,7 +87,7 @@ Suppose you decide to change the value of `span`, and set it to 0.3.
It would be very useful to add a comment noting why you decided to make this change, for yourself in the future and others reviewing your code.
In the following example the first comment for the same code is not as good as the second one as it doesn't say why the decision to change the span was made.
## What's in a name?
## What's in a name? {#sec-whats-in-a-name}
Object names must start with a letter, and can only contain letters, numbers, `_` and `.`.
You want your object names to be descriptive, so you'll need to adopt a convention for multiple words.
@ -95,7 +102,7 @@ some.people.use.periods
And_aFew.People_RENOUNCEconvention
```
We'll come back to names again when we talk more about code style in Chapter \@ref(workflow-style).
We'll come back to names again when we talk more about code style in [Chapter -@sec-workflow-style].
You can inspect an object by typing its name:

View File

@ -1,7 +1,10 @@
# Workflow: Getting help
# Workflow: Getting help {#sec-workflow-getting-help}
```{r, results = "asis", echo = FALSE}
status("restructuring")
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
This book is not an island; there is no single resource that will allow you to master R.

View File

@ -1,6 +1,9 @@
# Workflow: Pipes {#workflow-pipes}
# Workflow: Pipes {#sec-workflow-pipes}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("complete")
```
@ -8,14 +11,14 @@ The pipe, `|>`, is a powerful tool for clearly expressing a sequence of operatio
We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss `%>%`, a predecessor to `|>`.
To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in Figure \@ref(fig:pipe-options); more on `%>%` shortly.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in @fig-pipe-options; more on `%>%` shortly.
```{r}
#| label: pipe-options
#| label: fig-pipe-options
#| echo: false
#| fig.cap: >
#| fig-cap: >
#| To insert `|>`, make sure the "Use native pipe" option is checked.
#| fig.alt: >
#| fig-alt: >
#| Screenshot showing the "Use native pipe operator" option which can
#| be found on the "Editing" panel of the "Code" options.
@ -112,7 +115,7 @@ But they're still good to know about even if you've never used `%>%` because you
- The `|>` placeholder is deliberately simple and can't replicate many features of the `%>%` placeholder: you can't pass it to multiple arguments, and it doesn't have any special behavior when the placeholder is used inside another function.
For example, `df %>% split(.$var)` is equivalent to `split(df, df$var)` and `df %>% {split(.$x, .$y)}` is equivalent to `split(df$x, df$y)`.
With `%>%` you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in Chapter \@ref(vectors)), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`.
With `%>%` you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in [Chapter -@sec-vectors]), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`.
A future version of R may add similar support for `|>` and `_`.
For the special case of extracting a column out of a data frame, you can also use `dplyr::pull():`

View File

@ -1,7 +1,10 @@
# Workflow: scripts and projects {#workflow-scripts-projects}
# Workflow: scripts and projects {#sec-workflow-scripts-projects}
```{r, results = "asis", echo = FALSE}
status("restructuring")
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
So far, you have used the console to run code.
@ -12,7 +15,7 @@ Now you'll see four panes:
```{r}
#| echo: false
#| out.width: "75%"
#| out-width: "75%"
#| fig-alt: >
#| RStudio IDE with Editor, Console, and Output highlighted.
@ -104,9 +107,9 @@ The script editor will also highlight syntax errors with a red squiggly line and
```{r}
#| echo: false
#| out.width: NULL
#| out-width: NULL
#| fig-alt: >
#| Script editor with the script `x y <- 10`. A red X indicates that there is
#| Script editor with the script x y <- 10. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
knitr::include_graphics("screenshots/rstudio-diagnostic.png")
@ -116,12 +119,11 @@ Hover over the cross to see what the problem is:
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| Script editor with the script `x y <- 10`. A red X indicates that there is
#| fig-alt: >
#| Script editor with the script x y <- 10. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
#| Hovering over the X shows a text box with the text 'unexpected token y' and
#| unexpected token <-'.
#| Hovering over the X shows a text box with the text unexpected token y and
#| unexpected token <-.
knitr::include_graphics("screenshots/rstudio-diagnostic-tip.png")
```
@ -130,12 +132,11 @@ RStudio will also let you know about potential problems:
```{r}
#| echo: false
#| out.width: NULL
#| fig.alt: >
#| Script editor with the script `3 == NA`. A yellow exclamation park
#| fig-alt: >
#| Script editor with the script 3 == NA. A yellow exclamation park
#| indicates that there may be a potential problem. Hovering over the
#| exclamation mark shows a text box with the text 'use is.na to check
#| whether expression evaluates to NA'.
#| exclamation mark shows a text box with the text use is.na to check
#| whether expression evaluates to NA.
knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
```
@ -164,11 +165,10 @@ To encourage this behavior, I highly recommend that you instruct RStudio not to
```{r}
#| echo: false
#| out.width: "75%"
#| fig.alt: >
#| RStudio preferences window where the option 'Restore .RData into workspace
#| at startup' is not checked. Also, the option 'Save workspace to .RData
#| on exit' is set to 'Never'.
#| fig-alt: >
#| RStudio preferences window where the option Restore .RData into workspace
#| at startup is not checked. Also, the option Save workspace to .RData
#| on exit is set to Never.
knitr::include_graphics("screenshots/rstudio-workspace.png")
```
@ -192,10 +192,9 @@ RStudio shows your current working directory at the top of the console:
```{r}
#| echo: false
#| out.width: "50%"
#| fig-alt: >
#| The Console tab shows the current working directory as
#| '~/Documents/r4ds/r4ds'.
#| ~/Documents/r4ds/r4ds.
knitr::include_graphics("screenshots/rstudio-wd.png")
```
@ -251,13 +250,13 @@ Click File \> New Project, then:
```{r}
#| echo: false
#| out.width: "50%"
#| layout-ncol: 2
#| fig-alt: >
#| There are three screenshots of the New Project menu. In the first screenshot,
#| the `Create Project` window is shown and 'New Directory' is selected.
#| In the second screenshot, the `Project Type` window is shown and
#| 'Empty Project' is selected. In the third screenshot, the 'Create New Project'
#| window is shown and the directory name is given as 'r4ds' and the project
#| the Create Project window is shown and New Directory is selected.
#| In the second screenshot, the Project Type window is shown and
#| Empty Project is selected. In the third screenshot, the Create New Project
#| window is shown and the directory name is given as r4ds and the project
#| is being created as subdirectory of the Desktop.
knitr::include_graphics("screenshots/rstudio-project-1.png")

View File

@ -1,6 +1,9 @@
# Workflow: code style {#workflow-style}
# Workflow: code style {#sec-workflow-style}
```{r, results = "asis", echo = FALSE}
```{r}
#| results: "asis"
#| echo: false
source("_common.R")
status("polishing")
```
@ -14,16 +17,16 @@ Additionally, there are some great tools to quickly restyle existing code, like
Once you've installed it with `install.packages("styler")`, an easy way to use it is via RStudio's **command palette**.
The command palette lets you use any build-in RStudio command, as well as many addins provided by packages.
Open the palette by pressing Cmd/Ctrl + Shift + P, then type "styler" to see all the shortcuts provided by styler.
Figure \@ref(fig:styler) shows the results.
@fig-styler shows the results.
```{r}
#| label: styler
#| label: fig-styler
#| echo: false
#| out.width: NULL
#| fig.cap: >
#| out-width: null
#| fig-cap: >
#| RStudio's command palette makes it easy to access every RStudio command
#| using only the keyboard.
#| fig.alt: >
#| fig-alt: >
#| A screenshot showing the command palette after typing "styler", showing
#| the four styling tool provided by the package.
@ -32,6 +35,7 @@ knitr::include_graphics("screenshots/rstudio-palette.png")
```{r}
#| label: setup
#| message: false
library(tidyverse)
library(nycflights13)
@ -39,7 +43,7 @@ library(nycflights13)
## Names
We talked briefly about names in Section \@ref(whats-in-a-name).
We talked briefly about names in @sec-whats-in-a-name.
Remember that variable names (those created by `<-` and those created by `mutate()`) should use only lowercase letters, numbers, and `_`.
Use `_` to separate words within a name.
@ -102,7 +106,7 @@ flights |>
)
```
## Pipes
## Pipes {#sec-pipes}
`|>` should always have a space before it and should typically be the last thing on a line.
This makes makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side.
@ -252,13 +256,13 @@ As your scripts get longer, use **sectioning** comments to break up your file in
# Plot data --------------------------------------
```
RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in Figure \@ref(fig:rstudio-sections).
RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in @fig-rstudio-sections.
```{r}
#| label: rstudio-sections
#| label: fig-rstudio-sections
#| echo: false
#| out.width: NULL
#| fig.cap: >
#| out-width: null
#| fig-cap: >
#| After adding sectioning comments to your script, you can
#| easily navigate to them using the code navigation tool in the
#| bottom-left of the script editor.