More TR edits (#1343)
* Data tidy edits * Style edits * Bit more clarification based on review comments * Import review edits * Scripts edits * Help edits * Insert edits throughout * Update data-import.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-import.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-tidy.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-tidy.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-tidy.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-transform.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-transform.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-import.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> --------- Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
This commit is contained in:
parent
dfb8b5602b
commit
c6edfb977e
4
EDA.qmd
4
EDA.qmd
|
@ -23,7 +23,7 @@ EDA is not a formal process with a strict set of rules.
|
|||
More than anything, EDA is a state of mind.
|
||||
During the initial phases of EDA you should feel free to investigate every idea that occurs to you.
|
||||
Some of these ideas will pan out, and some will be dead ends.
|
||||
As your exploration continues, you will home in on a few particularly productive areas that you'll eventually write up and communicate to others.
|
||||
As your exploration continues, you will home in on a few particularly productive insights that you'll eventually write up and communicate to others.
|
||||
|
||||
EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data.
|
||||
Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not.
|
||||
|
@ -181,7 +181,7 @@ We'll get to that shortly.
|
|||
### Unusual values
|
||||
|
||||
Outliers are observations that are unusual; data points that don't seem to fit the pattern.
|
||||
Sometimes outliers are data entry errors; other times outliers suggest important new science.
|
||||
Sometimes outliers are data entry errors; other times outliers suggest important new discoveries.
|
||||
When you have a lot of data, outliers are sometimes difficult to see in a histogram.
|
||||
For example, take the distribution of the `y` variable from the diamonds dataset.
|
||||
The only evidence of outliers is the unusually wide limits on the x-axis.
|
||||
|
|
|
@ -3,6 +3,7 @@
|
|||
```{r}
|
||||
#| results: "asis"
|
||||
#| echo: false
|
||||
|
||||
source("_common.R")
|
||||
status("complete")
|
||||
```
|
||||
|
@ -30,13 +31,15 @@ library(tidyverse)
|
|||
|
||||
## Reading data from a file
|
||||
|
||||
To begin, we'll focus on the most rectangular data file type: CSV, which is short for comma-separated values.
|
||||
To begin, we'll focus on the most common rectangular data file type: CSV, which is short for comma-separated values.
|
||||
Here is what a simple CSV file looks like.
|
||||
The first row, commonly called the header row, gives the column names, and the following six rows provide the data.
|
||||
The columns are separated, aka delimited, by commas.
|
||||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| message: false
|
||||
#| comment: ""
|
||||
|
||||
read_lines("data/students.csv") |> cat(sep = "\n")
|
||||
```
|
||||
|
@ -54,7 +57,9 @@ read_csv("data/students.csv") |>
|
|||
```
|
||||
|
||||
We can read this file into R using `read_csv()`.
|
||||
The first argument is the most important: it's the path to the file.
|
||||
The first argument is the most important: the path to the file.
|
||||
You can think about the path as the address of the file.
|
||||
The following says that the file is called `students.csv` and that it's in the `data` folder.
|
||||
|
||||
```{r}
|
||||
#| message: true
|
||||
|
@ -62,6 +67,15 @@ The first argument is the most important: it's the path to the file.
|
|||
students <- read_csv("data/students.csv")
|
||||
```
|
||||
|
||||
The code above will work if you have the `students.csv` file in a `data` folder in your project.
|
||||
You can download the `students.csv` file from <https://pos.it/r4ds-students-csv> or you can read it directly from that URL with:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
||||
students <- read_csv("https://pos.it/r4ds-students-csv")
|
||||
```
|
||||
|
||||
When you run `read_csv()`, it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains).
|
||||
It also prints out some information about retrieving the full column specification and how to quiet this message.
|
||||
This message is an integral part of readr, and we'll return to it in @sec-col-types.
|
||||
|
@ -71,8 +85,13 @@ This message is an integral part of readr, and we'll return to it in @sec-col-ty
|
|||
Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis.
|
||||
Let's take another look at the `students` data with that in mind.
|
||||
|
||||
```{r}
|
||||
students
|
||||
```
|
||||
|
||||
In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available".
|
||||
This is something we can address using the `na` argument.
|
||||
By default `read_csv()` only recognizes empty strings (`""`) as `NA`s, we want it to also recognize the character string `"N/A`
|
||||
|
||||
```{r}
|
||||
#| message: false
|
||||
|
@ -82,8 +101,8 @@ students
|
|||
```
|
||||
|
||||
You might also notice that the `Student ID` and `Full Name` columns are surrounded by backticks.
|
||||
That's because they contain spaces, breaking R's usual rules for variable names.
|
||||
To refer to them, you need to use those backticks:
|
||||
That's because they contain spaces, breaking R's usual rules for variable names; they're **non-syntactic** names.
|
||||
To refer to these variables, you need to surround them with backticks, `` ` ``:
|
||||
|
||||
```{r}
|
||||
students |>
|
||||
|
@ -117,8 +136,8 @@ students |>
|
|||
Note that the values in the `meal_plan` variable have stayed the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
|
||||
You'll learn more about factors in @sec-factors.
|
||||
|
||||
Before you analyze these data, you'll probably want to fix the `age` column.
|
||||
Currently, it's a character variable because one of the observations is typed out as `five` instead of a numeric `5`.
|
||||
Before you analyze these data, you'll probably want to fix the `age` and `id` columns.
|
||||
Currently, `age` is a character variable because one of the observations is typed out as `five` instead of a numeric `5`.
|
||||
We discuss the details of fixing this issue in @sec-import-spreadsheets.
|
||||
|
||||
```{r}
|
||||
|
@ -134,7 +153,7 @@ students
|
|||
|
||||
### Other arguments
|
||||
|
||||
There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read CSV files that you've created in a string:
|
||||
There are a couple of other important arguments that we need to mention, and they'll be easier to demonstrate if we first show you a handy trick: `read_csv()` can read text strings that you've created and formatted like a CSV file:
|
||||
|
||||
```{r}
|
||||
#| message: false
|
||||
|
@ -300,7 +319,7 @@ One of the most common causes for this is a missing value, recorded using someth
|
|||
Take this simple 1 column CSV file as an example:
|
||||
|
||||
```{r}
|
||||
csv <- "
|
||||
simple_csv <- "
|
||||
x
|
||||
10
|
||||
.
|
||||
|
@ -312,16 +331,19 @@ csv <- "
|
|||
If we read it without any additional arguments, `x` becomes a character column:
|
||||
|
||||
```{r}
|
||||
df <- read_csv(csv)
|
||||
df <- read_csv(simple_csv)
|
||||
```
|
||||
|
||||
In this very small case, you can easily see the missing value `.`.
|
||||
But what happens if you have thousands of rows with only a few missing values represented by `.`s speckled among them?
|
||||
One approach is to tell readr that `x` is a numeric column, and then see where it fails.
|
||||
You can do that with the `col_types` argument, which takes a named list:
|
||||
You can do that with the `col_types` argument, which takes a named list where the names match the column names in the CSV file:
|
||||
|
||||
```{r}
|
||||
df <- read_csv(csv, col_types = list(x = col_double()))
|
||||
df <- read_csv(
|
||||
simple_csv,
|
||||
col_types = list(x = col_double())
|
||||
)
|
||||
```
|
||||
|
||||
Now `read_csv()` reports that there was a problem, and tells us we can find out more with `problems()`:
|
||||
|
@ -335,7 +357,7 @@ That suggests this dataset uses `.` for missing values.
|
|||
So then we set `na = "."`, the automatic guessing succeeds, giving us the numeric column that we want:
|
||||
|
||||
```{r}
|
||||
df <- read_csv(csv, na = ".")
|
||||
df <- read_csv(simple_csv, na = ".")
|
||||
```
|
||||
|
||||
### Column types
|
||||
|
@ -344,19 +366,22 @@ readr provides a total of nine column types for you to use:
|
|||
|
||||
- `col_logical()` and `col_double()` read logicals and real numbers. They're relatively rarely needed (except as above), since readr will usually guess them for you.
|
||||
- `col_integer()` reads integers. We distinguish integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
|
||||
- `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half.
|
||||
- `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half, for example a phone number, social security number, credit card number, etc.
|
||||
- `col_factor()`, `col_date()`, and `col_datetime()` create factors, dates, and date-times respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-dates-and-times.
|
||||
- `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You'll learn more about it in @sec-numbers.
|
||||
- `col_skip()` skips a column so it's not included in the result.
|
||||
- `col_skip()` skips a column so it's not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.
|
||||
|
||||
It's also possible to override the default column by switching from `list()` to `cols()`:
|
||||
|
||||
```{r}
|
||||
csv <- "
|
||||
another_csv <- "
|
||||
x,y,z
|
||||
1,2,3"
|
||||
|
||||
read_csv(csv, col_types = cols(.default = col_character()))
|
||||
read_csv(
|
||||
another_csv,
|
||||
col_types = cols(.default = col_character())
|
||||
)
|
||||
```
|
||||
|
||||
Another useful helper is `cols_only()` which will read in only the columns you specify:
|
||||
|
@ -380,6 +405,20 @@ sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
|
|||
read_csv(sales_files, id = "file")
|
||||
```
|
||||
|
||||
Once again, the code above will work if you have the CSV files in a `data` folder in your project.
|
||||
You can download these files from <https://pos.it/r4ds-01-sales>, <https://pos.it/r4ds-02-sales>, and <https://pos.it/r4ds-03-sales> or you can read them directly with:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
||||
sales_files <- c(
|
||||
"https://pos.it/r4ds-01-sales",
|
||||
"https://pos.it/r4ds-02-sales",
|
||||
"https://pos.it/r4ds-03-sales"
|
||||
)
|
||||
read_csv(sales_files, id = "file")
|
||||
```
|
||||
|
||||
With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
|
||||
This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.
|
||||
|
||||
|
@ -407,7 +446,7 @@ write_csv(students, "students.csv")
|
|||
```
|
||||
|
||||
Now let's read that csv file back in.
|
||||
Note that the type information is lost when you save to csv:
|
||||
Note that the variable type information that you just set up is lost when you save to CSV because you're starting over with reading from a plain text file again:
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
|
@ -422,7 +461,8 @@ This makes CSVs a little unreliable for caching interim results---you need to re
|
|||
There are two main alternative:
|
||||
|
||||
1. `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
|
||||
These store data in R's custom binary format called RDS:
|
||||
These store data in R's custom binary format called RDS.
|
||||
This means that when you reload the object, you are loading the *exact same* R object that you stored.
|
||||
|
||||
```{r}
|
||||
write_rds(students, "students.rds")
|
||||
|
@ -496,8 +536,6 @@ tribble(
|
|||
)
|
||||
```
|
||||
|
||||
We'll use `tibble()` and `tribble()` later in the book to construct small examples to demonstrate how various functions work.
|
||||
|
||||
## Summary
|
||||
|
||||
In this chapter, you've learned how to load CSV files with `read_csv()` and to do your own data entry with `tibble()` and `tribble()`.
|
||||
|
|
113
data-tidy.qmd
113
data-tidy.qmd
|
@ -3,6 +3,7 @@
|
|||
```{r}
|
||||
#| results: "asis"
|
||||
#| echo: false
|
||||
|
||||
source("_common.R")
|
||||
status("complete")
|
||||
```
|
||||
|
@ -119,20 +120,21 @@ Here are a few small examples showing how you might work with `table1`.
|
|||
table1 |>
|
||||
mutate(rate = cases / population * 10000)
|
||||
|
||||
# Compute cases per year
|
||||
table1 |>
|
||||
count(year, wt = cases)
|
||||
# Compute total cases per year
|
||||
table1 |>
|
||||
group_by(year) |>
|
||||
summarize(total_cases = sum(cases))
|
||||
|
||||
# Visualize changes over time
|
||||
ggplot(table1, aes(x = year, y = cases)) +
|
||||
geom_line(aes(group = country), color = "grey50") +
|
||||
geom_point(aes(color = country, shape = country)) +
|
||||
scale_x_continuous(breaks = c(1999, 2000))
|
||||
scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Using words, describe how the variables and observations are organized in each of the sample tables.
|
||||
1. For each of the sample tables, describe what each observation and each column represents.
|
||||
|
||||
2. Sketch out the process you'd use to calculate the `rate` for `table2` and `table3`.
|
||||
You will need to perform four operations:
|
||||
|
@ -177,14 +179,7 @@ The first three columns (`artist`, `track` and `date.entered`) are variables tha
|
|||
Then we have 76 columns (`wk1`-`wk76`) that describe the rank of the song in each week.
|
||||
Here, the column names are one variable (the `week`) and the cell values are another (the `rank`).
|
||||
|
||||
To tidy this data, we'll use `pivot_longer()`.
|
||||
After the data, there are three key arguments:
|
||||
|
||||
- `cols` specifies which columns need to be pivoted, i.e. which columns aren't variables. This argument uses the same syntax as `select()` so here we could use `!c(artist, track, date.entered)` or `starts_with("wk")`.
|
||||
- `names_to` names of the variable stored in the column names, here `"week"`.
|
||||
- `values_to` names the variable stored in the cell values, here `"rank"`.
|
||||
|
||||
That gives the following call:
|
||||
To tidy this data, we'll use `pivot_longer()`:
|
||||
|
||||
```{r, R.options=list(pillar.print_min = 10)}
|
||||
billboard |>
|
||||
|
@ -195,10 +190,19 @@ billboard |>
|
|||
)
|
||||
```
|
||||
|
||||
After the data, there are three key arguments:
|
||||
|
||||
- `cols` specifies which columns need to be pivoted, i.e. which columns aren't variables. This argument uses the same syntax as `select()` so here we could use `!c(artist, track, date.entered)` or `starts_with("wk")`.
|
||||
- `names_to` names of the variable stored in the column names, we named that variable `week`.
|
||||
- `values_to` names the variable stored in the cell values, we named that variable `rank`.
|
||||
|
||||
Note that in the code `"week"` and `"rank"` are quoted because those are new variables we're creating, they don't yet exist in the data when we run the `pivot_longer()` call.
|
||||
|
||||
Now let's turn our attention to the resulting, longer data frame.
|
||||
What happens if a song is in the top 100 for less than 76 weeks?
|
||||
Take 2 Pac's "Baby Don't Cry", for example.
|
||||
The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
|
||||
These `NA`s don't really represent unknown observations; they're forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
|
||||
The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values.
|
||||
These `NA`s don't really represent unknown observations; they were forced to exist by the structure of the dataset[^data-tidy-1], so we can ask `pivot_longer()` to get rid of them by setting `values_drop_na = TRUE`:
|
||||
|
||||
[^data-tidy-1]: We'll come back to this idea in @sec-missing-values.
|
||||
|
||||
|
@ -212,14 +216,16 @@ billboard |>
|
|||
)
|
||||
```
|
||||
|
||||
The number of rows is now much lower, indicating that the rows with `NA`s were dropped.
|
||||
|
||||
You might also wonder what happens if a song is in the top 100 for more than 76 weeks?
|
||||
We can't tell from this data, but you might guess that additional columns `wk77`, `wk78`, ... would be added to the dataset.
|
||||
|
||||
This data is now tidy, but we could make future computation a bit easier by converting `week` into a number using `mutate()` and `readr::parse_number()`.
|
||||
This data is now tidy, but we could make future computation a bit easier by converting values of `week` from character strings to numbers using `mutate()` and `readr::parse_number()`.
|
||||
`parse_number()` is a handy function that will extract the first number from a string, ignoring all other text.
|
||||
|
||||
```{r}
|
||||
billboard_tidy <- billboard |>
|
||||
billboard_longer <- billboard |>
|
||||
pivot_longer(
|
||||
cols = starts_with("wk"),
|
||||
names_to = "week",
|
||||
|
@ -229,11 +235,12 @@ billboard_tidy <- billboard |>
|
|||
mutate(
|
||||
week = parse_number(week)
|
||||
)
|
||||
billboard_tidy
|
||||
billboard_longer
|
||||
```
|
||||
|
||||
Now we're in a good position to look at how song ranks vary over time by drawing a plot.
|
||||
The code is shown below and the result is @fig-billboard-ranks.
|
||||
Now that we have all the week numbers in one variable and all the rank values in another, we're in a good position to visualize how song ranks vary over time.
|
||||
The code is shown below and the result is in @fig-billboard-ranks.
|
||||
We can see that very few songs stay in the top 100 for more than 20 weeks.
|
||||
|
||||
```{r}
|
||||
#| label: fig-billboard-ranks
|
||||
|
@ -246,16 +253,17 @@ The code is shown below and the result is @fig-billboard-ranks.
|
|||
#| surprisingly few tracks in the region when week is >20 and rank is
|
||||
#| >50.
|
||||
|
||||
billboard_tidy |>
|
||||
billboard_longer |>
|
||||
ggplot(aes(x = week, y = rank, group = track)) +
|
||||
geom_line(alpha = 1/3) +
|
||||
geom_line(alpha = 0.25) +
|
||||
scale_y_reverse()
|
||||
```
|
||||
|
||||
### How does pivoting work?
|
||||
|
||||
Now that you've seen what pivoting can do for you, it's worth taking a little time to gain some intuition about what it does to the data.
|
||||
Let's start with a very simple dataset to make it easier to see what's happening:
|
||||
Now that you've seen how we can use pivoting to reshape our data, let's take a little time to gain some intuition about what pivoting does to the data.
|
||||
Let's start with a very simple dataset to make it easier to see what's happening.
|
||||
We'll create it with `tribble()`, a handy function for creating small tibbles by hand:
|
||||
|
||||
```{r}
|
||||
df <- tribble(
|
||||
|
@ -266,8 +274,8 @@ df <- tribble(
|
|||
)
|
||||
```
|
||||
|
||||
Here we'll say there are three variables: `var` (already in a variable), `name` (the column names in the column names), and `value` (the cell values).
|
||||
So we can tidy it with:
|
||||
We want out new dataset to have three variables: `var` (already exists), `name` (the column names), and `value` (the cell values).
|
||||
So we can tidy `df` with:
|
||||
|
||||
```{r}
|
||||
df |>
|
||||
|
@ -278,9 +286,9 @@ df |>
|
|||
)
|
||||
```
|
||||
|
||||
How does this transformation take place?
|
||||
It's easier to see if we take it component by component.
|
||||
Columns that are already variables need to be repeated, once for each column in `cols`, as shown in @fig-pivot-variables.
|
||||
How does the reshaping work?
|
||||
It's easier to see if we think about it column by column.
|
||||
As shown in @fig-pivot-variables, the values in column that was already a variable in the original dataset (`var`) need to be repeated, once for each column that is pivoted.
|
||||
|
||||
```{r}
|
||||
#| label: fig-pivot-variables
|
||||
|
@ -297,15 +305,15 @@ Columns that are already variables need to be repeated, once for each column in
|
|||
knitr::include_graphics("diagrams/tidy-data/variables.png", dpi = 270)
|
||||
```
|
||||
|
||||
The column names become values in a new variable, whose name is given by `names_to`, as shown in @fig-pivot-names.
|
||||
The column names become values in a new variable, whose name is defined by `names_to`, as shown in @fig-pivot-names.
|
||||
They need to be repeated once for each row in the original dataset.
|
||||
|
||||
```{r}
|
||||
#| label: fig-pivot-names
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| The column names of pivoted columns become a new column. The values
|
||||
#| need to be repeated once for each row of the original dataset.
|
||||
#| The column names of pivoted columns become values in a new column. The
|
||||
#| values need to be repeated once for each row of the original dataset.
|
||||
#| fig-alt: >
|
||||
#| A diagram showing how `pivot_longer()` transforms a simple
|
||||
#| data set, using color to highlight how column names ("col1" and
|
||||
|
@ -315,7 +323,7 @@ They need to be repeated once for each row in the original dataset.
|
|||
knitr::include_graphics("diagrams/tidy-data/column-names.png", dpi = 270)
|
||||
```
|
||||
|
||||
The cell values also become values in a new variable, with a name given by `values_to`.
|
||||
The cell values also become values in a new variable, with a name defined by `values_to`.
|
||||
They are unwound row by row.
|
||||
@fig-pivot-values illustrates the process.
|
||||
|
||||
|
@ -337,22 +345,22 @@ knitr::include_graphics("diagrams/tidy-data/cell-values.png", dpi = 270)
|
|||
|
||||
### Many variables in column names
|
||||
|
||||
A more challenging situation occurs when you have multiple variables crammed into the column names.
|
||||
A more challenging situation occurs when you have multiple pieces of information crammed into the column names, and you would like to store these in separate new variables.
|
||||
For example, take the `who2` dataset, the source of `table1` and friends that you saw above:
|
||||
|
||||
```{r}
|
||||
who2
|
||||
```
|
||||
|
||||
This dataset records information about tuberculosis data collected by the WHO.
|
||||
This dataset, collected by the World Health Organisation, records information about tuberculosis diagnoses.
|
||||
There are two columns that are already variables and are easy to interpret: `country` and `year`.
|
||||
They are followed by 56 columns like `sp_m_014`, `ep_m_4554`, and `rel_m_3544`.
|
||||
If you stare at these columns for long enough, you'll notice there's a pattern.
|
||||
Each column name is made up of three pieces separated by `_`.
|
||||
The first piece, `sp`/`rel`/`ep`, describes the method used for the `diagnosis`, the second piece, `m`/`f` is the `gender`, and the third piece, `014`/`1524`/`2535`/`3544`/`4554`/`65` is the `age` range.
|
||||
The first piece, `sp`/`rel`/`ep`, describes the method used for the diagnosis, the second piece, `m`/`f` is the `gender` (coded as a binary variable in this dataset), and the third piece, `014`/`1524`/`2535`/`3544`/`4554`/`65` is the `age` range (`014` represents 0-14, for example).
|
||||
|
||||
So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell value.
|
||||
This requires two changes to our call to `pivot_longer()`: `names_to` gets a vector of column names and `names_sep` describes how to split the variable name up into pieces:
|
||||
So in this case we have six pieces of information recorded in `who2`: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values).
|
||||
To organize these six pieces of information in six separate columns, we use `pivot_longer()` with a vector of column names for `names_to` and instructors for splitting the original variable names into pieces for `names_sep` as well as a column name for `values_to`:
|
||||
|
||||
```{r}
|
||||
who2 |>
|
||||
|
@ -374,8 +382,9 @@ You can imagine this happening in two steps (first pivoting and then separating)
|
|||
#| label: fig-pivot-multiple-names
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| Pivotting with many variables in the column names means that each
|
||||
#| column name now fills in values in multiple output columns.
|
||||
#| Pivoting columns with multiple pieces of information in the names
|
||||
#| means that each column name now fills in values in multiple output
|
||||
#| columns.
|
||||
#| fig-alt: >
|
||||
#| A diagram that uses color to illustrate how supplying `names_sep`
|
||||
#| and multiple `names_to` creates multiple variables in the output.
|
||||
|
@ -448,7 +457,7 @@ We'll start by looking at `cms_patient_experience`, a dataset from the Centers o
|
|||
cms_patient_experience
|
||||
```
|
||||
|
||||
An observation is an organization, but each organization is spread across six rows, with one row for each variable, or measure.
|
||||
The core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization.
|
||||
We can see the complete set of values for `measure_cd` and `measure_title` by using `distinct()`:
|
||||
|
||||
```{r}
|
||||
|
@ -457,9 +466,9 @@ cms_patient_experience |>
|
|||
```
|
||||
|
||||
Neither of these columns will make particularly great variable names: `measure_cd` doesn't hint at the meaning of the variable and `measure_title` is a long sentence containing spaces.
|
||||
We'll use `measure_cd` for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.
|
||||
We'll use `measure_cd` as the source for our new column names for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.
|
||||
|
||||
`pivot_wider()` has the opposite interface to `pivot_longer()`: we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`:
|
||||
`pivot_wider()` has the opposite interface to `pivot_longer()`: instead of choosing new column names, we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`:
|
||||
|
||||
```{r}
|
||||
cms_patient_experience |>
|
||||
|
@ -470,8 +479,7 @@ cms_patient_experience |>
|
|||
```
|
||||
|
||||
The output doesn't look quite right; we still seem to have multiple rows for each organization.
|
||||
That's because, by default, `pivot_wider()` will attempt to preserve all the existing columns including `measure_title` which has six distinct observations for each organizations.
|
||||
To fix this problem we need to tell `pivot_wider()` which columns identify each row; in this case those are the variables starting with `"org"`:
|
||||
That's because, we also need to tell `pivot_wider()` which column or columns have values that uniquely identify each row; in this case those are the variables starting with `"org"`:
|
||||
|
||||
```{r}
|
||||
cms_patient_experience |>
|
||||
|
@ -512,7 +520,7 @@ df |>
|
|||
The connection between the position of the row in the input and the cell in the output is weaker than in `pivot_longer()` because the rows and columns in the output are primarily determined by the values of variables, not their locations.
|
||||
|
||||
To begin the process `pivot_wider()` needs to first figure out what will go in the rows and columns.
|
||||
Finding the column names is easy: it's just the unique values of `name`.
|
||||
Finding the new column names is easy: it's just the unique values of `name`.
|
||||
|
||||
```{r}
|
||||
df |>
|
||||
|
@ -520,7 +528,7 @@ df |>
|
|||
pull()
|
||||
```
|
||||
|
||||
By default, the rows in the output are formed by all the variables that aren't going into the names or values.
|
||||
By default, the rows in the output are determined by all the variables that aren't going into the new names or values.
|
||||
These are called the `id_cols`.
|
||||
Here there is only one column, but in general there can be any number.
|
||||
|
||||
|
@ -580,11 +588,16 @@ It's then up to you to figure out what's gone wrong with your data and either re
|
|||
## Summary
|
||||
|
||||
In this chapter you learned about tidy data: data that has variables in columns and observations in rows.
|
||||
Tidy data makes working in the tidyverse easier, because it's a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format.
|
||||
Tidy data makes working in the tidyverse easier, because it's a consistent structure understood by most functions, the main challenge is transforming the data from whatever structure you receive it in to a tidy format.
|
||||
To that end, you learned about `pivot_longer()` and `pivot_wider()` which allow you to tidy up many untidy datasets.
|
||||
The examples we used here are just a selection of those from `vignette("pivot", package = "tidyr")`, so if you encounter a problem that this chapter doesn't help you with, that vignette is a good place to try next.
|
||||
The examples we presented here are a selection of those from `vignette("pivot", package = "tidyr")`, so if you encounter a problem that this chapter doesn't help you with, that vignette is a good place to try next.
|
||||
|
||||
If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
|
||||
Another challenge is that, for a given dataset, it can be impossible to label the longer or the wider version as the "tidy" one.
|
||||
This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn't actually define what a variable is (and it's surprisingly hard to do so).
|
||||
It's totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.
|
||||
So if you're stuck figuring out how to do some computation, consider switching up the organisation of your data; don't be afraid to untidy, transform, and re-tidy as needed!
|
||||
|
||||
If you enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
|
||||
|
||||
Now that you're writing a substantial amount of R code, it's time to learn more about organizing your code into files and directories.
|
||||
In the next chapter, you'll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.
|
||||
|
|
|
@ -10,7 +10,7 @@ status("complete")
|
|||
|
||||
## Introduction
|
||||
|
||||
Visualization is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need to visualize it.
|
||||
Visualization is an important tool for generating insight, but it's rare that you get the data in exactly the right form you need to make the graph you want.
|
||||
Often you'll need to create some new variables or summaries to answer your questions with your data, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with.
|
||||
You'll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the **dplyr** package and a new dataset on flights that departed New York City in 2013.
|
||||
|
||||
|
@ -69,7 +69,7 @@ But before we discuss their individual differences, it's worth stating what they
|
|||
|
||||
2. The subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
|
||||
|
||||
3. The result is always a new data frame.
|
||||
3. The output is always a new data frame.
|
||||
|
||||
Because each verb does one thing well, solving complex problems will usually require combining multiple verbs, and we'll do so with the pipe, `|>`.
|
||||
We'll discuss the pipe more in @the-pipe, but in brief, the pipe takes the thing on its left and passes it along to the function on its right so that `x |> f(y)` is equivalent to `f(x, y)`, and `x |> f(y) |> g(z)` is equivalent to into `g(f(x, y), z)`.
|
||||
|
@ -112,7 +112,7 @@ flights |>
|
|||
```
|
||||
|
||||
As well as `>` (greater than), you can use `>=` (greater than or equal to), `<` (less than), `<=` (less than or equal to), `==` (equal to), and `!=` (not equal to).
|
||||
You can also use `&` (and) or `|` (or) to combine multiple conditions:
|
||||
You can also combine conditions with `&` or `,` to indicate "and" (check for both conditions) or with `|` to indicate "or" (check for either condition):
|
||||
|
||||
```{r}
|
||||
# Flights that departed on January 1
|
||||
|
@ -181,8 +181,8 @@ flights |>
|
|||
arrange(year, month, day, dep_time)
|
||||
```
|
||||
|
||||
You can use `desc()` to re-order the data frame based a column, in descending order.
|
||||
For example, this code shows the most delayed flights first:
|
||||
You can use `desc()` on a column inside of `arrange()` to re-order the data frame based on that column in descending (big-to-small) order.
|
||||
For example, this code orders flights from most to least delayed:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -197,16 +197,31 @@ Note that the number of rows has not changed -- we're only arranging the data, w
|
|||
Most of the time, however, you'll want the distinct combination of some variables, so you can also optionally supply column names:
|
||||
|
||||
```{r}
|
||||
# This would remove any duplicate rows if there were any
|
||||
# Remove duplicate rows, if any
|
||||
flights |>
|
||||
distinct()
|
||||
|
||||
# This finds all unique origin and destination pairs
|
||||
# Find all unique origin and destination pairs
|
||||
flights |>
|
||||
distinct(origin, dest)
|
||||
```
|
||||
|
||||
Note that if you want to find the number of duplicates, or rows that weren't duplicated, you're better off swapping `distinct()` for `count()`, which will give the number of observations per unique level, and then filtering as needed.
|
||||
Alternatively, if you want to the keep other columns when filtering for unique rows, you can use the `.keep_all = TRUE` option.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
distinct(origin, dest, .keep_all = TRUE)
|
||||
```
|
||||
|
||||
It's not a coincidence that all of these distinct flights are on January 1, `distinct()` will find the find the first occurrence of a unique row in the dataset and discard the rest.
|
||||
|
||||
If you want to find the number of occurrences instead, you're better off swapping `distinct()` for `count()`, and with the `sort = TRUE` argument you can arrange them in descending order of number of occurrences.
|
||||
You'll learn more about count in @sec-counts.
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
count(origin, dest, sort = TRUE)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -282,7 +297,7 @@ flights |>
|
|||
```
|
||||
|
||||
Alternatively, you can control which variables are kept with the `.keep` argument.
|
||||
A particularly useful argument is `"used"` which allows you to keep only the inputs and outputs from your calculations.
|
||||
A particularly useful argument is `"used"` which specifies that we only keep the columns that were involved or created in the `mutate()` step.
|
||||
For example, the following output will contain only the variables `dep_delay`, `arr_delay`, `air_time`, `gain`, `hours`, and `gain_per_hour`.
|
||||
|
||||
```{r}
|
||||
|
@ -335,6 +350,8 @@ In this situation, the first challenge is often just focusing on the variables y
|
|||
select(!year:day)
|
||||
```
|
||||
|
||||
You can also use `-` instead of `!` (and you're likely to see that in the wild); we recommend `!` because it reads as "not", and combines well with `&` and `|`.
|
||||
|
||||
- Select all columns that are characters:
|
||||
|
||||
```{r}
|
||||
|
@ -419,7 +436,7 @@ ggplot(flights, aes(x = air_time - airtime2)) + geom_histogram()
|
|||
|
||||
2. Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
|
||||
|
||||
3. What happens if you specify the name of a variable multiple times in a `select()` call?
|
||||
3. What happens if you specify the name of the same variable multiple times in a `select()` call?
|
||||
|
||||
4. What does the `any_of()` function do?
|
||||
Why might it be helpful in conjunction with this vector?
|
||||
|
@ -552,9 +569,9 @@ flights |>
|
|||
group_by(month)
|
||||
```
|
||||
|
||||
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that it's now "grouped by" month.
|
||||
`group_by()` doesn't change the data but, if you look closely at the output, you'll notice that the output indicates that it is "grouped by" month (`Groups: month [12]`).
|
||||
This means subsequent operations will now work "by month".
|
||||
`group_by()` doesn't do anything by itself; instead it changes the behavior of the subsequent verbs.
|
||||
`group_by()` adds this grouped feature (referred to as class) to the data frame, which changes the behavior of the subsequent verbs applied to the data.
|
||||
|
||||
### `summarize()` {#sec-summarize}
|
||||
|
||||
|
@ -623,7 +640,7 @@ This is similar to computing the max delay with `summarize()`, but you get the w
|
|||
### Grouping by multiple variables
|
||||
|
||||
You can create groups using more than one variable.
|
||||
For example, we could make a group for each day.
|
||||
For example, we could make a group for each date.
|
||||
|
||||
```{r}
|
||||
daily <- flights |>
|
||||
|
@ -792,7 +809,7 @@ When we plot the skill of the batter (measured by the batting average, `performa
|
|||
1. The variation in `performance` is larger among players with fewer at-bats.
|
||||
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].
|
||||
|
||||
2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because teams want to give their best batters the most opportunities to hit the ball.
|
||||
2. There's a positive correlation between skill (`performance`) and opportunities to hit the ball (`n`) because teams want to give their best batters the most opportunities to hit the ball.
|
||||
|
||||
[^data-transform-4]: \*cough\* the law of large numbers \*cough\*.
|
||||
|
||||
|
@ -816,7 +833,7 @@ Note the handy pattern for combining ggplot2 and dplyr.
|
|||
You just have to remember to switch from `|>`, for dataset processing, to `+` for adding layers to your plot.
|
||||
|
||||
This also has important implications for ranking.
|
||||
If you naively sort on `desc(performance)`, the people with the best batting averages are clearly lucky, not skilled:
|
||||
If you naively sort on `desc(performance)`, the people with the best batting averages are clearly the ones who tried to put the ball in play very few times and happened to get a hit, they're not necessarily the most skilled players:
|
||||
|
||||
```{r}
|
||||
batters |>
|
||||
|
|
|
@ -155,7 +155,7 @@ ggplot(data = penguins)
|
|||
|
||||
Next, we need to tell `ggplot()` how the information from our data will be visually represented.
|
||||
The `mapping` argument of the `ggplot()` function defines how variables in your dataset are mapped to visual properties (**aesthetics**) of your plot.
|
||||
The `mapping` argument is always defined using the `aes()` function, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes.
|
||||
The `mapping` argument is always defined in the `aes()` function, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes.
|
||||
For now, we will only map flipper length to the `x` aesthetic and body mass to the `y` aesthetic.
|
||||
ggplot2 looks for the mapped variables in the `data` argument, in this case, `penguins`.
|
||||
|
||||
|
|
10
intro.qmd
10
intro.qmd
|
@ -14,7 +14,7 @@ After reading this book, you'll have the tools to tackle a wide variety of data
|
|||
|
||||
Data science is a vast field, and there's no way you can master it all by reading a single book.
|
||||
This book aims to give you a solid foundation in the most important tools and enough knowledge to find the resources to learn more when necessary.
|
||||
Our model of the tools needed in a typical data science project looks something like @fig-ds-diagram.
|
||||
Our model of the steps of a typical data science project looks something like @fig-ds-diagram.
|
||||
|
||||
```{r}
|
||||
#| label: fig-ds-diagram
|
||||
|
@ -47,7 +47,7 @@ Transformation includes narrowing in on observations of interest (like all peopl
|
|||
Together, tidying and transforming are called **wrangling** because getting your data in a form that's natural to work with often feels like a fight!
|
||||
|
||||
Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualization and modeling.
|
||||
These have complementary strengths and weaknesses, so any real analysis will iterate between them many times.
|
||||
These have complementary strengths and weaknesses, so any real data analysis will iterate between them many times.
|
||||
|
||||
**Visualization** is a fundamentally human activity.
|
||||
A good visualization will show you things you did not expect or raise new questions about the data.
|
||||
|
@ -75,12 +75,12 @@ Throughout this book, we'll point you to resources where you can learn more.
|
|||
## How this book is organized
|
||||
|
||||
The previous description of the tools of data science is organized roughly according to the order in which you use them in an analysis (although, of course, you'll iterate through them multiple times).
|
||||
In our experience, however, learning data ingesting and tidying first is sub-optimal because 80% of the time, it's routine and boring, and the other 20% of the time, it's weird and frustrating.
|
||||
In our experience, however, learning data importing and tidying first is sub-optimal because 80% of the time, it's routine and boring, and the other 20% of the time, it's weird and frustrating.
|
||||
That's a bad place to start learning a new subject!
|
||||
Instead, we'll start with visualization and transformation of data that's already been imported and tidied.
|
||||
That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.
|
||||
|
||||
Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture and then dive into the details.
|
||||
Within each chapter, we try and adhere to a consistent pattern: start with some motivating examples so you can see the bigger picture and then dive into the details.
|
||||
Each section of the book is paired with exercises to help you practice what you've learned.
|
||||
Although it can be tempting to skip the exercises, there's no better way to learn than practicing on real problems.
|
||||
|
||||
|
@ -197,7 +197,7 @@ In other words, the complement to the tidyverse is not the messyverse but many o
|
|||
As you tackle more data science projects with R, you'll learn new packages and new ways of thinking about data.
|
||||
|
||||
We'll use many packages from outside the tidyverse in this book.
|
||||
For example, we use the following four data packages to provide interesting applications:
|
||||
For example, we use the following packages to that provide interesting data sets:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
|
|
@ -10,7 +10,7 @@ status("complete")
|
|||
|
||||
## Introduction
|
||||
|
||||
In the @sec-data-visualization, you learned much more than just how to make scatterplots, bar charts, and boxplots.
|
||||
In @sec-data-visualization, you learned much more than just how to make scatterplots, bar charts, and boxplots.
|
||||
You learned a foundation that you can use to make *any* type of plot with ggplot2.
|
||||
|
||||
In this chapter, you'll expand on that foundation as you learn about the layered grammar of graphics.
|
||||
|
|
|
@ -3,6 +3,7 @@
|
|||
```{r}
|
||||
#| results: "asis"
|
||||
#| echo: false
|
||||
|
||||
source("_common.R")
|
||||
status("complete")
|
||||
```
|
||||
|
@ -51,7 +52,7 @@ x <- c("$1,234", "USD 3,513", "59%")
|
|||
parse_number(x)
|
||||
```
|
||||
|
||||
## Counts
|
||||
## Counts {#sec-counts}
|
||||
|
||||
It's surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with `count()`.
|
||||
This function is great for quick exploration and checks during analysis:
|
||||
|
|
Binary file not shown.
Before Width: | Height: | Size: 26 KiB After Width: | Height: | Size: 24 KiB |
|
@ -41,7 +41,7 @@ You can **c**ombine multiple elements into a vector with `c()`:
|
|||
primes <- c(2, 3, 5, 7, 11, 13)
|
||||
```
|
||||
|
||||
And operations applied to the vector are applied to every element of it:
|
||||
And basic arithmetic on vectors is applied to every element of of the vector:
|
||||
|
||||
```{r}
|
||||
primes * 2
|
||||
|
@ -72,7 +72,7 @@ We'll sometimes include comments in examples explaining what's happening with th
|
|||
Comments can be helpful for briefly describing what the following code does.
|
||||
|
||||
```{r}
|
||||
# define primes
|
||||
# create vector of primes
|
||||
primes <- c(2, 3, 5, 7, 11, 13)
|
||||
|
||||
# multiply primes by 2
|
||||
|
@ -88,7 +88,7 @@ If you describe every step in the comments, and then change the code, you will h
|
|||
|
||||
Figuring out *why* something was done is much more difficult, if not impossible.
|
||||
For example, `geom_smooth()` has an argument called `span`, which controls the smoothness of the curve, with larger values yielding a smoother curve.
|
||||
Suppose you decide to change the value of `span` from its default of 0.75 to 0.3: it's easy for a future reader to understand *what* is happening, but unless you note your thinking in a comment, no one will understand *why* you changed the default.
|
||||
Suppose you decide to change the value of `span` from its default of 0.75 to 0.9: it's easy for a future reader to understand *what* is happening, but unless you note your thinking in a comment, no one will understand *why* you changed the default.
|
||||
|
||||
For data analysis code, use comments to explain your overall plan of attack and record important insights as you encounter them.
|
||||
There's no way to re-capture this knowledge from the code itself.
|
||||
|
|
|
@ -59,14 +59,14 @@ y <- 1:4
|
|||
mean(y)
|
||||
```
|
||||
|
||||
Then call `reprex()`, where the default target venue is GitHub:
|
||||
Then call `reprex()`, where the default output is formatted for GitHub:
|
||||
|
||||
``` r
|
||||
reprex::reprex()
|
||||
```
|
||||
|
||||
A nicely rendered HTML preview will display in RStudio's Viewer (if you're in RStudio) or your default browser otherwise.
|
||||
The relevant bit of GitHub-flavored Markdown is ready to be pasted from your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):
|
||||
The reprex is automatically copied to your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):
|
||||
|
||||
``` r
|
||||
y <- 1:4
|
||||
|
@ -74,7 +74,8 @@ The relevant bit of GitHub-flavored Markdown is ready to be pasted from your cli
|
|||
#> [1] 2.5
|
||||
```
|
||||
|
||||
Here's what that Markdown would look like rendered in a GitHub issue:
|
||||
This text is formatted in a special way, called Markdown, which can be pasted to sites like StackOverflow or Github and they will automatically render it to look like code.
|
||||
Here's what that Markdown would look like rendered on GitHub:
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -99,7 +100,7 @@ There are three things you need to include to make your example reproducible: re
|
|||
2. Copy the output
|
||||
3. In reprex, type `mtcars <-`, then paste.
|
||||
|
||||
Try and find the smallest subset of your data that still reveals the problem.
|
||||
Try to use the smallest subset of your data that still reveals the problem.
|
||||
|
||||
3. Spend a little bit of time ensuring that your **code** is easy for others to read:
|
||||
|
||||
|
@ -123,10 +124,6 @@ Investing a little time in learning R each day will pay off handsomely in the lo
|
|||
One way is to follow what the tidyverse team is doing on the [tidyverse blog](https://www.tidyverse.org/blog/).
|
||||
To keep up with the R community more broadly, we recommend reading [R Weekly](https://rweekly.org): it's a community effort to aggregate the most interesting news in the R community each week.
|
||||
|
||||
If you're an active Twitter user, you might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)), Mine ([\@minebocek](https://twitter.com/minebocek)), Garrett ([\@statgarrett](https://twitter.com/statgarrett)), or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
|
||||
If you want the full fire hose of new developments, you can also read the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
|
||||
This is one of the key tools that Hadley and Mine use to keep up with new developments in the community.
|
||||
|
||||
## Summary
|
||||
|
||||
This chapter concludes the Whole Game part of the book.
|
||||
|
|
|
@ -16,8 +16,9 @@ That's a great place to start, but you'll find it gets cramped pretty quickly as
|
|||
To give yourself more room to work, use the script editor.
|
||||
Open it up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
|
||||
Now you'll see four panes, as in @fig-rstudio-script.
|
||||
The script editor is a great place to put code you care about.
|
||||
Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.
|
||||
The script editor is a great place to experiment with your code.
|
||||
When you want to change something, you don't have to re-type the whole thing, you can just edit the script and re-run it.
|
||||
And once you have written code that works and does what you want, you can save it as a script file to easily return to later.
|
||||
|
||||
```{r}
|
||||
#| label: fig-rstudio-script
|
||||
|
@ -37,9 +38,6 @@ The script editor is an excellent place for building complex ggplot2 plots or lo
|
|||
The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter.
|
||||
This executes the current R expression in the console.
|
||||
For example, take the code below.
|
||||
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
|
||||
It will also move the cursor to the following statement (beginning with `not_cancelled |>`).
|
||||
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -55,13 +53,17 @@ not_cancelled |>
|
|||
summarize(mean = mean(dep_delay))
|
||||
```
|
||||
|
||||
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
|
||||
It will also move the cursor to the following statement (beginning with `not_cancelled |>`).
|
||||
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
|
||||
|
||||
Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S.
|
||||
Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script.
|
||||
|
||||
We recommend you always start your script with the packages you need.
|
||||
That way, if you share your code with others, they can easily see which packages they need to install.
|
||||
Note, however, that you should never include `install.packages()` in a script you share.
|
||||
It's very antisocial to change settings on someone else's computer!
|
||||
It's inconsiderate to hand off a script that will something on their computer if they're not being careful!
|
||||
|
||||
When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts.
|
||||
Over time, sending code to the console in this way will become so natural that you won't even think about it.
|
||||
|
@ -134,7 +136,7 @@ For example, suppose you have the following files in a project folder.
|
|||
|
||||
There are a variety of problems here: it's hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (`finalreport` vs. `FinalReport`[^workflow-scripts-1]), and some names don't describe their contents (`run-first` and `temp`).
|
||||
|
||||
[^workflow-scripts-1]: Not to mention that you're tempting fate by using "final" in the name 😆 The comic piled higher and deeper has a [fun strip on this](https://phdcomics.com/comics/archive.php?comicid=1531).
|
||||
[^workflow-scripts-1]: Not to mention that you're tempting fate by using "final" in the name 😆 The comic Piled Higher and Deeper has a [fun strip on this](https://phdcomics.com/comics/archive.php?comicid=1531).
|
||||
|
||||
Here's better way of naming and organizing the same set of files:
|
||||
|
||||
|
@ -150,6 +152,7 @@ Here's better way of naming and organizing the same set of files:
|
|||
|
||||
Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies.
|
||||
Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and `temp` is renamed to `report-draft-notes` to better describe its contents.
|
||||
If you have a lot of files in a directory, taking organization one step further and placing different types of files (scripts, figures, etc.) in different directories is recommended.
|
||||
|
||||
## Projects
|
||||
|
||||
|
@ -166,15 +169,15 @@ To handle these real life situations, you need to make two decisions:
|
|||
|
||||
### What is the source of truth?
|
||||
|
||||
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis.
|
||||
However, in the long run, you'll be much better off if you ensure that your R scripts are the source of truth.
|
||||
As a beginner, it's okay to rely on your current Environment to contain all the objects you have created throughout your analysis.
|
||||
However, to make it easier to work on larger projects or collaborate with others, your source of truth should be the R scripts.
|
||||
With your R scripts (and your data files), you can recreate the environment.
|
||||
With only your environment, it's much harder to recreate your R scripts: you'll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you'll have to carefully mine your R history.
|
||||
|
||||
To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions.
|
||||
You can do this either by running `usethis::use_blank_slate()`[^workflow-scripts-2] or by mimicking the options shown in @fig-blank-slate. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time.
|
||||
But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code.
|
||||
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
|
||||
You can do this either by running `usethis::use_blank_slate()`[^workflow-scripts-2] or by mimicking the options shown in @fig-blank-slate. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time nor will the objects you created or datasets you read be available to use.
|
||||
But this short-term pain saves you long-term agony because it forces you to capture all important procedures in your code.
|
||||
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your environment, not the calculation itself in your code.
|
||||
|
||||
[^workflow-scripts-2]: If you don't have usethis installed, you can install it with `install.packages("usethis")`.
|
||||
|
||||
|
@ -185,7 +188,7 @@ There's nothing worse than discovering three months after the fact that you've o
|
|||
#| Copy these options in your RStudio options to always start your
|
||||
#| RStudio session with a clean slate.
|
||||
#| fig-alt: >
|
||||
#| RStudio preferences window where the option Restore .RData into workspace
|
||||
#| RStudio Global Options window where the option Restore .RData into workspace
|
||||
#| at startup is not checked. Also, the option Save workspace to .RData
|
||||
#| on exit is set to Never.
|
||||
#| out-width: ~
|
||||
|
@ -200,6 +203,8 @@ There is a great pair of keyboard shortcuts that will work together to make sure
|
|||
|
||||
We collectively use this pattern hundreds of times a week.
|
||||
|
||||
Alternatively, if you don't use keyboard shortcuts, you can go to Session \> Restart R and then highlight and re-run your current script.
|
||||
|
||||
::: callout-note
|
||||
## RStudio server
|
||||
|
||||
|
@ -219,7 +224,7 @@ RStudio shows your current working directory at the top of the console:
|
|||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| The Console tab shows the current working directory as
|
||||
#| ~/Documents/r4ds/r4ds.
|
||||
#| ~/Documents/r4ds.
|
||||
#| out-width: ~
|
||||
knitr::include_graphics("screenshots/rstudio-wd.png")
|
||||
```
|
||||
|
@ -229,11 +234,14 @@ And you can print this out in R code by running `getwd()`:
|
|||
```{r}
|
||||
#| eval: false
|
||||
getwd()
|
||||
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
|
||||
#> [1] "/Users/hadley/Documents/r4ds"
|
||||
```
|
||||
|
||||
In this R session, the current working directory (think of it as "home") is in hadley's Documents folder, in a subfolder called r4ds.
|
||||
This code will return a different result when you run it, because your computer has a different directory structure than Hadley's!
|
||||
|
||||
As a beginning R user, it's OK to let your working directory be your home directory, documents directory, or any other weird directory on your computer.
|
||||
But you're nine chapters into this book, and you're no longer a rank beginner.
|
||||
But you're seven chapters into this book, and you're no longer a beginner.
|
||||
Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R's working directory to the associated directory.
|
||||
|
||||
You can set the working directory from within R but **we** **do not recommend it**:
|
||||
|
@ -285,11 +293,13 @@ Check that the "home" of your project is the current working directory:
|
|||
```{r}
|
||||
#| eval: false
|
||||
getwd()
|
||||
#> [1] /Users/hadley/Documents/r4ds/r4ds
|
||||
#> [1] /Users/hadley/Documents/r4ds
|
||||
```
|
||||
|
||||
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
|
||||
Next, run the complete script which will save a PDF and CSV file into your project directory.
|
||||
Then, create a new folder called "data".
|
||||
You can do this by clicking on the "New Folder" button in the Files pane in RStudio.
|
||||
Finally, run the complete script which will save a PNG and CSV file into your project directory.
|
||||
Don't worry about the details, you'll learn them later in the book.
|
||||
|
||||
```{r}
|
||||
|
@ -300,9 +310,9 @@ library(tidyverse)
|
|||
|
||||
ggplot(diamonds, aes(x = carat, y = price)) +
|
||||
geom_hex()
|
||||
ggsave("diamonds.pdf")
|
||||
ggsave("diamonds.png")
|
||||
|
||||
write_csv(diamonds, "diamonds.csv")
|
||||
write_csv(diamonds, "data/diamonds.csv")
|
||||
```
|
||||
|
||||
Quit RStudio.
|
||||
|
@ -311,7 +321,7 @@ Double-click that file to re-open the project.
|
|||
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
|
||||
Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
|
||||
|
||||
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
|
||||
In your favorite OS-specific way, search your computer for `diamonds.png` and you will find the PNG (no surprise) but *also the script that created it* (`diamonds.R`).
|
||||
This is a huge win!
|
||||
One day, you will want to remake a figure or just understand where it came from.
|
||||
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
|
||||
|
@ -320,10 +330,10 @@ If you rigorously save figures to files **with R code** and never with the mouse
|
|||
|
||||
Once you're inside a project, you should only ever use relative paths not absolute paths.
|
||||
What's the difference?
|
||||
A relative path is **relative** to the working directory, i.e. the project's home.
|
||||
When Hadley wrote `diamonds.R` above it was a shortcut for `/Users/hadley/Documents/r4ds/r4ds/diamonds.R`.
|
||||
But importantly, if Mine ran this code on her computer, it would point to `/Users/Mine/Documents/r4ds/r4ds/diamonds.R`.
|
||||
This is why relative paths are important: they'll work regardless of where the project ends up.
|
||||
A relative path is relative to the working directory, i.e. the project's home.
|
||||
When Hadley wrote `data/diamonds.csv` above it was a shortcut for `/Users/hadley/Documents/r4ds/data/diamonds.csv`.
|
||||
But importantly, if Mine ran this code on her computer, it would point to `/Users/Mine/Documents/r4ds/data/diamonds.csv`.
|
||||
This is why relative paths are important: they'll work regardless of where the R project folder ends up.
|
||||
|
||||
Absolute paths point to the same place regardless of your working directory.
|
||||
They look a little different depending on your operating system.
|
||||
|
@ -331,7 +341,7 @@ On Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `
|
|||
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
|
||||
|
||||
There's another important difference between operating systems: how you separate the components of the path.
|
||||
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
|
||||
Mac and Linux uses slashes (e.g. `data/diamonds.csv`) and Windows uses backslashes (e.g. `data\diamonds.csv`).
|
||||
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
|
||||
That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.
|
||||
|
||||
|
@ -345,6 +355,10 @@ That makes life frustrating, so we recommend always using the Linux/Mac style wi
|
|||
|
||||
## Summary
|
||||
|
||||
In this chapter, you've learned how to organize your R code in scripts (files) and projects (directories).
|
||||
Much like code style, this may feel like busywork at first.
|
||||
But as you accumulate more code across multiple projects, you'll learn to appreciate how a little up front organisation can save you a bunch of time down the road.
|
||||
|
||||
In summary, scripts and projects give you a solid workflow that will serve you well in the future:
|
||||
|
||||
- Create one RStudio project for each data analysis project.
|
||||
|
|
|
@ -34,6 +34,8 @@ Open the palette by pressing Cmd/Ctrl + Shift + P, then type "styler" to see all
|
|||
knitr::include_graphics("screenshots/rstudio-palette.png")
|
||||
```
|
||||
|
||||
We'll use the tidyverse and nycflights13 packages for code examples in this chapter.
|
||||
|
||||
```{r}
|
||||
#| label: setup
|
||||
#| message: false
|
||||
|
@ -112,7 +114,7 @@ flights |>
|
|||
## Pipes {#sec-pipes}
|
||||
|
||||
`|>` should always have a space before it and should typically be the last thing on a line.
|
||||
This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 50,000 ft view by skimming the verbs on the left-hand side.
|
||||
This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 10,000 ft view by skimming the verbs on the left-hand side.
|
||||
|
||||
```{r}
|
||||
#| eval: false
|
||||
|
@ -149,6 +151,7 @@ flights |>
|
|||
```
|
||||
|
||||
After the first step of the pipeline, indent each line by two spaces.
|
||||
RStudio will automatically put the spaces in for you after a line break following a `\>` .
|
||||
If you're putting each argument on its own line, indent by an extra two spaces.
|
||||
Make sure `)` is on its own line, and un-indented to match the horizontal position of the function name.
|
||||
|
||||
|
@ -171,6 +174,7 @@ flights|>
|
|||
n = n()
|
||||
)
|
||||
|
||||
# Avoid
|
||||
flights|>
|
||||
group_by(tailnum) |>
|
||||
summarize(
|
||||
|
@ -199,7 +203,7 @@ df |>
|
|||
Finally, be wary of writing very long pipes, say longer than 10-15 lines.
|
||||
Try to break them up into smaller sub-tasks, giving each task an informative name.
|
||||
The names will help cue the reader into what's happening and makes it easier to check that intermediate results are as expected.
|
||||
Whenever you can give something an informative name, you should give it an informative name.
|
||||
Whenever you can give something an informative name, you should give it an informative name, for example when you fundamentally change the structure of the data, e.g., after pivoting or summarizing.
|
||||
Don't expect to get it right the first time!
|
||||
This means breaking up long pipelines if there are intermediate states that can get good names.
|
||||
|
||||
|
|
Loading…
Reference in New Issue