Data transformation (#940)

* Minor edit + link to style guide

* Fix reference

* If you don't know order of operations, not clear

* Alt text + minor edits

* Add median and fix reference

* Move up mult groups up to discuss summarise msg

* Go over grouping again

* Part rename

* Chapter rename

* Clean up section labels to avoid dups

* Update comment

* Switch part order

* Move columnwise to transform
This commit is contained in:
Mine Cetinkaya-Rundel 2021-03-29 14:58:27 +01:00 committed by GitHub
parent 50703875cc
commit 27db40ad65
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
14 changed files with 150 additions and 119 deletions

View File

@ -272,7 +272,7 @@ You'll need to figure out what caused them (e.g. a data entry error) and disclos
What happens if you leave `binwidth` unset?
What happens if you try and zoom so only half a bar shows?
## Missing values
## Missing values {#missing-values-eda}
If you've encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.

View File

@ -17,7 +17,7 @@ rmd_files: [
"EDA.Rmd",
"workflow-projects.Rmd",
"data-types.Rmd",
"transform.Rmd",
"tibble.Rmd",
"relational-data.Rmd",
"logicals-numbers.Rmd",
@ -26,11 +26,7 @@ rmd_files: [
"strings.Rmd",
"factors.Rmd",
"datetimes.Rmd",
"wrangle.Rmd",
"column-wise.Rmd",
"list-columns.Rmd",
"rectangle.Rmd",
"import.Rmd",
"import-rectangular.Rmd",
@ -39,6 +35,10 @@ rmd_files: [
"import-webscrape.Rmd",
"import-other.Rmd",
"tidy.Rmd",
"list-columns.Rmd",
"rectangle.Rmd",
"program.Rmd",
"pipes.Rmd",
"functions.Rmd",

View File

@ -1,4 +1,4 @@
# Column-wise operations
# Column-wise operations {#column-wise}
## Introduction

View File

@ -563,7 +563,7 @@ table1 %>%
baker
```
## Missing values
## Missing values {#missing-values-tidy}
Changing the representation of a dataset brings up an important subtlety of missing values.
Surprisingly, a value can be missing in one of two possible ways:

View File

@ -31,10 +31,11 @@ flights
```
You might notice that this data frame prints a little differently from other data frames you might have used in the past: it only shows the first few rows and all the columns that fit on one screen.
It also displays the number of rows (`r format(nrow(nycflights13::flights), big.mark = ",")`) and columns (`r ncol(nycflights13::flights)`).
(To see the whole dataset, you can run `View(flights)` which will open the dataset in the RStudio viewer).
It prints differently because it's a **tibble**.
Tibbles are data frames, but slightly tweaked to work better in the tidyverse.
For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in [wrangle](#wrangle-intro).
For now, you don't need to worry about the differences; we'll come back to tibbles in more detail in Chapter \@ref(tibbles).
You might also have noticed the row of three (or four) letter abbreviations under the column names.
These describe the type of each variable:
@ -43,7 +44,7 @@ These describe the type of each variable:
- `dbl` stands for doubles, or real numbers.
- `chr` stands for character vectors, or strings.
- `chr` stands for characters, or strings.
- `dttm` stands for date-times (a date + a time).
@ -120,8 +121,8 @@ There's another common problem you might encounter when using `==`: floating poi
These results might surprise you!
```{r}
sqrt(2) ^ 2 == 2
1 / 49 * 49 == 1
(sqrt(2) ^ 2) == 2
(1 / 49 * 49) == 1
```
Computers use finite precision arithmetic (they obviously can't store an infinite number of digits!) so remember that every number you see is an approximation.
@ -138,7 +139,7 @@ Multiple arguments to `filter()` are combined with "and": every expression must
For other types of combinations, you'll need to use Boolean operators yourself: `&` is "and", `|` is "or", and `!` is "not".
Figure \@ref(fig:bool-ops) shows the complete set of Boolean operations.
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects."}
```{r bool-ops, echo = FALSE, fig.cap = "Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.", fig.alt = "Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x, x & y is the intersection of x and y, x & !y is x but none of y, x is all of x none of y, xor(x, y) is everything except the intersection of x and y, y is all of y none of x, and x | y is everything."}
knitr::include_graphics("diagrams/transform-logical.png")
```
@ -151,7 +152,7 @@ filter(flights, month == 11 | month == 12)
The order of operations doesn't work like English.
You can't write `filter(flights, month == (11 | 12))`, which you might literally translate into "finds all flights that departed in November or December".
Instead it finds all months that equal `11 | 12`, an expression that evaluates to `TRUE`.
In a numeric context (like here), `TRUE` becomes one, so this finds all flights in January, not November or December.
In a numeric context (like here), `TRUE` becomes `1`, so this finds all flights in January, not November or December.
This is quite confusing!
A useful short-hand for this problem is `x %in% y`.
@ -172,15 +173,15 @@ filter(flights, arr_delay <= 120, dep_delay <= 120)
As well as `&` and `|`, R also has `&&` and `||`.
Don't use them here!
You'll learn when you should use them in [conditional execution].
You'll learn when you should use them in Section \@ref(conditional-execution) on conditional execution.
Whenever you start using complicated, multipart expressions in `filter()`, consider making them explicit variables instead.
That makes it much easier to check your work.
You'll learn how to create new variables shortly.
### Missing values
### Missing values {#missing-values-filter}
One important feature of R that can make comparison tricky are missing values, or `NA`s ("not availables").
One important feature of R that can make comparison tricky is missing values, or `NA`s ("not availables").
`NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.
```{r}
@ -277,17 +278,17 @@ arrange(df, desc(x))
### Exercises
1. How could you use `arrange()` to sort all missing values to the start?
(Hint: use `is.na()`).
2. Sort `flights` to find the most delayed flights.
1. Sort `flights` to find the flights with longest departure delays.
Find the flights that left earliest.
3. Sort `flights` to find the fastest (highest speed) flights.
2. Sort `flights` to find the fastest (highest speed) flights.
4. Which flights travelled the farthest?
3. Which flights travelled the farthest?
Which travelled the shortest?
4. How could you use `arrange()` to sort all missing values to the start?
(Hint: use `!is.na()`).
## Select columns with `select()` {#select}
It's not uncommon to get datasets with hundreds or even thousands of variables.
@ -326,11 +327,11 @@ Instead, use `rename()`, which is a variant of `select()` that keeps all the var
rename(flights, tail_num = tailnum)
```
Another option is to use `select()` in conjunction with the `everything()` helper.
This is useful if you have a handful of variables you'd like to move to the start of the data frame.
If you want to move certain variables to the start of the data frame but not drop the others, you can do this in two ways: using `select()` in conjunction with the `everything()` helper or using `relocate()`.
```{r}
select(flights, time_hour, air_time, everything())
relocate(flights, time_hour, air_time)
```
### Exercises
@ -343,7 +344,7 @@ select(flights, time_hour, air_time, everything())
Why might it be helpful in conjunction with this vector?
```{r}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
variables <- c("year", "month", "day", "dep_delay", "arr_delay")
```
4. Does the result of running the following code surprise you?
@ -446,7 +447,7 @@ There's no way to list every possible function that you might use, but here's a
cummean(x)
```
- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
- Logical comparisons: `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier.
If you're doing a complex sequence of logical operations it's often a good idea to store the interim values in new variables so you can check that each step is working as expected.
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
@ -472,6 +473,7 @@ There's no way to list every possible function that you might use, but here's a
### Exercises
```{r, eval = FALSE, echo = FALSE}
# For data checking, not used in results shown in book
flights <- flights %>% mutate(
dep_time = hour * 60 + minute,
arr_time = (arr_time %/% 100) * 60 + (arr_time %% 100),
@ -518,11 +520,11 @@ summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
`summarise()` is not terribly useful unless we pair it with `group_by()`.
This changes the unit of analysis from the complete dataset to individual groups.
Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group".
For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
For example, if we applied exactly the same code to a data frame grouped by month, we get the average delay per month:
```{r}
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
by_month <- group_by(flights, month)
summarise(by_month, delay = mean(dep_delay, na.rm = TRUE))
```
Together `group_by()` and `summarise()` provide one of the tools that you'll use most commonly when working with dplyr: grouped summaries.
@ -558,7 +560,7 @@ There are three steps to prepare this data:
3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about it.
This code is a little frustrating to write because we have to give each intermediate data frame a name, even though we don't care about them.
Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem with the pipe, `%>%`:
@ -586,14 +588,14 @@ Working with the pipe is one of the key criteria for belonging to the tidyverse.
The only exception is ggplot2: it was written before the pipe was discovered.
Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn't quite ready for prime time yet.
### Missing values
### Missing values {#missing-values-summarise}
You may have wondered about the `na.rm` argument we used above.
What happens if we don't set it?
```{r}
flights %>%
group_by(year, month, day) %>%
group_by(month) %>%
summarise(mean = mean(dep_delay))
```
@ -603,11 +605,11 @@ Fortunately, all aggregation functions have an `na.rm` argument which removes th
```{r}
flights %>%
group_by(year, month, day) %>%
group_by(month) %>%
summarise(mean = mean(dep_delay, na.rm = TRUE))
```
In this case, where missing values represent cancelled flights, we could also tackle the problem by first removing the cancelled flights.
In this case, missing values represent cancelled flights, therefore we could also tackle the problem by first removing the cancelled flights.
We'll save this dataset so we can reuse it in the next few examples.
```{r}
@ -615,10 +617,63 @@ not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
group_by(month) %>%
summarise(mean = mean(dep_delay))
```
### Grouping by multiple variables
You can group a data frame by multiple variables as well.
Note that the grouping information is printed on top of the output.
The number in the square brackets indicates how many groups are created.
```{r}
daily <- group_by(flights, year, month, day)
daily
```
When you group by multiple variables, each summary peels off one level of the grouping by default, and a message is printed that tells you how you can change this behaviour.
```{r}
summarise(daily, flights = n())
```
If you're happy with this behaviour, you can also explicitly define it, in which case the message won't be printed out.
```{r}
summarise(daily, flights = n(), .groups = "drop_last")
```
Or you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`.
```{r}
# Note the difference between the grouping structures
summarise(daily, flights = n(), .groups = "drop")
summarise(daily, flights = n(), .groups = "keep")
```
The fact that each summary peels off one level of the grouping by default makes it easy to progressively roll up a dataset:
```{r}
(per_day <- summarise(daily, flights = n()))
(per_month <- summarise(per_day, flights = sum(flights)))
(per_year <- summarise(per_month, flights = sum(flights)))
```
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median.
In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
### Ungrouping
You might also want to remove grouping outside of `summarise()`.
You can do this and return to operations on ungrouped data using `ungroup()`.
```{r}
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
```
### Counts
Whenever you do any aggregation, it's always a good idea to include either a count (`n()`), or a count of non-missing values (`sum(!is.na(x))`).
@ -664,7 +719,7 @@ It's a bit painful that you have to switch from `%>%` to `+`, but once you get t
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
geom_point(alpha = 1/10)
```
------------------------------------------------------------------------
@ -722,8 +777,17 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
- Measures of location: we've used `mean(x)`, but `median(x)` is also useful.
The mean is the sum divided by the length; the median is a value where 50% of `x` is above it, and 50% is below it.
```{r}
not_cancelled %>%
group_by(month) %>%
summarise(
med_arr_delay = median(arr_delay),
med_dep_delay = median(dep_delay)
)
```
It's sometimes useful to combine aggregation with logical subsetting.
We haven't talked about this sort of subsetting yet, but you'll learn more about it in [subsetting].
We haven't talked about this sort of subsetting yet, but you'll learn more about it in Section \@ref(vector-subsetting).
```{r}
not_cancelled %>%
@ -802,6 +866,13 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
count(dest)
```
Just like with `group_by()`, you can also provide multiple variables to `count()`.
```{r}
not_cancelled %>%
count(carrier, dest)
```
You can optionally provide a weight variable.
For example, you could use this to "count" (sum) the total number of miles a plane flew:
@ -827,31 +898,6 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
summarise(hour_prop = mean(arr_delay > 60))
```
### Grouping by multiple variables
When you group by multiple variables, each summary peels off one level of the grouping.
That makes it easy to progressively roll up a dataset:
```{r}
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
(per_month <- summarise(per_day, flights = sum(flights)))
(per_year <- summarise(per_month, flights = sum(flights)))
```
Be careful when progressively rolling up summaries: it's OK for sums and counts, but you need to think about weighting means and variances, and it's not possible to do it exactly for rank-based statistics like the median.
In other words, the sum of groupwise sums is the overall sum, but the median of groupwise medians is not the overall median.
### Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
```{r}
daily %>%
ungroup() %>% # no longer grouped by date
summarise(flights = n()) # all flights
```
### Exercises
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.
@ -886,7 +932,7 @@ daily %>%
6. What does the `sort` argument to `count()` do.
When might you use it?
## Grouped mutates (and filters)
## Grouped mutates and filters
Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`:

View File

@ -1,3 +1,3 @@
# Logicals and numbers
# Logicals and numbers {#logicals-numbers}
## Introduction

View File

@ -1,3 +1,3 @@
# Missing values
# Missing values {#missing-values}
## Introduction

View File

@ -1,4 +1,4 @@
# Rectangling data
# Data rectangling {#rectangle-data}
## Introduction

21
tidy.Rmd Normal file
View File

@ -0,0 +1,21 @@
# (PART) Tidy {.unnumbered}
# Introduction {#wrangle-intro}
In this part of the book, you'll learn about data tidying, the art of getting your data into R in a useful form for visualisation and modelling.
Data wrangling is very important: without it you can't work with your own data!
There are three main parts to data wrangling:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-wrangle.png")
```
<!--# TO DO: Redo the diagram without highlighting import. -->
This part of the book proceeds as follows:
- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists.
- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
<!--# TO DO: Revisit bullet points about new chapters. -->

View File

@ -1,24 +1,28 @@
# (PART) Data types {.unnumbered}
# (PART) Transform {.unnumbered}
# Introduction {#data-types-intro}
In this part of the book, you'll learn about data types, ...
In this part of the book, you'll learn about various types of data the columns of a data frame can contain and how to transform them.
The transformations you might want to apply to a column vary depending on the type of data you're working with, for example if you have text strings you might want to extract or remove certain pieces while if you have numerical data, you might want to rescale them.
You've already learned a little about data wrangling in the previous part.
Now we'll focus on new skills for specific types of data you will frequently encounter in practice.
<!--# TO DO: Add a diagram? -->
This part of the book proceeds as follows:
- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**. You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
Data wrangling also encompasses data transformation, which you've already learned a little about.
Now we'll focus on new skills for specific types of data you will frequently encounter in practice:
- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets.
- Chapter \@ref(logicals-numbers) ...
<!--# TO DO: Something about logicals and numbers -->
- Chapter \@ref(vector-tools) ...
<!--# TO DO: Something about general vector tools -->
- Chapter \@ref(missing-values)...
<!--# TO DO: Something about missing values -->
- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
@ -27,3 +31,5 @@ Now we'll focus on new skills for specific types of data you will frequently enc
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times.
- Chapter \@ref(column-wise) will give you tools for performing the same operation on multiple columns.

View File

@ -1,3 +1,3 @@
# General vector tools
# Vector tools
## Introduction

View File

@ -150,7 +150,7 @@ pryr::object_size(y)
`y` doesn't take up 1,000x as much memory as `x`, because each element of `y` is just a pointer to that same string.
A pointer is 8 bytes, so 1000 pointers to a 152 B string is 8 \* 1000 + 152 = 8.14 kB.
### Missing values
### Missing values {#missing-values-vectors}
Note that each type of atomic vector has its own missing value:

View File

@ -52,6 +52,7 @@ And_aFew.People_RENOUNCEconvention
```
We'll come back to code style later, in Chapter \@ref(functions) on functions.
If you're interested in learning more about about best practices for code style, I also recommend The tidyverse style guide: [https://style.tidyverse.org](https://style.tidyverse.org/).
You can inspect an object by typing its name:
@ -105,7 +106,7 @@ function_name(arg1 = val1, arg2 = val2, ...)
Let's try using `seq()` which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio.
Type `se` and hit TAB.
A popup shows you possible completions.
Specify `seq()` by typing more (a "q") to disambiguate, or by using ↑/↓ arrows to select.
Specify `seq()` by typing more (a `q`) to disambiguate, or by using ↑/↓ arrows to select.
Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose.
If you want more help, press F1 to get all the details in the help tab in the lower right pane.

View File

@ -1,43 +0,0 @@
# (PART) Wrangle {.unnumbered}
# Introduction {#wrangle-intro}
In this part of the book, you'll learn about data wrangling, the art of getting your data into R in a useful form for visualisation and modelling.
Data wrangling is very important: without it you can't work with your own data!
There are three main parts to data wrangling:
```{r echo = FALSE, out.width = "75%"}
knitr::include_graphics("diagrams/data-science-wrangle.png")
```
<!--# TO DO: Redo the diagram without highlighting import. -->
This part of the book proceeds as follows:
- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**.
You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
- In Chapter \@ref(tidy-data), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.
- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
- Chapter \@ref(column-wise-operations) will give you tools for performing the same operation on multiple columns.
- Chapter \@ref(row-wise-operations) will give you tools for performing operations over rows.
Data wrangling also encompasses data transformation, which you've already learned a little about.
Now we'll focus on new skills for three specific types of data you will frequently encounter in practice:
- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets.
- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists.
- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- Chapter \@ref(factors) will introduce factors --- how R stores categorical data.
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times.
<!--# TO DO: Revisit bullet points about new chapters. -->