Minor polishing to get back into the swing of things

2021-12-01 08:34:16 -06:00 · 2021-12-01 08:34:16 -06:00 · 821b51d536
parent e80ed2d577
commit 821b51d536
1 changed files with 26 additions and 33 deletions
--- a/data-transform.Rmd
+++ b/data-transform.Rmd
@ -62,16 +62,6 @@ There are three other common types that aren't used here but you'll encounter la
 ### dplyr basics

 In this chapter you are going to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges.
-They are organised into four camps:
-
-   Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns, `slice()` and friends subsets rows based on their position, and `arrange()` changes the order of the rows.
-
-   Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` which changes their positions.
-
-   Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
-
-Later, in Chapter \@ref(relational-data), you'll learn about other verbs that work with **tables**, like the join functions and the set operations.
-
 All dplyr verbs work the same way:

 1.  The first argument is a data frame.
@ -81,11 +71,22 @@ All dplyr verbs work the same way:
 3.  The result is a new data frame.

 Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
+The verbs are organised into four groups:
+
+-   Functions that operate on **rows**: `filter()` subsets rows based on the values of the columns and `arrange()` changes the order of the rows.
+
+-   Functions that operate on **columns**: `mutate()` creates new columns, `select()` columns, `rename()` changes their names, and `relocate()` changes their positions.
+
+-   Functions that operate on **groups**: `group_by()` divides data up into groups for analysis, and `summarise()` reduces each group to a single row.
+
+-   Functions that operate on **tables**, like the join functions and the set operations.
+    We'll come back to these in in Chapter \@ref(relational-data).
+
 Let's dive in and see how these verbs work.

 ## Rows

-These functions affect the rows (the observations), leaving the columns (the variables) unchanged.
+`filter()` and `arrange()` affect the rows (the observations), leaving the columns (the variables) unchanged.
 `filter()` changes which rows are included without changing the order, `arrange()` changes the order without changing the membership.

 ### `filter()`
@ -111,6 +112,7 @@ jan1 <- filter(flights, month == 1, day == 1)
 To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
 R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).
 It also provides `%in%`: `filter(df, x %in% c(a, b, c))` will return all rows where `x` is `a`, `b`, or `c`.
+We'll come back to these operations again in Chapter \@ref(logicals-numbers).

 When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
 `filter()` will let you know when this happens:
@ -158,7 +160,7 @@ arrange(flights, desc(dep_delay))

 ## Columns

-These functions affect the columns (the variables) without changing the rows (the observations).
+`mutate()`, `select()`, `rename()`, and `relocate()` affect the columns (the variables) without changing the rows (the observations).
 `mutate()` creates new variables that are functions of the existing variables; `select()`, `rename()`, and `relocate()` changes which variables are present, their names, and their positions.

 ### `mutate()`
@ -187,8 +189,8 @@ mutate(flights,
 )
 ```

-The leading `.` is a sign that `.before` is an argument to the function, not a new variable being created.
-You can also use `.after` to add after a variable, and use a variable name instead of a position:
+The leading `.` is a sign that `.before` is an argument to the function, not the name of a new variable.
+You can also use `.after` to add after a variable, and in both `.before` and `.after` you can the name of a variable name instead of a position:

 ```{r}
 mutate(flights,
@ -212,7 +214,7 @@ mutate(flights,
 ### `select()` {#select}

 It's not uncommon to get datasets with hundreds or even thousands of variables.
-In this case, the first challenge is often narrowing in on the variables you're actually interested in.
+In this case, the first challenge is often focussing on just the variables you're interested in.
 `select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

 `select()` is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
@ -239,7 +241,7 @@ There are a number of helper functions you can use within `select()`:
 -   `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.

 See `?select` for more details.
-Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a regexp.
+Once you know regular expressions (the topic of Chapter \@ref(regular-expressions)) you'll also be use `matches()` to select variables that match a pattern.

 You can rename variables as you `select()` them by using `=`.
 The new name appears on the left hand side of the `=`, and the old variable appears on the right hand side:
@ -267,7 +269,7 @@ By default it moves variables to the front:
 relocate(flights, time_hour, air_time)
 ```

-But you can use the `.before` and `.after` arguments to choose where to place them:
+But like with `mutate()`, you can use the `.before` and `.after` arguments to choose where to place them:

 ```{r}
 relocate(flights, year:dep_time, .after = time_hour)
@ -406,13 +408,13 @@ daily %>% summarise(n = n())

 If you're happy with this behaviour, you can explicitly define it in order to suppress the message:

-```{r results = FALSE}
+```{r, results = FALSE}
 daily %>% summarise(n = n(), .groups = "drop_last")
 ```

 Alternatively, you can change the default behaviour by setting a different value, e.g. `"drop"` for dropping all levels of grouping or `"keep"` for same grouping structure as `daily`:

-```{r results = FALSE}
+```{r, results = FALSE}
 daily %>% summarise(n = n(), .groups = "drop")
 daily %>% summarise(n = n(), .groups = "keep")
 ```
@ -433,26 +435,17 @@ daily %>%

 For the purposes of summarising, ungrouped data is treated as if all your data was in a single group, so you get one row back.

-### Selecting rows
-
-`arrange()` and `filter()` are mostly unaffected by grouping.
-But the slice functions are super useful:
-
-   `slice_head()` and `slice_tail()` select the first or last rows in each group.
-
-   `slice_max()` and `slice_min()` select the rows in each group with highest or lowest values.
-
-   `slice_sample()` random selects rows from each group.
-
-Each of these verbs takes either a `n` or `prop` argument depending on whether you want to select a fixed number of rows, or a number of rows proportional to the group size.
-
 ### Other verbs

+`group_by()` is usually paired with `summarise()`, but it's good to know how it affects other verbs:
+
 -   `select()`, `rename()`, `relocate()`: grouping has no affect

-   `filter()`, `mutate()`: computation happens per group.
+-   `mutate()`: computation happens per group.
    This doesn't affect the functions you currently know but is very useful once you learn about window functions, Section \@ref(window-functions).

+-   `arrange()` and `filter()` are mostly unaffected by grouping, unless you are doing computation (e.g. `filter(flights, dep_delay == min(dep_delay)`), in which case the `mutate()` caveat applies.
+
 ### Exercises

 1.  Which carrier has the worst delays?