parent
1d2246f4f6
commit
82d293bb50
|
@ -18,7 +18,7 @@ rmd_files: [
|
|||
"functions.Rmd",
|
||||
"iteration.Rmd",
|
||||
"data-structures.Rmd",
|
||||
"lists.Rmd",
|
||||
"hierarchy.Rmd",
|
||||
"robust-code.Rmd",
|
||||
"science.Rmd",
|
||||
"model-vis.Rmd",
|
||||
|
|
|
@ -1,5 +1,9 @@
|
|||
# Data structures
|
||||
|
||||
```{r, include = FALSE}
|
||||
library(purrr)
|
||||
```
|
||||
|
||||
Might be quite brief.
|
||||
|
||||
Atomic vectors and lists + data frames.
|
||||
|
@ -15,6 +19,8 @@ Most important data types:
|
|||
|
||||
<http://adv-r.had.co.nz/OO-essentials.html>
|
||||
|
||||
## Vectors
|
||||
|
||||
Every vector has three key properties:
|
||||
|
||||
1. Type: e.g. integer, double, list. Retrieve with `typeof()`.
|
||||
|
@ -23,9 +29,51 @@ Every vector has three key properties:
|
|||
attribute used to build more complex data structure (like factors and
|
||||
dates) up from simpler components. Get with `attributes()`.
|
||||
|
||||
(Need function to show these? `vector_str()`?)
|
||||
|
||||
### Predicates
|
||||
|
||||
| | lgl | int | dbl | chr | list | null |
|
||||
|------------------|-----|-----|-----|-----|------|------|
|
||||
| `is_logical()` | x | | | | | |
|
||||
| `is_integer()` | | x | | | | |
|
||||
| `is_double()` | | | x | | | |
|
||||
| `is_numeric()` | | x | x | | | |
|
||||
| `is_character()` | | | | x | | |
|
||||
| `is_atomic()` | x | x | x | x | | |
|
||||
| `is_list()` | | | | | x | |
|
||||
| `is_vector()` | x | x | x | x | x | |
|
||||
| `is_null()` | | | | | | x |
|
||||
|
||||
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
|
||||
|
||||
```{r}
|
||||
is.atomic(NULL)
|
||||
is_atomic(NULL)
|
||||
|
||||
is.vector(factor("a"))
|
||||
is_vector(factor("a"))
|
||||
```
|
||||
|
||||
I recommend using these instead of the base functions.
|
||||
|
||||
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
|
||||
|
||||
```{r}
|
||||
y <- factor(c("a", "b", "c"))
|
||||
is_integer(y)
|
||||
is_scalar_integer(y)
|
||||
is_bare_integer(y)
|
||||
```
|
||||
|
||||
|
||||
### Exercises
|
||||
1. Carefully read the documentation of `is.vector()`. What does it actually
|
||||
test for?
|
||||
|
||||
## Atomic vectors
|
||||
|
||||
### Doubles
|
||||
### Numbers
|
||||
|
||||
```{r}
|
||||
sqrt(2) ^ 2 - 2
|
||||
|
@ -37,7 +85,9 @@ sqrt(2) ^ 2 - 2
|
|||
mean(numeric())
|
||||
```
|
||||
|
||||
## Non-atomic vectors
|
||||
## Elemental vectors
|
||||
|
||||
All built on top of atomic vectors.
|
||||
|
||||
`class()`
|
||||
|
||||
|
@ -49,10 +99,142 @@ mean(numeric())
|
|||
|
||||
### Date times
|
||||
|
||||
## Lists
|
||||
## Recursive vectors (lists)
|
||||
|
||||
Lists are the data structure R uses for hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
|
||||
|
||||
You create a list with `list()`:
|
||||
|
||||
```{r}
|
||||
x <- list(1, 2, 3)
|
||||
str(x)
|
||||
|
||||
x_named <- list(a = 1, b = 2, c = 3)
|
||||
str(x_named)
|
||||
```
|
||||
|
||||
Unlike atomic vectors, `lists()` can contain a mix of objects:
|
||||
|
||||
```{r}
|
||||
y <- list("a", 1L, 1.5, TRUE)
|
||||
str(y)
|
||||
```
|
||||
|
||||
Lists can even contain other lists!
|
||||
|
||||
```{r}
|
||||
z <- list(list(1, 2), list(3, 4))
|
||||
str(z)
|
||||
```
|
||||
|
||||
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
|
||||
|
||||
### Visualising lists
|
||||
|
||||
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
|
||||
|
||||
```{r}
|
||||
x1 <- list(c(1, 2), c(3, 4))
|
||||
x2 <- list(list(1, 2), list(3, 4))
|
||||
x3 <- list(1, list(2, list(3)))
|
||||
```
|
||||
|
||||
I draw them as follows:
|
||||
|
||||
```{r, echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/lists-structure.png")
|
||||
```
|
||||
|
||||
* Lists are rounded rectangles that contain their children.
|
||||
|
||||
* I draw each child a little darker than its parent to make it easier to see
|
||||
the hierarchy.
|
||||
|
||||
* The orientation of the children (i.e. rows or columns) isn't important,
|
||||
so I'll pick a row or column orientation to either save space or illustrate
|
||||
an important property in the example.
|
||||
|
||||
### Subsetting
|
||||
|
||||
There are three ways to subset a list, which I'll illustrate with `a`:
|
||||
|
||||
```{r}
|
||||
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
|
||||
```
|
||||
|
||||
* `[` extracts a sub-list. The result will always be a list.
|
||||
|
||||
```{r}
|
||||
str(a[1:2])
|
||||
str(a[4])
|
||||
```
|
||||
|
||||
Like subsetting vectors, you can use an integer vector to select by
|
||||
position, or a character vector to select by name.
|
||||
|
||||
* `[[` extracts a single component from a list. It removes a level of
|
||||
hierarchy from the list.
|
||||
|
||||
```{r}
|
||||
str(y[[1]])
|
||||
str(y[[4]])
|
||||
```
|
||||
|
||||
* `$` is a shorthand for extracting named elements of a list. It works
|
||||
similarly to `[[` except that you don't need to use quotes.
|
||||
|
||||
```{r}
|
||||
a$a
|
||||
a[["b"]]
|
||||
```
|
||||
|
||||
Or visually:
|
||||
|
||||
```{r, echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/lists-subsetting.png")
|
||||
```
|
||||
|
||||
### Lists of condiments
|
||||
|
||||
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper.jpg")
|
||||
```
|
||||
|
||||
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper-1.jpg")
|
||||
```
|
||||
|
||||
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
|
||||
|
||||
`x[[1]]` is:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper-2.jpg")
|
||||
```
|
||||
|
||||
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper-3.jpg")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Draw the following lists as nested sets.
|
||||
|
||||
1. Generate the lists corresponding to these nested set diagrams.
|
||||
|
||||
1. What happens if you subset a data frame as if you're subsetting a list?
|
||||
What are the key differences between a list and a data frame?
|
||||
|
||||
|
||||
## Data frames
|
||||
|
||||
## Subsetting
|
||||
|
||||
Not sure where else this should be covered.
|
||||
|
||||
|
|
|
@ -0,0 +1,124 @@
|
|||
# Handling hierarchy {#hierarchy}
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
library(purrr)
|
||||
```
|
||||
|
||||
<!--
|
||||
## Warm ups
|
||||
|
||||
* What does this for loop do?
|
||||
* How is a data frame like a list?
|
||||
* What does `mean()` mean? What does `mean` mean?
|
||||
* How do you get help about the $ function? How do you normally write
|
||||
`[[`(mtcars, 1) ?
|
||||
* Argument order
|
||||
-->
|
||||
|
||||
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
|
||||
|
||||
* You can extract deeply nested elements in a single call by supplying
|
||||
a character vector to the map functions.
|
||||
|
||||
* You can remove a level of the hierarchy with the flatten functions.
|
||||
|
||||
* You can flip levels of the hierarchy with the transpose function.
|
||||
|
||||
## Extracting deeply nested elements
|
||||
|
||||
Some times you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
|
||||
|
||||
```{r}
|
||||
# From https://api.github.com/repos/hadley/r4ds/issues
|
||||
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
|
||||
```
|
||||
|
||||
There are eight issues, and each issue is a nested list:
|
||||
|
||||
```{r}
|
||||
length(issues)
|
||||
str(issues[[1]])
|
||||
```
|
||||
|
||||
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
|
||||
|
||||
```{r}
|
||||
issues %>% map_int("id")
|
||||
issues %>% map_lgl("locked")
|
||||
issues %>% map_chr("state")
|
||||
```
|
||||
|
||||
You can use the same technique to extract more deeply nested structure. For example, imagine you want to extract the name and id of the user. You could do that in two steps:
|
||||
|
||||
```{r}
|
||||
users <- issues %>% map("user")
|
||||
users %>% map_chr("login")
|
||||
users %>% map_int("id")
|
||||
```
|
||||
|
||||
But by supplying a character _vector_ to `map_*`, you can do it in one:
|
||||
|
||||
```{r}
|
||||
issues %>% map_chr(c("user", "login"))
|
||||
issues %>% map_int(c("user", "id"))
|
||||
```
|
||||
|
||||
## Removing a level of hierarchy
|
||||
|
||||
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
|
||||
|
||||
```{r}
|
||||
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
|
||||
str(x)
|
||||
|
||||
y <- flatten(x)
|
||||
str(y)
|
||||
flatten_dbl(y)
|
||||
```
|
||||
|
||||
Graphically, that sequence of operations looks like:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-flatten.png")
|
||||
```
|
||||
|
||||
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
|
||||
|
||||
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.
|
||||
|
||||
## Switching levels in the hierarchy
|
||||
|
||||
Other times the hierarchy feels "inside out". You can use `transpose()` to flip the first and second levels of a list:
|
||||
|
||||
```{r}
|
||||
x <- list(
|
||||
x = list(a = 1, b = 3, c = 5),
|
||||
y = list(a = 2, b = 4, c = 6)
|
||||
)
|
||||
x %>% str()
|
||||
x %>% transpose() %>% str()
|
||||
```
|
||||
|
||||
Graphically, this looks like:
|
||||
|
||||
```{r, echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/lists-transpose.png")
|
||||
```
|
||||
|
||||
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
|
||||
|
||||
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
|
||||
|
||||
Transpose is also useful when working with JSON APIs. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
|
||||
|
||||
```{r}
|
||||
df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))
|
||||
df %>% transpose() %>% str()
|
||||
```
|
||||
|
||||
## Turning lists into data frames
|
||||
|
||||
* Have a deeply nested list with missing pieces
|
||||
* Need a tidy data frame so you can visualise, transform, model etc.
|
||||
* What do you do?
|
||||
* By hand with purrr, talk about `fromJSON` and `tidyJSON`
|
|
@ -132,6 +132,23 @@ for (i in seq_along(x)) {
|
|||
|
||||
## For loops vs functionals
|
||||
|
||||
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specifics. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and easier to learn.
|
||||
|
||||
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
|
||||
|
||||
1. How can you solve the problem for a single element of the list? Once
|
||||
you've solved that problem, purrr takes care of generalising your
|
||||
solution to every element in the list.
|
||||
|
||||
1. If you're solving a complex problem, how can you break it down into
|
||||
bite sized pieces that allow you to advance one small step towards a
|
||||
solution? With purrr, you get lots of small pieces that you can
|
||||
compose together with the pipe.
|
||||
|
||||
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
|
||||
|
||||
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
|
||||
|
||||
Imagine you have a data frame and you want to compute the mean of each column. You might write code like this:
|
||||
|
||||
```{r}
|
||||
|
@ -583,3 +600,73 @@ pwalk(list(paths, plots), ggsave, path = tempdir())
|
|||
```
|
||||
|
||||
`walk()`, `walk2()` and `pwalk()` all invisibly return the `.x`, the first argument. This makes them suitable for use in the middle of pipelines.
|
||||
|
||||
|
||||
## Predicates
|
||||
|
||||
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:
|
||||
|
||||
1. Find all numeric columns.
|
||||
1. Summarise each column.
|
||||
|
||||
In code, that would look like:
|
||||
|
||||
```{r}
|
||||
col_sum <- function(df, f) {
|
||||
is_num <- df %>% map_lgl(is_numeric)
|
||||
df[is_num] %>% map_dbl(f)
|
||||
}
|
||||
```
|
||||
|
||||
`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
|
||||
|
||||
* `keep()` and `discard()` keeps/discards list elements where the predicate is
|
||||
true.
|
||||
|
||||
* `head_while()` and `tail_while()` keep the first/last elements of a list until
|
||||
you get the first element where the predicate is true.
|
||||
|
||||
* `some()` and `every()` determine if the predicate is true for any or all of
|
||||
the elements.
|
||||
|
||||
* `detect()` and `detect_index()`
|
||||
|
||||
We could use `keep()` to simplify the summary function to:
|
||||
|
||||
```{r}
|
||||
col_sum <- function(df, f) {
|
||||
df %>%
|
||||
keep(is.numeric) %>%
|
||||
map_dbl(f)
|
||||
}
|
||||
```
|
||||
|
||||
I like this formulation because you can easily read the sequence of steps.
|
||||
|
||||
|
||||
### Exercises
|
||||
|
||||
1. A possible base R equivalent of `col_sum()` is:
|
||||
|
||||
```{r}
|
||||
col_sum3 <- function(df, f) {
|
||||
is_num <- sapply(df, is.numeric)
|
||||
df_num <- df[, is_num]
|
||||
|
||||
sapply(df_num, f)
|
||||
}
|
||||
```
|
||||
|
||||
But it has a number of bugs as illustrated with the following inputs:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
df <- data.frame(z = c("a", "b", "c"), x = 1:3, y = 3:1)
|
||||
# OK
|
||||
col_sum3(df, mean)
|
||||
# Has problems: don't always return numeric vector
|
||||
col_sum3(df[1:2], mean)
|
||||
col_sum3(df[1], mean)
|
||||
col_sum3(df[0], mean)
|
||||
```
|
||||
|
||||
What causes the bugs?
|
||||
|
|
383
lists.Rmd
383
lists.Rmd
|
@ -1,383 +0,0 @@
|
|||
# Lists
|
||||
|
||||
```{r setup-lists, include=FALSE}
|
||||
library(purrr)
|
||||
```
|
||||
|
||||
In this chapter, you'll learn how to handle lists, the data structure R uses for complex, hierarchical objects. You're already familiar with vectors, R's data structure for 1d objects. Lists extend these ideas to model objects that are like trees. You can create a hierarchical structure with a list because unlike vectors, a list can contain other lists.
|
||||
|
||||
If you've worked with list-like objects before, you're probably familiar with the for loop. I'll talk a little bit about for loops here, but the focus will be functions from the __purrr__ package. purrr makes it easier to work with lists by eliminating common for loop boilerplate so you can focus on the specifics. The apply family of functions in base R (`apply()`, `lapply()`, `tapply()`, etc) solve a similar problem, but purrr is more consistent and easier to learn.
|
||||
|
||||
The goal of using purrr functions instead of for loops is to allow you break common list manipulation challenges into independent pieces:
|
||||
|
||||
1. How can you solve the problem for a single element of the list? Once
|
||||
you've solved that problem, purrr takes care of generalising your
|
||||
solution to every element in the list.
|
||||
|
||||
1. If you're solving a complex problem, how can you break it down into
|
||||
bite sized pieces that allow you to advance one small step towards a
|
||||
solution? With purrr, you get lots of small pieces that you can
|
||||
compose together with the pipe.
|
||||
|
||||
This structure makes it easier to solve new problems. It also makes it easier to understand your solutions to old problems when you re-read your old code.
|
||||
|
||||
In later chapters you'll learn how to apply these ideas when modelling. You can often use multiple simple models to help understand a complex dataset, or you might have multiple models because you're bootstrapping or cross-validating. The techniques you'll learn in this chapter will be invaluable.
|
||||
|
||||
<!--
|
||||
## Warm ups
|
||||
|
||||
* What does this for loop do?
|
||||
* How is a data frame like a list?
|
||||
* What does `mean()` mean? What does `mean` mean?
|
||||
* How do you get help about the $ function? How do you normally write
|
||||
`[[`(mtcars, 1) ?
|
||||
* Argument order
|
||||
-->
|
||||
|
||||
## List basics
|
||||
|
||||
You create a list with `list()`:
|
||||
|
||||
```{r}
|
||||
x <- list(1, 2, 3)
|
||||
str(x)
|
||||
|
||||
x_named <- list(a = 1, b = 2, c = 3)
|
||||
str(x_named)
|
||||
```
|
||||
|
||||
Unlike atomic vectors, `lists()` can contain a mix of objects:
|
||||
|
||||
```{r}
|
||||
y <- list("a", 1L, 1.5, TRUE)
|
||||
str(y)
|
||||
```
|
||||
|
||||
Lists can even contain other lists!
|
||||
|
||||
```{r}
|
||||
z <- list(list(1, 2), list(3, 4))
|
||||
str(z)
|
||||
```
|
||||
|
||||
`str()` is very helpful when looking at lists because it focusses on the structure, not the contents.
|
||||
|
||||
### Visualising lists
|
||||
|
||||
To explain more complicated list manipulation functions, it's helpful to have a visual representation of lists. For example, take these three lists:
|
||||
|
||||
```{r}
|
||||
x1 <- list(c(1, 2), c(3, 4))
|
||||
x2 <- list(list(1, 2), list(3, 4))
|
||||
x3 <- list(1, list(2, list(3)))
|
||||
```
|
||||
|
||||
I draw them as follows:
|
||||
|
||||
```{r, echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/lists-structure.png")
|
||||
```
|
||||
|
||||
* Lists are rounded rectangles that contain their children.
|
||||
|
||||
* I draw each child a little darker than its parent to make it easier to see
|
||||
the hierarchy.
|
||||
|
||||
* The orientation of the children (i.e. rows or columns) isn't important,
|
||||
so I'll pick a row or column orientation to either save space or illustrate
|
||||
an important property in the example.
|
||||
|
||||
### Subsetting
|
||||
|
||||
There are three ways to subset a list, which I'll illustrate with `a`:
|
||||
|
||||
```{r}
|
||||
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
|
||||
```
|
||||
|
||||
* `[` extracts a sub-list. The result will always be a list.
|
||||
|
||||
```{r}
|
||||
str(a[1:2])
|
||||
str(a[4])
|
||||
```
|
||||
|
||||
Like subsetting vectors, you can use an integer vector to select by
|
||||
position, or a character vector to select by name.
|
||||
|
||||
* `[[` extracts a single component from a list. It removes a level of
|
||||
hierarchy from the list.
|
||||
|
||||
```{r}
|
||||
str(y[[1]])
|
||||
str(y[[4]])
|
||||
```
|
||||
|
||||
* `$` is a shorthand for extracting named elements of a list. It works
|
||||
similarly to `[[` except that you don't need to use quotes.
|
||||
|
||||
```{r}
|
||||
a$a
|
||||
a[["b"]]
|
||||
```
|
||||
|
||||
Or visually:
|
||||
|
||||
```{r, echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/lists-subsetting.png")
|
||||
```
|
||||
|
||||
### Lists of condiments
|
||||
|
||||
It's easy to get confused between `[` and `[[`, but it's important to understand the difference. A few months ago I stayed at a hotel with a pretty interesting pepper shaker that I hope will help you remember these differences:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper.jpg")
|
||||
```
|
||||
|
||||
If this pepper shaker is your list `x`, then, `x[1]` is a pepper shaker containing a single pepper packet:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper-1.jpg")
|
||||
```
|
||||
|
||||
`x[2]` would look the same, but would contain the second packet. `x[1:2]` would be a pepper shaker containing two pepper packets.
|
||||
|
||||
`x[[1]]` is:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper-2.jpg")
|
||||
```
|
||||
|
||||
If you wanted to get the content of the pepper package, you'd need `x[[1]][[1]]`:
|
||||
|
||||
```{r, echo = FALSE, out.width = "25%"}
|
||||
knitr::include_graphics("images/pepper-3.jpg")
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Draw the following lists as nested sets.
|
||||
|
||||
1. Generate the lists corresponding to these nested set diagrams.
|
||||
|
||||
1. What happens if you subset a data frame as if you're subsetting a list?
|
||||
What are the key differences between a list and a data frame?
|
||||
|
||||
## Handling hierarchy {#hierarchy}
|
||||
|
||||
The map functions apply a function to every element in a list. They are the most commonly used part of purrr, but not the only part. Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
|
||||
|
||||
* You can extract deeply nested elements in a single call by supplying
|
||||
a character vector to the map functions.
|
||||
|
||||
* You can remove a level of the hierarchy with the flatten functions.
|
||||
|
||||
* You can flip levels of the hierarchy with the transpose function.
|
||||
|
||||
### Extracting deeply nested elements
|
||||
|
||||
Some times you get data structures that are very deeply nested. A common source of such data is JSON from a web API. I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. Now I'm going to load it into a list with jsonlite. By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
|
||||
|
||||
```{r}
|
||||
# From https://api.github.com/repos/hadley/r4ds/issues
|
||||
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
|
||||
```
|
||||
|
||||
There are eight issues, and each issue is a nested list:
|
||||
|
||||
```{r}
|
||||
length(issues)
|
||||
str(issues[[1]])
|
||||
```
|
||||
|
||||
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
|
||||
|
||||
```{r}
|
||||
issues %>% map_int("id")
|
||||
issues %>% map_lgl("locked")
|
||||
issues %>% map_chr("state")
|
||||
```
|
||||
|
||||
You can use the same technique to extract more deeply nested structure. For example, imagine you want to extract the name and id of the user. You could do that in two steps:
|
||||
|
||||
```{r}
|
||||
users <- issues %>% map("user")
|
||||
users %>% map_chr("login")
|
||||
users %>% map_int("id")
|
||||
```
|
||||
|
||||
But by supplying a character _vector_ to `map_*`, you can do it in one:
|
||||
|
||||
```{r}
|
||||
issues %>% map_chr(c("user", "login"))
|
||||
issues %>% map_int(c("user", "id"))
|
||||
```
|
||||
|
||||
### Removing a level of hierarchy
|
||||
|
||||
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
|
||||
|
||||
```{r}
|
||||
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
|
||||
str(x)
|
||||
|
||||
y <- flatten(x)
|
||||
str(y)
|
||||
flatten_dbl(y)
|
||||
```
|
||||
|
||||
Graphically, that sequence of operations looks like:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-flatten.png")
|
||||
```
|
||||
|
||||
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
|
||||
|
||||
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. This tends to create problems that are frustrating to debug.
|
||||
|
||||
### Switching levels in the hierarchy
|
||||
|
||||
Other times the hierarchy feels "inside out". You can use `transpose()` to flip the first and second levels of a list:
|
||||
|
||||
```{r}
|
||||
x <- list(
|
||||
x = list(a = 1, b = 3, c = 5),
|
||||
y = list(a = 2, b = 4, c = 6)
|
||||
)
|
||||
x %>% str()
|
||||
x %>% transpose() %>% str()
|
||||
```
|
||||
|
||||
Graphically, this looks like:
|
||||
|
||||
```{r, echo = FALSE, out.width = "75%"}
|
||||
knitr::include_graphics("diagrams/lists-transpose.png")
|
||||
```
|
||||
|
||||
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
|
||||
|
||||
It's called transpose by analogy to matrices. When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
|
||||
|
||||
Transpose is also useful when working with JSON APIs. Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. `transpose()` makes it easy to switch between the two:
|
||||
|
||||
```{r}
|
||||
df <- dplyr::data_frame(x = 1:3, y = c("a", "b", "c"))
|
||||
df %>% transpose() %>% str()
|
||||
```
|
||||
|
||||
### Turning lists into data frames
|
||||
|
||||
* Have a deeply nested list with missing pieces
|
||||
* Need a tidy data frame so you can visualise, transform, model etc.
|
||||
* What do you do?
|
||||
* By hand with purrr, talk about `fromJSON` and `tidyJSON`
|
||||
|
||||
### Exercises
|
||||
|
||||
## Predicates
|
||||
|
||||
Imagine we want to summarise each numeric column of a data frame. We could do it in two steps:
|
||||
|
||||
1. Find all numeric columns.
|
||||
1. Summarise each column.
|
||||
|
||||
In code, that would look like:
|
||||
|
||||
```{r}
|
||||
col_sum <- function(df, f) {
|
||||
is_num <- df %>% map_lgl(is_numeric)
|
||||
df[is_num] %>% map_dbl(f)
|
||||
}
|
||||
```
|
||||
|
||||
`is_numeric()` is a __predicate__: a function that returns either `TRUE` or `FALSE`. There are a number of of purrr functions designed to work specifically with predicates:
|
||||
|
||||
* `keep()` and `discard()` keeps/discards list elements where the predicate is
|
||||
true.
|
||||
|
||||
* `head_while()` and `tail_while()` keep the first/last elements of a list until
|
||||
you get the first element where the predicate is true.
|
||||
|
||||
* `some()` and `every()` determine if the predicate is true for any or all of
|
||||
the elements.
|
||||
|
||||
* `detect()` and `detect_index()`
|
||||
|
||||
We could use `keep()` to simplify the summary function to:
|
||||
|
||||
```{r}
|
||||
col_sum <- function(df, f) {
|
||||
df %>%
|
||||
keep(is.numeric) %>%
|
||||
map_dbl(f)
|
||||
}
|
||||
```
|
||||
|
||||
I like this formulation because you can easily read the sequence of steps.
|
||||
|
||||
### Built-in predicates
|
||||
|
||||
Purrr comes with a number of predicate functions built-in:
|
||||
|
||||
| | lgl | int | dbl | chr | list | null |
|
||||
|------------------|-----|-----|-----|-----|------|------|
|
||||
| `is_logical()` | x | | | | | |
|
||||
| `is_integer()` | | x | | | | |
|
||||
| `is_double()` | | | x | | | |
|
||||
| `is_numeric()` | | x | x | | | |
|
||||
| `is_character()` | | | | x | | |
|
||||
| `is_atomic()` | x | x | x | x | | |
|
||||
| `is_list()` | | | | | x | |
|
||||
| `is_vector()` | x | x | x | x | x | |
|
||||
| `is_null()` | | | | | | x |
|
||||
|
||||
Compared to the base R functions, they only inspect the type of the object, not its attributes. This means they tend to be less surprising:
|
||||
|
||||
```{r}
|
||||
is.atomic(NULL)
|
||||
is_atomic(NULL)
|
||||
|
||||
is.vector(factor("a"))
|
||||
is_vector(factor("a"))
|
||||
```
|
||||
|
||||
Each predicate also comes with "scalar" and "bare" versions. The scalar version checks that the length is 1 and the bare version checks that the object is a bare vector with no S3 class.
|
||||
|
||||
```{r}
|
||||
y <- factor(c("a", "b", "c"))
|
||||
is_integer(y)
|
||||
is_scalar_integer(y)
|
||||
is_bare_integer(y)
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
||||
1. A possible base R equivalent of `col_sum()` is:
|
||||
|
||||
```{r}
|
||||
col_sum3 <- function(df, f) {
|
||||
is_num <- sapply(df, is.numeric)
|
||||
df_num <- df[, is_num]
|
||||
|
||||
sapply(df_num, f)
|
||||
}
|
||||
```
|
||||
|
||||
But it has a number of bugs as illustrated with the following inputs:
|
||||
|
||||
```{r, eval = FALSE}
|
||||
df <- data.frame(z = c("a", "b", "c"), x = 1:3, y = 3:1)
|
||||
# OK
|
||||
col_sum3(df, mean)
|
||||
# Has problems: don't always return numeric vector
|
||||
col_sum3(df[1:2], mean)
|
||||
col_sum3(df[1], mean)
|
||||
col_sum3(df[0], mean)
|
||||
```
|
||||
|
||||
What causes the bugs?
|
||||
|
||||
1. Carefully read the documentation of `is.vector()`. What does it actually
|
||||
test for?
|
40
program.Rmd
40
program.Rmd
|
@ -2,24 +2,36 @@
|
|||
|
||||
Code is a tool of communication, not just to the computer, but to other people. This is important because every project you undertake is fundamentally collaborative. Even if you're not working with other people, you'll definitely be working with future-you. You want to write clear code so that future-you doesn't curse present-you when you look at a project again after several months have passed.
|
||||
|
||||
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. In this part of the book, you'll learn three important skills that help you move in this direction:
|
||||
To me, improving your communication skills is a key part of mastering R as a programming language. Over time, you want your code to become more and more clear, and easier to write. Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
|
||||
|
||||
1. We'll dive deep into the __pipe__, `%>%`, talking more about how it works
|
||||
and how it gives you a new tool for rewriting your code. You'll also learn
|
||||
about when not to use the pipe!
|
||||
In the following chapters, you'll learn important programming skills:
|
||||
|
||||
1. Repeating yourself in code is dangerous because it can easily lead to
|
||||
errors and inconsistencies. We'll talk about how to write __functions__
|
||||
in order to remove duplication in your logic.
|
||||
1. We'll start by diving deep into the __pipe__, `%>%`, talking more about how
|
||||
it works, what the alternatives are, and when not to use the pipe.
|
||||
|
||||
1. Copy-and-paste is powerful tool, but you should avoid doing it more than
|
||||
twice. Repeating yourself in code is dangerous because it can easily lead
|
||||
to errors and inconsistencies. Instead, write __functions__ which let
|
||||
you extract out repeated code so that it can be easily reused.
|
||||
|
||||
1. Functions extract out repeated code, but you often need to repeat the
|
||||
same actions on multiple inputs. You need tools for __iteration__ that
|
||||
let you do similar things again again. These tools include for loops
|
||||
and functional programming.
|
||||
|
||||
1. Another important tool for removing duplication is the __for loop__ which
|
||||
allows you to repeat the same action again and again and again. You tend to
|
||||
use for-loops less often in R than in other programming languages because R
|
||||
is a functional programming language which means that you can extract out
|
||||
common patterns of for loops and put them in a function. We'll come back to
|
||||
that idea in XYZ.
|
||||
1. As you start to write more powerful functions, you'll need a solid
|
||||
grouning in R's data structures. You must master the four common atomic
|
||||
vectors, the three important S3 classes built on top of them, and
|
||||
understand the mysteries of the list and data frame.
|
||||
|
||||
1. One of the partiuclarly important data structures in R is the list.
|
||||
Lists are important because a list can contain other lists, so is
|
||||
__hierarchical__. Two common scenarios where hierarchical structures
|
||||
arise are json, and fitting many models. You'll need to learn some new
|
||||
tools from the purrr package to make handling these cases as easy as
|
||||
possible.
|
||||
|
||||
Removing duplication is an important part of expressing yourself clearly because it lets the reader (i.e. future-you!) focus on what's different between operations rather than what's the same. The goal is not just to write better functions or to do things that you couldn't do before, but to code with more "ease". As you internalise the ideas in this chapter, you should find it easier to re-tackle problems that you've solved in the past with much effort.
|
||||
The goal of these chapters is to teach you the minimum about programming that a practicising data scientist must know. It turns out this is a reasonable amount, and I think it's worth investing in your programming skills. It's an investment that won't pay off immediately, but over time it will allow you to solve new problems more quickly, and reuse your insights from previous problems in new scenarios.
|
||||
|
||||
Writing code is similar in many ways to writing prose. One parallel which I find particularly useful is that in both cases rewriting is key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it's often worth looking at your code and thinking about whether or not it's obvious what you've done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn't mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely you'll first attempt will be clear.)
|
||||
|
||||
|
|
Loading…
Reference in New Issue