Remove extra chapter about hierarchical data
Since now superseded by rectangling
This commit is contained in:
parent
027848d806
commit
f37e134d74
|
@ -1,217 +0,0 @@
|
|||
# Hierarchical data {#hierarchy}
|
||||
|
||||
## Introduction
|
||||
|
||||
This chapter belongs in [wrangle](#wrangle-intro): it will give you a set of tools for working with hierarchical data, such as the deeply nested lists you often get when working with JSON.
|
||||
However, you can only learn it now because working with hierarchical structures requires some programming skills, particularly an understanding of data structures, functions, and iteration.
|
||||
Now you have those tools under your belt, you can learn how to work with hierarchical data.
|
||||
|
||||
The
|
||||
|
||||
As well as tools to simplify iteration, purrr provides tools for handling deeply nested lists.
|
||||
There are three common sources of such data:
|
||||
|
||||
- JSON and XML
|
||||
-
|
||||
|
||||
The map functions apply a function to every element in a list.
|
||||
They are the most commonly used part of purrr, but not the only part.
|
||||
Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy:
|
||||
|
||||
- You can extract deeply nested elements in a single call by supplying a character vector to the map functions.
|
||||
|
||||
- You can remove a level of the hierarchy with the flatten functions.
|
||||
|
||||
- You can flip levels of the hierarchy with the transpose function.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
This chapter focusses mostly on purrr.
|
||||
As well as the tools for iteration that you've already learned about, purrr also provides a number of tools specifically designed to manipulate hierarchical data.
|
||||
|
||||
```{r setup}
|
||||
library(purrr)
|
||||
```
|
||||
|
||||
## Initial exploration
|
||||
|
||||
Sometimes you get data structures that are very deeply nested.
|
||||
A common source of such data is JSON from a web API.
|
||||
I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`.
|
||||
Now I'm going to load it into a list with jsonlite.
|
||||
By default `fromJSON()` tries to be helpful and simplifies the structure a little for you.
|
||||
Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`:
|
||||
|
||||
```{r}
|
||||
# From https://api.github.com/repos/hadley/r4ds/issues
|
||||
issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE)
|
||||
```
|
||||
|
||||
You might be tempted to use `str()` on this data.
|
||||
Unfortunately, however, `str()` is not designed for lists that are both deep and wide, and you'll tend to get overwhelmed by the output.
|
||||
A better strategy is to pull the list apart piece by piece.
|
||||
|
||||
First, figure out how many elements are in the list, take a look at one, and then check they're all the same structure.
|
||||
In this case there are eight elements, and the first element is another list.
|
||||
|
||||
```{r}
|
||||
length(issues)
|
||||
str(issues[[1]])
|
||||
```
|
||||
|
||||
(In this case we got lucky and the structure is (just) simple enough to print out with `str()`.
|
||||
If you're unlucky, you may need to repeat this procedure.)
|
||||
|
||||
```{r}
|
||||
tibble::tibble(
|
||||
i = seq_along(issues),
|
||||
names = issues |> map(names)
|
||||
) |>
|
||||
tidyr::unnest(names) |>
|
||||
table() |>
|
||||
t()
|
||||
```
|
||||
|
||||
Another alternative is the **listviewer** package, <https://github.com/timelyportfolio/listviewer>.
|
||||
|
||||
## Extracting deeply nested elements
|
||||
|
||||
To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in:
|
||||
|
||||
```{r}
|
||||
|
||||
issues |> map_int("id")
|
||||
issues |> map_lgl("locked")
|
||||
issues |> map_chr("state")
|
||||
```
|
||||
|
||||
You can use the same technique to extract more deeply nested structure.
|
||||
For example, imagine you want to extract the name and id of the user.
|
||||
You could do that in two steps:
|
||||
|
||||
```{r}
|
||||
users <- issues |> map("user")
|
||||
users |> map_chr("login")
|
||||
users |> map_int("id")
|
||||
```
|
||||
|
||||
But by supplying a character *vector* to `map_*`, you can do it in one:
|
||||
|
||||
```{r}
|
||||
issues |> map_chr(c("user", "login"))
|
||||
issues |> map_int(c("user", "id"))
|
||||
```
|
||||
|
||||
What happens if that path is missing in some of the elements?
|
||||
For example, lets try and extract the HTML url to the pull request:
|
||||
|
||||
```{r, error = TRUE}
|
||||
issues |> map_chr(c("pull_request", "html_url"))
|
||||
```
|
||||
|
||||
Unfortunately that doesn't work.
|
||||
Whenever you see an error from purrr complaining about the "type" of the result, it's because it's trying to shove it into a simple vector (here a character).
|
||||
You can diagnose the problem more easily if you use `map()`:
|
||||
|
||||
```{r}
|
||||
issues |> map(c("pull_request", "html_url"))
|
||||
```
|
||||
|
||||
To get the results into a character vector, we need to tell purrr what it should change `NULL` to.
|
||||
You can do that with the `.null` argument.
|
||||
The most common value to use is `NA`:
|
||||
|
||||
```{r}
|
||||
issues |> map_chr(c("pull_request", "html_url"), .null = NA)
|
||||
```
|
||||
|
||||
(You might wonder why that isn't the default value since it's so useful. Well, if it was the default, you'd never get an error message if you had a typo in the names. You'd just get a vector of missing values. That would be annoying to debug because it's a silent failure.)
|
||||
|
||||
It's possible to mix position and named indexing by using a list
|
||||
|
||||
```{r}
|
||||
issues |> map_chr(list("pull_request", 1), .null = NA)
|
||||
```
|
||||
|
||||
## Removing a level of hierarchy
|
||||
|
||||
As well as indexing deeply into hierarchy, it's sometimes useful to flatten it.
|
||||
That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`.
|
||||
In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector.
|
||||
|
||||
```{r}
|
||||
x <- list(list(a = 1, b = 2), list(c = 3, d = 4))
|
||||
str(x)
|
||||
|
||||
y <- flatten(x)
|
||||
str(y)
|
||||
flatten_dbl(y)
|
||||
```
|
||||
|
||||
Graphically, that sequence of operations looks like:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-flatten.png")
|
||||
```
|
||||
|
||||
Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on.
|
||||
|
||||
Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds.
|
||||
Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output.
|
||||
This tends to create problems that are frustrating to debug.
|
||||
|
||||
## Switching levels in the hierarchy {#transpose}
|
||||
|
||||
Other times the hierarchy feels "inside out".
|
||||
You can use `transpose()` to flip the first and second levels of a list:
|
||||
|
||||
```{r}
|
||||
x <- list(
|
||||
x = list(a = 1, b = 3, c = 5),
|
||||
y = list(a = 2, b = 4, c = 6)
|
||||
)
|
||||
x |> str()
|
||||
x |> transpose() |> str()
|
||||
```
|
||||
|
||||
Graphically, this looks like:
|
||||
|
||||
```{r, echo = FALSE}
|
||||
knitr::include_graphics("diagrams/lists-transpose.png")
|
||||
```
|
||||
|
||||
You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`.
|
||||
|
||||
It's called transpose by analogy to matrices.
|
||||
When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`.
|
||||
It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`.
|
||||
Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`.
|
||||
|
||||
Transpose is also useful when working with JSON APIs.
|
||||
Many JSON APIs represent data frames in a row-based format, rather than R's column-based format.
|
||||
`transpose()` makes it easy to switch between the two:
|
||||
|
||||
```{r}
|
||||
df <- tibble::tibble(x = 1:3, y = c("a", "b", "c"))
|
||||
df |> transpose() |> str()
|
||||
```
|
||||
|
||||
## Turning lists into data frames
|
||||
|
||||
- Have a deeply nested list with missing pieces
|
||||
- Need a tidy data frame so you can visualise, transform, model etc.
|
||||
- What do you do?
|
||||
- By hand with purrr, talk about `fromJSON` and `tidyJSON`
|
||||
- tidyjson
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Challenge: read all the CSV files in a directory.
|
||||
Which ones failed and why?
|
||||
|
||||
```{r, eval = FALSE}
|
||||
files <- dir("data", pattern = "\\.csv$")
|
||||
files |>
|
||||
set_names(., basename(.)) |>
|
||||
map_df(safely(readr::read_csv), .id = "filename")
|
||||
```
|
Loading…
Reference in New Issue