From f37e134d74a23d329bbd0730e40f4beec65f066d Mon Sep 17 00:00:00 2001 From: Hadley Wickham Date: Mon, 8 Aug 2022 13:56:14 -0500 Subject: [PATCH] Remove extra chapter about hierarchical data Since now superseded by rectangling --- extra/hierarchy.Rmd | 217 -------------------------------------------- 1 file changed, 217 deletions(-) delete mode 100644 extra/hierarchy.Rmd diff --git a/extra/hierarchy.Rmd b/extra/hierarchy.Rmd deleted file mode 100644 index 46210b5..0000000 --- a/extra/hierarchy.Rmd +++ /dev/null @@ -1,217 +0,0 @@ -# Hierarchical data {#hierarchy} - -## Introduction - -This chapter belongs in [wrangle](#wrangle-intro): it will give you a set of tools for working with hierarchical data, such as the deeply nested lists you often get when working with JSON. -However, you can only learn it now because working with hierarchical structures requires some programming skills, particularly an understanding of data structures, functions, and iteration. -Now you have those tools under your belt, you can learn how to work with hierarchical data. - -The - -As well as tools to simplify iteration, purrr provides tools for handling deeply nested lists. -There are three common sources of such data: - -- JSON and XML -- - -The map functions apply a function to every element in a list. -They are the most commonly used part of purrr, but not the only part. -Since lists are often used to represent complex hierarchies, purrr also provides tools to work with hierarchy: - -- You can extract deeply nested elements in a single call by supplying a character vector to the map functions. - -- You can remove a level of the hierarchy with the flatten functions. - -- You can flip levels of the hierarchy with the transpose function. - -### Prerequisites - -This chapter focusses mostly on purrr. -As well as the tools for iteration that you've already learned about, purrr also provides a number of tools specifically designed to manipulate hierarchical data. - -```{r setup} -library(purrr) -``` - -## Initial exploration - -Sometimes you get data structures that are very deeply nested. -A common source of such data is JSON from a web API. -I've previously downloaded a list of GitHub issues related to this book and saved it as `issues.json`. -Now I'm going to load it into a list with jsonlite. -By default `fromJSON()` tries to be helpful and simplifies the structure a little for you. -Here I'm going to show you how to do it with purrr, so I set `simplifyVector = FALSE`: - -```{r} -# From https://api.github.com/repos/hadley/r4ds/issues -issues <- jsonlite::fromJSON("issues.json", simplifyVector = FALSE) -``` - -You might be tempted to use `str()` on this data. -Unfortunately, however, `str()` is not designed for lists that are both deep and wide, and you'll tend to get overwhelmed by the output. -A better strategy is to pull the list apart piece by piece. - -First, figure out how many elements are in the list, take a look at one, and then check they're all the same structure. -In this case there are eight elements, and the first element is another list. - -```{r} -length(issues) -str(issues[[1]]) -``` - -(In this case we got lucky and the structure is (just) simple enough to print out with `str()`. -If you're unlucky, you may need to repeat this procedure.) - -```{r} -tibble::tibble( - i = seq_along(issues), - names = issues |> map(names) -) |> - tidyr::unnest(names) |> - table() |> - t() -``` - -Another alternative is the **listviewer** package, . - -## Extracting deeply nested elements - -To work with this sort of data, you typically want to turn it into a data frame by extracting the related vectors that you're most interested in: - -```{r} - -issues |> map_int("id") -issues |> map_lgl("locked") -issues |> map_chr("state") -``` - -You can use the same technique to extract more deeply nested structure. -For example, imagine you want to extract the name and id of the user. -You could do that in two steps: - -```{r} -users <- issues |> map("user") -users |> map_chr("login") -users |> map_int("id") -``` - -But by supplying a character *vector* to `map_*`, you can do it in one: - -```{r} -issues |> map_chr(c("user", "login")) -issues |> map_int(c("user", "id")) -``` - -What happens if that path is missing in some of the elements? -For example, lets try and extract the HTML url to the pull request: - -```{r, error = TRUE} -issues |> map_chr(c("pull_request", "html_url")) -``` - -Unfortunately that doesn't work. -Whenever you see an error from purrr complaining about the "type" of the result, it's because it's trying to shove it into a simple vector (here a character). -You can diagnose the problem more easily if you use `map()`: - -```{r} -issues |> map(c("pull_request", "html_url")) -``` - -To get the results into a character vector, we need to tell purrr what it should change `NULL` to. -You can do that with the `.null` argument. -The most common value to use is `NA`: - -```{r} -issues |> map_chr(c("pull_request", "html_url"), .null = NA) -``` - -(You might wonder why that isn't the default value since it's so useful. Well, if it was the default, you'd never get an error message if you had a typo in the names. You'd just get a vector of missing values. That would be annoying to debug because it's a silent failure.) - -It's possible to mix position and named indexing by using a list - -```{r} -issues |> map_chr(list("pull_request", 1), .null = NA) -``` - -## Removing a level of hierarchy - -As well as indexing deeply into hierarchy, it's sometimes useful to flatten it. -That's the job of the flatten family of functions: `flatten()`, `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, and `flatten_chr()`. -In the code below we take a list of lists of double vectors, then flatten it to a list of double vectors, then to a double vector. - -```{r} -x <- list(list(a = 1, b = 2), list(c = 3, d = 4)) -str(x) - -y <- flatten(x) -str(y) -flatten_dbl(y) -``` - -Graphically, that sequence of operations looks like: - -```{r, echo = FALSE} -knitr::include_graphics("diagrams/lists-flatten.png") -``` - -Whenever I get confused about a sequence of flattening operations, I'll often draw a diagram like this to help me understand what's going on. - -Base R has `unlist()`, but I recommend avoiding it for the same reason I recommend avoiding `sapply()`: it always succeeds. -Even if your data structure accidentally changes, `unlist()` will continue to work silently the wrong type of output. -This tends to create problems that are frustrating to debug. - -## Switching levels in the hierarchy {#transpose} - -Other times the hierarchy feels "inside out". -You can use `transpose()` to flip the first and second levels of a list: - -```{r} -x <- list( - x = list(a = 1, b = 3, c = 5), - y = list(a = 2, b = 4, c = 6) -) -x |> str() -x |> transpose() |> str() -``` - -Graphically, this looks like: - -```{r, echo = FALSE} -knitr::include_graphics("diagrams/lists-transpose.png") -``` - -You'll see an example of this in the next section, as `transpose()` is particularly useful in conjunction with adverbs like `safely()` and `quietly()`. - -It's called transpose by analogy to matrices. -When you subset a transposed matrix, you switch indices: `x[i, j]` is the same as `t(x)[j, i]`. -It's the same idea when transposing a list, but the subsetting looks a little different: `x[[i]][[j]]` is equivalent to `transpose(x)[[j]][[i]]`. -Similarly, a transpose is its own inverse so `transpose(transpose(x))` is equal to `x`. - -Transpose is also useful when working with JSON APIs. -Many JSON APIs represent data frames in a row-based format, rather than R's column-based format. -`transpose()` makes it easy to switch between the two: - -```{r} -df <- tibble::tibble(x = 1:3, y = c("a", "b", "c")) -df |> transpose() |> str() -``` - -## Turning lists into data frames - -- Have a deeply nested list with missing pieces -- Need a tidy data frame so you can visualise, transform, model etc. -- What do you do? -- By hand with purrr, talk about `fromJSON` and `tidyJSON` -- tidyjson - -### Exercises - -1. Challenge: read all the CSV files in a directory. - Which ones failed and why? - - ```{r, eval = FALSE} - files <- dir("data", pattern = "\\.csv$") - files |> - set_names(., basename(.)) |> - map_df(safely(readr::read_csv), .id = "filename") - ```