Rectangling polish

2022-09-01 08:27:21 -05:00 · 2022-09-01 08:27:21 -05:00 · fc3641a376
parent 279611af8a
commit fc3641a376
1 changed files with 82 additions and 53 deletions
--- a/rectangling.qmd
+++ b/rectangling.qmd
@ -10,17 +10,17 @@ status("polishing")
 ## Introduction

 In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
-This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
+This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.

-To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
-Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
-We'll then show you a few case studies, applying these simple function multiple times to solve real problems.
+To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
+Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()` and `tidyr::unnest_wider()`.
+We'll then show you a few case studies, applying these simple functions again and again to solve real problems.
 We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.

 ### Prerequisites

 In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
-We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
+We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists.

 ```{r}
 #| label: setup
@ -33,8 +33,8 @@ library(jsonlite)

 ## Lists

-So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
-These vectors are simple because they're homogeneous: every element is same type.
+So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors.
+These vectors are simple because they're homogeneous: every element is the same type.
 If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:

 ```{r}
@ -86,16 +86,21 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
 str(x5)
 ```

-As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
-@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
+As lists get even larger and more complex, `str()` eventually starts to fail, and you'll need to switch to `View()`[^rectangling-1].
+@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-lists.

 [^rectangling-1]: This is an RStudio feature.

 ```{r}
 #| label: fig-view-collapsed
 #| fig.cap: >
-#|   The RStudio allows you to interactively explore a complex list.  
+#|   The RStudio view lets you interactively explore a complex list.  
 #|   The viewer opens showing only the top level of the list.
+#| fig.alt: >
+#|   A screenshot of RStudio showing the list-viewer. It shows the
+#|   two children of x4: the first child is a double vector and the
+#|   second child is a list. A rightward facing triable indicates that the
+#|   second child itself has children but you can't see them.
 #| echo: false
 #| out-width: NULL
 knitr::include_graphics("screenshots/View-1.png", dpi = 220)
@ -106,6 +111,10 @@ knitr::include_graphics("screenshots/View-1.png", dpi = 220)
 #| fig.cap: >
 #|   Clicking on the rightward facing triangle expands that component
 #|   of the list so that you can also see its children.
+#| fig.alt: >
+#|   Another screenshot of the list-viewer having expand the second
+#|   child of x2. It also has two children, a double vector and another
+#|   list.
 #| echo: false
 #| out-width: NULL
 knitr::include_graphics("screenshots/View-2.png", dpi = 220)
@ -115,9 +124,12 @@ knitr::include_graphics("screenshots/View-2.png", dpi = 220)
 #| label: fig-view-expand-2
 #| fig.cap: >
 #|   You can repeat this operation as many times as needed to get to the 
-#|   data you're interested in. Note the bottom-right corner: if you click
+#|   data you're interested in. Note the bottom-left corner: if you click
 #|   an element of the list, RStudio will give you the subsetting code
 #|   needed to access it, in this case `x4[[2]][[2]][[2]]`.
+#| fig.alt: >
+#|   Another screenshot, having expanded the grandchild of x4 to see its
+#|   two children, again a double vector and a list.
 #| echo: false
 #| out-width: NULL
 knitr::include_graphics("screenshots/View-3.png", dpi = 220)
@ -173,11 +185,11 @@ It's possible to put a list in a column of a `data.frame`, but it's a lot fiddli
 data.frame(x = list(1:3, 3:5))
 ```

-You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly usefully:
+You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly well:

 ```{r}
 data.frame(
-  x = I(list(1:3, 3:5)), 
+  x = I(list(1:2, 3:5)), 
  y = c("1, 2", "3, 4, 5")
 )
 ```
@ -188,14 +200,12 @@ It's easier to use list-columns with tibbles because `tibble()` treats lists lik
 ## Unnesting

 Now that you've learned the basics of lists and list-columns, let's explore how you can turn them back into regular rows and columns.
-We'll start with very simple sample data so you can get the basic idea, and then switch to more realistic examples in the next section.
+Here we'll use very simple sample data so you can get the basic idea; in the next section we'll switch to real data.

 List-columns tend to come in two basic forms: named and unnamed.
 When the children are **named**, they tend to have the same names in every row.
-When the children are **unnamed**, the number of elements tends to vary from row-to-row.
-The following code creates an example of each.
-In `df1`, every element of list-column `y` has two elements named `a` and `b`.
-In `df2`, the elements of list-column `y` are unnamed and vary in length.
+For example, in `df1`, every element of list-column `y` has two elements named `a` and `b`.
+Named list-columns naturally unnest into columns: each named element becomes a new named column.

 ```{r}
 df1 <- tribble(
@ -204,6 +214,13 @@ df1 <- tribble(
  2, list(a = 21, b = 22),
  3, list(a = 31, b = 32),
 )
+```
+
+When the children are **unnamed**, the number of elements tends to vary from row-to-row.
+For example, in `df2`, the elements of list-column `y` are unnamed and vary in length from one to three.
+Unnamed list-columns naturally unnest in to rows: you'll get one row for each child.
+
+```{r}

 df2 <- tribble(
  ~x, ~y,
@ -213,9 +230,7 @@ df2 <- tribble(
 )
 ```

-Named list-columns naturally unnest into columns: each named element becomes a new named column.
-Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
-tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
+tidyr provides two functions for these two cases: `unnest_wider()` and `unnest_longer()`.
 The following sections explain how they work.

 ### `unnest_wider()`
@ -227,7 +242,7 @@ df1 |>
  unnest_wider(y)
 ```

-By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
+By default, the names of the new columns come exclusively from the names of the list elements, but you can use the `names_sep` argument to request that they combine the column name and the element name.
 This is useful for disambiguating repeated names.

 ```{r}
@ -255,7 +270,7 @@ df2 |>
 ```

 Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
-But what happens if the list-column is empty, as in the following example?
+But what happens if one of the elements is empty, as in the following example?

 ```{r}
 df6 <- tribble(
@ -270,15 +285,15 @@ df6 |> unnest_longer(y)
 We get zero rows in the output, so the row effectively disappears.
 Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.

-You can also unnest named list-columns, like `df1$y` into the rows.
-Because the elements are named, and those names might be useful data, puts them in a new column with the suffix `_id`:
+You can also unnest named list-columns, like `df1$y`, into rows.
+Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`:

 ```{r}
 df1 |> 
  unnest_longer(y)
 ```

-If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
+If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
 On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
 You can do this with `indices_include = TRUE`:

@ -311,7 +326,7 @@ df4 |>

 As you can see, the output contains a list-column, but every element of the list-column contains a single element.
 Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
-You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
+You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type.

 What happens if you find this problem in a dataset you're trying to rectangle?
 There are two basic options.
@ -328,8 +343,7 @@ Another option would be to filter down to the rows that have values of a specifi
 ```{r}
 df4 |> 
  unnest_longer(y) |> 
-  rowwise() |> 
-  filter(is.numeric(y))
+  filter(map_lgl(y, is.numeric))
 ```

 Then you can call `unnest_longer()` once more:
@ -337,20 +351,21 @@ Then you can call `unnest_longer()` once more:
 ```{r}
 df4 |> 
  unnest_longer(y) |> 
-  rowwise() |> 
-  filter(is.numeric(y)) |> 
+  filter(map_lgl(y, is.numeric)) |> 
  unnest_longer(y)
 ```

+You'll learn more about `map_lgl()` in @sec-iteration.
+
 ### Other functions

 tidyr has a few other useful rectangling functions that we're not going to cover in this book:

 -   `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
-   `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
+-   `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book.
 -   `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.

-These are good to know about when you're other people's code and for tackling rarer rectangling challenges.
+These are good to know about when you're reading other people's code or tackling rarer rectangling challenges.

 ### Exercises

@ -370,13 +385,12 @@ These are good to know about when you're other people's code and for tackling ra

 ## Case studies

-So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`.
-The main difference between real data and these simple examples is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and `unnest_wider()`.
-This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
+The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`.
+This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild.

 ### Very wide data

-We'll start by exploring `gh_repos`.
+We'll with `gh_repos`.
 This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.

 `gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
@ -389,7 +403,7 @@ repos

 This tibble contains 6 rows, one row for each child of `gh_repos`.
 Each row contains a unnamed list with either 26 or 30 rows.
-Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
+Since these are unnamed, we'll start with `unnest_longer()` to put each child in its own row:

 ```{r}
 repos |> 
@ -437,6 +451,8 @@ repos |>
  unnest_wider(owner)
 ```

+<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
+
 Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
 Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:

@ -461,14 +477,14 @@ chars <- tibble(json = got_chars)
 chars
 ```

-The `json` column contains named values, so we'll start by widening it:
+The `json` column contains named elements, so we'll start by widening it:

 ```{r}
 chars |> 
  unnest_wider(json)
 ```

-And selecting a few columns just to make it easier to read:
+And selecting a few columns to make it easier to read:

 ```{r}
 characters <- chars |> 
@ -508,16 +524,15 @@ titles <- chars |>
 titles
 ```

-Now, for example, we could use this table to all the characters that are captains and see all their titles:
+Now, for example, we could use this table tofind all the characters that are captains and see all their titles:

 ```{r}
 captains <- titles |> filter(str_detect(title, "Captain"))
 captains

 characters |> 
-  semi_join(captains, by = "id") |> 
  select(id, name) |> 
-  left_join(titles, by = "id", multiple = "all")
+  inner_join(titles, by = "id", multiple = "all")
 ```

 You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
@ -540,7 +555,7 @@ titles |>
  unnest_longer(word)
 ```

-And then we can count that column to find the most common:
+And then we can count that column to find the most common words:

 ```{r}
 titles |> 
@ -680,6 +695,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
    Why does it work for `got_chars` but might not work in general?

    ```{r}
+    #| results: false
    tibble(json = got_chars) |> 
      unnest_wider(json) |> 
      select(id, where(is.list)) %>% 
@ -699,7 +715,7 @@ If these case studies have whetted your appetite for more real-life rectangling,

 ## JSON

-All of the case studies in the previous section were sourced from wild-caught JSON files.
+All of the case studies in the previous section were sourced from wild-caught JSON.
 JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
 It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.

@ -709,27 +725,28 @@ JSON is a simple format designed to be easily read and written by machines, not
 It has six key data types.
 Four of them are scalars:

-   The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
-   A **string** is much like a string in R, but must use double quotes, not single quotes.
-   A **number** is similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
-   A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`.
+-   The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
+-   A **string** is much like a string in R, but must always use double quotes.
+-   A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
+-   A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.

 JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
 The main difference is that JSON's scalars can only represent a single value.
-To represent multiple values you need to use one of the two remaining types, arrays and objects.
+To represent multiple values you need to use one of the two remaining types: arrays and objects.

 Both arrays and objects are similar to lists in R; the difference is whether or not they're named.
 An **array** is like an unnamed list, and is written with `[]`.
 For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
-An **object** is like a named list, and it's written with `{}`.
+An **object** is like a named list, and is written with `{}`.
+The names (keys in JSON terminology) are strings, so must be surrounded by quotes.
 For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.

 ### jsonlite

-To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms.
+To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms.
 We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
 In real life, you'll use `read_json()` to read a JSON file from disk.
-For example, the repurrsive package also provides the source for `gh_user` as a JSON file:
+For example, the repurrsive package also provides the source for `gh_user` as a JSON file and you can read it with `read_json()`:

 ```{r}
 # A path to a json file inside the package:
@ -767,6 +784,7 @@ json <- '[
 ]'
 df <- tibble(json = parse_json(json))
 df
+
 df |> 
  unnest_wider(json)
 ```
@ -785,6 +803,7 @@ json <- '{
 '
 df <- tibble(json = list(parse_json(json)))
 df
+
 df |> 
  unnest_wider(json) |> 
  unnest_longer(results) |> 
@ -828,3 +847,13 @@ Apply `readr::parse_double()` as needed to the get correct variable type.
    df_col <- tibble(json = list(json_col)) 
    df_row <- tibble(json = json_row)
    ```
+
+## Summary
+
+In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames.
+Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns.
+It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.
+
+JSON is the most common data format returned by web APIs.
+What happens if the website doesn't have an API, but you can see data you want on the website?
+That's the topic of the next chapter: web scraping, extracting data from HTML webpages.