Rectangling polish
This commit is contained in:
parent
279611af8a
commit
fc3641a376
135
rectangling.qmd
135
rectangling.qmd
|
@ -10,17 +10,17 @@ status("polishing")
|
|||
## Introduction
|
||||
|
||||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
|
||||
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
|
||||
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
|
||||
|
||||
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
|
||||
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
|
||||
We'll then show you a few case studies, applying these simple function multiple times to solve real problems.
|
||||
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
|
||||
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()` and `tidyr::unnest_wider()`.
|
||||
We'll then show you a few case studies, applying these simple functions again and again to solve real problems.
|
||||
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
|
||||
We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
|
||||
We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists.
|
||||
|
||||
```{r}
|
||||
#| label: setup
|
||||
|
@ -33,8 +33,8 @@ library(jsonlite)
|
|||
|
||||
## Lists
|
||||
|
||||
So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
|
||||
These vectors are simple because they're homogeneous: every element is same type.
|
||||
So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors.
|
||||
These vectors are simple because they're homogeneous: every element is the same type.
|
||||
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
|
||||
|
||||
```{r}
|
||||
|
@ -86,16 +86,21 @@ x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
|||
str(x5)
|
||||
```
|
||||
|
||||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
|
||||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
||||
As lists get even larger and more complex, `str()` eventually starts to fail, and you'll need to switch to `View()`[^rectangling-1].
|
||||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-lists.
|
||||
|
||||
[^rectangling-1]: This is an RStudio feature.
|
||||
|
||||
```{r}
|
||||
#| label: fig-view-collapsed
|
||||
#| fig.cap: >
|
||||
#| The RStudio allows you to interactively explore a complex list.
|
||||
#| The RStudio view lets you interactively explore a complex list.
|
||||
#| The viewer opens showing only the top level of the list.
|
||||
#| fig.alt: >
|
||||
#| A screenshot of RStudio showing the list-viewer. It shows the
|
||||
#| two children of x4: the first child is a double vector and the
|
||||
#| second child is a list. A rightward facing triable indicates that the
|
||||
#| second child itself has children but you can't see them.
|
||||
#| echo: false
|
||||
#| out-width: NULL
|
||||
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
||||
|
@ -106,6 +111,10 @@ knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
|||
#| fig.cap: >
|
||||
#| Clicking on the rightward facing triangle expands that component
|
||||
#| of the list so that you can also see its children.
|
||||
#| fig.alt: >
|
||||
#| Another screenshot of the list-viewer having expand the second
|
||||
#| child of x2. It also has two children, a double vector and another
|
||||
#| list.
|
||||
#| echo: false
|
||||
#| out-width: NULL
|
||||
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
||||
|
@ -115,9 +124,12 @@ knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
|||
#| label: fig-view-expand-2
|
||||
#| fig.cap: >
|
||||
#| You can repeat this operation as many times as needed to get to the
|
||||
#| data you're interested in. Note the bottom-right corner: if you click
|
||||
#| data you're interested in. Note the bottom-left corner: if you click
|
||||
#| an element of the list, RStudio will give you the subsetting code
|
||||
#| needed to access it, in this case `x4[[2]][[2]][[2]]`.
|
||||
#| fig.alt: >
|
||||
#| Another screenshot, having expanded the grandchild of x4 to see its
|
||||
#| two children, again a double vector and a list.
|
||||
#| echo: false
|
||||
#| out-width: NULL
|
||||
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
||||
|
@ -173,11 +185,11 @@ It's possible to put a list in a column of a `data.frame`, but it's a lot fiddli
|
|||
data.frame(x = list(1:3, 3:5))
|
||||
```
|
||||
|
||||
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly usefully:
|
||||
You can force `data.frame()` to treat a list as a list of rows by wrapping it in list `I()`, but the result doesn't print particularly well:
|
||||
|
||||
```{r}
|
||||
data.frame(
|
||||
x = I(list(1:3, 3:5)),
|
||||
x = I(list(1:2, 3:5)),
|
||||
y = c("1, 2", "3, 4, 5")
|
||||
)
|
||||
```
|
||||
|
@ -188,14 +200,12 @@ It's easier to use list-columns with tibbles because `tibble()` treats lists lik
|
|||
## Unnesting
|
||||
|
||||
Now that you've learned the basics of lists and list-columns, let's explore how you can turn them back into regular rows and columns.
|
||||
We'll start with very simple sample data so you can get the basic idea, and then switch to more realistic examples in the next section.
|
||||
Here we'll use very simple sample data so you can get the basic idea; in the next section we'll switch to real data.
|
||||
|
||||
List-columns tend to come in two basic forms: named and unnamed.
|
||||
When the children are **named**, they tend to have the same names in every row.
|
||||
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
||||
The following code creates an example of each.
|
||||
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
||||
In `df2`, the elements of list-column `y` are unnamed and vary in length.
|
||||
For example, in `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
||||
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
||||
|
||||
```{r}
|
||||
df1 <- tribble(
|
||||
|
@ -204,6 +214,13 @@ df1 <- tribble(
|
|||
2, list(a = 21, b = 22),
|
||||
3, list(a = 31, b = 32),
|
||||
)
|
||||
```
|
||||
|
||||
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
||||
For example, in `df2`, the elements of list-column `y` are unnamed and vary in length from one to three.
|
||||
Unnamed list-columns naturally unnest in to rows: you'll get one row for each child.
|
||||
|
||||
```{r}
|
||||
|
||||
df2 <- tribble(
|
||||
~x, ~y,
|
||||
|
@ -213,9 +230,7 @@ df2 <- tribble(
|
|||
)
|
||||
```
|
||||
|
||||
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
||||
Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
|
||||
tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
|
||||
tidyr provides two functions for these two cases: `unnest_wider()` and `unnest_longer()`.
|
||||
The following sections explain how they work.
|
||||
|
||||
### `unnest_wider()`
|
||||
|
@ -227,7 +242,7 @@ df1 |>
|
|||
unnest_wider(y)
|
||||
```
|
||||
|
||||
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
|
||||
By default, the names of the new columns come exclusively from the names of the list elements, but you can use the `names_sep` argument to request that they combine the column name and the element name.
|
||||
This is useful for disambiguating repeated names.
|
||||
|
||||
```{r}
|
||||
|
@ -255,7 +270,7 @@ df2 |>
|
|||
```
|
||||
|
||||
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
|
||||
But what happens if the list-column is empty, as in the following example?
|
||||
But what happens if one of the elements is empty, as in the following example?
|
||||
|
||||
```{r}
|
||||
df6 <- tribble(
|
||||
|
@ -270,15 +285,15 @@ df6 |> unnest_longer(y)
|
|||
We get zero rows in the output, so the row effectively disappears.
|
||||
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
||||
|
||||
You can also unnest named list-columns, like `df1$y` into the rows.
|
||||
Because the elements are named, and those names might be useful data, puts them in a new column with the suffix `_id`:
|
||||
You can also unnest named list-columns, like `df1$y`, into rows.
|
||||
Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`:
|
||||
|
||||
```{r}
|
||||
df1 |>
|
||||
unnest_longer(y)
|
||||
```
|
||||
|
||||
If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
|
||||
If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
|
||||
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
||||
You can do this with `indices_include = TRUE`:
|
||||
|
||||
|
@ -311,7 +326,7 @@ df4 |>
|
|||
|
||||
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
||||
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
||||
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
|
||||
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type.
|
||||
|
||||
What happens if you find this problem in a dataset you're trying to rectangle?
|
||||
There are two basic options.
|
||||
|
@ -328,8 +343,7 @@ Another option would be to filter down to the rows that have values of a specifi
|
|||
```{r}
|
||||
df4 |>
|
||||
unnest_longer(y) |>
|
||||
rowwise() |>
|
||||
filter(is.numeric(y))
|
||||
filter(map_lgl(y, is.numeric))
|
||||
```
|
||||
|
||||
Then you can call `unnest_longer()` once more:
|
||||
|
@ -337,20 +351,21 @@ Then you can call `unnest_longer()` once more:
|
|||
```{r}
|
||||
df4 |>
|
||||
unnest_longer(y) |>
|
||||
rowwise() |>
|
||||
filter(is.numeric(y)) |>
|
||||
filter(map_lgl(y, is.numeric)) |>
|
||||
unnest_longer(y)
|
||||
```
|
||||
|
||||
You'll learn more about `map_lgl()` in @sec-iteration.
|
||||
|
||||
### Other functions
|
||||
|
||||
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
||||
|
||||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||||
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
|
||||
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book.
|
||||
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
||||
|
||||
These are good to know about when you're other people's code and for tackling rarer rectangling challenges.
|
||||
These are good to know about when you're reading other people's code or tackling rarer rectangling challenges.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -370,13 +385,12 @@ These are good to know about when you're other people's code and for tackling ra
|
|||
|
||||
## Case studies
|
||||
|
||||
So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`.
|
||||
The main difference between real data and these simple examples is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and `unnest_wider()`.
|
||||
This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
|
||||
The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`.
|
||||
This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild.
|
||||
|
||||
### Very wide data
|
||||
|
||||
We'll start by exploring `gh_repos`.
|
||||
We'll with `gh_repos`.
|
||||
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
||||
|
||||
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
|
||||
|
@ -389,7 +403,7 @@ repos
|
|||
|
||||
This tibble contains 6 rows, one row for each child of `gh_repos`.
|
||||
Each row contains a unnamed list with either 26 or 30 rows.
|
||||
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
||||
Since these are unnamed, we'll start with `unnest_longer()` to put each child in its own row:
|
||||
|
||||
```{r}
|
||||
repos |>
|
||||
|
@ -437,6 +451,8 @@ repos |>
|
|||
unnest_wider(owner)
|
||||
```
|
||||
|
||||
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
|
||||
|
||||
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
||||
Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
|
||||
|
||||
|
@ -461,14 +477,14 @@ chars <- tibble(json = got_chars)
|
|||
chars
|
||||
```
|
||||
|
||||
The `json` column contains named values, so we'll start by widening it:
|
||||
The `json` column contains named elements, so we'll start by widening it:
|
||||
|
||||
```{r}
|
||||
chars |>
|
||||
unnest_wider(json)
|
||||
```
|
||||
|
||||
And selecting a few columns just to make it easier to read:
|
||||
And selecting a few columns to make it easier to read:
|
||||
|
||||
```{r}
|
||||
characters <- chars |>
|
||||
|
@ -508,16 +524,15 @@ titles <- chars |>
|
|||
titles
|
||||
```
|
||||
|
||||
Now, for example, we could use this table to all the characters that are captains and see all their titles:
|
||||
Now, for example, we could use this table tofind all the characters that are captains and see all their titles:
|
||||
|
||||
```{r}
|
||||
captains <- titles |> filter(str_detect(title, "Captain"))
|
||||
captains
|
||||
|
||||
characters |>
|
||||
semi_join(captains, by = "id") |>
|
||||
select(id, name) |>
|
||||
left_join(titles, by = "id", multiple = "all")
|
||||
inner_join(titles, by = "id", multiple = "all")
|
||||
```
|
||||
|
||||
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
||||
|
@ -540,7 +555,7 @@ titles |>
|
|||
unnest_longer(word)
|
||||
```
|
||||
|
||||
And then we can count that column to find the most common:
|
||||
And then we can count that column to find the most common words:
|
||||
|
||||
```{r}
|
||||
titles |>
|
||||
|
@ -680,6 +695,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
|
|||
Why does it work for `got_chars` but might not work in general?
|
||||
|
||||
```{r}
|
||||
#| results: false
|
||||
tibble(json = got_chars) |>
|
||||
unnest_wider(json) |>
|
||||
select(id, where(is.list)) %>%
|
||||
|
@ -699,7 +715,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
|
|||
|
||||
## JSON
|
||||
|
||||
All of the case studies in the previous section were sourced from wild-caught JSON files.
|
||||
All of the case studies in the previous section were sourced from wild-caught JSON.
|
||||
JSON is short for **j**ava**s**cript **o**bject **n**otation and is the way that most web APIs return data.
|
||||
It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.
|
||||
|
||||
|
@ -709,27 +725,28 @@ JSON is a simple format designed to be easily read and written by machines, not
|
|||
It has six key data types.
|
||||
Four of them are scalars:
|
||||
|
||||
- The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
|
||||
- A **string** is much like a string in R, but must use double quotes, not single quotes.
|
||||
- A **number** is similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
|
||||
- A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`.
|
||||
- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
|
||||
- A **string** is much like a string in R, but must always use double quotes.
|
||||
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
|
||||
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.
|
||||
|
||||
JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
|
||||
The main difference is that JSON's scalars can only represent a single value.
|
||||
To represent multiple values you need to use one of the two remaining types, arrays and objects.
|
||||
To represent multiple values you need to use one of the two remaining types: arrays and objects.
|
||||
|
||||
Both arrays and objects are similar to lists in R; the difference is whether or not they're named.
|
||||
An **array** is like an unnamed list, and is written with `[]`.
|
||||
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
|
||||
An **object** is like a named list, and it's written with `{}`.
|
||||
An **object** is like a named list, and is written with `{}`.
|
||||
The names (keys in JSON terminology) are strings, so must be surrounded by quotes.
|
||||
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
|
||||
|
||||
### jsonlite
|
||||
|
||||
To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms.
|
||||
To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms.
|
||||
We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
|
||||
In real life, you'll use `read_json()` to read a JSON file from disk.
|
||||
For example, the repurrsive package also provides the source for `gh_user` as a JSON file:
|
||||
For example, the repurrsive package also provides the source for `gh_user` as a JSON file and you can read it with `read_json()`:
|
||||
|
||||
```{r}
|
||||
# A path to a json file inside the package:
|
||||
|
@ -767,6 +784,7 @@ json <- '[
|
|||
]'
|
||||
df <- tibble(json = parse_json(json))
|
||||
df
|
||||
|
||||
df |>
|
||||
unnest_wider(json)
|
||||
```
|
||||
|
@ -785,6 +803,7 @@ json <- '{
|
|||
'
|
||||
df <- tibble(json = list(parse_json(json)))
|
||||
df
|
||||
|
||||
df |>
|
||||
unnest_wider(json) |>
|
||||
unnest_longer(results) |>
|
||||
|
@ -828,3 +847,13 @@ Apply `readr::parse_double()` as needed to the get correct variable type.
|
|||
df_col <- tibble(json = list(json_col))
|
||||
df_row <- tibble(json = json_row)
|
||||
```
|
||||
|
||||
## Summary
|
||||
|
||||
In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames.
|
||||
Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns.
|
||||
It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.
|
||||
|
||||
JSON is the most common data format returned by web APIs.
|
||||
What happens if the website doesn't have an API, but you can see data you want on the website?
|
||||
That's the topic of the next chapter: web scraping, extracting data from HTML webpages.
|
||||
|
|
Loading…
Reference in New Issue