In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
We'll then show you a few case studies, applying these simple function multiple times to solve real complex problems.
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and common format for data exchange on the web.
We'll also use repurrrsive to supply some interesting datasets to practice your rectangling skills, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangle-1].
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
When each row has the same number of elements with the same names, like `df1`, it's natural to put each component into its own column with `unnest_wider()`:
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
We get zero rows in the output, so the row effectively disappears.
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
What happens if you unnest a list-column contains different types of vector?
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so you read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
What happens if you apply two `unnest_longer()` calls to this data frame?
How can you preserve the relationship between `x` and `y`?
So far you've learned about the simplest case of list-columns, where you need only a single call to `unnest_longer()` or `unnest_wider()`.
The main difference between real data and these simple examples, is with real data you'll see multiple levels of nesting.
For example, you might see named list nested inside an unnested list, or an unnamed list nested inside of another unnamed list nested inside a named list.
To handle these case you'll need to chain together multiple calls to `unnest_wider()` and/or `unnest_longer()`.
This section will work through some real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
These challenges share the common feature that they're mostly just a sequence of multiple `unnest_wider()` and/or `unnest_longer()` calls, with a dash of dplyr where needed.
We'll start by exploring `gh_repos` which contains data about some GitHub repositories retrived from the GitHub API. It's a very deeply nested list so it's to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it a tibble.
I call the column call `json` for reasons we'll get to later.
At first glance, it might seem like we haven't improved the situation: while we have more rows (176 instead of 6) each element of `json` is still a list.
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
You can use this to work back to understand `gh_repos`: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
`owner` is another list-column, and since it a contains named list, we can use `unnest_wider()` to get at the values:
You might expect to see this data in its own table because you could then join back to the characters data as needed.
To make this table I'll do a little cleaning; removing the rows contain empty strings and renaming `titles` to `title` since each row now only contains a single title.
There are plenty of sophisticated ways to do this, but one simple way starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
We'll finish off this case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
This is a two column tibble containing five city names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
Now we can see why two cities got two results: Washington matched both the Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
That gives us new `bounds` (which gives a rectangular region) and the midpoint in `location`, which we can unnest to get latitude (`lat`) and longitude (`lng`):
```{r}
locations |>
select(city, formatted_address, geometry) |>
unnest_wider(geometry) |>
unnest_wider(location)
```
Extracting the bounds requires a few more steps
```{r}
locations |>
select(city, formatted_address, geometry) |>
unnest_wider(geometry) |>
# focus on the variables of interest
select(!location:viewport) |>
unnest_wider(bounds)
```
I then rename `southwest` and `northeast` (the corners of the rectangle) so I can use `names_sep` to create short but evocative names:
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in `vignette("rectangling", package = "tidyr")`.
JSON, short for javascript object notation, is a data format that grew out of the javascript programming language and has become an extremely common way of representing data.