In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
We'll then show you a few case studies, applying these simple function multiple times to solve real complex problems.
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and common format for data exchange on the web.
We'll also use repurrrsive to supply some interesting datasets to practice your rectangling skills, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
When each row has the same number of elements with the same names, like `df1`, it's natural to put each component into its own column with `unnest_wider()`:
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
We get zero rows in the output, so the row effectively disappears.
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
What happens if you unnest a list-column contains different types of vector?
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so you read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
What happens if you apply two `unnest_longer()` calls to this data frame?
How can you preserve the relationship between `x` and `y`?
So far you've learned about the simplest case of list-columns, where you need only a single call to `unnest_longer()` or `unnest_wider()`.
The main difference between real data and these simple examples, is with real data you'll see multiple levels of nesting.
For example, you might see named list nested inside an unnested list, or an unnamed list nested inside of another unnamed list nested inside a named list.
To handle these case you'll need to chain together multiple calls to `unnest_wider()` and/or `unnest_longer()`.
This section will work through some real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
These challenges share the common feature that they're mostly just a sequence of multiple `unnest_wider()` and/or `unnest_longer()` calls, with a dash of dplyr where needed.
We'll start by exploring `gh_repos` which contains data about some GitHub repositories retrived from the GitHub API. It's a very deeply nested list so it's to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it a tibble.
I call the column call `json` for reasons we'll get to later.
At first glance, it might seem like we haven't improved the situation: while we have more rows (176 instead of 6) each element of `json` is still a list.
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
You can use this to work back to understand `gh_repos`: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
`owner` is another list-column, and since it a contains named list, we can use `unnest_wider()` to get at the values:
You might expect to see this data in its own table because you could then join back to the characters data as needed.
To make this table I'll do a little cleaning; removing the rows contain empty strings and renaming `titles` to `title` since each row now only contains a single title.
There are plenty of sophisticated ways to do this, but one simple way starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
We'll finish off this case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
This is a two column tibble containing five city names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
Now we can see why two cities got two results: Washington matched both the Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
That gives us new `bounds` (which gives a rectangular region) and the midpoint in `location`, which we can unnest to get latitude (`lat`) and longitude (`lng`):
```{r}
locations |>
select(city, formatted_address, geometry) |>
unnest_wider(geometry) |>
unnest_wider(location)
```
Extracting the bounds requires a few more steps
```{r}
locations |>
select(city, formatted_address, geometry) |>
unnest_wider(geometry) |>
# focus on the variables of interest
select(!location:viewport) |>
unnest_wider(bounds)
```
I then rename `southwest` and `northeast` (the corners of the rectangle) so I can use `names_sep` to create short but evocative names:
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in `vignette("rectangling", package = "tidyr")`.
All of the case studies in the previous section came from data stored in JSON format.
JSON is short for **j**ava**s**cript **o**bject **n**otation and the way that most web APIs return data.
In this section, you'll learn a little more about JSON and how to read it into R; once you've done that you can use the rectangling tools described above to get it into a data frame for further analysis.
- The simplest type is `null`, which is equivalent to both `NULL` and `NA` in R. It represents the absence of data.
- Strings are written much like in R, but can only use double quotes, not single quotes.
- Numbers are similar to R's numbers: they can be integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3). JSON doesn't support Inf, -Inf, or NaN.
- Booleans, are similar to R's logical vectors, but use `true` and `false` instead of `TRUE` and `FALSE`.
JSON represents more complex data by nesting in to arrays and objects.
An array is like an unnamed list in R, and is written with `[]`.
For `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
Objects are like a named list in R are a written with `{}`.
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
Most of the time you won't deal with JSON directly, instead you'll use the jsonlite package, by Jeroen Oooms, to load it into R as a nested list.
We'll focus on two functions from jsonlite.
Most of the time you'll use `read_json()` to read a json file from disk, but sometimes you'll also need `parse_json()` which takes json stored in a string in R.
Note that these functions have an important difference to `fromJSON()` --- they set the default value of `simplifyVector = FALSE`.
`fromJSON()` uses `simplifyVector = TRUE` which attempts to automatically unnest the JSON in a data frame.
This can work well for simple cases[^rectangling-2], but we think you're better off doing the simplification yourself so you know exactly what's happening and easily handle arbitrarily complicated systems.
[^rectangling-2]: Doing it yourself also means you'll use the standard tidyverse rules for recycling and vector coercion.
There's nothing wrong with jsonlite's rules, but they're different and we don't want to get in to the details here.
Note that the rectangling approach described above is designed around the most common case where the API returns multiple "things", e.g. multiple pages, or multiple records, or multiple results.
In this case, you just do `tibble(json)` and each element becomes a row.
If the JSON returns a single "thing", then you'll need to do `tibble(json = list(json))` so you start with a data frame containing a single row.
There isn't a perfect match between json's data types and R's data types.
So when reading a json file into R, we have to make some assumptions:
- Inside an array, `null` is translated to `NA`, so `[true, null, false]` is translated to `c(TRUE, NA, FALSE)` but `{"x": null}` is translated to `list(x = NULL)`.
- JSON doesn't have any way to represent dates or date-times, so they're normally stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.