In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
We'll also use repurrrsive to provide some interesting datasets rectangling practice, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
List-columns are useful because they allow you to shoehorn in objects that wouldn't wouldn't usually belong in a tibble.
In particular, list-columns are are used a lot in the [tidymodels](https://www.tidymodels.org) ecosystem, because they allows you to store things like models or resamples in a data frame.
It's easier to use list-columns with tibbles because `tibble()` treats lists like either vectors and the print method has been designed with lists in mind.
When each row has the same number of elements with the same names, like `df1`, it's natural to put each component into its own column with `unnest_wider()`:
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
We get zero rows in the output, so the row effectively disappears.
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
What happens if you unnest a list-column contains different types of vector?
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
These are good to know about when you're other people's code and for tackling rarer rectangling challenges.
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
What happens if you apply two `unnest_longer()` calls to this data frame?
How can you preserve the relationship between `x` and `y`?
So far you've learned about the simplest case of list-columns, where rectangling only requires a single call to `unnest_longer()` or `unnest_wider()`.
The main difference between real data and these simple examples is that real data typically containsmultiple levels of nesting that requires multiple calls to `unnest_longer()` and `unnest_wider()`.
This section will work through four real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
At first glance, it might seem like we haven't improved the situation: while we have more rows (176 instead of 6) each element of `json` is still a list.
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
You can use this to work back to understand how `gh_repos` was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
You might expect to see this data in its own table because it would be easy to join to the characters data as needed.
To do so, we'll do a little cleaning: removing the rows containing empty strings and renaming `titles` to `title` since each row now only contains a single title.
We'll finish off these case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
This is a two column tibble containing five city names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in `vignette("rectangling", package = "tidyr")`.
It's important to understand it because while JSON and R's data types are pretty similar, there isn't a perfect 1-to-1 mapping, so it's good to understand a bit about JSON if things go wrong.
- The simplest type is a null, which is written `null`, which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
- A **string** is much like a string in R, but must use double quotes, not single quotes.
- A **number** is similar to R's numbers: they can be use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
- A **boolean** is similar to R's `TRUE` and `FALSE`, but use lower case `true` and `false`.
For example `[1, 2, 3]` is an array containing 3 numbers, and `[null, 1, "string", false]` is an array that contains a null, a number, a string, and a boolean.
To convert JSON into R data structures, we recommend that you use the jsonlite package, by Jeroen Oooms.
We'll use only two jsonlite functions: `read_json()` and `parse_json()`.
In real life, you'll use `read_json()` to read a JSON file from disk.
For example, we the repurrsive package also provides the source for `gh_user` as a JSON file:
```{r}
# A path to a json file inside the package:
gh_users_json()
# Read it with read_json()
gh_users2 <- read_json(gh_users_json())
# Check it's the same as the data we were using previously
identical(gh_users, gh_users2)
```
In this book, I'll also use `parse_json()`, since it takes a string containing JSON, which makes it good for generating simple examples.
To get started, here's three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:
jsonlite has another important function called `fromJSON()`.
We don't use it here because it performs automatic simplification (`simplifyVector = TRUE)`.
This often works well, particularly in simple cases, but we think you're better off doing the rectangling yourself so you know exactly what's happening and can more easily handle the most complicated nested structures.
In most cases, JSON files contain a single top-level array, because they're designed to provide data about multiple "things", e.g. multiple pages, or multiple records, or multiple results.
In this case, you'll start your rectangling with `tibble(json)` so that each element becomes a row:
Since JSON doesn't have any way to represent dates or date-times, they're often stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
Similarly, JSON's rules for representing floating point numbers in JSON are a little imprecise, so you'll also sometimes find numbers stored in strings.
Apply `readr::parse_double()` as needed to the get correct variable type.