801 lines
25 KiB
Plaintext
801 lines
25 KiB
Plaintext
# Data rectangling {#sec-rectangling}
|
||
|
||
```{r}
|
||
#| results: "asis"
|
||
#| echo: false
|
||
source("_common.R")
|
||
status("drafting")
|
||
```
|
||
|
||
## Introduction
|
||
|
||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
|
||
This is important because hierarchical data is surprisingly common, especially when working with data that comes from a web API.
|
||
|
||
To learn about rectangling, you'll first learn about lists, the data structure that makes hierarchical data possible in R.
|
||
Then you'll learn about two crucial tidyr functions: `tidyr::unnest_longer()`, which converts children in rows, and `tidyr::unnest_wider()`, which converts children into columns.
|
||
We'll then show you a few case studies, applying these simple function multiple times to solve real complex problems.
|
||
We'll finish off by talking about JSON, the most frequent source of hierarchical datasets and common format for data exchange on the web.
|
||
|
||
### Prerequisites
|
||
|
||
In this chapter we'll continue using tidyr.
|
||
We'll also use repurrrsive to supply some interesting datasets to practice your rectangling skills, and we'll finish up with a little jsonlite, which we'll use to read JSON files into R lists.
|
||
|
||
```{r}
|
||
#| label: setup
|
||
#| message: false
|
||
|
||
library(tidyverse)
|
||
library(repurrrsive)
|
||
library(jsonlite)
|
||
```
|
||
|
||
## Lists
|
||
|
||
So far we've used simple vectors like integers, numbers, characters, date-times, and factors.
|
||
These vectors are simple because they're homogeneous: every element is same type.
|
||
If you want to store element of different types, you need a **list**, which you create with `list()`:
|
||
|
||
```{r}
|
||
x1 <- list(1:4, "a", TRUE)
|
||
x1
|
||
```
|
||
|
||
It's often convenient to name the components, or **children**, of a list, which you can do in the same way as naming the columns of a tibble:
|
||
|
||
```{r}
|
||
x2 <- list(a = 1:2, b = 1:3, c = 1:4)
|
||
x2
|
||
```
|
||
|
||
Even for these very simple lists, printing takes up quite a lot of space.
|
||
A useful alternative is `str()`, which generates a compact display of the **str**ucture, de-emphasizing the contents:
|
||
|
||
```{r}
|
||
str(x1)
|
||
str(x2)
|
||
```
|
||
|
||
As you can see, `str()` displays each child on its own line.
|
||
It displays the name, if present, then an abbreviation of the type, then the first few values.
|
||
|
||
### Hierarchy
|
||
|
||
Lists can contain any type of object, including other lists.
|
||
This makes them suitable for representing hierarchical or tree-like structures:
|
||
|
||
```{r}
|
||
x3 <- list(list(1, 2), list(3, 4))
|
||
str(x3)
|
||
```
|
||
|
||
This is notably different to `c()`, which generates a flat vector:
|
||
|
||
```{r}
|
||
c(c(1, 2), c(3, 4))
|
||
|
||
x4 <- c(list(1, 2), list(3, 4))
|
||
str(x4)
|
||
```
|
||
|
||
As lists get more complex, `str()` gets more useful, as it lets you see the hierarchy at a glance:
|
||
|
||
```{r}
|
||
x5 <- list(1, list(2, list(3, list(4, list(5)))))
|
||
str(x5)
|
||
```
|
||
|
||
As lists get even large and more complex, even `str()` starts to fail, you'll need to switch to `View()`[^rectangling-1].
|
||
@fig-view-collapsed shows the result of calling `View(x4)`. The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in @fig-view-expand-1. RStudio will also show you the code you need to access that element, as in @fig-view-expand-2. We'll come back to how this code works in @sec-vector-subsetting.
|
||
|
||
[^rectangling-1]: This is an RStudio feature.
|
||
|
||
```{r}
|
||
#| label: fig-view-collapsed
|
||
#| fig.cap: >
|
||
#| The RStudio allows you to interactively explore a complex list.
|
||
#| The viewer opens showing only the top level of the list.
|
||
#| echo: false
|
||
#| out-width: NULL
|
||
knitr::include_graphics("screenshots/View-1.png", dpi = 220)
|
||
```
|
||
|
||
```{r}
|
||
#| label: fig-view-expand-1
|
||
#| fig.cap: >
|
||
#| Clicking on the rightward facing triangle expands that component
|
||
#| of the list so that you can also see its children.
|
||
#| echo: false
|
||
#| out-width: NULL
|
||
knitr::include_graphics("screenshots/View-2.png", dpi = 220)
|
||
```
|
||
|
||
```{r}
|
||
#| label: fig-view-expand-2
|
||
#| fig.cap: >
|
||
#| You can repeat this operation as many times as needed to get to the
|
||
#| data you're interested in. Note the bottom-right corner: if you click
|
||
#| an element of the list, RStudio will give you the subsetting code
|
||
#| needed to access it, in this case `x4[[2]][[2]][[2]]`.
|
||
#| echo: false
|
||
#| out-width: NULL
|
||
knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
||
```
|
||
|
||
### List-columns
|
||
|
||
Lists can also live inside a tibble, where we call them list-columns.
|
||
List-columns are useful because they allow you to shoehorn in objects that wouldn't wouldn't usually belong in a data frame.
|
||
List-columns are are used a lot in the tidymodels ecosystem, because it allows you to store things like models or resamples in a data frame.
|
||
|
||
Here's a simple example of a list-column:
|
||
|
||
```{r}
|
||
df <- tibble(
|
||
x = 1:2,
|
||
y = c("a", "b"),
|
||
z = list(list(1, 2), list(3, 4, 5))
|
||
)
|
||
df
|
||
```
|
||
|
||
There's nothing special about lists in a tibble; they behave like any other column:
|
||
|
||
```{r}
|
||
df |>
|
||
filter(x == 1)
|
||
```
|
||
|
||
Computing with them is harder, but that's because computing with lists is a harder; we'll come back to that in @sec-iteration.
|
||
|
||
The default print method just displays a rough summary of the contents.
|
||
The list column could be arbitrarily complex, so there's no good way to print it.
|
||
If you want to see it, you'll need to pull the list-column out and apply of the techniques that you learned above:
|
||
|
||
```{r}
|
||
df |>
|
||
filter(x == 1) |>
|
||
pull(z) |>
|
||
str()
|
||
```
|
||
|
||
Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
|
||
To explore those fields you'll need to `pull()` and view, e.g.
|
||
`View(pull(df, z))`.
|
||
|
||
::: callout-note
|
||
## Base R
|
||
|
||
It's possible to put a list in a column of a `data.frame`, but it's a lot fiddlier.
|
||
However, base R doesn't make it easy to create list-columns because `data.frame()` treats a list as a list of columns:
|
||
|
||
```{r}
|
||
data.frame(x = list(1:3, 3:5))
|
||
```
|
||
|
||
You can prevent `data.frame()` from doing this with `I()`, but the result doesn't print particularly informatively:
|
||
|
||
```{r}
|
||
data.frame(
|
||
x = I(list(1:3, 3:5)),
|
||
y = c("1, 2", "3, 4, 5")
|
||
)
|
||
```
|
||
|
||
Tibbles make it easier to work with list-columns because `tibble()` doesn't modify its inputs and the print method is designed with lists in mind.
|
||
:::
|
||
|
||
## Unnesting
|
||
|
||
Now that you've learned the basics of lists and list-columns, lets explore how you can turn them back into regular rows and columns.
|
||
We'll start with very simple sample data so you can get the basic idea, and then in the next section switch to more realistic examples.
|
||
|
||
List-columns tend to come in two basic forms: named and unnamed.
|
||
When the children are **named**, they tend to have the same names in every row.
|
||
When the children are **unnamed**, the number of elements tends to vary from row-to-row.
|
||
The following code creates an example of each.
|
||
In `df1`, every element of list-column `y` has two elements named `a` and `b`.
|
||
If `df2`, the elements of list-column `y` are unnamed and vary in length.
|
||
|
||
```{r}
|
||
df1 <- tribble(
|
||
~x, ~y,
|
||
1, list(a = 11, b = 12),
|
||
2, list(a = 21, b = 22),
|
||
3, list(a = 31, b = 32),
|
||
)
|
||
|
||
df2 <- tribble(
|
||
~x, ~y,
|
||
1, list(11, 12, 13),
|
||
2, list(21),
|
||
3, list(31, 32),
|
||
)
|
||
```
|
||
|
||
Named list-columns naturally unnest into columns: each named element becomes a new named column.
|
||
Unnamed list-columns naturally unnested in to rows: you'll get one row for each child.
|
||
tidyr provides two functions for these two case: `unnest_wider()` and `unnest_longer()`.
|
||
The following sections explain how they work.
|
||
|
||
### `unnest_wider()`
|
||
|
||
When each row has the same number of elements with the same names, like `df1`, it's natural to put each component into its own column with `unnest_wider()`:
|
||
|
||
```{r}
|
||
df1 |>
|
||
unnest_wider(y)
|
||
```
|
||
|
||
By default, the names of the new columns come exclusively from the names of the list, but you can use the `names_sep` argument to request that they combine the column name and the list names.
|
||
This is useful for disambiguating repeated names.
|
||
|
||
```{r}
|
||
df1 |>
|
||
unnest_wider(y, names_sep = "_")
|
||
```
|
||
|
||
We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
|
||
Since columns require names but the list lacks them, `unnest_wider()` will label them with consecutive integers:
|
||
|
||
```{r}
|
||
df2 |>
|
||
unnest_wider(y, names_sep = "_")
|
||
```
|
||
|
||
You'll notice that `unnested_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
|
||
|
||
### `unnest_longer()`
|
||
|
||
When each row contains an unnamed list, it's most natural to put each element into its own row with `unnest_longer()`:
|
||
|
||
```{r}
|
||
df2 |>
|
||
unnest_longer(y)
|
||
```
|
||
|
||
Note how `x` is duplicated for each element inside of `y`: we get one row of output for each element inside the list-column.
|
||
But what happens if the list-column is empty, as in the following example?
|
||
|
||
```{r}
|
||
df6 <- tribble(
|
||
~x, ~y,
|
||
"a", list(1, 2),
|
||
"b", list(3),
|
||
"c", list()
|
||
)
|
||
df6 |> unnest_longer(y)
|
||
```
|
||
|
||
We get zero rows in the output, so the row effectively disappears.
|
||
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
||
|
||
You can also unnest named list-columns, like `df1$y` into the rows.
|
||
Because the elements are named, and those names might be useful data, puts them in a new column with the suffix`_id`:
|
||
|
||
```{r}
|
||
df1 |>
|
||
unnest_longer(y)
|
||
```
|
||
|
||
If you don't want these `ids`, you can suppress this with `indices_include = FALSE`.
|
||
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
||
You can do this with `indices_include = TRUE`:
|
||
|
||
```{r}
|
||
df2 |>
|
||
unnest_longer(y, indices_include = TRUE)
|
||
```
|
||
|
||
### Inconsistent types
|
||
|
||
What happens if you unnest a list-column contains different types of vector?
|
||
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
|
||
|
||
```{r}
|
||
df4 <- tribble(
|
||
~x, ~y,
|
||
"a", list(1, "a"),
|
||
"b", list(TRUE, factor("a"), 5)
|
||
)
|
||
```
|
||
|
||
`unnest_longer()` always keeps the set of columns change, while changing the number of rows.
|
||
So what happens?
|
||
How does `unnest_longer()` produce five rows while keeping everything in `y`?
|
||
|
||
```{r}
|
||
df4 |>
|
||
unnest_longer(y)
|
||
```
|
||
|
||
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
||
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
||
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite, because every element is a still a list, and each component of that list contains something different.
|
||
|
||
What happens if you find this problem in a dataset you're trying to rectangle?
|
||
I think there are two basic options.
|
||
You could use the `transform` argument to coerce all inputs to a common type.
|
||
It's not particularly useful here because there's only really one class that these five class can be converted to: character.
|
||
|
||
```{r}
|
||
df4 |>
|
||
unnest_longer(y, transform = as.character)
|
||
```
|
||
|
||
Another option would be to filter down to the rows that have values of a specific type:
|
||
|
||
```{r}
|
||
df4 |>
|
||
unnest_longer(y) |>
|
||
rowwise() |>
|
||
filter(is.numeric(y))
|
||
```
|
||
|
||
Then you can call `unnest_longer()` once more:
|
||
|
||
```{r}
|
||
df4 |>
|
||
unnest_longer(y) |>
|
||
rowwise() |>
|
||
filter(is.numeric(y)) |>
|
||
unnest_longer(y)
|
||
```
|
||
|
||
### Other functions
|
||
|
||
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
||
|
||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but I think it's ultimately a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which we don't see in this book.
|
||
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so you read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
||
|
||
### Exercises
|
||
|
||
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
|
||
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
|
||
What happens if you apply two `unnest_longer()` calls to this data frame?
|
||
How can you preserve the relationship between `x` and `y`?
|
||
(Hint: carefully read the docs).
|
||
|
||
```{r}
|
||
df4 <- tribble(
|
||
~x, ~y, ~z,
|
||
"a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
|
||
"b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
|
||
)
|
||
```
|
||
|
||
## Case studies
|
||
|
||
So far you've learned about the simplest case of list-columns, where you need only a single call to `unnest_longer()` or `unnest_wider()`.
|
||
The main difference between real data and these simple examples, is with real data you'll see multiple levels of nesting.
|
||
For example, you might see named list nested inside an unnested list, or an unnamed list nested inside of another unnamed list nested inside a named list.
|
||
To handle these case you'll need to chain together multiple calls to `unnest_wider()` and/or `unnest_longer()`.
|
||
|
||
This section will work through some real rectangling challenges using datasets from the repurrrsive package that are inspired by datasets that we've encountered in the wild.
|
||
These challenges share the common feature that they're mostly just a sequence of multiple `unnest_wider()` and/or `unnest_longer()` calls, with a dash of dplyr where needed.
|
||
|
||
### Very wide data
|
||
|
||
We'll start by exploring `gh_repos` which contains data about some GitHub repositories retrived from the GitHub API. It's a very deeply nested list so it's to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
||
|
||
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it a tibble.
|
||
I call the column call `json` for reasons we'll get to later.
|
||
|
||
```{r}
|
||
repos <- tibble(json = gh_repos)
|
||
repos
|
||
```
|
||
|
||
This tibble contains 6 rows, one row for each child of `gh_repos`.
|
||
Each row contains a unnamed list with either 26 or 30 rows.
|
||
Since these are unnamed, we'll start with an `unnest_longer()` to put each child in its own row:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json)
|
||
```
|
||
|
||
At first glance, it might seem like we haven't improved the situation: while we have more rows (176 instead of 6) each element of `json` is still a list.
|
||
However, there's an important difference: now each element is a **named** list so we can use `unnamed_wider()` to put each element into its own column:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
This has worked but the result is a little overwhelming: there are so many columns that tibble doesn't even print all of them!
|
||
We can see them all with `names()`:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
names()
|
||
```
|
||
|
||
Let's select a few that look interesting:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
select(id, full_name, owner, description)
|
||
```
|
||
|
||
You can use this to work back to understand `gh_repos`: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.
|
||
|
||
`owner` is another list-column, and since it a contains named list, we can use `unnest_wider()` to get at the values:
|
||
|
||
```{r}
|
||
#| error: true
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
select(id, full_name, owner, description) |>
|
||
unnest_wider(owner)
|
||
```
|
||
|
||
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
||
Rather than following the advice to use `names_repair` (which would also work), I'll instead use `names_sep`:
|
||
|
||
```{r}
|
||
repos |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json) |>
|
||
select(id, full_name, owner, description) |>
|
||
unnest_wider(owner, names_sep = "_")
|
||
```
|
||
|
||
This gives another wide dataset, but you can see that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
|
||
|
||
### Relational data
|
||
|
||
When you get nested data, it's not uncommon for it to contain data that we'd normally spread out into multiple data frames.
|
||
Take `got_chars`, for example.
|
||
Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
|
||
|
||
```{r}
|
||
chars <- tibble(json = got_chars)
|
||
chars
|
||
```
|
||
|
||
The `json` column contains named values, so we'll start by widening it:
|
||
|
||
```{r}
|
||
chars |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
And selecting a few columns just to make it easier to read:
|
||
|
||
```{r}
|
||
characters <- chars |>
|
||
unnest_wider(json) |>
|
||
select(id, name, gender, culture, born, died, alive)
|
||
characters
|
||
```
|
||
|
||
There are also many list-columns:
|
||
|
||
```{r}
|
||
chars |>
|
||
unnest_wider(json) |>
|
||
select(id, where(is.list))
|
||
```
|
||
|
||
Lets explore the `titles` column.
|
||
It's an unnamed list-column, so we'll unnest it into rows:
|
||
|
||
```{r}
|
||
chars |>
|
||
unnest_wider(json) |>
|
||
select(id, titles) |>
|
||
unnest_longer(titles)
|
||
```
|
||
|
||
You might expect to see this data in its own table because you could then join back to the characters data as needed.
|
||
To make this table I'll do a little cleaning; removing the rows contain empty strings and renaming `titles` to `title` since each row now only contains a single title.
|
||
|
||
```{r}
|
||
titles <- chars |>
|
||
unnest_wider(json) |>
|
||
select(id, titles) |>
|
||
unnest_longer(titles) |>
|
||
filter(titles != "") |>
|
||
rename(title = titles)
|
||
titles
|
||
```
|
||
|
||
Now, for example, we could use this table to all the characters that are captains and see all their titles:
|
||
|
||
```{r}
|
||
captains <- titles |> filter(str_detect(title, "Captain"))
|
||
captains
|
||
|
||
characters |>
|
||
semi_join(captains) |>
|
||
select(id, name) |>
|
||
left_join(titles)
|
||
```
|
||
|
||
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
||
|
||
### A dash of text analysis
|
||
|
||
What if we wanted to find the most common words in the title?
|
||
There are plenty of sophisticated ways to do this, but one simple way starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
|
||
|
||
```{r}
|
||
titles |>
|
||
mutate(word = str_split(title, " "), .keep = "unused")
|
||
```
|
||
|
||
This creates a unnamed variable length list-column, so we can use `unnest_longer()`:
|
||
|
||
```{r}
|
||
titles |>
|
||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||
unnest_longer(word)
|
||
```
|
||
|
||
And then we can count that column to find the most common:
|
||
|
||
```{r}
|
||
titles |>
|
||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||
unnest_longer(word) |>
|
||
count(word, sort = TRUE)
|
||
```
|
||
|
||
Some of those words are not very interesting so we could create a list of common words to drop.
|
||
In text analysis this is commonly called stop words.
|
||
|
||
```{r}
|
||
stop_words <- tribble(
|
||
~ word,
|
||
"of",
|
||
"the"
|
||
)
|
||
|
||
titles |>
|
||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||
unnest_longer(word) |>
|
||
anti_join(stop_words) |>
|
||
count(word, sort = TRUE)
|
||
```
|
||
|
||
Breaking up text into individual fragments is a powerful idea that underlies much of text analysis.
|
||
If this sounds interesting, I'd recommend reading [Text Mining with R](https://www.tidytextmining.com) by Julia Silge and David Robinson.
|
||
|
||
### Deeply nested
|
||
|
||
We'll finish off this case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
|
||
This is a two column tibble containing five city names and the results of using Google's [geocoding API](https://developers.google.com/maps/documentation/geocoding) to determine their location:
|
||
|
||
```{r}
|
||
gmaps_cities
|
||
```
|
||
|
||
`json` is list-column with internal names, so we start with an `unnest_wider()`:
|
||
|
||
```{r}
|
||
gmaps_cities |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
This gives us the `status` and the `results`.
|
||
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
|
||
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
|
||
|
||
```{r}
|
||
gmaps_cities |>
|
||
unnest_wider(json) |>
|
||
select(-status) |>
|
||
unnest_longer(results)
|
||
```
|
||
|
||
Now `results` is a named list, so we'll use `unnest_wider()`:
|
||
|
||
```{r}
|
||
locations <- gmaps_cities |>
|
||
unnest_wider(json) |>
|
||
select(-status) |>
|
||
unnest_longer(results) |>
|
||
unnest_wider(results)
|
||
locations
|
||
```
|
||
|
||
Now we can see why two cities got two results: Washington matched both the Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.
|
||
|
||
There are few different places we could go from here.
|
||
We might want to determine the exact location of the match, which is stored in the `geometry` list-column:
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry)
|
||
```
|
||
|
||
That gives us new `bounds` (which gives a rectangular region) and the midpoint in `location`, which we can unnest to get latitude (`lat`) and longitude (`lng`):
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry) |>
|
||
unnest_wider(location)
|
||
```
|
||
|
||
Extracting the bounds requires a few more steps
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry) |>
|
||
# focus on the variables of interest
|
||
select(!location:viewport) |>
|
||
unnest_wider(bounds)
|
||
```
|
||
|
||
I then rename `southwest` and `northeast` (the corners of the rectangle) so I can use `names_sep` to create short but evocative names:
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
unnest_wider(geometry) |>
|
||
select(!location:viewport) |>
|
||
unnest_wider(bounds) |>
|
||
rename(ne = northeast, sw = southwest) |>
|
||
unnest_wider(c(ne, sw), names_sep = "_")
|
||
```
|
||
|
||
Note that I unnest the two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
|
||
|
||
This one place where `hoist()`, mentioned briefly above, can be useful.
|
||
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
||
|
||
```{r}
|
||
locations |>
|
||
select(city, formatted_address, geometry) |>
|
||
hoist(
|
||
geometry,
|
||
ne_lat = c("bounds", "northeast", "lat"),
|
||
sw_lat = c("bounds", "southwest", "lat"),
|
||
ne_lng = c("bounds", "northeast", "lng"),
|
||
sw_lng = c("bounds", "southwest", "lng"),
|
||
)
|
||
```
|
||
|
||
If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in `vignette("rectangling", package = "tidyr")`.
|
||
|
||
### Exercises
|
||
|
||
1. Roughly estimate when `gh_repos` was created.
|
||
Why can you only roughly estimate the date?
|
||
|
||
2. The `owner` column of `gh_repo` contains a lot of duplicated information because each owner can have many repos.
|
||
Can you construct a `owners` data frame that contains one row for each owner?
|
||
(Hint: does `distinct()` work with `list-cols`?)
|
||
|
||
3. Explain the following code line-by-line.
|
||
Why is it interesting?
|
||
Why does it work for this dataset but might not work in general?
|
||
|
||
```{r}
|
||
tibble(json = got_chars) |>
|
||
unnest_wider(json) |>
|
||
select(id, where(is.list)) %>%
|
||
pivot_longer(
|
||
where(is.list),
|
||
names_to = "name",
|
||
values_to = "value"
|
||
) %>%
|
||
unnest_longer(value)
|
||
```
|
||
|
||
4. In `gmaps_cities`, what does `address_components` contain?
|
||
Why does the length vary between rows?
|
||
Unnest it appropriately to figure it out.
|
||
(Hint: `types` always appears to contain two elements. Does `unnest_wider()` make it easier to work with than `unnest_longer()`?)
|
||
.
|
||
|
||
## JSON
|
||
|
||
All of the case studies in the previous section came originally as JSON, one of the most common sources of hierarchical data.
|
||
In this section, you'll learn more about JSON and some common problems you might have.
|
||
JSON, short for javascript object notation, is a data format that grew out of the javascript programming language and has become an extremely common way of representing data.
|
||
|
||
``` json
|
||
{
|
||
"name1": "value1",
|
||
"name2": "value2"
|
||
}
|
||
```
|
||
|
||
Which in R you might represent as:
|
||
|
||
```{r}
|
||
list(
|
||
name1 = "value1",
|
||
name2 = "value2"
|
||
)
|
||
```
|
||
|
||
There are five types of things that JSON can represent
|
||
|
||
``` json
|
||
{
|
||
"strings": "are surrounded by double doubles",
|
||
"numbers": 123456,
|
||
"boolean": [false, true],
|
||
"arrays": [1, 2, 3, 4, 5],
|
||
"objects": {
|
||
"name1": "value1",
|
||
"name2": "value2"
|
||
},
|
||
"null": null
|
||
}
|
||
```
|
||
|
||
You'll notice that these types don't embrace many of the types you've learned earlier in the book like factors, and date-times.
|
||
This is important: typically these data types will be encoded as string, and you'll need coerce to the correct data type.
|
||
|
||
Most of the time you won't deal with JSON directly, instead you'll use the jsonlite package, by Jeroen Oooms, to load it into R as a nested list.
|
||
|
||
### Data frames
|
||
|
||
JSON doesn't have any 2-dimension data structures, so how would you represent a data frame?
|
||
|
||
```{r}
|
||
df <- tribble(
|
||
~x, ~y,
|
||
"a", 10,
|
||
"x", 3
|
||
)
|
||
```
|
||
|
||
There are two ways: you can either make an struct of arrays, or an array of structs.
|
||
|
||
``` json
|
||
{
|
||
"x": ["a", "x"],
|
||
"y": [10, 3]
|
||
}
|
||
```
|
||
|
||
``` {.json .josn}
|
||
[
|
||
{"x": "a", "y": 10},
|
||
{"x": "x", "y": 3}
|
||
]
|
||
```
|
||
|
||
```{r}
|
||
df_col <- jsonlite::fromJSON('
|
||
{
|
||
"x": ["a", "x"],
|
||
"y": [10, 3]
|
||
}
|
||
')
|
||
tibble(json = list(df_col)) |>
|
||
unnest_wider(json) |>
|
||
unnest_longer(everything())
|
||
```
|
||
|
||
```{r}
|
||
df_row <- jsonlite::fromJSON(simplifyVector = FALSE, '
|
||
[
|
||
{"x": "a", "y": 10},
|
||
{"x": "x", "y": 3}
|
||
]
|
||
')
|
||
tibble(json = list(df_row)) |>
|
||
unnest_longer(json) |>
|
||
unnest_wider(json)
|
||
```
|
||
|
||
Note that we have to wrap it in a `list()` because we have a single "thing" to unnest.
|