Streamline hierarchical data
This commit is contained in:
parent
aa9d72a7c6
commit
bd67dc7a62
165
rectangling.qmd
165
rectangling.qmd
|
@ -164,17 +164,7 @@ In this chapter, we'll focus on unnesting list-columns out into regular variable
|
|||
|
||||
The default print method just displays a rough summary of the contents.
|
||||
The list column could be arbitrarily complex, so there's no good way to print it.
|
||||
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above:
|
||||
|
||||
```{r}
|
||||
df |>
|
||||
filter(x == 1) |>
|
||||
pull(z) |>
|
||||
str()
|
||||
```
|
||||
|
||||
Similarly, if you `View()` a data frame in RStudio, you'll get the standard tabular view, which doesn't allow you to selectively expand list columns.
|
||||
To explore those fields you'll need to `pull()` and view, e.g. `df |> pull(z) |> View()`.
|
||||
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above, like `df |> pull(z) |> str()` or `df |> pull(z) |> View()`.
|
||||
|
||||
::: callout-note
|
||||
## Base R
|
||||
|
@ -250,14 +240,6 @@ df1 |>
|
|||
unnest_wider(y, names_sep = "_")
|
||||
```
|
||||
|
||||
We can also use `unnest_wider()` with unnamed list-columns, as in `df2`.
|
||||
Since columns require names but the list lacks them, `unnest_wider()` will label them with consecutive integers:
|
||||
|
||||
```{r}
|
||||
df2 |>
|
||||
unnest_wider(y, names_sep = "_")
|
||||
```
|
||||
|
||||
You'll notice that `unnest_wider()`, much like `pivot_wider()`, turns implicit missing values in to explicit missing values.
|
||||
|
||||
### `unnest_longer()`
|
||||
|
@ -283,24 +265,7 @@ df6 |> unnest_longer(y)
|
|||
```
|
||||
|
||||
We get zero rows in the output, so the row effectively disappears.
|
||||
Once <https://github.com/tidyverse/tidyr/issues/1339> is fixed, you'll be able to keep this row, replacing `y` with `NA` by setting `keep_empty = TRUE`.
|
||||
|
||||
You can also unnest named list-columns, like `df1$y`, into rows.
|
||||
Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix `_id`:
|
||||
|
||||
```{r}
|
||||
df1 |>
|
||||
unnest_longer(y)
|
||||
```
|
||||
|
||||
If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
|
||||
On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices.
|
||||
You can do this with `indices_include = TRUE`:
|
||||
|
||||
```{r}
|
||||
df2 |>
|
||||
unnest_longer(y, indices_include = TRUE)
|
||||
```
|
||||
If you want to preserve that row, adding add `NA` in `y` by setting `keep_empty = TRUE`.
|
||||
|
||||
### Inconsistent types
|
||||
|
||||
|
@ -310,8 +275,8 @@ For example, take the following dataset where the list-column `y` contains two n
|
|||
```{r}
|
||||
df4 <- tribble(
|
||||
~x, ~y,
|
||||
"a", list(1, "a"),
|
||||
"b", list(TRUE, factor("a"), 5)
|
||||
"a", list(1),
|
||||
"b", list("a", TRUE, 5)
|
||||
)
|
||||
```
|
||||
|
||||
|
@ -326,37 +291,10 @@ df4 |>
|
|||
|
||||
As you can see, the output contains a list-column, but every element of the list-column contains a single element.
|
||||
Because `unnest_longer()` can't find a common type of vector, it keeps the original types in a list-column.
|
||||
You might wonder if this breaks the commandment that every element of a column must be the same type --- not quite: every element is a still a list, even though the contents of each element is a different type.
|
||||
You might wonder if this breaks the commandment that every element of a column must be the same type.
|
||||
It doesn't: every element is a list, even though the contents are of different types.
|
||||
|
||||
What happens if you find this problem in a dataset you're trying to rectangle?
|
||||
There are two basic options.
|
||||
You could use the `transform` argument to coerce all inputs to a common type.
|
||||
However, it's not particularly useful here because there's only really one class that these five class can be converted to character.
|
||||
|
||||
```{r}
|
||||
df4 |>
|
||||
unnest_longer(y, transform = as.character)
|
||||
```
|
||||
|
||||
Another option would be to filter down to the rows that have values of a specific type:
|
||||
|
||||
```{r}
|
||||
df4 |>
|
||||
unnest_longer(y) |>
|
||||
filter(map_lgl(y, is.numeric))
|
||||
```
|
||||
|
||||
Then you can call `unnest_longer()` once more.
|
||||
This gives us a rectangular dataset of just the numeric values.
|
||||
|
||||
```{r}
|
||||
df4 |>
|
||||
unnest_longer(y) |>
|
||||
filter(map_lgl(y, is.numeric)) |>
|
||||
unnest_longer(y)
|
||||
```
|
||||
|
||||
You'll learn more about `map_lgl()` in @sec-iteration.
|
||||
Dealing with inconsistent types is challenging and the details depend on the precise nature of the problem and your goals, but you'll mostly likely need tools from @sec-iteration.
|
||||
|
||||
### Other functions
|
||||
|
||||
|
@ -370,7 +308,14 @@ These functions are good to know about as you might encounter them when reading
|
|||
|
||||
### Exercises
|
||||
|
||||
1. From time-to-time you encounter data frames with multiple list-columns with aligned values.
|
||||
1. What happens when you use `unnest_wider()` with unnamed list-columns like `df2`?
|
||||
What argument is now necessary?
|
||||
|
||||
2. What happens when you use `unnest_longer()` with named list-columns like `df1`?
|
||||
What additional information do you get in the output?
|
||||
How can you suppress that extra detail?
|
||||
|
||||
3. From time-to-time you encounter data frames with multiple list-columns with aligned values.
|
||||
For example, in the following data frame, the values of `y` and `z` are aligned (i.e. `y` and `z` will always have the same length within a row, and the first value of `y` corresponds to the first value of `z`).
|
||||
What happens if you apply two `unnest_longer()` calls to this data frame?
|
||||
How can you preserve the relationship between `x` and `y`?
|
||||
|
@ -387,7 +332,7 @@ These functions are good to know about as you might encounter them when reading
|
|||
## Case studies
|
||||
|
||||
The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to `unnest_longer()` and/or `unnest_wider()`.
|
||||
This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we've encountered in the wild.
|
||||
To show that in action, this section works through three real rectangling challenges using datasets from the repurrrsive package.
|
||||
|
||||
### Very wide data
|
||||
|
||||
|
@ -395,7 +340,7 @@ We'll start with `gh_repos`.
|
|||
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; we recommend exploring a little on your own with `View(gh_repos)` before we continue.
|
||||
|
||||
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
|
||||
We call the column `json` for reasons we'll get to later.
|
||||
We call this column `json` for reasons we'll get to later.
|
||||
|
||||
```{r}
|
||||
repos <- tibble(json = gh_repos)
|
||||
|
@ -431,7 +376,7 @@ repos |>
|
|||
head(10)
|
||||
```
|
||||
|
||||
Let's select a few that look interesting:
|
||||
Let's pull out a few that look interesting:
|
||||
|
||||
```{r}
|
||||
repos |>
|
||||
|
@ -453,10 +398,8 @@ repos |>
|
|||
unnest_wider(owner)
|
||||
```
|
||||
|
||||
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
|
||||
|
||||
Uh oh, this list column also contains an `id` column and we can't have two `id` columns in the same data frame.
|
||||
Rather than following the advice to use `names_repair` (which would also work), we'll instead use `names_sep`:
|
||||
As suggested, lets use `names_sep` to resolve the problem:
|
||||
|
||||
```{r}
|
||||
repos |>
|
||||
|
@ -466,12 +409,12 @@ repos |>
|
|||
unnest_wider(owner, names_sep = "_")
|
||||
```
|
||||
|
||||
This gives another wide dataset, but you can see that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
|
||||
This gives another wide dataset, but you can get the sense that `owner` appears to contain a lot of additional data about the person who "owns" the repository.
|
||||
|
||||
### Relational data
|
||||
|
||||
Nested data is sometimes used to represent data that we'd usually spread out into multiple data frames.
|
||||
For example, take `got_chars` which contains data about characters that appear in Game of Thrones.
|
||||
Nested data is sometimes used to represent data that we'd usually spread across multiple data frames.
|
||||
For example, take `got_chars` which contains data about characters that appear in the Game of Thrones books and TV series.
|
||||
Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
|
||||
|
||||
```{r}
|
||||
|
@ -495,7 +438,7 @@ characters <- chars |>
|
|||
characters
|
||||
```
|
||||
|
||||
There are also many list-columns:
|
||||
This dataset contains also many list-columns:
|
||||
|
||||
```{r}
|
||||
chars |>
|
||||
|
@ -514,7 +457,7 @@ chars |>
|
|||
```
|
||||
|
||||
You might expect to see this data in its own table because it would be easy to join to the characters data as needed.
|
||||
To do so, we'll do a little cleaning: removing the rows containing empty strings and renaming `titles` to `title` since each row now only contains a single title.
|
||||
Let's do that, which requires little cleaning: removing the rows containing empty strings and renaming `titles` to `title` since each row now only contains a single title.
|
||||
|
||||
```{r}
|
||||
titles <- chars |>
|
||||
|
@ -539,49 +482,6 @@ characters |>
|
|||
|
||||
You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.
|
||||
|
||||
### A dash of text analysis
|
||||
|
||||
Sticking with the same data, what if we wanted to find the most common words in the title?
|
||||
One simple approach starts by using `str_split()` to break each element of `title` up into words by splitting on `" "`:
|
||||
|
||||
```{r}
|
||||
titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused")
|
||||
```
|
||||
|
||||
This creates an unnamed variable length list-column, so we can use `unnest_longer()`:
|
||||
|
||||
```{r}
|
||||
titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word)
|
||||
```
|
||||
|
||||
And then we can count that column to find the most common words:
|
||||
|
||||
```{r}
|
||||
titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word) |>
|
||||
count(word, sort = TRUE)
|
||||
```
|
||||
|
||||
Some of those words are not very interesting so we could create a list of common words to drop.
|
||||
In text analysis these are commonly called stop words.
|
||||
|
||||
```{r}
|
||||
stop_words <- tibble(word = c("of", "the"))
|
||||
|
||||
titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused") |>
|
||||
unnest_longer(word) |>
|
||||
anti_join(stop_words) |>
|
||||
count(word, sort = TRUE)
|
||||
```
|
||||
|
||||
Breaking up text into individual fragments is a powerful idea that underlies much of text analysis.
|
||||
If this sounds interesting, a good place to learn more is [Text Mining with R](https://www.tidytextmining.com) by Julia Silge and David Robinson.
|
||||
|
||||
### Deeply nested
|
||||
|
||||
We'll finish off these case studies with a list-column that's very deeply nested and requires repeated rounds of `unnest_wider()` and `unnest_longer()` to unravel: `gmaps_cities`.
|
||||
|
@ -670,6 +570,7 @@ This is where `hoist()`, mentioned earlier in the chapter, can be useful.
|
|||
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
||||
|
||||
```{r}
|
||||
#| results: false
|
||||
locations |>
|
||||
select(city, formatted_address, geometry) |>
|
||||
hoist(
|
||||
|
@ -692,7 +593,9 @@ If these case studies have whetted your appetite for more real-life rectangling,
|
|||
Can you construct a `owners` data frame that contains one row for each owner?
|
||||
(Hint: does `distinct()` work with `list-cols`?)
|
||||
|
||||
3. Explain the following code line-by-line.
|
||||
3. Follow the steps used for `titles` to create similar tables for the aliases, allegiances, books, and TV series for the Game of Thrones characters.
|
||||
|
||||
4. Explain the following code line-by-line.
|
||||
Why is it interesting?
|
||||
Why does it work for `got_chars` but might not work in general?
|
||||
|
||||
|
@ -709,7 +612,7 @@ If these case studies have whetted your appetite for more real-life rectangling,
|
|||
unnest_longer(value)
|
||||
```
|
||||
|
||||
4. In `gmaps_cities`, what does `address_components` contain?
|
||||
5. In `gmaps_cities`, what does `address_components` contain?
|
||||
Why does the length vary between rows?
|
||||
Unnest it appropriately to figure it out.
|
||||
(Hint: `types` always appears to contain two elements. Does `unnest_wider()` make it easier to work with than `unnest_longer()`?)
|
||||
|
@ -743,6 +646,10 @@ An **object** is like a named list, and is written with `{}`.
|
|||
The names (keys in JSON terminology) are strings, so must be surrounded by quotes.
|
||||
For example, `{"x": 1, "y": 2}` is an object that maps `x` to 1 and `y` to 2.
|
||||
|
||||
Note that JSON doesn't have any native way to represent dates or date-times, so they're often stored as strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
|
||||
Similarly, JSON's rules for representing floating point numbers in JSON are a little imprecise, so you'll also sometimes find numbers stored in strings.
|
||||
Apply `readr::parse_double()` as needed to the get correct variable type.
|
||||
|
||||
### jsonlite
|
||||
|
||||
To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms.
|
||||
|
@ -820,12 +727,6 @@ df |>
|
|||
unnest_wider(results)
|
||||
```
|
||||
|
||||
### Translation challenges
|
||||
|
||||
Since JSON doesn't have any way to represent dates or date-times, they're often stored as ISO8601 date times in strings, and you'll need to use `readr::parse_date()` or `readr::parse_datetime()` to turn them into the correct data structure.
|
||||
Similarly, JSON's rules for representing floating point numbers in JSON are a little imprecise, so you'll also sometimes find numbers stored in strings.
|
||||
Apply `readr::parse_double()` as needed to the get correct variable type.
|
||||
|
||||
### Exercises
|
||||
|
||||
1. Rectangle the `df_col` and `df_row` below.
|
||||
|
|
Loading…
Reference in New Issue