Title change + feedback from O'Reilly
This commit is contained in:
parent
19c89ebf64
commit
31363dc23a
|
@ -1,4 +1,4 @@
|
|||
# Data rectangling {#sec-rectangling}
|
||||
# Hierarchical data {#sec-rectangling}
|
||||
|
||||
```{r}
|
||||
#| results: "asis"
|
||||
|
@ -9,7 +9,7 @@ status("polishing")
|
|||
|
||||
## Introduction
|
||||
|
||||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns.
|
||||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frames made up of rows and columns.
|
||||
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
|
||||
|
||||
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
|
||||
|
@ -294,7 +294,7 @@ df1 |>
|
|||
```
|
||||
|
||||
If you don't want these `ids`, you can suppress them with `indices_include = FALSE`.
|
||||
On the other hand, it's sometimes useful to retain the position of unnamed elements in unnamed list-columns.
|
||||
On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices.
|
||||
You can do this with `indices_include = TRUE`:
|
||||
|
||||
```{r}
|
||||
|
@ -304,7 +304,7 @@ df2 |>
|
|||
|
||||
### Inconsistent types
|
||||
|
||||
What happens if you unnest a list-column contains different types of vector?
|
||||
What happens if you unnest a list-column that contains different types of vector?
|
||||
For example, take the following dataset where the list-column `y` contains two numbers, a factor, and a logical, which can't normally be mixed in a single column.
|
||||
|
||||
```{r}
|
||||
|
@ -346,7 +346,8 @@ df4 |>
|
|||
filter(map_lgl(y, is.numeric))
|
||||
```
|
||||
|
||||
Then you can call `unnest_longer()` once more:
|
||||
Then you can call `unnest_longer()` once more.
|
||||
This gives us a rectangular dataset of just the numeric values.
|
||||
|
||||
```{r}
|
||||
df4 |>
|
||||
|
@ -390,8 +391,8 @@ This section will work through four real rectangling challenges using datasets f
|
|||
|
||||
### Very wide data
|
||||
|
||||
We'll with `gh_repos`.
|
||||
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; you might want to explore a little on your own with `View(gh_repos)` before we continue.
|
||||
We'll start with `gh_repos`.
|
||||
This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It's a very deeply nested list so it's difficult to show the structure in this book; we recommend exploring a little on your own with `View(gh_repos)` before we continue.
|
||||
|
||||
`gh_repos` is a list, but our tools work with list-columns, so we'll begin by putting it into a tibble.
|
||||
We call the column `json` for reasons we'll get to later.
|
||||
|
@ -469,7 +470,7 @@ This gives another wide dataset, but you can see that `owner` appears to contain
|
|||
### Relational data
|
||||
|
||||
Nested data is sometimes used to represent data that we'd usually spread out into multiple data frames.
|
||||
For example, take `got_chars`.
|
||||
For example, take `got_chars` which contains data about characters that appear in Game of Thrones.
|
||||
Like `gh_repos` it's a list, so we start by turning it into a list-column of a tibble:
|
||||
|
||||
```{r}
|
||||
|
@ -539,7 +540,7 @@ You could imagine creating a table like this for each of the list-columns, then
|
|||
|
||||
### A dash of text analysis
|
||||
|
||||
What if we wanted to find the most common words in the title?
|
||||
Sticking with the same data, what if we wanted to find the most common words in the title?
|
||||
One simple approach starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
|
||||
|
||||
```{r}
|
||||
|
@ -664,7 +665,7 @@ locations |>
|
|||
|
||||
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
|
||||
|
||||
This is somewhere that `hoist()`, mentioned briefly above, can be useful.
|
||||
This is somewhere that `hoist()`, mentioned earlier in the chapter, can be useful.
|
||||
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
||||
|
||||
```{r}
|
||||
|
|
Loading…
Reference in New Issue