Update rectangling.qmd (#1172)
This commit is contained in:
parent
4635426ec3
commit
ae9680ecd7
|
@ -9,7 +9,7 @@ status("polishing")
|
|||
|
||||
## Introduction
|
||||
|
||||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frames made up of rows and columns.
|
||||
In this chapter, you'll learn the art of data **rectangling**, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns.
|
||||
This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.
|
||||
|
||||
To learn about rectangling, you'll need to first learn about lists, the data structure that makes hierarchical data possible.
|
||||
|
@ -19,7 +19,7 @@ We'll finish off by talking about JSON, the most frequent source of hierarchical
|
|||
|
||||
### Prerequisites
|
||||
|
||||
In this chapter we'll use many functions from tidyr, a core member of the tidyverse.
|
||||
In this chapter, we'll use many functions from tidyr, a core member of the tidyverse.
|
||||
We'll also use repurrrsive to provide some interesting datasets for rectangling practice, and we'll finish by using jsonlite to read JSON files into R lists.
|
||||
|
||||
```{r}
|
||||
|
@ -34,8 +34,8 @@ library(jsonlite)
|
|||
## Lists
|
||||
|
||||
So far you've worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors.
|
||||
These vectors are simple because they're homogeneous: every element is the same type.
|
||||
If you want to store element of different types in the same vector, you'll need a **list**, which you create with `list()`:
|
||||
These vectors are simple because they're homogeneous: every element is of the same data type.
|
||||
If you want to store elements of different types in the same vector, you'll need a **list**, which you create with `list()`:
|
||||
|
||||
```{r}
|
||||
x1 <- list(1:4, "a", TRUE)
|
||||
|
@ -138,8 +138,8 @@ knitr::include_graphics("screenshots/View-3.png", dpi = 220)
|
|||
### List-columns
|
||||
|
||||
Lists can also live inside a tibble, where we call them list-columns.
|
||||
List-columns are useful because they allow you to shoehorn in objects that wouldn't usually belong in a tibble.
|
||||
In particular, list-columns are are used a lot in the [tidymodels](https://www.tidymodels.org) ecosystem, because they allow you to store things like models or resamples in a data frame.
|
||||
List-columns are useful because they allow you to place objects in a tibble that wouldn't usually belong in there.
|
||||
In particular, list-columns are used a lot in the [tidymodels](https://www.tidymodels.org) ecosystem, because they allow you to store things like model outputs or resamples in a data frame.
|
||||
|
||||
Here's a simple example of a list-column:
|
||||
|
||||
|
@ -164,7 +164,7 @@ In this chapter, we'll focus on unnesting list-columns out into regular variable
|
|||
|
||||
The default print method just displays a rough summary of the contents.
|
||||
The list column could be arbitrarily complex, so there's no good way to print it.
|
||||
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you learned above:
|
||||
If you want to see it, you'll need to pull the list-column out and apply one of the techniques that you've learned above:
|
||||
|
||||
```{r}
|
||||
df |>
|
||||
|
@ -194,7 +194,7 @@ data.frame(
|
|||
)
|
||||
```
|
||||
|
||||
It's easier to use list-columns with tibbles because `tibble()` treats lists like either vectors and the print method has been designed with lists in mind.
|
||||
It's easier to use list-columns with tibbles because `tibble()` treats lists like vectors and the print method has been designed with lists in mind.
|
||||
:::
|
||||
|
||||
## Unnesting
|
||||
|
@ -315,7 +315,7 @@ df4 <- tribble(
|
|||
)
|
||||
```
|
||||
|
||||
`unnest_longer()` always keeps the set of columns change, while changing the number of rows.
|
||||
`unnest_longer()` always keeps the set of columns unchanged, while changing the number of rows.
|
||||
So what happens?
|
||||
How does `unnest_longer()` produce five rows while keeping everything in `y`?
|
||||
|
||||
|
@ -331,7 +331,7 @@ You might wonder if this breaks the commandment that every element of a column m
|
|||
What happens if you find this problem in a dataset you're trying to rectangle?
|
||||
There are two basic options.
|
||||
You could use the `transform` argument to coerce all inputs to a common type.
|
||||
It's not particularly useful here because there's only really one class that these five class can be converted to character.
|
||||
However, it's not particularly useful here because there's only really one class that these five class can be converted to character.
|
||||
|
||||
```{r}
|
||||
df4 |>
|
||||
|
@ -362,11 +362,11 @@ You'll learn more about `map_lgl()` in @sec-iteration.
|
|||
|
||||
tidyr has a few other useful rectangling functions that we're not going to cover in this book:
|
||||
|
||||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's a great for rapid exploration, but ultimately its a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||||
- `unnest_auto()` automatically picks between `unnest_longer()` and `unnest_wider()` based on the structure of the list-column. It's great for rapid exploration, but ultimately it's a bad idea because it doesn't force you to understand how your data is structured, and makes your code harder to understand.
|
||||
- `unnest()` expands both rows and columns. It's useful when you have a list-column that contains a 2d structure like a data frame, which you don't see in this book.
|
||||
- `hoist()` allows you to reach into a deeply nested list and extract just the components that you need. It's mostly equivalent to repeated invocations of `unnest_wider()` + `select()` so read up on it if you're trying to extract just a couple of important variables embedded in a bunch of data that you don't care about.
|
||||
|
||||
These are good to know about when you're reading other people's code or tackling rarer rectangling challenges.
|
||||
These functions are good to know about as you might encounter them when reading other people's code or tackling rarer rectangling challenges yourself.
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -525,7 +525,7 @@ titles <- chars |>
|
|||
titles
|
||||
```
|
||||
|
||||
Now, for example, we could use this table tofind all the characters that are captains and see all their titles:
|
||||
Now, for example, we could use this table to find all the characters that are captains and see all their titles:
|
||||
|
||||
```{r}
|
||||
captains <- titles |> filter(str_detect(title, "Captain"))
|
||||
|
@ -541,14 +541,14 @@ You could imagine creating a table like this for each of the list-columns, then
|
|||
### A dash of text analysis
|
||||
|
||||
Sticking with the same data, what if we wanted to find the most common words in the title?
|
||||
One simple approach starts by using `str_split()` to break each element of `title` up into words by spitting on `" "`:
|
||||
One simple approach starts by using `str_split()` to break each element of `title` up into words by splitting on `" "`:
|
||||
|
||||
```{r}
|
||||
titles |>
|
||||
mutate(word = str_split(title, " "), .keep = "unused")
|
||||
```
|
||||
|
||||
This creates a unnamed variable length list-column, so we can use `unnest_longer()`:
|
||||
This creates an unnamed variable length list-column, so we can use `unnest_longer()`:
|
||||
|
||||
```{r}
|
||||
titles |>
|
||||
|
@ -566,7 +566,7 @@ titles |>
|
|||
```
|
||||
|
||||
Some of those words are not very interesting so we could create a list of common words to drop.
|
||||
In text analysis these is commonly called stop words.
|
||||
In text analysis these are commonly called stop words.
|
||||
|
||||
```{r}
|
||||
stop_words <- tibble(word = c("of", "the"))
|
||||
|
@ -598,7 +598,7 @@ gmaps_cities |>
|
|||
```
|
||||
|
||||
This gives us the `status` and the `results`.
|
||||
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want capture all the rows where `status != "OK"` and figure out what went wrong.
|
||||
We'll drop the status column since they're all `OK`; in a real analysis, you'd also want to capture all the rows where `status != "OK"` and figure out what went wrong.
|
||||
`results` is an unnamed list, with either one or two elements (we'll see why shortly) so we'll unnest it into rows:
|
||||
|
||||
```{r}
|
||||
|
@ -665,7 +665,7 @@ locations |>
|
|||
|
||||
Note how we unnest two columns simultaneously by supplying a vector of variable names to `unnest_wider()`.
|
||||
|
||||
This is somewhere that `hoist()`, mentioned earlier in the chapter, can be useful.
|
||||
This is where `hoist()`, mentioned earlier in the chapter, can be useful.
|
||||
Once you've discovered the path to get to the components you're interested in, you can extract them directly using `hoist()`:
|
||||
|
||||
```{r}
|
||||
|
@ -728,7 +728,7 @@ Four of them are scalars:
|
|||
|
||||
- The simplest type is a null (`null`) which plays the same role as both `NULL` and `NA` in R. It represents the absence of data.
|
||||
- A **string** is much like a string in R, but must always use double quotes.
|
||||
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support Inf, -Inf, or NaN.
|
||||
- A **number** is similar to R's numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn't support `Inf`, `-Inf`, or `NaN`.
|
||||
- A **boolean** is similar to R's `TRUE` and `FALSE`, but uses lowercase `true` and `false`.
|
||||
|
||||
JSON's strings, numbers, and booleans are pretty similar to R's character, numeric, and logical vectors.
|
||||
|
@ -760,8 +760,8 @@ gh_users2 <- read_json(gh_users_json())
|
|||
identical(gh_users, gh_users2)
|
||||
```
|
||||
|
||||
In this book, I'll also use `parse_json()`, since it takes a string containing JSON, which makes it good for generating simple examples.
|
||||
To get started, here's three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:
|
||||
In this book, we'll also use `parse_json()`, since it takes a string containing JSON, which makes it good for generating simple examples.
|
||||
To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:
|
||||
|
||||
```{r}
|
||||
str(parse_json('1'))
|
||||
|
@ -790,8 +790,8 @@ df |>
|
|||
unnest_wider(json)
|
||||
```
|
||||
|
||||
In rarer cases, the JSON consists of a single top-level JSON object, representing one "thing".
|
||||
In this case, you'll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.
|
||||
In rarer cases, the JSON file consists of a single top-level JSON object, representing one "thing".
|
||||
In this case, you'll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.
|
||||
|
||||
```{r}
|
||||
json <- '{
|
||||
|
@ -851,7 +851,7 @@ Apply `readr::parse_double()` as needed to the get correct variable type.
|
|||
|
||||
## Summary
|
||||
|
||||
In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames.
|
||||
In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames.
|
||||
Surprisingly we only need two new functions: `unnest_longer()` to put list elements into rows and `unnest_wider()` to put list elements into columns.
|
||||
It doesn't matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.
|
||||
|
||||
|
|
Loading…
Reference in New Issue