Fix code language
This commit is contained in:
		@@ -11,7 +11,7 @@ Introduction</h1>
 | 
			
		||||
Prerequisites</h2>
 | 
			
		||||
<p>In this chapter we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
 | 
			
		||||
library(repurrrsive)
 | 
			
		||||
library(jsonlite)</pre>
 | 
			
		||||
</div>
 | 
			
		||||
@@ -23,7 +23,7 @@ library(jsonlite)</pre>
 | 
			
		||||
Lists</h1>
 | 
			
		||||
<p>So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is the same type. If you want to store element of different types in the same vector, you’ll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">x1 <- list(1:4, "a", TRUE)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">x1 <- list(1:4, "a", TRUE)
 | 
			
		||||
x1
 | 
			
		||||
#> [[1]]
 | 
			
		||||
#> [1] 1 2 3 4
 | 
			
		||||
@@ -36,7 +36,7 @@ x1
 | 
			
		||||
</div>
 | 
			
		||||
<p>It’s often convenient to name the components, or <strong>children</strong>, of a list, which you can do in the same way as naming the columns of a tibble:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">x2 <- list(a = 1:2, b = 1:3, c = 1:4)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">x2 <- list(a = 1:2, b = 1:3, c = 1:4)
 | 
			
		||||
x2
 | 
			
		||||
#> $a
 | 
			
		||||
#> [1] 1 2
 | 
			
		||||
@@ -49,7 +49,7 @@ x2
 | 
			
		||||
</div>
 | 
			
		||||
<p>Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code>, which generates a compact display of the <strong>str</strong>ucture, de-emphasizing the contents:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">str(x1)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">str(x1)
 | 
			
		||||
#> List of 3
 | 
			
		||||
#>  $ : int [1:4] 1 2 3 4
 | 
			
		||||
#>  $ : chr "a"
 | 
			
		||||
@@ -67,7 +67,7 @@ str(x2)
 | 
			
		||||
Hierarchy</h2>
 | 
			
		||||
<p>Lists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">x3 <- list(list(1, 2), list(3, 4))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">x3 <- list(list(1, 2), list(3, 4))
 | 
			
		||||
str(x3)
 | 
			
		||||
#> List of 2
 | 
			
		||||
#>  $ :List of 2
 | 
			
		||||
@@ -79,7 +79,7 @@ str(x3)
 | 
			
		||||
</div>
 | 
			
		||||
<p>This is notably different to <code><a href="https://rdrr.io/r/base/c.html">c()</a></code>, which generates a flat vector:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">c(c(1, 2), c(3, 4))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">c(c(1, 2), c(3, 4))
 | 
			
		||||
#> [1] 1 2 3 4
 | 
			
		||||
 | 
			
		||||
x4 <- c(list(1, 2), list(3, 4))
 | 
			
		||||
@@ -92,7 +92,7 @@ str(x4)
 | 
			
		||||
</div>
 | 
			
		||||
<p>As lists get more complex, <code><a href="https://rdrr.io/r/utils/str.html">str()</a></code> gets more useful, as it lets you see the hierarchy at a glance:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">x5 <- list(1, list(2, list(3, list(4, list(5)))))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">x5 <- list(1, list(2, list(3, list(4, list(5)))))
 | 
			
		||||
str(x5)
 | 
			
		||||
#> List of 2
 | 
			
		||||
#>  $ : num 1
 | 
			
		||||
@@ -138,7 +138,7 @@ List-columns</h2>
 | 
			
		||||
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldn’t usually belong in a tibble. In particular, list-columns are are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like models or resamples in a data frame.</p>
 | 
			
		||||
<p>Here’s a simple example of a list-column:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(
 | 
			
		||||
  x = 1:2, 
 | 
			
		||||
  y = c("a", "b"),
 | 
			
		||||
  z = list(list(1, 2), list(3, 4, 5))
 | 
			
		||||
@@ -152,7 +152,7 @@ df
 | 
			
		||||
</div>
 | 
			
		||||
<p>There’s nothing special about lists in a tibble; they behave like any other column:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df |> 
 | 
			
		||||
  filter(x == 1)
 | 
			
		||||
#> # A tibble: 1 × 3
 | 
			
		||||
#>       x y     z         
 | 
			
		||||
@@ -162,7 +162,7 @@ df
 | 
			
		||||
<p>Computing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.</p>
 | 
			
		||||
<p>The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull the list-column out and apply one of the techniques that you learned above:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df |> 
 | 
			
		||||
  filter(x == 1) |> 
 | 
			
		||||
  pull(z) |> 
 | 
			
		||||
  str()
 | 
			
		||||
@@ -175,13 +175,13 @@ df
 | 
			
		||||
<div data-type="note"><h1>
 | 
			
		||||
Base R
 | 
			
		||||
</h1><p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">data.frame(x = list(1:3, 3:5))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">data.frame(x = list(1:3, 3:5))
 | 
			
		||||
#>   x.1.3 x.3.5
 | 
			
		||||
#> 1     1     3
 | 
			
		||||
#> 2     2     4
 | 
			
		||||
#> 3     3     5</pre>
 | 
			
		||||
</div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesn’t print particularly well:</p><div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">data.frame(
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">data.frame(
 | 
			
		||||
  x = I(list(1:2, 3:5)), 
 | 
			
		||||
  y = c("1, 2", "3, 4, 5")
 | 
			
		||||
)
 | 
			
		||||
@@ -199,7 +199,7 @@ Unnesting</h1>
 | 
			
		||||
<p>Now that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.</p>
 | 
			
		||||
<p>List-columns tend to come in two basic forms: named and unnamed. When the children are <strong>named</strong>, they tend to have the same names in every row. For example, in <code>df1</code>, every element of list-column <code>y</code> has two elements named <code>a</code> and <code>b</code>. Named list-columns naturally unnest into columns: each named element becomes a new named column.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df1 <- tribble(
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df1 <- tribble(
 | 
			
		||||
  ~x, ~y,
 | 
			
		||||
  1, list(a = 11, b = 12),
 | 
			
		||||
  2, list(a = 21, b = 22),
 | 
			
		||||
@@ -208,7 +208,7 @@ Unnesting</h1>
 | 
			
		||||
</div>
 | 
			
		||||
<p>When the children are <strong>unnamed</strong>, the number of elements tends to vary from row-to-row. For example, in <code>df2</code>, the elements of list-column <code>y</code> are unnamed and vary in length from one to three. Unnamed list-columns naturally unnest in to rows: you’ll get one row for each child.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">
 | 
			
		||||
df2 <- tribble(
 | 
			
		||||
  ~x, ~y,
 | 
			
		||||
  1, list(11, 12, 13),
 | 
			
		||||
@@ -224,7 +224,7 @@ df2 <- tribble(
 | 
			
		||||
</h2>
 | 
			
		||||
<p>When each row has the same number of elements with the same names, like <code>df1</code>, it’s natural to put each component into its own column with <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df1 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df1 |> 
 | 
			
		||||
  unnest_wider(y)
 | 
			
		||||
#> # A tibble: 3 × 3
 | 
			
		||||
#>       x     a     b
 | 
			
		||||
@@ -235,7 +235,7 @@ df2 <- tribble(
 | 
			
		||||
</div>
 | 
			
		||||
<p>By default, the names of the new columns come exclusively from the names of the list elements, but you can use the <code>names_sep</code> argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df1 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df1 |> 
 | 
			
		||||
  unnest_wider(y, names_sep = "_")
 | 
			
		||||
#> # A tibble: 3 × 3
 | 
			
		||||
#>       x   y_a   y_b
 | 
			
		||||
@@ -246,7 +246,7 @@ df2 <- tribble(
 | 
			
		||||
</div>
 | 
			
		||||
<p>We can also use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> with unnamed list-columns, as in <code>df2</code>. Since columns require names but the list lacks them, <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> will label them with consecutive integers:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df2 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df2 |> 
 | 
			
		||||
  unnest_wider(y, names_sep = "_")
 | 
			
		||||
#> # A tibble: 3 × 4
 | 
			
		||||
#>       x   y_1   y_2   y_3
 | 
			
		||||
@@ -264,7 +264,7 @@ df2 <- tribble(
 | 
			
		||||
</h2>
 | 
			
		||||
<p>When each row contains an unnamed list, it’s most natural to put each element into its own row with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df2 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df2 |> 
 | 
			
		||||
  unnest_longer(y)
 | 
			
		||||
#> # A tibble: 6 × 2
 | 
			
		||||
#>       x     y
 | 
			
		||||
@@ -278,7 +278,7 @@ df2 <- tribble(
 | 
			
		||||
</div>
 | 
			
		||||
<p>Note how <code>x</code> is duplicated for each element inside of <code>y</code>: we get one row of output for each element inside the list-column. But what happens if one of the elements is empty, as in the following example?</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df6 <- tribble(
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df6 <- tribble(
 | 
			
		||||
  ~x, ~y,
 | 
			
		||||
  "a", list(1, 2),
 | 
			
		||||
  "b", list(3),
 | 
			
		||||
@@ -295,7 +295,7 @@ df6 |> unnest_longer(y)
 | 
			
		||||
<p>We get zero rows in the output, so the row effectively disappears. Once <a href="https://github.com/tidyverse/tidyr/issues/1339" class="uri">https://github.com/tidyverse/tidyr/issues/1339</a> is fixed, you’ll be able to keep this row, replacing <code>y</code> with <code>NA</code> by setting <code>keep_empty = TRUE</code>.</p>
 | 
			
		||||
<p>You can also unnest named list-columns, like <code>df1$y</code>, into rows. Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix <code>_id</code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df1 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df1 |> 
 | 
			
		||||
  unnest_longer(y)
 | 
			
		||||
#> # A tibble: 6 × 3
 | 
			
		||||
#>       x     y y_id 
 | 
			
		||||
@@ -309,7 +309,7 @@ df6 |> unnest_longer(y)
 | 
			
		||||
</div>
 | 
			
		||||
<p>If you don’t want these <code>ids</code>, you can suppress them with <code>indices_include = FALSE</code>. On the other hand, it’s sometimes useful to retain the position of unnamed elements in unnamed list-columns. You can do this with <code>indices_include = TRUE</code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df2 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df2 |> 
 | 
			
		||||
  unnest_longer(y, indices_include = TRUE)
 | 
			
		||||
#> # A tibble: 6 × 3
 | 
			
		||||
#>       x     y  y_id
 | 
			
		||||
@@ -328,7 +328,7 @@ df6 |> unnest_longer(y)
 | 
			
		||||
Inconsistent types</h2>
 | 
			
		||||
<p>What happens if you unnest a list-column contains different types of vector? For example, take the following dataset where the list-column <code>y</code> contains two numbers, a factor, and a logical, which can’t normally be mixed in a single column.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df4 <- tribble(
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df4 <- tribble(
 | 
			
		||||
  ~x, ~y,
 | 
			
		||||
  "a", list(1, "a"),
 | 
			
		||||
  "b", list(TRUE, factor("a"), 5)
 | 
			
		||||
@@ -336,7 +336,7 @@ Inconsistent types</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns change, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df4 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df4 |> 
 | 
			
		||||
  unnest_longer(y)
 | 
			
		||||
#> # A tibble: 5 × 2
 | 
			
		||||
#>   x     y        
 | 
			
		||||
@@ -350,7 +350,7 @@ Inconsistent types</h2>
 | 
			
		||||
<p>As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.</p>
 | 
			
		||||
<p>What happens if you find this problem in a dataset you’re trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. It’s not particularly useful here because there’s only really one class that these five class can be converted to character.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df4 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df4 |> 
 | 
			
		||||
  unnest_longer(y, transform = as.character)
 | 
			
		||||
#> # A tibble: 5 × 2
 | 
			
		||||
#>   x     y    
 | 
			
		||||
@@ -363,7 +363,7 @@ Inconsistent types</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Another option would be to filter down to the rows that have values of a specific type:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df4 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df4 |> 
 | 
			
		||||
  unnest_longer(y) |> 
 | 
			
		||||
  filter(map_lgl(y, is.numeric))
 | 
			
		||||
#> # A tibble: 2 × 2
 | 
			
		||||
@@ -374,7 +374,7 @@ Inconsistent types</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df4 |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df4 |> 
 | 
			
		||||
  unnest_longer(y) |> 
 | 
			
		||||
  filter(map_lgl(y, is.numeric)) |> 
 | 
			
		||||
  unnest_longer(y)
 | 
			
		||||
@@ -406,7 +406,7 @@ Exercises</h2>
 | 
			
		||||
<ol type="1"><li>
 | 
			
		||||
<p>From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of <code>y</code> and <code>z</code> are aligned (i.e. <code>y</code> and <code>z</code> will always have the same length within a row, and the first value of <code>y</code> corresponds to the first value of <code>z</code>). What happens if you apply two <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> calls to this data frame? How can you preserve the relationship between <code>x</code> and <code>y</code>? (Hint: carefully read the docs).</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df4 <- tribble(
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df4 <- tribble(
 | 
			
		||||
  ~x, ~y, ~z,
 | 
			
		||||
  "a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
 | 
			
		||||
  "b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
 | 
			
		||||
@@ -427,7 +427,7 @@ Very wide data</h2>
 | 
			
		||||
<p>We’ll with <code>gh_repos</code>. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; you might want to explore a little on your own with <code>View(gh_repos)</code> before we continue.</p>
 | 
			
		||||
<p><code>gh_repos</code> is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call the column <code>json</code> for reasons we’ll get to later.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos <- tibble(json = gh_repos)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos <- tibble(json = gh_repos)
 | 
			
		||||
repos
 | 
			
		||||
#> # A tibble: 6 × 1
 | 
			
		||||
#>   json       
 | 
			
		||||
@@ -441,7 +441,7 @@ repos
 | 
			
		||||
</div>
 | 
			
		||||
<p>This tibble contains 6 rows, one row for each child of <code>gh_repos</code>. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, we’ll start with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put each child in its own row:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos |> 
 | 
			
		||||
  unnest_longer(json)
 | 
			
		||||
#> # A tibble: 176 × 1
 | 
			
		||||
#>   json             
 | 
			
		||||
@@ -456,7 +456,7 @@ repos
 | 
			
		||||
</div>
 | 
			
		||||
<p>At first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of <code>json</code> is still a list. However, there’s an important difference: now each element is a <strong>named</strong> list so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put each element into its own column:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos |> 
 | 
			
		||||
  unnest_longer(json) |> 
 | 
			
		||||
  unnest_wider(json) 
 | 
			
		||||
#> # A tibble: 176 × 68
 | 
			
		||||
@@ -478,7 +478,7 @@ repos
 | 
			
		||||
</div>
 | 
			
		||||
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos |> 
 | 
			
		||||
  unnest_longer(json) |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  names()
 | 
			
		||||
@@ -508,7 +508,7 @@ repos
 | 
			
		||||
</div>
 | 
			
		||||
<p>Let’s select a few that look interesting:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos |> 
 | 
			
		||||
  unnest_longer(json) |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, full_name, owner, description)
 | 
			
		||||
@@ -526,7 +526,7 @@ repos
 | 
			
		||||
<p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
 | 
			
		||||
<p><code>owner</code> is another list-column, and since it contains a named list, we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to get at the values:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos |> 
 | 
			
		||||
  unnest_longer(json) |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, full_name, owner, description) |> 
 | 
			
		||||
@@ -540,7 +540,7 @@ repos
 | 
			
		||||
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
 | 
			
		||||
<p>Uh oh, this list column also contains an <code>id</code> column and we can’t have two <code>id</code> columns in the same data frame. Rather than following the advice to use <code>names_repair</code> (which would also work), we’ll instead use <code>names_sep</code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">repos |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">repos |> 
 | 
			
		||||
  unnest_longer(json) |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, full_name, owner, description) |> 
 | 
			
		||||
@@ -570,7 +570,7 @@ repos
 | 
			
		||||
Relational data</h2>
 | 
			
		||||
<p>Nested data is sometimes used to represent data that we’d usually spread out into multiple data frames. For example, take <code>got_chars</code>. Like <code>gh_repos</code> it’s a list, so we start by turning it into a list-column of a tibble:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">chars <- tibble(json = got_chars)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">chars <- tibble(json = got_chars)
 | 
			
		||||
chars
 | 
			
		||||
#> # A tibble: 30 × 1
 | 
			
		||||
#>   json             
 | 
			
		||||
@@ -585,7 +585,7 @@ chars
 | 
			
		||||
</div>
 | 
			
		||||
<p>The <code>json</code> column contains named elements, so we’ll start by widening it:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">chars |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">chars |> 
 | 
			
		||||
  unnest_wider(json)
 | 
			
		||||
#> # A tibble: 30 × 18
 | 
			
		||||
#>   url         id name  gender culture born  died  alive titles aliases father
 | 
			
		||||
@@ -602,7 +602,7 @@ chars
 | 
			
		||||
</div>
 | 
			
		||||
<p>And selecting a few columns to make it easier to read:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">characters <- chars |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">characters <- chars |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, name, gender, culture, born, died, alive)
 | 
			
		||||
characters
 | 
			
		||||
@@ -619,7 +619,7 @@ characters
 | 
			
		||||
</div>
 | 
			
		||||
<p>There are also many list-columns:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">chars |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">chars |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, where(is.list))
 | 
			
		||||
#> # A tibble: 30 × 8
 | 
			
		||||
@@ -635,7 +635,7 @@ characters
 | 
			
		||||
</div>
 | 
			
		||||
<p>Lets explore the <code>titles</code> column. It’s an unnamed list-column, so we’ll unnest it into rows:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">chars |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">chars |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, titles) |> 
 | 
			
		||||
  unnest_longer(titles)
 | 
			
		||||
@@ -652,7 +652,7 @@ characters
 | 
			
		||||
</div>
 | 
			
		||||
<p>You might expect to see this data in its own table because it would be easy to join to the characters data as needed. To do so, we’ll do a little cleaning: removing the rows containing empty strings and renaming <code>titles</code> to <code>title</code> since each row now only contains a single title.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">titles <- chars |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">titles <- chars |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, titles) |> 
 | 
			
		||||
  unnest_longer(titles) |> 
 | 
			
		||||
@@ -672,7 +672,7 @@ titles
 | 
			
		||||
</div>
 | 
			
		||||
<p>Now, for example, we could use this table tofind all the characters that are captains and see all their titles:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">captains <- titles |> filter(str_detect(title, "Captain"))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">captains <- titles |> filter(str_detect(title, "Captain"))
 | 
			
		||||
captains
 | 
			
		||||
#> # A tibble: 5 × 2
 | 
			
		||||
#>      id title                                 
 | 
			
		||||
@@ -705,7 +705,7 @@ characters |>
 | 
			
		||||
A dash of text analysis</h2>
 | 
			
		||||
<p>What if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by spitting on <code>" "</code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">titles |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">titles |> 
 | 
			
		||||
  mutate(word = str_split(title, " "), .keep = "unused")
 | 
			
		||||
#> # A tibble: 53 × 2
 | 
			
		||||
#>      id word      
 | 
			
		||||
@@ -720,7 +720,7 @@ A dash of text analysis</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>This creates a unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">titles |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">titles |> 
 | 
			
		||||
  mutate(word = str_split(title, " "), .keep = "unused") |> 
 | 
			
		||||
  unnest_longer(word)
 | 
			
		||||
#> # A tibble: 202 × 2
 | 
			
		||||
@@ -736,7 +736,7 @@ A dash of text analysis</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>And then we can count that column to find the most common words:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">titles |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">titles |> 
 | 
			
		||||
  mutate(word = str_split(title, " "), .keep = "unused") |> 
 | 
			
		||||
  unnest_longer(word) |> 
 | 
			
		||||
  count(word, sort = TRUE)
 | 
			
		||||
@@ -753,7 +753,7 @@ A dash of text analysis</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these is commonly called stop words.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">stop_words <- tibble(word = c("of", "the"))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">stop_words <- tibble(word = c("of", "the"))
 | 
			
		||||
 | 
			
		||||
titles |> 
 | 
			
		||||
  mutate(word = str_split(title, " "), .keep = "unused") |> 
 | 
			
		||||
@@ -780,7 +780,7 @@ titles |>
 | 
			
		||||
Deeply nested</h2>
 | 
			
		||||
<p>We’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to unravel: <code>gmaps_cities</code>. This is a two column tibble containing five city names and the results of using Google’s <a href="https://developers.google.com/maps/documentation/geocoding">geocoding API</a> to determine their location:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities
 | 
			
		||||
#> # A tibble: 5 × 2
 | 
			
		||||
#>   city       json            
 | 
			
		||||
#>   <chr>      <list>          
 | 
			
		||||
@@ -792,7 +792,7 @@ Deeply nested</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p><code>json</code> is a list-column with internal names, so we start with an <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities |> 
 | 
			
		||||
  unnest_wider(json)
 | 
			
		||||
#> # A tibble: 5 × 3
 | 
			
		||||
#>   city       results    status
 | 
			
		||||
@@ -805,7 +805,7 @@ Deeply nested</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>This gives us the <code>status</code> and the <code>results</code>. We’ll drop the status column since they’re all <code>OK</code>; in a real analysis, you’d also want capture all the rows where <code>status != "OK"</code> and figure out what went wrong. <code>results</code> is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">gmaps_cities |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">gmaps_cities |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(-status) |> 
 | 
			
		||||
  unnest_longer(results)
 | 
			
		||||
@@ -822,7 +822,7 @@ Deeply nested</h2>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Now <code>results</code> is a named list, so we’ll use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">locations <- gmaps_cities |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">locations <- gmaps_cities |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(-status) |> 
 | 
			
		||||
  unnest_longer(results) |> 
 | 
			
		||||
@@ -842,7 +842,7 @@ locations
 | 
			
		||||
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
 | 
			
		||||
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">locations |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">locations |> 
 | 
			
		||||
  select(city, formatted_address, geometry) |> 
 | 
			
		||||
  unnest_wider(geometry)
 | 
			
		||||
#> # A tibble: 7 × 6
 | 
			
		||||
@@ -858,7 +858,7 @@ locations
 | 
			
		||||
</div>
 | 
			
		||||
<p>That gives us new <code>bounds</code> (a rectangular region) and <code>location</code> (a point). We can unnest <code>location</code> to see the latitude (<code>lat</code>) and longitude (<code>lng</code>):</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">locations |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">locations |> 
 | 
			
		||||
  select(city, formatted_address, geometry) |> 
 | 
			
		||||
  unnest_wider(geometry) |> 
 | 
			
		||||
  unnest_wider(location)
 | 
			
		||||
@@ -875,7 +875,7 @@ locations
 | 
			
		||||
</div>
 | 
			
		||||
<p>Extracting the bounds requires a few more steps:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">locations |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">locations |> 
 | 
			
		||||
  select(city, formatted_address, geometry) |> 
 | 
			
		||||
  unnest_wider(geometry) |> 
 | 
			
		||||
  # focus on the variables of interest
 | 
			
		||||
@@ -894,7 +894,7 @@ locations
 | 
			
		||||
</div>
 | 
			
		||||
<p>We then rename <code>southwest</code> and <code>northeast</code> (the corners of the rectangle) so we can use <code>names_sep</code> to create short but evocative names:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">locations |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">locations |> 
 | 
			
		||||
  select(city, formatted_address, geometry) |> 
 | 
			
		||||
  unnest_wider(geometry) |> 
 | 
			
		||||
  select(!location:viewport) |>
 | 
			
		||||
@@ -915,7 +915,7 @@ locations
 | 
			
		||||
<p>Note how we unnest two columns simultaneously by supplying a vector of variable names to <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>.</p>
 | 
			
		||||
<p>This is somewhere that <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned briefly above, can be useful. Once you’ve discovered the path to get to the components you’re interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">locations |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">locations |> 
 | 
			
		||||
  select(city, formatted_address, geometry) |> 
 | 
			
		||||
  hoist(
 | 
			
		||||
    geometry,
 | 
			
		||||
@@ -946,7 +946,7 @@ Exercises</h2>
 | 
			
		||||
<li>
 | 
			
		||||
<p>Explain the following code line-by-line. Why is it interesting? Why does it work for <code>got_chars</code> but might not work in general?</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">tibble(json = got_chars) |> 
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">tibble(json = got_chars) |> 
 | 
			
		||||
  unnest_wider(json) |> 
 | 
			
		||||
  select(id, where(is.list)) |> 
 | 
			
		||||
  pivot_longer(
 | 
			
		||||
@@ -983,7 +983,7 @@ Data types</h2>
 | 
			
		||||
jsonlite</h2>
 | 
			
		||||
<p>To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code> and <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>. In real life, you’ll use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code> to read a JSON file from disk. For example, the repurrsive package also provides the source for <code>gh_user</code> as a JSON file and you can read it with <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">read_json()</a></code>:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit"># A path to a json file inside the package:
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r"># A path to a json file inside the package:
 | 
			
		||||
gh_users_json()
 | 
			
		||||
#> [1] "/Users/hadleywickham/Library/R/arm64/4.2/library/repurrrsive/extdata/gh_users.json"
 | 
			
		||||
 | 
			
		||||
@@ -996,7 +996,7 @@ identical(gh_users, gh_users2)
 | 
			
		||||
</div>
 | 
			
		||||
<p>In this book, I’ll also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here’s three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">str(parse_json('1'))
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">str(parse_json('1'))
 | 
			
		||||
#>  int 1
 | 
			
		||||
str(parse_json('[1, 2, 3]'))
 | 
			
		||||
#> List of 3
 | 
			
		||||
@@ -1018,7 +1018,7 @@ str(parse_json('{"x": [1, 2, 3]}'))
 | 
			
		||||
Starting the rectangling process</h2>
 | 
			
		||||
<p>In most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g. multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with <code>tibble(json)</code> so that each element becomes a row:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">json <- '[
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">json <- '[
 | 
			
		||||
  {"name": "John", "age": 34},
 | 
			
		||||
  {"name": "Susan", "age": 27}
 | 
			
		||||
]'
 | 
			
		||||
@@ -1040,7 +1040,7 @@ df |>
 | 
			
		||||
</div>
 | 
			
		||||
<p>In rarer cases, the JSON consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">json <- '{
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">json <- '{
 | 
			
		||||
  "status": "OK", 
 | 
			
		||||
  "results": [
 | 
			
		||||
    {"name": "John", "age": 34},
 | 
			
		||||
@@ -1067,7 +1067,7 @@ df |>
 | 
			
		||||
</div>
 | 
			
		||||
<p>Alternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">df <- tibble(results = parse_json(json)$results)
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">df <- tibble(results = parse_json(json)$results)
 | 
			
		||||
df |> 
 | 
			
		||||
  unnest_wider(results)
 | 
			
		||||
#> # A tibble: 2 × 2
 | 
			
		||||
@@ -1090,7 +1090,7 @@ Exercises</h2>
 | 
			
		||||
<ol type="1"><li>
 | 
			
		||||
<p>Rectangle the <code>df_col</code> and <code>df_row</code> below. They represent the two ways of encoding a data frame in JSON.</p>
 | 
			
		||||
<div class="cell">
 | 
			
		||||
<pre data-type="programlisting" data-code-language="downlit">json_col <- parse_json('
 | 
			
		||||
<pre data-type="programlisting" data-code-language="r">json_col <- parse_json('
 | 
			
		||||
  {
 | 
			
		||||
    "x": ["a", "x", "z"],
 | 
			
		||||
    "y": [10, null, 3]
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user