Re-render book for O'Reilly

This commit is contained in:
Hadley Wickham
2023-01-12 17:22:57 -06:00
parent 28671ed8bd
commit 360d65ae47
113 changed files with 4957 additions and 2997 deletions

View File

@@ -1,15 +1,15 @@
<section data-type="chapter" id="chp-rectangling">
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data rectangling</span></span></h1>
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Hierarchical data</span></span></h1>
<section id="introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>In this chapter, youll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
<p>In this chapter, youll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
<p>To learn about rectangling, youll need to first learn about lists, the data structure that makes hierarchical data possible. Then youll learn about two crucial tidyr functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">tidyr::unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">tidyr::unnest_wider()</a></code>. Well then show you a few case studies, applying these simple functions again and again to solve real problems. Well finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.</p>
<section id="prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well use many functions from tidyr, a core member of the tidyverse. Well also use repurrrsive to provide some interesting datasets for rectangling practice, and well finish by using jsonlite to read JSON files into R lists.</p>
<p>In this chapter, well use many functions from tidyr, a core member of the tidyverse. Well also use repurrrsive to provide some interesting datasets for rectangling practice, and well finish by using jsonlite to read JSON files into R lists.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
library(repurrrsive)
@@ -21,7 +21,7 @@ library(jsonlite)</pre>
<section id="lists" data-type="sect1">
<h1>
Lists</h1>
<p>So far youve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because theyre homogeneous: every element is the same type. If you want to store element of different types in the same vector, youll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
<p>So far youve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because theyre homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, youll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">x1 &lt;- list(1:4, "a", TRUE)
x1
@@ -135,7 +135,7 @@ str(x5)
<section id="list-columns" data-type="sect2">
<h2>
List-columns</h2>
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldnt usually belong in a tibble. In particular, list-columns are are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like models or resamples in a data frame.</p>
<p>Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to place objects in a tibble that wouldnt usually belong in there. In particular, list-columns are used a lot in the <a href="https://www.tidymodels.org">tidymodels</a> ecosystem, because they allow you to store things like model outputs or resamples in a data frame.</p>
<p>Heres a simple example of a list-column:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df &lt;- tibble(
@@ -160,7 +160,7 @@ df
#&gt; 1 1 a &lt;list [2]&gt;</pre>
</div>
<p>Computing with list-columns is harder, but thats because computing with lists is harder in general; well come back to that in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>. In this chapter, well focus on unnesting list-columns out into regular variables so you can use your existing tools on them.</p>
<p>The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so theres no good way to print it. If you want to see it, youll need to pull the list-column out and apply one of the techniques that you learned above:</p>
<p>The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so theres no good way to print it. If you want to see it, youll need to pull the list-column out and apply one of the techniques that youve learned above:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df |&gt;
filter(x == 1) |&gt;
@@ -188,7 +188,7 @@ Base R
#&gt; x y
#&gt; 1 1, 2 1, 2
#&gt; 2 3, 4, 5 3, 4, 5</pre>
</div><p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like either vectors and the print method has been designed with lists in mind.</p></div>
</div><p>Its easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like vectors and the print method has been designed with lists in mind.</p></div>
</section>
</section>
@@ -307,7 +307,7 @@ df6 |&gt; unnest_longer(y)
#&gt; 5 3 31 a
#&gt; 6 3 32 b</pre>
</div>
<p>If you dont want these <code>ids</code>, you can suppress them with <code>indices_include = FALSE</code>. On the other hand, its sometimes useful to retain the position of unnamed elements in unnamed list-columns. You can do this with <code>indices_include = TRUE</code>:</p>
<p>If you dont want these <code>ids</code>, you can suppress them with <code>indices_include = FALSE</code>. On the other hand, sometimes the positions of the elements is meaningful, and even if the elements are unnamed, you might still want to track their indices. You can do this with <code>indices_include = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df2 |&gt;
unnest_longer(y, indices_include = TRUE)
@@ -326,7 +326,7 @@ df6 |&gt; unnest_longer(y)
<section id="inconsistent-types" data-type="sect2">
<h2>
Inconsistent types</h2>
<p>What happens if you unnest a list-column contains different types of vector? For example, take the following dataset where the list-column <code>y</code> contains two numbers, a factor, and a logical, which cant normally be mixed in a single column.</p>
<p>What happens if you unnest a list-column that contains different types of vector? For example, take the following dataset where the list-column <code>y</code> contains two numbers, a factor, and a logical, which cant normally be mixed in a single column.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df4 &lt;- tribble(
~x, ~y,
@@ -334,7 +334,7 @@ Inconsistent types</h2>
"b", list(TRUE, factor("a"), 5)
)</pre>
</div>
<p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns change, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> always keeps the set of columns unchanged, while changing the number of rows. So what happens? How does <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> produce five rows while keeping everything in <code>y</code>?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df4 |&gt;
unnest_longer(y)
@@ -348,7 +348,7 @@ Inconsistent types</h2>
#&gt; 5 b &lt;dbl [1]&gt;</pre>
</div>
<p>As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> cant find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.</p>
<p>What happens if you find this problem in a dataset youre trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. Its not particularly useful here because theres only really one class that these five class can be converted to character.</p>
<p>What happens if you find this problem in a dataset youre trying to rectangle? There are two basic options. You could use the <code>transform</code> argument to coerce all inputs to a common type. However, its not particularly useful here because theres only really one class that these five class can be converted to character.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df4 |&gt;
unnest_longer(y, transform = as.character)
@@ -372,7 +372,7 @@ Inconsistent types</h2>
#&gt; 1 a &lt;dbl [1]&gt;
#&gt; 2 b &lt;dbl [1]&gt;</pre>
</div>
<p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more:</p>
<p>Then you can call <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> once more. This gives us a rectangular dataset of just the numeric values.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">df4 |&gt;
unnest_longer(y) |&gt;
@@ -392,12 +392,12 @@ Inconsistent types</h2>
Other functions</h2>
<p>tidyr has a few other useful rectangling functions that were not going to cover in this book:</p>
<ul><li>
<code><a href="https://tidyr.tidyverse.org/reference/unnest_auto.html">unnest_auto()</a></code> automatically picks between <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> based on the structure of the list-column. Its a great for rapid exploration, but ultimately its a bad idea because it doesnt force you to understand how your data is structured, and makes your code harder to understand.</li>
<code><a href="https://tidyr.tidyverse.org/reference/unnest_auto.html">unnest_auto()</a></code> automatically picks between <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> based on the structure of the list-column. Its great for rapid exploration, but ultimately its a bad idea because it doesnt force you to understand how your data is structured, and makes your code harder to understand.</li>
<li>
<code><a href="https://tidyr.tidyverse.org/reference/unnest.html">unnest()</a></code> expands both rows and columns. Its useful when you have a list-column that contains a 2d structure like a data frame, which you dont see in this book.</li>
<code><a href="https://tidyr.tidyverse.org/reference/unnest.html">unnest()</a></code> expands both rows and columns. Its useful when you have a list-column that contains a 2d structure like a data frame, which you dont see in this book, but you might encounter if you use the <a href="https://www.tmwr.org/base-r.html#combining-base-r-models-and-the-tidyverse">tidymodels</a> ecosystem.</li>
<li>
<code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code> allows you to reach into a deeply nested list and extract just the components that you need. Its mostly equivalent to repeated invocations of <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> + <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> so read up on it if youre trying to extract just a couple of important variables embedded in a bunch of data that you dont care about.</li>
</ul><p>These are good to know about when youre reading other peoples code or tackling rarer rectangling challenges.</p>
</ul><p>These functions are good to know about as you might encounter them when reading other peoples code or tackling rarer rectangling challenges yourself.</p>
</section>
<section id="exercises" data-type="sect2">
@@ -424,7 +424,7 @@ Case studies</h1>
<section id="very-wide-data" data-type="sect2">
<h2>
Very wide data</h2>
<p>Well with <code>gh_repos</code>. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. Its a very deeply nested list so its difficult to show the structure in this book; you might want to explore a little on your own with <code>View(gh_repos)</code> before we continue.</p>
<p>Well start with <code>gh_repos</code>. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. Its a very deeply nested list so its difficult to show the structure in this book; we recommend exploring a little on your own with <code>View(gh_repos)</code> before we continue.</p>
<p><code>gh_repos</code> is a list, but our tools work with list-columns, so well begin by putting it into a tibble. We call the column <code>json</code> for reasons well get to later.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">repos &lt;- tibble(json = gh_repos)
@@ -460,21 +460,21 @@ repos
unnest_longer(json) |&gt;
unnest_wider(json)
#&gt; # A tibble: 176 × 68
#&gt; id name full_…¹ owner private html_…² descr…³ fork url
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt; &lt;chr&gt;
#&gt; 1 61160198 after gaborc… &lt;named list&gt; FALSE https:… Run Co… FALSE http…
#&gt; 2 40500181 argufy gaborc… &lt;named list&gt; FALSE https:… Declar… FALSE http…
#&gt; 3 36442442 ask gaborc… &lt;named list&gt; FALSE https:… Friend… FALSE http…
#&gt; 4 34924886 baseimpo… gaborc… &lt;named list&gt; FALSE https:… Do we … FALSE http…
#&gt; 5 61620661 citest gaborc… &lt;named list&gt; FALSE https:… Test R… TRUE http…
#&gt; 6 33907457 clisymbo… gaborc… &lt;named list&gt; FALSE https:… Unicod… FALSE http…
#&gt; # … with 170 more rows, 59 more variables: forks_url &lt;chr&gt;, keys_url &lt;chr&gt;,
#&gt; # collaborators_url &lt;chr&gt;, teams_url &lt;chr&gt;, hooks_url &lt;chr&gt;,
#&gt; # issue_events_url &lt;chr&gt;, events_url &lt;chr&gt;, assignees_url &lt;chr&gt;,
#&gt; # branches_url &lt;chr&gt;, tags_url &lt;chr&gt;, blobs_url &lt;chr&gt;, git_tags_url &lt;chr&gt;,
#&gt; # git_refs_url &lt;chr&gt;, trees_url &lt;chr&gt;, statuses_url &lt;chr&gt;,
#&gt; # languages_url &lt;chr&gt;, stargazers_url &lt;chr&gt;, contributors_url &lt;chr&gt;,
#&gt; # subscribers_url &lt;chr&gt;, subscription_url &lt;chr&gt;, commits_url &lt;chr&gt;, …</pre>
#&gt; id name full_name owner private html_url description fork
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;lgl&gt; &lt;chr&gt; &lt;chr&gt; &lt;lgl&gt;
#&gt; 1 61160198 after gaborcsa&lt;named list&gt; FALSE https:/… Run Code i… FALSE
#&gt; 2 40500181 argufy gaborcsa&lt;named list&gt; FALSE https:/… Declarativ… FALSE
#&gt; 3 36442442 ask gaborcsa&lt;named list&gt; FALSE https:/… Friendly C… FALSE
#&gt; 4 34924886 baseimp… gaborcsa&lt;named list&gt; FALSE https:/… Do we get … FALSE
#&gt; 5 61620661 citest gaborcsa&lt;named list&gt; FALSE https:/… Test R pac… TRUE
#&gt; 6 33907457 clisymb… gaborcsa&lt;named list&gt; FALSE https:/… Unicode sy… FALSE
#&gt; # … with 170 more rows, and 60 more variables: url &lt;chr&gt;, forks_url &lt;chr&gt;,
#&gt; # keys_url &lt;chr&gt;, collaborators_url &lt;chr&gt;, teams_url &lt;chr&gt;,
#&gt; # hooks_url &lt;chr&gt;, issue_events_url &lt;chr&gt;, events_url &lt;chr&gt;,
#&gt; # assignees_url &lt;chr&gt;, branches_url &lt;chr&gt;, tags_url &lt;chr&gt;,
#&gt; # blobs_url &lt;chr&gt;, git_tags_url &lt;chr&gt;, git_refs_url &lt;chr&gt;,
#&gt; # trees_url &lt;chr&gt;, statuses_url &lt;chr&gt;, languages_url &lt;chr&gt;,
#&gt; # stargazers_url &lt;chr&gt;, contributors_url &lt;chr&gt;, subscribers_url &lt;chr&gt;, …</pre>
</div>
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesnt even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
<div class="cell">
@@ -531,7 +531,7 @@ repos
unnest_wider(json) |&gt;
select(id, full_name, owner, description) |&gt;
unnest_wider(owner)
#&gt; Error in `unpack()`:
#&gt; Error in `unpack()` at ]8;line = 121:col = 2;file:///Users/hadleywickham/Documents/tidy-data/tidyr/R/unnest-wider.Rtidyr/R/unnest-wider.R:121:2]8;;:
#&gt; ! Names must be unique.
#&gt; ✖ These names are duplicated:
#&gt; * "id" at locations 1 and 4.
@@ -546,21 +546,21 @@ repos
select(id, full_name, owner, description) |&gt;
unnest_wider(owner, names_sep = "_")
#&gt; # A tibble: 176 × 20
#&gt; id full_name owner…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsar… gaborc 660288 https:… "" https:… https:… https:…
#&gt; 2 40500181 gaborcsar… gaborc 660288 https:… "" https:… https:… https:…
#&gt; 3 36442442 gaborcsar… gaborc 660288 https:… "" https:… https:… https:…
#&gt; 4 34924886 gaborcsar… gaborc 660288 https:… "" https:… https:… https:…
#&gt; 5 61620661 gaborcsar… gaborc 660288 https:… "" https:… https:… https:…
#&gt; 6 33907457 gaborcsar… gaborc 660288 https:… "" https:… https:… https:…
#&gt; # … with 170 more rows, 11 more variables: owner_following_url &lt;chr&gt;,
#&gt; # owner_gists_url &lt;chr&gt;, owner_starred_url &lt;chr&gt;,
#&gt; # owner_subscriptions_url &lt;chr&gt;, owner_organizations_url &lt;chr&gt;,
#&gt; # owner_repos_url &lt;chr&gt;, owner_events_url &lt;chr&gt;,
#&gt; # owner_received_events_url &lt;chr&gt;, owner_type &lt;chr&gt;,
#&gt; # owner_site_admin &lt;lgl&gt;, description &lt;chr&gt;, and abbreviated variable
#&gt; # names ¹owner_login, ²owner_id, ³owner_avatar_url, ⁴owner_gravatar_id, …</pre>
#&gt; id full_name owner_login owner_id owner_avatar_url owner_gravatar_id
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 61160198 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 2 40500181 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 3 36442442 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 4 34924886 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 5 61620661 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; 6 33907457 gaborcsar… gaborcsardi 660288 https://avatars… ""
#&gt; # … with 170 more rows, and 14 more variables: owner_url &lt;chr&gt;,
#&gt; # owner_html_url &lt;chr&gt;, owner_followers_url &lt;chr&gt;,
#&gt; # owner_following_url &lt;chr&gt;, owner_gists_url &lt;chr&gt;,
#&gt; # owner_starred_url &lt;chr&gt;, owner_subscriptions_url &lt;chr&gt;,
#&gt; # owner_organizations_url &lt;chr&gt;, owner_repos_url &lt;chr&gt;,
#&gt; # owner_events_url &lt;chr&gt;, owner_received_events_url &lt;chr&gt;,
#&gt; # owner_type &lt;chr&gt;, owner_site_admin &lt;lgl&gt;, description &lt;chr&gt;</pre>
</div>
<p>This gives another wide dataset, but you can see that <code>owner</code> appears to contain a lot of additional data about the person who “owns” the repository.</p>
</section>
@@ -568,7 +568,7 @@ repos
<section id="relational-data" data-type="sect2">
<h2>
Relational data</h2>
<p>Nested data is sometimes used to represent data that wed usually spread out into multiple data frames. For example, take <code>got_chars</code>. Like <code>gh_repos</code> its a list, so we start by turning it into a list-column of a tibble:</p>
<p>Nested data is sometimes used to represent data that wed usually spread out into multiple data frames. For example, take <code>got_chars</code> which contains data about characters that appear in Game of Thrones. Like <code>gh_repos</code> its a list, so we start by turning it into a list-column of a tibble:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">chars &lt;- tibble(json = got_chars)
chars
@@ -623,15 +623,15 @@ characters
unnest_wider(json) |&gt;
select(id, where(is.list))
#&gt; # A tibble: 30 × 8
#&gt; id titles aliases allegiances books povBooks tvSeries playe…¹
#&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 1022 &lt;chr [3]&gt; &lt;chr [4]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 2 1052 &lt;chr [2]&gt; &lt;chr [11]&gt; &lt;chr [1]&gt; &lt;chr [2]&gt; &lt;chr [4]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 3 1074 &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 4 1109 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 5 1166 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 6 1267 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; # … with 24 more rows, and abbreviated variable name ¹playedBy</pre>
#&gt; id titles aliases allegiances books povBooks tvSeries playedBy
#&gt; &lt;int&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt; &lt;list&gt;
#&gt; 1 1022 &lt;chr [2]&gt; &lt;chr [4]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 2 1052 &lt;chr [2]&gt; &lt;chr [11]&gt; &lt;chr [1]&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 3 1074 &lt;chr [2]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 4 1109 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [1]&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 5 1166 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;chr [3]&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 6 1267 &lt;chr [1]&gt; &lt;chr [1]&gt; &lt;NULL&gt; &lt;chr [2]&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; # … with 24 more rows</pre>
</div>
<p>Lets explore the <code>titles</code> column. Its an unnamed list-column, so well unnest it into rows:</p>
<div class="cell">
@@ -639,16 +639,16 @@ characters
unnest_wider(json) |&gt;
select(id, titles) |&gt;
unnest_longer(titles)
#&gt; # A tibble: 60 × 2
#&gt; # A tibble: 59 × 2
#&gt; id titles
#&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1022 Prince of Winterfell
#&gt; 2 1022 Captain of Sea Bitch
#&gt; 3 1022 Lord of the Iron Islands (by law of the green lands)
#&gt; 4 1052 Acting Hand of the King (former)
#&gt; 5 1052 Master of Coin (former)
#&gt; 6 1074 Lord Captain of the Iron Fleet
#&gt; # … with 54 more rows</pre>
#&gt; 2 1022 Lord of the Iron Islands (by law of the green lands)
#&gt; 3 1052 Acting Hand of the King (former)
#&gt; 4 1052 Master of Coin (former)
#&gt; 5 1074 Lord Captain of the Iron Fleet
#&gt; 6 1074 Master of the Iron Victory
#&gt; # … with 53 more rows</pre>
</div>
<p>You might expect to see this data in its own table because it would be easy to join to the characters data as needed. To do so, well do a little cleaning: removing the rows containing empty strings and renaming <code>titles</code> to <code>title</code> since each row now only contains a single title.</p>
<div class="cell">
@@ -659,43 +659,42 @@ characters
filter(titles != "") |&gt;
rename(title = titles)
titles
#&gt; # A tibble: 53 × 2
#&gt; # A tibble: 52 × 2
#&gt; id title
#&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1022 Prince of Winterfell
#&gt; 2 1022 Captain of Sea Bitch
#&gt; 3 1022 Lord of the Iron Islands (by law of the green lands)
#&gt; 4 1052 Acting Hand of the King (former)
#&gt; 5 1052 Master of Coin (former)
#&gt; 6 1074 Lord Captain of the Iron Fleet
#&gt; # … with 47 more rows</pre>
#&gt; 2 1022 Lord of the Iron Islands (by law of the green lands)
#&gt; 3 1052 Acting Hand of the King (former)
#&gt; 4 1052 Master of Coin (former)
#&gt; 5 1074 Lord Captain of the Iron Fleet
#&gt; 6 1074 Master of the Iron Victory
#&gt; # … with 46 more rows</pre>
</div>
<p>Now, for example, we could use this table tofind all the characters that are captains and see all their titles:</p>
<p>Now, for example, we could use this table to find all the characters that are captains and see all their titles:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">captains &lt;- titles |&gt; filter(str_detect(title, "Captain"))
captains
#&gt; # A tibble: 5 × 2
#&gt; # A tibble: 4 × 2
#&gt; id title
#&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1022 Captain of Sea Bitch
#&gt; 2 1074 Lord Captain of the Iron Fleet
#&gt; 3 1166 Captain of the Guard at Sunspear
#&gt; 4 150 Captain of the Black Wind
#&gt; 5 60 Captain of the Golden Storm (formerly)
#&gt; 1 1074 Lord Captain of the Iron Fleet
#&gt; 2 1166 Captain of the Guard at Sunspear
#&gt; 3 150 Captain of the Black Wind
#&gt; 4 60 Captain of the Golden Storm (formerly)
characters |&gt;
select(id, name) |&gt;
inner_join(titles, by = "id", multiple = "all")
#&gt; # A tibble: 53 × 3
#&gt; # A tibble: 52 × 3
#&gt; id name title
#&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt;
#&gt; 1 1022 Theon Greyjoy Prince of Winterfell
#&gt; 2 1022 Theon Greyjoy Captain of Sea Bitch
#&gt; 3 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green land…
#&gt; 4 1052 Tyrion Lannister Acting Hand of the King (former)
#&gt; 5 1052 Tyrion Lannister Master of Coin (former)
#&gt; 6 1074 Victarion Greyjoy Lord Captain of the Iron Fleet
#&gt; # … with 47 more rows</pre>
#&gt; 2 1022 Theon Greyjoy Lord of the Iron Islands (by law of the green land…
#&gt; 3 1052 Tyrion Lannister Acting Hand of the King (former)
#&gt; 4 1052 Tyrion Lannister Master of Coin (former)
#&gt; 5 1074 Victarion Greyjoy Lord Captain of the Iron Fleet
#&gt; 6 1074 Victarion Greyjoy Master of the Iron Victory
#&gt; # … with 46 more rows</pre>
</div>
<p>You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.</p>
</section>
@@ -703,36 +702,36 @@ characters |&gt;
<section id="a-dash-of-text-analysis" data-type="sect2">
<h2>
A dash of text analysis</h2>
<p>What if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by spitting on <code>" "</code>:</p>
<p>Sticking with the same data, what if we wanted to find the most common words in the title? One simple approach starts by using <code><a href="https://stringr.tidyverse.org/reference/str_split.html">str_split()</a></code> to break each element of <code>title</code> up into words by splitting on <code>" "</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">titles |&gt;
mutate(word = str_split(title, " "), .keep = "unused")
#&gt; # A tibble: 53 × 2
#&gt; # A tibble: 52 × 2
#&gt; id word
#&gt; &lt;int&gt; &lt;list&gt;
#&gt; 1 1022 &lt;chr [3]&gt;
#&gt; 2 1022 &lt;chr [4]&gt;
#&gt; 3 1022 &lt;chr [11]&gt;
#&gt; 4 1052 &lt;chr [6]&gt;
#&gt; 5 1052 &lt;chr [4]&gt;
#&gt; 6 1074 &lt;chr [6]&gt;
#&gt; # … with 47 more rows</pre>
#&gt; 2 1022 &lt;chr [11]&gt;
#&gt; 3 1052 &lt;chr [6]&gt;
#&gt; 4 1052 &lt;chr [4]&gt;
#&gt; 5 1074 &lt;chr [6]&gt;
#&gt; 6 1074 &lt;chr [5]&gt;
#&gt; # … with 46 more rows</pre>
</div>
<p>This creates a unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
<p>This creates an unnamed variable length list-column, so we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">titles |&gt;
mutate(word = str_split(title, " "), .keep = "unused") |&gt;
unnest_longer(word)
#&gt; # A tibble: 202 × 2
#&gt; # A tibble: 198 × 2
#&gt; id word
#&gt; &lt;int&gt; &lt;chr&gt;
#&gt; 1 1022 Prince
#&gt; 2 1022 of
#&gt; 3 1022 Winterfell
#&gt; 4 1022 Captain
#&gt; 4 1022 Lord
#&gt; 5 1022 of
#&gt; 6 1022 Sea
#&gt; # … with 196 more rows</pre>
#&gt; 6 1022 the
#&gt; # … with 192 more rows</pre>
</div>
<p>And then we can count that column to find the most common words:</p>
<div class="cell">
@@ -740,18 +739,18 @@ A dash of text analysis</h2>
mutate(word = str_split(title, " "), .keep = "unused") |&gt;
unnest_longer(word) |&gt;
count(word, sort = TRUE)
#&gt; # A tibble: 78 × 2
#&gt; word n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 of 41
#&gt; 2 the 29
#&gt; 3 Lord 9
#&gt; 4 Hand 6
#&gt; 5 Captain 5
#&gt; 6 King 5
#&gt; # … with 72 more rows</pre>
#&gt; # A tibble: 77 × 2
#&gt; word n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 of 40
#&gt; 2 the 29
#&gt; 3 Lord 9
#&gt; 4 Hand 6
#&gt; 5 King 5
#&gt; 6 Princess 5
#&gt; # … with 71 more rows</pre>
</div>
<p>Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these is commonly called stop words.</p>
<p>Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these are commonly called stop words.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">stop_words &lt;- tibble(word = c("of", "the"))
@@ -761,16 +760,16 @@ titles |&gt;
anti_join(stop_words) |&gt;
count(word, sort = TRUE)
#&gt; Joining with `by = join_by(word)`
#&gt; # A tibble: 76 × 2
#&gt; # A tibble: 75 × 2
#&gt; word n
#&gt; &lt;chr&gt; &lt;int&gt;
#&gt; 1 Lord 9
#&gt; 2 Hand 6
#&gt; 3 Captain 5
#&gt; 4 King 5
#&gt; 5 Princess 5
#&gt; 6 Queen 5
#&gt; # … with 70 more rows</pre>
#&gt; 3 King 5
#&gt; 4 Princess 5
#&gt; 5 Queen 5
#&gt; 6 Ser 5
#&gt; # … with 69 more rows</pre>
</div>
<p>Breaking up text into individual fragments is a powerful idea that underlies much of text analysis. If this sounds interesting, a good place to learn more is <a href="https://www.tidytextmining.com">Text Mining with R</a> by Julia Silge and David Robinson.</p>
</section>
@@ -803,7 +802,7 @@ Deeply nested</h2>
#&gt; 4 Chicago &lt;list [1]&gt; OK
#&gt; 5 Arlington &lt;list [2]&gt; OK</pre>
</div>
<p>This gives us the <code>status</code> and the <code>results</code>. Well drop the status column since theyre all <code>OK</code>; in a real analysis, youd also want capture all the rows where <code>status != "OK"</code> and figure out what went wrong. <code>results</code> is an unnamed list, with either one or two elements (well see why shortly) so well unnest it into rows:</p>
<p>This gives us the <code>status</code> and the <code>results</code>. Well drop the status column since theyre all <code>OK</code>; in a real analysis, youd also want to capture all the rows where <code>status != "OK"</code> and figure out what went wrong. <code>results</code> is an unnamed list, with either one or two elements (well see why shortly) so well unnest it into rows:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gmaps_cities |&gt;
unnest_wider(json) |&gt;
@@ -829,15 +828,15 @@ Deeply nested</h2>
unnest_wider(results)
locations
#&gt; # A tibble: 7 × 6
#&gt; city address_components formatted_address geometry place…¹ types
#&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston &lt;list [4]&gt; Houston, TX, USA &lt;named list&gt; ChIJAY… &lt;list&gt;
#&gt; 2 Washington &lt;list [2]&gt; Washington, USA &lt;named list&gt; ChIJ-b… &lt;list&gt;
#&gt; 3 Washington &lt;list [4]&gt; Washington, DC, … &lt;named list&gt; ChIJW-… &lt;list&gt;
#&gt; 4 New York &lt;list [3]&gt; New York, NY, USA &lt;named list&gt; ChIJOw… &lt;list&gt;
#&gt; 5 Chicago &lt;list [4]&gt; Chicago, IL, USA &lt;named list&gt; ChIJ7c… &lt;list&gt;
#&gt; 6 Arlington &lt;list [4]&gt; Arlington, TX, U… &lt;named list&gt; ChIJ05… &lt;list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹​place_id</pre>
#&gt; city address_compone…¹ formatted_address geometry place_id types
#&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston &lt;list [4]&gt; Houston, TX, USA &lt;named list&gt; ChIJAYW&lt;list&gt;
#&gt; 2 Washington &lt;list [2]&gt; Washington, USA &lt;named list&gt; ChIJ-bD&lt;list&gt;
#&gt; 3 Washington &lt;list [4]&gt; Washington, DC, … &lt;named list&gt; ChIJW-T&lt;list&gt;
#&gt; 4 New York &lt;list [3]&gt; New York, NY, USA &lt;named list&gt; ChIJOwg&lt;list&gt;
#&gt; 5 Chicago &lt;list [4]&gt; Chicago, IL, USA &lt;named list&gt; ChIJ7cv&lt;list&gt;
#&gt; 6 Arlington &lt;list [4]&gt; Arlington, TX, U… &lt;named list&gt; ChIJ05g&lt;list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹​address_components</pre>
</div>
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
@@ -846,15 +845,15 @@ locations
select(city, formatted_address, geometry) |&gt;
unnest_wider(geometry)
#&gt; # A tibble: 7 × 6
#&gt; city formatted_address bounds location locat…¹ viewport
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list&gt; &lt;named list&gt; APPROX&lt;named list&gt;
#&gt; 2 Washington Washington, USA &lt;named list&gt; &lt;named list&gt; APPROX&lt;named list&gt;
#&gt; 3 Washington Washington, DC, &lt;named list&gt; &lt;named list&gt; APPROX&lt;named list&gt;
#&gt; 4 New York New York, NY, USA &lt;named list&gt; &lt;named list&gt; APPROX&lt;named list&gt;
#&gt; 5 Chicago Chicago, IL, USA &lt;named list&gt; &lt;named list&gt; APPROX&lt;named list&gt;
#&gt; 6 Arlington Arlington, TX, U &lt;named list&gt; &lt;named list&gt; APPROX&lt;named list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹location_type</pre>
#&gt; city formatted_address bounds location location_type
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;list&gt; &lt;chr&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE
#&gt; 2 Washington Washington, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE
#&gt; 3 Washington Washington, DC, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE
#&gt; 4 New York New York, NY, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE
#&gt; 5 Chicago Chicago, IL, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE
#&gt; 6 Arlington Arlington, TX, USA &lt;named list [2]&gt; &lt;named list&gt; APPROXIMATE
#&gt; # … with 1 more row, and 1 more variable: viewport &lt;list&gt;</pre>
</div>
<p>That gives us new <code>bounds</code> (a rectangular region) and <code>location</code> (a point). We can unnest <code>location</code> to see the latitude (<code>lat</code>) and longitude (<code>lng</code>):</p>
<div class="cell">
@@ -863,15 +862,15 @@ locations
unnest_wider(geometry) |&gt;
unnest_wider(location)
#&gt; # A tibble: 7 × 7
#&gt; city formatted_address bounds lat lng locat…¹ viewport
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;list&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list&gt; 29.8 -95.4 APPROX&lt;named list&gt;
#&gt; 2 Washington Washington, USA &lt;named list&gt; 47.8 -121. APPROX&lt;named list&gt;
#&gt; 3 Washington Washington, DC, &lt;named list&gt; 38.9 -77.0 APPROX&lt;named list&gt;
#&gt; 4 New York New York, NY, USA &lt;named list&gt; 40.7 -74.0 APPROX&lt;named list&gt;
#&gt; 5 Chicago Chicago, IL, USA &lt;named list&gt; 41.9 -87.6 APPROX&lt;named list&gt;
#&gt; 6 Arlington Arlington, TX, U &lt;named list&gt; 32.7 -97.1 APPROX&lt;named list&gt;
#&gt; # … with 1 more row, and abbreviated variable name ¹location_type</pre>
#&gt; city formatted_address bounds lat lng location_type
#&gt; &lt;chr&gt; &lt;chr&gt; &lt;list&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt;
#&gt; 1 Houston Houston, TX, USA &lt;named list [2]&gt; 29.8 -95.4 APPROXIMATE
#&gt; 2 Washington Washington, USA &lt;named list [2]&gt; 47.8 -121. APPROXIMATE
#&gt; 3 Washington Washington, DC, USA &lt;named list [2]&gt; 38.9 -77.0 APPROXIMATE
#&gt; 4 New York New York, NY, USA &lt;named list [2]&gt; 40.7 -74.0 APPROXIMATE
#&gt; 5 Chicago Chicago, IL, USA &lt;named list [2]&gt; 41.9 -87.6 APPROXIMATE
#&gt; 6 Arlington Arlington, TX, USA &lt;named list [2]&gt; 32.7 -97.1 APPROXIMATE
#&gt; # … with 1 more row, and 1 more variable: viewport &lt;list&gt;</pre>
</div>
<p>Extracting the bounds requires a few more steps:</p>
<div class="cell">
@@ -913,7 +912,7 @@ locations
#&gt; # … with 1 more row</pre>
</div>
<p>Note how we unnest two columns simultaneously by supplying a vector of variable names to <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>.</p>
<p>This is somewhere that <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned briefly above, can be useful. Once youve discovered the path to get to the components youre interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
<p>This is where <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>, mentioned earlier in the chapter, can be useful. Once youve discovered the path to get to the components youre interested in, you can extract them directly using <code><a href="https://tidyr.tidyverse.org/reference/hoist.html">hoist()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">locations |&gt;
select(city, formatted_address, geometry) |&gt;
@@ -972,7 +971,7 @@ Data types</h2>
<p>JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:</p>
<ul><li>The simplest type is a null (<code>null</code>) which plays the same role as both <code>NULL</code> and <code>NA</code> in R. It represents the absence of data.</li>
<li>A <strong>string</strong> is much like a string in R, but must always use double quotes.</li>
<li>A <strong>number</strong> is similar to Rs numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesnt support Inf, -Inf, or NaN.</li>
<li>A <strong>number</strong> is similar to Rs numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesnt support <code>Inf</code>, <code>-Inf</code>, or <code>NaN</code>.</li>
<li>A <strong>boolean</strong> is similar to Rs <code>TRUE</code> and <code>FALSE</code>, but uses lowercase <code>true</code> and <code>false</code>.</li>
</ul><p>JSONs strings, numbers, and booleans are pretty similar to Rs character, numeric, and logical vectors. The main difference is that JSONs scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.</p>
<p>Both arrays and objects are similar to lists in R; the difference is whether or not theyre named. An <strong>array</strong> is like an unnamed list, and is written with <code>[]</code>. For example <code>[1, 2, 3]</code> is an array containing 3 numbers, and <code>[null, 1, "string", false]</code> is an array that contains a null, a number, a string, and a boolean. An <strong>object</strong> is like a named list, and is written with <code><a href="https://rdrr.io/r/base/Paren.html">{}</a></code>. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, <code>{"x": 1, "y": 2}</code> is an object that maps <code>x</code> to 1 and <code>y</code> to 2.</p>
@@ -994,7 +993,7 @@ gh_users2 &lt;- read_json(gh_users_json())
identical(gh_users, gh_users2)
#&gt; [1] TRUE</pre>
</div>
<p>In this book, Ill also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, heres three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:</p>
<p>In this book, well also use <code><a href="https://rdrr.io/pkg/jsonlite/man/read_json.html">parse_json()</a></code>, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here are three simple JSON datasets, starting with a number, then putting a few numbers in an array, then putting that array in an object:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">str(parse_json('1'))
#&gt; int 1
@@ -1038,7 +1037,7 @@ df |&gt;
#&gt; 1 John 34
#&gt; 2 Susan 27</pre>
</div>
<p>In rarer cases, the JSON consists of a single top-level JSON object, representing one “thing”. In this case, youll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.</p>
<p>In rarer cases, the JSON file consists of a single top-level JSON object, representing one “thing”. In this case, youll need to kick off the rectangling process by wrapping it in a list, before you put it in a tibble.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">json &lt;- '{
"status": "OK",
@@ -1114,7 +1113,7 @@ df_row &lt;- tibble(json = json_row)</pre>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesnt matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>
<p>In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesnt matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>
<p>JSON is the most common data format returned by web APIs. What happens if the website doesnt have an API, but you can see data you want on the website? Thats the topic of the next chapter: web scraping, extracting data from HTML webpages.</p>