More minor page count tweaks & fixes
And re-convert with latest htmlbook
This commit is contained in:
@@ -1,12 +1,12 @@
|
||||
<section data-type="chapter" id="chp-rectangling">
|
||||
<h1><span id="sec-rectangling" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Hierarchical data</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<section id="rectangling-introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>In this chapter, you’ll learn the art of data <strong>rectangling</strong>, taking data that is fundamentally hierarchical, or tree-like, and converting it into a rectangular data frame made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.</p>
|
||||
<p>To learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">tidyr::unnest_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">tidyr::unnest_wider()</a></code>. We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<section id="rectangling-prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.</p>
|
||||
@@ -18,7 +18,7 @@ library(jsonlite)</pre>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
<section id="lists" data-type="sect1">
|
||||
<section id="rectangling-lists" data-type="sect1">
|
||||
<h1>
|
||||
Lists</h1>
|
||||
<p>So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is of the same data type. If you want to store elements of different types in the same vector, you’ll need a <strong>list</strong>, which you create with <code><a href="https://rdrr.io/r/base/list.html">list()</a></code>:</p>
|
||||
@@ -174,13 +174,19 @@ df
|
||||
<p>Similarly, if you <code><a href="https://rdrr.io/r/utils/View.html">View()</a></code> a data frame in RStudio, you’ll get the standard tabular view, which doesn’t allow you to selectively expand list columns. To explore those fields you’ll need to <code><a href="https://dplyr.tidyverse.org/reference/pull.html">pull()</a></code> and view, e.g. <code>df |> pull(z) |> View()</code>.</p>
|
||||
<div data-type="note"><h1>
|
||||
Base R
|
||||
</h1><p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p><div class="cell">
|
||||
</h1>
|
||||
|
||||
|
||||
<p>It’s possible to put a list in a column of a <code>data.frame</code>, but it’s a lot fiddlier because <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> treats a list as a list of columns:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">data.frame(x = list(1:3, 3:5))
|
||||
#> x.1.3 x.3.5
|
||||
#> 1 1 3
|
||||
#> 2 2 4
|
||||
#> 3 3 5</pre>
|
||||
</div><p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesn’t print particularly well:</p><div class="cell">
|
||||
</div>
|
||||
<p>You can force <code><a href="https://rdrr.io/r/base/data.frame.html">data.frame()</a></code> to treat a list as a list of rows by wrapping it in list <code><a href="https://rdrr.io/r/base/AsIs.html">I()</a></code>, but the result doesn’t print particularly well:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">data.frame(
|
||||
x = I(list(1:2, 3:5)),
|
||||
y = c("1, 2", "3, 4, 5")
|
||||
@@ -188,7 +194,10 @@ Base R
|
||||
#> x y
|
||||
#> 1 1, 2 1, 2
|
||||
#> 2 3, 4, 5 3, 4, 5</pre>
|
||||
</div><p>It’s easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like vectors and the print method has been designed with lists in mind.</p></div>
|
||||
</div>
|
||||
<p>It’s easier to use list-columns with tibbles because <code><a href="https://tibble.tidyverse.org/reference/tibble.html">tibble()</a></code> treats lists like vectors and the print method has been designed with lists in mind.</p>
|
||||
|
||||
</div>
|
||||
|
||||
</section>
|
||||
</section>
|
||||
@@ -220,7 +229,7 @@ df2 <- tribble(
|
||||
|
||||
<section id="unnest_wider" data-type="sect2">
|
||||
<h2>
|
||||
<code>unnest_wider()</code>
|
||||
unnest_wider()
|
||||
</h2>
|
||||
<p>When each row has the same number of elements with the same names, like <code>df1</code>, it’s natural to put each component into its own column with <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code>:</p>
|
||||
<div class="cell">
|
||||
@@ -260,7 +269,7 @@ df2 <- tribble(
|
||||
|
||||
<section id="unnest_longer" data-type="sect2">
|
||||
<h2>
|
||||
<code>unnest_longer()</code>
|
||||
unnest_longer()
|
||||
</h2>
|
||||
<p>When each row contains an unnamed list, it’s most natural to put each element into its own row with <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code>:</p>
|
||||
<div class="cell">
|
||||
@@ -387,7 +396,7 @@ Inconsistent types</h2>
|
||||
<p>You’ll learn more about <code><a href="https://purrr.tidyverse.org/reference/map.html">map_lgl()</a></code> in <a href="#chp-iteration" data-type="xref">#chp-iteration</a>.</p>
|
||||
</section>
|
||||
|
||||
<section id="other-functions" data-type="sect2">
|
||||
<section id="rectangling-other-functions" data-type="sect2">
|
||||
<h2>
|
||||
Other functions</h2>
|
||||
<p>tidyr has a few other useful rectangling functions that we’re not going to cover in this book:</p>
|
||||
@@ -400,7 +409,7 @@ Other functions</h2>
|
||||
</ul><p>These functions are good to know about as you might encounter them when reading other people’s code or tackling rarer rectangling challenges yourself.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<section id="rectangling-exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
@@ -460,51 +469,26 @@ repos
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json)
|
||||
#> # A tibble: 176 × 68
|
||||
#> id name full_name owner private html_url description fork
|
||||
#> <int> <chr> <chr> <list> <lgl> <chr> <chr> <lgl>
|
||||
#> 1 61160198 after gaborcsa… <named list> FALSE https:/… Run Code i… FALSE
|
||||
#> 2 40500181 argufy gaborcsa… <named list> FALSE https:/… Declarativ… FALSE
|
||||
#> 3 36442442 ask gaborcsa… <named list> FALSE https:/… Friendly C… FALSE
|
||||
#> 4 34924886 baseimp… gaborcsa… <named list> FALSE https:/… Do we get … FALSE
|
||||
#> 5 61620661 citest gaborcsa… <named list> FALSE https:/… Test R pac… TRUE
|
||||
#> 6 33907457 clisymb… gaborcsa… <named list> FALSE https:/… Unicode sy… FALSE
|
||||
#> # … with 170 more rows, and 60 more variables: url <chr>, forks_url <chr>,
|
||||
#> # keys_url <chr>, collaborators_url <chr>, teams_url <chr>,
|
||||
#> # hooks_url <chr>, issue_events_url <chr>, events_url <chr>,
|
||||
#> # assignees_url <chr>, branches_url <chr>, tags_url <chr>,
|
||||
#> # blobs_url <chr>, git_tags_url <chr>, git_refs_url <chr>,
|
||||
#> # trees_url <chr>, statuses_url <chr>, languages_url <chr>,
|
||||
#> # stargazers_url <chr>, contributors_url <chr>, subscribers_url <chr>, …</pre>
|
||||
#> id name full_name owner private html_url
|
||||
#> <int> <chr> <chr> <list> <lgl> <chr>
|
||||
#> 1 61160198 after gaborcsardi/after <named list> FALSE https://github…
|
||||
#> 2 40500181 argufy gaborcsardi/argu… <named list> FALSE https://github…
|
||||
#> 3 36442442 ask gaborcsardi/ask <named list> FALSE https://github…
|
||||
#> 4 34924886 baseimports gaborcsardi/base… <named list> FALSE https://github…
|
||||
#> 5 61620661 citest gaborcsardi/cite… <named list> FALSE https://github…
|
||||
#> 6 33907457 clisymbols gaborcsardi/clis… <named list> FALSE https://github…
|
||||
#> # … with 170 more rows, and 62 more variables: description <chr>,
|
||||
#> # fork <lgl>, url <chr>, forks_url <chr>, keys_url <chr>, …</pre>
|
||||
</div>
|
||||
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>:</p>
|
||||
<p>This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with <code><a href="https://rdrr.io/r/base/names.html">names()</a></code>; and here we look at the first 10:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
unnest_longer(json) |>
|
||||
unnest_wider(json) |>
|
||||
names()
|
||||
#> [1] "id" "name" "full_name"
|
||||
#> [4] "owner" "private" "html_url"
|
||||
#> [7] "description" "fork" "url"
|
||||
#> [10] "forks_url" "keys_url" "collaborators_url"
|
||||
#> [13] "teams_url" "hooks_url" "issue_events_url"
|
||||
#> [16] "events_url" "assignees_url" "branches_url"
|
||||
#> [19] "tags_url" "blobs_url" "git_tags_url"
|
||||
#> [22] "git_refs_url" "trees_url" "statuses_url"
|
||||
#> [25] "languages_url" "stargazers_url" "contributors_url"
|
||||
#> [28] "subscribers_url" "subscription_url" "commits_url"
|
||||
#> [31] "git_commits_url" "comments_url" "issue_comment_url"
|
||||
#> [34] "contents_url" "compare_url" "merges_url"
|
||||
#> [37] "archive_url" "downloads_url" "issues_url"
|
||||
#> [40] "pulls_url" "milestones_url" "notifications_url"
|
||||
#> [43] "labels_url" "releases_url" "deployments_url"
|
||||
#> [46] "created_at" "updated_at" "pushed_at"
|
||||
#> [49] "git_url" "ssh_url" "clone_url"
|
||||
#> [52] "svn_url" "homepage" "size"
|
||||
#> [55] "stargazers_count" "watchers_count" "language"
|
||||
#> [58] "has_issues" "has_downloads" "has_wiki"
|
||||
#> [61] "has_pages" "forks_count" "mirror_url"
|
||||
#> [64] "open_issues_count" "forks" "open_issues"
|
||||
#> [67] "watchers" "default_branch"</pre>
|
||||
names() |>
|
||||
head(10)
|
||||
#> [1] "id" "name" "full_name" "owner" "private"
|
||||
#> [6] "html_url" "description" "fork" "url" "forks_url"</pre>
|
||||
</div>
|
||||
<p>Let’s select a few that look interesting:</p>
|
||||
<div class="cell">
|
||||
@@ -523,7 +507,7 @@ repos
|
||||
#> 6 33907457 gaborcsardi/clisymbols <named list [17]> Unicode symbols for CLI…
|
||||
#> # … with 170 more rows</pre>
|
||||
</div>
|
||||
<p>You can use this to work back to understand how <code>gh_repos</code> was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
|
||||
<p>You can use this to work back to understand how <code>gh_repos</code> was structured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.</p>
|
||||
<p><code>owner</code> is another list-column, and since it contains a named list, we can use <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to get at the values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">repos |>
|
||||
@@ -531,11 +515,13 @@ repos
|
||||
unnest_wider(json) |>
|
||||
select(id, full_name, owner, description) |>
|
||||
unnest_wider(owner)
|
||||
#> Error in `unpack()` at ]8;line = 121:col = 2;file:///Users/hadleywickham/Documents/tidy-data/tidyr/R/unnest-wider.Rtidyr/R/unnest-wider.R:121:2]8;;:
|
||||
#> ! Names must be unique.
|
||||
#> Error in `unnest_wider()`:
|
||||
#> ! Can't duplicate names between the affected columns and the original
|
||||
#> data.
|
||||
#> ✖ These names are duplicated:
|
||||
#> * "id" at locations 1 and 4.
|
||||
#> ℹ Use argument `names_repair` to specify repair strategy.</pre>
|
||||
#> ℹ `id`, from `owner`.
|
||||
#> ℹ Use `names_sep` to disambiguate using the column name.
|
||||
#> ℹ Or use `names_repair` to specify a repair strategy.</pre>
|
||||
</div>
|
||||
<!--# TODO: https://github.com/tidyverse/tidyr/issues/1390 -->
|
||||
<p>Uh oh, this list column also contains an <code>id</code> column and we can’t have two <code>id</code> columns in the same data frame. Rather than following the advice to use <code>names_repair</code> (which would also work), we’ll instead use <code>names_sep</code>:</p>
|
||||
@@ -546,21 +532,16 @@ repos
|
||||
select(id, full_name, owner, description) |>
|
||||
unnest_wider(owner, names_sep = "_")
|
||||
#> # A tibble: 176 × 20
|
||||
#> id full_name owner_login owner_id owner_avatar_url owner_gravatar_id
|
||||
#> <int> <chr> <chr> <int> <chr> <chr>
|
||||
#> 1 61160198 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 2 40500181 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 3 36442442 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 4 34924886 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 5 61620661 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> 6 33907457 gaborcsar… gaborcsardi 660288 https://avatars… ""
|
||||
#> # … with 170 more rows, and 14 more variables: owner_url <chr>,
|
||||
#> # owner_html_url <chr>, owner_followers_url <chr>,
|
||||
#> # owner_following_url <chr>, owner_gists_url <chr>,
|
||||
#> # owner_starred_url <chr>, owner_subscriptions_url <chr>,
|
||||
#> # owner_organizations_url <chr>, owner_repos_url <chr>,
|
||||
#> # owner_events_url <chr>, owner_received_events_url <chr>,
|
||||
#> # owner_type <chr>, owner_site_admin <lgl>, description <chr></pre>
|
||||
#> id full_name owner_login owner_id owner_avatar_url
|
||||
#> <int> <chr> <chr> <int> <chr>
|
||||
#> 1 61160198 gaborcsardi/after gaborcsardi 660288 https://avatars.gith…
|
||||
#> 2 40500181 gaborcsardi/argufy gaborcsardi 660288 https://avatars.gith…
|
||||
#> 3 36442442 gaborcsardi/ask gaborcsardi 660288 https://avatars.gith…
|
||||
#> 4 34924886 gaborcsardi/baseimports gaborcsardi 660288 https://avatars.gith…
|
||||
#> 5 61620661 gaborcsardi/citest gaborcsardi 660288 https://avatars.gith…
|
||||
#> 6 33907457 gaborcsardi/clisymbols gaborcsardi 660288 https://avatars.gith…
|
||||
#> # … with 170 more rows, and 15 more variables: owner_gravatar_id <chr>,
|
||||
#> # owner_url <chr>, owner_html_url <chr>, owner_followers_url <chr>, …</pre>
|
||||
</div>
|
||||
<p>This gives another wide dataset, but you can see that <code>owner</code> appears to contain a lot of additional data about the person who “owns” the repository.</p>
|
||||
</section>
|
||||
@@ -588,17 +569,16 @@ chars
|
||||
<pre data-type="programlisting" data-code-language="r">chars |>
|
||||
unnest_wider(json)
|
||||
#> # A tibble: 30 × 18
|
||||
#> url id name gender culture born died alive titles aliases father
|
||||
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
|
||||
#> 1 https:/… 1022 Theo… Male "Ironb… "In … "" TRUE <chr> <chr> ""
|
||||
#> 2 https:/… 1052 Tyri… Male "" "In … "" TRUE <chr> <chr> ""
|
||||
#> 3 https:/… 1074 Vict… Male "Ironb… "In … "" TRUE <chr> <chr> ""
|
||||
#> 4 https:/… 1109 Will Male "" "" "In … FALSE <chr> <chr> ""
|
||||
#> 5 https:/… 1166 Areo… Male "Norvo… "In … "" TRUE <chr> <chr> ""
|
||||
#> 6 https:/… 1267 Chett Male "" "At … "In … FALSE <chr> <chr> ""
|
||||
#> # … with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
|
||||
#> # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
|
||||
#> # playedBy <list></pre>
|
||||
#> url id name gender culture born
|
||||
#> <chr> <int> <chr> <chr> <chr> <chr>
|
||||
#> 1 https://www.anapio… 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or …
|
||||
#> 2 https://www.anapio… 1052 Tyrion Lannist… Male "" "In 273 AC, at…
|
||||
#> 3 https://www.anapio… 1074 Victarion Grey… Male "Ironborn" "In 268 AC or …
|
||||
#> 4 https://www.anapio… 1109 Will Male "" ""
|
||||
#> 5 https://www.anapio… 1166 Areo Hotah Male "Norvoshi" "In 257 AC or …
|
||||
#> 6 https://www.anapio… 1267 Chett Male "" "At Hag's Mire"
|
||||
#> # … with 24 more rows, and 12 more variables: died <chr>, alive <lgl>,
|
||||
#> # titles <list>, aliases <list>, father <chr>, mother <chr>, …</pre>
|
||||
</div>
|
||||
<p>And selecting a few columns to make it easier to read:</p>
|
||||
<div class="cell">
|
||||
@@ -607,15 +587,15 @@ chars
|
||||
select(id, name, gender, culture, born, died, alive)
|
||||
characters
|
||||
#> # A tibble: 30 × 7
|
||||
#> id name gender culture born died alive
|
||||
#> <int> <chr> <chr> <chr> <chr> <chr> <lgl>
|
||||
#> 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 279 AC… "" TRUE
|
||||
#> 2 1052 Tyrion Lannister Male "" "In 273 AC, at Caste… "" TRUE
|
||||
#> 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or before… "" TRUE
|
||||
#> 4 1109 Will Male "" "" "In … FALSE
|
||||
#> 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or before… "" TRUE
|
||||
#> 6 1267 Chett Male "" "At Hag's Mire" "In … FALSE
|
||||
#> # … with 24 more rows</pre>
|
||||
#> id name gender culture born died
|
||||
#> <int> <chr> <chr> <chr> <chr> <chr>
|
||||
#> 1 1022 Theon Greyjoy Male "Ironborn" "In 278 AC or 27… ""
|
||||
#> 2 1052 Tyrion Lannister Male "" "In 273 AC, at C… ""
|
||||
#> 3 1074 Victarion Greyjoy Male "Ironborn" "In 268 AC or be… ""
|
||||
#> 4 1109 Will Male "" "" "In 297 AC, at…
|
||||
#> 5 1166 Areo Hotah Male "Norvoshi" "In 257 AC or be… ""
|
||||
#> 6 1267 Chett Male "" "At Hag's Mire" "In 299 AC, at…
|
||||
#> # … with 24 more rows, and 1 more variable: alive <lgl></pre>
|
||||
</div>
|
||||
<p>There are also many list-columns:</p>
|
||||
<div class="cell">
|
||||
@@ -828,15 +808,16 @@ Deeply nested</h2>
|
||||
unnest_wider(results)
|
||||
locations
|
||||
#> # A tibble: 7 × 6
|
||||
#> city address_compone…¹ formatted_address geometry place_id types
|
||||
#> <chr> <list> <chr> <list> <chr> <list>
|
||||
#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAYW… <list>
|
||||
#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-bD… <list>
|
||||
#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-T… <list>
|
||||
#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOwg… <list>
|
||||
#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7cv… <list>
|
||||
#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05g… <list>
|
||||
#> # … with 1 more row, and abbreviated variable name ¹address_components</pre>
|
||||
#> city address_compone…¹ formatted_address geometry place_id
|
||||
#> <chr> <list> <chr> <list> <chr>
|
||||
#> 1 Houston <list [4]> Houston, TX, USA <named list> ChIJAYWNSLS4QI…
|
||||
#> 2 Washington <list [2]> Washington, USA <named list> ChIJ-bDD5__lhV…
|
||||
#> 3 Washington <list [4]> Washington, DC, … <named list> ChIJW-T2Wt7Gt4…
|
||||
#> 4 New York <list [3]> New York, NY, USA <named list> ChIJOwg_06VPwo…
|
||||
#> 5 Chicago <list [4]> Chicago, IL, USA <named list> ChIJ7cv00DwsDo…
|
||||
#> 6 Arlington <list [4]> Arlington, TX, U… <named list> ChIJ05gI5NJiTo…
|
||||
#> # … with 1 more row, 1 more variable: types <list>, and abbreviated variable
|
||||
#> # name ¹address_components</pre>
|
||||
</div>
|
||||
<p>Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.</p>
|
||||
<p>There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the <code>geometry</code> list-column:</p>
|
||||
@@ -937,7 +918,7 @@ locations
|
||||
<p>If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in <code>vignette("rectangling", package = "tidyr")</code>.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<section id="rectangling-exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Roughly estimate when <code>gh_repos</code> was created. Why can you only roughly estimate the date?</p></li>
|
||||
@@ -965,7 +946,7 @@ Exercises</h2>
|
||||
JSON</h1>
|
||||
<p>All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for <strong>j</strong>ava<strong>s</strong>cript <strong>o</strong>bject <strong>n</strong>otation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.</p>
|
||||
|
||||
<section id="data-types" data-type="sect2">
|
||||
<section id="rectangling-data-types" data-type="sect2">
|
||||
<h2>
|
||||
Data types</h2>
|
||||
<p>JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:</p>
|
||||
@@ -1083,7 +1064,7 @@ Translation challenges</h2>
|
||||
<p>Since JSON doesn’t have any way to represent dates or date-times, they’re often stored as ISO8601 date times in strings, and you’ll need to use <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">readr::parse_date()</a></code> or <code><a href="https://readr.tidyverse.org/reference/parse_datetime.html">readr::parse_datetime()</a></code> to turn them into the correct data structure. Similarly, JSON’s rules for representing floating point numbers in JSON are a little imprecise, so you’ll also sometimes find numbers stored in strings. Apply <code><a href="https://readr.tidyverse.org/reference/parse_atomic.html">readr::parse_double()</a></code> as needed to the get correct variable type.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises-2" data-type="sect2">
|
||||
<section id="rectangling-exercises-2" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li>
|
||||
@@ -1110,7 +1091,7 @@ df_row <- tibble(json = json_row)</pre>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<section id="rectangling-summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter, you learned what lists are, how you can generate them from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: <code><a href="https://tidyr.tidyverse.org/reference/unnest_longer.html">unnest_longer()</a></code> to put list elements into rows and <code><a href="https://tidyr.tidyverse.org/reference/unnest_wider.html">unnest_wider()</a></code> to put list elements into columns. It doesn’t matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.</p>
|
||||
|
||||
Reference in New Issue
Block a user