More minor page count tweaks & fixes
And re-convert with latest htmlbook
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
<section data-type="chapter" id="chp-data-tidy">
|
||||
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<section id="data-tidy-introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<blockquote class="blockquote">
|
||||
@@ -14,7 +14,7 @@ Introduction</h1>
|
||||
<p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
|
||||
<p>In this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. We’ll finish with a discussion of usefully untidy data and how you can create it if needed.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<section id="data-tidy-prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
|
||||
@@ -35,7 +35,7 @@ Tidy data</h1>
|
||||
<pre data-type="programlisting" data-code-language="r">table1
|
||||
#> # A tibble: 6 × 4
|
||||
#> country year cases population
|
||||
#> <chr> <int> <int> <int>
|
||||
#> <chr> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanistan 1999 745 19987071
|
||||
#> 2 Afghanistan 2000 2666 20595360
|
||||
#> 3 Brazil 1999 37737 172006362
|
||||
@@ -45,7 +45,7 @@ Tidy data</h1>
|
||||
table2
|
||||
#> # A tibble: 12 × 4
|
||||
#> country year type count
|
||||
#> <chr> <int> <chr> <int>
|
||||
#> <chr> <dbl> <chr> <dbl>
|
||||
#> 1 Afghanistan 1999 cases 745
|
||||
#> 2 Afghanistan 1999 population 19987071
|
||||
#> 3 Afghanistan 2000 cases 2666
|
||||
@@ -56,7 +56,7 @@ table2
|
||||
table3
|
||||
#> # A tibble: 6 × 3
|
||||
#> country year rate
|
||||
#> * <chr> <int> <chr>
|
||||
#> <chr> <dbl> <chr>
|
||||
#> 1 Afghanistan 1999 745/19987071
|
||||
#> 2 Afghanistan 2000 2666/20595360
|
||||
#> 3 Brazil 1999 37737/172006362
|
||||
@@ -68,14 +68,14 @@ table3
|
||||
table4a # cases
|
||||
#> # A tibble: 3 × 3
|
||||
#> country `1999` `2000`
|
||||
#> * <chr> <int> <int>
|
||||
#> <chr> <dbl> <dbl>
|
||||
#> 1 Afghanistan 745 2666
|
||||
#> 2 Brazil 37737 80488
|
||||
#> 3 China 212258 213766
|
||||
table4b # population
|
||||
#> # A tibble: 3 × 3
|
||||
#> country `1999` `2000`
|
||||
#> * <chr> <int> <int>
|
||||
#> <chr> <dbl> <dbl>
|
||||
#> 1 Afghanistan 19987071 20595360
|
||||
#> 2 Brazil 172006362 174504898
|
||||
#> 3 China 1272915272 1280428583</pre>
|
||||
@@ -106,7 +106,7 @@ table1 |>
|
||||
)
|
||||
#> # A tibble: 6 × 5
|
||||
#> country year cases population rate
|
||||
#> <chr> <int> <int> <int> <dbl>
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanistan 1999 745 19987071 0.373
|
||||
#> 2 Afghanistan 2000 2666 20595360 1.29
|
||||
#> 3 Brazil 1999 37737 172006362 2.19
|
||||
@@ -119,7 +119,7 @@ table1 |>
|
||||
count(year, wt = cases)
|
||||
#> # A tibble: 2 × 2
|
||||
#> year n
|
||||
#> <int> <int>
|
||||
#> <dbl> <dbl>
|
||||
#> 1 1999 250740
|
||||
#> 2 2000 296920
|
||||
|
||||
@@ -133,7 +133,7 @@ ggplot(table1, aes(x = year, y = cases)) +
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<section id="data-tidy-exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Using prose, describe how the variables and observations are organised in each of the sample tables.</p></li>
|
||||
@@ -166,21 +166,16 @@ Data in column names</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">billboard
|
||||
#> # A tibble: 317 × 79
|
||||
#> artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
|
||||
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
|
||||
#> 3 3 Doors… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
|
||||
#> 4 3 Doors… Loser 2000-10-21 76 76 72 69 67 65 55 59
|
||||
#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
|
||||
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
|
||||
#> # … with 311 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
|
||||
#> # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
|
||||
#> # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
|
||||
#> # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
|
||||
#> # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
|
||||
#> # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
|
||||
#> # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …</pre>
|
||||
#> artist track date.entered wk1 wk2 wk3 wk4 wk5
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87
|
||||
#> 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA
|
||||
#> 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66
|
||||
#> 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67
|
||||
#> 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17
|
||||
#> 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26
|
||||
#> # … with 311 more rows, and 71 more variables: wk6 <dbl>, wk7 <dbl>,
|
||||
#> # wk8 <dbl>, wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, …</pre>
|
||||
</div>
|
||||
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
|
||||
<p>To tidy this data, we’ll use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
|
||||
@@ -339,21 +334,16 @@ Many variables in column names</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">who2
|
||||
#> # A tibble: 7,240 × 58
|
||||
#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanist… 1980 NA NA NA NA NA NA
|
||||
#> 2 Afghanist… 1981 NA NA NA NA NA NA
|
||||
#> 3 Afghanist… 1982 NA NA NA NA NA NA
|
||||
#> 4 Afghanist… 1983 NA NA NA NA NA NA
|
||||
#> 5 Afghanist… 1984 NA NA NA NA NA NA
|
||||
#> 6 Afghanist… 1985 NA NA NA NA NA NA
|
||||
#> # … with 7,234 more rows, and 50 more variables: sp_m_65 <dbl>,
|
||||
#> # sp_f_014 <dbl>, sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>,
|
||||
#> # sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>,
|
||||
#> # sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>,
|
||||
#> # sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>,
|
||||
#> # sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>,
|
||||
#> # sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, …</pre>
|
||||
#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanistan 1980 NA NA NA NA NA
|
||||
#> 2 Afghanistan 1981 NA NA NA NA NA
|
||||
#> 3 Afghanistan 1982 NA NA NA NA NA
|
||||
#> 4 Afghanistan 1983 NA NA NA NA NA
|
||||
#> 5 Afghanistan 1984 NA NA NA NA NA
|
||||
#> 6 Afghanistan 1985 NA NA NA NA NA
|
||||
#> # … with 7,234 more rows, and 51 more variables: sp_m_5564 <dbl>,
|
||||
#> # sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>, sp_f_2534 <dbl>, …</pre>
|
||||
</div>
|
||||
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
|
||||
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
|
||||
@@ -479,16 +469,16 @@ Widening data</h2>
|
||||
values_from = prf_rate
|
||||
)
|
||||
#> # A tibble: 500 × 9
|
||||
#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3
|
||||
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDI… CAHPS for MI… 63 NA NA
|
||||
#> 2 0446157747 USC CARE MEDI… CAHPS for MI… NA 87 NA
|
||||
#> 3 0446157747 USC CARE MEDI… CAHPS for MI… NA NA 86
|
||||
#> 4 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
|
||||
#> 5 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
|
||||
#> 6 0446157747 USC CARE MEDI… CAHPS for MI… NA NA NA
|
||||
#> # … with 494 more rows, and 3 more variables: CAHPS_GRP_5 <dbl>,
|
||||
#> # CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl></pre>
|
||||
#> org_pac_id org_nm measure_title CAHPS_GRP_1 CAHPS_GRP_2
|
||||
#> <chr> <chr> <chr> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… 63 NA
|
||||
#> 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA 87
|
||||
#> 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
|
||||
#> 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
|
||||
#> 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
|
||||
#> 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS… NA NA
|
||||
#> # … with 494 more rows, and 4 more variables: CAHPS_GRP_3 <dbl>,
|
||||
#> # CAHPS_GRP_5 <dbl>, CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl></pre>
|
||||
</div>
|
||||
<p>The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
|
||||
<div class="cell">
|
||||
@@ -515,7 +505,7 @@ Widening data</h2>
|
||||
|
||||
<section id="how-does-pivot_wider-work" data-type="sect2">
|
||||
<h2>
|
||||
How does<code>pivot_wider()</code> work?</h2>
|
||||
How does pivot_wider() work?</h2>
|
||||
<p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, let’s again start with a very simple dataset:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">df <- tribble(
|
||||
@@ -849,7 +839,7 @@ Pragmatic computation</h2>
|
||||
</ul></section>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<section id="data-tidy-summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data can’t solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="https://www.jstatsoft.org/article/view/v059i10">Tidy Data</a> paper published in the Journal of Statistical Software.</p>
|
||||
|
||||
Reference in New Issue
Block a user