More minor page count tweaks & fixes

And re-convert with latest htmlbook
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions
--- a/oreilly/data-tidy.html
+++ b/oreilly/data-tidy.html
@@ -1,6 +1,6 @@
 <section data-type="chapter" id="chp-data-tidy">
 <h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1>
-<section id="introduction" data-type="sect1">
+<section id="data-tidy-introduction" data-type="sect1">
 <h1>
 Introduction</h1>
 <blockquote class="blockquote">
@@ -14,7 +14,7 @@ Introduction</h1>
 <p>In this chapter, you will learn a consistent way to organize your data in R using a system called <strong>tidy data</strong>. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.</p>
 <p>In this chapter, you’ll first learn the definition of tidy data and see it applied to a simple toy dataset. Then we’ll dive into the primary tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data without changing any of the values. We’ll finish with a discussion of usefully untidy data and how you can create it if needed.</p>

-<section id="prerequisites" data-type="sect2">
+<section id="data-tidy-prerequisites" data-type="sect2">
 <h2>
 Prerequisites</h2>
 <p>In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.</p>
@@ -35,7 +35,7 @@ Tidy data</h1>
 <pre data-type="programlisting" data-code-language="r">table1
 #&gt; # A tibble: 6 × 4
 #&gt;   country      year  cases population
-#&gt;   &lt;chr&gt;       &lt;int&gt;  &lt;int&gt;      &lt;int&gt;
+#&gt;   &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;      &lt;dbl&gt;
 #&gt; 1 Afghanistan  1999    745   19987071
 #&gt; 2 Afghanistan  2000   2666   20595360
 #&gt; 3 Brazil       1999  37737  172006362
@@ -45,7 +45,7 @@ Tidy data</h1>
 table2
 #&gt; # A tibble: 12 × 4
 #&gt;   country      year type           count
-#&gt;   &lt;chr&gt;       &lt;int&gt; &lt;chr&gt;          &lt;int&gt;
+#&gt;   &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;          &lt;dbl&gt;
 #&gt; 1 Afghanistan  1999 cases            745
 #&gt; 2 Afghanistan  1999 population  19987071
 #&gt; 3 Afghanistan  2000 cases           2666
@@ -56,7 +56,7 @@ table2
 table3
 #&gt; # A tibble: 6 × 3
 #&gt;   country      year rate             
-#&gt; * &lt;chr&gt;       &lt;int&gt; &lt;chr&gt;            
+#&gt;   &lt;chr&gt;       &lt;dbl&gt; &lt;chr&gt;            
 #&gt; 1 Afghanistan  1999 745/19987071     
 #&gt; 2 Afghanistan  2000 2666/20595360    
 #&gt; 3 Brazil       1999 37737/172006362  
@@ -68,14 +68,14 @@ table3
 table4a # cases
 #&gt; # A tibble: 3 × 3
 #&gt;   country     `1999` `2000`
-#&gt; * &lt;chr&gt;        &lt;int&gt;  &lt;int&gt;
+#&gt;   &lt;chr&gt;        &lt;dbl&gt;  &lt;dbl&gt;
 #&gt; 1 Afghanistan    745   2666
 #&gt; 2 Brazil       37737  80488
 #&gt; 3 China       212258 213766
 table4b # population
 #&gt; # A tibble: 3 × 3
 #&gt;   country         `1999`     `2000`
-#&gt; * &lt;chr&gt;            &lt;int&gt;      &lt;int&gt;
+#&gt;   &lt;chr&gt;            &lt;dbl&gt;      &lt;dbl&gt;
 #&gt; 1 Afghanistan   19987071   20595360
 #&gt; 2 Brazil       172006362  174504898
 #&gt; 3 China       1272915272 1280428583</pre>
@@ -106,7 +106,7 @@ table1 |&gt;
  )
 #&gt; # A tibble: 6 × 5
 #&gt;   country      year  cases population  rate
-#&gt;   &lt;chr&gt;       &lt;int&gt;  &lt;int&gt;      &lt;int&gt; &lt;dbl&gt;
+#&gt;   &lt;chr&gt;       &lt;dbl&gt;  &lt;dbl&gt;      &lt;dbl&gt; &lt;dbl&gt;
 #&gt; 1 Afghanistan  1999    745   19987071 0.373
 #&gt; 2 Afghanistan  2000   2666   20595360 1.29 
 #&gt; 3 Brazil       1999  37737  172006362 2.19 
@@ -119,7 +119,7 @@ table1 |&gt;
  count(year, wt = cases)
 #&gt; # A tibble: 2 × 2
 #&gt;    year      n
-#&gt;   &lt;int&gt;  &lt;int&gt;
+#&gt;   &lt;dbl&gt;  &lt;dbl&gt;
 #&gt; 1  1999 250740
 #&gt; 2  2000 296920

@@ -133,7 +133,7 @@ ggplot(table1, aes(x = year, y = cases)) +
 </div>
 </div>

-<section id="exercises" data-type="sect2">
+<section id="data-tidy-exercises" data-type="sect2">
 <h2>
 Exercises</h2>
 <ol type="1"><li><p>Using prose, describe how the variables and observations are organised in each of the sample tables.</p></li>
@@ -166,21 +166,16 @@ Data in column names</h2>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">billboard
 #&gt; # A tibble: 317 × 79
-#&gt;   artist   track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
-#&gt;   &lt;chr&gt;    &lt;chr&gt; &lt;date&gt;       &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
-#&gt; 1 2 Pac    Baby… 2000-02-26      87    82    72    77    87    94    99    NA
-#&gt; 2 2Ge+her  The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
-#&gt; 3 3 Doors… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
-#&gt; 4 3 Doors… Loser 2000-10-21      76    76    72    69    67    65    55    59
-#&gt; 5 504 Boyz Wobb… 2000-04-15      57    34    25    17    17    31    36    49
-#&gt; 6 98^0     Give… 2000-08-19      51    39    34    26    26    19     2     2
-#&gt; # … with 311 more rows, and 68 more variables: wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;,
-#&gt; #   wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, wk14 &lt;dbl&gt;, wk15 &lt;dbl&gt;, wk16 &lt;dbl&gt;,
-#&gt; #   wk17 &lt;dbl&gt;, wk18 &lt;dbl&gt;, wk19 &lt;dbl&gt;, wk20 &lt;dbl&gt;, wk21 &lt;dbl&gt;, wk22 &lt;dbl&gt;,
-#&gt; #   wk23 &lt;dbl&gt;, wk24 &lt;dbl&gt;, wk25 &lt;dbl&gt;, wk26 &lt;dbl&gt;, wk27 &lt;dbl&gt;, wk28 &lt;dbl&gt;,
-#&gt; #   wk29 &lt;dbl&gt;, wk30 &lt;dbl&gt;, wk31 &lt;dbl&gt;, wk32 &lt;dbl&gt;, wk33 &lt;dbl&gt;, wk34 &lt;dbl&gt;,
-#&gt; #   wk35 &lt;dbl&gt;, wk36 &lt;dbl&gt;, wk37 &lt;dbl&gt;, wk38 &lt;dbl&gt;, wk39 &lt;dbl&gt;, wk40 &lt;dbl&gt;,
-#&gt; #   wk41 &lt;dbl&gt;, wk42 &lt;dbl&gt;, wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, …</pre>
+#&gt;   artist       track               date.entered   wk1   wk2   wk3   wk4   wk5
+#&gt;   &lt;chr&gt;        &lt;chr&gt;               &lt;date&gt;       &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
+#&gt; 1 2 Pac        Baby Don't Cry (Ke… 2000-02-26      87    82    72    77    87
+#&gt; 2 2Ge+her      The Hardest Part O… 2000-09-02      91    87    92    NA    NA
+#&gt; 3 3 Doors Down Kryptonite          2000-04-08      81    70    68    67    66
+#&gt; 4 3 Doors Down Loser               2000-10-21      76    76    72    69    67
+#&gt; 5 504 Boyz     Wobble Wobble       2000-04-15      57    34    25    17    17
+#&gt; 6 98^0         Give Me Just One N… 2000-08-19      51    39    34    26    26
+#&gt; # … with 311 more rows, and 71 more variables: wk6 &lt;dbl&gt;, wk7 &lt;dbl&gt;,
+#&gt; #   wk8 &lt;dbl&gt;, wk9 &lt;dbl&gt;, wk10 &lt;dbl&gt;, wk11 &lt;dbl&gt;, wk12 &lt;dbl&gt;, wk13 &lt;dbl&gt;, …</pre>
 </div>
 <p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
 <p>To tidy this data, we’ll use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
@@ -339,21 +334,16 @@ Many variables in column names</h2>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">who2
 #&gt; # A tibble: 7,240 × 58
-#&gt;   country     year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
-#&gt;   &lt;chr&gt;      &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;
-#&gt; 1 Afghanist…  1980       NA        NA        NA        NA        NA        NA
-#&gt; 2 Afghanist…  1981       NA        NA        NA        NA        NA        NA
-#&gt; 3 Afghanist…  1982       NA        NA        NA        NA        NA        NA
-#&gt; 4 Afghanist…  1983       NA        NA        NA        NA        NA        NA
-#&gt; 5 Afghanist…  1984       NA        NA        NA        NA        NA        NA
-#&gt; 6 Afghanist…  1985       NA        NA        NA        NA        NA        NA
-#&gt; # … with 7,234 more rows, and 50 more variables: sp_m_65 &lt;dbl&gt;,
-#&gt; #   sp_f_014 &lt;dbl&gt;, sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, sp_f_3544 &lt;dbl&gt;,
-#&gt; #   sp_f_4554 &lt;dbl&gt;, sp_f_5564 &lt;dbl&gt;, sp_f_65 &lt;dbl&gt;, sn_m_014 &lt;dbl&gt;,
-#&gt; #   sn_m_1524 &lt;dbl&gt;, sn_m_2534 &lt;dbl&gt;, sn_m_3544 &lt;dbl&gt;, sn_m_4554 &lt;dbl&gt;,
-#&gt; #   sn_m_5564 &lt;dbl&gt;, sn_m_65 &lt;dbl&gt;, sn_f_014 &lt;dbl&gt;, sn_f_1524 &lt;dbl&gt;,
-#&gt; #   sn_f_2534 &lt;dbl&gt;, sn_f_3544 &lt;dbl&gt;, sn_f_4554 &lt;dbl&gt;, sn_f_5564 &lt;dbl&gt;,
-#&gt; #   sn_f_65 &lt;dbl&gt;, ep_m_014 &lt;dbl&gt;, ep_m_1524 &lt;dbl&gt;, ep_m_2534 &lt;dbl&gt;, …</pre>
+#&gt;   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
+#&gt;   &lt;chr&gt;       &lt;dbl&gt;    &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;     &lt;dbl&gt;
+#&gt; 1 Afghanistan  1980       NA        NA        NA        NA        NA
+#&gt; 2 Afghanistan  1981       NA        NA        NA        NA        NA
+#&gt; 3 Afghanistan  1982       NA        NA        NA        NA        NA
+#&gt; 4 Afghanistan  1983       NA        NA        NA        NA        NA
+#&gt; 5 Afghanistan  1984       NA        NA        NA        NA        NA
+#&gt; 6 Afghanistan  1985       NA        NA        NA        NA        NA
+#&gt; # … with 7,234 more rows, and 51 more variables: sp_m_5564 &lt;dbl&gt;,
+#&gt; #   sp_m_65 &lt;dbl&gt;, sp_f_014 &lt;dbl&gt;, sp_f_1524 &lt;dbl&gt;, sp_f_2534 &lt;dbl&gt;, …</pre>
 </div>
 <p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
 <p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
@@ -479,16 +469,16 @@ Widening data</h2>
    values_from = prf_rate
  )
 #&gt; # A tibble: 500 × 9
-#&gt;   org_pac_id org_nm         measure_title CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3
-#&gt;   &lt;chr&gt;      &lt;chr&gt;          &lt;chr&gt;               &lt;dbl&gt;       &lt;dbl&gt;       &lt;dbl&gt;
-#&gt; 1 0446157747 USC CARE MEDI… CAHPS for MI…          63          NA          NA
-#&gt; 2 0446157747 USC CARE MEDI… CAHPS for MI…          NA          87          NA
-#&gt; 3 0446157747 USC CARE MEDI… CAHPS for MI…          NA          NA          86
-#&gt; 4 0446157747 USC CARE MEDI… CAHPS for MI…          NA          NA          NA
-#&gt; 5 0446157747 USC CARE MEDI… CAHPS for MI…          NA          NA          NA
-#&gt; 6 0446157747 USC CARE MEDI… CAHPS for MI…          NA          NA          NA
-#&gt; # … with 494 more rows, and 3 more variables: CAHPS_GRP_5 &lt;dbl&gt;,
-#&gt; #   CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;</pre>
+#&gt;   org_pac_id org_nm                   measure_title   CAHPS_GRP_1 CAHPS_GRP_2
+#&gt;   &lt;chr&gt;      &lt;chr&gt;                    &lt;chr&gt;                 &lt;dbl&gt;       &lt;dbl&gt;
+#&gt; 1 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          63          NA
+#&gt; 2 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          87
+#&gt; 3 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
+#&gt; 4 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
+#&gt; 5 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
+#&gt; 6 0446157747 USC CARE MEDICAL GROUP … CAHPS for MIPS…          NA          NA
+#&gt; # … with 494 more rows, and 4 more variables: CAHPS_GRP_3 &lt;dbl&gt;,
+#&gt; #   CAHPS_GRP_5 &lt;dbl&gt;, CAHPS_GRP_8 &lt;dbl&gt;, CAHPS_GRP_12 &lt;dbl&gt;</pre>
 </div>
 <p>The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
 <div class="cell">
@@ -515,7 +505,7 @@ Widening data</h2>

 <section id="how-does-pivot_wider-work" data-type="sect2">
 <h2>
-How does<code>pivot_wider()</code> work?</h2>
+How does pivot_wider() work?</h2>
 <p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, let’s again start with a very simple dataset:</p>
 <div class="cell">
 <pre data-type="programlisting" data-code-language="r">df &lt;- tribble(
@@ -849,7 +839,7 @@ Pragmatic computation</h2>
 </ul></section>
 </section>

-<section id="summary" data-type="sect1">
+<section id="data-tidy-summary" data-type="sect1">
 <h1>
 Summary</h1>
 <p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data can’t solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="https://www.jstatsoft.org/article/view/v059i10">Tidy Data</a> paper published in the Journal of Statistical Software.</p>