More work on O'Reilly book
* Make width narrower * Convert deps to table * Strip chapter status
This commit is contained in:
@@ -1,13 +1,5 @@
|
||||
<section data-type="chapter" id="chp-data-tidy">
|
||||
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><div data-type="note"><div class="callout-body d-flex">
|
||||
<div class="callout-icon-container">
|
||||
<i class="callout-icon"/>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<p>You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>.</p></div>
|
||||
|
||||
<h1><span id="sec-data-tidy" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data tidying</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
@@ -174,21 +166,21 @@ Data in column names</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">billboard
|
||||
#> # A tibble: 317 × 79
|
||||
#> artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA
|
||||
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA NA
|
||||
#> 3 3 Door… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51
|
||||
#> 4 3 Door… Loser 2000-10-21 76 76 72 69 67 65 55 59 62
|
||||
#> 5 504 Bo… Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53
|
||||
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3
|
||||
#> # … with 311 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
|
||||
#> # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
|
||||
#> # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
|
||||
#> # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
|
||||
#> # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
|
||||
#> # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
|
||||
#> # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …</pre>
|
||||
#> artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
|
||||
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
|
||||
#> 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
|
||||
#> 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
|
||||
#> 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
|
||||
#> 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
|
||||
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
|
||||
#> # … with 311 more rows, 68 more variables: wk9 <dbl>, wk10 <dbl>,
|
||||
#> # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
|
||||
#> # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
|
||||
#> # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
|
||||
#> # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
|
||||
#> # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
|
||||
#> # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …</pre>
|
||||
</div>
|
||||
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
|
||||
<p>To tidy this data, we’ll use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
|
||||
@@ -347,21 +339,21 @@ Many variables in column names</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">who2
|
||||
#> # A tibble: 7,240 × 58
|
||||
#> country year sp_m_…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_…⁶ sp_m_65 sp_f_…⁷
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghani… 1980 NA NA NA NA NA NA NA NA
|
||||
#> 2 Afghani… 1981 NA NA NA NA NA NA NA NA
|
||||
#> 3 Afghani… 1982 NA NA NA NA NA NA NA NA
|
||||
#> 4 Afghani… 1983 NA NA NA NA NA NA NA NA
|
||||
#> 5 Afghani… 1984 NA NA NA NA NA NA NA NA
|
||||
#> 6 Afghani… 1985 NA NA NA NA NA NA NA NA
|
||||
#> # … with 7,234 more rows, 48 more variables: sp_f_1524 <dbl>, sp_f_2534 <dbl>,
|
||||
#> # sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>,
|
||||
#> # sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>,
|
||||
#> # sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>,
|
||||
#> # sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>,
|
||||
#> # sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>,
|
||||
#> # ep_m_2534 <dbl>, ep_m_3544 <dbl>, ep_m_4554 <dbl>, ep_m_5564 <dbl>, …</pre>
|
||||
#> country year sp_m_014 sp_m_1…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_65
|
||||
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanistan 1980 NA NA NA NA NA NA NA
|
||||
#> 2 Afghanistan 1981 NA NA NA NA NA NA NA
|
||||
#> 3 Afghanistan 1982 NA NA NA NA NA NA NA
|
||||
#> 4 Afghanistan 1983 NA NA NA NA NA NA NA
|
||||
#> 5 Afghanistan 1984 NA NA NA NA NA NA NA
|
||||
#> 6 Afghanistan 1985 NA NA NA NA NA NA NA
|
||||
#> # … with 7,234 more rows, 49 more variables: sp_f_014 <dbl>,
|
||||
#> # sp_f_1524 <dbl>, sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>,
|
||||
#> # sp_f_5564 <dbl>, sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>,
|
||||
#> # sn_m_2534 <dbl>, sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>,
|
||||
#> # sn_m_65 <dbl>, sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>,
|
||||
#> # sn_f_3544 <dbl>, sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>,
|
||||
#> # ep_m_014 <dbl>, ep_m_1524 <dbl>, ep_m_2534 <dbl>, ep_m_3544 <dbl>, …</pre>
|
||||
</div>
|
||||
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
|
||||
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
|
||||
@@ -454,14 +446,14 @@ Widening data</h2>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience
|
||||
#> # A tibble: 500 × 5
|
||||
#> org_pac_id org_nm measure_cd measure_title prf_r…¹
|
||||
#> <chr> <chr> <chr> <chr> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS SSM… 63
|
||||
#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS SSM… 87
|
||||
#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS SSM… 86
|
||||
#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS SSM… 57
|
||||
#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS SSM… 85
|
||||
#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM… 24
|
||||
#> org_pac_id org_nm measure_cd measure_title prf_r…¹
|
||||
#> <chr> <chr> <chr> <chr> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS … 63
|
||||
#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS … 87
|
||||
#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS … 86
|
||||
#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS … 57
|
||||
#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS … 85
|
||||
#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS … 24
|
||||
#> # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
|
||||
</div>
|
||||
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
|
||||
@@ -469,13 +461,13 @@ Widening data</h2>
|
||||
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |>
|
||||
distinct(measure_cd, measure_title)
|
||||
#> # A tibble: 6 × 2
|
||||
#> measure_cd measure_title
|
||||
#> <chr> <chr>
|
||||
#> 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor…
|
||||
#> 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
|
||||
#> 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
|
||||
#> 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
|
||||
#> 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
|
||||
#> measure_cd measure_title
|
||||
#> <chr> <chr>
|
||||
#> 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
|
||||
#> 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
|
||||
#> 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
|
||||
#> 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
|
||||
#> 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
|
||||
#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre>
|
||||
</div>
|
||||
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesn’t hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. We’ll use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
|
||||
@@ -487,14 +479,14 @@ Widening data</h2>
|
||||
values_from = prf_rate
|
||||
)
|
||||
#> # A tibble: 500 × 9
|
||||
#> org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
|
||||
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE M… CAHPS … 63 NA NA NA NA NA
|
||||
#> 2 0446157747 USC CARE M… CAHPS … NA 87 NA NA NA NA
|
||||
#> 3 0446157747 USC CARE M… CAHPS … NA NA 86 NA NA NA
|
||||
#> 4 0446157747 USC CARE M… CAHPS … NA NA NA 57 NA NA
|
||||
#> 5 0446157747 USC CARE M… CAHPS … NA NA NA NA 85 NA
|
||||
#> 6 0446157747 USC CARE M… CAHPS … NA NA NA NA NA 24
|
||||
#> org_pac_id org_nm measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
|
||||
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CAR… CAHPS … 63 NA NA NA NA NA
|
||||
#> 2 0446157747 USC CAR… CAHPS … NA 87 NA NA NA NA
|
||||
#> 3 0446157747 USC CAR… CAHPS … NA NA 86 NA NA NA
|
||||
#> 4 0446157747 USC CAR… CAHPS … NA NA NA 57 NA NA
|
||||
#> 5 0446157747 USC CAR… CAHPS … NA NA NA NA 85 NA
|
||||
#> 6 0446157747 USC CAR… CAHPS … NA NA NA NA NA 24
|
||||
#> # … with 494 more rows, and abbreviated variable names ¹measure_title,
|
||||
#> # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
|
||||
#> # ⁷CAHPS_GRP_12</pre>
|
||||
@@ -508,14 +500,14 @@ Widening data</h2>
|
||||
values_from = prf_rate
|
||||
)
|
||||
#> # A tibble: 95 × 8
|
||||
#> org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICAL G… 63 87 86 57 85 24
|
||||
#> 2 0446162697 ASSOCIATION OF UNI… 59 85 83 63 88 22
|
||||
#> 3 0547164295 BEAVER MEDICAL GRO… 49 NA 75 44 73 12
|
||||
#> 4 0749333730 CAPE PHYSICIANS AS… 67 84 85 65 82 24
|
||||
#> 5 0840104360 ALLIANCE PHYSICIAN… 66 87 87 64 87 28
|
||||
#> 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
|
||||
#> org_pac_id org_nm CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
|
||||
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 0446157747 USC CARE MEDICA… 63 87 86 57 85 24
|
||||
#> 2 0446162697 ASSOCIATION OF … 59 85 83 63 88 22
|
||||
#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44 73 12
|
||||
#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65 82 24
|
||||
#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64 87 28
|
||||
#> 6 0840109864 REX HOSPITAL INC 73 87 84 67 91 30
|
||||
#> # … with 89 more rows, and abbreviated variable names ¹CAHPS_GRP_1,
|
||||
#> # ²CAHPS_GRP_2, ³CAHPS_GRP_3, ⁴CAHPS_GRP_5, ⁵CAHPS_GRP_8, ⁶CAHPS_GRP_12</pre>
|
||||
</div>
|
||||
@@ -602,7 +594,8 @@ How does<code>pivot_wider()</code> work?</h2>
|
||||
names_from = name,
|
||||
values_from = value
|
||||
)
|
||||
#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
|
||||
#> Warning: Values from `value` are not uniquely identified; output will contain
|
||||
#> list-cols.
|
||||
#> • Use `values_fn = list` to suppress this warning.
|
||||
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
|
||||
#> • Use the following dplyr code to identify duplicates.
|
||||
@@ -695,15 +688,16 @@ col_year <- gapminder |>
|
||||
)
|
||||
col_year
|
||||
#> # A tibble: 142 × 13
|
||||
#> country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
|
||||
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghani… 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81 2.80
|
||||
#> 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40 3.50
|
||||
#> 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70 3.68
|
||||
#> 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42 3.36
|
||||
#> 5 Argenti… 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97 4.04
|
||||
#> 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43
|
||||
#> # … with 136 more rows, and 2 more variables: `2002` <dbl>, `2007` <dbl></pre>
|
||||
#> country `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
|
||||
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
|
||||
#> 1 Afghanistan 2.89 2.91 2.93 2.92 2.87 2.90 2.99 2.93 2.81
|
||||
#> 2 Albania 3.20 3.29 3.36 3.44 3.52 3.55 3.56 3.57 3.40
|
||||
#> 3 Algeria 3.39 3.48 3.41 3.51 3.62 3.69 3.76 3.75 3.70
|
||||
#> 4 Angola 3.55 3.58 3.63 3.74 3.74 3.48 3.44 3.39 3.42
|
||||
#> 5 Argentina 3.77 3.84 3.85 3.91 3.98 4.00 3.95 3.96 3.97
|
||||
#> 6 Australia 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37
|
||||
#> # … with 136 more rows, and 3 more variables: `1997` <dbl>, `2002` <dbl>,
|
||||
#> # `2007` <dbl></pre>
|
||||
</div>
|
||||
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms don’t want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
|
||||
<div class="cell">
|
||||
|
||||
Reference in New Issue
Block a user