Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -29,7 +29,7 @@ Prerequisites</h2>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
</div>
<p>From this chapter on, well suppress the loading message from <code><a href="#chp-https://tidyverse.tidyverse" data-type="xref">#chp-https://tidyverse.tidyverse</a></code>.</p>
<p>From this chapter on, well suppress the loading message from <code><a href="https://tidyverse.tidyverse.org">library(tidyverse)</a></code>.</p>
</section>
</section>
@@ -164,7 +164,7 @@ Pivoting</h1>
<ol type="1"><li><p>Data is often organised to facilitate some goal other than analysis. For example, its common for data to be structured to make data entry, not analysis, easy.</p></li>
<li><p>Most people arent familiar with the principles of tidy data, and its hard to derive them yourself unless you spend a lot of time working with data.</p></li>
</ol><p>This means that most real analyses will require at least a little tidying. Youll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times youll need to consult with the people who originally generated the data. Next, youll <strong>pivot</strong> your data into a tidy form, with variables in the columns and observations in the rows.</p>
<p>tidyr provides two functions for pivoting data: <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>, which makes datasets <strong>longer</strong> by increasing rows and reducing columns, and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which makes datasets <strong>wider</strong> by increasing columns and reducing rows. The following sections work through the use of <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to tackle a wide range of realistic datasets. These examples are drawn from <code><a href="#chp-https://tidyr.tidyverse.org/articles/pivot" data-type="xref">#chp-https://tidyr.tidyverse.org/articles/pivot</a></code>, which you should check out if you want to see more variations and more challenging problems.</p>
<p>tidyr provides two functions for pivoting data: <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>, which makes datasets <strong>longer</strong> by increasing rows and reducing columns, and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which makes datasets <strong>wider</strong> by increasing columns and reducing rows. The following sections work through the use of <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to tackle a wide range of realistic datasets. These examples are drawn from <code><a href="https://tidyr.tidyverse.org/articles/pivot.html">vignette("pivot", package = "tidyr")</a></code>, which you should check out if you want to see more variations and more challenging problems.</p>
<p>Lets dive in.</p>
<section id="sec-billboard" data-type="sect2">
@@ -191,9 +191,9 @@ Data in column names</h2>
#&gt; # wk43 &lt;dbl&gt;, wk44 &lt;dbl&gt;, wk45 &lt;dbl&gt;, wk46 &lt;dbl&gt;, wk47 &lt;dbl&gt;, wk48 &lt;dbl&gt;, …</pre>
</div>
<p>In this dataset, each observation is a song. The first three columns (<code>artist</code>, <code>track</code> and <code>date.entered</code>) are variables that describe the song. Then we have 76 columns (<code>wk1</code>-<code>wk76</code>) that describe the rank of the song in each week. Here, the column names are one variable (the <code>week</code>) and the cell values are another (the <code>rank</code>).</p>
<p>To tidy this data, well use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>. After the data, there are three key arguments:</p>
<p>To tidy this data, well use <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>. After the data, there are three key arguments:</p>
<ul><li>
<code>cols</code> specifies which columns need to be pivoted, i.e. which columns arent variables. This argument uses the same syntax as <code><a href="#chp-https://dplyr.tidyverse.org/reference/select" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/select</a></code> so here we could use <code>!c(artist, track, date.entered)</code> or <code>starts_with("wk")</code>.</li>
<code>cols</code> specifies which columns need to be pivoted, i.e. which columns arent variables. This argument uses the same syntax as <code><a href="https://dplyr.tidyverse.org/reference/select.html">select()</a></code> so here we could use <code>!c(artist, track, date.entered)</code> or <code>starts_with("wk")</code>.</li>
<li>
<code>names_to</code> names of the variable stored in the column names, here <code>"week"</code>.</li>
<li>
@@ -221,7 +221,7 @@ Data in column names</h2>
#&gt; 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
#&gt; # … with 24,082 more rows</pre>
</div>
<p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pacs “Baby Dont Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s dont really represent unknown observations; theyre forced to exist by the structure of the dataset<span data-type="footnote">Well come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p>
<p>What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pacs “Baby Dont Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These <code>NA</code>s dont really represent unknown observations; theyre forced to exist by the structure of the dataset<span data-type="footnote">Well come back to this idea in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</span>, so we can ask <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to get rid of them by setting <code>values_drop_na = TRUE</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard |&gt;
pivot_longer(
@@ -242,7 +242,7 @@ Data in column names</h2>
#&gt; # … with 5,301 more rows</pre>
</div>
<p>You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We cant tell from this data, but you might guess that additional columns <code>wk77</code>, <code>wk78</code>, … would be added to the dataset.</p>
<p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code>. <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p>
<p>This data is now tidy, but we could make future computation a bit easier by converting <code>week</code> into a number using <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">readr::parse_number()</a></code>. <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> is a handy function that will extract the first number from a string, ignoring all other text.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">billboard_tidy &lt;- billboard |&gt;
pivot_longer(
@@ -364,7 +364,7 @@ Many variables in column names</h2>
#&gt; # ep_m_2534 &lt;dbl&gt;, ep_m_3544 &lt;dbl&gt;, ep_m_4554 &lt;dbl&gt;, ep_m_5564 &lt;dbl&gt;, …</pre>
</div>
<p>This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: <code>country</code> and <code>year</code>. They are followed by 56 columns like <code>sp_m_014</code>, <code>ep_m_4554</code>, and <code>rel_m_3544</code>. If you stare at these columns for long enough, youll notice theres a pattern. Each column name is made up of three pieces separated by <code>_</code>. The first piece, <code>sp</code>/<code>rel</code>/<code>ep</code>, describes the method used for the <code>diagnosis</code>, the second piece, <code>m</code>/<code>f</code> is the <code>gender</code>, and the third piece, <code>014</code>/<code>1524</code>/<code>2535</code>/<code>3544</code>/<code>4554</code>/<code>65</code> is the <code>age</code> range.</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
<p>So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: <code>names_to</code> gets a vector of column names and <code>names_sep</code> describes how to split the variable name up into pieces:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">who2 |&gt;
pivot_longer(
@@ -434,7 +434,7 @@ Data and variable names in the column headers</h2>
#&gt; 6 4 1 2004-10-10 Craig
#&gt; # … with 3 more rows</pre>
</div>
<p>We again use <code>values_drop_na = TRUE</code>, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and <code><a href="#chp-https://readr.tidyverse.org/reference/parse_number" data-type="xref">#chp-https://readr.tidyverse.org/reference/parse_number</a></code> to convert (e.g.) <code>child1</code> into 1.</p>
<p>We again use <code>values_drop_na = TRUE</code>, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and <code><a href="https://readr.tidyverse.org/reference/parse_number.html">parse_number()</a></code> to convert (e.g.) <code>child1</code> into 1.</p>
<p><a href="#fig-pivot-names-and-values" data-type="xref">#fig-pivot-names-and-values</a> illustrates the basic idea with a simpler example. When you use <code>".value"</code> in <code>names_to</code>, the column names in the input contribute to both values and variable names in the output.</p>
<div class="cell">
<div class="cell-output-display">
@@ -449,7 +449,7 @@ Data and variable names in the column headers</h2>
<section id="widening-data" data-type="sect2">
<h2>
Widening data</h2>
<p>So far weve used <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> to solve the common class of problems where values have ended up in column names. Next well pivot (HA HA) to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p>
<p>So far weve used <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> to solve the common class of problems where values have ended up in column names. Next well pivot (HA HA) to <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code>, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.</p>
<p>Well start by looking at <code>cms_patient_experience</code>, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience
@@ -464,7 +464,7 @@ Widening data</h2>
#&gt; 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM… 24
#&gt; # … with 494 more rows, and abbreviated variable name ¹prf_rate</pre>
</div>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="#chp-https://dplyr.tidyverse.org/reference/distinct" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/distinct</a></code>:</p>
<p>An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for <code>measure_cd</code> and <code>measure_title</code> by using <code><a href="https://dplyr.tidyverse.org/reference/distinct.html">distinct()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
distinct(measure_cd, measure_title)
@@ -479,7 +479,7 @@ Widening data</h2>
#&gt; 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources</pre>
</div>
<p>Neither of these columns will make particularly great variable names: <code>measure_cd</code> doesnt hint at the meaning of the variable and <code>measure_title</code> is a long sentence containing spaces. Well use <code>measure_cd</code> for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.</p>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> has the opposite interface to <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> has the opposite interface to <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code>: we need to provide the existing columns that define the values (<code>values_from</code>) and the column name (<code>names_from)</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
pivot_wider(
@@ -499,7 +499,7 @@ Widening data</h2>
#&gt; # ²CAHPS_GRP_1, ³CAHPS_GRP_2, ⁴CAHPS_GRP_3, ⁵CAHPS_GRP_5, ⁶CAHPS_GRP_8,
#&gt; # ⁷CAHPS_GRP_12</pre>
</div>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<p>The output doesnt look quite right; we still seem to have multiple rows for each organization. Thats because, by default, <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> will attempt to preserve all the existing columns including <code>measure_title</code> which has six distinct observations for each organisations. To fix this problem we need to tell <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which columns identify each row; in this case those are the variables starting with <code>"org"</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cms_patient_experience |&gt;
pivot_wider(
@@ -525,7 +525,7 @@ Widening data</h2>
<section id="how-does-pivot_wider-work" data-type="sect2">
<h2>
How does<code>pivot_wider()</code> work?</h2>
<p>To understand how <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> works, lets again start with a very simple dataset:</p>
<p>To understand how <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> works, lets again start with a very simple dataset:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
~id, ~name, ~value,
@@ -549,8 +549,8 @@ How does<code>pivot_wider()</code> work?</h2>
#&gt; 1 A 1 4 5
#&gt; 2 B 3 2 NA</pre>
</div>
<p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p>
<p>To begin the process <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: its just the values of <code>name</code>.</p>
<p>The connection between the position of the row in the input and the cell in the output is weaker than in <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> because the rows and columns in the output are primarily determined by the values of variables, not their locations.</p>
<p>To begin the process <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> needs to first figure out what will go in the rows and columns. Finding the column names is easy: its just the values of <code>name</code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
distinct(name)
@@ -572,7 +572,7 @@ How does<code>pivot_wider()</code> work?</h2>
#&gt; 1 A
#&gt; 2 B</pre>
</div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> then combines these results to generate an empty data frame:</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> then combines these results to generate an empty data frame:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df |&gt;
select(-name, -value) |&gt;
@@ -584,7 +584,7 @@ How does<code>pivot_wider()</code> work?</h2>
#&gt; 1 A NA NA NA
#&gt; 2 B NA NA NA</pre>
</div>
<p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as theres no entry for id “B” and name “z”, so that cell remains missing. Well come back to this idea that <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
<p>It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as theres no entry for id “B” and name “z”, so that cell remains missing. Well come back to this idea that <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can “make” missing values in <a href="#chp-missing-values" data-type="xref">#chp-missing-values</a>.</p>
<p>You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">df &lt;- tribble(
@@ -634,13 +634,13 @@ How does<code>pivot_wider()</code> work?</h2>
<section id="untidy-data" data-type="sect1">
<h1>
Untidy data</h1>
<p>While <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> is occasionally useful for making tidy data, its real strength is making <strong>untidy</strong> data. While that sounds like a bad thing, untidy isnt a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but its not the only data format youll ever need.</p>
<p>The following sections will show a few examples of <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.</p>
<p>While <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> is occasionally useful for making tidy data, its real strength is making <strong>untidy</strong> data. While that sounds like a bad thing, untidy isnt a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but its not the only data format youll ever need.</p>
<p>The following sections will show a few examples of <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.</p>
<section id="presenting-data-to-humans" data-type="sect2">
<h2>
Presenting data to humans</h2>
<p>As youve seen, <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p>
<p>As youve seen, <code><a href="https://dplyr.tidyverse.org/reference/count.html">dplyr::count()</a></code> produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(clarity, color)
@@ -655,7 +655,7 @@ Presenting data to humans</h2>
#&gt; 6 I1 I 92
#&gt; # … with 50 more rows</pre>
</div>
<p>This is easy to visualize or summarize further, but its not the most compact form for display. You can use <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> to create a form more suitable for display to other humans:</p>
<p>This is easy to visualize or summarize further, but its not the most compact form for display. You can use <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> to create a form more suitable for display to other humans:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(clarity, color) |&gt;
@@ -674,8 +674,8 @@ Presenting data to humans</h2>
#&gt; 6 VVS2 553 991 975 1443 608 365 131
#&gt; # … with 2 more rows</pre>
</div>
<p>This display also makes it easy to compare in two directions, horizontally and vertically, much like <code><a href="#chp-https://ggplot2.tidyverse.org/reference/facet_grid" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/facet_grid</a></code>.</p>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like <a href="#chp-https://gt.rstudio" data-type="xref">#chp-https://gt.rstudio</a>. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.</p>
<p>This display also makes it easy to compare in two directions, horizontally and vertically, much like <code><a href="https://ggplot2.tidyverse.org/reference/facet_grid.html">facet_grid()</a></code>.</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like <a href="https://gt.rstudio.com">gt</a>. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.</p>
</section>
<section id="multivariate-statistics" data-type="sect2">
@@ -705,7 +705,7 @@ col_year
#&gt; 6 Austral… 4.00 4.04 4.09 4.16 4.23 4.26 4.29 4.34 4.37 4.43
#&gt; # … with 136 more rows, and 2 more variables: `2002` &lt;dbl&gt;, `2007` &lt;dbl&gt;</pre>
</div>
<p><code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
<p><code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> produces a tibble where each row is labelled by the <code>country</code> variable. But most classic statistical algorithms dont want the identifier as an explicit variable; they want as a <strong>row name</strong>. We can turn the <code>country</code> variable into row names with <code>column_to_rowname()</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">col_year &lt;- col_year |&gt;
column_to_rownames("country")
@@ -727,7 +727,7 @@ head(col_year)
#&gt; Australia 4.340224 4.369675 4.431331 4.486965 4.537005</pre>
</div>
<p>This makes a data frame, because tibbles dont support row names<span data-type="footnote">tibbles dont use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.</span>.</p>
<p>Were now ready to cluster with (e.g.) <code><a href="#chp-https://rdrr.io/r/stats/kmeans" data-type="xref">#chp-https://rdrr.io/r/stats/kmeans</a></code>:</p>
<p>Were now ready to cluster with (e.g.) <code><a href="https://rdrr.io/r/stats/kmeans.html">kmeans()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">cluster &lt;- stats::kmeans(col_year, centers = 6)</pre>
</div>
@@ -859,7 +859,7 @@ Pragmatic computation</h2>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_longer" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_longer</a></code> and <code><a href="#chp-https://tidyr.tidyverse.org/reference/pivot_wider" data-type="xref">#chp-https://tidyr.tidyverse.org/reference/pivot_wider</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="#chp-https://www.jstatsoft.org/article/view/v059i10" data-type="xref">#chp-https://www.jstatsoft.org/article/view/v059i10</a> paper published in the Journal of Statistical Software.</p>
<p>In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because its a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about <code><a href="https://tidyr.tidyverse.org/reference/pivot_longer.html">pivot_longer()</a></code> and <code><a href="https://tidyr.tidyverse.org/reference/pivot_wider.html">pivot_wider()</a></code> which allow you to tidy up many untidy datasets. Of course, tidy data cant solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the <a href="https://www.jstatsoft.org/article/view/v059i10">Tidy Data</a> paper published in the Journal of Statistical Software.</p>
<p>In the next chapter, well pivot back to workflow to discuss the importance of code style, keeping your code “tidy” (ha!) in order to make it easy for you and others to read and understand your code.</p>