Fix code language

This commit is contained in:
Hadley Wickham
2022-11-18 11:26:25 -06:00
parent 69b4597f3b
commit 868a35ca71
29 changed files with 912 additions and 907 deletions

View File

@@ -11,7 +11,7 @@ Introduction</h1>
Prerequisites</h2>
<p>Base R provides some basic tools for creating and manipulating factors. Well supplement these with the <strong>forcats</strong> package, which is part of the core tidyverse. It provides tools for dealing with <strong>cat</strong>egorical variables (and its an anagram of factors!) using a wide range of helpers for working with factors.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
</div>
</section>
</section>
@@ -21,32 +21,32 @@ Prerequisites</h2>
Factor basics</h1>
<p>Imagine that you have a variable that records month:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x1 &lt;- c("Dec", "Apr", "Jan", "Mar")</pre>
<pre data-type="programlisting" data-code-language="r">x1 &lt;- c("Dec", "Apr", "Jan", "Mar")</pre>
</div>
<p>Using a string to record this variable has two problems:</p>
<ol type="1"><li>
<p>There are only twelve possible months, and theres nothing saving you from typos:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">x2 &lt;- c("Dec", "Apr", "Jam", "Mar")</pre>
<pre data-type="programlisting" data-code-language="r">x2 &lt;- c("Dec", "Apr", "Jam", "Mar")</pre>
</div>
</li>
<li>
<p>It doesnt sort in a useful way:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">sort(x1)
<pre data-type="programlisting" data-code-language="r">sort(x1)
#&gt; [1] "Apr" "Dec" "Jan" "Mar"</pre>
</div>
</li>
</ol><p>You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid <strong>levels</strong>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">month_levels &lt;- c(
<pre data-type="programlisting" data-code-language="r">month_levels &lt;- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)</pre>
</div>
<p>Now you can create a factor:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y1 &lt;- factor(x1, levels = month_levels)
<pre data-type="programlisting" data-code-language="r">y1 &lt;- factor(x1, levels = month_levels)
y1
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
@@ -57,27 +57,27 @@ sort(y1)
</div>
<p>And any values not in the level will be silently converted to NA:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2 &lt;- factor(x2, levels = month_levels)
<pre data-type="programlisting" data-code-language="r">y2 &lt;- factor(x2, levels = month_levels)
y2
#&gt; [1] Dec Apr &lt;NA&gt; Mar
#&gt; Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec</pre>
</div>
<p>This seems risky, so you might want to use <code><a href="https://forcats.tidyverse.org/reference/fct.html">fct()</a></code> instead:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">y2 &lt;- fct(x2, levels = month_levels)
<pre data-type="programlisting" data-code-language="r">y2 &lt;- fct(x2, levels = month_levels)
#&gt; Error in `fct()`:
#&gt; ! All values of `x` must appear in `levels` or `na`
#&gt; Missing level: "Jam"</pre>
</div>
<p>If you omit the levels, theyll be taken from the data in alphabetical order:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">factor(x1)
<pre data-type="programlisting" data-code-language="r">factor(x1)
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Apr Dec Jan Mar</pre>
</div>
<p>Sometimes youd prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to <code>unique(x)</code>, or after the fact, with <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_inorder()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">f1 &lt;- factor(x1, levels = unique(x1))
<pre data-type="programlisting" data-code-language="r">f1 &lt;- factor(x1, levels = unique(x1))
f1
#&gt; [1] Dec Apr Jan Mar
#&gt; Levels: Dec Apr Jan Mar
@@ -89,12 +89,12 @@ f2
</div>
<p>If you ever need to access the set of valid levels directly, you can do so with <code><a href="https://rdrr.io/r/base/levels.html">levels()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">levels(f2)
<pre data-type="programlisting" data-code-language="r">levels(f2)
#&gt; [1] "Dec" "Apr" "Jan" "Mar"</pre>
</div>
<p>You can also create a factor when reading your data with readr with <code><a href="https://readr.tidyverse.org/reference/parse_factor.html">col_factor()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">csv &lt;- "
<pre data-type="programlisting" data-code-language="r">csv &lt;- "
month,value
Jan,12
Feb,56
@@ -112,7 +112,7 @@ df$month
General Social Survey</h1>
<p>For the rest of this chapter, were going to use <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">forcats::gss_cat</a></code>. Its a sample of data from the <a href="https://gss.norc.org">General Social Survey</a>, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in <code>gss_cat</code> Hadley selected a handful that will illustrate some common challenges youll encounter when working with factors.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat
<pre data-type="programlisting" data-code-language="r">gss_cat
#&gt; # A tibble: 21,483 × 9
#&gt; year marital age race rincome partyid relig denom tvhours
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
@@ -127,7 +127,7 @@ General Social Survey</h1>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>
<p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
count(race)
#&gt; # A tibble: 3 × 2
#&gt; race n
@@ -138,7 +138,7 @@ General Social Survey</h1>
</div>
<p>Or with a bar chart:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(gss_cat, aes(race)) +
<pre data-type="programlisting" data-code-language="r">ggplot(gss_cat, aes(race)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A bar chart showing the distribution of race. There are ~2000 records with race &quot;Other&quot;, 3000 with race &quot;Black&quot;, and other 15,000 with race &quot;White&quot;." width="576"/></p>
@@ -160,7 +160,7 @@ Exercise</h2>
Modifying factor order</h1>
<p>Its often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">relig_summary &lt;- gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">relig_summary &lt;- gss_cat |&gt;
group_by(relig) |&gt;
summarise(
age = mean(age, na.rm = TRUE),
@@ -181,7 +181,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
<code>x</code>, a numeric vector that you want to use to reorder the levels.</li>
<li>Optionally, <code>fun</code>, a function thats used if there are multiple values of <code>x</code> for each value of <code>f</code>. The default value is <code>median</code>.</li>
</ul><div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
@@ -190,7 +190,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
<p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p>
<p>As you start making more complicated transformations, we recommend moving them out of <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> and into a separate <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> step. For example, you could rewrite the plot above as:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">relig_summary |&gt;
<pre data-type="programlisting" data-code-language="r">relig_summary |&gt;
mutate(
relig = fct_reorder(relig, tvhours)
) |&gt;
@@ -199,7 +199,7 @@ ggplot(relig_summary, aes(tvhours, relig)) +
</div>
<p>What if we create a similar plot looking at how average age varies across reported income level?</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">rincome_summary &lt;- gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">rincome_summary &lt;- gss_cat |&gt;
group_by(rincome) |&gt;
summarise(
age = mean(age, na.rm = TRUE),
@@ -216,7 +216,7 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
<p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
<p>However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use <code><a href="https://forcats.tidyverse.org/reference/fct_relevel.html">fct_relevel()</a></code>. It takes a factor, <code>f</code>, and then any number of levels that you want to move to the front of the line.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is &quot;Not applicable&quot;." width="576"/></p>
@@ -225,7 +225,7 @@ ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) +
<p>Why do you think the average age for “Not applicable” is so high?</p>
<p>Another type of reordering is useful when you are coloring the lines on a plot. <code>fct_reorder2(f, x, y)</code> reorders the factor <code>f</code> by the <code>y</code> values associated with the largest <code>x</code> values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">#|
<pre data-type="programlisting" data-code-language="r">#|
#| Rearranging the legend makes the plot easier to read because the
#| legend colours now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion
@@ -259,7 +259,7 @@ ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
</div>
<p>Finally, for bar plots, you can use <code><a href="https://forcats.tidyverse.org/reference/fct_inorder.html">fct_infreq()</a></code> to order levels in decreasing frequency: this is the simplest type of reordering because it doesnt need any extra variables. Combine it with <code><a href="https://forcats.tidyverse.org/reference/fct_rev.html">fct_rev()</a></code> if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(marital = marital |&gt; fct_infreq() |&gt; fct_rev()) |&gt;
ggplot(aes(marital)) +
geom_bar()</pre>
@@ -282,7 +282,7 @@ Exercises</h2>
Modifying factor levels</h1>
<p>More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. It allows you to recode, or change, the value of each level. For example, take the <code>gss_cat$partyid</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt; count(partyid)
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt; count(partyid)
#&gt; # A tibble: 10 × 2
#&gt; partyid n
#&gt; &lt;fct&gt; &lt;int&gt;
@@ -296,7 +296,7 @@ Modifying factor levels</h1>
</div>
<p>The levels are terse and inconsistent. Lets tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
@@ -322,7 +322,7 @@ Modifying factor levels</h1>
<p><code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code> will leave the levels that arent explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesnt exist.</p>
<p>To combine groups, you can assign multiple old levels to the same new level:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(
partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
@@ -351,7 +351,7 @@ Modifying factor levels</h1>
<p>Use this technique with care: if you group together categories that are truly different you will end up with misleading results.</p>
<p>If you want to collapse a lot of levels, <code><a href="https://forcats.tidyverse.org/reference/fct_collapse.html">fct_collapse()</a></code> is a useful variant of <code><a href="https://forcats.tidyverse.org/reference/fct_recode.html">fct_recode()</a></code>. For each new variable, you can provide a vector of old levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(
partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
@@ -371,7 +371,7 @@ Modifying factor levels</h1>
</div>
<p>Sometimes you just want to lump together the small groups to make a plot or table simpler. Thats the job of the <code>fct_lump_*()</code> family of functions. <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_lowfreq()</a></code> is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(relig = fct_lump_lowfreq(relig)) |&gt;
count(relig)
#&gt; # A tibble: 2 × 2
@@ -382,7 +382,7 @@ Modifying factor levels</h1>
</div>
<p>In this case its not very helpful: it is true that the majority of Americans in this survey are Protestant, but wed probably like to see some more details! Instead, we can use the <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_n()</a></code> to specify that we want exactly 10 groups:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">gss_cat |&gt;
mutate(relig = fct_lump_n(relig, n = 10)) |&gt;
count(relig, sort = TRUE) |&gt;
print(n = Inf)
@@ -416,7 +416,7 @@ Exercises</h2>
Ordered factors</h1>
<p>Before we go on, theres a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with <code><a href="https://rdrr.io/r/base/factor.html">ordered()</a></code>, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use <code>&lt;</code> between the factor levels:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ordered(c("a", "b", "c"))
<pre data-type="programlisting" data-code-language="r">ordered(c("a", "b", "c"))
#&gt; [1] a b c
#&gt; Levels: a &lt; b &lt; c</pre>
</div>