More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,12 +1,12 @@
<section data-type="chapter" id="chp-factors">
<h1><span id="sec-factors" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Factors</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="factors-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.</p>
<p>Well start by motivating why factors are needed for data analysis and how you can create them with <code><a href="https://rdrr.io/r/base/factor.html">factor()</a></code>. Well then introduce you to the <code>gss_cat</code> dataset which contains a bunch of categorical variables to experiment with. Youll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.</p>
<section id="prerequisites" data-type="sect2">
<section id="factors-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>Base R provides some basic tools for creating and manipulating factors. Well supplement these with the <strong>forcats</strong> package, which is part of the core tidyverse. It provides tools for dealing with <strong>cat</strong>egorical variables (and its an anagram of factors!) using a wide range of helpers for working with factors.</p>
@@ -114,15 +114,16 @@ General Social Survey</h1>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">gss_cat
#&gt; # A tibble: 21,483 × 9
#&gt; year marital age race rincome partyid relig denom tvhours
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt; &lt;int&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,nea… Prot… Sout… 12
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str… Prot… Bapt… NA
#&gt; 3 2000 Widowed 67 White Not applicable Indepen… Prot… No d… 2
#&gt; 4 2000 Never married 39 White Not applicable Ind,nea… Orth… Not … 4
#&gt; 5 2000 Divorced 25 White Not applicable Not str… None Not … 1
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong … Prot… Sout… NA
#&gt; # … with 21,477 more rows</pre>
#&gt; year marital age race rincome partyid
#&gt; &lt;int&gt; &lt;fct&gt; &lt;int&gt; &lt;fct&gt; &lt;fct&gt; &lt;fct&gt;
#&gt; 1 2000 Never married 26 White $8000 to 9999 Ind,near rep
#&gt; 2 2000 Divorced 48 White $8000 to 9999 Not str republican
#&gt; 3 2000 Widowed 67 White Not applicable Independent
#&gt; 4 2000 Never married 39 White Not applicable Ind,near rep
#&gt; 5 2000 Divorced 25 White Not applicable Not str democrat
#&gt; 6 2000 Married 25 White $20000 - 24999 Strong democrat
#&gt; # … with 21,477 more rows, and 3 more variables: relig &lt;fct&gt;, denom &lt;fct&gt;,
#&gt; # tvhours &lt;int&gt;</pre>
</div>
<p>(Remember, since this dataset is provided by a package, you can get more information about the variables with <code><a href="https://forcats.tidyverse.org/reference/gss_cat.html">?gss_cat</a></code>.)</p>
<p>When factors are stored in a tibble, you cant see their levels so easily. One way to view them is with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
@@ -136,14 +137,6 @@ General Social Survey</h1>
#&gt; 2 Black 3129
#&gt; 3 White 16395</pre>
</div>
<p>Or with a bar chart:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(gss_cat, aes(x = race)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A bar chart showing the distribution of race. There are ~2000 records with race &quot;Other&quot;, 3000 with race &quot;Black&quot;, and other 15,000 with race &quot;White&quot;." width="576"/></p>
</div>
</div>
<p>When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.</p>
<section id="exercise" data-type="sect2">
@@ -171,7 +164,7 @@ Modifying factor order</h1>
ggplot(relig_summary, aes(x = tvhours, y = relig)) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern." width="576"/></p>
</div>
</div>
<p>It is hard to read this plot because theres no overall pattern. We can improve it by reordering the levels of <code>relig</code> using <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code>. <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> takes three arguments:</p>
@@ -184,7 +177,7 @@ ggplot(relig_summary, aes(x = tvhours, y = relig)) +
<pre data-type="programlisting" data-code-language="r">ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-17-1.png" class="img-fluid" alt="The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. &quot;Other eastern&quot; has the fewest tvhours under 2, and &quot;Don't know&quot; has the highest (over 5)." width="576"/></p>
</div>
</div>
<p>Reordering religion makes it much easier to see that people in the “Dont know” category watch much more TV, and Hinduism &amp; Other Eastern religions watch much less.</p>
@@ -210,7 +203,7 @@ ggplot(relig_summary, aes(x = tvhours, y = relig)) +
ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-19-1.png" class="img-fluid" alt="A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then &lt;$1000, then $8000-9999." width="576"/></p>
</div>
</div>
<p>Here, arbitrarily reordering the levels isnt a good idea! Thats because <code>rincome</code> already has a principled order that we shouldnt mess with. Reserve <code><a href="https://forcats.tidyverse.org/reference/fct_reorder.html">fct_reorder()</a></code> for factors whose levels are arbitrarily ordered.</p>
@@ -219,20 +212,13 @@ ggplot(rincome_summary, aes(x = age, y = fct_reorder(rincome, age))) +
<pre data-type="programlisting" data-code-language="r">ggplot(rincome_summary, aes(x = age, y = fct_relevel(rincome, "Not applicable"))) +
geom_point()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is &quot;Not applicable&quot;." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-20-1.png" class="img-fluid" alt="The same scatterplot but now &quot;Not Applicable&quot; is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highethst average age is &quot;Not applicable&quot;." width="576"/></p>
</div>
</div>
<p>Why do you think the average age for “Not applicable” is so high?</p>
<p>Another type of reordering is useful when you are coloring the lines on a plot. <code>fct_reorder2(f, x, y)</code> reorders the factor <code>f</code> by the <code>y</code> values associated with the largest <code>x</code> values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.</p>
<div>
<pre data-type="programlisting" data-code-language="r">#|
#| Rearranging the legend makes the plot easier to read because the
#| legend colors now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion
#| never marred decreases with age, married forms an upside down U
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age &lt;- gss_cat |&gt;
<pre data-type="programlisting" data-code-language="r">by_age &lt;- gss_cat |&gt;
filter(!is.na(age)) |&gt;
count(age, marital) |&gt;
group_by(age) |&gt;
@@ -249,10 +235,10 @@ ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop)))
<div class="cell quarto-layout-panel">
<div class="quarto-layout-row quarto-layout-valign-top">
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot." width="384"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60." width="384"/></p>
</div>
<div class="cell-output-display quarto-layout-cell" style="flex-basis: 50.0%;justify-content: center;">
<p><img src="factors_files/figure-html/unnamed-chunk-22-2.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot." width="384"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-21-2.png" class="img-fluid" alt="A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot. Rearranging the legend makes the plot easier to read because the legend colors now match the order of the lines on the far right of the plot. You can see some unsuprising patterns: the proportion never marred decreases with age, married forms an upside down U shape, and widowed starts off low but increases steeply after age 60." width="384"/></p>
</div>
</div>
</div>
@@ -264,11 +250,11 @@ ggplot(by_age, aes(x = age, y = prop, color = fct_reorder2(marital, age, prop)))
ggplot(aes(x = marital)) +
geom_bar()</pre>
<div class="cell-output-display">
<p><img src="factors_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
<p><img src="factors_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000)." width="576"/></p>
</div>
</div>
<section id="exercises" data-type="sect2">
<section id="factors-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>There are some suspiciously high numbers in <code>tvhours</code>. Is the mean a good summary?</p></li>
@@ -402,7 +388,7 @@ Modifying factor levels</h1>
</div>
<p>Read the documentation to learn about <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_min()</a></code> and <code><a href="https://forcats.tidyverse.org/reference/fct_lump.html">fct_lump_prop()</a></code> which are useful in other cases.</p>
<section id="exercises-1" data-type="sect2">
<section id="factors-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?</p></li>
@@ -426,7 +412,7 @@ Ordered factors</h1>
</ul><p>Given the arguable utility of these differences, we dont generally recommend using ordered factors.</p>
</section>
<section id="summary" data-type="sect1">
<section id="factors-summary" data-type="sect1">
<h1>
Summary</h1>
<p>This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didnt have space to discuss here, so whenever youre facing a factor analysis challenge that you havent encountered before, I highly recommend skimming the <a href="https://forcats.tidyverse.org/reference/index.html">reference index</a> to see if theres a canned function that can help solve your problem.</p>