More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-data-visualize">
<h1><span id="sec-data-visualization" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Data visualization</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="data-visualize-introduction" data-type="sect1">
<h1>
Introduction</h1>
<blockquote class="blockquote">
@@ -9,30 +9,30 @@ Introduction</h1>
<p>R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the <strong>grammar of graphics</strong>, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.</p>
<p>This chapter will teach you how to visualize your data using <strong>ggplot2</strong>. We will start by creating a simple scatterplot and use that to introduce aesthetic mappings and geometric objects the fundamental building blocks of ggplot2. We will then walk you through visualizing distributions of single variables as well as visualizing relationships between two or more variables. Well finish off with saving your plots and troubleshooting tips.</p>
<section id="prerequisites" data-type="sect2">
<section id="data-visualize-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:</p>
<p>This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(tidyverse)
#&gt; ── Attaching core tidyverse packages ──────────────── tidyverse 1.3.2.9000 ──
#&gt; ✔ dplyr 1.0.99.9000 ✔ readr 2.1.3
#&gt; ✔ forcats 0.5.2.9000 ✔ stringr 1.5.0.9000
#&gt; ✔ forcats 0.5.2 ✔ stringr 1.5.0
#&gt; ✔ ggplot2 3.4.0.9000 ✔ tibble 3.1.8
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.2.1.9001
#&gt; ✔ lubridate 1.9.0 ✔ tidyr 1.3.0
#&gt; ✔ purrr 1.0.1
#&gt; ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#&gt; ✖ dplyr::filter() masks stats::filter()
#&gt; ✖ dplyr::lag() masks stats::lag()
#&gt; Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors</pre>
#&gt; Use the conflicted package (&lt;http://conflicted.r-lib.org/&gt;) to force all conflicts to become errors</pre>
</div>
<p>That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).</p>
<p>That one line of code loads the core tidyverse; the packages that you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded)<span data-type="footnote">You can eliminate that message and force conflict resolution to happen on demand by using the conflicted package, which becomes more important as you load more packages. You can learn more about conflicted at <a href="https://conflicted.r-lib.org" class="uri">https://conflicted.r-lib.org</a>.</span>.</p>
<p>If you run this code and get the error message <code>there is no package called 'tidyverse'</code>, youll need to first install it, then run <code><a href="https://rdrr.io/r/base/library.html">library()</a></code> once again.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">install.packages("tidyverse")
library(tidyverse)</pre>
</div>
<p>You only need to install a package once, but you need to reload it every time you start a new session.</p>
<p>You only need to install a package once, but you need to load it every time you start a new session.</p>
<p>In addition to tidyverse, we will also use the <strong>palmerpenguins</strong> package, which includes the <code>penguins</code> dataset containing body measurements for penguins on three islands in the Palmer Archipelago.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(palmerpenguins)</pre>
@@ -47,20 +47,21 @@ First steps</h1>
<section id="the-penguins-data-frame" data-type="sect2">
<h2>
The<code>penguins</code> data frame</h2>
The penguins data frame</h2>
<p>You can test your answer with the <code>penguins</code> <strong>data frame</strong> found in palmerpenguins (a.k.a. <code><a href="https://allisonhorst.github.io/palmerpenguins/reference/penguins.html">palmerpenguins::penguins</a></code>). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). <code>penguins</code> contains 344 observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER<span data-type="footnote">Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. <a href="https://allisonhorst.github.io/palmerpenguins/" class="uri">https://allisonhorst.github.io/palmerpenguins/</a>. doi: 10.5281/zenodo.3960218.</span>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">penguins
#&gt; # A tibble: 344 × 8
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt; &lt;int&gt;
#&gt; 1 Adelie Torgers 39.1 18.7 181 3750
#&gt; 2 Adelie Torgers 39.5 17.4 186 3800
#&gt; 3 Adelie Torgers 40.3 18 195 3250
#&gt; 4 Adelie Torgers NA NA NA NA
#&gt; 5 Adelie Torgers 36.7 19.3 193 3450
#&gt; 6 Adelie Torgers 39.3 20.6 190 3650
#&gt; # … with 338 more rows, and 2 more variables: sex &lt;fct&gt;, year &lt;int&gt;</pre>
#&gt; species island bill_length_mm bill_depth_mm flipper_length_mm
#&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;int&gt;
#&gt; 1 Adelie Torgersen 39.1 18.7 181
#&gt; 2 Adelie Torgersen 39.5 17.4 186
#&gt; 3 Adelie Torgersen 40.3 18 195
#&gt; 4 Adelie Torgersen NA NA NA
#&gt; 5 Adelie Torgersen 36.7 19.3 193
#&gt; 6 Adelie Torgersen 39.3 20.6 190
#&gt; # … with 338 more rows, and 3 more variables: body_mass_g &lt;int&gt;, sex &lt;fct&gt;,
#&gt; # year &lt;int&gt;</pre>
</div>
<p>This data frame contains 8 columns. For an alternative view, where you can see all variables and the first few observations of each variable, use <code><a href="https://pillar.r-lib.org/reference/glimpse.html">glimpse()</a></code>. Or, if youre in RStudio, run <code>View(penguins)</code> to open an interactive data viewer.</p>
<div class="cell">
@@ -239,7 +240,7 @@ Adding aesthetics and layers</h2>
<p>We finally have a plot that perfectly matches our “ultimate goal”!</p>
</section>
<section id="exercises" data-type="sect2">
<section id="data-visualize-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>How many rows are in <code>penguins</code>? How many columns?</p></li>
@@ -410,7 +411,7 @@ ggplot(penguins, aes(x = body_mass_g)) +
</div>
</section>
<section id="exercises-1" data-type="sect2">
<section id="data-visualize-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Make a bar plot of <code>species</code> of <code>penguins</code>, where you assign <code>species</code> to the <code>y</code> aesthetic. How is this plot different?</p></li>
@@ -479,7 +480,7 @@ A numerical and a categorical variable</h2>
<li>Otherwise, we <em>set</em> the value of an aesthetic.</li>
</ul></section>
<section id="two-categorical-variables" data-type="sect2">
<section id="data-visualize-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>We can use segmented bar plots to visualize the distribution between two categorical variables. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
@@ -498,7 +499,7 @@ ggplot(penguins, aes(x = island, fill = species)) +
</div>
</section>
<section id="two-numerical-variables" data-type="sect2">
<section id="data-visualize-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>So far youve learned about scatterplots (created with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>) and smooth curves (created with <code><a href="https://ggplot2.tidyverse.org/reference/geom_smooth.html">geom_smooth()</a></code>) for visualizing the relationship between two numerical variables. A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.</p>
@@ -535,7 +536,7 @@ Three or more variables</h2>
<p>You will learn about many other geoms for visualizing distributions of variables and relationships between them in <a href="#chp-layers" data-type="xref">#chp-layers</a>.</p>
</section>
<section id="exercises-2" data-type="sect2">
<section id="data-visualize-exercises-2" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Which variables in <code>mpg</code> are categorical? Which variables are continuous? (Hint: type <code><a href="https://ggplot2.tidyverse.org/reference/mpg.html">?mpg</a></code> to read the documentation for the dataset). How can you see this information when you run <code>mpg</code>?</p></li>
@@ -576,7 +577,7 @@ ggsave(filename = "my-plot.png")</pre>
<p>If you dont specify the <code>width</code> and <code>height</code> they will be taken from the dimensions of the current plotting device. For reproducible code, youll want to specify them. You can learn more about <code><a href="https://ggplot2.tidyverse.org/reference/ggsave.html">ggsave()</a></code> in the documentation.</p>
<p>Generally, however, we recommend that you assemble your final reports using Quarto, a reproducible authoring system that allows you to interleave your code and your prose and automatically include your plots in your write-ups. You will learn more about Quarto in <a href="#chp-quarto" data-type="xref">#chp-quarto</a>.</p>
<section id="exercises-3" data-type="sect2">
<section id="data-visualize-exercises-3" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li>
@@ -607,7 +608,7 @@ Common problems</h1>
<p>If that doesnt help, carefully read the error message. Sometimes the answer will be buried there! But when youre new to R, even if the answer is in the error message, you might not yet know how to understand it. Another great tool is Google: try googling the error message, as its likely someone else has had the same problem, and has gotten help online.</p>
</section>
<section id="summary" data-type="sect1">
<section id="data-visualize-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, color, size and shape. You then learned about increasing the complexity and improving the presentation of your plots layer-by-layer. You also learned about commonly used plots for visualizing the distribution of a single variable as well as for visualizing relationships between two or more variables, by levering additional aesthetic mappings and/or splitting your plot into small multiples using faceting.</p>