More minor page count tweaks & fixes

And re-convert with latest htmlbook
This commit is contained in:
Hadley Wickham
2023-01-26 10:36:07 -06:00
parent d9afa135fc
commit aa9d72a7c6
38 changed files with 838 additions and 1093 deletions

View File

@@ -1,6 +1,6 @@
<section data-type="chapter" id="chp-EDA">
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1>
<section id="introduction" data-type="sect1">
<section id="EDA-introduction" data-type="sect1">
<h1>
Introduction</h1>
<p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
@@ -10,7 +10,7 @@ Introduction</h1>
</ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that youll eventually write up and communicate to others.</p>
<p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, youll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>
<section id="prerequisites" data-type="sect2">
<section id="EDA-prerequisites" data-type="sect2">
<h2>
Prerequisites</h2>
<p>In this chapter well combine what youve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
@@ -137,7 +137,7 @@ unusual
<p>Its good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you cant figure out why theyre there, its reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldnt drop them without justification. Youll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
</section>
<section id="exercises" data-type="sect2">
<section id="EDA-exercises" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
@@ -198,7 +198,7 @@ Unusual values</h1>
</div>
<p>However this plot isnt great because there are many more non-cancelled flights than cancelled flights. In the next section well explore some techniques for improving this comparison.</p>
<section id="exercises-1" data-type="sect2">
<section id="EDA-exercises-1" data-type="sect2">
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
@@ -217,9 +217,7 @@ A categorical and a numerical variable</h2>
<p>For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
#&gt; Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#&gt; Please use `linewidth` instead.</pre>
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
</div>
@@ -235,7 +233,7 @@ A categorical and a numerical variable</h2>
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, well display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) +
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)</pre>
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
<div class="cell-output-display">
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div>
@@ -279,7 +277,7 @@ A categorical and a numerical variable</h2>
</div>
</div>
<section id="exercises-2" data-type="sect3">
<section id="EDA-exercises-2" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
@@ -291,7 +289,7 @@ Exercises</h3>
</ol></section>
</section>
<section id="two-categorical-variables" data-type="sect2">
<section id="EDA-two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
@@ -330,7 +328,7 @@ Two categorical variables</h2>
</div>
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
<section id="exercises-3" data-type="sect3">
<section id="EDA-exercises-3" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
@@ -340,7 +338,7 @@ Exercises</h3>
</ol></section>
</section>
<section id="two-numerical-variables" data-type="sect2">
<section id="EDA-two-numerical-variables" data-type="sect2">
<h2>
Two numerical variables</h2>
<p>Youve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
@@ -390,7 +388,7 @@ ggplot(smaller, aes(x = carat, y = price)) +
</div>
</div>
<section id="exercises-4" data-type="sect3">
<section id="EDA-exercises-4" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
@@ -464,7 +462,7 @@ ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
<p>Were not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
</section>
<section id="summary" data-type="sect1">
<section id="EDA-summary" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter youve learned a variety of tools to help you understand the variation within your data. Youve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but theyre foundation upon which all other techniques are built.</p>