More minor page count tweaks & fixes
And re-convert with latest htmlbook
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
<section data-type="chapter" id="chp-EDA">
|
||||
<h1><span id="sec-exploratory-data-analysis" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Exploratory data analysis</span></span></h1>
|
||||
<section id="introduction" data-type="sect1">
|
||||
<section id="EDA-introduction" data-type="sect1">
|
||||
<h1>
|
||||
Introduction</h1>
|
||||
<p>This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:</p>
|
||||
@@ -10,7 +10,7 @@ Introduction</h1>
|
||||
</ol><p>EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.</p>
|
||||
<p>EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.</p>
|
||||
|
||||
<section id="prerequisites" data-type="sect2">
|
||||
<section id="EDA-prerequisites" data-type="sect2">
|
||||
<h2>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
|
||||
@@ -137,7 +137,7 @@ unusual
|
||||
<p>It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.</p>
|
||||
</section>
|
||||
|
||||
<section id="exercises" data-type="sect2">
|
||||
<section id="EDA-exercises" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
|
||||
@@ -198,7 +198,7 @@ Unusual values</h1>
|
||||
</div>
|
||||
<p>However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.</p>
|
||||
|
||||
<section id="exercises-1" data-type="sect2">
|
||||
<section id="EDA-exercises-1" data-type="sect2">
|
||||
<h2>
|
||||
Exercises</h2>
|
||||
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
|
||||
@@ -217,9 +217,7 @@ A categorical and a numerical variable</h2>
|
||||
<p>For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>) using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price)) +
|
||||
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)
|
||||
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
|
||||
#> ℹ Please use `linewidth` instead.</pre>
|
||||
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-16-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
|
||||
</div>
|
||||
@@ -235,7 +233,7 @@ A categorical and a numerical variable</h2>
|
||||
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(diamonds, aes(x = price, y = after_stat(density))) +
|
||||
geom_freqpoly(aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-18-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
|
||||
</div>
|
||||
@@ -279,7 +277,7 @@ A categorical and a numerical variable</h2>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises-2" data-type="sect3">
|
||||
<section id="EDA-exercises-2" data-type="sect3">
|
||||
<h3>
|
||||
Exercises</h3>
|
||||
<ol type="1"><li><p>Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
|
||||
@@ -291,7 +289,7 @@ Exercises</h3>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="two-categorical-variables" data-type="sect2">
|
||||
<section id="EDA-two-categorical-variables" data-type="sect2">
|
||||
<h2>
|
||||
Two categorical variables</h2>
|
||||
<p>To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
|
||||
@@ -330,7 +328,7 @@ Two categorical variables</h2>
|
||||
</div>
|
||||
<p>If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.</p>
|
||||
|
||||
<section id="exercises-3" data-type="sect3">
|
||||
<section id="EDA-exercises-3" data-type="sect3">
|
||||
<h3>
|
||||
Exercises</h3>
|
||||
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
|
||||
@@ -340,7 +338,7 @@ Exercises</h3>
|
||||
</ol></section>
|
||||
</section>
|
||||
|
||||
<section id="two-numerical-variables" data-type="sect2">
|
||||
<section id="EDA-two-numerical-variables" data-type="sect2">
|
||||
<h2>
|
||||
Two numerical variables</h2>
|
||||
<p>You’ve already seen one great way to visualize the covariation between two numerical variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
|
||||
@@ -390,7 +388,7 @@ ggplot(smaller, aes(x = carat, y = price)) +
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<section id="exercises-4" data-type="sect3">
|
||||
<section id="EDA-exercises-4" data-type="sect3">
|
||||
<h3>
|
||||
Exercises</h3>
|
||||
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs. <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
|
||||
@@ -464,7 +462,7 @@ ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
|
||||
<p>We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.</p>
|
||||
</section>
|
||||
|
||||
<section id="summary" data-type="sect1">
|
||||
<section id="EDA-summary" data-type="sect1">
|
||||
<h1>
|
||||
Summary</h1>
|
||||
<p>In this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but they’re foundation upon which all other techniques are built.</p>
|
||||
|
||||
Reference in New Issue
Block a user