Fix code language
This commit is contained in:
@@ -15,7 +15,7 @@ Introduction</h1>
|
||||
Prerequisites</h2>
|
||||
<p>In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)</pre>
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidyverse)</pre>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
@@ -52,7 +52,7 @@ Variation</h1>
|
||||
Visualizing distributions</h2>
|
||||
<p>How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is <strong>categorical</strong> if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, you can use a bar chart:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p>
|
||||
@@ -60,7 +60,7 @@ Visualizing distributions</h2>
|
||||
</div>
|
||||
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut)
|
||||
#> # A tibble: 5 × 2
|
||||
#> cut n
|
||||
@@ -73,7 +73,7 @@ Visualizing distributions</h2>
|
||||
</div>
|
||||
<p>A variable is <strong>continuous</strong> if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, you can use a histogram:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
|
||||
@@ -81,7 +81,7 @@ Visualizing distributions</h2>
|
||||
</div>
|
||||
<p>You can compute this by hand by combining <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut_width(carat, 0.5))
|
||||
#> # A tibble: 11 × 2
|
||||
#> `cut_width(carat, 0.5)` n
|
||||
@@ -97,7 +97,7 @@ Visualizing distributions</h2>
|
||||
<p>A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. Note that even though it’s not possible to have a <code>carat</code> value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0. This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25. The tallest bar shows that almost 30,000 observations have a <code>carat</code> value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5.</p>
|
||||
<p>You can set the width of the intervals in a histogram with the <code>binwidth</code> argument, which is measured in the units of the <code>x</code> variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">smaller <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">smaller <- diamonds |>
|
||||
filter(carat < 3)
|
||||
|
||||
ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
@@ -108,7 +108,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
</div>
|
||||
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> performs the same calculation as <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
|
||||
geom_freqpoly(binwidth = 0.1, size = 0.75)
|
||||
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
|
||||
#> ℹ Please use `linewidth` instead.</pre>
|
||||
@@ -132,7 +132,7 @@ Typical values</h2>
|
||||
<ul><li><p>Why are there more diamonds at whole carats and common fractions of carats?</p></li>
|
||||
<li><p>Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?</p></li>
|
||||
</ul><div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.01)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-9-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak." width="576"/></p>
|
||||
@@ -145,7 +145,7 @@ Typical values</h2>
|
||||
<li><p>Why might the appearance of clusters be misleading?</p></li>
|
||||
</ul><p>The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
geom_histogram(binwidth = 0.25)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-10-1.png" class="img-fluid" alt="A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5." width="576"/></p>
|
||||
@@ -159,7 +159,7 @@ Typical values</h2>
|
||||
Unusual values</h2>
|
||||
<p>Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the <code>y</code> variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
geom_histogram(binwidth = 0.5)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
|
||||
@@ -167,7 +167,7 @@ Unusual values</h2>
|
||||
</div>
|
||||
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = y)) +
|
||||
geom_histogram(binwidth = 0.5) +
|
||||
coord_cartesian(ylim = c(0, 50))</pre>
|
||||
<div class="cell-output-display">
|
||||
@@ -177,7 +177,7 @@ Unusual values</h2>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> also has an <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
|
||||
<p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">unusual <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">unusual <- diamonds |>
|
||||
filter(y < 3 | y > 20) |>
|
||||
select(price, x, y, z) |>
|
||||
arrange(y)
|
||||
@@ -216,7 +216,7 @@ Missing values</h1>
|
||||
<ol type="1"><li>
|
||||
<p>Drop the entire row with the strange values:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds2 <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds2 <- diamonds |>
|
||||
filter(between(y, 3, 20))</pre>
|
||||
</div>
|
||||
<p>We don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!</p>
|
||||
@@ -224,14 +224,14 @@ Missing values</h1>
|
||||
<li>
|
||||
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to replace the variable with a modified copy. You can use the <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> function to replace unusual values with <code>NA</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds2 <- diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds2 <- diamonds |>
|
||||
mutate(y = if_else(y < 3 | y > 20, NA, y))</pre>
|
||||
</div>
|
||||
</li>
|
||||
</ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another.</p>
|
||||
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
geom_point()
|
||||
#> Warning: Removed 9 rows containing missing values (`geom_point()`).</pre>
|
||||
<div class="cell-output-display">
|
||||
@@ -240,12 +240,12 @@ Missing values</h1>
|
||||
</div>
|
||||
<p>To suppress that warning, set <code>na.rm = TRUE</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
|
||||
geom_point(na.rm = TRUE)</pre>
|
||||
</div>
|
||||
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, we’ll use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">nycflights13::flights |>
|
||||
<pre data-type="programlisting" data-code-language="r">nycflights13::flights |>
|
||||
mutate(
|
||||
cancelled = is.na(dep_time),
|
||||
sched_hour = sched_dep_time %/% 100,
|
||||
@@ -278,7 +278,7 @@ Covariation</h1>
|
||||
A categorical and continuous variable</h2>
|
||||
<p>It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in the shapes of their distributions. For example, let’s explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = price)) +
|
||||
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-21-1.png" class="img-fluid" alt="A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500." width="576"/></p>
|
||||
@@ -286,7 +286,7 @@ A categorical and continuous variable</h2>
|
||||
</div>
|
||||
<p>It’s hard to see the difference in distribution because the overall counts differ so much:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-22-1.png" class="img-fluid" alt="Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal." width="576"/></p>
|
||||
@@ -294,7 +294,7 @@ A categorical and continuous variable</h2>
|
||||
</div>
|
||||
<p>To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the <strong>density</strong>, which is the count standardized so that the area under each frequency polygon is one.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) +
|
||||
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
|
||||
@@ -313,7 +313,7 @@ A categorical and continuous variable</h2>
|
||||
</div>
|
||||
<p>Let’s take a look at the distribution of price by cut using <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-25-1.png" class="img-fluid" alt="Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest." width="576"/></p>
|
||||
@@ -323,7 +323,7 @@ A categorical and continuous variable</h2>
|
||||
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://rdrr.io/r/stats/reorder.factor.html">reorder()</a></code> function.</p>
|
||||
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-26-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv)." width="576"/></p>
|
||||
@@ -331,7 +331,7 @@ A categorical and continuous variable</h2>
|
||||
</div>
|
||||
<p>To make the trend easier to see, we can reorder <code>class</code> based on the median value of <code>hwy</code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg,
|
||||
mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
@@ -340,7 +340,7 @@ A categorical and continuous variable</h2>
|
||||
</div>
|
||||
<p>If you have long variable names, <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = mpg,
|
||||
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
@@ -365,7 +365,7 @@ Exercises</h3>
|
||||
Two categorical variables</h2>
|
||||
<p>To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
|
||||
geom_count()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-29-1.png" class="img-fluid" alt="A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000." width="576"/></p>
|
||||
@@ -374,7 +374,7 @@ Two categorical variables</h2>
|
||||
<p>The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.</p>
|
||||
<p>A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart. In creating this bar chart, we map the variable we want to divide the data into first to the <code>x</code> aesthetic and the variable we then further want to divide each group into to the <code>fill</code> aesthetic.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
geom_bar()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-30-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The number of diamonds for each level of cut increases from Fair to Ideal and the heights of the segments within each bar represent the number of diamonds that fall within each color/cut combination. There appear to be some of each color of diamonds within each level of cut of diamonds." width="576"/></p>
|
||||
@@ -382,7 +382,7 @@ Two categorical variables</h2>
|
||||
</div>
|
||||
<p>However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
|
||||
geom_bar(position = "fill")</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-31-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds, segmented by color. The heights of each of the bars representing each cut of diamond are the same, 1. The heights of the segments within each bar represent the proportion of diamonds that fall within each color/cut combination. The proportions don't appear to be very different across the levels of cut." width="576"/></p>
|
||||
@@ -390,7 +390,7 @@ Two categorical variables</h2>
|
||||
</div>
|
||||
<p>Another approach for exploring the relationship between these variables is computing the counts with dplyr:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(color, cut)
|
||||
#> # A tibble: 35 × 3
|
||||
#> color cut n
|
||||
@@ -405,7 +405,7 @@ Two categorical variables</h2>
|
||||
</div>
|
||||
<p>Then visualize with <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> and the fill aesthetic:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(color, cut) |>
|
||||
ggplot(mapping = aes(x = color, y = cut)) +
|
||||
geom_tile(mapping = aes(fill = n))</pre>
|
||||
@@ -430,7 +430,7 @@ Exercises</h3>
|
||||
Two continuous variables</h2>
|
||||
<p>You’ve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-34-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential." width="576"/></p>
|
||||
@@ -438,7 +438,7 @@ Two continuous variables</h2>
|
||||
</div>
|
||||
<p>Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). You’ve already seen one way to fix the problem: using the <code>alpha</code> aesthetic to add transparency.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
|
||||
geom_point(alpha = 1 / 100)</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
|
||||
@@ -447,7 +447,7 @@ Two continuous variables</h2>
|
||||
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> to bin in one dimension. Now you’ll learn how to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> to bin in two dimensions.</p>
|
||||
<p><code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> creates rectangular bins. <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code>.</p>
|
||||
<div>
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_bin2d()
|
||||
|
||||
# install.packages("hexbin")
|
||||
@@ -459,7 +459,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
</div>
|
||||
<p>Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin <code>carat</code> and then for each group, display a boxplot:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-37-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
@@ -468,7 +468,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
|
||||
<p>Another approach is to display approximately the same number of points in each bin. That’s the job of <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
|
||||
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-38-1.png" class="img-fluid" alt="Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end." width="576"/></p>
|
||||
@@ -485,7 +485,7 @@ Exercises</h3>
|
||||
<li>
|
||||
<p>Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of <code>x</code> and <code>y</code> values, which makes the points outliers even though their <code>x</code> and <code>y</code> values appear normal when examined separately.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
|
||||
geom_point() +
|
||||
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))</pre>
|
||||
<div class="cell-output-display">
|
||||
@@ -509,7 +509,7 @@ Patterns and models</h1>
|
||||
<li><p>Does the relationship change if you look at individual subgroups of the data?</p></li>
|
||||
</ul><p>A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) +
|
||||
geom_point()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-40-1.png" class="img-fluid" alt="A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times." width="576"/></p>
|
||||
@@ -518,7 +518,7 @@ Patterns and models</h1>
|
||||
<p>Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.</p>
|
||||
<p>Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts <code>price</code> from <code>carat</code> and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of <code>price</code> and <code>carat</code>, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">library(tidymodels)
|
||||
<pre data-type="programlisting" data-code-language="r">library(tidymodels)
|
||||
|
||||
diamonds <- diamonds |>
|
||||
mutate(
|
||||
@@ -540,7 +540,7 @@ ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
|
||||
</div>
|
||||
<p>Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) +
|
||||
geom_boxplot()</pre>
|
||||
<div class="cell-output-display">
|
||||
<p><img src="EDA_files/figure-html/unnamed-chunk-42-1.png" class="img-fluid" alt="Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end." width="576"/></p>
|
||||
@@ -554,18 +554,18 @@ ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) +
|
||||
ggplot2 calls</h1>
|
||||
<p>As we move on from these introductory chapters, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(data = faithful, mapping = aes(x = eruptions)) +
|
||||
geom_freqpoly(binwidth = 0.25)</pre>
|
||||
</div>
|
||||
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
|
||||
<p>Rewriting the previous plot more concisely yields:</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">ggplot(faithful, aes(eruptions)) +
|
||||
<pre data-type="programlisting" data-code-language="r">ggplot(faithful, aes(eruptions)) +
|
||||
geom_freqpoly(binwidth = 0.25)</pre>
|
||||
</div>
|
||||
<p>Sometimes we’ll turn the end of a pipeline of data transformation into a plot. Watch for the transition from <code>|></code> to <code>+</code>. We wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.</p>
|
||||
<div class="cell">
|
||||
<pre data-type="programlisting" data-code-language="downlit">diamonds |>
|
||||
<pre data-type="programlisting" data-code-language="r">diamonds |>
|
||||
count(cut, clarity) |>
|
||||
ggplot(aes(clarity, cut, fill = n)) +
|
||||
geom_tile()</pre>
|
||||
|
||||
Reference in New Issue
Block a user