Don't transform non-crossref links

This commit is contained in:
Hadley Wickham
2022-11-18 10:30:32 -06:00
parent 4caea5281b
commit 78a1c12fe7
32 changed files with 693 additions and 693 deletions

View File

@@ -66,7 +66,7 @@ Visualizing distributions</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid" alt="A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500)." width="576"/></p>
</div>
</div>
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code>:</p>
<p>The height of the bars displays how many observations occurred with each x value. You can compute these values manually with <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut)
@@ -87,7 +87,7 @@ Visualizing distributions</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail." width="576"/></p>
</div>
</div>
<p>You can compute this by hand by combining <code><a href="#chp-https://dplyr.tidyverse.org/reference/count" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/count</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>:</p>
<p>You can compute this by hand by combining <code><a href="https://dplyr.tidyverse.org/reference/count.html">count()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(cut_width(carat, 0.5))
@@ -114,7 +114,7 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
<p><img src="EDA_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid" alt="A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), resulting in many bars. The distribution is right skewed but there are lots of ups and downs in the heights of the bins, creating a jagged outline." width="576"/></p>
</div>
</div>
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> instead of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> performs the same calculation as <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>, but instead of displaying the counts with bars, uses lines instead. Its much easier to understand overlapping lines than bars.</p>
<p>If you wish to overlay multiple histograms in the same plot, we recommend using <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> instead of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>. <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> performs the same calculation as <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, but instead of displaying the counts with bars, uses lines instead. Its much easier to understand overlapping lines than bars.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1, size = 0.75)
@@ -173,7 +173,7 @@ Unusual values</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-11-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak." width="576"/></p>
</div>
</div>
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 youll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code>:</p>
<p>There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 youll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = y)) +
geom_histogram(binwidth = 0.5) +
@@ -182,7 +182,7 @@ Unusual values</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-12-1.png" class="img-fluid" alt="A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1." width="576"/></p>
</div>
</div>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> also has an <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
<p><code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> also has an <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> argument for when you need to zoom into the x-axis. ggplot2 also has <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> functions that work slightly differently: they throw away the data outside the limits.</p>
<p>This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">unusual &lt;- diamonds |&gt;
@@ -213,7 +213,7 @@ Exercises</h2>
<ol type="1"><li><p>Explore the distribution of each of the <code>x</code>, <code>y</code>, and <code>z</code> variables in <code>diamonds</code>. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.</p></li>
<li><p>Explore the distribution of <code>price</code>. Do you discover anything unusual or surprising? (Hint: Carefully think about the <code>binwidth</code> and make sure you try a wide range of values.)</p></li>
<li><p>How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian</a></code> vs <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> or <code><a href="#chp-https://ggplot2.tidyverse.org/reference/lims" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/lims</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/coord_cartesian.html">coord_cartesian()</a></code> vs <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">xlim()</a></code> or <code><a href="https://ggplot2.tidyverse.org/reference/lims.html">ylim()</a></code> when zooming in on a histogram. What happens if you leave <code>binwidth</code> unset? What happens if you try and zoom so only half a bar shows?</p></li>
</ol></section>
</section>
@@ -230,13 +230,13 @@ Missing values</h1>
<p>We dont recommend this option because just because one measurement is invalid, doesnt mean all the measurements are. Additionally, if you have low quality data, by time that youve applied this approach to every variable you might find that you dont have any data left!</p>
</li>
<li>
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="#chp-https://dplyr.tidyverse.org/reference/mutate" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/mutate</a></code> to replace the variable with a modified copy. You can use the <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> function to replace unusual values with <code>NA</code>:</p>
<p>Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use <code><a href="https://dplyr.tidyverse.org/reference/mutate.html">mutate()</a></code> to replace the variable with a modified copy. You can use the <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> function to replace unusual values with <code>NA</code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds2 &lt;- diamonds |&gt;
mutate(y = if_else(y &lt; 3 | y &gt; 20, NA, y))</pre>
</div>
</li>
</ol><p><code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code>, use <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code>. <code><a href="#chp-https://dplyr.tidyverse.org/reference/case_when" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/case_when</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="#chp-https://dplyr.tidyverse.org/reference/if_else" data-type="xref">#chp-https://dplyr.tidyverse.org/reference/if_else</a></code> statements nested inside one another.</p>
</ol><p><code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> has three arguments. The first argument <code>test</code> should be a logical vector. The result will contain the value of the second argument, <code>yes</code>, when <code>test</code> is <code>TRUE</code>, and the value of the third argument, <code>no</code>, when it is false. Alternatively to <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code>, use <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code>. <code><a href="https://dplyr.tidyverse.org/reference/case_when.html">case_when()</a></code> is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple <code><a href="https://dplyr.tidyverse.org/reference/if_else.html">if_else()</a></code> statements nested inside one another.</p>
<p>Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. Its not obvious where you should plot missing values, so ggplot2 doesnt include them in the plot, but it does warn that theyve been removed:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
@@ -251,7 +251,7 @@ Missing values</h1>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)</pre>
</div>
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="#chp-https://rdrr.io/pkg/nycflights13/man/flights" data-type="xref">#chp-https://rdrr.io/pkg/nycflights13/man/flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, well use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="#chp-https://rdrr.io/r/base/NA" data-type="xref">#chp-https://rdrr.io/r/base/NA</a></code>.</p>
<p>Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in <code><a href="https://rdrr.io/pkg/nycflights13/man/flights.html">nycflights13::flights</a></code><span data-type="footnote">Remember that when need to be explicit about where a function (or dataset) comes from, well use the special form <code>package::function()</code> or <code>package::dataset</code>.</span>, missing values in the <code>dep_time</code> variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with <code><a href="https://rdrr.io/r/base/NA.html">is.na()</a></code>.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">nycflights13::flights |&gt;
mutate(
@@ -272,7 +272,7 @@ Missing values</h1>
<h2>
Exercises</h2>
<ol type="1"><li><p>What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?</p></li>
<li><p>What does <code>na.rm = TRUE</code> do in <code><a href="#chp-https://rdrr.io/r/base/mean" data-type="xref">#chp-https://rdrr.io/r/base/mean</a></code> and <code><a href="#chp-https://rdrr.io/r/base/sum" data-type="xref">#chp-https://rdrr.io/r/base/sum</a></code>?</p></li>
<li><p>What does <code>na.rm = TRUE</code> do in <code><a href="https://rdrr.io/r/base/mean.html">mean()</a></code> and <code><a href="https://rdrr.io/r/base/sum.html">sum()</a></code>?</p></li>
</ol></section>
</section>
@@ -284,7 +284,7 @@ Covariation</h1>
<section id="sec-cat-cont" data-type="sect2">
<h2>
A categorical and continuous variable</h2>
<p>Its common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, its hard to see the differences in the shapes of their distributions. For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
<p>Its common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, its hard to see the differences in the shapes of their distributions. For example, lets explore how the price of a diamond varies with its quality (measured by <code>cut</code>):</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)</pre>
@@ -308,7 +308,7 @@ A categorical and continuous variable</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-23-1.png" class="img-fluid" alt="A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others." width="576"/></p>
</div>
</div>
<p>Note that were mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes_eval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes_eval</a></code> function to do so.</p>
<p>Note that were mapping the density the <code>y</code>, but since <code>density</code> is not a variable in the <code>diamonds</code> dataset, we need to first calculate it. We use the <code><a href="https://ggplot2.tidyverse.org/reference/aes_eval.html">after_stat()</a></code> function to do so.</p>
<p>Theres something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe thats because frequency polygons are a little hard to interpret - theres a lot going on in this plot.</p>
<p>Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A <strong>boxplot</strong> is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:</p>
<ul><li><p>A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.</p></li>
@@ -319,7 +319,7 @@ A categorical and continuous variable</h2>
<p><img src="images/EDA-boxplot.png" class="img-fluid" alt="A diagram depicting how a boxplot is created following the steps outlined above." width="1066"/></p>
</div>
</div>
<p>Lets take a look at the distribution of price by cut using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code>:</p>
<p>Lets take a look at the distribution of price by cut using <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()</pre>
@@ -328,7 +328,7 @@ A categorical and continuous variable</h2>
</div>
</div>
<p>We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, youll be challenged to figure out why.</p>
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables dont have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="#chp-https://rdrr.io/r/stats/reorder.factor" data-type="xref">#chp-https://rdrr.io/r/stats/reorder.factor</a></code> function.</p>
<p><code>cut</code> is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables dont have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the <code><a href="https://rdrr.io/r/stats/reorder.factor.html">reorder()</a></code> function.</p>
<p>For example, take the <code>class</code> variable in the <code>mpg</code> dataset. You might be interested to know how highway mileage varies across classes:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
@@ -346,7 +346,7 @@ A categorical and continuous variable</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-27-1.png" class="img-fluid" alt="Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize)." width="576"/></p>
</div>
</div>
<p>If you have long variable names, <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_boxplot</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
<p>If you have long variable names, <code><a href="https://ggplot2.tidyverse.org/reference/geom_boxplot.html">geom_boxplot()</a></code> will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = mpg,
mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
@@ -361,17 +361,17 @@ A categorical and continuous variable</h2>
Exercises</h3>
<ol type="1"><li><p>Use what youve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.</p></li>
<li><p>What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?</p></li>
<li><p>Instead of exchanging the x and y variables, add <code><a href="#chp-https://ggplot2.tidyverse.org/reference/coord_flip" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/coord_flip</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li>
<li><p>Instead of exchanging the x and y variables, add <code><a href="https://ggplot2.tidyverse.org/reference/coord_flip.html">coord_flip()</a></code> as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?</p></li>
<li><p>One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using <code>geom_lv()</code> to display the distribution of price vs cut. What do you learn? How do you interpret the plots?</p></li>
<li><p>Compare and contrast <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_violin" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_violin</a></code> with a faceted <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>, or a coloured <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code>. What are the pros and cons of each method?</p></li>
<li><p>If you have a small dataset, its sometimes useful to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_jitter" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_jitter</a></code>. List them and briefly describe what each one does.</p></li>
<li><p>Compare and contrast <code><a href="https://ggplot2.tidyverse.org/reference/geom_violin.html">geom_violin()</a></code> with a faceted <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code>, or a coloured <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code>. What are the pros and cons of each method?</p></li>
<li><p>If you have a small dataset, its sometimes useful to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code> to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to <code><a href="https://ggplot2.tidyverse.org/reference/geom_jitter.html">geom_jitter()</a></code>. List them and briefly describe what each one does.</p></li>
</ol></section>
</section>
<section id="two-categorical-variables" data-type="sect2">
<h2>
Two categorical variables</h2>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_count" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_count</a></code>:</p>
<p>To visualize the covariation between categorical variables, youll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in <code><a href="https://ggplot2.tidyverse.org/reference/geom_count.html">geom_count()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
geom_count()</pre>
@@ -411,7 +411,7 @@ Two categorical variables</h2>
#&gt; 6 E Fair 224
#&gt; # … with 29 more rows</pre>
</div>
<p>Then visualize with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> and the fill aesthetic:</p>
<p>Then visualize with <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> and the fill aesthetic:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">diamonds |&gt;
count(color, cut) |&gt;
@@ -428,7 +428,7 @@ Two categorical variables</h2>
Exercises</h3>
<ol type="1"><li><p>How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?</p></li>
<li><p>How does the segmented bar chart change if color is mapped to the <code>x</code> aesthetic and <code>cut</code> is mapped to the <code>fill</code> aesthetic? Calculate the counts that fall into each of the segments.</p></li>
<li><p>Use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_tile" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_tile</a></code> together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?</p></li>
<li><p>Use <code><a href="https://ggplot2.tidyverse.org/reference/geom_tile.html">geom_tile()</a></code> together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?</p></li>
<li><p>Why is it slightly better to use <code>aes(x = color, y = cut)</code> rather than <code>aes(x = cut, y = color)</code> in the example above?</p></li>
</ol></section>
</section>
@@ -436,7 +436,7 @@ Exercises</h3>
<section id="two-continuous-variables" data-type="sect2">
<h2>
Two continuous variables</h2>
<p>Youve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_point" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_point</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
<p>Youve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with <code><a href="https://ggplot2.tidyverse.org/reference/geom_point.html">geom_point()</a></code>. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point()</pre>
@@ -452,8 +452,8 @@ Two continuous variables</h2>
<p><img src="EDA_files/figure-html/unnamed-chunk-35-1.png" class="img-fluid" alt="A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats." width="576"/></p>
</div>
</div>
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_histogram" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_histogram</a></code> to bin in one dimension. Now youll learn how to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> to bin in two dimensions.</p>
<p><code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> and <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d</a></code> creates rectangular bins. <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="#chp-https://ggplot2.tidyverse.org/reference/geom_hex" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/geom_hex</a></code>.</p>
<p>But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_histogram()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_histogram.html">geom_freqpoly()</a></code> to bin in one dimension. Now youll learn how to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> to bin in two dimensions.</p>
<p><code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> and <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. <code><a href="https://ggplot2.tidyverse.org/reference/geom_bin_2d.html">geom_bin2d()</a></code> creates rectangular bins. <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code> creates hexagonal bins. You will need to install the hexbin package to use <code><a href="https://ggplot2.tidyverse.org/reference/geom_hex.html">geom_hex()</a></code>.</p>
<div>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_bin2d()
@@ -474,7 +474,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
</div>
</div>
<p><code>cut_width(x, width)</code>, as used above, divides <code>x</code> into bins of width <code>width</code>. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so its difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with <code>varwidth = TRUE</code>.</p>
<p>Another approach is to display approximately the same number of points in each bin. Thats the job of <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>:</p>
<p>Another approach is to display approximately the same number of points in each bin. Thats the job of <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))</pre>
@@ -486,7 +486,7 @@ ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
<section id="exercises-4" data-type="sect3">
<h3>
Exercises</h3>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code> vs <code><a href="#chp-https://ggplot2.tidyverse.org/reference/cut_interval" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/cut_interval</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
<ol type="1"><li><p>Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_width()</a></code> vs <code><a href="https://ggplot2.tidyverse.org/reference/cut_interval.html">cut_number()</a></code>? How does that impact a visualization of the 2d distribution of <code>carat</code> and <code>price</code>?</p></li>
<li><p>Visualize the distribution of <code>carat</code>, partitioned by <code>price</code>.</p></li>
<li><p>How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?</p></li>
<li><p>Combine two of the techniques youve learned to visualize the combined distribution of cut, carat, and price.</p></li>
@@ -565,7 +565,7 @@ ggplot2 calls</h1>
<pre data-type="programlisting" data-code-language="downlit">ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)</pre>
</div>
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/ggplot" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/ggplot</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="#chp-https://ggplot2.tidyverse.org/reference/aes" data-type="xref">#chp-https://ggplot2.tidyverse.org/reference/aes</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we wont supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see whats different between plots. Thats a really important programming concern that well come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
<p>Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/ggplot.html">ggplot()</a></code> are <code>data</code> and <code>mapping</code>, and the first two arguments to <code><a href="https://ggplot2.tidyverse.org/reference/aes.html">aes()</a></code> are <code>x</code> and <code>y</code>. In the remainder of the book, we wont supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see whats different between plots. Thats a really important programming concern that well come back to in <a href="#chp-functions" data-type="xref">#chp-functions</a>.</p>
<p>Rewriting the previous plot more concisely yields:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">ggplot(faithful, aes(eruptions)) +