Review edits (#1330)
* Cut case study down a bit * Intro feedback * Visualize feedback * Update data-transform.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-visualize.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update data-visualize.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update intro.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Update intro.qmd Co-authored-by: Hadley Wickham <h.wickham@gmail.com> * Incorporate review feedback --------- Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
This commit is contained in:
parent
fc631a4509
commit
70687bf94c
|
@ -3,6 +3,7 @@
|
|||
```{r}
|
||||
#| results: "asis"
|
||||
#| echo: false
|
||||
|
||||
source("_common.R")
|
||||
status("complete")
|
||||
```
|
||||
|
@ -753,89 +754,32 @@ As you can see, when you summarize an ungrouped data frame, you get a single row
|
|||
|
||||
Whenever you do any aggregation, it's always a good idea to include a count (`n()`).
|
||||
That way, you can ensure that you're not drawing conclusions based on very small amounts of data.
|
||||
For example, let's look at the planes (identified by their tail number) that have the highest average delays:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A frequency histogram showing the distribution of flight delays.
|
||||
#| The distribution is unimodal, with a large spike around 0, and
|
||||
#| asymmetric: very few flights leave more than 30 minutes early,
|
||||
#| but flights are delayed up to 5 hours.
|
||||
|
||||
delays <- flights |>
|
||||
filter(!is.na(arr_delay), !is.na(tailnum)) |>
|
||||
group_by(tailnum) |>
|
||||
summarize(
|
||||
delay = mean(arr_delay, na.rm = TRUE),
|
||||
n = n()
|
||||
)
|
||||
|
||||
ggplot(delays, aes(x = delay)) +
|
||||
geom_freqpoly(binwidth = 10)
|
||||
```
|
||||
|
||||
Wow, there are some planes that have an *average* delay of 5 hours (300 minutes)!
|
||||
That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| A scatterplot showing number of flights versus average arrival delay. Delays
|
||||
#| for planes with very small number of flights have very high variability
|
||||
#| (from -50 to ~300), but the variability rapidly decreases as the
|
||||
#| number of flights increases.
|
||||
|
||||
ggplot(delays, aes(x = delay, y = n)) +
|
||||
geom_point(alpha = 1/10)
|
||||
```
|
||||
|
||||
Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane.
|
||||
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].
|
||||
|
||||
[^data-transform-4]: \*cough\* the law of large numbers \*cough\*.
|
||||
|
||||
When looking at this sort of plot, it's often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| Scatterplot of number of flights of a given plane vs. the average delay
|
||||
#| for those flights, for planes with more than 25 flights. As average delay
|
||||
#| increases from -20 to 10, the number of flights also increases. For
|
||||
#| larger average delayes, the number of flights decreases.
|
||||
|
||||
delays |>
|
||||
filter(n > 25) |>
|
||||
ggplot(aes(x = delay, y = n)) +
|
||||
geom_point(alpha = 1/10) +
|
||||
geom_smooth(se = FALSE)
|
||||
```
|
||||
|
||||
Note the handy pattern for combining ggplot2 and dplyr.
|
||||
It's a bit annoying that you have to switch from `|>` to `+`, but it's not too much of a hassle once you get the hang of it.
|
||||
|
||||
There's another common variation on this pattern that we can see in some data about baseball players.
|
||||
The following code uses data from the **Lahman** package to compare what proportion of times a player gets a hit vs. the number of times they try to put the ball in play:
|
||||
We'll demonstrate this with some baseball data from the **Lahman** package.
|
||||
Specifically, we will compare what proportion of times a player gets a hit vs. the number of times they try to put the ball in play:
|
||||
|
||||
```{r}
|
||||
batters <- Lahman::Batting |>
|
||||
group_by(playerID) |>
|
||||
summarize(
|
||||
perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
|
||||
performance = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
|
||||
n = sum(AB, na.rm = TRUE)
|
||||
)
|
||||
batters
|
||||
```
|
||||
|
||||
When we plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
|
||||
When we plot the skill of the batter (measured by the batting average, `performance`) against the number of opportunities to hit the ball (measured by times at bat, `n`), you see two patterns:
|
||||
|
||||
1. As above, the variation in our aggregate decreases as we get more data points.
|
||||
1. The variation in our aggregate decreases as we get more data points.
|
||||
The shape of this plot is very characteristic: whenever you plot a mean (or other summary statistics) vs. group size, you'll see that the variation decreases as the sample size increases[^data-transform-4].
|
||||
|
||||
2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because obviously teams want to give their best batters the most opportunities to hit the ball.
|
||||
2. There's a positive correlation between skill (`perf`) and opportunities to hit the ball (`n`) because teams want to give their best batters the most opportunities to hit the ball.
|
||||
|
||||
[^data-transform-4]: \*cough\* the law of large numbers \*cough\*.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A scatterplot of number of batting opportunites vs. batting performance
|
||||
#| A scatterplot of number of batting performance vs. batting opportunites
|
||||
#| overlaid with a smoothed line. Average performance increases sharply
|
||||
#| from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance
|
||||
#| continues to increase linearly at a much shallower slope reaching
|
||||
|
@ -843,17 +787,20 @@ When we plot the skill of the batter (measured by the batting average, `ba`) aga
|
|||
|
||||
batters |>
|
||||
filter(n > 100) |>
|
||||
ggplot(aes(x = n, y = perf)) +
|
||||
geom_point(alpha = 1 / 10) +
|
||||
geom_smooth(se = FALSE)
|
||||
ggplot(aes(x = n, y = performance)) +
|
||||
geom_point(alpha = 1 / 10) +
|
||||
geom_smooth(se = FALSE)
|
||||
```
|
||||
|
||||
Note the handy pattern for combining ggplot2 and dplyr.
|
||||
It's a bit annoying that you have to switch from `|>` to `+`, but it's not too much of a hassle once you get the hang of it.
|
||||
|
||||
This also has important implications for ranking.
|
||||
If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
|
||||
If you naively sort on `desc(performance)`, the people with the best batting averages are clearly lucky, not skilled:
|
||||
|
||||
```{r}
|
||||
batters |>
|
||||
arrange(desc(perf))
|
||||
arrange(desc(performance))
|
||||
```
|
||||
|
||||
You can find a good explanation of this problem and how to overcome it at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <https://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
|
||||
|
|
|
@ -58,7 +58,7 @@ library(ggthemes)
|
|||
|
||||
## First steps
|
||||
|
||||
Let's use our first graph to answer a question: Do penguins with longer flippers weigh more or less than penguins with shorter flippers?
|
||||
Do penguins with longer flippers weigh more or less than penguins with shorter flippers?
|
||||
You probably already have an answer, but try to make your answer precise.
|
||||
What does the relationship between flipper length and body mass look like?
|
||||
Is it positive?
|
||||
|
@ -67,6 +67,7 @@ Linear?
|
|||
Nonlinear?
|
||||
Does the relationship vary by the species of the penguin?
|
||||
And how about by the island where the penguin lives.
|
||||
Let's create visualizations that we can use to answer these questions.
|
||||
|
||||
### The `penguins` data frame
|
||||
|
||||
|
@ -80,6 +81,10 @@ A data frame is a rectangular collection of variables (in the columns) and obser
|
|||
<https://allisonhorst.github.io/palmerpenguins/>.
|
||||
doi: 10.5281/zenodo.3960218.
|
||||
|
||||
Type the name of the data frame in the console and R will print a preview of its contents.
|
||||
Note that it says `tibble` on top of this preview.
|
||||
In the tidyverse, we use special data frames called tibbles that you will learn more about soon.
|
||||
|
||||
```{r}
|
||||
penguins
|
||||
```
|
||||
|
@ -110,8 +115,8 @@ Our ultimate goal in this chapter is to recreate the following visualization dis
|
|||
#| echo: false
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A scatterplot of body mass vs. flipper length of penguins, with a
|
||||
#| smooth curve displaying the relationship between these two variables
|
||||
#| A scatterplot of body mass vs. flipper length of penguins, with a
|
||||
#| best fit line of the relationship between these two variables
|
||||
#| overlaid. The plot displays a positive, fairly linear, and relatively
|
||||
#| strong relationship between these two variables. Species (Adelie,
|
||||
#| Chinstrap, and Gentoo) are represented with different colors and
|
||||
|
@ -121,7 +126,7 @@ Our ultimate goal in this chapter is to recreate the following visualization dis
|
|||
|
||||
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
|
||||
geom_point(aes(color = species, shape = species)) +
|
||||
geom_smooth() +
|
||||
geom_smooth(method = "lm") +
|
||||
labs(
|
||||
title = "Body mass and flipper length",
|
||||
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
|
||||
|
@ -135,10 +140,10 @@ ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
|
|||
|
||||
### Creating a ggplot
|
||||
|
||||
Let's recreate this plot layer-by-layer.
|
||||
Let's recreate this plot step-by-step.
|
||||
|
||||
With ggplot2, you begin a plot with the function `ggplot()`, defining a plot object that you then add layers to.
|
||||
The first argument of `ggplot()` is the dataset to use in the graph and so `ggplot(data = penguins)` creates an empty graph.
|
||||
With ggplot2, you begin a plot with the function `ggplot()`, defining a plot object that you then add **layers** to.
|
||||
The first argument of `ggplot()` is the dataset to use in the graph and so `ggplot(data = penguins)` creates an empty graph that is primed to display the `penguins` data, but since we haven't told it how to visualize it yet, for now it's empty.
|
||||
This is not a very exciting plot, but you can think of it like an empty canvas you'll paint the remaining layers of your plot onto.
|
||||
|
||||
```{r}
|
||||
|
@ -148,13 +153,13 @@ This is not a very exciting plot, but you can think of it like an empty canvas y
|
|||
ggplot(data = penguins)
|
||||
```
|
||||
|
||||
Next, we need to tell `ggplot()` the variables from this data frame that we want to map to visual properties (**aesthetics**) of the plot.
|
||||
The `mapping` argument of the `ggplot()` function defines how variables in your dataset are mapped to visual properties of your plot.
|
||||
Next, we need to tell `ggplot()` how the information from our data will be visually represented.
|
||||
The `mapping` argument of the `ggplot()` function defines how variables in your dataset are mapped to visual properties (**aesthetics**) of your plot.
|
||||
The `mapping` argument is always paired with the `aes()` function, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes.
|
||||
For now, we will only map flipper length to the `x` aesthetic and body mass to the `y` aesthetic.
|
||||
ggplot2 looks for the mapped variables in the `data` argument, in this case, `penguins`.
|
||||
|
||||
The following plots show the result of adding these mappings, one at a time.
|
||||
The following plot shows the result of adding these mappings.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
|
@ -175,8 +180,7 @@ This is because we have not yet articulated, in our code, how to represent the o
|
|||
To do so, we need to define a **geom**: the geometrical object that a plot uses to represent data.
|
||||
These geometric objects are made available in ggplot2 with functions that start with `geom_`.
|
||||
People often describe plots by the type of geom that the plot uses.
|
||||
For example, bar charts use bar geoms (`geom_bar()`), line charts use line geoms (`geom_line()`), boxplots use boxplot geoms (`geom_boxplot()`), and so on.
|
||||
Scatterplots break the trend; they use the point geom: `geom_point()`.
|
||||
For example, bar charts use bar geoms (`geom_bar()`), line charts use line geoms (`geom_line()`), boxplots use boxplot geoms (`geom_boxplot()`), scatterplots use point geoms (`geom_point()`), and so on.
|
||||
|
||||
The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot.
|
||||
ggplot2 comes with many geom functions that each adds a different type of layer to a plot.
|
||||
|
@ -203,27 +207,19 @@ Before we add more layers to this plot, let's pause for a moment and review the
|
|||
|
||||
> Removed 2 rows containing missing values (`geom_point()`).
|
||||
|
||||
We're seeing this message because there are two penguins in our dataset with missing body mass and flipper length values and ggplot2 has no way of representing them on the plot.
|
||||
You don't need to worry about understanding the following code yet (you will learn about it in @sec-data-transform), but it's one way of identifying the observations with `NA`s for either body mass or flipper length.
|
||||
|
||||
```{r}
|
||||
penguins |>
|
||||
select(species, flipper_length_mm, body_mass_g) |>
|
||||
filter(is.na(body_mass_g) | is.na(flipper_length_mm))
|
||||
```
|
||||
|
||||
We're seeing this message because there are two penguins in our dataset with missing body mass and/or flipper length values and ggplot2 has no way of representing them on the plot without both of these values.
|
||||
Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing.
|
||||
This type of warning is probably one of the most common types of warnings you will see when working with real data -- missing values are a very common issue and you'll learn more about them throughout the book, particularly in @sec-missing-values.
|
||||
For the remaining plots in this chapter we will suppress this warning so it's not printed alongside every single plot we make.
|
||||
|
||||
### Adding aesthetics and layers
|
||||
### Adding aesthetics and layers {#sec-adding-aesthetics-layers}
|
||||
|
||||
Scatterplots are useful for displaying the relationship between two variables, but it's always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.
|
||||
Scatterplots are useful for displaying the relationship between two numerical variables, but it's always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.
|
||||
For example, does the relationship between flipper length and body mass differ by species?
|
||||
Let's incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between these variables.
|
||||
We will do this by representing species with different colored points.
|
||||
|
||||
To achieve this, where should `species` go in the ggplot call from earlier?
|
||||
To achieve this, will we need to modify the aesthetic or the geom?
|
||||
If you guessed "in the aesthetic mapping, inside of `aes()`", you're already getting the hang of creating data visualizations with ggplot2!
|
||||
And if not, don't worry.
|
||||
Throughout the book you will make many more ggplots and have many more opportunities to check your intuition as you make them.
|
||||
|
@ -243,13 +239,14 @@ ggplot(
|
|||
geom_point()
|
||||
```
|
||||
|
||||
When a variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as **scaling**.
|
||||
When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here a unique color) to each unique level of the variable (each of the three species), a process known as **scaling**.
|
||||
ggplot2 will also add a legend that explains which values correspond to which levels.
|
||||
|
||||
Now let's add one more layer: a smooth curve displaying the relationship between body mass and flipper length.
|
||||
Before you proceed, refer back to the code above, and think about how we can add this to our existing plot.
|
||||
|
||||
Since this is a new geometric object representing our data, we will add a new geom: `geom_smooth()`.
|
||||
Since this is a new geometric object representing our data, we will add a new geom as a layer on top of our point geom: `geom_smooth()`.
|
||||
And we will specify that we want to to draw the line of best fit based on a `l`inear `m`odel with `method = "lm"`.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
|
@ -265,20 +262,20 @@ ggplot(
|
|||
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
|
||||
) +
|
||||
geom_point() +
|
||||
geom_smooth()
|
||||
geom_smooth(method = "lm")
|
||||
```
|
||||
|
||||
We have successfully added smooth curves, but this plot doesn't look like the plot from @sec-ultimate-goal, which only has one curve for the entire dataset as opposed to separate curves for each of the penguin species.
|
||||
We have successfully added lines, but this plot doesn't look like the plot from @sec-ultimate-goal, which only has one line for the entire dataset as opposed to separate lines for each of the penguin species.
|
||||
|
||||
When aesthetic mappings are defined in `ggplot()`, at the *global* level, they're inherited by each of the subsequent geom layers of the plot.
|
||||
However, each geom function in ggplot2 can also take a `mapping` argument, which allows for aesthetic mappings at the *local* level.
|
||||
Since we want points to be colored based on species but don't want the smooth curves to be separated out for them, we should specify `color = species` for `geom_point()` only.
|
||||
When aesthetic mappings are defined in `ggplot()`, at the *global* level, they're passed down to each of the subsequent geom layers of the plot.
|
||||
However, each geom function in ggplot2 can also take a `mapping` argument, which allows for aesthetic mappings at the *local* level that are added to those inherited from the global level.
|
||||
Since we want points to be colored based on species but don't want the lines to be separated out for them, we should specify `color = species` for `geom_point()` only.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. Overlaid
|
||||
#| on the scatterplot are is a single smooth curve displaying the
|
||||
#| on the scatterplot is a single line of best fit displaying the
|
||||
#| relationship between these variables for each species (Adelie,
|
||||
#| Chinstrap, and Gentoo). Different penguin species are plotted in
|
||||
#| different colors for the points only.
|
||||
|
@ -288,7 +285,7 @@ ggplot(
|
|||
mapping = aes(x = flipper_length_mm, y = body_mass_g)
|
||||
) +
|
||||
geom_point(mapping = aes(color = species)) +
|
||||
geom_smooth()
|
||||
geom_smooth(method = "lm")
|
||||
```
|
||||
|
||||
Voila!
|
||||
|
@ -302,7 +299,7 @@ Therefore, in addition to color, we can also map `species` to the `shape` aesthe
|
|||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. Overlaid
|
||||
#| on the scatterplot are is a single smooth curve displaying the
|
||||
#| on the scatterplot is a single line of best fit displaying the
|
||||
#| relationship between these variables for each species (Adelie,
|
||||
#| Chinstrap, and Gentoo). Different penguin species are plotted in
|
||||
#| different colors and shapes for the points only.
|
||||
|
@ -312,7 +309,7 @@ ggplot(
|
|||
mapping = aes(x = flipper_length_mm, y = body_mass_g)
|
||||
) +
|
||||
geom_point(mapping = aes(color = species, shape = species)) +
|
||||
geom_smooth()
|
||||
geom_smooth(method = "lm")
|
||||
```
|
||||
|
||||
Note that the legend is automatically updated to reflect the different shapes of the points as well.
|
||||
|
@ -326,7 +323,7 @@ In addition, we can improve the color palette to be colorblind safe with the `sc
|
|||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A scatterplot of body mass vs. flipper length of penguins, with a
|
||||
#| smooth curve displaying the relationship between these two variables
|
||||
#| line of best fit displaying the relationship between these two variables
|
||||
#| overlaid. The plot displays a positive, fairly linear, and relatively
|
||||
#| strong relationship between these two variables. Species (Adelie,
|
||||
#| Chinstrap, and Gentoo) are represented with different colors and
|
||||
|
@ -339,7 +336,7 @@ ggplot(
|
|||
mapping = aes(x = flipper_length_mm, y = body_mass_g)
|
||||
) +
|
||||
geom_point(aes(color = species, shape = species)) +
|
||||
geom_smooth() +
|
||||
geom_smooth(method = "lm") +
|
||||
labs(
|
||||
title = "Body mass and flipper length",
|
||||
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
|
||||
|
@ -363,7 +360,7 @@ We finally have a plot that perfectly matches our "ultimate goal"!
|
|||
Describe the relationship between these two variables.
|
||||
|
||||
4. What happens if you make a scatterplot of `species` vs. `bill_depth_mm`?
|
||||
Why is the plot not useful?
|
||||
What might be a better choice of geom?
|
||||
|
||||
5. Why does the following give an error and how would you fix it?
|
||||
|
||||
|
@ -456,7 +453,7 @@ ggplot(
|
|||
|
||||
Typically, the first one or two arguments to a function are so important that you should know them by heart.
|
||||
The first two arguments to `ggplot()` are `data` and `mapping`, in the remainder of the book, we won't supply those names.
|
||||
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots.
|
||||
That saves typing, and, by reducing the amount of extra text, makes it easier to see what's different between plots.
|
||||
That's a really important programming concern that we'll come back to in @sec-functions.
|
||||
|
||||
Rewriting the previous plot more concisely yields:
|
||||
|
@ -516,8 +513,9 @@ You will learn more about factors and functions for dealing with factors (like `
|
|||
|
||||
### A numerical variable
|
||||
|
||||
A variable is **numerical** if it can take any of an infinite set of ordered values.
|
||||
Numbers and date-times are two examples of continuous variables.
|
||||
A variable is **numerical** (or quantitative) if it can take on a wide range of numerical values, and it is sensible to add, subtract, or take averages with those values.
|
||||
Numerical variables can be continuous or discrete.
|
||||
|
||||
One commonly used visualization for distributions of continuous variables is a histogram.
|
||||
|
||||
```{r}
|
||||
|
@ -534,15 +532,11 @@ ggplot(penguins, aes(x = body_mass_g)) +
|
|||
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
|
||||
In the graph above, the tallest bar shows that 39 observations have a `body_mass_g` value between 3,500 and 3,700 grams, which are the left and right edges of the bar.
|
||||
|
||||
```{r}
|
||||
penguins |>
|
||||
count(cut_width(body_mass_g, 200))
|
||||
```
|
||||
|
||||
You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the `x` variable.
|
||||
You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.
|
||||
In the plots below a binwidth of 20 is too narrow, resulting in too many bars, making it difficult to determine the shape of the distribution.
|
||||
Similarly, a binwidth of 2,000 is too high, resulting in all data being binned into only three bars, and also making it difficult to determine the shape of the distribution.
|
||||
A binwidth of 200 provides a sensible balance.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
|
@ -599,7 +593,7 @@ ggplot(penguins, aes(x = body_mass_g)) +
|
|||
|
||||
3. What does the `bins` argument in `geom_histogram()` do?
|
||||
|
||||
4. Make a histogram of the `carat` variable in the `diamonds` dataset.
|
||||
4. Make a histogram of the `carat` variable in the `diamonds` dataset that is available when you load the tidyverse package.
|
||||
Experiment with different binwidths.
|
||||
What binwidth reveals the most interesting patterns?
|
||||
|
||||
|
@ -611,7 +605,8 @@ In the following sections you will learn about commonly used plots for visualizi
|
|||
### A numerical and a categorical variable
|
||||
|
||||
To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots.
|
||||
A **boxplot** is a type of visual shorthand for measures of position (percentiles) that describe a distribution that are commonly used in statistical analysis of data.
|
||||
A **boxplot** is a type of visual shorthand for measures of position (percentiles) that describe a distribution.
|
||||
It is also useful for identifying potential outliers.
|
||||
As shown in @fig-eda-boxplot, each boxplot consists of:
|
||||
|
||||
- A box that indicates the range of the middle half of the data, a distance known as the interquartile range (IQR), stretching from the 25th percentile of the distribution to the 75th percentile.
|
||||
|
@ -651,36 +646,32 @@ ggplot(penguins, aes(x = species, y = body_mass_g)) +
|
|||
geom_boxplot()
|
||||
```
|
||||
|
||||
Alternatively, we can make frequency polygons with `geom_freqpoly()`.
|
||||
`geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, it uses lines instead.
|
||||
It's much easier to understand overlapping lines than bars of `geom_histogram()`.
|
||||
There are a few challenges with this type of plot, which we will come back to in @sec-cat-num on exploring the covariation between a categorical and a numerical variable.
|
||||
Alternatively, we can make density plots with `geom_density()`.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A frequency polygon of body masses of penguins by species of
|
||||
#| penguins. Each species (Adelie, Chinstrap, and Gentoo) is
|
||||
#| represented with different colored outlines for the polygons.
|
||||
#| A density plot of body masses of penguins by species of penguins. Each
|
||||
#| species (Adelie, Chinstrap, and Gentoo) is represented with different
|
||||
#| colored outlines for the density curves.
|
||||
|
||||
ggplot(penguins, aes(x = body_mass_g, color = species)) +
|
||||
geom_freqpoly(binwidth = 200, linewidth = 0.75)
|
||||
geom_density(linewidth = 0.75)
|
||||
```
|
||||
|
||||
We've also customized the thickness of the lines using the `linewidth` argument in order to make them stand out a bit more against the background.
|
||||
|
||||
We can also use overlaid density plots, with `species` mapped to both `color` and `fill` aesthetics and using the `alpha` aesthetic to add transparency to the filled density curves.
|
||||
Alternatively, we can map `species` to both `color` and `fill` aesthetics and use the `alpha` aesthetic to add transparency to the filled density curves.
|
||||
This aesthetic takes values between 0 (completely transparent) and 1 (completely opaque).
|
||||
In the following plot it's *set* to 0.5.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| A frequency polygon of body masses of penguins (on the left) and density
|
||||
#| plot (on the right). Each species of penguins (Adelie, Chinstrap, and
|
||||
#| Gentoo) are represented in different colored outlines for the frequency
|
||||
#| polygons and the density curves. The density curves are also filled with
|
||||
#| the same colors, with some transparency added.
|
||||
#| A density plot of body masses of penguins by species of penguins. Each
|
||||
#| species (Adelie, Chinstrap, and Gentoo) is represented in different
|
||||
#| colored outlines for the density curves. The density curves are also
|
||||
#| filled with the same colors, with some transparency added.
|
||||
|
||||
ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
|
||||
geom_density(alpha = 0.5)
|
||||
|
@ -693,20 +684,23 @@ Note the terminology we have used here:
|
|||
|
||||
### Two categorical variables
|
||||
|
||||
We can use segmented bar plots to visualize the distribution between two categorical variables.
|
||||
In creating this bar chart, we map the variable we want to divide the data into first to the `x` aesthetic and the variable we then further want to divide each group into to the `fill` aesthetic.
|
||||
We can use stacked bar plots to visualize the relationship between two categorical variables.
|
||||
|
||||
Below are two segmented bar plots, both displaying the relationship between `island` and `species`, or specifically, visualizing the distribution of `species` within each island.
|
||||
The two stacked bar plots below both display the relationship between `island` and `species`, or specifically, visualizing the distribution of `species` within each island.
|
||||
The plot on the left shows the frequencies of each species of penguins on each island and the plot on the right shows the relative frequencies (proportions) of each species within each island (despite the incorrectly labeled y-axis that says "count").
|
||||
The plot of frequencies show that there are equal numbers of Adelies on each island.
|
||||
But we don't have a good sense of the percentage balance within each island.
|
||||
In the proportions plot, we've lost our notion of total penguins, but we've gained the advantage of "breakdown by island".
|
||||
|
||||
The relative frequency plot, created by setting `position = "fill"` in the geom is more useful for comparing species distributions across islands since it's not affected by the unequal numbers of penguins across the islands.
|
||||
Based on the plot on the left, we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen.
|
||||
|
||||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-alt: >
|
||||
#| Bar plots of penguin species by island (Biscoe, Dream, and Torgersen).
|
||||
#| On the right, frequencies of species are shown. On the left, relative
|
||||
#| frequencies of species are shown.
|
||||
#| layout-ncol: 2
|
||||
|
||||
ggplot(penguins, aes(x = island, fill = species)) +
|
||||
geom_bar()
|
||||
|
@ -714,10 +708,12 @@ ggplot(penguins, aes(x = island, fill = species)) +
|
|||
geom_bar(position = "fill")
|
||||
```
|
||||
|
||||
In creating these bar charts, we map the variable that will be separated into bars to the `x` aesthetic, and the variable that will change the colors inside the bars to the `fill` aesthetic.
|
||||
|
||||
### Two numerical variables
|
||||
|
||||
So far you've learned about scatterplots (created with `geom_point()`) and smooth curves (created with `geom_smooth()`) for visualizing the relationship between two numerical variables.
|
||||
A scatterplot is probably the most commonly used plot for visualizing the relationship between two variables.
|
||||
A scatterplot is probably the most commonly used plot for visualizing the relationship between two numerical variables.
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
|
@ -732,7 +728,7 @@ ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
|
|||
|
||||
### Three or more variables
|
||||
|
||||
One way to add additional variables to a plot is by mapping them to an aesthetic.
|
||||
As we saw in @sec-adding-aesthetics-layers, we can incorporate more variables into a plot by mapping them to additional aesthetics.
|
||||
For example, in the following scatterplot the colors of points represent species and the shapes of points represent islands.
|
||||
|
||||
```{r}
|
||||
|
@ -779,9 +775,8 @@ You will learn about many other geoms for visualizing distributions of variables
|
|||
### Exercises
|
||||
|
||||
1. Which variables in `mpg` are categorical?
|
||||
Which variables are continuous?
|
||||
(Hint: type `?mpg` to read the documentation for the dataset).
|
||||
How can you see this information when you run `mpg`?
|
||||
Which variables are numerical?
|
||||
(Hint: Type `?mpg` to read the documentation for the dataset.) How can you see this information when you run `mpg`?
|
||||
|
||||
2. Make a scatterplot of `hwy` vs. `displ` using the `mpg` data frame.
|
||||
Next, map a third, numerical variable to `color`, then `size`, then both `color` and `size`, then `shape`.
|
||||
|
@ -793,6 +788,7 @@ You will learn about many other geoms for visualizing distributions of variables
|
|||
|
||||
5. Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm` and color the points by `species`.
|
||||
What does adding coloring by species reveal about the relationship between these two variables?
|
||||
What about faceting by species?
|
||||
|
||||
6. Why does the following yield two separate legends?
|
||||
How would you fix it to combine the two legends?
|
||||
|
@ -812,7 +808,7 @@ You will learn about many other geoms for visualizing distributions of variables
|
|||
labs(color = "Species")
|
||||
```
|
||||
|
||||
7. Create the two following segmented bar plots.
|
||||
7. Create the two following stacked bar plots.
|
||||
Which question can you answer with the first one?
|
||||
Which question can you answer with the second one?
|
||||
|
||||
|
@ -828,7 +824,7 @@ You will learn about many other geoms for visualizing distributions of variables
|
|||
## Saving your plots {#sec-ggsave}
|
||||
|
||||
Once you've made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere.
|
||||
That's the job of `ggsave()`, which will save the most recent plot to disk:
|
||||
That's the job of `ggsave()`, which will save the plot most recently created to disk:
|
||||
|
||||
```{r}
|
||||
#| fig-show: hide
|
||||
|
@ -836,7 +832,7 @@ That's the job of `ggsave()`, which will save the most recent plot to disk:
|
|||
|
||||
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
|
||||
geom_point()
|
||||
ggsave(filename = "my-plot.png")
|
||||
ggsave(filename = "penguin-plot.png")
|
||||
```
|
||||
|
||||
```{r}
|
||||
|
@ -871,12 +867,13 @@ You will learn more about Quarto in @sec-quarto.
|
|||
```
|
||||
|
||||
2. What do you need to change in the code above to save the plot as a PDF instead of a PNG?
|
||||
How could you find out what types of image files would work in `ggsave()`?
|
||||
|
||||
## Common problems
|
||||
|
||||
As you start to run R code, you're likely to run into problems.
|
||||
Don't worry --- it happens to everyone.
|
||||
We have all been writing R code for years, but every day we still write code that doesn't work!
|
||||
We have all been writing R code for years, but every day we still write code that doesn't work on the first try!
|
||||
|
||||
Start by carefully comparing the code that you're running to the code in the book.
|
||||
R is extremely picky, and a misplaced character can make all the difference.
|
||||
|
@ -896,7 +893,7 @@ ggplot(data = mpg)
|
|||
```
|
||||
|
||||
If you're still stuck, try the help.
|
||||
You can get help about any R function by running `?function_name` in the console, or selecting the function name and pressing F1 in RStudio.
|
||||
You can get help about any R function by running `?function_name` in the console, or highlighting the function name and pressing F1 in RStudio.
|
||||
Don't worry if the help doesn't seem that helpful - instead skip down to the examples and look for code that matches what you're trying to do.
|
||||
|
||||
If that doesn't help, carefully read the error message.
|
||||
|
|
|
@ -573,7 +573,7 @@ Here is a selection that you might find useful.
|
|||
### Center
|
||||
|
||||
So far, we've mostly used `mean()` to summarize the center of a vector of values.
|
||||
Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values.
|
||||
As we've seen in @sec-sample-size, because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values.
|
||||
An alternative is to use the `median()`, which finds a value that lies in the "middle" of the vector, i.e. 50% of the values is above it and 50% are below it.
|
||||
Depending on the shape of the distribution of the variable you're interested in, mean or median might be a better measure of center.
|
||||
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
|
||||
|
|
Loading…
Reference in New Issue