TR edits - Chp 1-9 (#1312)

* Mention parquet and databases * Simplify language * Explain what var and obs mean * Data View() alternative * Explain density * Boxplot definition * Clarify IQR, hide figure, add exercise * will -> can * Transform edits * Fix typo * Clairfy cases
2023-02-27 21:54:34 -05:00
parent c0f0375d44
commit 9887705f43
5 changed files with 67 additions and 48 deletions
--- a/data-visualize.qmd
+++ b/data-visualize.qmd
@@ -72,7 +72,7 @@ And how about by the island where the penguin lives.

 You can test your answer with the `penguins` **data frame** found in palmerpenguins (a.k.a. `palmerpenguins::penguins`).
 A data frame is a rectangular collection of variables (in the columns) and observations (in the rows).
-`penguins` contains `r nrow(penguins)` observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER[^data-visualize-2].
+`penguins` contains `r nrow(penguins)` observations collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER[^data-visualize-2]. In this context, a variable refers to an attribute of all the penguins, and an observation refers to all the attributes of a single penguin.

 [^data-visualize-2]: Horst AM, Hill AP, Gorman KB (2020).
    palmerpenguins: Palmer Archipelago (Antarctica) penguin data.
@@ -86,7 +86,7 @@ penguins

 This data frame contains `r ncol(penguins)` columns.
 For an alternative view, where you can see all variables and the first few observations of each variable, use `glimpse()`.
-Or, if you're in RStudio, run `View(penguins)` to open an interactive data viewer.
+Or, if you're in RStudio, click on the name of the data frame in the Environment pane or run `View(penguins)` to open an interactive data viewer.

 ```{r}
 glimpse(penguins)
@@ -157,17 +157,11 @@ ggplot2 looks for the mapped variables in the `data` argument, in this case, `pe
 The following plots show the result of adding these mappings, one at a time.

 ```{r}
-#| layout-ncol: 2
 #| fig-alt: >
-#|   There are two plots. The plot on the left is shows flipper length on 
-#|   the x-axis. The values range from 170 to 230 The plot on the right 
-#|   also shows body mass on the y-axis. The values range from 3000 to 
-#|   6000.
+#|   The plot shows flipper length on the x-axis, with values that range from 
+#|   170 to 230, and body mass on the y-axis, with values that range from 3000 
+#|   to 6000.

-ggplot(
-  data = penguins,
-  mapping = aes(x = flipper_length_mm)
-)
 ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
@@ -202,7 +196,7 @@ ggplot(
 ```

 Now we have something that looks like what we might think of as a "scatter plot".
-It doesn't yet match our "ultimate goal" plot, but using this plot we can start answering the question that motivated our exploration: "What does the relationship between flipper length and body mass look like?" The relationship appears to be positive, fairly linear, and moderately strong.
+It doesn't yet match our "ultimate goal" plot, but using this plot we can start answering the question that motivated our exploration: "What does the relationship between flipper length and body mass look like?" The relationship appears to be positive (as flipper length increases, so does body mass), fairly linear (the points are clustered around a line instead of a curve), and moderately strong (there isn't too much scatter around such a line).
 Penguins with longer flippers are generally larger in terms of their body mass.

 Before we add more layers to this plot, let's pause for a moment and review the warning message we got:
@@ -225,7 +219,8 @@ For the remaining plots in this chapter we will suppress this warning so it's no
 ### Adding aesthetics and layers

 Scatterplots are useful for displaying the relationship between two variables, but it's always a good idea to be skeptical of any apparent relationship between two variables and ask if there may be other variables that explain or change the nature of this apparent relationship.
-Let's incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between flipper length and body mass.
+For example, does the relationship between flipper length and body mass differ by species?
+Let's incorporate species into our plot and see if this reveals any additional insights into the apparent relationship between these variables.
 We will do this by representing species with different colored points.

 To achieve this, where should `species` go in the ggplot call from earlier?
@@ -483,8 +478,6 @@ penguins |>
  geom_point()
 ```

-This is the most common syntax you'll see in the wild.
-
 ## Visualizing distributions

 How you visualize the distribution of a variable depends on the type of variable: categorical or numerical.
@@ -525,20 +518,17 @@ You will learn more about factors and functions for dealing with factors (like `

 A variable is **numerical** if it can take any of an infinite set of ordered values.
 Numbers and date-times are two examples of continuous variables.
-To visualize the distribution of a continuous variable, you can use a histogram or a density plot.
+One commonly used visualization for distributions of continuous variables is a histogram.

 ```{r}
 #| warning: false
 #| layout-ncol: 2
 #| fig-alt: >
-#|   A histogram (on the left) and density plot (on the right) of body masses 
-#|   of penguins. The distribution is unimodal and right skewed, ranging 
-#|   between approximately 2500 to 6500 grams.
+#|   A histogram of body masses of penguins. The distribution is unimodal 
+#|   and right skewed, ranging between approximately 2500 to 6500 grams.

 ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 200)
-ggplot(penguins, aes(x = body_mass_g)) +
-  geom_density()
 ```

 A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin.
@@ -572,6 +562,23 @@ ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 2000)
 ```

+An alternative visualization for distributions of numerical variables is a density plot.
+A density plot is a smoothed-out version of a histogram and a practical alternative, particularly for continuous data that comes from an underlying smooth distribution.
+We won't go into how `geom_density()` estimates the density (you can read more about that in the function documentation), but let's explain how the density curve is drawn with an analogy.
+Imagine a histogram made out of wooden blocks.
+Then, imagine that you drop a cooked spaghetti string over it.
+The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve.
+It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.
+
+```{r}
+#| fig-alt: >
+#|   A density plot of body masses of penguins. The distribution is unimodal 
+#|   and right skewed, ranging between approximately 2500 to 6500 grams.
+
+ggplot(penguins, aes(x = body_mass_g)) +
+  geom_density()
+```
+
 ### Exercises

 1.  Make a bar plot of `species` of `penguins`, where you assign `species` to the `y` aesthetic.
@@ -604,10 +611,10 @@ In the following sections you will learn about commonly used plots for visualizi
 ### A numerical and a categorical variable

 To visualize the relationship between a numerical and a categorical variable we can use side-by-side box plots.
-A **boxplot** is a type of visual shorthand for a distribution of values that is popular among statisticians.
+A **boxplot** is a type of visual shorthand for measures of position (percentiles) that describe a distribution that are commonly used in statistical analysis of data.
 As shown in @fig-eda-boxplot, each boxplot consists of:

-   A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR).
+-   A box that indicates the range of the middle half of the data, a distance known as the interquartile range (IQR), stretching from the 25th percentile of the distribution to the 75th percentile.
    In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution.
    These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.

@@ -792,11 +799,7 @@ You will learn about many other geoms for visualizing distributions of variables

    ```{r}
    #| warning: false
-    #| fig-alt: >
-    #|   Scatterplot of bill depth vs. bill length where different color and 
-    #|   shape pairings represent each species. The plot has two legends, 
-    #|   one labelled "species" which shows the shape scale and the other
-    #|   that shows the color scale.
+    #| fig-show: hide

    ggplot(
      data = penguins,
@@ -809,6 +812,19 @@ You will learn about many other geoms for visualizing distributions of variables
      labs(color = "Species")
    ```

+7.  Create the two following segmented bar plots.
+    Which question can you answer with the first one?
+    Which question can you answer with the second one?
+
+    ```{r}
+    #| fig-show: hide
+
+    ggplot(penguins, aes(x = island, fill = species)) +
+      geom_bar(position = "fill")
+    ggplot(penguins, aes(x = species, fill = island)) +
+      geom_bar(position = "fill")
+    ```
+
 ## Saving your plots {#sec-ggsave}

 Once you've made a plot, you might want to get it out of R by saving it as an image that you can use elsewhere.