Make syntax for multi-line fig-cap and fig-alt consistent
This commit is contained in:
parent
cbb5b1b01f
commit
205c9922f4
40
EDA.qmd
40
EDA.qmd
|
@ -81,7 +81,7 @@ We'll start our exploration by visualizing the distribution of weights (`carat`)
|
|||
Since `carat` is a numerical variable, we can use a histogram:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5
|
||||
#| and the y-axis ranging from 0 to 30000. The distribution is right skewed
|
||||
#| with very few diamonds in the bin centered at 0, almost 30000 diamonds in
|
||||
|
@ -117,7 +117,7 @@ To turn this information into useful questions, look for anything unexpected:
|
|||
Let's take a look at the distribution of `carat` for smaller diamonds.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and
|
||||
#| the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow
|
||||
#| (0.01), resulting in a very large number of skinny bars. The distribution
|
||||
|
@ -161,7 +161,7 @@ For example, take the distribution of the `y` variable from the diamonds dataset
|
|||
The only evidence of outliers is the unusually wide limits on the x-axis.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and
|
||||
#| the y-axis ranges from 0 to 12000. There is a peak around 5, and the
|
||||
#| data appear to be completely clustered around the peak.
|
||||
|
@ -174,7 +174,7 @@ There are so many observations in the common bins that the rare bins are very sh
|
|||
To make it easy to see the unusual values, we need to zoom to small values of the y-axis with `coord_cartesian()`:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the
|
||||
#| y-axis ranges from 0 to 50. There is a peak around 5, and the data
|
||||
#| appear to be completely clustered around the peak. Other than those data,
|
||||
|
@ -270,7 +270,7 @@ It's not obvious where you should plot missing values, so ggplot2 doesn't includ
|
|||
|
||||
```{r}
|
||||
#| dev: "png"
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of widths vs. lengths of diamonds. There is a strong,
|
||||
#| linear association between the two variables. All but one of the diamonds
|
||||
#| has length greater than 3. The one outlier has a length of 0 and a width
|
||||
|
@ -297,7 +297,7 @@ You can do this by making a new variable, using `is.na()` to check if `dep_time`
|
|||
[^eda-1]: Remember that when we need to be explicit about where a function (or dataset) comes from, we'll use the special form `package::function()` or `package::dataset`.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A frequency polygon of scheduled departure times of flights. Two lines
|
||||
#| represent flights that are cancelled and not cancelled. The x-axis ranges
|
||||
#| from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of
|
||||
|
@ -340,7 +340,7 @@ The best way to spot covariation is to visualize the relationship between two or
|
|||
For example, let's explore how the price of a diamond varies with its quality (measured by `cut`) using `geom_freqpoly()`:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A frequency polygon of prices of diamonds where each cut of carat (Fair,
|
||||
#| Good, Very Good, Premium, and Ideal) is represented with a different color
|
||||
#| line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to
|
||||
|
@ -361,7 +361,7 @@ To make the comparison easier we need to swap what is displayed on the y-axis.
|
|||
Instead of displaying count, we'll display the **density**, which is the count standardized so that the area under each frequency polygon is one.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A frequency polygon of densities of prices of diamonds where each cut of
|
||||
#| carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a
|
||||
#| different color line. The x-axis ranges from 0 to 20000. The lines overlap
|
||||
|
@ -382,7 +382,7 @@ But maybe that's because frequency polygons are a little hard to interpret - the
|
|||
A visually simpler plot for exploring this relationship is using side-by-side boxplots.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side boxplots of prices of diamonds by cut. The distribution of
|
||||
#| prices is right skewed for each cut (Fair, Good, Very Good, Premium, and
|
||||
#| Ideal). The medians are close to each other, with the median for Ideal
|
||||
|
@ -404,7 +404,7 @@ For example, take the `class` variable in the `mpg` dataset.
|
|||
You might be interested to know how highway mileage varies across classes:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
|
||||
#| on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact,
|
||||
#| and suv).
|
||||
|
@ -416,7 +416,7 @@ ggplot(mpg, aes(x = class, y = hwy)) +
|
|||
To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
|
||||
#| on the x-axis and ordered by increasing median highway mileage (pickup,
|
||||
#| suv, minivan, 2seater, subcompact, compact, and midsize).
|
||||
|
@ -429,7 +429,7 @@ If you have long variable names, `geom_boxplot()` will work better if you flip i
|
|||
You can do that by exchanging the x and y aesthetic mappings.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side boxplots of highway mileages of cars by class. Classes are
|
||||
#| on the y-axis and ordered by increasing median highway mileage.
|
||||
|
||||
|
@ -468,7 +468,7 @@ To visualize the covariation between categorical variables, you'll need to count
|
|||
One way to do that is to rely on the built-in `geom_count()`:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of color vs. cut of diamonds. There is one point for each
|
||||
#| combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal)
|
||||
#| and color (D, E, F, G, G, I, and J). The sizes of the points represent
|
||||
|
@ -492,7 +492,7 @@ diamonds |>
|
|||
Then visualize with `geom_tile()` and the fill aesthetic:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A tile plot of cut vs. color of diamonds. Each tile represents a
|
||||
#| cut/color combination and tiles are colored according to the number of
|
||||
#| observations in each tile. There are more Ideal diamonds than other cuts,
|
||||
|
@ -528,7 +528,7 @@ The relationship is exponential.
|
|||
|
||||
```{r}
|
||||
#| dev: "png"
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of price vs. carat. The relationship is positive, somewhat
|
||||
#| strong, and exponential.
|
||||
|
||||
|
@ -543,7 +543,7 @@ You've already seen one way to fix the problem: using the `alpha` aesthetic to a
|
|||
|
||||
```{r}
|
||||
#| dev: "png"
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of price vs. carat. The relationship is positive, somewhat
|
||||
#| strong, and exponential. The points are transparent, showing clusters where
|
||||
#| the number of points is higher than other areas, The most obvious clusters
|
||||
|
@ -566,7 +566,7 @@ You will need to install the hexbin package to use `geom_hex()`.
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Plot 1: A binned density plot of price vs. carat. Plot 2: A hexagonal bin
|
||||
#| plot of price vs. carat. Both plots show that the highest density of
|
||||
#| diamonds have low carats and low prices.
|
||||
|
@ -584,7 +584,7 @@ Then you can use one of the techniques for visualizing the combination of a cate
|
|||
For example, you could bin `carat` and then for each group, display a boxplot:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side box plots of price by carat. Each box plot represents diamonds
|
||||
#| that are 0.1 carats apart in weight. The box plots show that as carat
|
||||
#| increases the median price increases as well. Additionally, diamonds with
|
||||
|
@ -668,7 +668,7 @@ Then, we exponentiate the residuals to put them back in the scale of raw prices.
|
|||
```{r}
|
||||
#| message: false
|
||||
#| dev: "png"
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of residuals vs. carat of diamonds. The x-axis ranges from 0
|
||||
#| to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered
|
||||
#| around low values of carat and residuals. There is a clear, curved pattern
|
||||
|
@ -695,7 +695,7 @@ ggplot(diamonds_aug, aes(x = carat, y = .resid)) +
|
|||
Once you've removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side box plots of residuals by cut. The x-axis displays the various
|
||||
#| cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are
|
||||
#| quite similar, between roughly 0.75 to 1.25. Each of the distributions of
|
||||
|
|
|
@ -346,11 +346,11 @@ If this pepper shaker is your list `pepper`, then, `pepper[1]` is a pepper shake
|
|||
#| label: fig-pepper
|
||||
#| echo: false
|
||||
#| out-width: "100%"
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| (Left) A pepper shaker that Hadley once found in his hotel room.
|
||||
#| (Middle) `pepper[1]`.
|
||||
#| (Right) `pepper[[1]]`
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three photos. On the left is a photo of a glass pepper shaker. Instead of
|
||||
#| the pepper shaker containing pepper, it contains a single packet of pepper.
|
||||
#| In the middle is a photo of a single packet of pepper. On the right is a
|
||||
|
|
|
@ -12,11 +12,11 @@ However, it doesn't matter how great your analysis is unless you can explain it
|
|||
```{r}
|
||||
#| label: fig-ds-communicate
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Communication is the final part of the data science process; if you
|
||||
#| can't communicate your results to other humans, it doesn't matter how
|
||||
#| great your analysis is.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram displaying the data science cycle with
|
||||
#| communicate highlighed in blue.
|
||||
#| out.width: NULL
|
||||
|
|
|
@ -48,7 +48,7 @@ You add labels with the `labs()` function.
|
|||
|
||||
```{r}
|
||||
#| message: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
|
||||
#| points are colored according to the car class. A smooth curve following
|
||||
#| the trajectory of the relationship between highway fuel efficiency versus
|
||||
|
@ -86,7 +86,7 @@ Just switch `""` out for `quote()` and read about the available options in `?plo
|
|||
#| fig-asp: 1
|
||||
#| out-width: "50%"
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot with math text on the x and y axis labels. X-axis label
|
||||
#| says x_i, y-axis label says sum of x_i squared, for i from 1 to n.
|
||||
|
||||
|
@ -112,7 +112,7 @@ ggplot(df, aes(x, y)) +
|
|||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway versus city fuel efficiency. Shapes and
|
||||
#| colors of points are determined by type of drive train.
|
||||
|
||||
|
@ -162,7 +162,7 @@ They're larger than the rest of the text on the plot and bolded.
|
|||
(`theme(legend.position = "none"`) turns all the legends off --- we'll talk about it more shortly.)
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway mileage versus engine size where points are colored
|
||||
#| by drive type. Smooth curves for each drive type are overlaid.
|
||||
#| Text labels identify the curves as front-wheel, rear-wheel, and 4-wheel.
|
||||
|
@ -185,7 +185,7 @@ We can use the `geom_label_repel()` function from the ggrepel package to address
|
|||
This useful package will automatically adjust labels so that they don't overlap:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
|
||||
#| points are colored according to the car class. Some points are labelled
|
||||
#| with the car's name. The labels are box with white, transparent background
|
||||
|
@ -206,7 +206,7 @@ You can also use the same idea to highlight certain points on a plot with `geom_
|
|||
Note another handy technique used here: we added a second layer of large, hollow points to further highlight the labelled points.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars. Points
|
||||
#| where highway mileage is above 40 as well as above 20 with engine size
|
||||
#| above 5 are red, with a hollow red circle, and labelled with model name
|
||||
|
@ -256,7 +256,7 @@ The `x` and `y` aesthetics in both define where the annotation should start, and
|
|||
Note also that the segment is styled as an arrow.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars. A red
|
||||
#| arrow pointing down follows the trend of the points and the annotation
|
||||
#| placed next to the arrow reads "Larger engine sizes tend to have lower
|
||||
|
@ -352,7 +352,7 @@ Labels controls the text label associated with each tick/key.
|
|||
The most common use of `breaks` is to override the default choice:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars,
|
||||
#| colored by drive. The y-axis has breaks starting at 15 and ending at 40,
|
||||
#| increasing by 5.
|
||||
|
@ -368,7 +368,7 @@ You can also use `breaks` and `labels` to control the appearance of legends.
|
|||
For discrete scales for categorical variables, `labels` can be a named list of the existing levels names and the desired labels for them.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, colored
|
||||
#| by drive. The x and y-axes do not have any labels at the axis ticks.
|
||||
#| The legend has custom labels: 4-wheel, front, rear.
|
||||
|
@ -388,7 +388,7 @@ Note that `breaks` is in the original scale of the data.
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two side-by-side box plots of price versus cut of diamonds. The outliers
|
||||
#| are transparent. On both plots the x-axis labels are formatted as dollars.
|
||||
#| The x-axis labels on the plot start at $0 and go to $15,000, increasing
|
||||
|
@ -412,7 +412,7 @@ ggplot(diamonds, aes(x = price, y = cut)) +
|
|||
Another handy label function is `label_percent()`:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Segmented bar plots of cut, filled with levels of clarity. The y-axis
|
||||
#| labels start at 0% and go to 100%, increasing by 25%. The y-axis label
|
||||
#| name is "Percentage".
|
||||
|
@ -426,7 +426,7 @@ Another use of `breaks` is when you have relatively few data points and want to
|
|||
For example, take this plot that shows when each US president started and ended their term.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Line plot of id number of presidents versus the year they started their
|
||||
#| presidency. Start year is marked with a point and a segment that starts
|
||||
#| there and ends at the end of the presidency. The x-axis labels are
|
||||
|
@ -459,7 +459,7 @@ The theme setting `legend.position` controls where the legend is drawn:
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Four scatterplots of highway fuel efficiency versus engine size of cars
|
||||
#| where points are colored based on class of car. Clockwise, the legend
|
||||
#| is placed on the right, left, top, and bottom of the plot.
|
||||
|
@ -485,7 +485,7 @@ The following example shows two important settings: controlling the number of ro
|
|||
This is particularly useful if you have used a low `alpha` to display many points on a plot.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars
|
||||
#| where points are colored based on class of car. Overlaid on the plot is a
|
||||
#| smooth curve. The legend is in the bottom and classes are listed
|
||||
|
@ -514,7 +514,7 @@ For example, it's easier to see the precise relationship between `carat` and `pr
|
|||
#| fig-align: default
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two plots of price versus carat of diamonds. Data binned and the color of
|
||||
#| the rectangles representing each bin based on the number of points that
|
||||
#| fall into that bin. In the plot on the right, price and carat values
|
||||
|
@ -534,7 +534,7 @@ Instead of doing the transformation in the aesthetic mapping, we can instead do
|
|||
This is visually identical, except the axes are labelled on the original data scale.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Plot of price versus carat of diamonds. Data binned and the color of
|
||||
#| the rectangles representing each bin based on the number of points that
|
||||
#| fall into that bin. The axis labels are on the original data scale.
|
||||
|
@ -556,7 +556,7 @@ The two plots below look similar, but there is enough difference in the shades o
|
|||
#| fig-align: default
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two scatterplots of highway mileage versus engine size where points are
|
||||
#| colored by drive type. The plot on the left uses the default
|
||||
#| ggplot2 color palette and the plot on the right uses a different color
|
||||
|
@ -575,7 +575,7 @@ If there are just a few colors, you can add a redundant shape mapping.
|
|||
This will also help ensure your plot is interpretable in black and white.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two scatterplots of highway mileage versus engine size where both color
|
||||
#| and shape of points are based on drive type. The color palette is not
|
||||
#| the default ggplot2 palette.
|
||||
|
@ -595,7 +595,7 @@ This often arises if you've used `cut()` to make a continuous variable into a ca
|
|||
#| echo: false
|
||||
#| fig-cap: All colorBrewer scales.
|
||||
#| fig-asp: 2.5
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| All colorBrewer scales. One group goes from light to dark colors.
|
||||
#| Another group is a set of non ordinal colors. And the last group has
|
||||
#| diverging scales (from dark to light to dark again). Within each set
|
||||
|
@ -610,7 +610,7 @@ For example, if we map presidential party to color, we want to use the standard
|
|||
One approach for assigning these colors is using hex color codes:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Line plot of id number of presidents versus the year they started their
|
||||
#| presidency. Start year is marked with a point and a segment that starts
|
||||
#| there and ends at the end of the presidency. Democratic presidents are
|
||||
|
@ -638,7 +638,7 @@ These scales are available as continuous (`c`), discrete (`d`), and binned (`b`)
|
|||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-asp: 0.75
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three hex plots where the color of the hexes show the number of observations
|
||||
#| that fall into that hex bin. The first plot uses the default, continuous
|
||||
#| ggplot2 scale. The second plot uses the viridis, continuous scale, and the
|
||||
|
@ -872,7 +872,7 @@ You can also create your own themes, if you are trying to match a particular cor
|
|||
#| label: fig-themes
|
||||
#| echo: false
|
||||
#| fig-cap: The eight themes built-in to ggplot2.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Eight barplots created with ggplot2, each
|
||||
#| with one of the eight built-in themes:
|
||||
#| theme_bw() - White background with grid lines,
|
||||
|
@ -898,7 +898,7 @@ In the following plot these are set to `"plot"` to indicate these elements are a
|
|||
A few other helpful `theme()` components are used to change the placement for format of the title and caption text.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, colored
|
||||
#| by drive. The plot is titled 'Larger engine sizes tend to have lower fuel
|
||||
#| economy' with the caption pointing to the source of the data, fueleconomy.gov.
|
||||
|
@ -944,7 +944,7 @@ Then, you place them next to each other with `+`.
|
|||
```{r}
|
||||
#| fig-width: 6
|
||||
#| fig-asp: 0.5
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two plots (a scatterplot of highway mileage versus engine size and a
|
||||
#| side-by-side boxplots of highway mileage versus drive train) placed next
|
||||
#| to each other.
|
||||
|
@ -967,7 +967,7 @@ In the following, `|` places the `p1` and `p3` next to each other and `/` moves
|
|||
```{r}
|
||||
#| fig-width: 6
|
||||
#| fig-asp: 0.8
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three plots laid out such that first and third plot are next to each other
|
||||
#| and the second plot stretched beneath them. The first plot is a
|
||||
#| scatterplot of highway mileage versus engine size, third plot is a
|
||||
|
@ -993,7 +993,7 @@ Patchwork divides up the area you have allotted for your plot using this scale a
|
|||
```{r}
|
||||
#| fig-width: 8
|
||||
#| fig-asp: 1
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Five plots laid out such that first two plots are next to each other. Plots
|
||||
#| three and four are underneath them. And the fifth plot stretches under them.
|
||||
#| The patchworked plot is titled "City and highway mileage for cars with
|
||||
|
@ -1065,7 +1065,7 @@ If you'd like to learn more about combining and layout out multiple plots with p
|
|||
#| fig-width: 7
|
||||
#| fig-asp: 0.8
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three plots: Plot 1 is a scatterplot of highway mileage versus engine size.
|
||||
#| Plot 2 is side-by-side box plots of highway mileage versus drive train.
|
||||
#| Plot 3 is side-by-side box plots of city mileage versus drive train.
|
||||
|
|
|
@ -66,10 +66,10 @@ There are three interrelated rules that make a dataset tidy:
|
|||
```{r}
|
||||
#| label: fig-tidy-structure
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The following three rules make a dataset tidy: variables are columns,
|
||||
#| observations are rows, and values are cells.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three panels, each representing a tidy data frame. The first panel
|
||||
#| shows that each variable is a column. The second panel shows that each
|
||||
#| observation is a row. The third panel shows that each value is
|
||||
|
@ -93,7 +93,7 @@ Here are a few small examples showing how you might work with `table1`.
|
|||
|
||||
```{r}
|
||||
#| fig-width: 5
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| This figure shows the number of cases in 1999 and 2000 for
|
||||
#| Afghanistan, Brazil, and China, with year on the x-axis and number
|
||||
#| of cases on the y-axis. Each point on the plot represents the number
|
||||
|
@ -236,9 +236,9 @@ We can see that very few songs stay in the top 100 for more than 20 weeks.
|
|||
|
||||
```{r}
|
||||
#| label: fig-billboard-ranks
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A line plot showing how the rank of a song changes over time.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot with week on the x-axis and rank on the y-axis, where
|
||||
#| each line represents a song. Most songs appear to start at a high rank,
|
||||
#| rapidly accelerate to a low rank, and then decay again. There are
|
||||
|
@ -286,10 +286,10 @@ As shown in @fig-pivot-variables, the values in column that was already a variab
|
|||
```{r}
|
||||
#| label: fig-pivot-variables
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Columns that are already variables need to be repeated, once for
|
||||
#| each column that is pivotted.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram showing how `pivot_longer()` transforms a simple
|
||||
#| dataset, using color to highlight how the values in the `id` column
|
||||
#| ("A", "B", "C") are each repeated twice in the output because there are
|
||||
|
@ -304,10 +304,10 @@ They need to be repeated once for each row in the original dataset.
|
|||
```{r}
|
||||
#| label: fig-pivot-names
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The column names of pivoted columns become values in a new column. The
|
||||
#| values need to be repeated once for each row of the original dataset.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram showing how `pivot_longer()` transforms a simple
|
||||
#| data set, using color to highlight how column names ("bp1" and
|
||||
#| "bp2") become the values in a new `measurement` column. They are repeated
|
||||
|
@ -323,10 +323,10 @@ They are unwound row by row.
|
|||
```{r}
|
||||
#| label: fig-pivot-values
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The number of values is preserved (not repeated), but unwound
|
||||
#| row-by-row.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram showing how `pivot_longer()` transforms data,
|
||||
#| using color to highlight how the cell values (blood pressure measurements)
|
||||
#| become the values in a new `value` column. They are unwound row-by-row,
|
||||
|
@ -374,11 +374,11 @@ You can imagine this happening in two steps (first pivoting and then separating)
|
|||
```{r}
|
||||
#| label: fig-pivot-multiple-names
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Pivoting columns with multiple pieces of information in the names
|
||||
#| means that each column name now fills in values in multiple output
|
||||
#| columns.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram that uses color to illustrate how supplying `names_sep`
|
||||
#| and multiple `names_to` creates multiple variables in the output.
|
||||
#| The input has variable names "x_1" and "y_2" which are split up
|
||||
|
@ -421,12 +421,12 @@ When you use `".value"` in `names_to`, the column names in the input contribute
|
|||
```{r}
|
||||
#| label: fig-pivot-names-and-values
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Pivoting with `names_to = c(".value", "num")` splits the column names
|
||||
#| into two components: the first part determines the output column
|
||||
#| name (`x` or `y`), and the second part determines the value of the
|
||||
#| `num` column.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram that uses color to illustrate how the special ".value"
|
||||
#| sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2",
|
||||
#| and we want to use the first component ("x", "y") as a variable name
|
||||
|
|
|
@ -521,9 +521,9 @@ You'll need to make one change to your RStudio options to use `|>` instead of `%
|
|||
```{r}
|
||||
#| label: fig-pipe-options
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| To insert `|>`, make sure the "Use native pipe operator" option is checked.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Screenshot showing the "Use native pipe operator" option which can
|
||||
#| be found on the "Editing" panel of the "Code" options.
|
||||
|
||||
|
@ -852,7 +852,7 @@ When we plot the skill of the batter (measured by the batting average, `performa
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of number of batting performance vs. batting opportunites
|
||||
#| overlaid with a smoothed line. Average performance increases sharply
|
||||
#| from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance
|
||||
|
|
|
@ -130,7 +130,7 @@ Our ultimate goal in this chapter is to recreate the following visualization dis
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins, with a
|
||||
#| best fit line of the relationship between these two variables
|
||||
#| overlaid. The plot displays a positive, fairly linear, and relatively
|
||||
|
@ -163,7 +163,7 @@ The first argument of `ggplot()` is the dataset to use in the graph and so `ggpl
|
|||
This is not a very exciting plot, but you can think of it like an empty canvas you'll paint the remaining layers of your plot onto.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A blank, gray plot area.
|
||||
|
||||
ggplot(data = penguins)
|
||||
|
@ -178,7 +178,7 @@ ggplot2 looks for the mapped variables in the `data` argument, in this case, `pe
|
|||
The following plot shows the result of adding these mappings.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The plot shows flipper length on the x-axis, with values that range from
|
||||
#| 170 to 230, and body mass on the y-axis, with values that range from 3000
|
||||
#| to 6000.
|
||||
|
@ -203,7 +203,7 @@ ggplot2 comes with many geom functions that each adds a different type of layer
|
|||
You'll learn a whole bunch of geoms throughout the book, particularly in @sec-layers.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. The plot
|
||||
#| displays a positive, linear, and relatively strong relationship between
|
||||
#| these two variables.
|
||||
|
@ -242,7 +242,7 @@ Throughout the book you will make many more ggplots and have many more opportuni
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. The plot
|
||||
#| displays a positive, fairly linear, and relatively strong relationship
|
||||
#| between these two variables. Species (Adelie, Chinstrap, and Gentoo)
|
||||
|
@ -266,7 +266,7 @@ And we will specify that we want to draw the line of best fit based on a `l`inea
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. Overlaid
|
||||
#| on the scatterplot are three smooth curves displaying the
|
||||
#| relationship between these variables for each species (Adelie,
|
||||
|
@ -289,7 +289,7 @@ Since we want points to be colored based on species but don't want the lines to
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. Overlaid
|
||||
#| on the scatterplot is a single line of best fit displaying the
|
||||
#| relationship between these variables for each species (Adelie,
|
||||
|
@ -313,7 +313,7 @@ Therefore, in addition to color, we can also map `species` to the `shape` aesthe
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. Overlaid
|
||||
#| on the scatterplot is a single line of best fit displaying the
|
||||
#| relationship between these variables for each species (Adelie,
|
||||
|
@ -337,7 +337,7 @@ In addition, we can improve the color palette to be colorblind safe with the `sc
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins, with a
|
||||
#| line of best fit displaying the relationship between these two variables
|
||||
#| overlaid. The plot displays a positive, fairly linear, and relatively
|
||||
|
@ -401,7 +401,7 @@ We finally have a plot that perfectly matches our "ultimate goal"!
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins, colored
|
||||
#| by bill depth. A smooth curve of the relationship between body mass
|
||||
#| and flipper length is overlaid. The relationship is positive,
|
||||
|
@ -503,7 +503,7 @@ To examine the distribution of a categorical variable, you can use a bar chart.
|
|||
The height of the bars displays how many observations occurred with each `x` value.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A bar chart of frequencies of species of penguins: Adelie
|
||||
#| (approximately 150), Chinstrap (approximately 90), Gentoo
|
||||
#| (approximately 125).
|
||||
|
@ -516,7 +516,7 @@ In bar plots of categorical variables with non-ordered levels, like the penguin
|
|||
Doing so requires transforming the variable to a factor (how R handles categorical data) and then reordering the levels of that factor.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A bar chart of frequencies of species of penguins, where the bars are
|
||||
#| ordered in decreasing order of their heights (frequencies): Adelie
|
||||
#| (approximately 150), Gentoo (approximately 125), Chinstrap
|
||||
|
@ -537,7 +537,7 @@ One commonly used visualization for distributions of continuous variables is a h
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A histogram of body masses of penguins. The distribution is unimodal
|
||||
#| and right skewed, ranging between approximately 2500 to 6500 grams.
|
||||
|
||||
|
@ -558,7 +558,7 @@ A binwidth of 200 provides a sensible balance.
|
|||
#| warning: false
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two histograms of body masses of penguins, one with binwidth of 20
|
||||
#| (left) and one with binwidth of 2000 (right). The histogram with binwidth
|
||||
#| of 20 shows lots of ups and downs in the heights of the bins, creating a
|
||||
|
@ -579,7 +579,7 @@ The shape the spaghetti will take draped over blocks can be thought of as the sh
|
|||
It shows fewer details than a histogram but can make it easier to quickly glean the shape of the distribution, particularly with respect to modes and skewness.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A density plot of body masses of penguins. The distribution is unimodal
|
||||
#| and right skewed, ranging between approximately 2500 to 6500 grams.
|
||||
|
||||
|
@ -635,9 +635,9 @@ As shown in @fig-eda-boxplot, each boxplot consists of:
|
|||
```{r}
|
||||
#| label: fig-eda-boxplot
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Diagram depicting how a boxplot is created.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram depicting how a boxplot is created following the steps outlined
|
||||
#| above.
|
||||
|
||||
|
@ -648,7 +648,7 @@ Let's take a look at the distribution of body mass by species using `geom_boxplo
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Side-by-side box plots of distributions of body masses of Adelie,
|
||||
#| Chinstrap, and Gentoo penguins. The distribution of Adelie and
|
||||
#| Chinstrap penguins' body masses appear to be symmetric with
|
||||
|
@ -664,7 +664,7 @@ Alternatively, we can make density plots with `geom_density()`.
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A density plot of body masses of penguins by species of penguins. Each
|
||||
#| species (Adelie, Chinstrap, and Gentoo) is represented with different
|
||||
#| colored outlines for the density curves.
|
||||
|
@ -681,7 +681,7 @@ In the following plot it's *set* to 0.5.
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A density plot of body masses of penguins by species of penguins. Each
|
||||
#| species (Adelie, Chinstrap, and Gentoo) is represented in different
|
||||
#| colored outlines for the density curves. The density curves are also
|
||||
|
@ -706,7 +706,7 @@ The plot of frequencies show that there are equal numbers of Adelies on each isl
|
|||
But we don't have a good sense of the percentage balance within each island.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Bar plots of penguin species by island (Biscoe, Dream, and Torgersen)
|
||||
ggplot(penguins, aes(x = island, fill = species)) +
|
||||
geom_bar()
|
||||
|
@ -716,7 +716,7 @@ The second plot is a relative frequency plot, created by setting `position = "fi
|
|||
Using this plot we can see that Gentoo penguins all live on Biscoe island and make up roughly 75% of the penguins on that island, Chinstrap all live on Dream island and make up roughly 50% of the penguins on that island, and Adelie live on all three islands and make up all of the penguins on Torgersen.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Bar plots of penguin species by island (Biscoe, Dream, and Torgersen)
|
||||
#| the bars are scaled to the same height, making it a relative frequencies
|
||||
#| plot
|
||||
|
@ -734,7 +734,7 @@ A scatterplot is probably the most commonly used plot for visualizing the relati
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. The plot
|
||||
#| displays a positive, linear, relatively strong relationship between
|
||||
#| these two variables.
|
||||
|
@ -750,7 +750,7 @@ For example, in the following scatterplot the colors of points represent species
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. The plot
|
||||
#| displays a positive, linear, relatively strong relationship between
|
||||
#| these two variables. The points are colored based on the species of the
|
||||
|
@ -776,7 +776,7 @@ The variable that you pass to `facet_wrap()` should be categorical.
|
|||
#| warning: false
|
||||
#| fig-width: 8
|
||||
#| fig-asp: 0.33
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of body mass vs. flipper length of penguins. The shapes and
|
||||
#| colors of points represent species. Penguins from each island are on a
|
||||
#| separate facet. Within each facet, the relationship between body mass and
|
||||
|
|
|
@ -333,7 +333,7 @@ wday(datetime, label = TRUE, abbr = FALSE)
|
|||
We can use `wday()` to see that more flights depart during the week than on the weekend:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A bar chart with days of the week on the x-axis and number of
|
||||
#| flights on the y-axis. Monday-Friday have roughly the same number of
|
||||
#| flights, ~48,0000, decreasingly slightly over the course of the week.
|
||||
|
@ -349,7 +349,7 @@ We can also look at the average departure delay by minute within the hour.
|
|||
There's an interesting pattern: flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line chart with minute of actual departure (0-60) on the x-axis and
|
||||
#| average delay (4-20) on the y-axis. Average delay starts at (0, 12),
|
||||
#| steadily increases to (18, 20), then sharply drops, hitting at minimum
|
||||
|
@ -370,7 +370,7 @@ flights_dt |>
|
|||
Interestingly, if we look at the *scheduled* departure time we don't see such a strong pattern:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line chart with minute of scheduled departure (0-60) on the x-axis
|
||||
#| and average delay (4-16). There is relatively little pattern, just a
|
||||
#| small suggestion that the average delay decreases from maybe 10 minutes
|
||||
|
@ -393,11 +393,11 @@ Always be alert for this sort of pattern whenever you work with data that involv
|
|||
|
||||
```{r}
|
||||
#| label: fig-human-rounding
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A frequency polygon showing the number of flights scheduled to
|
||||
#| depart each hour. You can see a strong preference for round numbers
|
||||
#| like 0 and 30 and generally for numbers that are a multiple of five.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot with departure minute (0-60) on the x-axis and number of
|
||||
#| flights (0-60000) on the y-axis. Most flights are scheduled to depart
|
||||
#| on either the hour (~60,000) or the half hour (~35,000). Otherwise,
|
||||
|
@ -415,7 +415,7 @@ Each function takes a vector of dates to adjust and then the name of the unit to
|
|||
This, for example, allows us to plot the number of flights per week:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot with week (Jan-Dec 2013) on the x-axis and number of
|
||||
#| flights (2,000-7,000) on the y-axis. The pattern is fairly flat from
|
||||
#| February to November with around 7,000 flights per week. There are
|
||||
|
@ -431,7 +431,7 @@ flights_dt |>
|
|||
You can use rounding to show the distribution of flights across the course of a day by computing the difference between `dep_time` and the earliest instant of that day:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot with depature time on the x-axis. This is units of seconds
|
||||
#| since midnight so it's hard to interpret.
|
||||
flights_dt |>
|
||||
|
@ -444,7 +444,7 @@ Computing the difference between a pair of date-times yields a difftime (more on
|
|||
We can convert that to an `hms` object to get a more useful x-axis:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot with depature time (midnight to midnight) on the x-axis
|
||||
#| and number of flights on the y-axis (0 to 15,000). There are very few
|
||||
#| (<100) flights before 5am. The number of flights then rises rapidly
|
||||
|
|
12
factors.qmd
12
factors.qmd
|
@ -159,7 +159,7 @@ It's often useful to change the order of the factor levels in a visualization.
|
|||
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot of with tvhours on the x-axis and religion on the y-axis.
|
||||
#| The y-axis is ordered seemingly aribtrarily making it hard to get
|
||||
#| any sense of overall pattern.
|
||||
|
@ -183,7 +183,7 @@ We can improve it by reordering the levels of `relig` using `fct_reorder()`.
|
|||
- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The same scatterplot as above, but now the religion is displayed in
|
||||
#| increasing order of tvhours. "Other eastern" has the fewest tvhours
|
||||
#| under 2, and "Don't know" has the highest (over 5).
|
||||
|
@ -210,7 +210,7 @@ relig_summary |>
|
|||
What if we create a similar plot looking at how average age varies across reported income level?
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A scatterplot with age on the x-axis and income on the y-axis. Income
|
||||
#| has been reordered in order of average age which doesn't make much
|
||||
#| sense. One section of the y-axis goes from $6000-6999, then <$1000,
|
||||
|
@ -235,7 +235,7 @@ You can use `fct_relevel()`.
|
|||
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The same scatterplot but now "Not Applicable" is displayed at the
|
||||
#| bottom of the y-axis. Generally there is a positive association
|
||||
#| between income and age, and the income band with the highethst average
|
||||
|
@ -254,7 +254,7 @@ This makes the plot easier to read because the colors of the line at the far rig
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot with age on the x-axis and proportion on the y-axis.
|
||||
#| There is one line for each category of marital status: no answer,
|
||||
#| never married, separated, divorced, widowed, and married. It is
|
||||
|
@ -289,7 +289,7 @@ Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing
|
|||
Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A bar char of marital status ordered in from least to most common:
|
||||
#| no answer (~0), separated (~1,000), widowed (~2,000), divorced
|
||||
#| (~3,000), never married (~5,000), married (~10,000).
|
||||
|
|
|
@ -13,10 +13,10 @@ But in more complex cases it might require both tidying and transformation in or
|
|||
```{r}
|
||||
#| label: fig-ds-import
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Data import is the beginning of the data science process; without
|
||||
#| data you can't do data science!
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Our data science model with import highlighted in blue.
|
||||
#| out.width: NULL
|
||||
|
||||
|
|
|
@ -19,12 +19,12 @@ Our model of the steps of a typical data science project looks something like @f
|
|||
```{r}
|
||||
#| label: fig-ds-diagram
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| In our model of the data science process, you start with data import
|
||||
#| and tidying. Next, you understand your data with an iterative cycle of
|
||||
#| transforming, visualizing, and modeling. You finish the process
|
||||
#| by communicating your results to other humans.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram displaying the data science cycle: Import -> Tidy -> Understand
|
||||
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
|
||||
#| Communicate. Surrounding all of these is Communicate.
|
||||
|
@ -151,10 +151,10 @@ You'll learn more as we go along![^intro-1]
|
|||
#| label: fig-rstudio-console
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The RStudio IDE has two key regions: type R code in the console pane
|
||||
#| on the left, and look for plots in the output pane on the right.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The RStudio IDE with the panes Console and Output highlighted.
|
||||
knitr::include_graphics("diagrams/rstudio/console.png", dpi = 270)
|
||||
```
|
||||
|
|
60
joins.qmd
60
joins.qmd
|
@ -94,11 +94,11 @@ These relationships are summarized visually in @fig-flights-relationships.
|
|||
#| label: fig-flights-relationships
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Connections between all five data frames in the nycflights13 package.
|
||||
#| Variables making up a primary key are colored grey, and are connected
|
||||
#| to their corresponding foreign keys with arrows.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The relationships between airports, planes, flights, weather, and
|
||||
#| airlines datasets from the nycflights13 package. airports$faa
|
||||
#| connected to the flights$origin and flights$dest. planes$tailnum
|
||||
|
@ -434,11 +434,11 @@ y <- tribble(
|
|||
#| label: fig-join-setup
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Graphical representation of two simple tables. The colored `key`
|
||||
#| columns map background color to key value. The grey columns represent
|
||||
#| the "value" columns that are carried along for the ride.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| x and y are two data frames with 2 columns and 3 rows, with contents
|
||||
#| as described in the text. The values of the keys are colored:
|
||||
#| 1 is green, 2 is purple, 3 is orange, and 4 is yellow.
|
||||
|
@ -454,10 +454,10 @@ The rows and columns in the output are primarily determined by `x`, so the `x` t
|
|||
#| label: fig-join-setup2
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| To understand how joins work, it's useful to think of every possible
|
||||
#| match. Here we show that with a grid of connecting lines.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| x and y are placed at right-angles, with horizonal lines extending
|
||||
#| from x and vertical lines extending from y. There are 3 rows in x and
|
||||
#| 3 rows in y, which leads to nine intersections representing nine
|
||||
|
@ -474,10 +474,10 @@ For example, @fig-join-inner shows an inner join, where rows are retained if and
|
|||
#| label: fig-join-inner
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| An inner join matches each row in `x` to the row in `y` that has the
|
||||
#| same value of `key`. Each match becomes a row in the output.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| x and y are placed at right-angles with lines forming a grid of
|
||||
#| potential matches. Keys 1 and 2 appear in both x and y, so we
|
||||
#| get a match, indicated by a dot. Each dot corresponds to a row
|
||||
|
@ -498,10 +498,10 @@ There are three types of outer joins:
|
|||
#| label: fig-join-left
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A visual representation of the left join where every row in `x`
|
||||
#| appears in the output.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Compared to the previous diagram showing an inner join, the y table
|
||||
#| gets a new virtual row containin NA that will match any row in x
|
||||
#| that didn't otherwise match. This means that the output now has
|
||||
|
@ -519,10 +519,10 @@ There are three types of outer joins:
|
|||
#| label: fig-join-right
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A visual representation of the right join where every row of `y`
|
||||
#| appears in the output.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Compared to the previous diagram showing an left join, the x table
|
||||
#| now gains a virtual row so that every row in y gets a match in x.
|
||||
#| val_x contains NA for the row in y that didn't match x.
|
||||
|
@ -538,10 +538,10 @@ There are three types of outer joins:
|
|||
#| label: fig-join-full
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A visual representation of the full join where every row in `x`
|
||||
#| and `y` appears in the output.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Now both x and y have a virtual row that always matches.
|
||||
#| The result has 4 rows: keys 1, 2, 3, and 4 with all values
|
||||
#| from val_x and val_y, however key 2, val_y and key 4, val_x are NAs
|
||||
|
@ -557,10 +557,10 @@ However, this is not a great representation because while it might jog your memo
|
|||
#| label: fig-join-venn
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Venn diagrams showing the difference between inner, left, right, and
|
||||
#| full joins.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Venn diagrams for inner, full, left, and right joins. Each join
|
||||
#| represented with two intersecting circles representing data frames x
|
||||
#| and y, with x on the right and y on the left. Shading indicates the
|
||||
|
@ -588,13 +588,13 @@ To understand what's going let's first narrow our focus to the `inner_join()` an
|
|||
#| label: fig-join-match-types
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The three ways a row in `x` can match. `x1` matches
|
||||
#| one row in `y`, `x2` matches two rows in `y`, `x3` matches
|
||||
#| zero rows in y. Note that while there are three rows in
|
||||
#| `x` and three rows in the output, there isn't a direct
|
||||
#| correspondence between the rows.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A join diagram where x has key values 1, 2, and 3, and y has
|
||||
#| key values 1, 2, 2. The output has three rows because key 1 matches
|
||||
#| one row, key 2 matches two rows, and key 3 matches zero rows.
|
||||
|
@ -639,10 +639,10 @@ This means that filtering joins never duplicate rows like mutating joins do.
|
|||
#| label: fig-join-semi
|
||||
#| echo: false
|
||||
#| out-width: null
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| In a semi-join it only matters that there is a match; otherwise
|
||||
#| values in `y` don't affect the output.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A join diagram with old friends x and y. In a semi join, only the
|
||||
#| presence of a match matters so the output contains the same columns
|
||||
#| as x.
|
||||
|
@ -654,10 +654,10 @@ knitr::include_graphics("diagrams/join/semi.png", dpi = 270)
|
|||
#| label: fig-join-anti
|
||||
#| echo: false
|
||||
#| out-width: null
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| An anti-join is the inverse of a semi-join, dropping rows from `x`
|
||||
#| that have a match in `y`.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| An anti-join is the inverse of a semi-join so matches are drawn with
|
||||
#| red lines indicating that they will be dropped from the output.
|
||||
|
||||
|
@ -679,9 +679,9 @@ x |> left_join(y, by = "key", keep = TRUE)
|
|||
|
||||
```{r}
|
||||
#| label: fig-inner-both
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| An inner join showing both `x` and `y` keys in the output.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A join diagram showing an inner join betwen x and y. The result
|
||||
#| now includes four columns: key.x, val_x, key.y, and val_y. The
|
||||
#| values of key.x and key.y are identical, which is why we usually
|
||||
|
@ -699,10 +699,10 @@ dplyr's join functions understand this distinction equi and non-equi joins so wi
|
|||
```{r}
|
||||
#| label: fig-join-gte
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A non-equi join where the `x` key must be greater than or equal to
|
||||
#| the `y` key. Many rows generate multiple matches.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A join diagram illustrating join_by(key >= key). The first row
|
||||
#| of x matches one row of y and the second and thirds rows each match
|
||||
#| two rows. This means the output has five rows containing each of the
|
||||
|
@ -729,9 +729,9 @@ This means the output will have `nrow(x) * nrow(y)` rows.
|
|||
#| label: fig-join-cross
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A cross join matches each row in `x` with every row in `y`.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A join diagram showing a dot for every combination of x and y.
|
||||
knitr::include_graphics("diagrams/join/cross.png", dpi = 270)
|
||||
```
|
||||
|
@ -785,10 +785,10 @@ For example `join_by(closest(x <= y))` matches the smallest `y` that's greater t
|
|||
#| label: fig-join-closest
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A rolling join is similar to a greater-than-or-equal inequality join
|
||||
#| but only matches the first value.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A rolling join is a subset of an inequality join so some matches are
|
||||
#| grayed out indicating that they're not used because they're not the
|
||||
#| "closest".
|
||||
|
|
66
layers.qmd
66
layers.qmd
|
@ -62,7 +62,7 @@ We can do this with a scatterplot where the numerical variables are mapped to th
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two scatterplots next to each other, both visualizing highway fuel
|
||||
#| efficiency versus engine size of cars and showing a negative
|
||||
#| association. In the plot on the left class is mapped to the color
|
||||
|
@ -96,7 +96,7 @@ Similarly, we can map `class` to `size` or `alpha` aesthetics as well, which con
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two scatterplots next to each other, both visualizing highway fuel
|
||||
#| efficiency versus engine size of cars and showing a negative
|
||||
#| association. In the plot on the left class is mapped to the size
|
||||
|
@ -130,7 +130,7 @@ You can also set the visual properties of your geom manually as an argument of y
|
|||
For example, we can make all of the points in our plot blue:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars
|
||||
#| that shows a negative association. All points are blue.
|
||||
|
||||
|
@ -151,7 +151,7 @@ You'll need to pick a value that makes sense for that aesthetic:
|
|||
#| warning: false
|
||||
#| fig.asp: 0.364
|
||||
#| fig-align: "center"
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| R has 25 built-in shapes that are identified by numbers. There are some
|
||||
#| seeming duplicates: for example, 0, 15, and 22 are all squares. The
|
||||
#| difference comes from the interaction of the `color` and `fill`
|
||||
|
@ -159,7 +159,7 @@ You'll need to pick a value that makes sense for that aesthetic:
|
|||
#| the solid shapes (15--20) are filled with `color`; the filled shapes
|
||||
#| (21--24) have a border of `color` and are filled with `fill`. Shapes are
|
||||
#| arranged to keep similar shapes next to each other.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Mapping between shapes and the numbers that represent them: 0 - square,
|
||||
#| 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond,
|
||||
#| 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus,
|
||||
|
@ -200,7 +200,7 @@ In the next section we dive deeper into geoms.
|
|||
|
||||
```{r}
|
||||
#| fig-show: hide
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars
|
||||
#| that shows a negative association. All points are red and
|
||||
#| the legend shows a red point that is mapped to the word blue.
|
||||
|
@ -225,7 +225,7 @@ How are these two plots similar?
|
|||
#| message: false
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| There are two plots. The plot on the left is a scatterplot of highway
|
||||
#| fuel efficiency versus engine size of cars and the plot on the right
|
||||
#| shows a smooth curve that follows the trajectory of the relationship
|
||||
|
@ -270,7 +270,7 @@ On the other hand, you *could* set the linetype of a line.
|
|||
#| message: false
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two plots of highway fuel efficiency versus engine size of cars.
|
||||
#| The data are represented with smooth curves. On the left, three
|
||||
#| smooth curves, all with the same linetype. On the right, three
|
||||
|
@ -295,7 +295,7 @@ If this sounds strange, we can make it clearer by overlaying the lines on top of
|
|||
|
||||
```{r}
|
||||
#| message: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A plot of highway fuel efficiency versus engine size of cars. The data
|
||||
#| are represented with points (colored by drive train) as well as smooth
|
||||
#| curves (where line type is determined based on drive train as well).
|
||||
|
@ -319,7 +319,7 @@ It is convenient to rely on this feature because the `group` aesthetic by itself
|
|||
#| fig-width: 3
|
||||
#| fig-asp: 1
|
||||
#| message: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three plots, each with highway fuel efficiency on the y-axis and engine
|
||||
#| size of cars, where data are represented by a smooth curve. The first plot
|
||||
#| only has these two variables, the center plot has three separate smooth
|
||||
|
@ -348,7 +348,7 @@ This makes it possible to display different aesthetics in different layers.
|
|||
|
||||
```{r}
|
||||
#| message: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
|
||||
#| points are colored according to the car class. A smooth curve following
|
||||
#| the trajectory of the relationship between highway fuel efficiency versus
|
||||
|
@ -365,7 +365,7 @@ The local data argument in `geom_point()` overrides the global data argument in
|
|||
|
||||
```{r}
|
||||
#| message: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, where
|
||||
#| points are colored according to the car class. A smooth curve following
|
||||
#| the trajectory of the relationship between highway fuel efficiency versus
|
||||
|
@ -391,7 +391,7 @@ For example, the histogram and density plot below reveal that the distribution o
|
|||
```{r}
|
||||
#| layout-ncol: 3
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three plots: histogram, density plot, and box plot of highway
|
||||
#| mileage.
|
||||
|
||||
|
@ -462,7 +462,7 @@ To learn more about any single geom, use the help (e.g., `?geom_smooth`).
|
|||
#| message: false
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| There are six scatterplots in this figure, arranged in a 3x2 grid.
|
||||
#| In all plots highway fuel efficiency of cars are on the y-axis and
|
||||
#| engine size is on the x-axis. The first plot shows all points in black
|
||||
|
@ -503,7 +503,7 @@ To learn more about any single geom, use the help (e.g., `?geom_smooth`).
|
|||
In @sec-data-visualization you learned about faceting with `facet_wrap()`, which splits a plot into subplots that each display one subset of the data based on a categorical variable.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars,
|
||||
#| faceted by class, with facets spanning two rows.
|
||||
|
||||
|
@ -516,7 +516,7 @@ To facet your plot with the combination of two variables, switch from `facet_wra
|
|||
The first argument of `facet_grid()` is also a formula, but now it's a double sided formula: `rows ~ cols`.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars, faceted
|
||||
#| by number of cylinders across rows and by type of drive train across
|
||||
#| columns. This results in a 4x3 grid of 12 facets. Some of these facets have
|
||||
|
@ -533,7 +533,7 @@ This is useful when you want to compare data across facets but it can be limitin
|
|||
Setting the `scales` argument in a faceting function to `"free"` will allow for different axis scales across both rows and columns, `"free_x"` will allow for different scales across rows, and `"free_y"` will allow for different scales across columns.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars,
|
||||
#| faceted by number of cylinders across rows and by type of drive train
|
||||
#| across columns. This results in a 4x3 grid of 12 facets. Some of these
|
||||
|
@ -631,7 +631,7 @@ The `diamonds` dataset is in the ggplot2 package and contains information on \~5
|
|||
The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Bar chart of number of each cut of diamond. There are roughly 1500
|
||||
#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut
|
||||
#| diamonds.
|
||||
|
@ -659,11 +659,11 @@ The algorithm used to calculate new values for a graph is called a **stat**, sho
|
|||
#| label: fig-vis-stat-bar
|
||||
#| echo: false
|
||||
#| out-width: "100%"
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| When creating a bar chart we first start with the raw data, then
|
||||
#| aggregate it to count the number of observations in each bar,
|
||||
#| and finally map those computed variables to plot aesthetics.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A figure demonstrating three steps of creating a bar chart.
|
||||
#| Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar()
|
||||
#| transforms the data with the count stat, which returns a data set of
|
||||
|
@ -688,7 +688,7 @@ However, there are three reasons why you might need to use a stat explicitly:
|
|||
|
||||
```{r}
|
||||
#| warning: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Bar chart of number of each cut of diamond. There are roughly 1500
|
||||
#| Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut
|
||||
#| diamonds.
|
||||
|
@ -703,7 +703,7 @@ However, there are three reasons why you might need to use a stat explicitly:
|
|||
For example, you might want to display a bar chart of proportions, rather than counts:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Bar chart of proportion of each cut of diamond. Roughly, Fair
|
||||
#| diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 0.26, and
|
||||
#| Ideal 0.40.
|
||||
|
@ -718,7 +718,7 @@ However, there are three reasons why you might need to use a stat explicitly:
|
|||
For example, you might use `stat_summary()`, which summarizes the y values for each unique x value, to draw attention to the summary that you're computing:
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A plot with depth on the y-axis and cut on the x-axis (with levels
|
||||
#| fair, good, very good, premium, and ideal) of diamonds. For each level
|
||||
#| of cut, vertical lines extend from minimum to maximum depth for diamonds
|
||||
|
@ -774,7 +774,7 @@ You can color a bar chart using either the `color` aesthetic, or, more usefully,
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two bar charts of drive types of cars. In the first plot, the bars have
|
||||
#| colored borders. In the second plot, they're filled with colors. Heights
|
||||
#| of the bars correspond to the number of cars in each cut category.
|
||||
|
@ -792,7 +792,7 @@ Note what happens if you map the fill aesthetic to another variable, like `class
|
|||
Each colored rectangle represents a combination of `drv` and `class`.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Segmented bar chart of drive types of cars, where each bar is filled with
|
||||
#| colors for the classes of cars. Heights of the bars correspond to the
|
||||
#| number of cars in each drive category, and heights of the colored
|
||||
|
@ -813,7 +813,7 @@ If you don't want a stacked bar chart, you can use one of three other options: `
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Segmented bar chart of drive types of cars, where each bar is filled with
|
||||
#| colors for the classes of cars. Heights of the bars correspond to the
|
||||
#| number of cars in each drive category, and heights of the colored
|
||||
|
@ -842,7 +842,7 @@ If you don't want a stacked bar chart, you can use one of three other options: `
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| On the left, segmented bar chart of drive types of cars, where each bar is
|
||||
#| filled with colors for the levels of class. Height of each bar is 1 and
|
||||
#| heights of the colored segments represent the proportions of cars
|
||||
|
@ -869,7 +869,7 @@ Did you notice that the plot displays only 126 points, even though there are 234
|
|||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Scatterplot of highway fuel efficiency versus engine size of cars that
|
||||
#| shows a negative association.
|
||||
|
||||
|
@ -887,7 +887,7 @@ You can avoid this gridding by setting the position adjustment to "jitter".
|
|||
This spreads the points out because no two points are likely to receive the same amount of random noise.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Jittered scatterplot of highway fuel efficiency versus engine size of cars.
|
||||
#| The plot shows a negative association.
|
||||
|
||||
|
@ -945,7 +945,7 @@ There are two other coordinate systems that are occasionally helpful.
|
|||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| message: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two maps of the boundaries of New Zealand. In the first plot the aspect
|
||||
#| ratio is incorrect, in the second plot it is correct.
|
||||
|
||||
|
@ -966,7 +966,7 @@ There are two other coordinate systems that are occasionally helpful.
|
|||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-asp: 1
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| There are two plots. On the left is a bar chart of clarity of diamonds,
|
||||
#| on the right is a Coxcomb chart of the same data.
|
||||
|
||||
|
@ -1030,12 +1030,12 @@ You'd then select a coordinate system to place the geoms into, using the locatio
|
|||
```{r}
|
||||
#| label: fig-visualization-grammar
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A figure demonstrating the steps for going from raw data to table of
|
||||
#| frequencies where each row represents one level of cut and a count column
|
||||
#| shows how many diamonds are in that cut level. Then, these values are
|
||||
#| mapped to heights of bars.
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Steps for going from raw data to a table of frequencies to a bar plot where
|
||||
#| the heights of the bar represent the frequencies.
|
||||
|
||||
|
|
|
@ -210,11 +210,11 @@ For example, `df |> filter(!is.na(x))` finds all rows where `x` is not missing a
|
|||
#| label: fig-bool-ops
|
||||
#| echo: false
|
||||
#| out-width: NULL
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The complete set of Boolean operations. `x` is the left-hand
|
||||
#| circle, `y` is the right-hand circle, and the shaded region show
|
||||
#| which parts each operator selects.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Six Venn diagrams, each explaining a given logical operator. The
|
||||
#| circles (sets) in each of the Venn diagrams represent x and y. 1. y &
|
||||
#| !x is y but none of x; x & y is the intersection of x and y; x & !y is
|
||||
|
|
|
@ -229,7 +229,7 @@ You can force them to display by supplying `drop = FALSE` to the appropriate dis
|
|||
```{r}
|
||||
#| layout-ncol: 2
|
||||
#| fig-width: 3
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A bar chart with a single value on the x-axis, "no".
|
||||
#|
|
||||
#| The same bar chart as the last plot, but now with two values on
|
||||
|
|
14
numbers.qmd
14
numbers.qmd
|
@ -241,12 +241,12 @@ The results are shown in @fig-prop-cancelled.
|
|||
|
||||
```{r}
|
||||
#| label: fig-prop-cancelled
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A line plot with scheduled departure hour on the x-axis, and proportion
|
||||
#| of cancelled flights on the y-axis. Cancellations seem to accumulate
|
||||
#| over the course of the day until 8pm, very late flights are much
|
||||
#| less likely to be cancelled.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A line plot showing how proportion of cancelled flights changes over
|
||||
#| the course of the day. The proportion starts low at around 0.5% at
|
||||
#| 6am, then steadily increases over the course of the day until peaking
|
||||
|
@ -584,10 +584,10 @@ The median delay is always smaller than the mean delay because flights sometimes
|
|||
|
||||
```{r}
|
||||
#| label: fig-mean-vs-median
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| A scatterplot showing the differences of summarizing daily depature
|
||||
#| delay with median instead of mean.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| All points fall below a 45° line, meaning that the median delay is
|
||||
#| always less than the mean delay. Most points are clustered in a
|
||||
#| dense region of mean [0, 20] and median [0, 5]. As the mean delay
|
||||
|
@ -665,12 +665,12 @@ This suggests that the mean is unlikely to be a good summary and we might prefer
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| label: fig-flights-dist
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| (Left) The histogram of the full data is extremely skewed making it
|
||||
#| hard to get any details. (Right) Zooming into delays of less than two
|
||||
#| hours makes it possible to see what's happening with the bulk of the
|
||||
#| observations.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Two histograms of `dep_delay`. On the left, it's very hard to see
|
||||
#| any pattern except that there's a very large spike around zero, the
|
||||
#| bars rapidly decay in height, and for most of the plot, you can't
|
||||
|
@ -700,7 +700,7 @@ In the following plot 365 frequency polygons of `dep_delay`, one for each day, a
|
|||
The distributions seem to follow a common pattern, suggesting it's fine to use the same summary for each day.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The distribution of `dep_delay` is highly right skewed with a strong
|
||||
#| peak slightly less than 0. The 365 frequency polygons are mostly
|
||||
#| overlapping forming a thick black bland.
|
||||
|
|
|
@ -13,9 +13,9 @@ Programming is a cross-cutting skill needed for all data science work: you must
|
|||
#| label: fig-ds-program
|
||||
#| echo: false
|
||||
#| out.width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Programming is the water in which all the other components swim.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Our model of the data science process with program (import, tidy,
|
||||
#| transform, visualize, model, and communicate, i.e. everything)
|
||||
#| highlighted in blue.
|
||||
|
|
|
@ -127,10 +127,10 @@ The advantage of this two step workflow is that you can create a very wide range
|
|||
#| label: fig-quarto-flow
|
||||
#| echo: false
|
||||
#| out-width: "75%"
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Workflow diagram starting with a qmd file, then knitr, then md,
|
||||
#| then pandoc, then PDF, MS Word, or HTML.
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Diagram of Quarto workflow from qmd, to knitr, to md, to pandoc,
|
||||
#| to output in PDF, MS Word, or HTML formats.
|
||||
|
||||
|
|
|
@ -136,7 +136,7 @@ It looks like they've radically increased in popularity lately!
|
|||
[^regexps-4]: This gives us the proportion of **names** that contain an "x"; if you wanted the proportion of babies with a name containing an x, you'd need to perform a weighted mean.
|
||||
|
||||
```{r}
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A time series showing the proportion of baby names that contain the letter x.
|
||||
#| The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in
|
||||
#| 1980, then increases rapidly to 16 per 1000 in 2019.
|
||||
|
|
|
@ -54,9 +54,9 @@ For the rest of the chapter we will focus on using `read_excel()`.
|
|||
#| label: fig-students-excel
|
||||
#| echo: false
|
||||
#| fig-width: 5
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Spreadsheet called students.xlsx in Excel.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A look at the students spreadsheet in Excel. The spreadsheet contains
|
||||
#| information on 6 students, their ID, full name, favourite food, meal plan,
|
||||
#| and age.
|
||||
|
@ -189,9 +189,9 @@ Each worksheet contains information on penguins from a different island where da
|
|||
```{r}
|
||||
#| label: fig-penguins-islands
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Spreadsheet called penguins.xlsx in Excel containing three worksheets.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A look at the penguins spreadsheet in Excel. The spreadsheet contains has
|
||||
#| three worksheets: Torgersen Island, Biscoe Island, and Dream Island.
|
||||
|
||||
|
@ -252,9 +252,9 @@ Since many use Excel spreadsheets for presentation as well as for data storage,
|
|||
```{r}
|
||||
#| label: fig-deaths-excel
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Spreadsheet called deaths.xlsx in Excel.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A look at the deaths spreadsheet in Excel. The spreadsheet has four rows
|
||||
#| on top that contain non-data information; the text 'For the same of
|
||||
#| consistency in the data layout, which is really a beautiful thing, I will
|
||||
|
@ -360,9 +360,9 @@ These can be turned off by setting `col_names` and `format_headers` arguments to
|
|||
#| label: fig-bake-sale-excel
|
||||
#| echo: false
|
||||
#| fig-width: 5
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Spreadsheet called bake_sale.xlsx in Excel.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Bake sale data frame created earlier in Excel.
|
||||
|
||||
knitr::include_graphics("screenshots/import-spreadsheets-bake-sale.png")
|
||||
|
@ -395,7 +395,7 @@ A good way of familiarizing yourself with the coding style used in a new package
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A spreadsheet with 3 columns (group, subgroup, and id) and 12 rows.
|
||||
#| The group column has two values: 1 (spanning 7 merged rows) and 2
|
||||
#| (spanning 5 merged rows). The subgroup column has four values: A
|
||||
|
@ -428,7 +428,7 @@ A good way of familiarizing yourself with the coding style used in a new package
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| fig-width: 4
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A spreadsheet with 3 columns (group, subgroup, and id) and 12 rows. The
|
||||
#| group column has two values: 1 (spanning 7 merged rows) and 2 (spanning
|
||||
#| 5 merged rows). The subgroup column has four values: A (spanning 3 merged
|
||||
|
@ -456,7 +456,7 @@ A good way of familiarizing yourself with the coding style used in a new package
|
|||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A spreadsheet with 2 columns and 13 rows. The first two rows have text
|
||||
#| containing information about the sheet. Row 1 says "This file contains
|
||||
#| information on sales". Row 2 says "Data are organized by brand name, and
|
||||
|
@ -543,9 +543,9 @@ This is the same dataset as in @fig-students-excel, except it's stored in a Goog
|
|||
```{r}
|
||||
#| label: fig-students-googlesheets
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Google Sheet called students in a browser window.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A look at the students spreadsheet in Google Sheets. The spreadsheet contains
|
||||
#| information on 6 students, their ID, full name, favourite food, meal plan,
|
||||
#| and age.
|
||||
|
|
|
@ -12,10 +12,10 @@ In this part of the book, you'll learn about the most important types of variabl
|
|||
```{r}
|
||||
#| label: fig-ds-transform
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| The options for data transformation depends heavily on the type of
|
||||
#| data involved, the subject of this part of the book.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Our data science model, with transform highlighted in blue.
|
||||
#| out.width: NULL
|
||||
|
||||
|
|
|
@ -14,9 +14,9 @@ In this part of the book, you'll learn about visualizing data in further depth.
|
|||
```{r}
|
||||
#| label: fig-ds-visualize
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Data visualization is often the first step in data exploration.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Our data science model, with visualize highlighted in blue.
|
||||
#| out.width: NULL
|
||||
|
||||
|
|
|
@ -450,9 +450,9 @@ At the time we wrote this chapter, the page looked like @fig-scraping-imdb.
|
|||
```{r}
|
||||
#| label: fig-scraping-imdb
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Screenshot of the IMDb top movies web page taken on 2022-12-05.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The screenshot shows a table with columns "Rank and Title",
|
||||
#| "IMDb Rating", and "Your Rating". 9 movies out of the top 250
|
||||
#| are shown. The top 5 are the Shawshank Redemption, The Godfather,
|
||||
|
|
|
@ -14,10 +14,10 @@ The later parts of the book will hit each of these topics in more depth, increas
|
|||
#| label: fig-ds-whole-game
|
||||
#| echo: false
|
||||
#| out.width: NULL
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| In this section of the book, you'll learn how to import,
|
||||
#| tidy, transform, and visualize data.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A diagram displaying the data science cycle: Import -> Tidy ->
|
||||
#| Understand (which has the phases Transform -> Visualize -> Model in a
|
||||
#| cycle) -> Communicate. Surrounding all of these is Program
|
||||
|
|
|
@ -206,7 +206,7 @@ Note that the environment tab in the upper right pane displays all of the object
|
|||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Environment tab of RStudio which shows r_rocks, this_is_a_really_long_name,
|
||||
#| x, and y in the Global Environment.
|
||||
|
||||
|
|
|
@ -24,10 +24,10 @@ And once you have written code that works and does what you want, you can save i
|
|||
#| label: fig-rstudio-script
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Opening the script editor adds a new pane at the top-left of the
|
||||
#| IDE.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| RStudio IDE with Editor, Console, and Output highlighted.
|
||||
knitr::include_graphics("diagrams/rstudio/script.png", dpi = 270)
|
||||
```
|
||||
|
@ -75,7 +75,7 @@ In the script editor, RStudio will highlight syntax errors with a red squiggly l
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Script editor with the script x y <- 10. A red X indicates that there is
|
||||
#| syntax error. The syntax error is also highlighted with a red squiggly line.
|
||||
|
||||
|
@ -87,7 +87,7 @@ Hover over the cross to see what the problem is:
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Script editor with the script x y <- 10. A red X indicates that there is
|
||||
#| syntax error. The syntax error is also highlighted with a red squiggly line.
|
||||
#| Hovering over the X shows a text box with the text unexpected token y and
|
||||
|
@ -101,7 +101,7 @@ RStudio will also let you know about potential problems:
|
|||
```{r}
|
||||
#| echo: false
|
||||
#| out-width: ~
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Script editor with the script 3 == NA. A yellow exclamation mark
|
||||
#| indicates that there may be a potential problem. Hovering over the
|
||||
#| exclamation mark shows a text box with the text use is.na to check
|
||||
|
@ -184,10 +184,10 @@ There's nothing worse than discovering three months after the fact that you've o
|
|||
```{r}
|
||||
#| label: fig-blank-slate
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| Copy these options in your RStudio options to always start your
|
||||
#| RStudio session with a clean slate.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| RStudio Global Options window where the option Restore .RData into workspace
|
||||
#| at startup is not checked. Also, the option Save workspace to .RData
|
||||
#| on exit is set to Never.
|
||||
|
@ -222,7 +222,7 @@ RStudio shows your current working directory at the top of the console:
|
|||
|
||||
```{r}
|
||||
#| echo: false
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| The Console tab shows the current working directory as
|
||||
#| ~/Documents/r4ds.
|
||||
#| out-width: ~
|
||||
|
@ -263,11 +263,11 @@ Click File \> New Project, then follow the steps shown in @fig-new-project.
|
|||
```{r}
|
||||
#| label: fig-new-project
|
||||
#| echo: false
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| To create new project: (top) first click New Directory, then (middle)
|
||||
#| click New Project, then (bottom) fill in the directory (project) name,
|
||||
#| choose a good subdirectory for its home and click Create Project.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| Three screenshots of the New Project menu. In the first screenshot,
|
||||
#| the Create Project window is shown and New Directory is selected.
|
||||
#| In the second screenshot, the Project Type window is shown and
|
||||
|
|
|
@ -24,10 +24,10 @@ Open the palette by pressing Cmd/Ctrl + Shift + P, then type "styler" to see all
|
|||
#| label: fig-styler
|
||||
#| echo: false
|
||||
#| out-width: null
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| RStudio's command palette makes it easy to access every RStudio command
|
||||
#| using only the keyboard.
|
||||
#| fig-alt: >
|
||||
#| fig-alt: |
|
||||
#| A screenshot showing the command palette after typing "styler", showing
|
||||
#| the four styling tool provided by the package.
|
||||
|
||||
|
@ -267,7 +267,7 @@ RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift +
|
|||
#| label: fig-rstudio-sections
|
||||
#| echo: false
|
||||
#| out-width: null
|
||||
#| fig-cap: >
|
||||
#| fig-cap: |
|
||||
#| After adding sectioning comments to your script, you can
|
||||
#| easily navigate to them using the code navigation tool in the
|
||||
#| bottom-left of the script editor.
|
||||
|
|
Loading…
Reference in New Issue