diff --git a/.gitignore b/.gitignore index 657351e..fd1091b 100644 --- a/.gitignore +++ b/.gitignore @@ -6,7 +6,7 @@ _main.rds _book *.md !CODE_OF_CONDUCT.md -*.html +/*.html !plausible.html search_index.json libs diff --git a/oreilly/.gitignore b/oreilly/.gitignore new file mode 100644 index 0000000..225c381 --- /dev/null +++ b/oreilly/.gitignore @@ -0,0 +1,2 @@ +*.png +*.jpg diff --git a/oreilly/EDA.html b/oreilly/EDA.html new file mode 100644 index 0000000..5f96732 --- /dev/null +++ b/oreilly/EDA.html @@ -0,0 +1,591 @@ +
+

Exploratory data analysis

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. EDA is an iterative cycle. You:

+
  1. Generate questions about your data.

  2. +
  3. Search for answers by visualizing, transforming, and modelling your data.

  4. +
  5. Use what you learn to refine your questions and/or generate new questions.

  6. +

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you’ll eventually write up and communicate to others.

+

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualization, transformation, and modelling.

+ +
+

+Prerequisites

+

In this chapter we’ll combine what you’ve learned about dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions.

+
+
library(tidyverse)
+
+
+
+ +
+

+Questions

+
+

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

+
+
+

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

+
+

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

+

EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data—and develop a set of thought-provoking questions—if you follow up each question with a new question based on what you find.

+

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

+
  1. What type of variation occurs within my variables?

  2. +
  3. What type of covariation occurs between my variables?

  4. +

The rest of this chapter will look at these two questions. We’ll explain what variation and covariation are, and we’ll show you several ways to answer each question. To make the discussion easier, let’s define some terms:

+
  • A variable is a quantity, quality, or property that you can measure.

  • +
  • A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

  • +
  • An observation is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We’ll sometimes refer to an observation as a data point.

  • +
  • Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy if each value is placed in its own “cell”, each variable in its own column, and each observation in its own row.

  • +

So far, all of the data that you’ve seen has been tidy. In real-life, most data isn’t tidy, so we’ll come back to these ideas again in #chp-rectangling.

+
+ +
+

+Variation

+

Variation is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements will include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colors of different people) or different times (e.g. the energy levels of an electron at different moments). Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualize the distribution of the variable’s values.

+ +
+

+Visualizing distributions

+

How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is categorical if it can only take one of a small set of values. In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, you can use a bar chart:

+
+
ggplot(data = diamonds, mapping = aes(x = cut)) +
+  geom_bar()
+
+

A bar chart of cuts of diamonds. The cuts are presented in increasing order of frequency: Fair (less than 2500), Good (approximately 5000), Very Good (apprximately 12500), Premium, (approximately 14000), and Ideal (approximately 21500).

+
+
+

The height of the bars displays how many observations occurred with each x value. You can compute these values manually with #chp-https://dplyr.tidyverse.org/reference/count:

+
+
diamonds |> 
+  count(cut)
+#> # A tibble: 5 × 2
+#>   cut           n
+#>   <ord>     <int>
+#> 1 Fair       1610
+#> 2 Good       4906
+#> 3 Very Good 12082
+#> 4 Premium   13791
+#> 5 Ideal     21551
+
+

A variable is continuous if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, you can use a histogram:

+
+
ggplot(data = diamonds, mapping = aes(x = carat)) +
+  geom_histogram(binwidth = 0.5)
+
+

A histogram of carats of diamonds, with the x-axis ranging from 0 to 4.5 and the y-axis ranging from 0 to 30000. The distribution is right skewed with very few diamonds in the bin centered at 0, almost 30000 diamonds in the bin centered at 0.5, approximately 15000 diamonds in the bin centered at 1, and much fewer, approximately 5000 diamonds in the bin centered at 1.5. Beyond this, there's a trailing tail.

+
+
+

You can compute this by hand by combining #chp-https://dplyr.tidyverse.org/reference/count and #chp-https://ggplot2.tidyverse.org/reference/cut_interval:

+
+
diamonds |> 
+  count(cut_width(carat, 0.5))
+#> # A tibble: 11 × 2
+#>   `cut_width(carat, 0.5)`     n
+#>   <fct>                   <int>
+#> 1 [-0.25,0.25]              785
+#> 2 (0.25,0.75]             29498
+#> 3 (0.75,1.25]             15977
+#> 4 (1.25,1.75]              5313
+#> 5 (1.75,2.25]              2002
+#> 6 (2.25,2.75]               322
+#> # … with 5 more rows
+
+

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. Note that even though it’s not possible to have a carat value that is smaller than 0 (since weights of diamonds, by definition, are positive values), the bins start at a negative value (-0.25) in order to create bins of equal width across the range of the data with the center of the first bin at 0. This behavior is also apparent in the histogram above, where the first bar ranges from -0.25 to 0.25. The tallest bar shows that almost 30,000 observations have a carat value between 0.25 and 0.75, which are the left and right edges of the bar centered at 0.5.

+

You can set the width of the intervals in a histogram with the binwidth argument, which is measured in the units of the x variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.

+
+
smaller <- diamonds |> 
+  filter(carat < 3)
+  
+ggplot(data = smaller, mapping = aes(x = carat)) +
+  geom_histogram(binwidth = 0.1)
+
+

A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to 10000. The binwidth is quite narrow (0.1), resulting in many bars. The distribution is right skewed but there are lots of ups and downs in the heights of the bins, creating a jagged outline.

+
+
+

If you wish to overlay multiple histograms in the same plot, we recommend using #chp-https://ggplot2.tidyverse.org/reference/geom_histogram instead of #chp-https://ggplot2.tidyverse.org/reference/geom_histogram. #chp-https://ggplot2.tidyverse.org/reference/geom_histogram performs the same calculation as #chp-https://ggplot2.tidyverse.org/reference/geom_histogram, but instead of displaying the counts with bars, uses lines instead. It’s much easier to understand overlapping lines than bars.

+
+
ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
+  geom_freqpoly(binwidth = 0.1, size = 0.75)
+#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+#> ℹ Please use `linewidth` instead.
+
+

A frequency polygon of carats of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 3 and the y-axis ranges from 0 to almost 6000. Ideal diamonds have a much higher peak than the others around 0.25 carats. All cuts of diamonds have right skewed distributions with local peaks at 1 carat and 2 carats. As the cut level increases (from Fair to Ideal), so does the number of diamonds that fall into that category.

+
+
+

We’ve also customized the thickness of the lines using the size argument in order to make them stand out a bit more against the background.

+

There are a few challenges with this type of plot, which we will come back to in #sec-cat-cont on visualizing a categorical and a continuous variable.

+

Now that you can visualize variation, what should you look for in your plots? And what type of follow-up questions should you ask? We’ve put together a list below of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).

+
+ +
+

+Typical values

+

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values. Places that do not have bars reveal values that were not seen in your data. To turn this information into useful questions, look for anything unexpected:

+
  • Which values are the most common? Why?

  • +
  • Which values are rare? Why? Does that match your expectations?

  • +
  • Can you see any unusual patterns? What might explain them?

  • +

As an example, the histogram below suggests several interesting questions:

+
  • Why are there more diamonds at whole carats and common fractions of carats?

  • +
  • Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?

  • +
+
ggplot(data = smaller, mapping = aes(x = carat)) +
+  geom_histogram(binwidth = 0.01)
+
+

A histogram of carats of diamonds, with the x-axis ranging from 0 to 3 and the y-axis ranging from 0 to roughly 2500. The binwidth is quite narrow (0.01), resulting in a very large number of skinny bars. The distribution is right skewed, with many peaks followed by bars in decreasing heights, until a sharp increase at the next peak.

+
+
+

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:

+
  • How are the observations within each cluster similar to each other?

  • +
  • How are the observations in separate clusters different from each other?

  • +
  • How can you explain or describe the clusters?

  • +
  • Why might the appearance of clusters be misleading?

  • +

The histogram below shows the length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park. Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.

+
+
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
+  geom_histogram(binwidth = 0.25)
+
+

A histogram of eruption times. The x-axis ranges from roughly 1.5 to 5, and the y-axis ranges from 0 to roughly 40. The distribution is bimodal with peaks around 1.75 and 4.5.

+
+
+

Many of the questions above will prompt you to explore a relationship between variables, for example, to see if the values of one variable can explain the behavior of another variable. We’ll get to that shortly.

+
+ +
+

+Unusual values

+

Outliers are observations that are unusual; data points that don’t seem to fit the pattern. Sometimes outliers are data entry errors; other times outliers suggest important new science. When you have a lot of data, outliers are sometimes difficult to see in a histogram. For example, take the distribution of the y variable from the diamonds dataset. The only evidence of outliers is the unusually wide limits on the x-axis.

+
+
ggplot(data = diamonds, mapping = aes(x = y)) + 
+  geom_histogram(binwidth = 0.5)
+
+

A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 12000. There is a peak around 5, and the data appear to be completely clustered around the peak.

+
+
+

There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with #chp-https://ggplot2.tidyverse.org/reference/coord_cartesian:

+
+
ggplot(data = diamonds, mapping = aes(x = y)) + 
+  geom_histogram(binwidth = 0.5) +
+  coord_cartesian(ylim = c(0, 50))
+
+

A histogram of lengths of diamonds. The x-axis ranges from 0 to 60 and the y-axis ranges from 0 to 50. There is a peak around 5, and the data appear to be completely clustered around the peak. Other than those data, there is one bin at 0 with a height of about 8, one a little over 30 with a height of 1 and another one a little below 60 with a height of 1.

+
+
+

#chp-https://ggplot2.tidyverse.org/reference/coord_cartesian also has an #chp-https://ggplot2.tidyverse.org/reference/lims argument for when you need to zoom into the x-axis. ggplot2 also has #chp-https://ggplot2.tidyverse.org/reference/lims and #chp-https://ggplot2.tidyverse.org/reference/lims functions that work slightly differently: they throw away the data outside the limits.

+

This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:

+
+
unusual <- diamonds |> 
+  filter(y < 3 | y > 20) |> 
+  select(price, x, y, z) |>
+  arrange(y)
+unusual
+#> # A tibble: 9 × 4
+#>   price     x     y     z
+#>   <int> <dbl> <dbl> <dbl>
+#> 1  5139  0      0    0   
+#> 2  6381  0      0    0   
+#> 3 12800  0      0    0   
+#> 4 15686  0      0    0   
+#> 5 18034  0      0    0   
+#> 6  2130  0      0    0   
+#> 7  2130  0      0    0   
+#> 8  2075  5.15  31.8  5.12
+#> 9 12210  8.09  58.9  8.06
+
+

The y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don’t cost hundreds of thousands of dollars!

+

It’s good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to omit them, and move on. However, if they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

+
+ +
+

+Exercises

+
  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

  2. +
  3. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

  4. +
  5. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

  6. +
  7. Compare and contrast #chp-https://ggplot2.tidyverse.org/reference/coord_cartesian vs #chp-https://ggplot2.tidyverse.org/reference/lims or #chp-https://ggplot2.tidyverse.org/reference/lims when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

  8. +
+
+ +
+

+Missing values

+

If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, you have two options.

+
  1. +

    Drop the entire row with the strange values:

    +
    +
    diamonds2 <- diamonds |> 
    +  filter(between(y, 3, 20))
    +
    +

    We don’t recommend this option because just because one measurement is invalid, doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!

    +
  2. +
  3. +

    Instead, we recommend replacing the unusual values with missing values. The easiest way to do this is to use #chp-https://dplyr.tidyverse.org/reference/mutate to replace the variable with a modified copy. You can use the #chp-https://dplyr.tidyverse.org/reference/if_else function to replace unusual values with NA:

    +
    +
    diamonds2 <- diamonds |> 
    +  mutate(y = if_else(y < 3 | y > 20, NA, y))
    +
    +
  4. +

#chp-https://dplyr.tidyverse.org/reference/if_else has three arguments. The first argument test should be a logical vector. The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false. Alternatively to #chp-https://dplyr.tidyverse.org/reference/if_else, use #chp-https://dplyr.tidyverse.org/reference/case_when. #chp-https://dplyr.tidyverse.org/reference/case_when is particularly useful inside mutate when you want to create a new variable that relies on a complex combination of existing variables or would otherwise require multiple #chp-https://dplyr.tidyverse.org/reference/if_else statements nested inside one another.

+

Like R, ggplot2 subscribes to the philosophy that missing values should never silently go missing. It’s not obvious where you should plot missing values, so ggplot2 doesn’t include them in the plot, but it does warn that they’ve been removed:

+
+
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
+  geom_point()
+#> Warning: Removed 9 rows containing missing values (`geom_point()`).
+
+

A scatterplot of widths vs. lengths of diamonds. There is a strong, linear association between the two variables. All but one of the diamonds has length greater than 3. The one outlier has a length of 0 and a width of about 6.5.

+
+
+

To suppress that warning, set na.rm = TRUE:

+
+
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
+  geom_point(na.rm = TRUE)
+
+

Other times you want to understand what makes observations with missing values different to observations with recorded values. For example, in #chp-https://rdrr.io/pkg/nycflights13/man/flightsRemember that when need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function() or package::dataset., missing values in the dep_time variable indicate that the flight was cancelled. So you might want to compare the scheduled departure times for cancelled and non-cancelled times. You can do this by making a new variable with #chp-https://rdrr.io/r/base/NA.

+
+
nycflights13::flights |> 
+  mutate(
+    cancelled = is.na(dep_time),
+    sched_hour = sched_dep_time %/% 100,
+    sched_min = sched_dep_time %% 100,
+    sched_dep_time = sched_hour + (sched_min / 60)
+  ) |> 
+  ggplot(mapping = aes(sched_dep_time)) + 
+  geom_freqpoly(mapping = aes(color = cancelled), binwidth = 1/4)
+
+

A frequency polygon of scheduled departure times of flights. Two lines represent flights that are cancelled and not cancelled. The x-axis ranges from 0 to 25 minutes and the y-axis ranges from 0 to 10000. The number of flights not cancelled are much higher than those not cancelled.

+
+
+

However this plot isn’t great because there are many more non-cancelled flights than cancelled flights. In the next section we’ll explore some techniques for improving this comparison.

+ +
+

+Exercises

+
  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference in how missing values are handled in histograms and bar charts?

  2. +
  3. What does na.rm = TRUE do in #chp-https://rdrr.io/r/base/mean and #chp-https://rdrr.io/r/base/sum?

  4. +
+
+ +
+

+Covariation

+

If variation describes the behavior within a variable, covariation describes the behavior between variables. Covariation is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualize the relationship between two or more variables. How you do that depends again on the types of variables involved.

+ +
+

+A categorical and continuous variable

+

It’s common to want to explore the distribution of a continuous variable broken down by a categorical variable, as in the previous frequency polygon. The default appearance of #chp-https://ggplot2.tidyverse.org/reference/geom_histogram is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it’s hard to see the differences in the shapes of their distributions. For example, let’s explore how the price of a diamond varies with its quality (measured by cut):

+
+
ggplot(data = diamonds, mapping = aes(x = price)) + 
+  geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)
+
+

A frequency polygon of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 30000 and the y-axis ranges from 0 to 5000. The lines overlap a great deal, suggesting similar frequency distributions of prices of diamonds. One notable feature is that Ideal diamonds have the highest peak around 1500.

+
+
+

It’s hard to see the difference in distribution because the overall counts differ so much:

+
+
ggplot(data = diamonds, mapping = aes(x = cut)) + 
+  geom_bar()
+
+

Bar chart of cuts of diamonds showing large variability between the frenquencies of various cuts. Fair diamonds have the lowest frequency, then Good, then Very Good, then Premium, and then Ideal.

+
+
+

To make the comparison easier we need to swap what is displayed on the y-axis. Instead of displaying count, we’ll display the density, which is the count standardized so that the area under each frequency polygon is one.

+
+
ggplot(data = diamonds, mapping = aes(x = price, y = after_stat(density))) + 
+  geom_freqpoly(mapping = aes(color = cut), binwidth = 500, size = 0.75)
+
+

A frequency polygon of densities of prices of diamonds where each cut of carat (Fair, Good, Very Good, Premium, and Ideal) is represented with a different color line. The x-axis ranges from 0 to 20000. The lines overlap a great deal, suggesting similar density distributions of prices of diamonds. One notable feature is that all but Fair diamonds have high peaks around a price of 1500 and Fair diamonds have a higher mean than others.

+
+
+

Note that we’re mapping the density the y, but since density is not a variable in the diamonds dataset, we need to first calculate it. We use the #chp-https://ggplot2.tidyverse.org/reference/aes_eval function to do so.

+

There’s something rather surprising about this plot - it appears that fair diamonds (the lowest quality) have the highest average price! But maybe that’s because frequency polygons are a little hard to interpret - there’s a lot going on in this plot.

+

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot. A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Each boxplot consists of:

+
  • A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). In the middle of the box is a line that displays the median, i.e. 50th percentile, of the distribution. These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side.

  • +
  • Visual points that display observations that fall more than 1.5 times the IQR from either edge of the box. These outlying points are unusual so are plotted individually.

  • +
  • A line (or whisker) that extends from each end of the box and goes to the farthest non-outlier point in the distribution.

  • +
+
+

A diagram depicting how a boxplot is created following the steps outlined above.

+
+
+

Let’s take a look at the distribution of price by cut using #chp-https://ggplot2.tidyverse.org/reference/geom_boxplot:

+
+
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of prices of diamonds by cut. The distribution of prices is right skewed for each cut (Fair, Good, Very Good, Premium, and Ideal). The medians are close to each other, with the median for Ideal diamonds lowest and that for Fair highest.

+
+
+

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counter-intuitive finding that better quality diamonds are cheaper on average! In the exercises, you’ll be challenged to figure out why.

+

cut is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don’t have such an intrinsic order, so you might want to reorder them to make a more informative display. One way to do that is with the #chp-https://rdrr.io/r/stats/reorder.factor function.

+

For example, take the class variable in the mpg dataset. You might be interested to know how highway mileage varies across classes:

+
+
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis (2seaters, compact, midsize, minivan, pickup, subcompact, and suv).

+
+
+

To make the trend easier to see, we can reorder class based on the median value of hwy:

+
+
ggplot(data = mpg,
+       mapping = aes(x = fct_reorder(class, hwy, median), y = hwy)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of highway mileages of cars by class. Classes are on the x-axis and ordered by increasing median highway mileage (pickup, suv, minivan, 2seater, subcompact, compact, and midsize).

+
+
+

If you have long variable names, #chp-https://ggplot2.tidyverse.org/reference/geom_boxplot will work better if you flip it 90°. You can do that by exchanging the x and y aesthetic mappings.

+
+
ggplot(data = mpg,
+       mapping = aes(y = fct_reorder(class, hwy, median), x = hwy)) +
+  geom_boxplot()
+
+

Side-by-side boxplots of highway mileages of cars by class. Classes are on the y-axis and ordered by increasing median highway mileage.

+
+
+ +
+

+Exercises

+
  1. Use what you’ve learned to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

  2. +
  3. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

  4. +
  5. Instead of exchanging the x and y variables, add #chp-https://ggplot2.tidyverse.org/reference/coord_flip as a new layer to the vertical boxplot to create a horizontal one. How does this compare to using exchanging the variables?

  6. +
  7. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?

  8. +
  9. Compare and contrast #chp-https://ggplot2.tidyverse.org/reference/geom_violin with a faceted #chp-https://ggplot2.tidyverse.org/reference/geom_histogram, or a coloured #chp-https://ggplot2.tidyverse.org/reference/geom_histogram. What are the pros and cons of each method?

  10. +
  11. If you have a small dataset, it’s sometimes useful to use #chp-https://ggplot2.tidyverse.org/reference/geom_jitter to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to #chp-https://ggplot2.tidyverse.org/reference/geom_jitter. List them and briefly describe what each one does.

  12. +
+
+ +
+

+Two categorical variables

+

To visualize the covariation between categorical variables, you’ll need to count the number of observations for each combination of levels of these categorical variables. One way to do that is to rely on the built-in #chp-https://ggplot2.tidyverse.org/reference/geom_count:

+
+
ggplot(data = diamonds, mapping = aes(x = cut, y = color)) +
+  geom_count()
+
+

A scatterplot of color vs. cut of diamonds. There is one point for each combination of levels of cut (Fair, Good, Very Good, Premium, and Ideal) abd color (D, E, F, G, G, I, and J). The sizes of the points represent the number of observations for that combination. The legend indicates that these sizes range between 1000 and 4000.

+
+
+

The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.

+

A more commonly used way of representing the covariation between two categorical variables is using a segmented bar chart. In creating this bar chart, we map the variable we want to divide the data into first to the x aesthetic and the variable we then further want to divide each group into to the fill aesthetic.

+
+
ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
+  geom_bar()
+
+

A bar chart of cuts of diamonds, segmented by color. The number of diamonds for each level of cut increases from Fair to Ideal and the heights of the segments within each bar represent the number of diamonds that fall within each color/cut combination. There appear to be some of each color of diamonds within each level of cut of diamonds.

+
+
+

However, in order to get a better sense of the relationship between these two variables, you should compare proportions instead of counts across groups.

+
+
ggplot(data = diamonds, mapping = aes(x = cut, fill = color)) +
+  geom_bar(position = "fill")
+
+

A bar chart of cuts of diamonds, segmented by color. The heights of each of the bars representing each cut of diamond are the same, 1. The heights of the segments within each bar represent the proportion of diamonds that fall within each color/cut combination. The proportions don't appear to be very different across the levels of cut.

+
+
+

Another approach for exploring the relationship between these variables is computing the counts with dplyr:

+
+
diamonds |> 
+  count(color, cut)
+#> # A tibble: 35 × 3
+#>   color cut           n
+#>   <ord> <ord>     <int>
+#> 1 D     Fair        163
+#> 2 D     Good        662
+#> 3 D     Very Good  1513
+#> 4 D     Premium    1603
+#> 5 D     Ideal      2834
+#> 6 E     Fair        224
+#> # … with 29 more rows
+
+

Then visualize with #chp-https://ggplot2.tidyverse.org/reference/geom_tile and the fill aesthetic:

+
+
diamonds |> 
+  count(color, cut) |>  
+  ggplot(mapping = aes(x = color, y = cut)) +
+  geom_tile(mapping = aes(fill = n))
+
+

A tile plot of cut vs. color of diamonds. Each tile represents a cut/color combination and tiles are colored according to the number of observations in each tile. There are more Ideal diamonds than other cuts, with the highest number being Ideal diamonds with color G. Fair diamonds and diamonds with color I are the lowest in frequency.

+
+
+

If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the heatmaply package, which creates interactive plots.

+ +
+

+Exercises

+
  1. How could you rescale the count dataset above to more clearly show the distribution of cut within color, or color within cut?

  2. +
  3. How does the segmented bar chart change if color is mapped to the x aesthetic and cut is mapped to the fill aesthetic? Calculate the counts that fall into each of the segments.

  4. +
  5. Use #chp-https://ggplot2.tidyverse.org/reference/geom_tile together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?

  6. +
  7. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?

  8. +
+
+ +
+

+Two continuous variables

+

You’ve already seen one great way to visualize the covariation between two continuous variables: draw a scatterplot with #chp-https://ggplot2.tidyverse.org/reference/geom_point. You can see covariation as a pattern in the points. For example, you can see an exponential relationship between the carat size and price of a diamond.

+
+
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
+  geom_point()
+
+

A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential.

+
+
+

Scatterplots become less useful as the size of your dataset grows, because points begin to overplot, and pile up into areas of uniform black (as above). You’ve already seen one way to fix the problem: using the alpha aesthetic to add transparency.

+
+
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + 
+  geom_point(alpha = 1 / 100)
+
+

A scatterplot of price vs. carat. The relationship is positive, somewhat strong, and exponential. The points are transparent, showing clusters where the number of points is higher than other areas, The most obvious clusters are for diamonds with 1, 1.5, and 2 carats.

+
+
+

But using transparency can be challenging for very large datasets. Another solution is to use bin. Previously you used #chp-https://ggplot2.tidyverse.org/reference/geom_histogram and #chp-https://ggplot2.tidyverse.org/reference/geom_histogram to bin in one dimension. Now you’ll learn how to use #chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d and #chp-https://ggplot2.tidyverse.org/reference/geom_hex to bin in two dimensions.

+

#chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d and #chp-https://ggplot2.tidyverse.org/reference/geom_hex divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin. #chp-https://ggplot2.tidyverse.org/reference/geom_bin_2d creates rectangular bins. #chp-https://ggplot2.tidyverse.org/reference/geom_hex creates hexagonal bins. You will need to install the hexbin package to use #chp-https://ggplot2.tidyverse.org/reference/geom_hex.

+
+
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
+  geom_bin2d()
+
+# install.packages("hexbin")
+ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
+  geom_hex()
+
+ +
+
+

Another option is to bin one continuous variable so it acts like a categorical variable. Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about. For example, you could bin carat and then for each group, display a boxplot:

+
+
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
+  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
+
+

Side-by-side box plots of price by carat. Each box plot represents diamonds that are 0.1 carats apart in weight. The box plots show that as carat increases the median price increases as well. Additionally, diamonds with 1.5 carats or lower have right skewed price distributions, 1.5 to 2 have roughly symmetric price distributions, and diamonds that weigh more have left skewed distributions. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end.

+
+
+

cut_width(x, width), as used above, divides x into bins of width width. By default, boxplots look roughly the same (apart from number of outliers) regardless of how many observations there are, so it’s difficult to tell that each boxplot summaries a different number of points. One way to show that is to make the width of the boxplot proportional to the number of points with varwidth = TRUE.

+

Another approach is to display approximately the same number of points in each bin. That’s the job of #chp-https://ggplot2.tidyverse.org/reference/cut_interval:

+
+
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
+  geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
+
+

Side-by-side box plots of price by carat. Each box plot represents 20 diamonds. The box plots show that as carat increases the median price increases as well. Cheaper, smaller diamonds have outliers on the higher end, more expensive, bigger diamonds have outliers on the lower end.

+
+
+ +
+

+Exercises

+
  1. Instead of summarizing the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using #chp-https://ggplot2.tidyverse.org/reference/cut_interval vs #chp-https://ggplot2.tidyverse.org/reference/cut_interval? How does that impact a visualization of the 2d distribution of carat and price?

  2. +
  3. Visualize the distribution of carat, partitioned by price.

  4. +
  5. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

  6. +
  7. Combine two of the techniques you’ve learned to visualize the combined distribution of cut, carat, and price.

  8. +
  9. +

    Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.

    +
    +
    ggplot(data = diamonds, mapping = aes(x = x, y = y)) +
    +  geom_point() +
    +  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
    +
    +

    A scatterplot of widths vs. lengths of diamonds. There is a positive, strong, linear relationship. There are a few unusual observations above and below the bulk of the data, more below it than above.

    +
    +
    +

    Why is a scatterplot a better display than a binned plot for this case?

    +
  10. +
+
+
+ +
+

+Patterns and models

+

Patterns in your data provide clues about relationships. If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

+
  • Could this pattern be due to coincidence (i.e. random chance)?

  • +
  • How can you describe the relationship implied by the pattern?

  • +
  • How strong is the relationship implied by the pattern?

  • +
  • What other variables might affect the relationship?

  • +
  • Does the relationship change if you look at individual subgroups of the data?

  • +

A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.

+
+
ggplot(data = faithful, mapping = aes(x = eruptions, y = waiting)) + 
+  geom_point()
+
+

A scatterplot of eruption time vs. waiting time to next eruption of the Old Faithful geyser. There are two clusters of points: one with low eruption times and short waiting times and one with long eruption times and long waiting times.

+
+
+

Patterns provide one of the most useful tools for data scientists because they reveal covariation. If you think of variation as a phenomenon that creates uncertainty, covariation is a phenomenon that reduces it. If two variables covary, you can use the values of one variable to make better predictions about the values of the second. If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

+

Models are a tool for extracting patterns out of data. For example, consider the diamonds data. It’s hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related. It’s possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain. The following code fits a model that predicts price from carat and then computes the residuals (the difference between the predicted value and the actual value). The residuals give us a view of the price of the diamond, once the effect of carat has been removed. Note that instead of using the raw values of price and carat, we log transform them first, and fit a model to the log-transformed values. Then, we exponentiate the residuals to put them back in the scale of raw prices.

+
+
library(tidymodels)
+
+diamonds <- diamonds |>
+  mutate(
+    log_price = log(price),
+    log_carat = log(carat)
+  )
+
+diamonds_fit <- linear_reg() |>
+  fit(log_price ~ log_carat, data = diamonds)
+
+diamonds_aug <- augment(diamonds_fit, new_data = diamonds) |>
+  mutate(.resid = exp(.resid))
+
+ggplot(data = diamonds_aug, mapping = aes(x = carat, y = .resid)) + 
+  geom_point()
+
+

A scatter plot of residuals vs. carat of diamonds. The x-axis ranges from 0 to 5, the y-axis ranges from 0 to almost 4. Much of the data are clustered around low values of carat and residuals. There is a clear, curved pattern showing decrease in residuals as carat increases.

+
+
+

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

+
+
ggplot(data = diamonds_aug, mapping = aes(x = cut, y = .resid)) + 
+  geom_boxplot()
+
+

Side-by-side box plots of residuals by cut. The x-axis displays the various cuts (Fair to Ideal), the y-axis ranges from 0 to almost 5. The medians are quite similar, between roughly 0.75 to 1.25. Each of the distributions of residuals is right skewed, with many outliers on the higher end.

+
+
+

We’re not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.

+
+ +
+

+ggplot2 calls

+

As we move on from these introductory chapters, we’ll transition to a more concise expression of ggplot2 code. So far we’ve been very explicit, which is helpful when you are learning:

+
+
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
+  geom_freqpoly(binwidth = 0.25)
+
+

Typically, the first one or two arguments to a function are so important that you should know them by heart. The first two arguments to #chp-https://ggplot2.tidyverse.org/reference/ggplot are data and mapping, and the first two arguments to #chp-https://ggplot2.tidyverse.org/reference/aes are x and y. In the remainder of the book, we won’t supply those names. That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what’s different between plots. That’s a really important programming concern that we’ll come back to in #chp-functions.

+

Rewriting the previous plot more concisely yields:

+
+
ggplot(faithful, aes(eruptions)) + 
+  geom_freqpoly(binwidth = 0.25)
+
+

Sometimes we’ll turn the end of a pipeline of data transformation into a plot. Watch for the transition from |> to +. We wish this transition wasn’t necessary but unfortunately ggplot2 was created before the pipe was discovered.

+
+
diamonds |> 
+  count(cut, clarity) |> 
+  ggplot(aes(clarity, cut, fill = n)) + 
+  geom_tile()
+
+
+ +
+

+Summary

+

In this chapter you’ve learned a variety of tools to help you understand the variation within your data. You’ve seen technique that work with a single variable at a time and with a pair of variables. This might seem painful restrictive if you have tens or hundreds of variables in your data, but they’re foundation upon which all other techniques are built.

+

In the next chapter, we’ll tackle our final piece of workflow advice: how to get help when you’re stuck.

+ + +
+
diff --git a/oreilly/base-R.html b/oreilly/base-R.html new file mode 100644 index 0000000..88eb92c --- /dev/null +++ b/oreilly/base-R.html @@ -0,0 +1,527 @@ +
+

A field guide to base R

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+

To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code that you’ll encounter in the wild.

This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from #chp-https://rdrr.io/r/base/library to load packages, to #chp-https://rdrr.io/r/base/sum and #chp-https://rdrr.io/r/base/mean for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.

After you read this book you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll certainly encounter these other approaches when you start reading R code written by other people, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!

In this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two important plotting functions.

+
+

+Prerequisites

+
+
library(tidyverse)
+
+
+ +
+

+Selecting multiple elements with[ +

+

[ is used to extract sub-components from vectors and data frames, and is called like x[i] or x[i, j]. In this section, we’ll introduce you to the power of [, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of [.

+ +
+

+Subsetting vectors

+

There are five main types of things that you can subset a vector with, i.e. that can be the i in x[i]:

+
  1. +

    A vector of positive integers. Subsetting with positive integers keeps the elements at those positions:

    +
    +
    x <- c("one", "two", "three", "four", "five")
    +x[c(3, 2, 5)]
    +#> [1] "three" "two"   "five"
    +
    +

    By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.

    +
    +
    x[c(1, 1, 5, 5, 5, 2)]
    +#> [1] "one"  "one"  "five" "five" "five" "two"
    +
    +
  2. +
  3. +

    A vector of negative integers. Negative values drop the elements at the specified positions:

    +
    +
    x[c(-1, -3, -5)]
    +#> [1] "two"  "four"
    +
    +
  4. +
  5. +

    A logical vector. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.

    +
    +
    x <- c(10, 3, NA, 5, 8, 1, NA)
    +
    +# All non-missing values of x
    +!is.na(x)
    +#> [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
    +x[!is.na(x)]
    +#> [1] 10  3  5  8  1
    +
    +# All even (or missing!) values of x
    +x %% 2 == 0
    +#> [1]  TRUE FALSE    NA FALSE  TRUE FALSE    NA
    +x[x %% 2 == 0]
    +#> [1] 10 NA  8 NA
    +
    +

    Note that, unlike #chp-https://dplyr.tidyverse.org/reference/filter, NA indices will be included in the output as NAs.

    +
  6. +
  7. +

    A character vector. If you have a named vector, you can subset it with a character vector:

    +
    +
    x <- c(abc = 1, def = 2, xyz = 5)
    +x[c("xyz", "def")]
    +#> xyz def 
    +#>   5   2
    +
    +

    As with subsetting with positive integers, you can use a character vector to duplicate individual entries.

    +
  8. +
  9. Nothing. The final type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but as we’ll see shortly it is useful when subsetting 2d structures like tibbles.

  10. +
+ +
+

+Subsetting data frames

+

There are quite a few different waysRead https://adv-r.hadley.nz/subsetting.html#subset-multiple to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix. that you can use [ with a data frame, but the most important way is to selecting rows and columns independently with df[rows, cols]. Here rows and cols are vectors as described above. For example, df[rows, ] and df[, cols] select just rows or just columns, using the empty subset to preserve the other dimension.

+

Here are a couple of examples:

+
+
df <- tibble(
+  x = 1:3, 
+  y = c("a", "e", "f"), 
+  z = runif(3)
+)
+
+# Select first row and second column
+df[1, 2]
+#> # A tibble: 1 × 1
+#>   y    
+#>   <chr>
+#> 1 a
+
+# Select all rows and columns x and y
+df[, c("x" , "y")]
+#> # A tibble: 3 × 2
+#>       x y    
+#>   <int> <chr>
+#> 1     1 a    
+#> 2     2 e    
+#> 3     3 f
+
+# Select rows where `x` is greater than 1 and all columns
+df[df$x > 1, ]
+#> # A tibble: 2 × 3
+#>       x y         z
+#>   <int> <chr> <dbl>
+#> 1     2 e     0.834
+#> 2     3 f     0.601
+
+

We’ll come back to $ shortly, but you should be able to guess what df$x does from the context: it extracts the x variable from df. We need to use it here because [ doesn’t use tidy evaluation, so you need to be explicit about the source of the x variable.

+

There’s an important difference between tibbles and data frames when it comes to [. In this book we’ve mostly used tibbles, which are data frames, but they tweak some older behaviors to make your life a little easier. In most places, you can use tibbles and data frame interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write data.frames. So if df is a data.frame, then df[, cols] will return a vector if col selects a single column and a data frame if it selects more than one column. If df is a tibble, then [ will always return a tibble.

+
+
df1 <- data.frame(x = 1:3)
+df1[, "x"]
+#> [1] 1 2 3
+
+df2 <- tibble(x = 1:3)
+df2[, "x"]
+#> # A tibble: 3 × 1
+#>       x
+#>   <int>
+#> 1     1
+#> 2     2
+#> 3     3
+
+

One way to avoid this ambiguity with data.frames is to explicitly specify drop = FALSE:

+
+
df1[, "x", drop = FALSE]
+#>   x
+#> 1 1
+#> 2 2
+#> 3 3
+
+
+ +
+

+dplyr equivalents

+

A number of dplyr verbs are special cases of [:

+

Base R also provides a function that combines the features of #chp-https://dplyr.tidyverse.org/reference/filter and #chp-https://dplyr.tidyverse.org/reference/selectBut it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like #chp-https://tidyselect.r-lib.org/reference/starts_with. called #chp-https://rdrr.io/r/base/subset:

+
+
df |> 
+  filter(x > 1) |> 
+  select(y, z)
+#> # A tibble: 2 × 2
+#>   y           z
+#>   <chr>   <dbl>
+#> 1 a     0.157  
+#> 2 b     0.00740
+
+# same as
+df |> subset(x > 1, c(y, z))
+#> # A tibble: 2 × 2
+#>   y           z
+#>   <chr>   <dbl>
+#> 1 a     0.157  
+#> 2 b     0.00740
+
+

This function was the inspiration for much of dplyr’s syntax.

+
+ +
+

+Exercises

+
  1. +

    Create functions that take a vector as input and return:

    +
    1. The elements at even numbered positions.
    2. +
    3. Every element except the last value.
    4. +
    5. Only even values (and no missing values).
    6. +
  2. +
  3. Why is x[-which(x > 0)] not the same as x[x <= 0]? Read the documentation for #chp-https://rdrr.io/r/base/which and do some experiments to figure it out.

  4. +
+
+ +
+

+Selecting a single element$ and [[ +

+

[, which selects many elements, is paired with [[ and $, which extract a single element. In this section, we’ll show you how to use [[ and $ to pull columns out of a data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [ and [[ when used with lists.

+ +
+

+Data frames

+

[[ and $ can be used like #chp-https://dplyr.tidyverse.org/reference/pull to extract columns out of a data frame. [[ can access by position or by name, and $ is specialized for access by name:

+
+
tb <- tibble(
+  x = 1:4,
+  y = c(10, 4, 1, 21)
+)
+
+# by position
+tb[[1]]
+#> [1] 1 2 3 4
+
+# by name
+tb[["x"]]
+#> [1] 1 2 3 4
+tb$x
+#> [1] 1 2 3 4
+
+

They can also be used to create new columns, the base R equivalent of #chp-https://dplyr.tidyverse.org/reference/mutate:

+
+
tb$z <- tb$x + tb$y
+tb
+#> # A tibble: 4 × 3
+#>       x     y     z
+#>   <int> <dbl> <dbl>
+#> 1     1    10    11
+#> 2     2     4     6
+#> 3     3     1     4
+#> 4     4    21    25
+
+

There are a number other base approaches to creating new columns including with #chp-https://rdrr.io/r/base/transform, #chp-https://rdrr.io/r/base/with, and #chp-https://rdrr.io/r/base/with. Hadley collected a few examples at https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf.

+

Using $ directly is convenient when performing quick summaries. For example, if you just want find the size of the biggest diamond or the possible values of cut, there’s no need to use #chp-https://dplyr.tidyverse.org/reference/summarise:

+
+
max(diamonds$carat)
+#> [1] 5.01
+
+levels(diamonds$cut)
+#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"
+
+
+ +
+

+Tibbles

+

There are a couple of important differences between tibbles and base data.frames when it comes to $. Data frames match the prefix of any variable names (so-called partial matching) and don’t complain if a column doesn’t exist:

+
+
df <- data.frame(x1 = 1)
+df$x
+#> Warning in df$x: partial match of 'x' to 'x1'
+#> [1] 1
+df$z
+#> NULL
+
+

Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:

+
+
tb <- tibble(x1 = 1)
+
+tb$x
+#> Warning: Unknown or uninitialised column: `x`.
+#> NULL
+tb$z
+#> Warning: Unknown or uninitialised column: `z`.
+#> NULL
+
+

For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.

+
+ +
+

+Lists

+

[[ and $ are also really important for working with lists, and it’s important to understand how they differ to [. Lets illustrate the differences with a list named l:

+
+
l <- list(
+  a = 1:3, 
+  b = "a string", 
+  c = pi, 
+  d = list(-1, -5)
+)
+
+
  • +

    [ extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.

    +
    +
    str(l[1:2])
    +#> List of 2
    +#>  $ a: int [1:3] 1 2 3
    +#>  $ b: chr "a string"
    +str(l[4])
    +#> List of 1
    +#>  $ d:List of 2
    +#>   ..$ : num -1
    +#>   ..$ : num -5
    +
    +

    Like with vectors, you can subset with a logical, integer, or character vector.

    +
  • +
  • +

    [[ and $ extract a single component from a list. They remove a level of hierarchy from the list.

    +
    +
    str(l[[1]])
    +#>  int [1:3] 1 2 3
    +str(l[[4]])
    +#> List of 2
    +#>  $ : num -1
    +#>  $ : num -5
    +
    +str(l$a)
    +#>  int [1:3] 1 2 3
    +
    +
  • +

The difference between [ and [[ is particularly important for lists because [[ drills down into the list while [ returns a new, smaller list. To help you remember the difference, take a look at the an unusual pepper shaker shown in #fig-pepper-1. If this pepper shaker is your list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet, as in #fig-pepper-2. If we suppose this pepper shaker is a list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet, as in #fig-pepper-2. pepper[2] would look the same, but would contain the second packet. pepper[1:2] would be a pepper shaker containing two pepper packets. pepper[[1]] would extract the pepper packet itself, as in #fig-pepper-3.

+
+
+ +

A photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains many packets of pepper.

+
Figure 26.1: A pepper shaker that Hadley once found in his hotel room.
+
+
+
+
+
+ +

A photo of the glass pepper shaker containing just one packet of pepper.

+
Figure 26.2: pepper[1]
+
+
+
+
+
+ +

A photo of single packet of pepper.

+
Figure 26.3: pepper[[1]]
+
+
+
+

This same principle applies when you use 1d [ with a data frame:

+
+
df <- tibble(x = 1:3, y = 3:5)
+
+# returns a one-column data frame
+df["x"]
+#> # A tibble: 3 × 1
+#>       x
+#>   <int>
+#> 1     1
+#> 2     2
+#> 3     3
+
+# returns the contents of x
+df[["x"]]
+#> [1] 1 2 3
+
+
+ +
+

+Exercises

+
  1. What happens when you use [[ with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?

  2. +
  3. What would pepper[[1]][1] be? What about pepper[[1]][[1]]?

  4. +
+
+ +
+

+Apply family

+

In #chp-iteration, you learned tidyverse techniques for iteration like #chp-https://dplyr.tidyverse.org/reference/across and the map family of functions. In this section, you’ll learn about their base equivalents, the apply family. In this context apply and maps are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.

+

The most important member of this family is #chp-https://rdrr.io/r/base/lapply, which is very similar to #chp-https://purrr.tidyverse.org/reference/mapIt just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.. In fact, because we haven’t used any of #chp-https://purrr.tidyverse.org/reference/map’s more advanced features, you can replace every #chp-https://purrr.tidyverse.org/reference/map call in #chp-iteration with #chp-https://rdrr.io/r/base/lapply.

+

There’s no exact base R equivalent to #chp-https://dplyr.tidyverse.org/reference/across but you can get close by using [ with #chp-https://rdrr.io/r/base/lapply. This works because under the hood, data frames are lists of columns, so calling #chp-https://rdrr.io/r/base/lapply on a data frame applies the function to each column.

+
+
df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)
+
+# First find numeric columns
+num_cols <- sapply(df, is.numeric)
+num_cols
+#>     a     b     c     d     e 
+#>  TRUE  TRUE FALSE FALSE  TRUE
+
+# Then transform each column with lapply() then replace the original values
+df[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
+df
+#> # A tibble: 1 × 5
+#>       a     b c     d         e
+#>   <dbl> <dbl> <chr> <chr> <dbl>
+#> 1     2     4 a     b         8
+
+

The code above uses a new function, #chp-https://rdrr.io/r/base/lapply. It’s similar to #chp-https://rdrr.io/r/base/lapply but it always tries to simplify the result, hence the s in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called #chp-https://purrr.tidyverse.org/reference/map that we didn’t mention in #chp-iteration.

+

Base R provides a stricter version of #chp-https://rdrr.io/r/base/lapply called #chp-https://rdrr.io/r/base/lapply, short for vector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the #chp-https://rdrr.io/r/base/lapply call above with this #chp-https://rdrr.io/r/base/lapply where we specify that we expect #chp-https://rdrr.io/r/base/numeric to return a logical vector of length 1:

+
+
vapply(df, is.numeric, logical(1))
+#>     a     b     c     d     e 
+#>  TRUE  TRUE FALSE FALSE  TRUE
+
+

The distinction between #chp-https://rdrr.io/r/base/lapply and #chp-https://rdrr.io/r/base/lapply is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.

+

Another important member of the apply family is #chp-https://rdrr.io/r/base/tapply which computes a single grouped summary:

+
+
diamonds |> 
+  group_by(cut) |> 
+  summarise(price = mean(price))
+#> # A tibble: 5 × 2
+#>   cut       price
+#>   <ord>     <dbl>
+#> 1 Fair      4359.
+#> 2 Good      3929.
+#> 3 Very Good 3982.
+#> 4 Premium   4584.
+#> 5 Ideal     3458.
+
+tapply(diamonds$price, diamonds$cut, mean)
+#>      Fair      Good Very Good   Premium     Ideal 
+#>  4358.758  3928.864  3981.760  4584.258  3457.542
+
+

Unfortunately #chp-https://rdrr.io/r/base/tapply returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use #chp-https://rdrr.io/r/base/tapply or other base techniques to perform other grouped summaries, Hadley has collected a few techniques #chp-https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec.

+

The final member of the apply family is the titular #chp-https://rdrr.io/r/base/apply, which works with matrices and arrays. In particular, watch out of apply(df, 2, something) which is a slow and potentially dangerous way of doing lapply(df, something). This rarely comes up in data science because we usually work with data frames and not matrices.

+
+ +
+

+For loops

+

For loops are the fundamental building block of iteration that both the apply and map families use under the hood. For loops are powerful and general tool that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:

+
+
for (element in vector) {
+  # do something with element
+}
+
+

The most straightforward use of for() loops is achieve the same affect as #chp-https://purrr.tidyverse.org/reference/map: call some function with a side-effect on each element of a list. For example, in #sec-save-database instead of using walk:

+
+
paths |> walk(append_file)
+
+

We could have used a for loop:

+
+
for (path in paths) {
+  append_file(path)
+}
+
+

Things get a little trickier if you want to save the output of the for-loop, for example reading all of the excel files in a directory like we did in #chp-iteration:

+
+
paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
+files <- map(paths, readxl::read_excel)
+
+

There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as paths, which we can create with #chp-https://rdrr.io/r/base/vector:

+
+
files <- vector("list", length(paths))
+
+

Then instead of iterating over the elements of paths, we’ll iterate over their indices, using #chp-https://rdrr.io/r/base/seq to generate one index for each element of paths:

+
+
seq_along(paths)
+#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
+
+

Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:

+
+
for (i in seq_along(paths)) {
+  files[[i]] <- readxl::read_excel(paths[[i]])
+}
+
+

To combine the list of tibbles into a single tibble you can use #chp-https://rdrr.io/r/base/do.call + #chp-https://rdrr.io/r/base/cbind:

+
+
do.call(rbind, files)
+#> # A tibble: 1,704 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         28.8  8425333      779.
+#> 2 Albania     Europe       55.2  1282697     1601.
+#> 3 Algeria     Africa       43.1  9279525     2449.
+#> 4 Angola      Africa       30.0  4232095     3521.
+#> 5 Argentina   Americas     62.5 17876956     5911.
+#> 6 Australia   Oceania      69.1  8691212    10040.
+#> # … with 1,698 more rows
+
+

Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:

+
+
out <- NULL
+for (path in paths) {
+  out <- rbind(out, readxl::read_excel(path))
+}
+
+

We recommend avoiding this pattern because it can become very slow when the vector is very long. This the source of the persistent canard that for loops are slow: they’re not, but iteratively growing a vector is.

+
+ +
+

+Plots

+

Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, modern look. However, base R plotting functions can still be useful because they’re so concise — it’s very little typing to do a basic exploratory plot.

+

There are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with #chp-https://rdrr.io/r/graphics/plot.default and #chp-https://rdrr.io/r/graphics/hist respectively. Here’s a quick example from the diamonds dataset:

+
+
hist(diamonds$carat)
+
+plot(diamonds$carat, diamonds$price)
+
+

+
+
+

+
+
+

Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using $ or some other technique.

+
+ +
+

+Summary

+

In this chapter, we’ve shown you selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.

+

This chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can program in R. We hope these chapters have sparked your interested in programming and that you’re are looking forward to learning more outside of this book.

+ + +
+
diff --git a/oreilly/communicate-plots.html b/oreilly/communicate-plots.html new file mode 100644 index 0000000..86ef67c --- /dev/null +++ b/oreilly/communicate-plots.html @@ -0,0 +1,624 @@ +
+

Graphics for communication

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

In #chp-EDA, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, could quickly look at it, and then move on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.

+

Now that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that ggplot2 provides to do so.

+

This chapter focuses on the tools you need to create good graphics. We assume that you know what you want, and just need to know how to do it. For that reason, we highly recommend pairing this chapter with a good general visualization book. We particularly like #chp-https://www.amazon.com/gp/product/0321934075/, by Albert Cairo. It doesn’t teach the mechanics of creating visualizations, but instead focuses on what you need to think about in order to create effective graphics.

+ +
+

+Prerequisites

+

In this chapter, we’ll focus once again on ggplot2. We’ll also use a little dplyr for data manipulation, and a few ggplot2 extension packages, including ggrepel and patchwork. Rather than loading those extensions here, we’ll refer to their functions explicitly, using the :: notation. This will help make it clear which functions are built into ggplot2, and which come from other packages. Don’t forget you’ll need to install those packages with #chp-https://rdrr.io/r/utils/install.packages if you don’t already have them.

+
+
library(tidyverse)
+
+
+
+ +
+

+Label

+

The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. You add labels with the #chp-https://ggplot2.tidyverse.org/reference/labs function. This example adds a plot title:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(color = class)) +
+  geom_smooth(se = FALSE) +
+  labs(title = "Fuel efficiency generally decreases with engine size")
+
+

+
+
+

The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g. “A scatterplot of engine displacement vs. fuel economy”.

+

If you need to add more text, there are two other useful labels that you can use in ggplot2 2.2.0 and above:

+
  • subtitle adds additional detail in a smaller font beneath the title.

  • +
  • caption adds text at the bottom right of the plot, often used to describe the source of the data.

  • +
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(color = class)) +
+  geom_smooth(se = FALSE) +
+  labs(
+    title = "Fuel efficiency generally decreases with engine size",
+    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
+    caption = "Data from fueleconomy.gov"
+  )
+
+

+
+
+

You can also use #chp-https://ggplot2.tidyverse.org/reference/labs to replace the axis and legend titles. It’s usually a good idea to replace short variable names with more detailed descriptions, and to include the units.

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class)) +
+  geom_smooth(se = FALSE) +
+  labs(
+    x = "Engine displacement (L)",
+    y = "Highway fuel economy (mpg)",
+    colour = "Car type"
+  )
+
+

+
+
+

It’s possible to use mathematical equations instead of text strings. Just switch "" out for #chp-https://rdrr.io/r/base/substitute and read about the available options in #chp-https://rdrr.io/r/grDevices/plotmath:

+
+
df <- tibble(
+  x = runif(10),
+  y = runif(10)
+)
+ggplot(df, aes(x, y)) +
+  geom_point() +
+  labs(
+    x = quote(sum(x[i] ^ 2, i == 1, n)),
+    y = quote(alpha + beta + frac(delta, theta))
+  )
+
+

+
+
+ +
+

+Exercises

+
  1. Create one plot on the fuel economy data with customized title, subtitle, caption, x, y, and colour labels.

  2. +
  3. +

    Recreate the following plot using the fuel economy data. Note that both the colors and shapes of points vary by type of drive train.

    +
    +
    +

    +
    +
    +
  4. +
  5. Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.

  6. +
+
+ +
+

+Annotations

+

In addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is #chp-https://ggplot2.tidyverse.org/reference/geom_text. #chp-https://ggplot2.tidyverse.org/reference/geom_text is similar to #chp-https://ggplot2.tidyverse.org/reference/geom_point, but it has an additional aesthetic: label. This makes it possible to add textual labels to your plots.

+

There are two possible sources of labels. First, you might have a tibble that provides labels. The plot below isn’t terribly useful, but it illustrates a useful approach: pull out the most efficient car in each class with dplyr, and then label it on the plot:

+
+
best_in_class <- mpg |>
+  group_by(class) |>
+  filter(row_number(desc(hwy)) == 1)
+
+ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class)) +
+  geom_text(aes(label = model), data = best_in_class)
+
+

+
+
+

This is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to #chp-https://ggplot2.tidyverse.org/reference/geom_text which draws a rectangle behind the text. We also use the nudge_y parameter to move the labels slightly above the corresponding points:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class)) +
+  geom_label(aes(label = model), data = best_in_class, nudge_y = 2, alpha = 0.5)
+
+

+
+
+

That helps a bit, but if you look closely in the top-left hand corner, you’ll notice that there are two labels practically on top of each other. This happens because the highway mileage and displacement for the best cars in the compact and subcompact categories are exactly the same. There’s no way that we can fix these by applying the same transformation for every label. Instead, we can use the ggrepel package by Kamil Slowikowski. This useful package will automatically adjust labels so that they don’t overlap:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class)) +
+  geom_point(size = 3, shape = 1, data = best_in_class) +
+  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
+
+

+
+
+

Note another handy technique used here: we added a second layer of large, hollow points to highlight the labelled points.

+

You can sometimes use the same idea to replace the legend with labels placed directly on the plot. It’s not wonderful for this plot, but it isn’t too bad. (theme(legend.position = "none") turns the legend off — we’ll talk about it more shortly.)

+
+
class_avg <- mpg |>
+  group_by(class) |>
+  summarise(
+    displ = median(displ),
+    hwy = median(hwy)
+  )
+
+ggplot(mpg, aes(displ, hwy, colour = class)) +
+  ggrepel::geom_label_repel(aes(label = class),
+    data = class_avg,
+    size = 6,
+    label.size = 0,
+    segment.color = NA
+  ) +
+  geom_point() +
+  theme(legend.position = "none")
+
+

+
+
+

Alternatively, you might just want to add a single label to the plot, but you’ll still need to create a data frame. Often, you want the label in the corner of the plot, so it’s convenient to create a new data frame using #chp-https://dplyr.tidyverse.org/reference/summarise to compute the maximum values of x and y.

+
+
label_info <- mpg |>
+  summarise(
+    displ = max(displ),
+    hwy = max(hwy),
+    label = "Increasing engine size is \nrelated to decreasing fuel economy."
+  )
+
+ggplot(mpg, aes(displ, hwy)) +
+  geom_point() +
+  geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")
+
+

+
+
+

If you want to place the text exactly on the borders of the plot, you can use +Inf and -Inf. Since we’re no longer computing the positions from mpg, we can use #chp-https://tibble.tidyverse.org/reference/tibble to create the data frame:

+
+
label_info <- tibble(
+  displ = Inf,
+  hwy = Inf,
+  label = "Increasing engine size is \nrelated to decreasing fuel economy."
+)
+
+ggplot(mpg, aes(displ, hwy)) +
+  geom_point() +
+  geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right")
+
+

+
+
+

In these examples, we manually broke the label up into lines using "\n". Another approach is to use #chp-https://stringr.tidyverse.org/reference/str_wrap to automatically add line breaks, given the number of characters you want per line:

+
+
"Increasing engine size is related to decreasing fuel economy." |>
+  str_wrap(width = 40) |>
+  writeLines()
+#> Increasing engine size is related to
+#> decreasing fuel economy.
+
+

Note the use of hjust and vjust to control the alignment of the label. #fig-just shows all nine possible combinations.

+
+
+ +

+
Figure 28.1: All nine combinations of hjust and vjust.hjust and vjust.
+
+
+
+

Remember, in addition to #chp-https://ggplot2.tidyverse.org/reference/geom_text, you have many other geoms in ggplot2 available to help annotate your plot. A few ideas:

+

The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!

+ +
+

+Exercises

+
  1. Use #chp-https://ggplot2.tidyverse.org/reference/geom_text with infinite positions to place text at the four corners of the plot.

  2. +
  3. Read the documentation for #chp-https://ggplot2.tidyverse.org/reference/annotate. How can you use it to add a text label to a plot without having to create a tibble?

  4. +
  5. How do labels with #chp-https://ggplot2.tidyverse.org/reference/geom_text interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the underlying data.)

  6. +
  7. What arguments to #chp-https://ggplot2.tidyverse.org/reference/geom_text control the appearance of the background box?

  8. +
  9. What are the four arguments to #chp-https://rdrr.io/r/grid/arrow? How do they work? Create a series of plots that demonstrate the most important options.

  10. +
+
+ +
+

+Scales

+

The third way you can make your plot better for communication is to adjust the scales. Scales control the mapping from data values to things that you can perceive. Normally, ggplot2 automatically adds scales for you. For example, when you type:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class))
+
+

ggplot2 automatically adds default scales behind the scenes:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class)) +
+  scale_x_continuous() +
+  scale_y_continuous() +
+  scale_colour_discrete()
+
+

Note the naming scheme for scales: scale_ followed by the name of the aesthetic, then _, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. There are lots of non-default scales which you’ll learn about below.

+

The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:

+
  • You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.

  • +
  • You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.

  • +
+
+

+Axis ticks and legend keys

+

There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: breaks and labels. Breaks controls the position of the ticks, or the values associated with the keys. Labels controls the text label associated with each tick/key. The most common use of breaks is to override the default choice:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point() +
+  scale_y_continuous(breaks = seq(15, 40, by = 5))
+
+

+
+
+

You can use labels in the same way (a character vector the same length as breaks), but you can also set it to NULL to suppress the labels altogether. This is useful for maps, or for publishing plots where you can’t share the absolute numbers.

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point() +
+  scale_x_continuous(labels = NULL) +
+  scale_y_continuous(labels = NULL)
+
+

+
+
+

You can also use breaks and labels to control the appearance of legends. Collectively axes and legends are called guides. Axes are used for x and y aesthetics; legends are used for everything else.

+

Another use of breaks is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.

+
+
presidential |>
+  mutate(id = 33 + row_number()) |>
+  ggplot(aes(start, id)) +
+    geom_point() +
+    geom_segment(aes(xend = end, yend = id)) +
+    scale_x_date(NULL, breaks = presidential$start, date_labels = "'%y")
+
+

+
+
+

Note that the specification of breaks and labels for date and datetime scales is a little different:

+
+ +
+

+Legend layout

+

You will most often use breaks and labels to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.

+

To control the overall position of the legend, you need to use a #chp-https://ggplot2.tidyverse.org/reference/theme setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting legend.position controls where the legend is drawn:

+
+
base <- ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class))
+
+base + theme(legend.position = "left")
+base + theme(legend.position = "top")
+base + theme(legend.position = "bottom")
+base + theme(legend.position = "right") # the default
+
+
+
+

+
+
+

+
+
+
+
+

+
+
+

+
+
+
+
+

You can also use legend.position = "none" to suppress the display of the legend altogether.

+

To control the display of individual legends, use #chp-https://ggplot2.tidyverse.org/reference/guides along with #chp-https://ggplot2.tidyverse.org/reference/guide_legend or #chp-https://ggplot2.tidyverse.org/reference/guide_colourbar. The following example shows two important settings: controlling the number of rows the legend uses with nrow, and overriding one of the aesthetics to make the points bigger. This is particularly useful if you have used a low alpha to display many points on a plot.

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(colour = class)) +
+  geom_smooth(se = FALSE) +
+  theme(legend.position = "bottom") +
+  guides(colour = guide_legend(nrow = 1, override.aes = list(size = 4)))
+#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
+
+

+
+
+
+ +
+

+Replacing a scale

+

Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and colour, you’ll be able to quickly pick up other scale replacements.

+

It’s very useful to plot transformations of your variable. For example, as we’ve seen in #chp-diamond-prices it’s easier to see the precise relationship between carat and price if we log transform them:

+
+
ggplot(diamonds, aes(carat, price)) +
+  geom_bin2d()
+
+ggplot(diamonds, aes(log10(carat), log10(price))) +
+  geom_bin2d()
+
+
+
+

+
+
+

+
+
+
+
+

However, the disadvantage of this transformation is that the axes are now labelled with the transformed values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.

+
+
ggplot(diamonds, aes(carat, price)) +
+  geom_bin2d() + 
+  scale_x_log10() + 
+  scale_y_log10()
+
+

+
+
+

Another scale that is frequently customized is colour. The default categorical scale picks colors that are evenly spaced around the colour wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(color = drv))
+
+ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(color = drv)) +
+  scale_colour_brewer(palette = "Set1")
+
+
+
+

+
+
+

+
+
+
+
+

Don’t forget simpler techniques. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(color = drv, shape = drv)) +
+  scale_colour_brewer(palette = "Set1")
+
+

+
+
+

The ColorBrewer scales are documented online at https://colorbrewer2.org/ and made available in R via the RColorBrewer package, by Erich Neuwirth. #fig-brewer shows the complete list of all palettes. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used #chp-https://rdrr.io/r/base/cut to make a continuous variable into a categorical variable.

+
+
+ +

+
Figure 28.2: All ColourBrewer scales.
+
+
+
+

When you have a predefined mapping between values and colors, use #chp-https://ggplot2.tidyverse.org/reference/scale_manual. For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats:

+
+
presidential |>
+  mutate(id = 33 + row_number()) |>
+  ggplot(aes(start, id, colour = party)) +
+  geom_point() +
+  geom_segment(aes(xend = end, yend = id)) +
+  scale_colour_manual(values = c(Republican = "red", Democratic = "blue"))
+
+

+
+
+

For continuous colour, you can use the built-in #chp-https://ggplot2.tidyverse.org/reference/scale_gradient or #chp-https://ggplot2.tidyverse.org/reference/scale_gradient. If you have a diverging scale, you can use #chp-https://ggplot2.tidyverse.org/reference/scale_gradient. That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.

+

Another option is to use the viridis color scales. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous colour schemes that are perceptible to people with various forms of colour blindness as well as perceptually uniform in both color and black and white. These scales are available as continuous (c), discrete (d), and binned (b) palettes in ggplot2.

+
+
df <- tibble(
+  x = rnorm(10000),
+  y = rnorm(10000)
+)
+ggplot(df, aes(x, y)) +
+  geom_hex() +
+  coord_fixed() +
+  labs(title = "Default, continuous")
+
+ggplot(df, aes(x, y)) +
+  geom_hex() +
+  coord_fixed() +
+  scale_fill_viridis_c() +
+  labs(title = "Viridis, continuous")
+
+ggplot(df, aes(x, y)) +
+  geom_hex() +
+  coord_fixed() +
+  scale_fill_viridis_b() +
+  labs(title = "Viridis, binned")
+
+
+
+

+
+
+

+
+
+
+
+

+
+
+
+
+

Note that all colour scales come in two variety: scale_colour_x() and scale_fill_x() for the colour and fill aesthetics respectively (the colour scales are available in both UK and US spellings).

+
+ +
+

+Exercises

+
  1. +

    Why doesn’t the following code override the default scale?

    +
    +
    ggplot(df, aes(x, y)) +
    +  geom_hex() +
    +  scale_colour_gradient(low = "white", high = "red") +
    +  coord_fixed()
    +
    +
  2. +
  3. What is the first argument to every scale? How does it compare to #chp-https://ggplot2.tidyverse.org/reference/labs?

  4. +
  5. +

    Change the display of the presidential terms by:

    +
    1. Combining the two variants shown above.
    2. +
    3. Improving the display of the y axis.
    4. +
    5. Labelling each term with the name of the president.
    6. +
    7. Adding informative plot labels.
    8. +
    9. Placing breaks every 4 years (this is trickier than it seems!).
    10. +
  6. +
  7. +

    Use override.aes to make the legend on the following plot easier to see.

    +
    +
    ggplot(diamonds, aes(carat, price)) +
    +  geom_point(aes(colour = cut), alpha = 1/20)
    +
    +

    +
    +
    +
  8. +
+
+ +
+

+Zooming

+

There are three ways to control the plot limits:

+
  1. Adjusting what data are plotted
  2. +
  3. Setting the limits in each scale
  4. +
  5. Setting xlim and ylim in #chp-https://ggplot2.tidyverse.org/reference/coord_cartesian +
  6. +

To zoom in on a region of the plot, it’s generally best to use #chp-https://ggplot2.tidyverse.org/reference/coord_cartesian. Compare the following two plots:

+
+
ggplot(mpg, mapping = aes(displ, hwy)) +
+  geom_point(aes(color = class)) +
+  geom_smooth() +
+  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
+
+mpg |>
+  filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) |>
+  ggplot(aes(displ, hwy)) +
+  geom_point(aes(color = class)) +
+  geom_smooth()
+
+
+
+

+
+
+

+
+
+
+
+

You can also set the limits on individual scales. Reducing the limits is basically equivalent to subsetting the data. It is generally more useful if you want expand the limits, for example, to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.

+
+
suv <- mpg |> filter(class == "suv")
+compact <- mpg |> filter(class == "compact")
+
+ggplot(suv, aes(displ, hwy, colour = drv)) +
+  geom_point()
+
+ggplot(compact, aes(displ, hwy, colour = drv)) +
+  geom_point()
+
+
+
+

+
+
+

+
+
+
+
+

One way to overcome this problem is to share scales across multiple plots, training the scales with the limits of the full data.

+
+
x_scale <- scale_x_continuous(limits = range(mpg$displ))
+y_scale <- scale_y_continuous(limits = range(mpg$hwy))
+col_scale <- scale_colour_discrete(limits = unique(mpg$drv))
+
+ggplot(suv, aes(displ, hwy, colour = drv)) +
+  geom_point() +
+  x_scale +
+  y_scale +
+  col_scale
+
+ggplot(compact, aes(displ, hwy, colour = drv)) +
+  geom_point() +
+  x_scale +
+  y_scale +
+  col_scale
+
+
+
+

+
+
+

+
+
+
+
+

In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.

+
+ +
+

+Themes

+

Finally, you can customize the non-data elements of your plot with a theme:

+
+
ggplot(mpg, aes(displ, hwy)) +
+  geom_point(aes(color = class)) +
+  geom_smooth(se = FALSE) +
+  theme_bw()
+
+

+
+
+

ggplot2 includes eight themes by default, as shown in #fig-themes. Many more are included in add-on packages like ggthemes (https://jrnold.github.io/ggthemes), by Jeffrey Arnold.

+
+
+ +

Eight barplots created with ggplot2, each with one of the eight built-in themes: theme_bw() - White background with grid lines, theme_light() - Light axes and grid lines, theme_classic() - Classic theme, axes but no grid lines, theme_linedraw() - Only black lines, theme_dark() - Dark background for contrast, theme_minimal() - Minimal theme, no background, theme_gray() - Gray background (default theme), theme_void() - Empty theme, only geoms are visible.

+
Figure 28.3: The eight themes built-in to ggplot2.
+
+
+
+

Many people wonder why the default theme has a gray background. This was a deliberate choice because it puts the data forward while still making the grid lines visible. The white grid lines are visible (which is important because they significantly aid position judgments), but they have little visual impact and we can easily tune them out. The grey background gives the plot a similar typographic colour to the text, ensuring that the graphics fit in with the flow of a document without jumping out with a bright white background. Finally, the grey background creates a continuous field of colour which ensures that the plot is perceived as a single visual entity.

+

It’s also possible to control individual components of each theme, like the size and colour of the font used for the y axis. Unfortunately, this level of detail is outside the scope of this book, so you’ll need to read the #chp-https://ggplot2-book.org/ for the full details. You can also create your own themes, if you are trying to match a particular corporate or journal style.

+
+ +
+

+Saving your plots

+

There are two main ways to get your plots out of R and into your final write-up: #chp-https://ggplot2.tidyverse.org/reference/ggsave and knitr. #chp-https://ggplot2.tidyverse.org/reference/ggsave will save the most recent plot to disk:

+
+
ggplot(mpg, aes(displ, hwy)) + geom_point()
+ggsave("my-plot.pdf")
+#> Saving 6 x 4 in image
+
+

If you don’t specify the width and height they will be taken from the dimensions of the current plotting device. For reproducible code, you’ll want to specify them.

+

Generally, however, we recommend that you assemble your final reports using Quarto, so we focus on the important code chunk options that you should know about for graphics. You can learn more about #chp-https://ggplot2.tidyverse.org/reference/ggsave in the documentation.

+
+ +
+

+Learning more

+

The absolute best place to learn more is the ggplot2 book: #chp-https://ggplot2-book.org/. It goes into much more depth about the underlying theory, and has many more examples of how to combine the individual pieces to solve practical problems.

+

Another great resource is the ggplot2 extensions gallery https://exts.ggplot2.tidyverse.org/gallery/. This site lists many of the packages that extend ggplot2 with new geoms and scales. It’s a great place to start if you’re trying to do something that seems hard with ggplot2.

+ + +
+
diff --git a/oreilly/communicate.html b/oreilly/communicate.html new file mode 100644 index 0000000..bf449e2 --- /dev/null +++ b/oreilly/communicate.html @@ -0,0 +1,12 @@ +
+

Communicate

So far, you’ve learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualization. However, it doesn’t matter how great your analysis is unless you can explain it to others: you need to communicate your results.

+
+ +

A diagram displaying the data science cycle with visualize and communicate highlighed in blue.

+
Figure 1: Communication is the final part of the data science process; if you can’t communicate your results to other humans, it doesn’t matter how great your analysis is.
+
+
+

Communication is the theme of the following three chapters:

  • In #chp-quarto, you will learn about Quarto, a tool for integrating prose, code, and results. You can use Quarto for analyst-to-analyst communication as well as analyst-to-decision-maker communication. Thanks to the power of Quarto formats, you can even use the same document for both purposes.

  • +
  • In #chp-quarto-formats, you’ll learn a little about the many other varieties of outputs you can produce using Quarto, including dashboards, websites, and books.

  • +
  • We’ll finish up with #chp-quarto-workflow, where you’ll learn about the “analysis notebook” and how to systematically record your successes and failures so that you can learn from them.

  • +

These chapters focus mostly on the technical mechanics of communication, not the really hard problems of communicating your thoughts to other humans. However, there are lot of other great books about communication, which we’ll point you to at the end of each chapter.

diff --git a/oreilly/cover.png b/oreilly/cover.png new file mode 100644 index 0000000..a7150bd Binary files /dev/null and b/oreilly/cover.png differ diff --git a/oreilly/data-import.html b/oreilly/data-import.html new file mode 100644 index 0000000..c1b0aa8 --- /dev/null +++ b/oreilly/data-import.html @@ -0,0 +1,602 @@ +
+

Data import

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data. In this chapter, you’ll learn how to read plain-text rectangular files into R.

+ +
+

+Prerequisites

+

In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse.

+
+
library(tidyverse)
+
+
+
+ +
+

+Reading data from a file

+

To begin we’ll focus on the most rectangular data file type: the CSV, short for comma separate values. Here is what a simple CSV file looks like. The first row, commonly called the header row, gives the column names, and the following six rows give the data.

+
+
#> Student ID,Full Name,favourite.food,mealPlan,AGE
+#> 1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
+#> 2,Barclay Lynn,French fries,Lunch only,5
+#> 3,Jayendra Lyne,N/A,Breakfast and lunch,7
+#> 4,Leon Rossini,Anchovies,Lunch only,
+#> 5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
+#> 6,Güvenç Attila,Ice cream,Lunch only,6
+
+

#tbl-students-table shows a representation of the same data as a table.

+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 8.1: Data from the students.csv file as a table.
Student IDFull Namefavourite.foodmealPlanAGE
1Sunil HuffmannStrawberry yoghurtLunch only4
2Barclay LynnFrench friesLunch only5
3Jayendra LyneN/ABreakfast and lunch7
4Leon RossiniAnchoviesLunch onlyNA
5Chidiegwu DunkelPizzaBreakfast and lunchfive
6Güvenç AttilaIce creamLunch only6
+
+
+

We can read this file into R using #chp-https://readr.tidyverse.org/reference/read_delim. The first argument is the most important: it’s the path to the file.

+
+
students <- read_csv("data/students.csv")
+#> Rows: 6 Columns: 5
+#> ── Column specification ────────────────────────────────────────────────────────
+#> Delimiter: ","
+#> chr (4): Full Name, favourite.food, mealPlan, AGE
+#> dbl (1): Student ID
+#> 
+#> ℹ Use `spec()` to retrieve the full column specification for this data.
+#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+

When you run #chp-https://readr.tidyverse.org/reference/read_delim it prints out a message telling you the number of rows and columns of data, the delimiter that was used, and the column specifications (names of columns organized by the type of data the column contains). It also prints out some information about how to retrieve the full column specification as well as how to quiet this message. This message is an important part of readr and we’ll come back to in #sec-col-types.

+ +
+

+Practical advice

+

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. Let’s take another look at the students data with that in mind.

+

In the favourite.food column, there are a bunch of food items and then the character string N/A, which should have been an real NA that R will recognize as “not available”. This is something we can address using the na argument.

+
+
students <- read_csv("data/students.csv", na = c("N/A", ""))
+
+students
+#> # A tibble: 6 × 5
+#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
+#>          <dbl> <chr>            <chr>              <chr>               <chr>
+#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
+#> 2            2 Barclay Lynn     French fries       Lunch only          5    
+#> 3            3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
+#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
+#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
+#> 6            6 Güvenç Attila    Ice cream          Lunch only          6
+
+

You might also notice that the Student ID and Full Name columns are surrounded by back ticks. That’s because they contain spaces, breaking R’s usual rules for variable names. To refer to them, you need to use those back ticks:

+
+
students |> 
+  rename(
+    student_id = `Student ID`,
+    full_name = `Full Name`
+  )
+#> # A tibble: 6 × 5
+#>   student_id full_name        favourite.food     mealPlan            AGE  
+#>        <dbl> <chr>            <chr>              <chr>               <chr>
+#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
+#> 2          2 Barclay Lynn     French fries       Lunch only          5    
+#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
+#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
+#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
+#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
+
+

An alternative approach is to use #chp-https://rdrr.io/pkg/janitor/man/clean_names to use some heuristics to turn them all into snake case at onceThe #chp-http://sfirke.github.io/janitor/ package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses |>..

+
+
students |> janitor::clean_names()
+#> # A tibble: 6 × 5
+#>   student_id full_name        favourite_food     meal_plan           age  
+#>        <dbl> <chr>            <chr>              <chr>               <chr>
+#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
+#> 2          2 Barclay Lynn     French fries       Lunch only          5    
+#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
+#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
+#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
+#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
+
+

Another common task after reading in data is to consider variable types. For example, meal_type is a categorical variable with a known set of possible values, which in R should be represent as factor:

+
+
students |>
+  janitor::clean_names() |>
+  mutate(
+    meal_plan = factor(meal_plan)
+  )
+#> # A tibble: 6 × 5
+#>   student_id full_name        favourite_food     meal_plan           age  
+#>        <dbl> <chr>            <chr>              <fct>               <chr>
+#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
+#> 2          2 Barclay Lynn     French fries       Lunch only          5    
+#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
+#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
+#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
+#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
+
+

Note that the values in the meal_type variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (<chr>) to factor (<fct>). You’ll learn more about factors in #chp-factors.

+

Before you move on to analyzing these data, you’ll probably want to fix the age column as well: currently it’s a character variable because of the one observation that is typed out as five instead of a numeric 5. We discuss the details of fixing this issue in #chp-spreadsheets.

+
+
students <- students |>
+  janitor::clean_names() |>
+  mutate(
+    meal_plan = factor(meal_plan),
+    age = parse_number(if_else(age == "five", "5", age))
+  )
+
+students
+#> # A tibble: 6 × 5
+#>   student_id full_name        favourite_food     meal_plan             age
+#>        <dbl> <chr>            <chr>              <fct>               <dbl>
+#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
+#> 2          2 Barclay Lynn     French fries       Lunch only              5
+#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
+#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
+#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
+#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
+
+
+ +
+

+Other arguments

+

There are a couple of other important arguments that we need to mention, and they’ll be easier to demonstrate if we first show you a handy trick: #chp-https://readr.tidyverse.org/reference/read_delim can read csv files that you’ve created in a string:

+
+
read_csv(
+  "a,b,c
+  1,2,3
+  4,5,6"
+)
+#> # A tibble: 2 × 3
+#>       a     b     c
+#>   <dbl> <dbl> <dbl>
+#> 1     1     2     3
+#> 2     4     5     6
+
+

Usually #chp-https://readr.tidyverse.org/reference/read_delim uses the first line of the data for the column names, which is a very common convention. But sometime there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines or use comment = "#" to drop all lines that start with (e.g.) #:

+
+
read_csv(
+  "The first line of metadata
+  The second line of metadata
+  x,y,z
+  1,2,3",
+  skip = 2
+)
+#> # A tibble: 1 × 3
+#>       x     y     z
+#>   <dbl> <dbl> <dbl>
+#> 1     1     2     3
+
+read_csv(
+  "# A comment I want to skip
+  x,y,z
+  1,2,3",
+  comment = "#"
+)
+#> # A tibble: 1 × 3
+#>       x     y     z
+#>   <dbl> <dbl> <dbl>
+#> 1     1     2     3
+
+

In other cases, the data might not have column names. You can use col_names = FALSE to tell #chp-https://readr.tidyverse.org/reference/read_delim not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

+
+
read_csv(
+  "1,2,3
+  4,5,6",
+  col_names = FALSE
+)
+#> # A tibble: 2 × 3
+#>      X1    X2    X3
+#>   <dbl> <dbl> <dbl>
+#> 1     1     2     3
+#> 2     4     5     6
+
+

Alternatively you can pass col_names a character vector which will be used as the column names:

+
+
read_csv(
+  "1,2,3
+  4,5,6",
+  col_names = c("x", "y", "z")
+)
+#> # A tibble: 2 × 3
+#>       x     y     z
+#>   <dbl> <dbl> <dbl>
+#> 1     1     2     3
+#> 2     4     5     6
+
+

These arguments are all you need to know to read the majority of CSV files that you’ll encounter in practice. (For the rest, you’ll need to carefully inspect your .csv file and carefully read the documentation for #chp-https://readr.tidyverse.org/reference/read_delim’s many other arguments.)

+
+ +
+

+Other file types

+

Once you’ve mastered #chp-https://readr.tidyverse.org/reference/read_delim, using readr’s other functions is straightforward; it’s just a matter of knowing which function to reach for:

+
+ +
+

+Exercises

+
  1. What function would you use to read a file where fields were separated with “|”?

  2. +
  3. Apart from file, skip, and comment, what other arguments do #chp-https://readr.tidyverse.org/reference/read_delim and #chp-https://readr.tidyverse.org/reference/read_delim have in common?

  4. +
  5. What are the most important arguments to #chp-https://readr.tidyverse.org/reference/read_fwf?

  6. +
  7. +

    Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By default, #chp-https://readr.tidyverse.org/reference/read_delim assumes that the quoting character will be ". What argument to #chp-https://readr.tidyverse.org/reference/read_delim do you need to specify to read the following text into a data frame?

    +
    +
    "x,y\n1,'a,b'"
    +
    +
  8. +
  9. +

    Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

    +
    +
    read_csv("a,b\n1,2,3\n4,5,6")
    +read_csv("a,b,c\n1,2\n1,2,3,4")
    +read_csv("a,b\n\"1")
    +read_csv("a,b\n1,2\na,b")
    +read_csv("a;b\n1;3")
    +
    +
  10. +
  11. +

    Practice referring to non-syntactic names in the following data frame by:

    +
    1. Extracting the variable called 1.
    2. +
    3. Plotting a scatterplot of 1 vs 2.
    4. +
    5. Creating a new column called 3 which is 2 divided by 1.
    6. +
    7. Renaming the columns to one, two and three.
    8. +
    +
    annoying <- tibble(
    +  `1` = 1:10,
    +  `2` = `1` * 2 + rnorm(length(`1`))
    +)
    +
    +
  12. +
+
+ +
+

+Controlling column types

+

A CSV file doesn’t contain any information about the type of each variable (i.e. whether it’s a logical, number, string, etc), so readr will try to guess the type. This section describes how the guessing process works, how to resolve some common problems that cause it to fail, and if needed, how to supply the column types yourself. Finally, we’ll mention a couple of general strategies that are a useful if readr is failing catastrophically and you need to get more insight in to the structure of your file.

+ +
+

+Guessing types

+

readr uses a heuristic to figure out the column types. For each column, it pulls the values of 1,000You can override the default of 1000 with the guess_max argument. rows spaced evenly from the first row to the last, ignoring an missing values. It then works through the following questions:

+
  • Does it contain only F, T, FALSE, or TRUE (ignoring case)? If so, it’s a logical.
  • +
  • Does it contain only numbers (e.g. 1, -4.5, 5e6, Inf)? If so, it’s a number.
  • +
  • Does it match match the ISO8601 standard? If so, it’s a date or date-time. (We’ll come back to date/times in more detail in #sec-creating-datetimes).
  • +
  • Otherwise, it must be a string.
  • +

You can see that behavior in action in this simple example:

+
+
read_csv("
+  logical,numeric,date,string
+  TRUE,1,2021-01-15,abc
+  false,4.5,2021-02-15,def
+  T,Inf,2021-02-16,ghi"
+)
+#> Rows: 3 Columns: 4
+#> ── Column specification ────────────────────────────────────────────────────────
+#> Delimiter: ","
+#> chr  (1): string
+#> dbl  (1): numeric
+#> lgl  (1): logical
+#> date (1): date
+#> 
+#> ℹ Use `spec()` to retrieve the full column specification for this data.
+#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+#> # A tibble: 3 × 4
+#>   logical numeric date       string
+#>   <lgl>     <dbl> <date>     <chr> 
+#> 1 TRUE        1   2021-01-15 abc   
+#> 2 FALSE       4.5 2021-02-15 def   
+#> 3 TRUE      Inf   2021-02-16 ghi
+
+

This heuristic works well if you have a clean dataset, but in real life you’ll encounter a selection of weird and wonderful failures.

+
+ +
+

+Missing values, column types, and problems

+

The most common way column detection fails is that a column contains unexpected values and you get a character column instead of a more specific type. One of the most common causes for this a missing value, recorded using something other than the NA that stringr expects.

+

Take this simple 1 column CSV file as an example:

+
+
csv <- "
+  x
+  10
+  .
+  20
+  30"
+
+

If we read it without any additional arguments, x becomes a character column:

+
+
df <- read_csv(csv)
+#> Rows: 4 Columns: 1
+#> ── Column specification ────────────────────────────────────────────────────────
+#> Delimiter: ","
+#> chr (1): x
+#> 
+#> ℹ Use `spec()` to retrieve the full column specification for this data.
+#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+

In this very small case, you can easily see the missing value .. But what happens if you have thousands of rows with only a few missing values represented by .s speckled amongst them? One approach is to tell readr that x is a numeric column, and then see where it fails. You can do that with the col_types argument, which takes a named list:

+
+
df <- read_csv(csv, col_types = list(x = col_double()))
+#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
+#> e.g.:
+#>   dat <- vroom(...)
+#>   problems(dat)
+
+

Now #chp-https://readr.tidyverse.org/reference/read_delim reports that there was a problem, and tells us we can find out more with #chp-https://readr.tidyverse.org/reference/problems:

+
+
problems(df)
+#> # A tibble: 1 × 5
+#>     row   col expected actual file                                    
+#>   <int> <int> <chr>    <chr>  <chr>                                   
+#> 1     3     1 a double .      /private/tmp/Rtmp43JYhG/file7cf337a06034
+
+

This tells us that there was a problem in row 3, col 1 where readr expected a double but got a .. That suggests this dataset uses . for missing values. So then we set na = ".", the automatic guessing succeeds, giving us the numeric column that we want:

+
+
df <- read_csv(csv, na = ".")
+#> Rows: 4 Columns: 1
+#> ── Column specification ────────────────────────────────────────────────────────
+#> Delimiter: ","
+#> dbl (1): x
+#> 
+#> ℹ Use `spec()` to retrieve the full column specification for this data.
+#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+
+
+ +
+

+Column types

+

readr provides a total of nine column types for you to use:

+

It’s also possible to override the default column by switching from #chp-https://rdrr.io/r/base/list to #chp-https://readr.tidyverse.org/reference/cols:

+
+
csv <- "
+x,y,z
+1,2,3"
+
+read_csv(csv, col_types = cols(.default = col_character()))
+#> # A tibble: 1 × 3
+#>   x     y     z    
+#>   <chr> <chr> <chr>
+#> 1 1     2     3
+
+

Another useful helper is #chp-https://readr.tidyverse.org/reference/cols which will read in only the columns you specify:

+
+
read_csv(
+  "x,y,z
+  1,2,3",
+  col_types = cols_only(x = col_character())
+)
+#> # A tibble: 1 × 1
+#>   x    
+#>   <chr>
+#> 1 1
+
+
+
+ +
+

+Reading data from multiple files

+

Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March. With #chp-https://readr.tidyverse.org/reference/read_delim you can read these data in at once and stack them on top of each other in a single data frame.

+
+
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
+read_csv(sales_files, id = "file")
+#> Rows: 19 Columns: 6
+#> ── Column specification ────────────────────────────────────────────────────────
+#> Delimiter: ","
+#> chr (1): month
+#> dbl (4): year, brand, item, n
+#> 
+#> ℹ Use `spec()` to retrieve the full column specification for this data.
+#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
+#> # A tibble: 19 × 6
+#>   file              month    year brand  item     n
+#>   <chr>             <chr>   <dbl> <dbl> <dbl> <dbl>
+#> 1 data/01-sales.csv January  2019     1  1234     3
+#> 2 data/01-sales.csv January  2019     1  8721     9
+#> 3 data/01-sales.csv January  2019     1  1822     2
+#> 4 data/01-sales.csv January  2019     2  3333     1
+#> 5 data/01-sales.csv January  2019     2  2156     9
+#> 6 data/01-sales.csv January  2019     2  3987     6
+#> # … with 13 more rows
+
+

With the additional id parameter we have added a new column called file to the resulting data frame that identifies the file the data come from. This is especially helpful in circumstances where the files you’re reading in do not have an identifying column that can help you trace the observations back to their original sources.

+

If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the base #chp-https://rdrr.io/r/base/list.files function to find the files for you by matching a pattern in the file names. You’ll learn more about these patterns in #chp-regexps.

+
+
sales_files <- list.files("data", pattern = "sales\\.csv$", full.names = TRUE)
+sales_files
+#> [1] "data/01-sales.csv" "data/02-sales.csv" "data/03-sales.csv"
+
+
+ +
+

+Writing to a file

+

readr also comes with two useful functions for writing data back to disk: #chp-https://readr.tidyverse.org/reference/write_delim and #chp-https://readr.tidyverse.org/reference/write_delim. Both functions increase the chances of the output file being read back in correctly by using the standard UTF-8 encoding for strings and ISO8601 format for date-times.

+

The most important arguments are x (the data frame to save), and file (the location to save it). You can also specify how missing values are written with na, and if you want to append to an existing file.

+
+
write_csv(students, "students.csv")
+
+

Now let’s read that csv file back in. Note that the type information is lost when you save to csv:

+
+
students
+#> # A tibble: 6 × 5
+#>   student_id full_name        favourite_food     meal_plan             age
+#>        <dbl> <chr>            <chr>              <fct>               <dbl>
+#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
+#> 2          2 Barclay Lynn     French fries       Lunch only              5
+#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
+#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
+#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
+#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
+write_csv(students, "students-2.csv")
+read_csv("students-2.csv")
+#> # A tibble: 6 × 5
+#>   student_id full_name        favourite_food     meal_plan             age
+#>        <dbl> <chr>            <chr>              <chr>               <dbl>
+#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
+#> 2          2 Barclay Lynn     French fries       Lunch only              5
+#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
+#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
+#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
+#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
+
+

This makes CSVs a little unreliable for caching interim results—you need to recreate the column specification every time you load in. There are two main options:

+
  1. +

    #chp-https://readr.tidyverse.org/reference/read_rds and #chp-https://readr.tidyverse.org/reference/read_rds are uniform wrappers around the base functions #chp-https://rdrr.io/r/base/readRDS and #chp-https://rdrr.io/r/base/readRDS. These store data in R’s custom binary format called RDS:

    +
    +
    write_rds(students, "students.rds")
    +read_rds("students.rds")
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <fct>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +
    +
  2. +
  3. +

    The arrow package allows you to read and write parquet files, a fast binary file format that can be shared across programming languages:

    +
    +
    library(arrow)
    +write_parquet(students, "students.parquet")
    +read_parquet("students.parquet")
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <fct>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    NA                 Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +
    +
  4. +

Parquet tends to be much faster than RDS and is usable outside of R, but does require you install the arrow package.

+
+ +
+

+Data entry

+

Sometimes you’ll need to assemble a tibble “by hand” doing a little data entry in your R script. There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows. #chp-https://tibble.tidyverse.org/reference/tibble works by column:

+
+
tibble(
+  x = c(1, 2, 5), 
+  y = c("h", "m", "g"),
+  z = c(0.08, 0.83, 0.60)
+)
+#> # A tibble: 3 × 3
+#>       x y         z
+#>   <dbl> <chr> <dbl>
+#> 1     1 h      0.08
+#> 2     2 m      0.83
+#> 3     5 g      0.6
+
+

Note that every column in tibble must be same size, so you’ll get an error if they’re not:

+
+
tibble(
+  x = c(1, 2),
+  y = c("h", "m", "g"),
+  z = c(0.08, 0.83, 0.6)
+)
+#> Error:
+#> ! Tibble columns must have compatible sizes.
+#> • Size 2: Existing data.
+#> • Size 3: Column `y`.
+#> ℹ Only values of size one are recycled.
+
+

Laying out the data by column can make it hard to see how the rows are related, so an alternative is #chp-https://tibble.tidyverse.org/reference/tribble, short for transposed tibble, which lets you lay out your data row by row. #chp-https://tibble.tidyverse.org/reference/tribble is customized for data entry in code: column headings start with ~ and entries are separated by commas. This makes it possible to lay out small amounts of data in an easy to read form:

+
+
tribble(
+  ~x, ~y, ~z,
+  "h", 1, 0.08,
+  "m", 2, 0.83,
+  "g", 5, 0.60,
+)
+#> # A tibble: 3 × 3
+#>   x         y     z
+#>   <chr> <dbl> <dbl>
+#> 1 h         1  0.08
+#> 2 m         2  0.83
+#> 3 g         5  0.6
+
+

We’ll use #chp-https://tibble.tidyverse.org/reference/tibble and #chp-https://tibble.tidyverse.org/reference/tribble later in the book to construct small examples to demonstrate how various functions work.

+
+ +
+

+Summary

+

In this chapter, you’ve learned how to load CSV files with #chp-https://readr.tidyverse.org/reference/read_delim and to do your own data entry with #chp-https://tibble.tidyverse.org/reference/tibble and #chp-https://tibble.tidyverse.org/reference/tribble. You’ve learned how csv files work, some of the problems you might encounter, and how to overcome them. We’ll come to data import a few times in this book: #chp-databases will show you how to load data from databases, #chp-spreadsheets from Excel and googlesheets, #chp-rectangling from JSON, and #chp-webscraping from websites.

+

Now that you’re writing a substantial amount of R code, it’s time to learn more about organizing your code into files and directories. In the next chapter, you’ll learn all about the advantages of scripts and projects, and some of the many tools that they provide to make your life easier.

+ + +
+
diff --git a/oreilly/data-tidy.html b/oreilly/data-tidy.html new file mode 100644 index 0000000..0136726 --- /dev/null +++ b/oreilly/data-tidy.html @@ -0,0 +1,867 @@ +
+

Data tidying

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+
+

“Happy families are all alike; every unhappy family is unhappy in its own way.”
+— Leo Tolstoy

+
+
+

“Tidy datasets are all alike, but every messy dataset is messy in its own way.”
+— Hadley Wickham

+
+

In this chapter, you will learn a consistent way to organize your data in R using a system called tidy data. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.

+

In this chapter, you’ll first learn the definition of tidy data and see it applied to simple toy dataset. Then we’ll dive into the main tool you’ll use for tidying data: pivoting. Pivoting allows you to change the form of your data, without changing any of the values. We’ll finish up with a discussion of usefully untidy data, and how you can create it if needed.

+ +
+

+Prerequisites

+

In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.

+
+
library(tidyverse)
+
+

From this chapter on, we’ll suppress the loading message from #chp-https://tidyverse.tidyverse.

+
+
+ +
+

+Tidy data

+

You can represent the same underlying data in multiple ways. The example below shows the same data organised in four different ways. Each dataset shows the same values of four variables: country, year, population, and cases of TB (tuberculosis), but each dataset organizes the values in a different way.

+ + +
+
table1
+#> # A tibble: 6 × 4
+#>   country      year  cases population
+#>   <chr>       <int>  <int>      <int>
+#> 1 Afghanistan  1999    745   19987071
+#> 2 Afghanistan  2000   2666   20595360
+#> 3 Brazil       1999  37737  172006362
+#> 4 Brazil       2000  80488  174504898
+#> 5 China        1999 212258 1272915272
+#> 6 China        2000 213766 1280428583
+table2
+#> # A tibble: 12 × 4
+#>   country      year type           count
+#>   <chr>       <int> <chr>          <int>
+#> 1 Afghanistan  1999 cases            745
+#> 2 Afghanistan  1999 population  19987071
+#> 3 Afghanistan  2000 cases           2666
+#> 4 Afghanistan  2000 population  20595360
+#> 5 Brazil       1999 cases          37737
+#> 6 Brazil       1999 population 172006362
+#> # … with 6 more rows
+table3
+#> # A tibble: 6 × 3
+#>   country      year rate             
+#> * <chr>       <int> <chr>            
+#> 1 Afghanistan  1999 745/19987071     
+#> 2 Afghanistan  2000 2666/20595360    
+#> 3 Brazil       1999 37737/172006362  
+#> 4 Brazil       2000 80488/174504898  
+#> 5 China        1999 212258/1272915272
+#> 6 China        2000 213766/1280428583
+
+# Spread across two tibbles
+table4a # cases
+#> # A tibble: 3 × 3
+#>   country     `1999` `2000`
+#> * <chr>        <int>  <int>
+#> 1 Afghanistan    745   2666
+#> 2 Brazil       37737  80488
+#> 3 China       212258 213766
+table4b # population
+#> # A tibble: 3 × 3
+#>   country         `1999`     `2000`
+#> * <chr>            <int>      <int>
+#> 1 Afghanistan   19987071   20595360
+#> 2 Brazil       172006362  174504898
+#> 3 China       1272915272 1280428583
+
+

These are all representations of the same underlying data, but they are not equally easy to use. One of them, table1, will be much easier to work with inside the tidyverse because it’s tidy.

+

There are three interrelated rules that make a dataset tidy:

+
  1. Each variable is a column; each column is a variable.
  2. +
  3. Each observation is row; each row is an observation.
  4. +
  5. Each value is a cell; each cell is a single value.
  6. +

#fig-tidy-structure shows the rules visually.

+
+
+ +

Three panels, each representing a tidy data frame. The first panel shows that each variable is a column. The second panel shows that each observation is a row. The third panel shows that each value is a cell.

+
Figure 6.1: The following three rules make a dataset tidy: variables are columns, observations are rows, and values are cells.
+
+
+
+

Why ensure that your data is tidy? There are two main advantages:

+
  1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

  2. +
  3. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in #sec-mutate and #sec-summarize, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

  4. +

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. Here are a couple of small examples showing how you might work with table1.

+
+
# Compute rate per 10,000
+table1 |>
+  mutate(
+    rate = cases / population * 10000
+  )
+#> # A tibble: 6 × 5
+#>   country      year  cases population  rate
+#>   <chr>       <int>  <int>      <int> <dbl>
+#> 1 Afghanistan  1999    745   19987071 0.373
+#> 2 Afghanistan  2000   2666   20595360 1.29 
+#> 3 Brazil       1999  37737  172006362 2.19 
+#> 4 Brazil       2000  80488  174504898 4.61 
+#> 5 China        1999 212258 1272915272 1.67 
+#> 6 China        2000 213766 1280428583 1.67
+
+# Compute cases per year
+table1 |>
+  count(year, wt = cases)
+#> # A tibble: 2 × 2
+#>    year      n
+#>   <int>  <int>
+#> 1  1999 250740
+#> 2  2000 296920
+
+# Visualise changes over time
+ggplot(table1, aes(year, cases)) +
+  geom_line(aes(group = country), color = "grey50") +
+  geom_point(aes(color = country, shape = country)) +
+  scale_x_continuous(breaks = c(1999, 2000))
+
+

This figure shows the numbers of cases in 1999 and 2000 for Afghanistan, Brazil, and China, with year on the x-axis and number of cases on the y-axis. Each point on the plot represents the number of cases in a given country in a given year. The points for each country are differentiated from others by color and shape and connected with a line, resulting in three, non-parallel, non-intersecting lines. The numbers of cases in China are highest for both 1999 and 2000, with values above 200,000 for both years. The number of cases in Brazil is approximately 40,000 in 1999 and approximately 75,000 in 2000. The numbers of cases in Afghanistan are lowest for both 1999 and 2000, with values that appear to be very close to 0 on this scale.

+
+
+ +
+

+Exercises

+
  1. Using prose, describe how the variables and observations are organised in each of the sample tables.

  2. +
  3. +

    Sketch out the process you’d use to calculate the rate for table2 and table4a + table4b. You will need to perform four operations:

    +
    1. Extract the number of TB cases per country per year.
    2. +
    3. Extract the matching population per country per year.
    4. +
    5. Divide cases by population, and multiply by 10000.
    6. +
    7. Store back in the appropriate place.
    8. +

    You haven’t yet learned all the functions you’d need to actually perform these operations, but you should still be able to think through the transformations you’d need.

    +
  4. +
  5. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?

  6. +
+
+ +
+

+Pivoting

+

The principles of tidy data might seem so obvious that you wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most real data is untidy. There are two main reasons:

+
  1. Data is often organised to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.

  2. +
  3. Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

  4. +

This means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data. Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.

+

tidyr provides two functions for pivoting data: #chp-https://tidyr.tidyverse.org/reference/pivot_longer, which makes datasets longer by increasing rows and reducing columns, and #chp-https://tidyr.tidyverse.org/reference/pivot_wider which makes datasets wider by increasing columns and reducing rows. The following sections work through the use of #chp-https://tidyr.tidyverse.org/reference/pivot_longer and #chp-https://tidyr.tidyverse.org/reference/pivot_wider to tackle a wide range of realistic datasets. These examples are drawn from #chp-https://tidyr.tidyverse.org/articles/pivot, which you should check out if you want to see more variations and more challenging problems.

+

Let’s dive in.

+ +
+

+Data in column names

+

The billboard dataset records the billboard rank of songs in the year 2000:

+
+
billboard
+#> # A tibble: 317 × 79
+#>   artist  track date.ent…¹   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9
+#>   <chr>   <chr> <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+#> 1 2 Pac   Baby… 2000-02-26    87    82    72    77    87    94    99    NA    NA
+#> 2 2Ge+her The … 2000-09-02    91    87    92    NA    NA    NA    NA    NA    NA
+#> 3 3 Door… Kryp… 2000-04-08    81    70    68    67    66    57    54    53    51
+#> 4 3 Door… Loser 2000-10-21    76    76    72    69    67    65    55    59    62
+#> 5 504 Bo… Wobb… 2000-04-15    57    34    25    17    17    31    36    49    53
+#> 6 98^0    Give… 2000-08-19    51    39    34    26    26    19     2     2     3
+#> # … with 311 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
+#> #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
+#> #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
+#> #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
+#> #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
+#> #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
+#> #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
+
+

In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week. Here, the column names are one variable (the week) and the cell values are another (the rank).

+

To tidy this data, we’ll use #chp-https://tidyr.tidyverse.org/reference/pivot_longer. After the data, there are three key arguments:

+
  • +cols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as #chp-https://dplyr.tidyverse.org/reference/select so here we could use !c(artist, track, date.entered) or starts_with("wk").
  • +
  • +names_to names of the variable stored in the column names, here "week".
  • +
  • +values_to names the variable stored in the cell values, here "rank".
  • +

That gives the following call:

+
+
billboard |> 
+  pivot_longer(
+    cols = starts_with("wk"), 
+    names_to = "week", 
+    values_to = "rank"
+  )
+#> # A tibble: 24,092 × 5
+#>    artist track                   date.entered week   rank
+#>    <chr>  <chr>                   <date>       <chr> <dbl>
+#>  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
+#>  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
+#>  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
+#>  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
+#>  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
+#>  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
+#>  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
+#>  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
+#>  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
+#> 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
+#> # … with 24,082 more rows
+
+

What happens if a song is in the top 100 for less than 76 weeks? Take 2 Pac’s “Baby Don’t Cry”, for example. The above output suggests that it was only the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These NAs don’t really represent unknown observations; they’re forced to exist by the structure of the datasetWe’ll come back to this idea in #chp-missing-values., so we can ask #chp-https://tidyr.tidyverse.org/reference/pivot_longer to get rid of them by setting values_drop_na = TRUE:

+
+
billboard |> 
+  pivot_longer(
+    cols = starts_with("wk"), 
+    names_to = "week", 
+    values_to = "rank",
+    values_drop_na = TRUE
+  )
+#> # A tibble: 5,307 × 5
+#>   artist track                   date.entered week   rank
+#>   <chr>  <chr>                   <date>       <chr> <dbl>
+#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
+#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
+#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
+#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
+#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
+#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
+#> # … with 5,301 more rows
+
+

You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns wk77, wk78, … would be added to the dataset.

+

This data is now tidy, but we could make future computation a bit easier by converting week into a number using #chp-https://dplyr.tidyverse.org/reference/mutate and #chp-https://readr.tidyverse.org/reference/parse_number. #chp-https://readr.tidyverse.org/reference/parse_number is a handy function that will extract the first number from a string, ignoring all other text.

+
+
billboard_tidy <- billboard |> 
+  pivot_longer(
+    cols = starts_with("wk"), 
+    names_to = "week", 
+    values_to = "rank",
+    values_drop_na = TRUE
+  ) |> 
+  mutate(
+    week = parse_number(week)
+  )
+billboard_tidy
+#> # A tibble: 5,307 × 5
+#>   artist track                   date.entered  week  rank
+#>   <chr>  <chr>                   <date>       <dbl> <dbl>
+#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1    87
+#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2    82
+#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3    72
+#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26       4    77
+#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26       5    87
+#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26       6    94
+#> # … with 5,301 more rows
+
+

Now we’re in a good position to look at how song ranks vary over time by drawing a plot. The code is shown below and the result is #fig-billboard-ranks.

+
+
billboard_tidy |> 
+  ggplot(aes(week, rank, group = track)) + 
+  geom_line(alpha = 1/3) + 
+  scale_y_reverse()
+
+ +

A line plot with week on the x-axis and rank on the y-axis, where each line represents a song. Most songs appear to start at a high rank, rapidly accelerate to a low rank, and then decay again. There are suprisingly few tracks in the region when week is >20 and rank is >50.

+
Figure 6.2: A line plot showing how the rank of a song changes over time.
+
+
+
+
+ +
+

+How does pivoting work?

+

Now that you’ve seen what pivoting can do for you, it’s worth taking a little time to gain some intuition about what it does to the data. Let’s start with a very simple dataset to make it easier to see what’s happening:

+
+
df <- tribble(
+  ~var, ~col1, ~col2,
+   "A",     1,     2,
+   "B",     3,     4,
+   "C",     5,     6
+)
+
+

Here we’ll say there are three variables: var (already in a variable), name (the column names in the column names), and value (the cell values). So we can tidy it with:

+
+
df |> 
+  pivot_longer(
+    cols = col1:col2,
+    names_to = "names",
+    values_to = "values"
+  )
+#> # A tibble: 6 × 3
+#>   var   names values
+#>   <chr> <chr>  <dbl>
+#> 1 A     col1       1
+#> 2 A     col2       2
+#> 3 B     col1       3
+#> 4 B     col2       4
+#> 5 C     col1       5
+#> 6 C     col2       6
+
+

How does this transformation take place? It’s easier to see if we take it component by component. Columns that are already variables need to be repeated, once for each column in cols, as shown in #fig-pivot-variables.

+
+
+ +

A diagram showing how `pivot_longer()` transforms a simple dataset, using color to highlight how the values in the `var` column ("A", "B", "C") are each repeated twice in the output because there are two columns being pivotted ("col1" and "col2").

+
Figure 6.3: Columns that are already variables need to be repeated, once for each column that is pivotted.
+
+
+
+

The column names become values in a new variable, whose name is given by names_to, as shown in #fig-pivot-names. They need to be repeated once for each row in the original dataset.

+
+
+ +

A diagram showing how `pivot_longer()` transforms a simple data set, using color to highlight how column names ("col1" and "col2") become the values in a new `var` column. They are repeated three times because there were three rows in the input.

+
Figure 6.4: The column names of pivoted columns become a new column.
+
+
+
+

The cell values also become values in a new variable, with a name given by values_to. They are unwound row by row. #fig-pivot-values illustrates the process.

+
+
+ +

A diagram showing how `pivot_longer()` transforms data, using color to highlight how the cell values (the numbers 1 to 6) become the values in a new `value` column. They are unwound row-by-row, so the original rows (1,2), then (3,4), then (5,6), become a column running from 1 to 6.

+
Figure 6.5: The number of values is preserved (not repeated), but unwound row-by-row.
+
+
+
+
+ +
+

+Many variables in column names

+

A more challenging situation occurs when you have multiple variables crammed into the column names. For example, take the who2 dataset:

+
+
who2
+#> # A tibble: 7,240 × 58
+#>   country   year sp_m_…¹ sp_m_…² sp_m_…³ sp_m_…⁴ sp_m_…⁵ sp_m_…⁶ sp_m_65 sp_f_…⁷
+#>   <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
+#> 1 Afghani…  1980      NA      NA      NA      NA      NA      NA      NA      NA
+#> 2 Afghani…  1981      NA      NA      NA      NA      NA      NA      NA      NA
+#> 3 Afghani…  1982      NA      NA      NA      NA      NA      NA      NA      NA
+#> 4 Afghani…  1983      NA      NA      NA      NA      NA      NA      NA      NA
+#> 5 Afghani…  1984      NA      NA      NA      NA      NA      NA      NA      NA
+#> 6 Afghani…  1985      NA      NA      NA      NA      NA      NA      NA      NA
+#> # … with 7,234 more rows, 48 more variables: sp_f_1524 <dbl>, sp_f_2534 <dbl>,
+#> #   sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>, sp_f_65 <dbl>,
+#> #   sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>, sn_m_3544 <dbl>,
+#> #   sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>, sn_f_014 <dbl>,
+#> #   sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>, sn_f_4554 <dbl>,
+#> #   sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, ep_m_1524 <dbl>,
+#> #   ep_m_2534 <dbl>, ep_m_3544 <dbl>, ep_m_4554 <dbl>, ep_m_5564 <dbl>, …
+
+

This dataset records information about tuberculosis data collected by the WHO. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56 columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender, and the third piece, 014/1524/2535/3544/4554/65 is the age range.

+

So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name. This requires two changes to our call to #chp-https://tidyr.tidyverse.org/reference/pivot_longer: names_to gets a vector of column names and names_sep describes how to split the variable name up into pieces:

+
+
who2 |> 
+  pivot_longer(
+    cols = !(country:year),
+    names_to = c("diagnosis", "gender", "age"), 
+    names_sep = "_",
+    values_to = "count"
+  )
+#> # A tibble: 405,440 × 6
+#>   country      year diagnosis gender age   count
+#>   <chr>       <dbl> <chr>     <chr>  <chr> <dbl>
+#> 1 Afghanistan  1980 sp        m      014      NA
+#> 2 Afghanistan  1980 sp        m      1524     NA
+#> 3 Afghanistan  1980 sp        m      2534     NA
+#> 4 Afghanistan  1980 sp        m      3544     NA
+#> 5 Afghanistan  1980 sp        m      4554     NA
+#> 6 Afghanistan  1980 sp        m      5564     NA
+#> # … with 405,434 more rows
+
+

An alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in #chp-regexps.

+

Conceptually, this is only a minor variation on the simpler case you’ve already seen. #fig-pivot-multiple-names shows the basic idea: now, instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it happens in a single step because that gives better performance.

+
+
+ +

A diagram that uses color to illustrate how supplying `names_sep` and multiple `names_to` creates multiple variables in the output. The input has variable names "x_1" and "y_2" which are split up by "_" to create name and number columns in the output. This is is similar case with a single `names_to`, but what would have been a single output variable is now separated into multiple variables.

+
Figure 6.6: Pivotting with many variables in the column names means that each column name now fills in values in multiple output columns.
+
+
+
+
+ +
+

+Data and variable names in the column headers

+

The next step up in complexity is when the column names include a mix of variable values and variable names. For example, take the household dataset:

+
+
household
+#> # A tibble: 5 × 5
+#>   family dob_child1 dob_child2 name_child1 name_child2
+#>    <int> <date>     <date>     <chr>       <chr>      
+#> 1      1 1998-11-26 2000-01-29 Susan       Jose       
+#> 2      2 1996-06-22 NA         Mark        <NA>       
+#> 3      3 2002-07-11 2004-04-05 Sam         Seth       
+#> 4      4 2004-10-10 2009-08-27 Craig       Khai       
+#> 5      5 2000-12-05 2005-02-28 Parker      Gracie
+
+

This dataset contains data about five families, with the names and dates of birth of up to two children. The new challenge in this dataset is that the column names contain the names of two variables (dob, name) and the values of another (child, with values 1 and 2). To solve this problem we again need to supply a vector to names_to but this time we use the special ".value" sentinel. This overrides the usual values_to argument to use the first component of the pivoted column name as a variable name in the output.

+
+
household |> 
+  pivot_longer(
+    cols = !family, 
+    names_to = c(".value", "child"), 
+    names_sep = "_", 
+    values_drop_na = TRUE
+  ) |> 
+  mutate(
+    child = parse_number(child)
+  )
+#> # A tibble: 9 × 4
+#>   family child dob        name 
+#>    <int> <dbl> <date>     <chr>
+#> 1      1     1 1998-11-26 Susan
+#> 2      1     2 2000-01-29 Jose 
+#> 3      2     1 1996-06-22 Mark 
+#> 4      3     1 2002-07-11 Sam  
+#> 5      3     2 2004-04-05 Seth 
+#> 6      4     1 2004-10-10 Craig
+#> # … with 3 more rows
+
+

We again use values_drop_na = TRUE, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and #chp-https://readr.tidyverse.org/reference/parse_number to convert (e.g.) child1 into 1.

+

#fig-pivot-names-and-values illustrates the basic idea with a simpler example. When you use ".value" in names_to, the column names in the input contribute to both values and variable names in the output.

+
+
+ +

A diagram that uses color to illustrate how the special ".value" sentinel works. The input has names "x_1", "x_2", "y_1", and "y_2", and we want to use the first component ("x", "y") as a variable name and the second ("1", "2") as the value for a new "id" column.

+
Figure 6.7: Pivoting with names_to = c(".value", "id") splits the column names into two components: the first part determines the output column name (x or y), and the second part determines the value of the id column.
+
+
+
+
+ +
+

+Widening data

+

So far we’ve used #chp-https://tidyr.tidyverse.org/reference/pivot_longer to solve the common class of problems where values have ended up in column names. Next we’ll pivot (HA HA) to #chp-https://tidyr.tidyverse.org/reference/pivot_wider, which helps when one observation is spread across multiple rows. This seems to arise less commonly in the wild, but it does seem to crop up a lot when dealing with governmental data.

+

We’ll start by looking at cms_patient_experience, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:

+
+
cms_patient_experience
+#> # A tibble: 500 × 5
+#>   org_pac_id org_nm                     measure_cd   measure_title       prf_r…¹
+#>   <chr>      <chr>                      <chr>        <chr>                 <dbl>
+#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS SSM…      63
+#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS SSM…      87
+#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS SSM…      86
+#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS SSM…      57
+#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS SSM…      85
+#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS SSM…      24
+#> # … with 494 more rows, and abbreviated variable name ¹​prf_rate
+
+

An observation is an organisation, but each organisation is spread across six rows, with one row for each variable, or measure. We can see the complete set of values for measure_cd and measure_title by using #chp-https://dplyr.tidyverse.org/reference/distinct:

+
+
cms_patient_experience |> 
+  distinct(measure_cd, measure_title)
+#> # A tibble: 6 × 2
+#>   measure_cd   measure_title                                                    
+#>   <chr>        <chr>                                                            
+#> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and Infor…
+#> 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate               
+#> 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider                 
+#> 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education               
+#> 5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff           
+#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources
+
+

Neither of these columns will make particularly great variable names: measure_cd doesn’t hint at the meaning of the variable and measure_title is a long sentence containing spaces. We’ll use measure_cd for now, but in a real analysis you might want to create your own variable names that are both short and meaningful.

+

#chp-https://tidyr.tidyverse.org/reference/pivot_wider has the opposite interface to #chp-https://tidyr.tidyverse.org/reference/pivot_longer: we need to provide the existing columns that define the values (values_from) and the column name (names_from):

+
+
cms_patient_experience |> 
+  pivot_wider(
+    names_from = measure_cd,
+    values_from = prf_rate
+  )
+#> # A tibble: 500 × 9
+#>   org_pac_id org_nm      measu…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶ CAHPS…⁷
+#>   <chr>      <chr>       <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
+#> 1 0446157747 USC CARE M… CAHPS …      63      NA      NA      NA      NA      NA
+#> 2 0446157747 USC CARE M… CAHPS …      NA      87      NA      NA      NA      NA
+#> 3 0446157747 USC CARE M… CAHPS …      NA      NA      86      NA      NA      NA
+#> 4 0446157747 USC CARE M… CAHPS …      NA      NA      NA      57      NA      NA
+#> 5 0446157747 USC CARE M… CAHPS …      NA      NA      NA      NA      85      NA
+#> 6 0446157747 USC CARE M… CAHPS …      NA      NA      NA      NA      NA      24
+#> # … with 494 more rows, and abbreviated variable names ¹​measure_title,
+#> #   ²​CAHPS_GRP_1, ³​CAHPS_GRP_2, ⁴​CAHPS_GRP_3, ⁵​CAHPS_GRP_5, ⁶​CAHPS_GRP_8,
+#> #   ⁷​CAHPS_GRP_12
+
+

The output doesn’t look quite right; we still seem to have multiple rows for each organization. That’s because, by default, #chp-https://tidyr.tidyverse.org/reference/pivot_wider will attempt to preserve all the existing columns including measure_title which has six distinct observations for each organisations. To fix this problem we need to tell #chp-https://tidyr.tidyverse.org/reference/pivot_wider which columns identify each row; in this case those are the variables starting with "org":

+
+
cms_patient_experience |> 
+  pivot_wider(
+    id_cols = starts_with("org"),
+    names_from = measure_cd,
+    values_from = prf_rate
+  )
+#> # A tibble: 95 × 8
+#>   org_pac_id org_nm              CAHPS…¹ CAHPS…² CAHPS…³ CAHPS…⁴ CAHPS…⁵ CAHPS…⁶
+#>   <chr>      <chr>                 <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
+#> 1 0446157747 USC CARE MEDICAL G…      63      87      86      57      85      24
+#> 2 0446162697 ASSOCIATION OF UNI…      59      85      83      63      88      22
+#> 3 0547164295 BEAVER MEDICAL GRO…      49      NA      75      44      73      12
+#> 4 0749333730 CAPE PHYSICIANS AS…      67      84      85      65      82      24
+#> 5 0840104360 ALLIANCE PHYSICIAN…      66      87      87      64      87      28
+#> 6 0840109864 REX HOSPITAL INC         73      87      84      67      91      30
+#> # … with 89 more rows, and abbreviated variable names ¹​CAHPS_GRP_1,
+#> #   ²​CAHPS_GRP_2, ³​CAHPS_GRP_3, ⁴​CAHPS_GRP_5, ⁵​CAHPS_GRP_8, ⁶​CAHPS_GRP_12
+
+

This gives us the output that we’re looking for.

+
+ +
+

+How doespivot_wider() work?

+

To understand how #chp-https://tidyr.tidyverse.org/reference/pivot_wider works, let’s again start with a very simple dataset:

+
+
df <- tribble(
+  ~id, ~name, ~value,
+  "A", "x", 1,
+  "B", "y", 2,
+  "B", "x", 3, 
+  "A", "y", 4,
+  "A", "z", 5,
+)
+
+

We’ll take the values from the value column and the names from the name column:

+
+
df |> 
+  pivot_wider(
+    names_from = name,
+    values_from = value
+  )
+#> # A tibble: 2 × 4
+#>   id        x     y     z
+#>   <chr> <dbl> <dbl> <dbl>
+#> 1 A         1     4     5
+#> 2 B         3     2    NA
+
+

The connection between the position of the row in the input and the cell in the output is weaker than in #chp-https://tidyr.tidyverse.org/reference/pivot_longer because the rows and columns in the output are primarily determined by the values of variables, not their locations.

+

To begin the process #chp-https://tidyr.tidyverse.org/reference/pivot_wider needs to first figure out what will go in the rows and columns. Finding the column names is easy: it’s just the values of name.

+
+
df |> 
+  distinct(name)
+#> # A tibble: 3 × 1
+#>   name 
+#>   <chr>
+#> 1 x    
+#> 2 y    
+#> 3 z
+
+

By default, the rows in the output are formed by all the variables that aren’t going into the names or values. These are called the id_cols.

+
+
df |> 
+  select(-name, -value) |> 
+  distinct()
+#> # A tibble: 2 × 1
+#>   id   
+#>   <chr>
+#> 1 A    
+#> 2 B
+
+

#chp-https://tidyr.tidyverse.org/reference/pivot_wider then combines these results to generate an empty data frame:

+
+
df |> 
+  select(-name, -value) |> 
+  distinct() |> 
+  mutate(x = NA, y = NA, z = NA)
+#> # A tibble: 2 × 4
+#>   id    x     y     z    
+#>   <chr> <lgl> <lgl> <lgl>
+#> 1 A     NA    NA    NA   
+#> 2 B     NA    NA    NA
+
+

It then fills in all the missing values using the data in the input. In this case, not every cell in the output has corresponding value in the input as there’s no entry for id “B” and name “z”, so that cell remains missing. We’ll come back to this idea that #chp-https://tidyr.tidyverse.org/reference/pivot_wider can “make” missing values in #chp-missing-values.

+

You might also wonder what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and name “x”:

+
+
df <- tribble(
+  ~id, ~name, ~value,
+  "A", "x", 1,
+  "A", "x", 2,
+  "A", "y", 3,
+  "B", "x", 4, 
+  "B", "y", 5, 
+)
+
+

If we attempt to pivot this we get an output that contains list-columns, which you’ll learn more about in #chp-rectangling:

+
+
df |> pivot_wider(
+  names_from = name,
+  values_from = value
+)
+#> Warning: Values from `value` are not uniquely identified; output will contain list-cols.
+#> • Use `values_fn = list` to suppress this warning.
+#> • Use `values_fn = {summary_fun}` to summarise duplicates.
+#> • Use the following dplyr code to identify duplicates.
+#>   {data} %>%
+#>   dplyr::group_by(id, name) %>%
+#>   dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
+#>   dplyr::filter(n > 1L)
+#> # A tibble: 2 × 3
+#>   id    x         y        
+#>   <chr> <list>    <list>   
+#> 1 A     <dbl [2]> <dbl [1]>
+#> 2 B     <dbl [1]> <dbl [1]>
+
+

Since you don’t know how to work with this sort of data yet, you’ll want to follow the hint in the warning to figure out where the problem is:

+
+
df |> 
+  group_by(id, name) |> 
+  summarize(n = n(), .groups = "drop") |> 
+  filter(n > 1L) 
+#> # A tibble: 1 × 3
+#>   id    name      n
+#>   <chr> <chr> <int>
+#> 1 A     x         2
+
+

It’s then up to you to figure out what’s gone wrong with your data and either repair the underlying damage or use your grouping and summarizing skills to ensure that each combination of row and column values only has a single row.

+
+
+ +
+

+Untidy data

+

While #chp-https://tidyr.tidyverse.org/reference/pivot_wider is occasionally useful for making tidy data, its real strength is making untidy data. While that sounds like a bad thing, untidy isn’t a pejorative term: there are many untidy data structures that are extremely useful. Tidy data is a great starting point for most analyses but it’s not the only data format you’ll ever need.

+

The following sections will show a few examples of #chp-https://tidyr.tidyverse.org/reference/pivot_wider making usefully untidy data for presenting data to other humans, for input to multivariate statistics algorithms, and for pragmatically solving data manipulation challenges.

+ +
+

+Presenting data to humans

+

As you’ve seen, #chp-https://dplyr.tidyverse.org/reference/count produces tidy data: it makes one row for each group, with one column for each grouping variable, and one column for the number of observations.

+
+
diamonds |> 
+  count(clarity, color)
+#> # A tibble: 56 × 3
+#>   clarity color     n
+#>   <ord>   <ord> <int>
+#> 1 I1      D        42
+#> 2 I1      E       102
+#> 3 I1      F       143
+#> 4 I1      G       150
+#> 5 I1      H       162
+#> 6 I1      I        92
+#> # … with 50 more rows
+
+

This is easy to visualize or summarize further, but it’s not the most compact form for display. You can use #chp-https://tidyr.tidyverse.org/reference/pivot_wider to create a form more suitable for display to other humans:

+
+
diamonds |> 
+  count(clarity, color) |> 
+  pivot_wider(
+    names_from = color, 
+    values_from = n
+  )
+#> # A tibble: 8 × 8
+#>   clarity     D     E     F     G     H     I     J
+#>   <ord>   <int> <int> <int> <int> <int> <int> <int>
+#> 1 I1         42   102   143   150   162    92    50
+#> 2 SI2      1370  1713  1609  1548  1563   912   479
+#> 3 SI1      2083  2426  2131  1976  2275  1424   750
+#> 4 VS2      1697  2470  2201  2347  1643  1169   731
+#> 5 VS1       705  1281  1364  2148  1169   962   542
+#> 6 VVS2      553   991   975  1443   608   365   131
+#> # … with 2 more rows
+
+

This display also makes it easy to compare in two directions, horizontally and vertically, much like #chp-https://ggplot2.tidyverse.org/reference/facet_grid.

+

#chp-https://tidyr.tidyverse.org/reference/pivot_wider can be great for quickly sketching out a table. But for real presentation tables, we highly suggest learning a package like #chp-https://gt.rstudio. gt is similar to ggplot2 in that it provides an extremely powerful grammar for laying out tables. It takes some work to learn but the payoff is the ability to make just about any table you can imagine.

+
+ +
+

+Multivariate statistics

+

Most classical multivariate statistical methods (like dimension reduction and clustering) require your data in matrix form, where each column is a time point, or a location, or a gene, or a species, but definitely not a variable. Sometimes these formats have substantial performance or space advantages, or sometimes they’re just necessary to get closer to the underlying matrix mathematics.

+

We’re not going to cover these statistical methods here, but it is useful to know how to get your data into the form that they need. For example, let’s imagine you wanted to cluster the gapminder data to find countries that had similar progression of gdpPercap over time. To do this, we need one row for each country and one column for each year:

+
+
library(gapminder)
+
+col_year <- gapminder |> 
+  mutate(gdpPercap = log10(gdpPercap)) |> 
+  pivot_wider(
+    id_cols = country, 
+    names_from = year,
+    values_from = gdpPercap
+  ) 
+col_year
+#> # A tibble: 142 × 13
+#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
+#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
+#> 1 Afghani…   2.89   2.91   2.93   2.92   2.87   2.90   2.99   2.93   2.81   2.80
+#> 2 Albania    3.20   3.29   3.36   3.44   3.52   3.55   3.56   3.57   3.40   3.50
+#> 3 Algeria    3.39   3.48   3.41   3.51   3.62   3.69   3.76   3.75   3.70   3.68
+#> 4 Angola     3.55   3.58   3.63   3.74   3.74   3.48   3.44   3.39   3.42   3.36
+#> 5 Argenti…   3.77   3.84   3.85   3.91   3.98   4.00   3.95   3.96   3.97   4.04
+#> 6 Austral…   4.00   4.04   4.09   4.16   4.23   4.26   4.29   4.34   4.37   4.43
+#> # … with 136 more rows, and 2 more variables: `2002` <dbl>, `2007` <dbl>
+
+

#chp-https://tidyr.tidyverse.org/reference/pivot_wider produces a tibble where each row is labelled by the country variable. But most classic statistical algorithms don’t want the identifier as an explicit variable; they want as a row name. We can turn the country variable into row names with column_to_rowname():

+
+
col_year <- col_year |> 
+  column_to_rownames("country") 
+
+head(col_year)
+#>                 1952     1957     1962     1967     1972     1977     1982
+#> Afghanistan 2.891786 2.914265 2.931000 2.922309 2.869221 2.895485 2.990344
+#> Albania     3.204407 3.288313 3.364155 3.440940 3.520277 3.548144 3.560012
+#> Algeria     3.388990 3.479140 3.406679 3.511481 3.621453 3.691118 3.759302
+#> Angola      3.546618 3.582965 3.630354 3.742157 3.738248 3.478371 3.440429
+#> Argentina   3.771684 3.836125 3.853282 3.905955 3.975112 4.003419 3.954141
+#> Australia   4.001716 4.039400 4.086973 4.162150 4.225015 4.263262 4.289522
+#>                 1987     1992     1997     2002     2007
+#> Afghanistan 2.930641 2.812473 2.803007 2.861376 2.988818
+#> Albania     3.572748 3.397495 3.504206 3.663155 3.773569
+#> Algeria     3.754452 3.700982 3.680996 3.723295 3.794025
+#> Angola      3.385644 3.419600 3.357390 3.442995 3.680991
+#> Argentina   3.960931 3.968876 4.040099 3.944366 4.106510
+#> Australia   4.340224 4.369675 4.431331 4.486965 4.537005
+
+

This makes a data frame, because tibbles don’t support row namestibbles don’t use row names because they only work for a subset of important cases: when observations can be identified by a single character vector..

+

We’re now ready to cluster with (e.g.) #chp-https://rdrr.io/r/stats/kmeans:

+
+
cluster <- stats::kmeans(col_year, centers = 6)
+
+

Extracting the data out of this object into a form you can work with is a challenge you’ll need to come back to later in the book, once you’ve learned more about lists. But for now, you can get the clustering membership out with this code:

+
+
cluster_id <- cluster$cluster |> 
+  enframe() |> 
+  rename(country = name, cluster_id = value)
+cluster_id
+#> # A tibble: 142 × 2
+#>   country     cluster_id
+#>   <chr>            <int>
+#> 1 Afghanistan          4
+#> 2 Albania              2
+#> 3 Algeria              6
+#> 4 Angola               2
+#> 5 Argentina            5
+#> 6 Australia            1
+#> # … with 136 more rows
+
+

You could then combine this back with the original data using one of the joins you’ll learn about in #chp-joins.

+
+
gapminder |> left_join(cluster_id)
+#> Joining with `by = join_by(country)`
+#> # A tibble: 1,704 × 7
+#>   country     continent  year lifeExp      pop gdpPercap cluster_id
+#>   <chr>       <fct>     <int>   <dbl>    <int>     <dbl>      <int>
+#> 1 Afghanistan Asia       1952    28.8  8425333      779.          4
+#> 2 Afghanistan Asia       1957    30.3  9240934      821.          4
+#> 3 Afghanistan Asia       1962    32.0 10267083      853.          4
+#> 4 Afghanistan Asia       1967    34.0 11537966      836.          4
+#> 5 Afghanistan Asia       1972    36.1 13079460      740.          4
+#> 6 Afghanistan Asia       1977    38.4 14880372      786.          4
+#> # … with 1,698 more rows
+
+
+ +
+

+Pragmatic computation

+

Sometimes it’s just easier to answer a question using untidy data. For example, if you’re interested in just the total number of missing values in cms_patient_experience, it’s easier to work with the untidy form:

+
+
cms_patient_experience |> 
+  group_by(org_pac_id) |> 
+  summarize(
+    n_miss = sum(is.na(prf_rate)),
+    n = n(),
+  )
+#> # A tibble: 95 × 3
+#>   org_pac_id n_miss     n
+#>   <chr>       <int> <int>
+#> 1 0446157747      0     6
+#> 2 0446162697      0     6
+#> 3 0547164295      1     6
+#> 4 0749333730      0     6
+#> 5 0840104360      0     6
+#> 6 0840109864      0     6
+#> # … with 89 more rows
+
+

This is partly a reflection of our definition of tidy data, where we said tidy data has one variable in each column, but we didn’t actually define what a variable is (and it’s surprisingly hard to do so). It’s totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.

+

So if you’re stuck figuring out how to do some computation, maybe it’s time to switch up the organisation of your data. For computations involving a fixed number of values (like computing differences or ratios), it’s usually easier if the data is in columns; for those with a variable number of values (like sums or means) it’s usually easier in rows. Don’t be afraid to untidy, transform, and re-tidy if needed.

+

Let’s explore this idea by looking at cms_patient_care, which has a similar structure to cms_patient_experience:

+
+
cms_patient_care
+#> # A tibble: 252 × 5
+#>   ccn    facility_name   measure_abbr      score type       
+#>   <chr>  <chr>           <chr>             <dbl> <chr>      
+#> 1 011500 BAPTIST HOSPICE beliefs_addressed 202   denominator
+#> 2 011500 BAPTIST HOSPICE beliefs_addressed 100   observed   
+#> 3 011500 BAPTIST HOSPICE composite_process 202   denominator
+#> 4 011500 BAPTIST HOSPICE composite_process  88.1 observed   
+#> 5 011500 BAPTIST HOSPICE dyspena_treatment 110   denominator
+#> 6 011500 BAPTIST HOSPICE dyspena_treatment  99.1 observed   
+#> # … with 246 more rows
+
+

It contains information about 9 measures (beliefs_addressed, composite_process, dyspena_treatment, …) on 14 different facilities (identified by ccn with a name given by facility_name). Compared to cms_patient_experience, however, each measurement is recorded in two rows with a score, the percentage of patients who answered yes to the survey question, and a denominator, the number of patients that the question applies to. Depending on what you want to do next, you may find any of the following three structures useful:

+
  • +

    If you want to compute the number of patients that answered yes to the question, you may pivot type into the columns:

    +
    +
    cms_patient_care |> 
    +  pivot_wider(
    +    names_from = type,
    +    values_from = score
    +  ) |> 
    +  mutate(
    +    numerator = round(observed / 100 * denominator)
    +  )
    +#> # A tibble: 126 × 6
    +#>   ccn    facility_name   measure_abbr      denominator observed numerator
    +#>   <chr>  <chr>           <chr>                   <dbl>    <dbl>     <dbl>
    +#> 1 011500 BAPTIST HOSPICE beliefs_addressed         202    100         202
    +#> 2 011500 BAPTIST HOSPICE composite_process         202     88.1       178
    +#> 3 011500 BAPTIST HOSPICE dyspena_treatment         110     99.1       109
    +#> 4 011500 BAPTIST HOSPICE dyspnea_screening         202    100         202
    +#> 5 011500 BAPTIST HOSPICE opioid_bowel               61    100          61
    +#> 6 011500 BAPTIST HOSPICE pain_assessment           107    100         107
    +#> # … with 120 more rows
    +
    +
  • +
  • +

    If you want to display the distribution of each metric, you may keep it as is so you could facet by measure_abbr.

    +
    +
    cms_patient_care |> 
    +  filter(type == "observed") |> 
    +  ggplot(aes(score)) + 
    +  geom_histogram(binwidth = 2) + 
    +  facet_wrap(vars(measure_abbr))
    +#> Warning: Removed 1 rows containing non-finite values (`stat_bin()`).
    +
    +
  • +
  • +

    If you want to explore how different metrics are related, you may put the measure names in the columns so you could compare them in scatterplots.

    +
    +
    cms_patient_care |> 
    +  filter(type == "observed") |> 
    +  select(-type) |> 
    +  pivot_wider(
    +    names_from = measure_abbr,
    +    values_from = score
    +  ) |> 
    +  ggplot(aes(dyspnea_screening, dyspena_treatment)) + 
    +  geom_point() + 
    +  coord_equal()
    +
    +
  • +
+
+ +
+

+Summary

+

In this chapter you learned about tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions: the main challenge is data from whatever structure you receive it in to a tidy format. To that end, you learn about #chp-https://tidyr.tidyverse.org/reference/pivot_longer and #chp-https://tidyr.tidyverse.org/reference/pivot_wider which allow you to tidy up many untidy datasets. Of course, tidy data can’t solve every problem so we also showed you some places were you might want to deliberately untidy your data into order to present to humans, feed into statistical models, or just pragmatically get shit done. If you particularly enjoyed this chapter and want to learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the #chp-https://www.jstatsoft.org/article/view/v059i10 paper published in the Journal of Statistical Software.

+

In the next chapter, we’ll pivot back to workflow to discuss the importance of code style, keeping your code “tidy” (ha!) in order to make it easy for you and others to read and understand your code.

+ + +
+
diff --git a/oreilly/data-transform.html b/oreilly/data-transform.html new file mode 100644 index 0000000..fb21fdd --- /dev/null +++ b/oreilly/data-transform.html @@ -0,0 +1,890 @@ +
+

Data transformation

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

Visualisation is an important tool for generating insight, but it’s rare that you get the data in exactly the right form you need for it. Often you’ll need to create some new variables or summaries to see the most important patterns, or maybe you just want to rename the variables or reorder the observations to make the data a little easier to work with. You’ll learn how to do all that (and more!) in this chapter, which will introduce you to data transformation using the dplyr package and a new dataset on flights that departed New York City in 2013.

+

The goal of this chapter is to give you an overview of all the key tools for transforming a data frame. We’ll come back these functions in more detail in later chapters, as we start to dig into specific types of data (e.g. numbers, strings, dates).

+ +
+

+Prerequisites

+

In this chapter we’ll focus on the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.

+
+
library(nycflights13)
+library(tidyverse)
+#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+#> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
+#> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
+#> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
+#> ✔ readr   2.1.3             ✔ forcats 0.5.2        
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag()    masks stats::lag()
+
+

Take careful note of the conflicts message that’s printed when you load the tidyverse. It tells you that dplyr overwrites some functions in base R. If you want to use the base version of these functions after loading dplyr, you’ll need to use their full names: #chp-https://rdrr.io/r/stats/filter and #chp-https://rdrr.io/r/stats/lag. So far we’ve mostly ignored which package a function comes from because most of the time it doesn’t matter. However, knowing the package can help you find help and find related functions, so when we need to be precise about which function a package comes from, we’ll use the same syntax as R: packagename::functionname().

+
+ +
+

+nycflights13

+

To explore the basic dplyr verbs, we’re going to use #chp-https://rdrr.io/pkg/nycflights13/man/flights. This dataset contains all 336,776 flights that departed from New York City in 2013. The data comes from the US #chp-http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0, and is documented in #chp-https://rdrr.io/pkg/nycflights13/man/flights.

+
+
flights
+#> # A tibble: 336,776 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

If you’ve used R before, you might notice that this data frame prints a little differently to other data frames you’ve seen. That’s because it’s a tibble, a special type of data frame used by the tidyverse to avoid some common gotchas. The most important difference is the way it prints: tibbles are designed for large datasets, so they only show the first few rows and only the columns that fit on one screen. To see everything you can use print(flights, width = Inf) to show everything in the console, but it’s generally more convenient to instead use View(flights) to open the dataset in the scrollable RStudio viewer.

+

You might have noticed the short abbreviations that follow each column name. These tell you the type of each variable: <int> is short for integer, <dbl> is short for double (aka real numbers), <chr> for character (aka strings), and <dttm> for date-time. These are important because the operations you can perform on a column depend so much on its “type”, and these types are used to organize the chapters in the next section of the book.

+
+ +
+

+dplyr basics

+

You’re about to learn the primary dplyr verbs which will allow you to solve the vast majority of your data manipulation challenges. But before we discuss their individual differences, it’s worth stating what they have in common:

+
  1. The first argument is always a data frame.

  2. +
  3. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  4. +
  5. The result is always a new data frame.

  6. +

Because the first argument is a data frame and the output is a data frame, dplyr verbs work well with the pipe, |>. The pipe takes the thing on its left and passes it along to the function on its right so that x |> f(y) is equivalent to f(x, y), and x |> f(y) |> g(z) is equivalent to into g(f(x, y), z). The easiest way to pronounce the pipe is “then”. That makes it possible to get a sense of the following code even though you haven’t yet learned the details:

+
+
flights |>
+  filter(dest == "IAH") |> 
+  group_by(year, month, day) |> 
+  summarize(
+    arr_delay = mean(arr_delay, na.rm = TRUE)
+  )
+
+

The code starts with the flights dataset, then filters it, then groups it, then summarizes it. We’ll come back to the pipe and its alternatives in #sec-pipes.

+

dplyr’s verbs are organised into four groups based on what they operate on: rows, columns, groups, or tables. In the following sections you’ll learn the most important verbs for rows, columns, and groups, then we’ll come back to verb that work on tables in #chp-joins. Let’s dive in!

+
+
+ +
+

+Rows

+

The most important verbs that operate on rows are #chp-https://dplyr.tidyverse.org/reference/filter, which changes which rows are present without changing their order, and #chp-https://dplyr.tidyverse.org/reference/arrange, which changes the order of the rows without changing which are present. Both functions only affect the rows, and the columns are left unchanged.

+ +
+

+filter() +

+

#chp-https://dplyr.tidyverse.org/reference/filter allows you to keep rows based on the values of the columnsLater, you’ll learn about the slice_*() family which allows you to choose rows based on their positions.. The first argument is the data frame. The second and subsequent arguments are the conditions that must be true to keep the row. For example, we could find all flights that arrived more than 120 minutes (two hours) late:

+
+
flights |> 
+  filter(arr_delay > 120)
+#> # A tibble: 10,034 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      811         630     101    1047     830     137 MQ     
+#> 2  2013     1     1      848        1835     853    1001    1950     851 MQ     
+#> 3  2013     1     1      957         733     144    1056     853     123 UA     
+#> 4  2013     1     1     1114         900     134    1447    1222     145 UA     
+#> 5  2013     1     1     1505        1310     115    1638    1431     127 EV     
+#> 6  2013     1     1     1525        1340     105    1831    1626     125 B6     
+#> # … with 10,028 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

As well as > (greater than), you can use >= (greater than or equal to), < (less than), <= (less than or equal to), == (equal to), and != (not equal to). You can also use & (and) or | (or) to combine multiple conditions:

+
+
# Flights that departed on January 1
+flights |> 
+  filter(month == 1 & day == 1)
+#> # A tibble: 842 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+# Flights that departed in January or February
+flights |> 
+  filter(month == 1 | month == 2)
+#> # A tibble: 51,955 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

There’s a useful shortcut when you’re combining | and ==: %in%. It keeps rows where the variable equals one of the values on the right:

+
+
# A shorter way to select flights that departed in January or February
+flights |> 
+  filter(month %in% c(1, 2))
+#> # A tibble: 51,955 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 51,949 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

We’ll come back to these comparisons and logical operators in more detail in #chp-logicals.

+

When you run #chp-https://dplyr.tidyverse.org/reference/filter dplyr executes the filtering operation, creating a new data frame, and then prints it. It doesn’t modify the existing flights dataset because dplyr functions never modify their inputs. To save the result, you need to use the assignment operator, <-:

+
+
jan1 <- flights |> 
+  filter(month == 1 & day == 1)
+
+
+ +
+

+Common mistakes

+

When you’re starting out with R, the easiest mistake to make is to use = instead of == when testing for equality. #chp-https://dplyr.tidyverse.org/reference/filter will let you know when this happens:

+
+
flights |> 
+  filter(month = 1)
+#> Error in `filter()`:
+#> ! We detected a named input.
+#> ℹ This usually means that you've used `=` instead of `==`.
+#> ℹ Did you mean `month == 1`?
+
+

Another mistakes is you write “or” statements like you would in English:

+
+
flights |> 
+  filter(month == 1 | 2)
+
+

This works, in the sense that it doesn’t throw an error, but it doesn’t do what you want. We’ll come back to what it does and why in #sec-boolean-operations.

+
+ +
+

+arrange() +

+

#chp-https://dplyr.tidyverse.org/reference/arrange changes the order of the rows based on the value of the columns. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. For example, the following code sorts by the departure time, which is spread over four columns.

+
+
flights |> 
+  arrange(year, month, day, dep_time)
+#> # A tibble: 336,776 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

You can use #chp-https://dplyr.tidyverse.org/reference/desc to re-order by a column in descending order. For example, this code shows the most delayed flights:

+
+
flights |> 
+  arrange(desc(dep_delay))
+#> # A tibble: 336,776 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     9      641         900    1301    1242    1530    1272 HA     
+#> 2  2013     6    15     1432        1935    1137    1607    2120    1127 MQ     
+#> 3  2013     1    10     1121        1635    1126    1239    1810    1109 MQ     
+#> 4  2013     9    20     1139        1845    1014    1457    2210    1007 AA     
+#> 5  2013     7    22      845        1600    1005    1044    1815     989 MQ     
+#> 6  2013     4    10     1100        1900     960    1342    2211     931 DL     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

You can combine #chp-https://dplyr.tidyverse.org/reference/arrange and #chp-https://dplyr.tidyverse.org/reference/filter to solve more complex problems. For example, we could look for the flights that were most delayed on arrival that left on roughly on time:

+
+
flights |> 
+  filter(dep_delay <= 10 & dep_delay >= -10) |> 
+  arrange(desc(arr_delay))
+#> # A tibble: 239,109 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013    11     1      658         700      -2    1329    1015     194 VX     
+#> 2  2013     4    18      558         600      -2    1149     850     179 AA     
+#> 3  2013     7     7     1659        1700      -1    2050    1823     147 US     
+#> 4  2013     7    22     1606        1615      -9    2056    1831     145 DL     
+#> 5  2013     9    19      648         641       7    1035     810     145 UA     
+#> 6  2013     4    18      655         700      -5    1213     950     143 AA     
+#> # … with 239,103 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+
+ +
+

+Exercises

+
  1. +

    Find all flights that

    +
    1. Had an arrival delay of two or more hours
    2. +
    3. Flew to Houston (IAH or HOU)
    4. +
    5. Were operated by United, American, or Delta
    6. +
    7. Departed in summer (July, August, and September)
    8. +
    9. Arrived more than two hours late, but didn’t leave late
    10. +
    11. Were delayed by at least an hour, but made up over 30 minutes in flight
    12. +
  2. +
  3. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

  4. +
  5. Sort flights to find the fastest flights (Hint: try sorting by a calculation).

  6. +
  7. Which flights traveled the farthest? Which traveled the shortest?

  8. +
  9. Does it matter what order you used #chp-https://dplyr.tidyverse.org/reference/filter and #chp-https://dplyr.tidyverse.org/reference/arrange in if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

  10. +
+
+ +
+

+Columns

+

There are four important verbs that affect the columns without changing the rows: #chp-https://dplyr.tidyverse.org/reference/mutate, #chp-https://dplyr.tidyverse.org/reference/select, #chp-https://dplyr.tidyverse.org/reference/rename, and #chp-https://dplyr.tidyverse.org/reference/relocate. #chp-https://dplyr.tidyverse.org/reference/mutate creates new columns that are functions of the existing columns; #chp-https://dplyr.tidyverse.org/reference/select, #chp-https://dplyr.tidyverse.org/reference/rename, and #chp-https://dplyr.tidyverse.org/reference/relocate change which columns are present, their names, or their positions.

+ +
+

+mutate() +

+

The job of #chp-https://dplyr.tidyverse.org/reference/mutate is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:

+
+
flights |> 
+  mutate(
+    gain = dep_delay - arr_delay,
+    speed = distance / air_time * 60
+  )
+#> # A tibble: 336,776 × 21
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 11 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>, and abbreviated
+#> #   variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time,
+#> #   ⁵​arr_delay
+
+

By default, #chp-https://dplyr.tidyverse.org/reference/mutate adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand sideRemember that in RStudio, the easiest way to see a dataset with many columns is #chp-https://rdrr.io/r/utils/View.:

+
+
flights |> 
+  mutate(
+    gain = dep_delay - arr_delay,
+    speed = distance / air_time * 60,
+    .before = 1
+  )
+#> # A tibble: 336,776 × 21
+#>    gain speed  year month   day dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
+#>   <dbl> <dbl> <int> <int> <int>    <int>   <int>   <dbl>   <int>   <int>   <dbl>
+#> 1    -9  370.  2013     1     1      517     515       2     830     819      11
+#> 2   -16  374.  2013     1     1      533     529       4     850     830      20
+#> 3   -31  408.  2013     1     1      542     540       2     923     850      33
+#> 4    17  517.  2013     1     1      544     545      -1    1004    1022     -18
+#> 5    19  394.  2013     1     1      554     600      -6     812     837     -25
+#> 6   -16  288.  2013     1     1      554     558      -4     740     728      12
+#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
+#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
+#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

The . is a sign that .before is an argument to the function, not the name of a new variable. You can also use .after to add after a variable, and in both .before and .after you can the name of a variable name instead of a position. For example, we could add the new variables after day:

+
+
flights |> 
+  mutate(
+    gain = dep_delay - arr_delay,
+    speed = distance / air_time * 60,
+    .after = day
+  )
+#> # A tibble: 336,776 × 21
+#>    year month   day  gain speed dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
+#>   <int> <int> <int> <dbl> <dbl>    <int>   <int>   <dbl>   <int>   <int>   <dbl>
+#> 1  2013     1     1    -9  370.      517     515       2     830     819      11
+#> 2  2013     1     1   -16  374.      533     529       4     850     830      20
+#> 3  2013     1     1   -31  408.      542     540       2     923     850      33
+#> 4  2013     1     1    17  517.      544     545      -1    1004    1022     -18
+#> 5  2013     1     1    19  394.      554     600      -6     812     837     -25
+#> 6  2013     1     1   -16  288.      554     558      -4     740     728      12
+#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
+#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
+#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which allows you to see the inputs and outputs from your calculations:

+
+
flights |> 
+  mutate(,
+    gain = dep_delay - arr_delay,
+    hours = air_time / 60,
+    gain_per_hour = gain / hours,
+    .keep = "used"
+  )
+#> # A tibble: 336,776 × 6
+#>   dep_delay arr_delay air_time  gain hours gain_per_hour
+#>       <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
+#> 1         2        11      227    -9  3.78         -2.38
+#> 2         4        20      227   -16  3.78         -4.23
+#> 3         2        33      160   -31  2.67        -11.6 
+#> 4        -1       -18      183    17  3.05          5.57
+#> 5        -6       -25      116    19  1.93          9.83
+#> 6        -4        12      150   -16  2.5          -6.4 
+#> # … with 336,770 more rows
+
+
+ +
+

+select() +

+

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. #chp-https://dplyr.tidyverse.org/reference/select allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. #chp-https://dplyr.tidyverse.org/reference/select is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:

+
+
# Select columns by name
+flights |> 
+  select(year, month, day)
+#> # A tibble: 336,776 × 3
+#>    year month   day
+#>   <int> <int> <int>
+#> 1  2013     1     1
+#> 2  2013     1     1
+#> 3  2013     1     1
+#> 4  2013     1     1
+#> 5  2013     1     1
+#> 6  2013     1     1
+#> # … with 336,770 more rows
+
+# Select all columns between year and day (inclusive)
+flights |> 
+  select(year:day)
+#> # A tibble: 336,776 × 3
+#>    year month   day
+#>   <int> <int> <int>
+#> 1  2013     1     1
+#> 2  2013     1     1
+#> 3  2013     1     1
+#> 4  2013     1     1
+#> 5  2013     1     1
+#> 6  2013     1     1
+#> # … with 336,770 more rows
+
+# Select all columns except those from year to day (inclusive)
+flights |> 
+  select(!year:day)
+#> # A tibble: 336,776 × 16
+#>   dep_time sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin
+#>      <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
+#> 1      517     515       2     830     819      11 UA        1545 N14228  EWR   
+#> 2      533     529       4     850     830      20 UA        1714 N24211  LGA   
+#> 3      542     540       2     923     850      33 AA        1141 N619AA  JFK   
+#> 4      544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
+#> 5      554     600      -6     812     837     -25 DL         461 N668DN  LGA   
+#> 6      554     558      -4     740     728      12 UA        1696 N39463  EWR   
+#> # … with 336,770 more rows, 6 more variables: dest <chr>, air_time <dbl>,
+#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
+#> #   variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time,
+#> #   ⁵​arr_delay
+
+# Select all columns that are characters
+flights |> 
+  select(where(is.character))
+#> # A tibble: 336,776 × 4
+#>   carrier tailnum origin dest 
+#>   <chr>   <chr>   <chr>  <chr>
+#> 1 UA      N14228  EWR    IAH  
+#> 2 UA      N24211  LGA    IAH  
+#> 3 AA      N619AA  JFK    MIA  
+#> 4 B6      N804JB  JFK    BQN  
+#> 5 DL      N668DN  LGA    ATL  
+#> 6 UA      N39463  EWR    ORD  
+#> # … with 336,770 more rows
+
+

There are a number of helper functions you can use within #chp-https://dplyr.tidyverse.org/reference/select:

+
  • +starts_with("abc"): matches names that begin with “abc”.
  • +
  • +ends_with("xyz"): matches names that end with “xyz”.
  • +
  • +contains("ijk"): matches names that contain “ijk”.
  • +
  • +num_range("x", 1:3): matches x1, x2 and x3.
  • +

See #chp-https://dplyr.tidyverse.org/reference/select for more details. Once you know regular expressions (the topic of #chp-regexps) you’ll also be use #chp-https://tidyselect.r-lib.org/reference/starts_with to select variables that match a pattern.

+

You can rename variables as you #chp-https://dplyr.tidyverse.org/reference/select them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:

+
+
flights |> 
+  select(tail_num = tailnum)
+#> # A tibble: 336,776 × 1
+#>   tail_num
+#>   <chr>   
+#> 1 N14228  
+#> 2 N24211  
+#> 3 N619AA  
+#> 4 N804JB  
+#> 5 N668DN  
+#> 6 N39463  
+#> # … with 336,770 more rows
+
+
+ +
+

+rename() +

+

If you just want to keep all the existing variables and just want to rename a few, you can use #chp-https://dplyr.tidyverse.org/reference/rename instead of #chp-https://dplyr.tidyverse.org/reference/select:

+
+
flights |> 
+  rename(tail_num = tailnum)
+#> # A tibble: 336,776 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tail_num <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

It works exactly the same way as #chp-https://dplyr.tidyverse.org/reference/select, but keeps all the variables that aren’t explicitly selected.

+

If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out #chp-https://rdrr.io/pkg/janitor/man/clean_names which provides some useful automated cleaning.

+
+ +
+

+relocate() +

+

Use #chp-https://dplyr.tidyverse.org/reference/relocate to move variables around. You might want to collect related variables together or move important variables to the front. By default #chp-https://dplyr.tidyverse.org/reference/relocate moves variables to the front:

+
+
flights |> 
+  relocate(time_hour, air_time)
+#> # A tibble: 336,776 × 19
+#>   time_hour           air_time  year month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴
+#>   <dttm>                 <dbl> <int> <int> <int>   <int>   <int>   <dbl>   <int>
+#> 1 2013-01-01 05:00:00      227  2013     1     1     517     515       2     830
+#> 2 2013-01-01 05:00:00      227  2013     1     1     533     529       4     850
+#> 3 2013-01-01 05:00:00      160  2013     1     1     542     540       2     923
+#> 4 2013-01-01 05:00:00      183  2013     1     1     544     545      -1    1004
+#> 5 2013-01-01 06:00:00      116  2013     1     1     554     600      -6     812
+#> 6 2013-01-01 05:00:00      150  2013     1     1     554     558      -4     740
+#> # … with 336,770 more rows, 10 more variables: sched_arr_time <int>,
+#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
+#> #   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>, and abbreviated
+#> #   variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time
+
+

But you can use the same .before and .after arguments as #chp-https://dplyr.tidyverse.org/reference/mutate to choose where to put them:

+
+
flights |> 
+  relocate(year:dep_time, .after = time_hour)
+#> # A tibble: 336,776 × 19
+#>   sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest 
+#>        <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr>
+#> 1        515       2     830     819      11 UA        1545 N14228  EWR    IAH  
+#> 2        529       4     850     830      20 UA        1714 N24211  LGA    IAH  
+#> 3        540       2     923     850      33 AA        1141 N619AA  JFK    MIA  
+#> 4        545      -1    1004    1022     -18 B6         725 N804JB  JFK    BQN  
+#> 5        600      -6     812     837     -25 DL         461 N668DN  LGA    ATL  
+#> 6        558      -4     740     728      12 UA        1696 N39463  EWR    ORD  
+#> # … with 336,770 more rows, 9 more variables: air_time <dbl>, distance <dbl>,
+#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>, month <int>,
+#> #   day <int>, dep_time <int>, and abbreviated variable names ¹​sched_dep_time,
+#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+flights |> 
+  relocate(starts_with("arr"), .before = dep_time)
+#> # A tibble: 336,776 × 19
+#>    year month   day arr_time arr_delay dep_time sched_…¹ dep_d…² sched…³ carrier
+#>   <int> <int> <int>    <int>     <dbl>    <int>    <int>   <dbl>   <int> <chr>  
+#> 1  2013     1     1      830        11      517      515       2     819 UA     
+#> 2  2013     1     1      850        20      533      529       4     830 UA     
+#> 3  2013     1     1      923        33      542      540       2     850 AA     
+#> 4  2013     1     1     1004       -18      544      545      -1    1022 B6     
+#> 5  2013     1     1      812       -25      554      600      -6     837 DL     
+#> 6  2013     1     1      740        12      554      558      -4     728 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​sched_arr_time
+
+
+ +
+

+Exercises

+
+ +
+
  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

  2. +
  3. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

  4. +
  5. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

  6. +
  7. What happens if you include the name of a variable multiple times in a #chp-https://dplyr.tidyverse.org/reference/select call?

  8. +
  9. +

    What does the #chp-https://tidyselect.r-lib.org/reference/all_of function do? Why might it be helpful in conjunction with this vector?

    +
    +
    variables <- c("year", "month", "day", "dep_delay", "arr_delay")
    +
    +
  10. +
  11. +

    Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    +
    +
    select(flights, contains("TIME"))
    +
    +
  12. +
+
+ +
+

+Groups

+

So far you’ve learned about functions that work with rows and columns. dplyr gets even more powerful when you add in the ability to work with groups. In this section, we’ll focus on the most important functions: #chp-https://dplyr.tidyverse.org/reference/group_by, #chp-https://dplyr.tidyverse.org/reference/summarise, and the slice family of functions.

+ +
+

+group_by() +

+

Use #chp-https://dplyr.tidyverse.org/reference/group_by to divide your dataset into groups meaningful for your analysis:

+
+
flights |> 
+  group_by(month)
+#> # A tibble: 336,776 × 19
+#> # Groups:   month [12]
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

#chp-https://dplyr.tidyverse.org/reference/group_by doesn’t change the data but, if you look closely at the output, you’ll notice that it’s now “grouped by” month. This means subsequent operations will now work “by month”.

+
+ +
+

+summarize() +

+

The most important grouped operation is a summary. It collapses each group to a single rowThis is a slightly simplification; later on you’ll learn how to use #chp-https://dplyr.tidyverse.org/reference/summarise to produce multiple summary rows for each group.. Here we compute the average departure delay by month:

+
+
flights |> 
+  group_by(month) |> 
+  summarize(
+    delay = mean(dep_delay)
+  )
+#> # A tibble: 12 × 2
+#>   month delay
+#>   <int> <dbl>
+#> 1     1    NA
+#> 2     2    NA
+#> 3     3    NA
+#> 4     4    NA
+#> 5     5    NA
+#> 6     6    NA
+#> # … with 6 more rows
+
+

Uhoh! Something has gone wrong and all of our results are NA (pronounced “N-A”), R’s symbol for missing value. We’ll come back to discuss missing values in #chp-missing-values, but for now we’ll remove them by using na.rm = TRUE:

+
+
flights |> 
+  group_by(month) |> 
+  summarize(
+    delay = mean(dep_delay, na.rm = TRUE)
+  )
+#> # A tibble: 12 × 2
+#>   month delay
+#>   <int> <dbl>
+#> 1     1  10.0
+#> 2     2  10.8
+#> 3     3  13.2
+#> 4     4  13.9
+#> 5     5  13.0
+#> 6     6  20.8
+#> # … with 6 more rows
+
+

You can create any number of summaries in a single call to #chp-https://dplyr.tidyverse.org/reference/summarise. You’ll learn various useful summaries in the upcoming chapters, but one very useful summary is #chp-https://dplyr.tidyverse.org/reference/context, which returns the number of rows in each group:

+
+
flights |> 
+  group_by(month) |> 
+  summarize(
+    delay = mean(dep_delay, na.rm = TRUE), 
+    n = n()
+  )
+#> # A tibble: 12 × 3
+#>   month delay     n
+#>   <int> <dbl> <int>
+#> 1     1  10.0 27004
+#> 2     2  10.8 24951
+#> 3     3  13.2 28834
+#> 4     4  13.9 28330
+#> 5     5  13.0 28796
+#> 6     6  20.8 28243
+#> # … with 6 more rows
+
+

Means and counts can get you a surprisingly long way in data science!

+
+ +
+

+Theslice_ functions

+

There are five handy functions that allow you pick off specific rows within each group:

+
  • +df |> slice_head(n = 1) takes the first row from each group.
  • +
  • +df |> slice_tail(n = 1) takes the last row in each group.
  • +
  • +df |> slice_min(x, n = 1) takes the row with the smallest value of x.
  • +
  • +df |> slice_max(x, n = 1) takes the row with the largest value of x.
  • +
  • +df |> slice_sample(x, n = 1) takes one random row.
  • +

You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group. For example, the following code finds the most delayed flight to each destination:

+
+
flights |> 
+  group_by(dest) |> 
+  slice_max(arr_delay, n = 1)
+#> # A tibble: 108 × 19
+#> # Groups:   dest [105]
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     7    22     2145        2007      98     132    2259     153 B6     
+#> 2  2013     7    23     1139         800     219    1250     909     221 B6     
+#> 3  2013     1    25      123        2000     323     229    2101     328 EV     
+#> 4  2013     8    17     1740        1625      75    2042    2003      39 UA     
+#> 5  2013     7    22     2257         759     898     121    1026     895 DL     
+#> 6  2013     7    10     2056        1505     351    2347    1758     349 UA     
+#> # … with 102 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

This is similar to computing the max delay with #chp-https://dplyr.tidyverse.org/reference/summarise, but you get the whole row instead of the single summary:

+
+
flights |> 
+  group_by(dest) |> 
+  summarize(max_delay = max(arr_delay, na.rm = TRUE))
+#> Warning: There was 1 warning in `summarize()`.
+#> ℹ In argument `max_delay = max(arr_delay, na.rm = TRUE)`.
+#> ℹ In group 52: `dest = "LGA"`.
+#> Caused by warning in `max()`:
+#> ! no non-missing arguments to max; returning -Inf
+#> # A tibble: 105 × 2
+#>   dest  max_delay
+#>   <chr>     <dbl>
+#> 1 ABQ         153
+#> 2 ACK         221
+#> 3 ALB         328
+#> 4 ANC          39
+#> 5 ATL         895
+#> 6 AUS         349
+#> # … with 99 more rows
+
+
+ +
+

+Grouping by multiple variables

+

You can create groups using more than one variable. For example, we could make a group for each day:

+
+
daily <- flights |>  
+  group_by(year, month, day)
+daily
+#> # A tibble: 336,776 × 19
+#> # Groups:   year, month, day [365]
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:

+
+
daily_flights <- daily |> 
+  summarize(
+    n = n()
+  )
+#> `summarise()` has grouped output by 'year', 'month'. You can override using the
+#> `.groups` argument.
+
+

If you’re happy with this behavior, you can explicitly request it in order to suppress the message:

+
+
daily_flights <- daily |> 
+  summarize(
+    n = n(), 
+    .groups = "drop_last"
+  )
+
+

Alternatively, change the default behavior by setting a different value, e.g. "drop" to drop all grouping or "keep" to preserve the same groups.

+
+ +
+

+Ungrouping

+

You might also want to remove grouping outside of #chp-https://dplyr.tidyverse.org/reference/summarise. You can do this with #chp-https://dplyr.tidyverse.org/reference/group_by.

+
+
daily |> 
+  ungroup() |>
+  summarize(
+    delay = mean(dep_delay, na.rm = TRUE), 
+    flights = n()
+  )
+#> # A tibble: 1 × 2
+#>   delay flights
+#>   <dbl>   <int>
+#> 1  12.6  336776
+
+

As you can see, when you summarize an ungrouped data frame, you get a single row back because dplyr treats all the rows in an ungrouped data frame as belonging to one group.

+
+ +
+

+Exercises

+
  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

  2. +
  3. Find the most delayed flight to each destination.

  4. +
  5. How do delays vary over the course of the day. Illustrate your answer with a plot.

  6. +
  7. What happens if you supply a negative n to #chp-https://dplyr.tidyverse.org/reference/slice and friends?

  8. +
  9. Explain what #chp-https://dplyr.tidyverse.org/reference/count does in terms of the dplyr verbs you just learn. What does the sort argument to #chp-https://dplyr.tidyverse.org/reference/count do?

  10. +
+
+ +
+

+Case study: aggregates and sample size

+

Whenever you do any aggregation, it’s always a good idea to include a count (#chp-https://dplyr.tidyverse.org/reference/context). That way, you can ensure that you’re not drawing conclusions based on very small amounts of data. For example, let’s look at the planes (identified by their tail number) that have the highest average delays:

+
+
delays <- flights |>  
+  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
+  group_by(tailnum) |> 
+  summarize(
+    delay = mean(arr_delay, na.rm = TRUE),
+    n = n()
+  )
+
+ggplot(delays, aes(delay)) + 
+  geom_freqpoly(binwidth = 10)
+
+

A frequency histogram showing the distribution of flight delays. The distribution is unimodal, with a large spike around 0, and asymmetric: very few flights leave more than 30 minutes early, but flights are delayed up to 5 hours.

+
+
+

Wow, there are some planes that have an average delay of 5 hours (300 minutes)! That seems pretty surprising, so lets draw a scatterplot of number of flights vs. average delay:

+
+
ggplot(delays, aes(n, delay)) + 
+  geom_point(alpha = 1/10)
+
+

A scatterplot showing number of flights versus after delay. Delays for planes with very small number of flights have very high variability (from -50 to ~300), but the variability rapidly decreases as the number of flights increases.

+
+
+

Not surprisingly, there is much greater variation in the average delay when there are few flights for a given plane. The shape of this plot is very characteristic: whenever you plot a mean (or other summary) vs. group size, you’ll see that the variation decreases as the sample size increases*cough* the central limit theorem *cough*..

+

When looking at this sort of plot, it’s often useful to filter out the groups with the smallest numbers of observations, so you can see more of the pattern and less of the extreme variation in the smallest groups:

+
+
delays |>  
+  filter(n > 25) |> 
+  ggplot(aes(n, delay)) + 
+  geom_point(alpha = 1/10) + 
+  geom_smooth(se = FALSE)
+
+

Now that the y-axis (average delay) is smaller (-20 to 60 minutes), we can see a more complicated story. The smooth line suggests an initial decrease in average delay from 10 minutes to 0 minutes as number of flights per plane increases from 25 to 100. This is followed by a gradual increase up to 10 minutes for 250 flights, then a gradual decrease to ~5 minutes at 500 flights.

+
+
+

Note the handy pattern for combining ggplot2 and dplyr. It’s a bit annoying that you have to switch from |> to +, but it’s not too much of a hassle once you get the hang of it.

+

There’s another common variation on this pattern that we can see in some data about baseball players. The following code uses data from the Lahman package to compare what proportion of times a player hits the ball vs. the number of attempts they take:

+
+
batters <- Lahman::Batting |> 
+  group_by(playerID) |> 
+  summarize(
+    perf = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
+    n = sum(AB, na.rm = TRUE)
+  )
+batters
+#> # A tibble: 20,166 × 3
+#>   playerID    perf     n
+#>   <chr>      <dbl> <int>
+#> 1 aardsda01 0          4
+#> 2 aaronha01 0.305  12364
+#> 3 aaronto01 0.229    944
+#> 4 aasedo01  0          5
+#> 5 abadan01  0.0952    21
+#> 6 abadfe01  0.111      9
+#> # … with 20,160 more rows
+
+

When we plot the skill of the batter (measured by the batting average, ba) against the number of opportunities to hit the ball (measured by at bat, ab), you see two patterns:

+
  1. As above, the variation in our aggregate decreases as we get more data points.

  2. +
  3. There’s a positive correlation between skill (perf) and opportunities to hit the ball (n) because obviously teams want to give their best batters the most opportunities to hit the ball.

  4. +
+
batters |> 
+  filter(n > 100) |> 
+  ggplot(aes(n, perf)) +
+    geom_point(alpha = 1 / 10) + 
+    geom_smooth(se = FALSE)
+
+

A scatterplot of number of batting opportunites vs batting performance overlaid with a smoothed line. Average performance increases sharply from 0.2 at when n is 1 to 0.25 when n is ~1000. Average performance continues to increase linearly at a much shallower slope reaching ~0.3 when n is ~15,000.

+
+
+

This also has important implications for ranking. If you naively sort on desc(ba), the people with the best batting averages are clearly lucky, not skilled:

+
+
batters |> 
+  arrange(desc(perf))
+#> # A tibble: 20,166 × 3
+#>   playerID   perf     n
+#>   <chr>     <dbl> <int>
+#> 1 abramge01     1     1
+#> 2 alberan01     1     1
+#> 3 banisje01     1     1
+#> 4 bartocl01     1     1
+#> 5 bassdo01      1     1
+#> 6 birasst01     1     2
+#> # … with 20,160 more rows
+
+

You can find a good explanation of this problem and how to overcome it at http://varianceexplained.org/r/empirical_bayes_baseball/ and https://www.evanmiller.org/how-not-to-sort-by-average-rating.html.

+
+ +
+

+Summary

+

In this chapter, you’ve learned the tools that dplyr provides for working with data frames. The tools are roughly grouped into three categories: those that manipulate the rows (like #chp-https://dplyr.tidyverse.org/reference/filter and #chp-https://dplyr.tidyverse.org/reference/arrange, those that manipulate the columns (like #chp-https://dplyr.tidyverse.org/reference/select and #chp-https://dplyr.tidyverse.org/reference/mutate), and those that manipulate groups (like #chp-https://dplyr.tidyverse.org/reference/group_by and #chp-https://dplyr.tidyverse.org/reference/summarise). In this chapter, we’ve focused on these “whole data frame” tools, but you haven’t yet learned much about what you can do with the individual variable. We’ll come back to that in the Transform part of the book, where each chapter will give you tools for a specific type of variable.

+

For now, we’ll pivot back to workflow, and in the next chapter you’ll learn more about the pipe, |>, why we recommend it, and a little of the history that lead from magrittr’s %>% to base R’s |>.

+ + +
+
diff --git a/oreilly/data-visualize.html b/oreilly/data-visualize.html new file mode 100644 index 0000000..b218f45 --- /dev/null +++ b/oreilly/data-visualize.html @@ -0,0 +1,838 @@ +
+

Data visualization

+
+

+Introduction

+
+

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

+
+

This chapter will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can do more and faster by learning one system and applying it in many places.

+ +
+

+Prerequisites

+

This chapter focuses on ggplot2, one of the core packages in the tidyverse. To access the datasets, help pages, and functions used in this chapter, load the tidyverse by running this code:

+
+
library(tidyverse)
+#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+#> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
+#> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
+#> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
+#> ✔ readr   2.1.3             ✔ forcats 0.5.2        
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag()    masks stats::lag()
+
+

That one line of code loads the core tidyverse; packages which you will use in almost every data analysis. It also tells you which functions from the tidyverse conflict with functions in base R (or from other packages you might have loaded).

+

If you run this code and get the error message “there is no package called ‘tidyverse’”, you’ll need to first install it, then run #chp-https://rdrr.io/r/base/library once again.

+
+
install.packages("tidyverse")
+library(tidyverse)
+
+

You only need to install a package once, but you need to reload it every time you start a new session.

+
+
+ +
+

+First steps

+

Let’s use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Nonlinear?

+ +
+

+Thempg data frame

+

You can test your answer with the mpg data frame found in ggplot2 (a.k.a. #chp-https://ggplot2.tidyverse.org/reference/mpg). A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). mpg contains observations collected by the US Environmental Protection Agency on 38 car models.

+
+
mpg
+#> # A tibble: 234 × 11
+#>   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
+#>   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
+#> 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
+#> 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
+#> 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
+#> 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
+#> 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
+#> 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…
+#> # … with 228 more rows
+
+

Among the variables in mpg are:

+
  1. displ, a car’s engine size, in liters.

  2. +
  3. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

  4. +

To learn more about mpg, open its help page by running #chp-https://ggplot2.tidyverse.org/reference/mpg.

+
+ +
+

+Creating a ggplot

+

To plot mpg, run this code to put displ on the x-axis and hwy on the y-axis:

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy))
+
+

Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association.

+
+
+

The plot shows a negative relationship between engine size (displ) and fuel efficiency (hwy). In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. Does this confirm or refute your hypothesis about fuel efficiency and engine size?

+

With ggplot2, you begin a plot with the function #chp-https://ggplot2.tidyverse.org/reference/ggplot. #chp-https://ggplot2.tidyverse.org/reference/ggplot creates a coordinate system that you can add layers to. The first argument of #chp-https://ggplot2.tidyverse.org/reference/ggplot is the dataset to use in the graph. So ggplot(data = mpg) creates an empty graph, but it’s not very interesting so we won’t show it here.

+

You complete your graph by adding one or more layers to #chp-https://ggplot2.tidyverse.org/reference/ggplot. The function #chp-https://ggplot2.tidyverse.org/reference/geom_point adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each adds a different type of layer to a plot. You’ll learn a whole bunch of them throughout this chapter.

+

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties of your plot. The mapping argument is always paired with #chp-https://ggplot2.tidyverse.org/reference/aes, and the x and y arguments of #chp-https://ggplot2.tidyverse.org/reference/aes specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg.

+
+ +
+

+A graphing template

+

Let’s turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

+
+
ggplot(data = <DATA>) + 
+  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
+
+

The rest of this chapter will show you how to complete and extend this template to make different types of graphs. We will begin with the <MAPPINGS> component.

+
+ +
+

+Exercises

+
  1. Run ggplot(data = mpg). What do you see?

  2. +
  3. How many rows are in mpg? How many columns?

  4. +
  5. What does the drv variable describe? Read the help for #chp-https://ggplot2.tidyverse.org/reference/mpg to find out.

  6. +
  7. Make a scatterplot of hwy vs cyl.

  8. +
  9. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

  10. +
+
+ +
+

+Aesthetic mappings

+
+

“The greatest value of a picture is when it forces us to notice what we never expected to see.” — John Tukey

+
+

In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher fuel efficiency than you might expect. That is, they have a higher miles per gallon than other cars with similar engine sizes. How can you explain these cars?

+
+
+

Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. Cars with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon stand out from the rest of the data and are highlighted in red.

+
+
+

Let’s hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the class value for each car. The class variable of the mpg dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

+

You can add a third variable, like class, to a two dimensional scatterplot by mapping it to an aesthetic. An aesthetic is a visual property of the objects in your plot. Aesthetics include things like the size, the shape, or the color of your points. You can display a point (like the one below) in different ways by changing the values of its aesthetic properties. Since we already use the word “value” to describe data, let’s use the word “level” to describe aesthetic properties. Here we change the levels of a point’s size, shape, and color to make the point small, triangular, or blue:

+
+
+

Diagram that shows four plotting characters next to each other. The first is a large circle, the second is a small circle, the third is a triangle, and the fourth is a blue circle.

+
+
+

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. For example, you can map the colors of your points to the class variable to reveal the class of each car.

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy, color = class))
+
+

Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The points representing each car are colored according to the class of the car. The legend on the right of the plot shows the mapping between colors and levels of the class variable: 2seater, compact, midsize, minivan, pickup, or suv.

+
+
+

(If you prefer British English, like Hadley, you can use colour instead of color.)

+

To map an aesthetic to a variable, associate the name of the aesthetic with the name of the variable inside #chp-https://ggplot2.tidyverse.org/reference/aes. ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

+

The colors reveal that many of the unusual points (with engine size greater than 5 liters and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars don’t seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.

+

In the above example, we mapped class to the color aesthetic, but we could have mapped class to the size aesthetic in the same way. In this case, the exact size of each point would reveal its class affiliation. We get a warning here: mapping an unordered variable (class) to an ordered aesthetic (size) is generally not a good idea because it implies a ranking that does not in fact exist.

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy, size = class))
+#> Warning: Using size for a discrete variable is not advised.
+
+

Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The points representing each car are sized according to the class of the car. The legend on the right of the plot shows the mapping between sizes and levels of the class variable -- going from small to large: 2seater, compact, midsize, minivan, pickup, or suv.

+
+
+

Similarly, we could have mapped class to the alpha aesthetic, which controls the transparency of the points, or to the shape aesthetic, which controls the shape of the points.

+
+
# Left
+ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
+
+# Right
+ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
+
+
+
+

Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable.

+
+
+

Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the alpha aesthetic, resulting in different transparency levels for each level of class. In the plot on the right class is mapped the shape aesthetic, resulting in different plotting character shapes for each level of class. Each plot comes with a legend that shows the mapping between alpha level or shape and levels of the class variable.

+
+
+
+
+

What happened to the SUVs? ggplot2 will only use six shapes at a time. By default, additional groups will go unplotted when you use the shape aesthetic.

+

For each aesthetic, you use #chp-https://ggplot2.tidyverse.org/reference/aes to associate the name of the aesthetic with a variable to display. The #chp-https://ggplot2.tidyverse.org/reference/aes function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument. The syntax highlights a useful insight about x and y: the x and y locations of a point are themselves aesthetics, visual properties that you can map to variables to display information about the data.

+

Once you map an aesthetic, ggplot2 takes care of the rest. It selects a reasonable scale to use with the aesthetic, and it constructs a legend that explains the mapping between levels and values. For x and y aesthetics, ggplot2 does not create a legend, but it creates an axis line with tick marks and a label. The axis line acts as a legend; it explains the mapping between locations and values.

+

You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue:

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
+
+

Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are blue.

+
+
+

Here, the color doesn’t convey information about a variable, but only changes the appearance of the plot. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function. In other words, it goes outside of #chp-https://ggplot2.tidyverse.org/reference/aes. You’ll need to pick a value that makes sense for that aesthetic:

+
  • The name of a color as a character string.
  • +
  • The size of a point in mm.
  • +
  • The shape of a point as a number, as shown in #fig-shapes.
  • +
+
+ +

Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue.

+
Figure 2.1: R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill.color and fill aesthetics. The hollow shapes (0–14) have a border determined by color; the solid shapes (15–20) are filled with color; the filled shapes (21–24) have a border of color and are filled with fill.
+
+
+
+ +
+

+Exercises

+
  1. +

    What’s gone wrong with this code? Why are the points not blue?

    +
    +
    ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. All points are red and the legend shows a red point that is mapped to the word blue.

    +
    +
    +
  2. +
  3. Which variables in mpg are categorical? Which variables are continuous? (Hint: type #chp-https://ggplot2.tidyverse.org/reference/mpg to read the documentation for the dataset). How can you see this information when you run mpg?

  4. +
  5. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

  6. +
  7. What happens if you map the same variable to multiple aesthetics?

  8. +
  9. What does the stroke aesthetic do? What shapes does it work with? (Hint: use #chp-https://ggplot2.tidyverse.org/reference/geom_point)

  10. +
  11. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? Note, you’ll also need to specify x and y.

  12. +
+
+ +
+

+Common problems

+

As you start to run R code, you’re likely to run into problems. Don’t worry — it happens to everyone. We have all been writing R code for years, but every day we still write code that doesn’t work!

+

Start by carefully comparing the code that you’re running to the code in the book. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every " is paired with another ". Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.

+

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start. In other words, make sure you haven’t accidentally written code like this:

+
ggplot(data = mpg) 
++ geom_point(mapping = aes(x = displ, y = hwy))
+

If you’re still stuck, try the help. You can get help about any R function by running ?function_name in the console, or selecting the function name and pressing F1 in RStudio. Don’t worry if the help doesn’t seem that helpful - instead skip down to the examples and look for code that matches what you’re trying to do.

+

If that doesn’t help, carefully read the error message. Sometimes the answer will be buried there! But when you’re new to R, the answer might be in the error message but you don’t yet know how to understand it. Another great tool is Google: try googling the error message, as it’s likely someone else has had the same problem, and has gotten help online.

+
+ +
+

+Facets

+

One way to add additional variables to a plot is by mapping them to an aesthetic. Another way, which is particularly useful for categorical variables, is to split your plot into facets, subplots that each display one subset of the data.

+

To facet your plot by a single variable, use #chp-https://ggplot2.tidyverse.org/reference/facet_wrap. The first argument of #chp-https://ggplot2.tidyverse.org/reference/facet_wrap is a formulaHere “formula” is the name of the type of thing created by ~, not a synonym for “equation”., which you create with ~ followed by a variable name. The variable that you pass to #chp-https://ggplot2.tidyverse.org/reference/facet_wrap should be discrete.

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy)) + 
+  facet_wrap(~cyl)
+
+

Scatterplot of highway fuel efficiency versus engine size of cars, faceted by class, with facets spanning two rows.

+
+
+

To facet your plot with the combination of two variables, switch from #chp-https://ggplot2.tidyverse.org/reference/facet_wrap to #chp-https://ggplot2.tidyverse.org/reference/facet_grid. The first argument of #chp-https://ggplot2.tidyverse.org/reference/facet_grid is also a formula, but now it’s a double sided formula: rows ~ cols.

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy)) + 
+  facet_grid(drv ~ cyl)
+
+

Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive.

+
+
+ +
+

+Exercises

+
  1. What happens if you facet on a continuous variable?

  2. +
  3. +

    What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

    +
    +
    ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = drv, y = cyl))
    +
    +

    Scatterplot of number of cycles versus type of drive train of cars. The plot shows that there are no cars with 5 cylinders that are 4 wheel drive or with 4 or 5 cylinders that are front wheel drive.

    +
    +
    +
  4. +
  5. +

    What plots does the following code make? What does . do?

    +
    +
    ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy)) +
    +  facet_grid(drv ~ .)
    +
    +ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy)) +
    +  facet_grid(. ~ cyl)
    +
    +
  6. +
  7. +

    Take the first faceted plot in this section:

    +
    +
    ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy)) + 
    +  facet_wrap(~ class, nrow = 2)
    +
    +

    What are the advantages to using faceting instead of the color aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

    +
  8. +
  9. Read #chp-https://ggplot2.tidyverse.org/reference/facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t #chp-https://ggplot2.tidyverse.org/reference/facet_grid have nrow and ncol arguments?

  10. +
  11. +

    Which of the following two plots makes it easier to compare engine size (displ) across cars with different drive trains? What does this say about when to place a faceting variable across rows or columns?

    +
    +
    ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy)) + 
    +  facet_grid(drv ~ .)
    +
    +ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy)) + 
    +  facet_grid(. ~ drv)
    +
    +

    Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns.

    +
    +
    +

    Two faceted plots, both visualizing highway fuel efficiency versus engine size of cars, faceted by drive train. In the top plot, facet are organized across rows and in the second, across columns.

    +
    +
    +
  12. +
  13. +

    Recreate this plot using #chp-https://ggplot2.tidyverse.org/reference/facet_wrap instead of #chp-https://ggplot2.tidyverse.org/reference/facet_grid. How do the positions of the facet labels change?

    +
    +
    ggplot(data = mpg) + 
    +  geom_point(mapping = aes(x = displ, y = hwy)) +
    +  facet_grid(drv ~ .)
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars, faceted by type of drive train across rows.

    +
    +
    +
  14. +
+
+ +
+

+Geometric objects

+

How are these two plots similar?

+
+
+
+

There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed.

+
+
+

There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed.

+
+
+
+

Both plots contain the same x variable, the same y variable, and both describe the same data. But the plots are not identical. Each plot uses a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.

+

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.

+

To change the geom in your plot, change the geom function that you add to #chp-https://ggplot2.tidyverse.org/reference/ggplot. For instance, to make the plots above, you can use this code:

+
+
# Left
+ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy))
+
+# Right
+ggplot(data = mpg) + 
+  geom_smooth(mapping = aes(x = displ, y = hwy))
+
+

Every geom function in ggplot2 takes a mapping argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you could set the linetype of a line. #chp-https://ggplot2.tidyverse.org/reference/geom_smooth will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.

+
+
ggplot(data = mpg) + 
+  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
+
+

A plot of highway fuel efficiency versus engine size of cars. The data are represented with smooth curves, which use a different line type (solid, dashed, or long dashed) for each type of drive train. Confidence intervals around the smooth curves are also displayed.

+
+
+

Here, #chp-https://ggplot2.tidyverse.org/reference/geom_smooth separates the cars into three lines based on their drv value, which describes a car’s drive train. One line describes all of the points that have a 4 value, one line describes all of the points that have an f value, and one line describes all of the points that have an r value. Here, 4 stands for four-wheel drive, f for front-wheel drive, and r for rear-wheel drive.

+

If this sounds strange, we can make it more clear by overlaying the lines on top of the raw data and then coloring everything according to drv.

+
+
+

A plot of highway fuel efficiency versus engine size of cars. The data are represented with points (colored by drive train) as well as smooth curves (where line type is determined based on drive train as well). Confidence intervals around the smooth curves are also displayed.

+
+
+

Notice that this plot contains two geoms in the same graph! If this makes you excited, buckle up. You will learn how to place multiple geoms in the same plot very soon.

+

ggplot2 provides more than 40 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org/gallery/ for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://rstudio.com/resources/cheatsheets. To learn more about any single geom, use the help (e.g. #chp-https://ggplot2.tidyverse.org/reference/geom_smooth).

+

Many geoms, like #chp-https://ggplot2.tidyverse.org/reference/geom_smooth, use a single geometric object to display multiple rows of data. For these geoms, you can set the group aesthetic to a categorical variable to draw multiple objects. ggplot2 will draw a separate object for each unique value of the grouping variable. In practice, ggplot2 will automatically group the data for these geoms whenever you map an aesthetic to a discrete variable (as in the linetype example). It is convenient to rely on this feature because the group aesthetic by itself does not add a legend or distinguishing features to the geoms.

+
+
ggplot(data = mpg) +
+  geom_smooth(mapping = aes(x = displ, y = hwy))
+              
+ggplot(data = mpg) +
+  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
+    
+ggplot(data = mpg) +
+  geom_smooth(
+    mapping = aes(x = displ, y = hwy, color = drv),
+    show.legend = FALSE
+  )
+
+
+
+

Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed.

+
+
+

Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed.

+
+
+

Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, without a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed.

+
+
+
+
+

To display multiple geoms in the same plot, add multiple geom functions to #chp-https://ggplot2.tidyverse.org/reference/ggplot:

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy)) +
+  geom_smooth(mapping = aes(x = displ, y = hwy))
+
+

Scatterplot of highway fuel efficiency versus engine size of cars with a smooth curve overlaid. A confidence interval around the smooth curves is also displayed.

+
+
+

This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to #chp-https://ggplot2.tidyverse.org/reference/ggplot. ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:

+
+
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
+  geom_point() + 
+  geom_smooth()
+
+

If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.

+
+
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
+  geom_point(mapping = aes(color = class)) + 
+  geom_smooth()
+
+

Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of cars is overlaid along with a confidence interval around it.

+
+
+

You can use the same idea to specify different data for each layer. Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in #chp-https://ggplot2.tidyverse.org/reference/geom_smooth overrides the global data argument in #chp-https://ggplot2.tidyverse.org/reference/ggplot for that layer only.

+
+
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
+  geom_point(mapping = aes(color = class)) + 
+  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
+
+

Scatterplot of highway fuel efficiency versus engine size of cars, where points are colored according to the car class. A smooth curve following the trajectory of the relationship between highway fuel efficiency versus engine size of subcompact cars is overlaid along with a confidence interval around it.

+
+
+

(You’ll learn how #chp-https://dplyr.tidyverse.org/reference/filter works in the chapter on data transformations: for now, just know that this command selects only the subcompact cars.)

+ +
+

+Exercises

+
  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

  2. +
  3. +

    Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

    +
    +
    ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
    +  geom_point() + 
    +  geom_smooth(se = FALSE)
    +
    +
  4. +
  5. +

    Earlier in this chapter we used show.legend without explaining it:

    +
    +
    ggplot(data = mpg) +
    +  geom_smooth(
    +    mapping = aes(x = displ, y = hwy, color = drv),
    +    show.legend = FALSE
    +  )
    +
    +

    What does show.legend = FALSE do here? What happens if you remove it? Why do you think we used it earlier?

    +
  6. +
  7. What does the se argument to #chp-https://ggplot2.tidyverse.org/reference/geom_smooth do?

  8. +
  9. +

    Will these two graphs look different? Why/why not?

    +
    +
    ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    +  geom_point() + 
    +  geom_smooth()
    +
    +ggplot() + 
    +  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
    +  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
    +
    +
  10. +
  11. +

    Recreate the R code necessary to generate the following graphs. Note that wherever a categorical variable is used in the plot, it’s drv.

    +
    +
    +
    +

    There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

    +
    +
    +

    There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

    +
    +
    +
    +
    +

    There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

    +
    +
    +

    There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

    +
    +
    +
    +
    +

    There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

    +
    +
    +

    There are six scatterplots in this figure, arranged in a 3x2 grid. In all plots highway fuel efficiency of cars are on the y-axis and engine size is on the x-axis. The first plot shows all points in black with a smooth curve overlaid on them. In the second plot points are also all black, with separate smooth curves overlaid for each level of drive train. On the third plot, points and the smooth curves are represented in different colors for each level of drive train. In the fourth plot the points are represented in different colors for each level of drive train but there is only a single smooth line fitted to the whole data. In the fifth plot, points are represented in different colors for each level of drive train, and a separate smooth curve with different line types are fitted to each level of drive train. And finally in the sixth plot points are represented in different colors for each level of drive train and they have a thick white border.

    +
    +
    +
    +
  12. +
+
+ +
+

+Statistical transformations

+

Next, let’s take a look at a bar chart. Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with #chp-https://ggplot2.tidyverse.org/reference/geom_bar or #chp-https://ggplot2.tidyverse.org/reference/geom_bar. The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset is in the ggplot2 package and contains information on ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

+
+
ggplot(data = diamonds) + 
+  geom_bar(mapping = aes(x = cut))
+
+

Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds.

+
+
+

On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

+
  • bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

  • +
  • smoothers fit a model to your data and then plot predictions from the model.

  • +
  • boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.

  • +

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. #fig-vis-stat-bar shows how this process works with #chp-https://ggplot2.tidyverse.org/reference/geom_bar.

+
+
+ +

A figure demonstrating three steps of creating a bar chart. Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar() transforms the data with the count stat, which returns a data set of cut values and counts. Step 3. geom_bar() uses the transformed data to build the plot. cut is mapped to the x-axis, count is mapped to the y-axis.

+
Figure 2.2: When create a bar chart we first start with the raw data, then aggregate it to count the number of observations in each bar, and finally map those computed variables to plot aesthetics.
+
+
+
+

You can learn which stat a geom uses by inspecting the default value for the stat argument. For example, #chp-https://ggplot2.tidyverse.org/reference/geom_bar shows that the default value for stat is “count”, which means that #chp-https://ggplot2.tidyverse.org/reference/geom_bar uses #chp-https://ggplot2.tidyverse.org/reference/geom_bar. #chp-https://ggplot2.tidyverse.org/reference/geom_bar is documented on the same page as #chp-https://ggplot2.tidyverse.org/reference/geom_bar. If you scroll down, the section called “Computed variables” explains that it computes two new variables: count and prop.

+

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using #chp-https://ggplot2.tidyverse.org/reference/geom_bar instead of #chp-https://ggplot2.tidyverse.org/reference/geom_bar:

+
+
ggplot(data = diamonds) + 
+  stat_count(mapping = aes(x = cut))
+
+

Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds.

+
+
+

This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. However, there are three reasons why you might need to use a stat explicitly:

+
  1. +

    You might want to override the default stat. In the code below, we change the stat of #chp-https://ggplot2.tidyverse.org/reference/geom_bar from count (the default) to identity. This lets me map the height of the bars to the raw values of a \(y\) variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.

    +
    +
    demo <- tribble(
    +  ~cut,         ~freq,
    +  "Fair",       1610,
    +  "Good",       4906,
    +  "Very Good",  12082,
    +  "Premium",    13791,
    +  "Ideal",      21551
    +)
    +
    +ggplot(data = demo) +
    +  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")
    +
    +

    Bar chart of number of each cut of diamond. There are roughly 1500 Fair, 5000 Good, 12000 Very Good, 14000 Premium, and 22000 Ideal cut diamonds.

    +
    +
    +

    (Don’t worry that you haven’t seen <- or #chp-https://tibble.tidyverse.org/reference/tribble before. You might be able to guess their meaning from the context, and you’ll learn exactly what they do soon!)

    +
  2. +
  3. +

    You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportions, rather than counts:

    +
    +
    ggplot(data = diamonds) + 
    +  geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
    +
    +

    Bar chart of proportion of each cut of diamond. Roughly, Fair diamonds make up 0.03, Good 0.09, Very Good 0.22, Premium 26, and Ideal 0.40.

    +
    +
    +

    To find the variables computed by the stat, look for the section titled “computed variables” in the help for #chp-https://ggplot2.tidyverse.org/reference/geom_bar.

    +
  4. +
  5. +

    You might want to draw greater attention to the statistical transformation in your code. For example, you might use #chp-https://ggplot2.tidyverse.org/reference/stat_summary, which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing:

    +
    +
    ggplot(data = diamonds) + 
    +  stat_summary(
    +    mapping = aes(x = cut, y = depth),
    +    fun.min = min,
    +    fun.max = max,
    +    fun = median
    +  )
    +
    +

    A plot with depth on the y-axis and cut on the x-axis (with levels fair, good, very good, premium, and ideal) of diamonds. For each level of cut, vertical lines extend from minimum to maximum depth for diamonds in that cut category, and the median depth is indicated on the line with a point.

    +
    +
    +
  6. +

ggplot2 provides more than 20 stats for you to use. Each stat is a function, so you can get help in the usual way, e.g. #chp-https://ggplot2.tidyverse.org/reference/geom_histogram. To see a complete list of stats, try the #chp-https://rstudio.com/resources/cheatsheets.

+ +
+

+Exercises

+
  1. What is the default geom associated with #chp-https://ggplot2.tidyverse.org/reference/stat_summary? How could you rewrite the previous plot to use that geom function instead of the stat function?

  2. +
  3. What does #chp-https://ggplot2.tidyverse.org/reference/geom_bar do? How is it different from #chp-https://ggplot2.tidyverse.org/reference/geom_bar?

  4. +
  5. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

  6. +
  7. What variables does #chp-https://ggplot2.tidyverse.org/reference/geom_smooth compute? What parameters control its behaviour?

  8. +
  9. +

    In our proportion bar chart, we need to set group = 1. Why? In other words, what is the problem with these two graphs?

    +
    +
    ggplot(data = diamonds) + 
    +  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
    +ggplot(data = diamonds) + 
    +  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
    +
    +
  10. +
+
+ +
+

+Position adjustments

+

There’s one more piece of magic associated with bar charts. You can color a bar chart using either the color aesthetic, or, more usefully, fill:

+
+
ggplot(data = diamonds) + 
+  geom_bar(mapping = aes(x = cut, color = cut))
+ggplot(data = diamonds) + 
+  geom_bar(mapping = aes(x = cut, fill = cut))
+
+
+
+

Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category.

+
+
+

Two bar charts of cut of diamonds. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of diamonds in each cut category.

+
+
+
+
+

Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automatically stacked. Each colored rectangle represents a combination of cut and clarity.

+
+
ggplot(data = diamonds) + 
+  geom_bar(mapping = aes(x = cut, fill = clarity))
+
+

Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level.

+
+
+

The stacking is performed automatically using the position adjustment specified by the position argument. If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill".

+
  • +

    position = "identity" will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by setting alpha to a small value, or completely transparent by setting fill = NA.

    +
    +
    ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
    +  geom_bar(alpha = 1/5, position = "identity")
    +ggplot(data = diamonds, mapping = aes(x = cut, color = clarity)) + 
    +  geom_bar(fill = NA, position = "identity")
    +
    +
    +
    +

    Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors.

    +
    +
    +

    Two segmented bar charts of cut of diamonds, where each bar is filled with colors for the levels of clarity. Heights of the bars correspond to the number of diamonds in each cut category, and heights of the colored segments are proportional to the number of diamonds with a given clarity level within a given cut level. However the segments overlap. In the first plot the segments are filled with transparent colors, in the second plot the segments are only outlined with colors.

    +
    +
    +
    +
    +

    The identity position adjustment is more useful for 2d geoms, like points, where it is the default.

    +
  • +
  • +

    position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

    +
    +
    ggplot(data = diamonds) + 
    +  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
    +
    +

    Segmented bar chart of cut of diamonds, where each bar is filled with colors for the levels of clarity. Height of each bar is 1 and heights of the colored segments are proportional to the proportion of diamonds with a given clarity level within a given cut level.

    +
    +
    +
  • +
  • +

    position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

    +
    +
    ggplot(data = diamonds) + 
    +  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
    +
    +

    Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels of cut (fair, good, very good, premium, and ideal). In each group there are eight bars, one for each level of clarity, and filled with a different color for each level. Heights of these bars represent the number of diamonds with a given level of cut and clarity.

    +
    +
    +
  • +

There’s one other type of adjustment that’s not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset?

+
+
+

Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association.

+
+
+

The underlying values of hwy and displ are rounded so the points appear on a grid and many points overlap each other. This problem is known as overplotting. This arrangement makes it difficult to see the distribution of the data. Are the data points spread equally throughout the graph, or is there one special combination of hwy and displ that contains 109 values?

+

You can avoid this gridding by setting the position adjustment to “jitter”. position = "jitter" adds a small amount of random noise to each point. This spreads the points out because no two points are likely to receive the same amount of random noise.

+
+
ggplot(data = mpg) + 
+  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
+
+

Jittered scatterplot of highway fuel efficiency versus engine size of cars. The plot shows a negative association.

+
+
+

Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph more revealing at large scales. Because this is such a useful operation, ggplot2 comes with a shorthand for geom_point(position = "jitter"): #chp-https://ggplot2.tidyverse.org/reference/geom_jitter.

+

To learn more about a position adjustment, look up the help page associated with each adjustment: #chp-https://ggplot2.tidyverse.org/reference/position_dodge, #chp-https://ggplot2.tidyverse.org/reference/position_stack, #chp-https://ggplot2.tidyverse.org/reference/position_identity, #chp-https://ggplot2.tidyverse.org/reference/position_jitter, and #chp-https://ggplot2.tidyverse.org/reference/position_stack.

+ +
+

+Exercises

+
  1. +

    What is the problem with this plot? How could you improve it?

    +
    +
    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
    +  geom_point()
    +
    +

    Scatterplot of highway fuel efficiency versus city fuel efficiency of cars that shows a positive association. The number of points visible in this plot is less than the number of points in the dataset.

    +
    +
    +
  2. +
  3. What parameters to #chp-https://ggplot2.tidyverse.org/reference/geom_jitter control the amount of jittering?

  4. +
  5. Compare and contrast #chp-https://ggplot2.tidyverse.org/reference/geom_jitter with #chp-https://ggplot2.tidyverse.org/reference/geom_count.

  6. +
  7. What’s the default position adjustment for #chp-https://ggplot2.tidyverse.org/reference/geom_boxplot? Create a visualization of the mpg dataset that demonstrates it.

  8. +
+
+ +
+

+Coordinate systems

+

Coordinate systems are probably the most complicated part of ggplot2. The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are three other coordinate systems that are occasionally helpful.

+
  • +

    #chp-https://ggplot2.tidyverse.org/reference/coord_flip switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels: it’s hard to get them to fit without overlapping on the x-axis.

    +
    +
    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
    +  geom_boxplot()
    +ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
    +  geom_boxplot() +
    +  coord_flip()
    +
    +
    +
    +

    Two side-by-side box plots of highway fuel efficiency of cars. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they are listed down the y-axis, avoiding overlap.

    +
    +
    +

    Two side-by-side box plots of highway fuel efficiency of cars. A separate box plot is created for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv). In the first plot class is on the x-axis, in the second plot class is on the y-axis. The second plot makes it easier to read the names of the levels of class since they are listed down the y-axis, avoiding overlap.

    +
    +
    +
    +
    +

    However, note that you can achieve the same result by flipping the aesthetic mappings of the two variables.

    +
    +
    ggplot(data = mpg, mapping = aes(y = class, x = hwy)) + 
    +  geom_boxplot()
    +
    +

    Side-by-side box plots of highway fuel efficiency of cars. A separate box plot is drawn along the y-axis for cars in each level of class (2seater, compact, midsize, minivan, pickup, subcompact, and suv).

    +
    +
    +
  • +
  • +

    #chp-https://ggplot2.tidyverse.org/reference/coord_map sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2. We don’t have the space to discuss maps in this book, but you can learn more in the #chp-https://ggplot2-book.org/maps of ggplot2: Elegant graphics for data analysis.

    +
    +
    nz <- map_data("nz")
    +
    +ggplot(nz, aes(long, lat, group = group)) +
    +  geom_polygon(fill = "white", color = "black")
    +
    +ggplot(nz, aes(long, lat, group = group)) +
    +  geom_polygon(fill = "white", color = "black") +
    +  coord_quickmap()
    +
    +
    +
    +

    Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct.

    +
    +
    +

    Two maps of the boundaries of New Zealand. In the first plot the aspect ratio is incorrect, in the second plot it is correct.

    +
    +
    +
    +
    +
  • +
  • +

    #chp-https://ggplot2.tidyverse.org/reference/coord_polar uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

    +
    +
    bar <- ggplot(data = diamonds) + 
    +  geom_bar(
    +    mapping = aes(x = cut, fill = cut), 
    +    show.legend = FALSE,
    +    width = 1
    +  ) + 
    +  theme(aspect.ratio = 1) +
    +  labs(x = NULL, y = NULL)
    +
    +bar + coord_flip()
    +bar + coord_polar()
    +
    +
    +
    +

    There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data.

    +
    +
    +

    There are two plots. On the left is a bar chart of cut of diamonds, on the right is a Coxcomb chart of the same data.

    +
    +
    +
    +
    +
  • +
+
+

+Exercises

+
  1. Turn a stacked bar chart into a pie chart using #chp-https://ggplot2.tidyverse.org/reference/coord_polar.

  2. +
  3. What does #chp-https://ggplot2.tidyverse.org/reference/labs do? Read the documentation.

  4. +
  5. What’s the difference between #chp-https://ggplot2.tidyverse.org/reference/coord_map and #chp-https://ggplot2.tidyverse.org/reference/coord_map?

  6. +
  7. +

    What does the plot below tell you about the relationship between city and highway mpg? Why is #chp-https://ggplot2.tidyverse.org/reference/coord_fixed important? What does #chp-https://ggplot2.tidyverse.org/reference/geom_abline do?

    +
    +
    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
    +  geom_point() + 
    +  geom_abline() +
    +  coord_fixed()
    +
    +

    Scatterplot of highway fuel efficiency versus engine size of cars that shows a negative association. The plot also has a straight line that follows the trend of the relationship between the variables but does not go through the cloud of points, it is beneath it.

    +
    +
    +
  8. +
+
+ +
+

+The layered grammar of graphics

+

In the previous sections, you learned much more than just how to make scatterplots, bar charts, and boxplots. You learned a foundation that you can use to make any type of plot with ggplot2. To see this, let’s add position adjustments, stats, coordinate systems, and faceting to our code template:

+
ggplot(data = <DATA>) + 
+  <GEOM_FUNCTION>(
+     mapping = aes(<MAPPINGS>),
+     stat = <STAT>, 
+     position = <POSITION>
+  ) +
+  <COORDINATE_FUNCTION> +
+  <FACET_FUNCTION>
+

Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.

+

The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.

+

To see how this works, consider how you could build a basic plot from scratch: you could start with a dataset and then transform it into the information that you want to display (with a stat).

+
+
+

A figure demonstrating the steps for going from raw data to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Steps 1 and 2 are annotated. Step 1. Begin with the diamonds dataset. Step 2. Compute counts for each cut value with stat_count().

+
+
+

Next, you could choose a geometric object to represent each observation in the transformed data. You could then use the aesthetic properties of the geoms to represent variables in the data. You would map the values of each variable to the levels of an aesthetic.

+
+
+

A figure demonstrating the steps for going from raw data to table of counts where each row represents one level of cut and a count column shows how many diamonds are in that cut level. Each level is also mapped to a color. Steps 3 and 4 are annotated. Step 3. Represent each observation with a bar. Step 4. Map the fill of each bar to the ..count.. variable.

+
+
+

You’d then select a coordinate system to place the geoms into, using the location of the objects (which is itself an aesthetic property) to display the values of the x and y variables. At that point, you would have a complete graph, but you could further adjust the positions of the geoms within the coordinate system (a position adjustment) or split the graph into subplots (faceting). You could also extend the plot by adding one or more additional layers, where each additional layer uses a dataset, a geom, a set of mappings, a stat, and a position adjustment.

+
+
+

A figure demonstrating the steps for going from raw data to bar chart where each bar represents one level of cut and filled in with a different color. Steps 5 and 6 are annotated. Step 5. Place geoms in a Cartesian coordinate system. Step 6. Map the y values to ..count.. and the x values to cut.

+
+
+

You could use this method to build any plot that you imagine. In other words, you can use the code template that you’ve learned in this chapter to build hundreds of thousands of unique plots.

+

If you’d like to learn more about this theoretical underpinnings of ggplot2, you might enjoy reading “#chp-https://vita.had.co.nz/papers/layered-grammar”, the scientific paper that describes the theory of ggplot2 in detail.

+
+ +
+

+Summary

+

In this chapter, you’ve learn the basics of data visualization with ggplot2. We started with the basic idea that underpins ggplot2: a visualization is a mapping from variables in your data to aesthetic properties like position, colour, size and shape. You then learned about facets, which allow you to create small multiples, where each panel contains a subgroup from your data. We then gave you a whirlwind tour of the geoms and stats which control the “type” of graph you get, whether it’s a scatterplot, line plot, histogram, or something else. Position adjustment control the fine details of position when geoms might otherwise overlap, and coordinate systems allow you fundamentally change what x and y mean.

+

We’ll use visualizations again and again through out this book, introducing new techniques as we need them. If you want to get a comprehensive understand of ggplot2, we recommend reading the book, #chp-https://ggplot2-book. Other useful resources are the #chp-https://r-graphics by Winston Chang and #chp-https://clauswilke.com/dataviz/ by Claus Wilke.

+

With the basics of visualization under your belt, in the next chapter we’re going to switch gears a little and give you some practical workflow advice. We intersperse workflow advice with data science tools throughout this part of the book because it’ll help you stay organize as you write increasing amounts of R code.

+ + +
+
diff --git a/oreilly/databases.html b/oreilly/databases.html new file mode 100644 index 0000000..88f54e2 --- /dev/null +++ b/oreilly/databases.html @@ -0,0 +1,770 @@ +
+

Databases

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

+
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
+diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
+

Other times you might want to use your own SQL query as a starting point:

+
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
+
+ +

Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue #chp-https://github.com/tidyverse/dbplyr/issues/ to help us do better.

+ +

In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

SELECT "tailnum", "type", "manufacturer", "model", "year"
+FROM "planes"

Some other database systems use backticks instead of quotes:

SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
+FROM `planes`
+ +
+

+Introduction

+

A huge amount of data lives in databases, so it’s essential that you know how to access it. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.

+

In this chapter, you’ll first learn the basics of the DBI package: how to use it to connect to a database and then retrieve data with a SQLSQL is either pronounced “s”-“q”-“l” or “sequel”. query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. That said, we’re not going to start with SQL, but instead we’ll teach you dbplyr, which can translate your dplyr code to the SQL. We’ll use that as way to teach you some of the most important features of SQL. You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.

+ +
+

+Prerequisites

+

In this chapter, we’ll introduce DBI and dbplyr. DBI is a low-level interface that connects to databases and executes SQL; dbplyr is a high-level interface that translates your dplyr code to SQL queries then executes them with DBI.

+
+
library(DBI)
+library(dbplyr)
+library(tidyverse)
+
+
+
+ +
+

+Database basics

+

At the simplest level, you can think about a database as a collection of data frames, called tables in database terminology. Like a data.frame, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:

+
  • Database tables are stored on disk and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still plenty large for many problems).

  • +
  • Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row. Data frames and tibbles don’t have indexes, but data.tables do, which is one of the reasons that they’re so fast.

  • +
  • Most classical databases are optimized for rapidly collecting data, not analyzing existing data. These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column like R. More recently, there’s been much development of column-oriented databases that make analyzing the existing data much faster.

  • +

Databases are run by database management systems (DBMS’s for short), which come in three basic forms:

+
  • +Client-server DBMS’s run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.
  • +
  • +Cloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.
  • +
  • +In-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user.
  • +
+ +
+

+Connecting to a database

+

To connect to the database from R, you’ll use a pair of packages:

+
  • You’ll always use DBI (database interface) because it provides a set of generic functions that connect to the database, upload data, run SQL queries, etc.

  • +
  • You’ll also use a package tailored for the DBMS you’re connecting to. This package translates the generic DBI commands into the specifics needed for a given DBMS. There’s usually one package for each DMBS, e.g. RPostgres for Postgres and RMariaDB for MySQL.

  • +

If you can’t find a specific package for your DBMS, you can usually use the odbc package instead. This uses the ODBC protocol supported by many DBMS. odbc requires a little more setup because you’ll also need to install an ODBC driver and tell the odbc package where to find it.

+

Concretely, you create a database connection using #chp-https://dbi.r-dbi.org/reference/dbConnect. The first argument selects the DBMSTypically, this is the only function you’ll use from the client package, so we recommend using :: to pull out that one function, rather than loading the complete package with #chp-https://rdrr.io/r/base/library., then the second and subsequent arguments describe how to connect to it (i.e. where it lives and the credentials that you need to access it). The following code shows a couple of typical examples:

+
+
con <- DBI::dbConnect(
+  RMariaDB::MariaDB(), 
+  username = "foo"
+)
+con <- DBI::dbConnect(
+  RPostgres::Postgres(), 
+  hostname = "databases.mycompany.com", 
+  port = 1234
+)
+
+

The precise details of the connection vary a lot from DBMS to DBMS so unfortunately we can’t cover all the details here. This means you’ll need to do a little research on your own. Typically you can ask the other data scientists in your team or talk to your DBA (database administrator). The initial setup will often take a little fiddling (and maybe some googling) to get right, but you’ll generally only need to do it once.

+ +
+

+In this book

+

Setting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-process DBMS that lives entirely in an R package: duckdb. Thanks to the magic of DBI, the only difference between using duckdb and any other DBMS is how you’ll connect to the database. This makes it great to teach with because you can easily run this code as well as easily take what you learn and apply it elsewhere.

+

Connecting to duckdb is particularly simple because the defaults create a temporary database that is deleted when you quit R. That’s great for learning because it guarantees that you’ll start from a clean slate every time you restart R:

+
+
con <- DBI::dbConnect(duckdb::duckdb())
+
+

duckdb is a high-performance database that’s designed very much for the needs of a data scientist. We use it here because it’s very to easy to get started with, but it’s also capable of handling gigabytes of data with great speed. If you want to use duckdb for a real data analysis project, you’ll also need to supply the dbdir argument to make a persistent database and tell duckdb where to save it. Assuming you’re using a project (#chp-workflow-scripts), it’s reasonable to store it in the duckdb directory of the current project:

+
+
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = "duckdb")
+
+
+ +
+

+Load some data

+

Since this is a new database, we need to start by adding some data. Here we’ll add mpg and diamonds datasets from ggplot2 using #chp-https://dbi.r-dbi.org/reference/dbWriteTable. The simplest usage of #chp-https://dbi.r-dbi.org/reference/dbWriteTable needs three arguments: a database connection, the name of the table to create in the database, and a data frame of data.

+
+
dbWriteTable(con, "mpg", ggplot2::mpg)
+dbWriteTable(con, "diamonds", ggplot2::diamonds)
+
+

If you’re using duckdb in a real project, we highly recommend learning about duckdb_read_csv() and duckdb_register_arrow(). These give you powerful and performant ways to quickly load data directly into duckdb, without having to first load it into R.

+

We’ll also show off a useful technique for loading multiple files into a database in #sec-save-database.

+
+
+ +
+

+DBI basics

+

Now that we’ve connected to a database with some data in it, let’s perform some basic operations with DBI.

+ +
+

+What’s there?

+

The most important database objects for data scientists are tables. DBI provides two useful functions to either list all the tables in the databaseAt least, all the tables that you have permission to see. or to check if a specific table already exists:

+
+
dbListTables(con)
+#> [1] "diamonds" "mpg"
+dbExistsTable(con, "foo")
+#> [1] FALSE
+
+
+ +
+

+Extract some data

+

Once you’ve determined a table exists, you can retrieve it with #chp-https://dbi.r-dbi.org/reference/dbReadTable:

+
+
con |> 
+  dbReadTable("diamonds") |> 
+  as_tibble()
+#> # A tibble: 53,940 × 10
+#>   carat cut       color clarity depth table price     x     y     z
+#>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
+#> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
+#> 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
+#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
+#> 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
+#> 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
+#> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
+#> # … with 53,934 more rows
+
+

#chp-https://dbi.r-dbi.org/reference/dbReadTable returns a data.frame so we use #chp-https://tibble.tidyverse.org/reference/as_tibble to convert it into a tibble so that it prints nicely.

+

In real life, it’s rare that you’ll use #chp-https://dbi.r-dbi.org/reference/dbReadTable because often database tables are too big to fit in memory, and you want bring back only a subset of the rows and columns.

+
+ +
+

+Run a query

+

The way you’ll usually retrieve data is with #chp-https://dbi.r-dbi.org/reference/dbGetQuery. It takes a database connection and some SQL code and returns a data frame:

+
+
sql <- "
+  SELECT carat, cut, clarity, color, price 
+  FROM diamonds 
+  WHERE price > 15000
+"
+as_tibble(dbGetQuery(con, sql))
+#> # A tibble: 1,655 × 5
+#>   carat cut       clarity color price
+#>   <dbl> <fct>     <fct>   <fct> <int>
+#> 1  1.54 Premium   VS2     E     15002
+#> 2  1.19 Ideal     VVS1    F     15005
+#> 3  2.1  Premium   SI1     I     15007
+#> 4  1.69 Ideal     SI1     D     15011
+#> 5  1.5  Very Good VVS2    G     15013
+#> 6  1.73 Very Good VS1     G     15014
+#> # … with 1,649 more rows
+
+

Don’t worry if you’ve never seen SQL before; you’ll learn more about it shortly. But if you read it carefully, you might guess that it selects five columns of the diamonds dataset and all the rows where price is greater than 15,000.

+

You’ll need to be a little careful with #chp-https://dbi.r-dbi.org/reference/dbGetQuery since it can potentially return more data than you have memory. We won’t discuss it further here, but if you’re dealing with very large datasets it’s possible to deal with a “page” of data at a time by using #chp-https://dbi.r-dbi.org/reference/dbSendQuery to get a “result set” which you can page through by calling #chp-https://dbi.r-dbi.org/reference/dbFetch until #chp-https://dbi.r-dbi.org/reference/dbHasCompleted returns TRUE.

+
+ +
+

+Other functions

+

There are lots of other functions in DBI that you might find useful if you’re managing your own data (like #chp-https://dbi.r-dbi.org/reference/dbWriteTable which we used in #sec-load-data), but we’re going to skip past them in the interest of staying focused on working with data that already lives in a database.

+
+
+ +
+

+dbplyr basics

+

Now that you’ve learned the low-level basics for connecting to a database and running a query, we’re going to switch it up a bit and learn a bit about dbplyr. dbplyr is a dplyr backend, which means that you keep writing dplyr code but the backend executes it differently. In this, dbplyr translates to SQL; other backends include #chp-https://dtplyr.tidyverse which translates to #chp-https://r-datatable, and #chp-https://multidplyr.tidyverse which executes your code on multiple cores.

+

To use dbplyr, you must first use #chp-https://dplyr.tidyverse.org/reference/tbl to create an object that represents a database table:

+
+
diamonds_db <- tbl(con, "diamonds")
+diamonds_db
+#> # Source:   table<diamonds> [?? x 10]
+#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
+#>   carat cut       color clarity depth table price     x     y     z
+#>   <dbl> <fct>     <fct> <fct>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
+#> 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
+#> 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
+#> 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
+#> 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
+#> 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
+#> 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
+#> # … with more rows
+
+
+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

+
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
+diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
+

Other times you might want to use your own SQL query as a starting point:

+
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
+
+ +

Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue #chp-https://github.com/tidyverse/dbplyr/issues/ to help us do better.

+ +

In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

SELECT "tailnum", "type", "manufacturer", "model", "year"
+FROM "planes"

Some other database systems use backticks instead of quotes:

SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
+FROM `planes`
+ +

This object is lazy; when you use dplyr verbs on it, dplyr doesn’t do any work: it just records the sequence of operations that you want to perform and only performs them when needed. For example, take the following pipeline:

+
+
big_diamonds_db <- diamonds_db |> 
+  filter(price > 15000) |> 
+  select(carat:clarity, price)
+
+big_diamonds_db
+#> # Source:   SQL [?? x 5]
+#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
+#>   carat cut       color clarity price
+#>   <dbl> <fct>     <fct> <fct>   <int>
+#> 1  1.54 Premium   E     VS2     15002
+#> 2  1.19 Ideal     F     VVS1    15005
+#> 3  2.1  Premium   I     SI1     15007
+#> 4  1.69 Ideal     D     SI1     15011
+#> 5  1.5  Very Good G     VVS2    15013
+#> 6  1.73 Very Good G     VS1     15014
+#> # … with more rows
+
+

You can tell this object represents a database query because it prints the DBMS name at the top, and while it tells you the number of columns, it typically doesn’t know the number of rows. This is because finding the total number of rows usually requires executing the complete query, something we’re trying to avoid.

+

You can see the SQL code generated by the dbplyr function #chp-https://dplyr.tidyverse.org/reference/explain:

+
+
big_diamonds_db |>
+  show_query()
+#> <SQL>
+#> SELECT carat, cut, color, clarity, price
+#> FROM diamonds
+#> WHERE (price > 15000.0)
+
+

To get all the data back into R, you call #chp-https://dplyr.tidyverse.org/reference/compute. Behind the scenes, this generates the SQL, calls #chp-https://dbi.r-dbi.org/reference/dbGetQuery to get the data, then turns the result into a tibble:

+
+
big_diamonds <- big_diamonds_db |> 
+  collect()
+big_diamonds
+#> # A tibble: 1,655 × 5
+#>   carat cut       color clarity price
+#>   <dbl> <fct>     <fct> <fct>   <int>
+#> 1  1.54 Premium   E     VS2     15002
+#> 2  1.19 Ideal     F     VVS1    15005
+#> 3  2.1  Premium   I     SI1     15007
+#> 4  1.69 Ideal     D     SI1     15011
+#> 5  1.5  Very Good G     VVS2    15013
+#> 6  1.73 Very Good G     VS1     15014
+#> # … with 1,649 more rows
+
+

Typically, you’ll use dbplyr to select the data you want from the database, performing basic filtering and aggregation using the translations described below. Then, once you’re ready to analyse the data with functions that are unique to R, you’ll #chp-https://dplyr.tidyverse.org/reference/compute the data to get an in-memory tibble, and continue your work with pure R code.

+
+ +
+

+SQL

+

The rest of the chapter will teach you a little SQL through the lens of dbplyr. It’s a rather non-traditional introduction to SQL but we hope it will get you quickly up to speed with the basics. Luckily, if you understand dplyr you’re in a great place to quickly pick up SQL because so many of the concepts are the same.

+

We’ll explore the relationship between dplyr and SQL using a couple of old friends from the nycflights13 package: flights and planes. These datasets are easy to get into our learning database because dbplyr has a function designed for this exact scenario:

+
+
dbplyr::copy_nycflights13(con)
+#> Creating table: airlines
+#> Creating table: airports
+#> Creating table: flights
+#> Creating table: planes
+#> Creating table: weather
+flights <- tbl(con, "flights")
+planes <- tbl(con, "planes")
+
+
+ +
+ +
+

+SQL basics

+

The top-level components of SQL are called statements. Common statements include CREATE for defining new tables, INSERT for adding data, and SELECT for retrieving data. We will on focus on SELECT statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.

+

A query is made up of clauses. There are five important clauses: SELECT, FROM, WHERE, ORDER BY, and GROUP BY. Every query must have the SELECTConfusingly, depending on the context, SELECT is either a statement or a clause. To avoid this confusion, we’ll generally use query instead of SELECT statement. and FROMOk, technically, only the SELECT is required, since you can write queries like SELECT 1+1 to perform basic calculations. But if you want to work with data (as you always do!) you’ll also need a FROM clause. clauses and the simplest query is SELECT * FROM table, which selects all columns from the specified table . This is what dbplyr generates for an unadulterated table :

+
+
flights |> show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+planes |> show_query()
+#> <SQL>
+#> SELECT *
+#> FROM planes
+
+

WHERE and ORDER BY control which rows are included and how they are ordered:

+
+
flights |> 
+  filter(dest == "IAH") |> 
+  arrange(dep_delay) |>
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+#> WHERE (dest = 'IAH')
+#> ORDER BY dep_delay
+
+

GROUP BY converts the query to a summary, causing aggregation to happen:

+
+
flights |> 
+  group_by(dest) |> 
+  summarise(dep_delay = mean(dep_delay, na.rm = TRUE)) |> 
+  show_query()
+#> <SQL>
+#> SELECT dest, AVG(dep_delay) AS dep_delay
+#> FROM flights
+#> GROUP BY dest
+
+

There are two important differences between dplyr verbs and SELECT clauses:

+
  • In SQL, case doesn’t matter: you can write select, SELECT, or even SeLeCt. In this book we’ll stick with the common convention of writing SQL keywords in uppercase to distinguish them from table or variables names.
  • +
  • In SQL, order matters: you must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.
  • +

The following sections explore each clause in more detail.

+
+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

+
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
+diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
+

Other times you might want to use your own SQL query as a starting point:

+
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
+
+ +

Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue #chp-https://github.com/tidyverse/dbplyr/issues/ to help us do better.

+ +

In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

SELECT "tailnum", "type", "manufacturer", "model", "year"
+FROM "planes"

Some other database systems use backticks instead of quotes:

SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
+FROM `planes`
+ +
+ +
+

+SELECT

+

The SELECT clause is the workhorse of queries and performs the same job as #chp-https://dplyr.tidyverse.org/reference/select, #chp-https://dplyr.tidyverse.org/reference/mutate, #chp-https://dplyr.tidyverse.org/reference/rename, #chp-https://dplyr.tidyverse.org/reference/relocate, and, as you’ll learn in the next section, #chp-https://dplyr.tidyverse.org/reference/summarise.

+

#chp-https://dplyr.tidyverse.org/reference/select, #chp-https://dplyr.tidyverse.org/reference/rename, and #chp-https://dplyr.tidyverse.org/reference/relocate have very direct translations to SELECT as they just affect where a column appears (if at all) along with its name:

+
+
planes |> 
+  select(tailnum, type, manufacturer, model, year) |> 
+  show_query()
+#> <SQL>
+#> SELECT tailnum, "type", manufacturer, model, "year"
+#> FROM planes
+
+planes |> 
+  select(tailnum, type, manufacturer, model, year) |> 
+  rename(year_built = year) |> 
+  show_query()
+#> <SQL>
+#> SELECT tailnum, "type", manufacturer, model, "year" AS year_built
+#> FROM planes
+
+planes |> 
+  select(tailnum, type, manufacturer, model, year) |> 
+  relocate(manufacturer, model, .before = type) |> 
+  show_query()
+#> <SQL>
+#> SELECT tailnum, manufacturer, model, "type", "year"
+#> FROM planes
+
+

This example also shows you how SQL does renaming. In SQL terminology renaming is called aliasing and is done with AS. Note that unlike #chp-https://dplyr.tidyverse.org/reference/mutate, the old name is on the left and the new name is on the right.

+
+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

There are two other common ways to interact with a database. First, many corporate databases are very large so you need some hierarchy to keep all the tables organised. In that case you might need to supply a schema, or a catalog and a schema, in order to pick the table you’re interested in:

+
diamonds_db <- tbl(con, in_schema("sales", "diamonds"))
+diamonds_db <- tbl(con, in_catalog("north_america", "sales", "diamonds"))
+

Other times you might want to use your own SQL query as a starting point:

+
diamonds_db <- tbl(con, sql("SELECT * FROM diamonds"))
+
+ +

Note that while SQL is a standard, it is extremely complex and no database follows it exactly. While the main components that we’ll focus on in this book are very similar between DBMSs, there are many minor variations. Fortunately, dbplyr is designed to handle this problem and generates different translations for different databases. It’s not perfect, but it’s continually improving, and if you hit a problem you can file an issue #chp-https://github.com/tidyverse/dbplyr/issues/ to help us do better.

+ +

In the examples above note that "year" and "type" are wrapped in double quotes. That’s because these are reserved words in duckdb, so dbplyr quotes them to avoid any potential confusion between column/table names and SQL operators.

When working with other databases you’re likely to see every variable name quotes because only a handful of client packages, like duckdb, know what all the reserved words are, so they quote everything to be safe.

SELECT "tailnum", "type", "manufacturer", "model", "year"
+FROM "planes"

Some other database systems use backticks instead of quotes:

SELECT `tailnum`, `type`, `manufacturer`, `model`, `year`
+FROM `planes`
+ +

The translations for #chp-https://dplyr.tidyverse.org/reference/mutate are similarly straightforward: each variable becomes a new expression in SELECT:

+
+
flights |> 
+  mutate(
+    speed = distance / (air_time / 60)
+  ) |> 
+  show_query()
+#> <SQL>
+#> SELECT *, distance / (air_time / 60.0) AS speed
+#> FROM flights
+
+

We’ll come back to the translation of individual components (like /) in #sec-sql-expressions.

+
+ +
+

+FROM

+

The FROM clause defines the data source. It’s going to be rather uninteresting for a little while, because we’re just using single tables. You’ll see more complex examples once we hit the join functions.

+
+ +
+

+GROUP BY

+

#chp-https://dplyr.tidyverse.org/reference/group_by is translated to the GROUP BYThis is no coincidence: the dplyr function name was inspired by the SQL clause. clause and #chp-https://dplyr.tidyverse.org/reference/summarise is translated to the SELECT clause:

+
+
diamonds_db |> 
+  group_by(cut) |> 
+  summarise(
+    n = n(),
+    avg_price = mean(price, na.rm = TRUE)
+  ) |> 
+  show_query()
+#> <SQL>
+#> SELECT cut, COUNT(*) AS n, AVG(price) AS avg_price
+#> FROM diamonds
+#> GROUP BY cut
+
+

We’ll come back to what’s happening with translation #chp-https://dplyr.tidyverse.org/reference/context and #chp-https://rdrr.io/r/base/mean in #sec-sql-expressions.

+
+ +
+

+WHERE

+

#chp-https://dplyr.tidyverse.org/reference/filter is translated to the WHERE clause:

+
+
flights |> 
+  filter(dest == "IAH" | dest == "HOU") |> 
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+#> WHERE (dest = 'IAH' OR dest = 'HOU')
+
+flights |> 
+  filter(arr_delay > 0 & arr_delay < 20) |> 
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+#> WHERE (arr_delay > 0.0 AND arr_delay < 20.0)
+
+

There are a few important details to note here:

+
  • +| becomes OR and & becomes AND.
  • +
  • SQL uses = for comparison, not ==. SQL doesn’t have assignment, so there’s no potential for confusion there.
  • +
  • SQL uses only '' for strings, not "". In SQL, "" is used to identify variables, like R’s ``.
  • +

Another useful SQL operator is IN, which is very close to R’s %in%:

+
+
flights |> 
+  filter(dest %in% c("IAH", "HOU")) |> 
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+#> WHERE (dest IN ('IAH', 'HOU'))
+
+

SQL uses NULL instead of NA. NULLs behave similarly to NAs. The main difference is that while they’re “infectious” in comparisons and arithmetic, they are silently dropped when summarizing. dbplyr will remind you about this behavior the first time you hit it:

+
+
flights |> 
+  group_by(dest) |> 
+  summarise(delay = mean(arr_delay))
+#> Warning: Missing values are always removed in SQL aggregation functions.
+#> Use `na.rm = TRUE` to silence this warning
+#> This warning is displayed once every 8 hours.
+#> # Source:   SQL [?? x 2]
+#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
+#>   dest   delay
+#>   <chr>  <dbl>
+#> 1 ATL   11.3  
+#> 2 ORD    5.88 
+#> 3 RDU   10.1  
+#> 4 IAD   13.9  
+#> 5 DTW    5.43 
+#> 6 LAX    0.547
+#> # … with more rows
+
+

If you want to learn more about how NULLs work, you might enjoy “#chp-https://modern-sql.com/concept/three-valued-logic” by Markus Winand.

+

In general, you can work with NULLs using the functions you’d use for NAs in R:

+
+
flights |> 
+  filter(!is.na(dep_delay)) |> 
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+#> WHERE (NOT((dep_delay IS NULL)))
+
+

This SQL query illustrates one of the drawbacks of dbplyr: while the SQL is correct, it isn’t as simple as you might write by hand. In this case, you could drop the parentheses and use a special operator that’s easier to read:

+
WHERE "dep_delay" IS NOT NULL
+

Note that if you #chp-https://dplyr.tidyverse.org/reference/filter a variable that you created using a summarize, dbplyr will generate a HAVING clause, rather than a FROM clause. This is a one of the idiosyncracies of SQL created because WHERE is evaluated before SELECT, so it needs another clause that’s evaluated afterwards.

+
+
diamonds_db |> 
+  group_by(cut) |> 
+  summarise(n = n()) |> 
+  filter(n > 100) |> 
+  show_query()
+#> <SQL>
+#> SELECT cut, COUNT(*) AS n
+#> FROM diamonds
+#> GROUP BY cut
+#> HAVING (COUNT(*) > 100.0)
+
+
+ +
+

+ORDER BY

+

Ordering rows involves a straightforward translation from #chp-https://dplyr.tidyverse.org/reference/arrange to the ORDER BY clause:

+
+
flights |> 
+  arrange(year, month, day, desc(dep_delay)) |> 
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM flights
+#> ORDER BY "year", "month", "day", dep_delay DESC
+
+

Notice how #chp-https://dplyr.tidyverse.org/reference/desc is translated to DESC: this is one of the many dplyr functions whose name was directly inspired by SQL.

+
+ +
+

+Subqueries

+

Sometimes it’s not possible to translate a dplyr pipeline into a single SELECT statement and you need to use a subquery. A subquery is just a query used as a data source in the FROM clause, instead of the usual table.

+

dbplyr typically uses subqueries to work around limitations of SQL. For example, expressions in the SELECT clause can’t refer to columns that were just created. That means that the following (silly) dplyr pipeline needs to happen in two steps: the first (inner) query computes year1 and then the second (outer) query can compute year2.

+
+
flights |> 
+  mutate(
+    year1 = year + 1,
+    year2 = year1 + 1
+  ) |> 
+  show_query()
+#> <SQL>
+#> SELECT *, year1 + 1.0 AS year2
+#> FROM (
+#>   SELECT *, "year" + 1.0 AS year1
+#>   FROM flights
+#> ) q01
+
+

You’ll also see this if you attempted to #chp-https://dplyr.tidyverse.org/reference/filter a variable that you just created. Remember, even though WHERE is written after SELECT, it’s evaluated before it, so we need a subquery in this (silly) example:

+
+
flights |> 
+  mutate(year1 = year + 1) |> 
+  filter(year1 == 2014) |> 
+  show_query()
+#> <SQL>
+#> SELECT *
+#> FROM (
+#>   SELECT *, "year" + 1.0 AS year1
+#>   FROM flights
+#> ) q01
+#> WHERE (year1 = 2014.0)
+
+

Sometimes dbplyr will create a subquery where it’s not needed because it doesn’t yet know how to optimize that translation. As dbplyr improves over time, these cases will get rarer but will probably never go away.

+
+ +
+

+Joins

+

If you’re familiar with dplyr’s joins, SQL joins are very similar. Here’s a simple example:

+
+
flights |> 
+  left_join(planes |> rename(year_built = year), by = "tailnum") |> 
+  show_query()
+#> <SQL>
+#> SELECT
+#>   flights.*,
+#>   planes."year" AS year_built,
+#>   "type",
+#>   manufacturer,
+#>   model,
+#>   engines,
+#>   seats,
+#>   speed,
+#>   engine
+#> FROM flights
+#> LEFT JOIN planes
+#>   ON (flights.tailnum = planes.tailnum)
+
+

The main thing to notice here is the syntax: SQL joins use sub-clauses of the FROM clause to bring in additional tables, using ON to define how the tables are related.

+

dplyr’s names for these functions are so closely connected to SQL that you can easily guess the equivalent SQL for #chp-https://dplyr.tidyverse.org/reference/mutate-joins, #chp-https://dplyr.tidyverse.org/reference/mutate-joins, and #chp-https://dplyr.tidyverse.org/reference/mutate-joins:

+
SELECT flights.*, "type", manufacturer, model, engines, seats, speed
+FROM flights
+INNER JOIN planes ON (flights.tailnum = planes.tailnum)
+
+SELECT flights.*, "type", manufacturer, model, engines, seats, speed
+FROM flights
+RIGHT JOIN planes ON (flights.tailnum = planes.tailnum)
+
+SELECT flights.*, "type", manufacturer, model, engines, seats, speed
+FROM flights
+FULL JOIN planes ON (flights.tailnum = planes.tailnum)
+

You’re likely to need many joins when working with data from a database. That’s because database tables are often stored in a highly normalized form, where each “fact” is stored in a single place and to keep a complete dataset for analysis you need to navigate a complex network of tables connected by primary and foreign keys. If you hit this scenario, the #chp-https://cynkra.github.io/dm/, by Tobias Schieferdecker, Kirill Müller, and Darko Bergant, is a life saver. It can automatically determine the connections between tables using the constraints that DBAs often supply, visualize the connections so you can see what’s going on, and generate the joins you need to connect one table to another.

+
+ +
+

+Other verbs

+

dbplyr also translates other verbs like #chp-https://dplyr.tidyverse.org/reference/distinct, slice_*(), and #chp-https://generics.r-lib.org/reference/setops, and a growing selection of tidyr functions like #chp-https://tidyr.tidyverse.org/reference/pivot_longer and #chp-https://tidyr.tidyverse.org/reference/pivot_wider. The easiest way to see the full set of what’s currently available is to visit the dbplyr website: https://dbplyr.tidyverse.org/reference/.

+
+ +
+

+Exercises

+
  1. What is #chp-https://dplyr.tidyverse.org/reference/distinct translated to? How about #chp-https://rdrr.io/r/utils/head?

  2. +
  3. +

    Explain what each of the following SQL queries do and try recreate them using dbplyr.

    +
    SELECT * 
    +FROM flights
    +WHERE dep_delay < arr_delay
    +
    +SELECT *, distance / (airtime / 60) AS speed
    +FROM flights
    +
  4. +
+
+ +
+

+Function translations

+

So far we’ve focused on the big picture of how dplyr verbs are translated to the clauses of a query. Now we’re going to zoom in a little and talk about the translation of the R functions that work with individual columns, e.g. what happens when you use mean(x) in a #chp-https://dplyr.tidyverse.org/reference/summarise?

+

To help see what’s going on, we’ll use a couple of little helper functions that run a #chp-https://dplyr.tidyverse.org/reference/summarise or #chp-https://dplyr.tidyverse.org/reference/mutate and show the generated SQL. That will make it a little easier to explore a few variations and see how summaries and transformations can differ.

+
+
summarize_query <- function(df, ...) {
+  df |> 
+    summarise(...) |> 
+    show_query()
+}
+mutate_query <- function(df, ...) {
+  df |> 
+    mutate(..., .keep = "none") |> 
+    show_query()
+}
+
+

Let’s dive in with some summaries! Looking at the code below you’ll notice that some summary functions, like #chp-https://rdrr.io/r/base/mean, have a relatively simple translation while others, like #chp-https://rdrr.io/r/stats/median, are much more complex. The complexity is typically higher for operations that are common in statistics but less common in databases.

+
+
flights |> 
+  group_by(year, month, day) |>  
+  summarize_query(
+    mean = mean(arr_delay, na.rm = TRUE),
+    median = median(arr_delay, na.rm = TRUE)
+  )
+#> `summarise()` has grouped output by "year" and "month". You can override using
+#> the `.groups` argument.
+#> <SQL>
+#> SELECT
+#>   "year",
+#>   "month",
+#>   "day",
+#>   AVG(arr_delay) AS mean,
+#>   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY arr_delay) AS median
+#> FROM flights
+#> GROUP BY "year", "month", "day"
+
+

The translation of summary functions becomes more complicated when you use them inside a #chp-https://dplyr.tidyverse.org/reference/mutate because they have to turn into a window function. In SQL, you turn an ordinary aggregation function into a window function by adding OVER after it:

+
+
flights |> 
+  group_by(year, month, day) |>  
+  mutate_query(
+    mean = mean(arr_delay, na.rm = TRUE),
+  )
+#> <SQL>
+#> SELECT
+#>   "year",
+#>   "month",
+#>   "day",
+#>   AVG(arr_delay) OVER (PARTITION BY "year", "month", "day") AS mean
+#> FROM flights
+
+

In SQL, the GROUP BY clause is used exclusively for summary so here you can see that the grouping has moved to the PARTITION BY argument to OVER.

+

Window functions include all functions that look forward or backwards, like #chp-https://dplyr.tidyverse.org/reference/lead-lag and #chp-https://dplyr.tidyverse.org/reference/lead-lag:

+
+
flights |> 
+  group_by(dest) |>  
+  arrange(time_hour) |> 
+  mutate_query(
+    lead = lead(arr_delay),
+    lag = lag(arr_delay)
+  )
+#> <SQL>
+#> SELECT
+#>   dest,
+#>   LEAD(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lead,
+#>   LAG(arr_delay, 1, NULL) OVER (PARTITION BY dest ORDER BY time_hour) AS lag
+#> FROM flights
+#> ORDER BY time_hour
+
+

Here it’s important to #chp-https://dplyr.tidyverse.org/reference/arrange the data, because SQL tables have no intrinsic order. In fact, if you don’t use #chp-https://dplyr.tidyverse.org/reference/arrange you might get the rows back in a different order every time! Notice for window functions, the ordering information is repeated: the ORDER BY clause of the main query doesn’t automatically apply to window functions.

+

Another important SQL function is CASE WHEN. It’s used as the translation of #chp-https://dplyr.tidyverse.org/reference/if_else and #chp-https://dplyr.tidyverse.org/reference/case_when, the dplyr function that it directly inspired. Here’s a couple of simple examples:

+
+
flights |> 
+  mutate_query(
+    description = if_else(arr_delay > 0, "delayed", "on-time")
+  )
+#> <SQL>
+#> SELECT CASE WHEN (arr_delay > 0.0) THEN 'delayed' WHEN NOT (arr_delay > 0.0) THEN 'on-time' END AS description
+#> FROM flights
+flights |> 
+  mutate_query(
+    description = 
+      case_when(
+        arr_delay < -5 ~ "early", 
+        arr_delay < 5 ~ "on-time",
+        arr_delay >= 5 ~ "late"
+      )
+  )
+#> <SQL>
+#> SELECT CASE
+#> WHEN (arr_delay < -5.0) THEN 'early'
+#> WHEN (arr_delay < 5.0) THEN 'on-time'
+#> WHEN (arr_delay >= 5.0) THEN 'late'
+#> END AS description
+#> FROM flights
+
+

CASE WHEN is also used for some other functions that don’t have a direct translation from R to SQL. A good example of this is #chp-https://rdrr.io/r/base/cut:

+
+
flights |> 
+  mutate_query(
+    description =  cut(
+      arr_delay, 
+      breaks = c(-Inf, -5, 5, Inf), 
+      labels = c("early", "on-time", "late")
+    )
+  )
+#> <SQL>
+#> SELECT CASE
+#> WHEN (arr_delay <= -5.0) THEN 'early'
+#> WHEN (arr_delay <= 5.0) THEN 'on-time'
+#> WHEN (arr_delay > 5.0) THEN 'late'
+#> END AS description
+#> FROM flights
+
+

dbplyr also translates common string and date-time manipulation functions, which you can learn about in #chp-https://dbplyr.tidyverse.org/articles/translation-function. dbplyr’s translations are certainly not perfect, and there are many R functions that aren’t translated yet, but dbplyr does a surprisingly good job covering the functions that you’ll use most of the time.

+ +
+

+Learning more

+

If you’ve finished this chapter and would like to learn more about SQL. We have two recommendations:

+
  • +#chp-https://sqlfordatascientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organisations.
  • +
  • +#chp-https://www.practicalsql by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialized in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.
  • +
+
+
diff --git a/oreilly/datetimes.html b/oreilly/datetimes.html new file mode 100644 index 0000000..bce55b4 --- /dev/null +++ b/oreilly/datetimes.html @@ -0,0 +1,771 @@ +
+

Dates and times

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

This chapter will show you how to work with dates and times in R. At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get!

+

To warm up think about how many days there are in a year, and how many hours there are in a day. You probably remembered that most years have 365 days, but leap years have 366. Do you know the full rule for determining if a year is a leap yearA year is a leap year if it’s divisible by 4, unless it’s also divisible by 100, except if it’s also divisible by 400. In other words, in every set of 400 years, there’s 97 leap years.? The number of hours in a day is a little less obvious: most days have 24 hours, but in places that use daylight saving time (DST), one day each year has 23 hours and another has 25.

+

Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST. This chapter won’t teach you every last detail about dates and times, but it will give you a solid grounding of practical skills that will help you with common data analysis challenges.

+

We’ll begin by showing you how to create date-times from various inputs, and then once you’ve got a date-time, how you can extract components like year, month, and day. We’ll then dive into the tricky topic of working with time spans, which come in a variety of flavors depending on what you’re trying to do. We’ll conclude with a brief discussion of the additional challenges posed by time zones.

+ +
+

+Prerequisites

+

This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.

+
+
library(tidyverse)
+
+library(lubridate)
+library(nycflights13)
+
+
+
+ +
+

+Creating date/times

+

There are three types of date/time data that refer to an instant in time:

+
  • A date. Tibbles print this as <date>.

  • +
  • A time within a day. Tibbles print this as <time>.

  • +
  • A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.

  • +

In this chapter we are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

+

You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.

+

To get the current date or date-time you can use #chp-https://lubridate.tidyverse.org/reference/now or #chp-https://lubridate.tidyverse.org/reference/now:

+
+
today()
+#> [1] "2022-11-18"
+now()
+#> [1] "2022-11-18 10:21:36 CST"
+
+

Otherwise, the following sections describe the four ways you’re likely to create a date/time:

+
  • While reading a file with readr.
  • +
  • From a string.
  • +
  • From individual date-time components.
  • +
  • From an existing date/time object.
  • +
+
+

+During import

+

If your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:

+
+
csv <- "
+  date,datetime
+  2022-01-02,2022-01-02 05:12
+"
+read_csv(csv)
+#> # A tibble: 1 × 2
+#>   date       datetime           
+#>   <date>     <dttm>             
+#> 1 2022-01-02 2022-01-02 05:12:00
+
+

If you haven’t heard of ISO8601 before, it’s an international standardhttps://xkcd.com/1179/ for writing dates where the components of a date are organised from biggest to smallest separated by -. For example, in ISO8601 March 5 2022 is 2022-05-03. ISO8601 dates can also include times, where hour, minute, and second are separated by :, and the date and time components are separated by either a T or a space. For example, you could write 4:26pm on March 5 2022 as either 2022-05-03 16:26 or 2022-05-03T16:26.

+

For other date-time formats, you’ll need to use col_types plus #chp-https://readr.tidyverse.org/reference/parse_datetime or #chp-https://readr.tidyverse.org/reference/parse_datetime along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a % followed by a single character. For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day. Table #tbl-date-formats lists all the options.

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 17.1: All date formats understood by readr
TypeCodeMeaningExample
Year%Y4 digit year2021
+%y2 digit year21
Month%mNumber2
+%bAbbreviated nameFeb
+%BFull nameFebrurary
Day%dTwo digits02
+%eOne or two digits2
Time%H24-hour hour13
+%I12-hour hour1
+%pAM/PMpm
+%MMinutes35
+%SSeconds45
+%OSSeconds with decimal component45.35
+%ZTime zone nameAmerica/Chicago
+%zOffset from UTC+0800
Other%.Skip one non-digit:
+%*Skip any number of non-digits +
+

And this code shows some a few options applied to a very ambiguous date:

+
+
csv <- "
+  date
+  01/02/15
+"
+
+read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
+#> # A tibble: 1 × 1
+#>   date      
+#>   <date>    
+#> 1 2015-01-02
+
+read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
+#> # A tibble: 1 × 1
+#>   date      
+#>   <date>    
+#> 1 2015-02-01
+
+read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))
+#> # A tibble: 1 × 1
+#>   date      
+#>   <date>    
+#> 1 2001-02-15
+
+

Note that no matter how you specify the date format, it’s always displayed the same way once you get it into R.

+

If you’re using %b or %B and working with non-English dates, you’ll also need to provide a #chp-https://readr.tidyverse.org/reference/locale. See the list of built-in languages in #chp-https://readr.tidyverse.org/reference/date_names, or create your own with #chp-https://readr.tidyverse.org/reference/date_names,

+
+ +
+

+From strings

+

The date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:

+
+
ymd("2017-01-31")
+#> [1] "2017-01-31"
+mdy("January 31st, 2017")
+#> [1] "2017-01-31"
+dmy("31-Jan-2017")
+#> [1] "2017-01-31"
+
+

#chp-https://lubridate.tidyverse.org/reference/ymd and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

+
+
ymd_hms("2017-01-31 20:11:59")
+#> [1] "2017-01-31 20:11:59 UTC"
+mdy_hm("01/31/2017 08:01")
+#> [1] "2017-01-31 08:01:00 UTC"
+
+

You can also force the creation of a date-time from a date by supplying a timezone:

+
+
ymd("2017-01-31", tz = "UTC")
+#> [1] "2017-01-31 UTC"
+
+
+ +
+

+From individual components

+

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

+
+
flights |> 
+  select(year, month, day, hour, minute)
+#> # A tibble: 336,776 × 5
+#>    year month   day  hour minute
+#>   <int> <int> <int> <dbl>  <dbl>
+#> 1  2013     1     1     5     15
+#> 2  2013     1     1     5     29
+#> 3  2013     1     1     5     40
+#> 4  2013     1     1     5     45
+#> 5  2013     1     1     6      0
+#> 6  2013     1     1     5     58
+#> # … with 336,770 more rows
+
+

To create a date/time from this sort of input, use #chp-https://lubridate.tidyverse.org/reference/make_datetime for dates, or #chp-https://lubridate.tidyverse.org/reference/make_datetime for date-times:

+
+
flights |> 
+  select(year, month, day, hour, minute) |> 
+  mutate(departure = make_datetime(year, month, day, hour, minute))
+#> # A tibble: 336,776 × 6
+#>    year month   day  hour minute departure          
+#>   <int> <int> <int> <dbl>  <dbl> <dttm>             
+#> 1  2013     1     1     5     15 2013-01-01 05:15:00
+#> 2  2013     1     1     5     29 2013-01-01 05:29:00
+#> 3  2013     1     1     5     40 2013-01-01 05:40:00
+#> 4  2013     1     1     5     45 2013-01-01 05:45:00
+#> 5  2013     1     1     6      0 2013-01-01 06:00:00
+#> 6  2013     1     1     5     58 2013-01-01 05:58:00
+#> # … with 336,770 more rows
+
+

Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.

+
+
make_datetime_100 <- function(year, month, day, time) {
+  make_datetime(year, month, day, time %/% 100, time %% 100)
+}
+
+flights_dt <- flights |> 
+  filter(!is.na(dep_time), !is.na(arr_time)) |> 
+  mutate(
+    dep_time = make_datetime_100(year, month, day, dep_time),
+    arr_time = make_datetime_100(year, month, day, arr_time),
+    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
+    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
+  ) |> 
+  select(origin, dest, ends_with("delay"), ends_with("time"))
+
+flights_dt
+#> # A tibble: 328,063 × 9
+#>   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
+#>   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
+#> 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
+#> 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
+#> 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
+#> 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
+#> 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
+#> 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
+#> # … with 328,057 more rows, and 3 more variables: arr_time <dttm>,
+#> #   sched_arr_time <dttm>, air_time <dbl>
+
+

With this data, we can visualize the distribution of departure times across the year:

+
+
flights_dt |> 
+  ggplot(aes(dep_time)) + 
+  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
+
+

A frequency polyon with departure time (Jan-Dec 2013) on the x-axis and number of flights on the y-axis (0-1000). The frequency polygon is binned by day so you see a time series of flights by day. The pattern is dominated by a weekly pattern; there are fewer flights on weekends. The are few days that stand out as having a surprisingly few flights in early Februrary, early July, late November, and late December.

+
+
+

Or within a single day:

+
+
flights_dt |> 
+  filter(dep_time < ymd(20130102)) |> 
+  ggplot(aes(dep_time)) + 
+  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
+
+

A frequency polygon with departure time (6am - midnight Jan 1) on the x-axis, number of flights on the y-axis (0-17), binned into 10 minute increments. It's hard to see much pattern because of high variability, but most bins have 8-12 flights, and there are markedly fewer flights before 6am and after 8pm.

+
+
+

Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.

+
+ +
+

+From other types

+

You may want to switch between a date-time and a date. That’s the job of #chp-https://lubridate.tidyverse.org/reference/as_date and #chp-https://lubridate.tidyverse.org/reference/as_date:

+
+
as_datetime(today())
+#> [1] "2022-11-18 UTC"
+as_date(now())
+#> [1] "2022-11-18"
+
+

Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use #chp-https://lubridate.tidyverse.org/reference/as_date; if it’s in days, use #chp-https://lubridate.tidyverse.org/reference/as_date.

+
+
as_datetime(60 * 60 * 10)
+#> [1] "1970-01-01 10:00:00 UTC"
+as_date(365 * 10 + 2)
+#> [1] "1980-01-01"
+
+
+ +
+

+Exercises

+
  1. +

    What happens if you parse a string that contains invalid dates?

    +
    +
    ymd(c("2010-10-10", "bananas"))
    +
    +
  2. +
  3. What does the tzone argument to #chp-https://lubridate.tidyverse.org/reference/now do? Why is it important?

  4. +
  5. +

    For each of the following date-times show how you’d parse it using a readr column-specification and a lubridate function.

    +
    +
    d1 <- "January 1, 2010"
    +d2 <- "2015-Mar-07"
    +d3 <- "06-Jun-2017"
    +d4 <- c("August 19 (2015)", "July 1 (2015)")
    +d5 <- "12/30/14" # Dec 30, 2014
    +t1 <- "1705"
    +t2 <- "11:15:10.12 PM"
    +
    +
  6. +
+
+ +
+

+Date-time components

+

Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.

+ +
+

+Getting components

+

You can pull out individual parts of the date with the accessor functions #chp-https://lubridate.tidyverse.org/reference/year, #chp-https://lubridate.tidyverse.org/reference/month, #chp-https://lubridate.tidyverse.org/reference/day (day of the month), #chp-https://lubridate.tidyverse.org/reference/day (day of the year), #chp-https://lubridate.tidyverse.org/reference/day (day of the week), #chp-https://lubridate.tidyverse.org/reference/hour, #chp-https://lubridate.tidyverse.org/reference/minute, and #chp-https://lubridate.tidyverse.org/reference/second.

+
+
datetime <- ymd_hms("2026-07-08 12:34:56")
+
+year(datetime)
+#> [1] 2026
+month(datetime)
+#> [1] 7
+mday(datetime)
+#> [1] 8
+
+yday(datetime)
+#> [1] 189
+wday(datetime)
+#> [1] 4
+
+

For #chp-https://lubridate.tidyverse.org/reference/month and #chp-https://lubridate.tidyverse.org/reference/day you can set label = TRUE to return the abbreviated name of the month or day of the week. Set abbr = FALSE to return the full name.

+
+
month(datetime, label = TRUE)
+#> [1] Jul
+#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
+wday(datetime, label = TRUE, abbr = FALSE)
+#> [1] Wednesday
+#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
+
+

We can use #chp-https://lubridate.tidyverse.org/reference/day to see that more flights depart during the week than on the weekend:

+
+
flights_dt |> 
+  mutate(wday = wday(dep_time, label = TRUE)) |> 
+  ggplot(aes(x = wday)) +
+    geom_bar()
+
+

A bar chart with days of the week on the x-axis and number of flights on the y-axis. Monday-Friday have roughly the same number of flights, ~48,0000, decreasingly slightly over the course of the week. Sunday is a little lower (~45,000), and Saturday is much lower (~38,000).

+
+
+

There’s an interesting pattern if we look at the average departure delay by minute within the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!

+
+
flights_dt |> 
+  mutate(minute = minute(dep_time)) |> 
+  group_by(minute) |> 
+  summarise(
+    avg_delay = mean(dep_delay, na.rm = TRUE),
+    n = n()) |> 
+  ggplot(aes(minute, avg_delay)) +
+    geom_line()
+
+

A line chart with minute of actual departure (0-60) on the x-axis and average delay (4-20) on the y-axis. Average delay starts at (0, 12), steadily increases to (18, 20), then sharply drops, hitting at minimum at ~23 minute past the hour and 9 minutes of delay. It then increases again to (17, 35), and sharply decreases to (55, 4). It finishes off with an increase to (60, 9).

+
+
+

Interestingly, if we look at the scheduled departure time we don’t see such a strong pattern:

+
+
sched_dep <- flights_dt |> 
+  mutate(minute = minute(sched_dep_time)) |> 
+  group_by(minute) |> 
+  summarise(
+    avg_delay = mean(arr_delay, na.rm = TRUE),
+    n = n())
+
+ggplot(sched_dep, aes(minute, avg_delay)) +
+  geom_line()
+
+

A line chart with minute of scheduled departure (0-60) on the x-axis and average delay (4-16). There is relatively little pattern, just a small suggestion that the average delay decreases from maybe 10 minutes to 8 minutes over the course of the hour.

+
+
+

So why do we see that pattern with the actual departure times? Well, like much data collected by humans, there’s a strong bias towards flights leaving at “nice” departure times. Always be alert for this sort of pattern whenever you work with data that involves human judgement!

+
+
ggplot(sched_dep, aes(minute, n)) +
+  geom_line()
+
+

A line plot with departure minute (0-60) on the x-axis and number of flights (0-60000) on the y-axis. Most flights are scheduled to depart on either the hour (~60,000) or the half hour (~35,000). Otherwise, all most all flights are scheduled to depart on multiples of five, with a few extra at 15, 45, and 55 minutes.

+
+
+
+ +
+

+Rounding

+

An alternative approach to plotting individual components is to round the date to a nearby unit of time, with #chp-https://lubridate.tidyverse.org/reference/round_date, #chp-https://lubridate.tidyverse.org/reference/round_date, and #chp-https://lubridate.tidyverse.org/reference/round_date. Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to. This, for example, allows us to plot the number of flights per week:

+
+
flights_dt |> 
+  count(week = floor_date(dep_time, "week")) |> 
+  ggplot(aes(week, n)) +
+  geom_line() + 
+  geom_point()
+
+

A line plot with week (Jan-Dec 2013) on the x-axis and number of flights (2,000-7,000) on the y-axis. The pattern is fairly flat from February to November with around 7,000 flights per week. There are far fewer flights on the first (approximately 4,500 flights) and last weeks of the year (approximately 2,500 flights).

+
+
+

You can use rounding to show the distribution of flights across the course of a day by computing the difference between dep_time and the earliest instant of that day:

+
+
flights_dt |> 
+  mutate(dep_hour = dep_time - floor_date(dep_time, "day")) |> 
+  ggplot(aes(dep_hour)) +
+    geom_freqpoly(binwidth = 60 * 30)
+#> Don't know how to automatically pick scale for object of type <difftime>.
+#> Defaulting to continuous.
+
+

A line plot with depature time on the x-axis. This is units of seconds since midnight so it's hard to interpret.

+
+
+

Computing the difference between a pair of date-times yields a difftime (more on that in #sec-intervals). We can convert that to an hms object to get a more useful x-axis:

+
+
flights_dt |> 
+  mutate(dep_hour = hms::as_hms(dep_time - floor_date(dep_time, "day"))) |> 
+  ggplot(aes(dep_hour)) +
+    geom_freqpoly(binwidth = 60 * 30)
+
+

A line plot with depature time (midnight to midnight) on the x-axis and number of flights on the y-axis (0 to 15,000). There are very few (<100) flights before 5am. The number of flights then rises rapidly to 12,000 / hour, peaking at 15,000 at 9am, before falling to around 8,000 / hour for 10am to 2pm. Number of flights then increases to around 12,000 per hour until 8pm, when they rapidly drop again.

+
+
+
+ +
+

+Modifying components

+

You can also use each accessor function to modify the components of a date/time:

+
+
(datetime <- ymd_hms("2026-07-08 12:34:56"))
+#> [1] "2026-07-08 12:34:56 UTC"
+
+year(datetime) <- 2030
+datetime
+#> [1] "2030-07-08 12:34:56 UTC"
+month(datetime) <- 01
+datetime
+#> [1] "2030-01-08 12:34:56 UTC"
+hour(datetime) <- hour(datetime) + 1
+datetime
+#> [1] "2030-01-08 13:34:56 UTC"
+
+

Alternatively, rather than modifying an existing variabke, you can create a new date-time with #chp-https://rdrr.io/r/stats/update. This also allows you to set multiple values in one step:

+
+
update(datetime, year = 2030, month = 2, mday = 2, hour = 2)
+#> [1] "2030-02-02 02:34:56 UTC"
+
+

If values are too big, they will roll-over:

+
+
update(ymd("2023-02-01"), mday = 30)
+#> [1] "2023-03-02"
+update(ymd("2023-02-01"), hour = 400)
+#> [1] "2023-02-17 16:00:00 UTC"
+
+
+ +
+

+Exercises

+
  1. How does the distribution of flight times within a day change over the course of the year?

  2. +
  3. Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings.

  4. +
  5. Compare air_time with the duration between the departure and arrival. Explain your findings. (Hint: consider the location of the airport.)

  6. +
  7. How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?

  8. +
  9. On what day of the week should you leave if you want to minimise the chance of a delay?

  10. +
  11. What makes the distribution of diamonds$carat and flights$sched_dep_time similar?

  12. +
  13. Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

  14. +
+
+ +
+

+Time spans

+

Next you’ll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you’ll learn about three important classes that represent time spans:

+
  • +Durations, which represent an exact number of seconds.
  • +
  • +Periods, which represent human units like weeks and months.
  • +
  • +Intervals, which represent a starting and ending point.
  • +

How do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.

+ +
+

+Durations

+

In R, when you subtract two dates, you get a difftime object:

+
+
# How old is Hadley?
+h_age <- today() - ymd("1979-10-14")
+h_age
+#> Time difference of 15741 days
+
+

A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.

+
+
as.duration(h_age)
+#> [1] "1360022400s (~43.1 years)"
+
+

Durations come with a bunch of convenient constructors:

+
+
dseconds(15)
+#> [1] "15s"
+dminutes(10)
+#> [1] "600s (~10 minutes)"
+dhours(c(12, 24))
+#> [1] "43200s (~12 hours)" "86400s (~1 days)"
+ddays(0:5)
+#> [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
+#> [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
+dweeks(3)
+#> [1] "1814400s (~3 weeks)"
+dyears(1)
+#> [1] "31557600s (~1 years)"
+
+

Durations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year is uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.

+

You can add and multiply durations:

+
+
2 * dyears(1)
+#> [1] "63115200s (~2 years)"
+dyears(1) + dweeks(12) + dhours(15)
+#> [1] "38869200s (~1.23 years)"
+
+

You can add and subtract durations to and from days:

+
+
tomorrow <- today() + ddays(1)
+last_year <- today() - dyears(1)
+
+

However, because durations represent an exact number of seconds, sometimes you might get an unexpected result:

+
+
one_pm <- ymd_hms("2026-03-12 13:00:00", tz = "America/New_York")
+
+one_pm
+#> [1] "2026-03-12 13:00:00 EDT"
+one_pm + ddays(1)
+#> [1] "2026-03-13 13:00:00 EDT"
+
+

Why is one day after 1pm March 12, 2pm March 13? If you look carefully at the date you might also notice that the time zones have changed. March 12 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.

+
+ +
+

+Periods

+

To solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:

+
+
one_pm
+#> [1] "2026-03-12 13:00:00 EDT"
+one_pm + days(1)
+#> [1] "2026-03-13 13:00:00 EDT"
+
+

Like durations, periods can be created with a number of friendly constructor functions.

+
+
hours(c(12, 24))
+#> [1] "12H 0M 0S" "24H 0M 0S"
+days(7)
+#> [1] "7d 0H 0M 0S"
+months(1:6)
+#> [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
+#> [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
+
+

You can add and multiply periods:

+
+
10 * (months(6) + days(1))
+#> [1] "60m 10d 0H 0M 0S"
+days(50) + hours(25) + minutes(2)
+#> [1] "50d 25H 2M 0S"
+
+

And of course, add them to dates. Compared to durations, periods are more likely to do what you expect:

+
+
# A leap year
+ymd("2024-01-01") + dyears(1)
+#> [1] "2024-12-31 06:00:00 UTC"
+ymd("2024-01-01") + years(1)
+#> [1] "2025-01-01"
+
+# Daylight Savings Time
+one_pm + ddays(1)
+#> [1] "2026-03-13 13:00:00 EDT"
+one_pm + days(1)
+#> [1] "2026-03-13 13:00:00 EDT"
+
+

Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.

+
+
flights_dt |> 
+  filter(arr_time < dep_time) 
+#> # A tibble: 10,640 × 9
+#>   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
+#>   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
+#> 1 EWR    BQN           9        -4 2013-01-01 19:29:00 2013-01-01 19:20:00
+#> 2 JFK    DFW          59        NA 2013-01-01 19:39:00 2013-01-01 18:40:00
+#> 3 EWR    TPA          -2         9 2013-01-01 20:58:00 2013-01-01 21:00:00
+#> 4 EWR    SJU          -6       -12 2013-01-01 21:02:00 2013-01-01 21:08:00
+#> 5 EWR    SFO          11       -14 2013-01-01 21:08:00 2013-01-01 20:57:00
+#> 6 LGA    FLL         -10        -2 2013-01-01 21:20:00 2013-01-01 21:30:00
+#> # … with 10,634 more rows, and 3 more variables: arr_time <dttm>,
+#> #   sched_arr_time <dttm>, air_time <dbl>
+
+

These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.

+
+
flights_dt <- flights_dt |> 
+  mutate(
+    overnight = arr_time < dep_time,
+    arr_time = arr_time + days(if_else(overnight, 0, 1)),
+    sched_arr_time = sched_arr_time + days(overnight * 1)
+  )
+
+

Now all of our flights obey the laws of physics.

+
+
flights_dt |> 
+  filter(overnight, arr_time < dep_time) 
+#> # A tibble: 10,640 × 10
+#>   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
+#>   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
+#> 1 EWR    BQN           9        -4 2013-01-01 19:29:00 2013-01-01 19:20:00
+#> 2 JFK    DFW          59        NA 2013-01-01 19:39:00 2013-01-01 18:40:00
+#> 3 EWR    TPA          -2         9 2013-01-01 20:58:00 2013-01-01 21:00:00
+#> 4 EWR    SJU          -6       -12 2013-01-01 21:02:00 2013-01-01 21:08:00
+#> 5 EWR    SFO          11       -14 2013-01-01 21:08:00 2013-01-01 20:57:00
+#> 6 LGA    FLL         -10        -2 2013-01-01 21:20:00 2013-01-01 21:30:00
+#> # … with 10,634 more rows, and 4 more variables: arr_time <dttm>,
+#> #   sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>
+
+
+ +
+

+Intervals

+

It’s obvious what dyears(1) / ddays(365) should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.

+

What should years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:

+
+
years(1) / days(1)
+#> [1] 365.25
+
+

If you want a more accurate measurement, you’ll have to use an interval. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.

+

You can create an interval by writing start %--% end:

+
+
y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01")
+y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01")
+
+y2023
+#> [1] 2023-01-01 UTC--2024-01-01 UTC
+y2024
+#> [1] 2024-01-01 UTC--2025-01-01 UTC
+
+

You could then divide it by #chp-https://lubridate.tidyverse.org/reference/period to find out how many days fit in the year:

+
+
y2023 / days(1)
+#> [1] 365
+y2024 / days(1)
+#> [1] 366
+
+
+ +
+

+Exercises

+
  1. Explain days(overnight * 1) to someone who has just started learning R. How does it work?

  2. +
  3. Create a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.

  4. +
  5. Write a function that given your birthday (as a date), returns how old you are in years.

  6. +
  7. Why can’t (today() %--% (today() + years(1))) / months(1) work?

  8. +
+
+ +
+

+Time zones

+

Time zones are an enormously complicated topic because of their interaction with geopolitical entities. Fortunately we don’t need to dig into all the details as they’re not all important for data analysis, but there are a few challenges we’ll need to tackle head on.

+ +

The first challenge is that everyday names of time zones tend to be ambiguous. For example, if you’re American you’re probably familiar with EST, or Eastern Standard Time. However, both Australia and Canada also have EST! To avoid confusion, R uses the international standard IANA time zones. These use a consistent naming scheme {area}/{location}, typically in the form {continent}/{city} or {ocean}/{city}. Examples include “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”.

+

You might wonder why the time zone uses a city, when typically you think of time zones as associated with a country or region within a country. This is because the IANA database has to record decades worth of time zone rules. Over the course of decades, countries change names (or break apart) fairly frequently, but city names tend to stay the same. Another problem is that the name needs to reflect not only the current behavior, but also the complete history. For example, there are time zones for both “America/New_York” and “America/Detroit”. These cities both currently use Eastern Standard Time but in 1969-1972 Michigan (the state in which Detroit is located), did not follow DST, so it needs a different name. It’s worth reading the raw time zone database (available at https://www.iana.org/time-zones) just to read some of these stories!

+

You can find out what R thinks your current time zone is with #chp-https://rdrr.io/r/base/timezones:

+
+
Sys.timezone()
+#> [1] "America/Chicago"
+
+

(If R doesn’t know, you’ll get an NA.)

+

And see the complete list of all time zone names with #chp-https://rdrr.io/r/base/timezones:

+
+
length(OlsonNames())
+#> [1] 595
+head(OlsonNames())
+#> [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
+#> [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"
+
+

In R, the time zone is an attribute of the date-time that only controls printing. For example, these three objects represent the same instant in time:

+
+
x1 <- ymd_hms("2024-06-01 12:00:00", tz = "America/New_York")
+x1
+#> [1] "2024-06-01 12:00:00 EDT"
+
+x2 <- ymd_hms("2024-06-01 18:00:00", tz = "Europe/Copenhagen")
+x2
+#> [1] "2024-06-01 18:00:00 CEST"
+
+x3 <- ymd_hms("2024-06-02 04:00:00", tz = "Pacific/Auckland")
+x3
+#> [1] "2024-06-02 04:00:00 NZST"
+
+

You can verify that they’re the same time using subtraction:

+
+
x1 - x2
+#> Time difference of 0 secs
+x1 - x3
+#> Time difference of 0 secs
+
+

Unless otherwise specified, lubridate always uses UTC. UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and is roughly equivalent to GMT (Greenwich Mean Time). It does not have DST, which makes a convenient representation for computation. Operations that combine date-times, like #chp-https://rdrr.io/r/base/c, will often drop the time zone. In that case, the date-times will display in your local time zone:

+
+
x4 <- c(x1, x2, x3)
+x4
+#> [1] "2024-06-01 12:00:00 EDT" "2024-06-01 12:00:00 EDT"
+#> [3] "2024-06-01 12:00:00 EDT"
+
+

You can change the time zone in two ways:

+
  • +

    Keep the instant in time the same, and change how it’s displayed. Use this when the instant is correct, but you want a more natural display.

    +
    +
    x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
    +x4a
    +#> [1] "2024-06-02 02:30:00 +1030" "2024-06-02 02:30:00 +1030"
    +#> [3] "2024-06-02 02:30:00 +1030"
    +x4a - x4
    +#> Time differences in secs
    +#> [1] 0 0 0
    +
    +

    (This also illustrates another challenge of times zones: they’re not all integer hour offsets!)

    +
  • +
  • +

    Change the underlying instant in time. Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.

    +
    +
    x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
    +x4b
    +#> [1] "2024-06-01 12:00:00 +1030" "2024-06-01 12:00:00 +1030"
    +#> [3] "2024-06-01 12:00:00 +1030"
    +x4b - x4
    +#> Time differences in hours
    +#> [1] -14.5 -14.5 -14.5
    +
    +
  • +
+ +
+

+Summary

+

This chapter has introduced you to the tools that lubridate provides to help you work with date-time data. Working with dates and times can seem harder than necessary, but hopefully this chapter has helped you see why — date-times are more complex than they seem at first glance, and handling every possible situation adds complexity. Even if your data never crosses a day light savings boundary or involves a leap year, the functions need to be able to handle it.

+

The next chapter gives a round up of missing values. You’ve seen them in a few places and have no doubt encounter in your own analysis, and it’s how time to provide a grab bag of useful techniques for dealing with them.

+ + +
+
diff --git a/oreilly/factors.html b/oreilly/factors.html new file mode 100644 index 0000000..e7f8145 --- /dev/null +++ b/oreilly/factors.html @@ -0,0 +1,446 @@ +
+

Factors

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

Factors are used for categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order.

+

We’ll start by motivating why factors are needed for data analysis and how you can create them with #chp-https://rdrr.io/r/base/factor. We’ll then introduce you to the gss_cat dataset which contains a bunch of categorical variables to experiment with. You’ll then use that dataset to practice modifying the order and values of factors, before we finish up with a discussion of ordered factors.

+ +
+

+Prerequisites

+

Base R provides some basic tools for creating and manipulating factors. We’ll supplement these with the forcats package, which is part of the core tidyverse. It provides tools for dealing with categorical variables (and it’s an anagram of factors!) using a wide range of helpers for working with factors.

+
+
library(tidyverse)
+
+
+
+ +
+

+Factor basics

+

Imagine that you have a variable that records month:

+
+
x1 <- c("Dec", "Apr", "Jan", "Mar")
+
+

Using a string to record this variable has two problems:

+
  1. +

    There are only twelve possible months, and there’s nothing saving you from typos:

    +
    +
    x2 <- c("Dec", "Apr", "Jam", "Mar")
    +
    +
  2. +
  3. +

    It doesn’t sort in a useful way:

    +
    +
    sort(x1)
    +#> [1] "Apr" "Dec" "Jan" "Mar"
    +
    +
  4. +

You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels:

+
+
month_levels <- c(
+  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
+  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
+)
+
+

Now you can create a factor:

+
+
y1 <- factor(x1, levels = month_levels)
+y1
+#> [1] Dec Apr Jan Mar
+#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
+
+sort(y1)
+#> [1] Jan Mar Apr Dec
+#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
+
+

And any values not in the level will be silently converted to NA:

+
+
y2 <- factor(x2, levels = month_levels)
+y2
+#> [1] Dec  Apr  <NA> Mar 
+#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
+
+

This seems risky, so you might want to use #chp-https://forcats.tidyverse.org/reference/fct instead:

+
+
y2 <- fct(x2, levels = month_levels)
+#> Error in `fct()`:
+#> ! All values of `x` must appear in `levels` or `na`
+#> ℹ Missing level: "Jam"
+
+

If you omit the levels, they’ll be taken from the data in alphabetical order:

+
+
factor(x1)
+#> [1] Dec Apr Jan Mar
+#> Levels: Apr Dec Jan Mar
+
+

Sometimes you’d prefer that the order of the levels matches the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the fact, with #chp-https://forcats.tidyverse.org/reference/fct_inorder:

+
+
f1 <- factor(x1, levels = unique(x1))
+f1
+#> [1] Dec Apr Jan Mar
+#> Levels: Dec Apr Jan Mar
+
+f2 <- x1 |> factor() |> fct_inorder()
+f2
+#> [1] Dec Apr Jan Mar
+#> Levels: Dec Apr Jan Mar
+
+

If you ever need to access the set of valid levels directly, you can do so with #chp-https://rdrr.io/r/base/levels:

+
+
levels(f2)
+#> [1] "Dec" "Apr" "Jan" "Mar"
+
+

You can also create a factor when reading your data with readr with #chp-https://readr.tidyverse.org/reference/parse_factor:

+
+
csv <- "
+month,value
+Jan,12
+Feb,56
+Mar,12"
+
+df <- read_csv(csv, col_types = cols(month = col_factor(month_levels)))
+df$month
+#> [1] Jan Feb Mar
+#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
+
+
+ +
+

+General Social Survey

+

For the rest of this chapter, we’re going to use #chp-https://forcats.tidyverse.org/reference/gss_cat. It’s a sample of data from the #chp-https://gss.norc, a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in gss_cat Hadley selected a handful that will illustrate some common challenges you’ll encounter when working with factors.

+
+
gss_cat
+#> # A tibble: 21,483 × 9
+#>    year marital         age race  rincome        partyid     relig denom tvhours
+#>   <int> <fct>         <int> <fct> <fct>          <fct>       <fct> <fct>   <int>
+#> 1  2000 Never married    26 White $8000 to 9999  Ind,near r… Prot… Sout…      12
+#> 2  2000 Divorced         48 White $8000 to 9999  Not str re… Prot… Bapt…      NA
+#> 3  2000 Widowed          67 White Not applicable Independent Prot… No d…       2
+#> 4  2000 Never married    39 White Not applicable Ind,near r… Orth… Not …       4
+#> 5  2000 Divorced         25 White Not applicable Not str de… None  Not …       1
+#> 6  2000 Married          25 White $20000 - 24999 Strong dem… Prot… Sout…      NA
+#> # … with 21,477 more rows
+
+

(Remember, since this dataset is provided by a package, you can get more information about the variables with #chp-https://forcats.tidyverse.org/reference/gss_cat.)

+

When factors are stored in a tibble, you can’t see their levels so easily. One way to view them is with #chp-https://dplyr.tidyverse.org/reference/count:

+
+
gss_cat |>
+  count(race)
+#> # A tibble: 3 × 2
+#>   race      n
+#>   <fct> <int>
+#> 1 Other  1959
+#> 2 Black  3129
+#> 3 White 16395
+
+

Or with a bar chart:

+
+
ggplot(gss_cat, aes(race)) +
+  geom_bar()
+
+

A bar chart showing the distribution of race. There are ~2000 records with race "Other", 3000 with race "Black", and other 15,000 with race "White".

+
+
+

When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.

+ +
+

+Exercise

+
  1. Explore the distribution of rincome (reported income). What makes the default bar chart hard to understand? How could you improve the plot?

  2. +
  3. What is the most common relig in this survey? What’s the most common partyid?

  4. +
  5. Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualization?

  6. +
+
+ +
+

+Modifying factor order

+

It’s often useful to change the order of the factor levels in a visualization. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:

+
+
relig_summary <- gss_cat |>
+  group_by(relig) |>
+  summarise(
+    age = mean(age, na.rm = TRUE),
+    tvhours = mean(tvhours, na.rm = TRUE),
+    n = n()
+  )
+
+ggplot(relig_summary, aes(tvhours, relig)) + 
+  geom_point()
+
+

A scatterplot of with tvhours on the x-axis and religion on the y-axis. The y-axis is ordered seemingly aribtrarily making it hard to get any sense of overall pattern.

+
+
+

It is hard to read this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using #chp-https://forcats.tidyverse.org/reference/fct_reorder. #chp-https://forcats.tidyverse.org/reference/fct_reorder takes three arguments:

+
  • +f, the factor whose levels you want to modify.
  • +
  • +x, a numeric vector that you want to use to reorder the levels.
  • +
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
  • +
+
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
+  geom_point()
+
+

The same scatterplot as above, but now the religion is displayed in increasing order of tvhours. "Other eastern" has the fewest tvhours under 2, and "Don't know" has the highest (over 5).

+
+
+

Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

+

As you start making more complicated transformations, we recommend moving them out of #chp-https://ggplot2.tidyverse.org/reference/aes and into a separate #chp-https://dplyr.tidyverse.org/reference/mutate step. For example, you could rewrite the plot above as:

+
+
relig_summary |>
+  mutate(
+    relig = fct_reorder(relig, tvhours)
+  ) |>
+  ggplot(aes(tvhours, relig)) +
+  geom_point()
+
+

What if we create a similar plot looking at how average age varies across reported income level?

+
+
rincome_summary <- gss_cat |>
+  group_by(rincome) |>
+  summarise(
+    age = mean(age, na.rm = TRUE),
+    tvhours = mean(tvhours, na.rm = TRUE),
+    n = n()
+  )
+
+ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + 
+  geom_point()
+
+

A scatterplot with age on the x-axis and income on the y-axis. Income has been reordered in order of average age which doesn't make much sense. One section of the y-axis goes from $6000-6999, then <$1000, then $8000-9999.

+
+
+

Here, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve #chp-https://forcats.tidyverse.org/reference/fct_reorder for factors whose levels are arbitrarily ordered.

+

However, it does make sense to pull “Not applicable” to the front with the other special levels. You can use #chp-https://forcats.tidyverse.org/reference/fct_relevel. It takes a factor, f, and then any number of levels that you want to move to the front of the line.

+
+
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
+  geom_point()
+
+

The same scatterplot but now "Not Applicable" is displayed at the bottom of the y-axis. Generally there is a positive association between income and age, and the income band with the highest average age is "Not applicable".

+
+
+

Why do you think the average age for “Not applicable” is so high?

+

Another type of reordering is useful when you are coloring the lines on a plot. fct_reorder2(f, x, y) reorders the factor f by the y values associated with the largest x values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.

+
+
#|
+#|     Rearranging the legend makes the plot easier to read because the
+#|     legend colours now match the order of the lines on the far right 
+#|     of the plot. You can see some unsuprising patterns: the proportion
+#|     never marred decreases with age, married forms an upside down U 
+#|     shape, and widowed starts off low but increases steeply after age
+#|     60.
+by_age <- gss_cat |>
+  filter(!is.na(age)) |>
+  count(age, marital) |>
+  group_by(age) |>
+  mutate(
+    prop = n / sum(n)
+  )
+
+ggplot(by_age, aes(age, prop, colour = marital)) +
+  geom_line(na.rm = TRUE)
+
+ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
+  geom_line() +
+  labs(colour = "marital")
+
+
+
+

A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot.

+
+
+

A line plot with age on the x-axis and proportion on the y-axis. There is one line for each category of marital status: no answer, never married, separated, divorced, widowed, and married. It is a little hard to read the plot because the order of the legend is unrelated to the lines on the plot.

+
+
+
+
+

Finally, for bar plots, you can use #chp-https://forcats.tidyverse.org/reference/fct_inorder to order levels in decreasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables. Combine it with #chp-https://forcats.tidyverse.org/reference/fct_rev if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.

+
+
gss_cat |>
+  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
+  ggplot(aes(marital)) +
+  geom_bar()
+
+

A bar char of marital status ordered in from least to most common: no answer (~0), separated (~1,000), widowed (~2,000), divorced (~3,000), never married (~5,000), married (~10,000).

+
+
+ +
+

+Exercises

+
  1. There are some suspiciously high numbers in tvhours. Is the mean a good summary?

  2. +
  3. For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

  4. +
  5. Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?

  6. +
+
+ +
+

+Modifying factor levels

+

More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays. The most general and powerful tool is #chp-https://forcats.tidyverse.org/reference/fct_recode. It allows you to recode, or change, the value of each level. For example, take the gss_cat$partyid:

+
+
gss_cat |> count(partyid)
+#> # A tibble: 10 × 2
+#>   partyid                n
+#>   <fct>              <int>
+#> 1 No answer            154
+#> 2 Don't know             1
+#> 3 Other party          393
+#> 4 Strong republican   2314
+#> 5 Not str republican  3032
+#> 6 Ind,near rep        1791
+#> # … with 4 more rows
+
+

The levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction. Like most rename and recoding functions in the tidyverse, the new values go on the left and the old values go on the right:

+
+
gss_cat |>
+  mutate(
+    partyid = fct_recode(partyid,
+      "Republican, strong"    = "Strong republican",
+      "Republican, weak"      = "Not str republican",
+      "Independent, near rep" = "Ind,near rep",
+      "Independent, near dem" = "Ind,near dem",
+      "Democrat, weak"        = "Not str democrat",
+      "Democrat, strong"      = "Strong democrat"
+    )
+  ) |>
+  count(partyid)
+#> # A tibble: 10 × 2
+#>   partyid                   n
+#>   <fct>                 <int>
+#> 1 No answer               154
+#> 2 Don't know                1
+#> 3 Other party             393
+#> 4 Republican, strong     2314
+#> 5 Republican, weak       3032
+#> 6 Independent, near rep  1791
+#> # … with 4 more rows
+
+

#chp-https://forcats.tidyverse.org/reference/fct_recode will leave the levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

+

To combine groups, you can assign multiple old levels to the same new level:

+
+
gss_cat |>
+  mutate(
+    partyid = fct_recode(partyid,
+      "Republican, strong"    = "Strong republican",
+      "Republican, weak"      = "Not str republican",
+      "Independent, near rep" = "Ind,near rep",
+      "Independent, near dem" = "Ind,near dem",
+      "Democrat, weak"        = "Not str democrat",
+      "Democrat, strong"      = "Strong democrat",
+      "Other"                 = "No answer",
+      "Other"                 = "Don't know",
+      "Other"                 = "Other party"
+    )
+  ) |>
+  count(partyid)
+#> # A tibble: 8 × 2
+#>   partyid                   n
+#>   <fct>                 <int>
+#> 1 Other                   548
+#> 2 Republican, strong     2314
+#> 3 Republican, weak       3032
+#> 4 Independent, near rep  1791
+#> 5 Independent            4119
+#> 6 Independent, near dem  2499
+#> # … with 2 more rows
+
+

Use this technique with care: if you group together categories that are truly different you will end up with misleading results.

+

If you want to collapse a lot of levels, #chp-https://forcats.tidyverse.org/reference/fct_collapse is a useful variant of #chp-https://forcats.tidyverse.org/reference/fct_recode. For each new variable, you can provide a vector of old levels:

+
+
gss_cat |>
+  mutate(
+    partyid = fct_collapse(partyid,
+      "other" = c("No answer", "Don't know", "Other party"),
+      "rep" = c("Strong republican", "Not str republican"),
+      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
+      "dem" = c("Not str democrat", "Strong democrat")
+    )
+  ) |>
+  count(partyid)
+#> # A tibble: 4 × 2
+#>   partyid     n
+#>   <fct>   <int>
+#> 1 other     548
+#> 2 rep      5346
+#> 3 ind      8409
+#> 4 dem      7180
+
+

Sometimes you just want to lump together the small groups to make a plot or table simpler. That’s the job of the fct_lump_*() family of functions. #chp-https://forcats.tidyverse.org/reference/fct_lump is a simple starting point that progressively lumps the smallest groups categories into “Other”, always keeping “Other” as the smallest category.

+
+
gss_cat |>
+  mutate(relig = fct_lump_lowfreq(relig)) |>
+  count(relig)
+#> # A tibble: 2 × 2
+#>   relig          n
+#>   <fct>      <int>
+#> 1 Protestant 10846
+#> 2 Other      10637
+
+

In this case it’s not very helpful: it is true that the majority of Americans in this survey are Protestant, but we’d probably like to see some more details! Instead, we can use the #chp-https://forcats.tidyverse.org/reference/fct_lump to specify that we want exactly 10 groups:

+
+
gss_cat |>
+  mutate(relig = fct_lump_n(relig, n = 10)) |>
+  count(relig, sort = TRUE) |>
+  print(n = Inf)
+#> # A tibble: 10 × 2
+#>    relig                       n
+#>    <fct>                   <int>
+#>  1 Protestant              10846
+#>  2 Catholic                 5124
+#>  3 None                     3523
+#>  4 Christian                 689
+#>  5 Other                     458
+#>  6 Jewish                    388
+#>  7 Buddhism                  147
+#>  8 Inter-nondenominational   109
+#>  9 Moslem/islam              104
+#> 10 Orthodox-christian         95
+
+

Read the documentation to learn about #chp-https://forcats.tidyverse.org/reference/fct_lump and #chp-https://forcats.tidyverse.org/reference/fct_lump which are useful in other cases.

+ +
+

+Exercises

+
  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?

  2. +
  3. How could you collapse rincome into a small set of categories?

  4. +
  5. Notice there are 9 groups (excluding other) in the fct_lump example above. Why not 10? (Hint: type #chp-https://forcats.tidyverse.org/reference/fct_lump, and find the default for the argument other_level is “Other”.)

  6. +
+
+ +
+

+Ordered factors

+

Before we go on, there’s a special type of factor that needs to be mentioned briefly: ordered factors. Ordered factors, created with #chp-https://rdrr.io/r/base/factor, imply a strict ordering and equal distance between levels: the first level is “less than” the second level by the same amount that the second level is “less than” the third level, and so on.. You can recognize them when printing because they use < between the factor levels:

+
+
ordered(c("a", "b", "c"))
+#> [1] a b c
+#> Levels: a < b < c
+
+

In practice, #chp-https://rdrr.io/r/base/factor factors behave very similarly to regular factors. There are only two places where you might notice different behavior:

+
  • If you map an ordered factor to color or fill in ggplot2, it will default to scale_color_viridis()/scale_fill_viridis(), a color scale that implies a ranking.
  • +
  • If you use an ordered function in a linear model, it will use “polygonal contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don’t routinely interpret them. If you want to learn more, we recommend vignette("contrasts", package = "faux") by Lisa DeBruine.
  • +

Given the arguable utility of these differences, we don’t generally recommend using ordered factors.

+
+ +
+

+Summary

+

This chapter introduced you to the handy forcats package for working with factors, introducing you to the most commonly used functions. forcats contains a wide range of other helpers that we didn’t have space to discuss here, so whenever you’re facing a factor analysis challenge that you haven’t encountered before, I highly recommend skimming the #chp-https://forcats.tidyverse.org/reference/index to see if there’s a canned function that can help solve your problem.

+

If you want to learn more about factors after reading this chapter, we recommend reading Amelia McNamara and Nicholas Horton’s paper, #chp-https://peerj.com/preprints/3163/. This paper lays out some of the history discussed in #chp-https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/ and #chp-https://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh, and compares the tidy approaches to categorical data outlined in this book with base R methods. An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!

+

In the next chapter we’ll switch gears to start learning about dates and times in R. Dates and times seem deceptively simple, but as you’ll soon see, the more you learn about them, the more complex they seem to get!

+ + +
+
diff --git a/oreilly/functions.html b/oreilly/functions.html new file mode 100644 index 0000000..c363bfe --- /dev/null +++ b/oreilly/functions.html @@ -0,0 +1,932 @@ +
+

Functions

+
+ +
+ +

+RStudio +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

Once you start writing functions, there are two RStudio shortcuts that are super useful:

  • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

  • +
  • To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.

  • +
+ +
+

+Introduction

+

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:

+
  1. You can give a function an evocative name that makes your code easier to understand.

  2. +
  3. As requirements change, you only need to update code in one place, instead of many.

  4. +
  5. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

  6. +

A good rule of thumb is to consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). In this chapter, you’ll learn about three useful types of functions:

+
  • Vector functions take one or more vectors as input and return a vector as output.
  • +
  • Data frame functions take a data frame as input and return a data frame as output.
  • +
  • Plot functions that take a data frame as input and return a plot as output.
  • +

Each of these sections include many examples to help you generalize the patterns that you see. These examples wouldn’t be possible without the help of folks of twitter, and we encourage follow the links in the comment to see original inspirations. You might also want to read the original motivating tweets for #chp-https://twitter.com/hadleywickham/status/1571603361350164486 and #chp-https://twitter.com/hadleywickham/status/1574373127349575680 to see even more functions.

+ +
+

+Prerequisites

+

We’ll wrap up a variety of functions from around the tidyverse. We’ll also use nycflights13 as a source of familiar data to use our functions with.

+
+
library(tidyverse)
+library(nycflights13)
+
+
+
+ +
+

+Vector functions

+

We’ll begin with vector functions: functions that take one or more vectors and return a vector result. For example, take a look at this code. What does it do?

+
+
df <- tibble(
+  a = rnorm(5),
+  b = rnorm(5),
+  c = rnorm(5),
+  d = rnorm(5),
+)
+
+df |> mutate(
+  a = (a - min(a, na.rm = TRUE)) / 
+    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
+  b = (b - min(b, na.rm = TRUE)) / 
+    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
+  c = (c - min(c, na.rm = TRUE)) / 
+    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
+  d = (d - min(d, na.rm = TRUE)) / 
+    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
+)
+#> # A tibble: 5 × 4
+#>       a     b     c     d
+#>   <dbl> <dbl> <dbl> <dbl>
+#> 1 0.339  2.59 0.291 0    
+#> 2 0.880  0    0.611 0.557
+#> 3 0      1.37 1     0.752
+#> 4 0.795  1.37 0     1    
+#> 5 1      1.34 0.580 0.394
+
+

You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? When Hadley wrote this code he made an error when copying-and-pasting and forgot to change an a to a b. Preventing this type of mistake of is one very good reason to learn how to write functions.

+ +
+

+Writing a function

+

To write a function you need to first analyse your repeated code to figure what parts are constant and what parts vary. If we take the code above and pull it outside of #chp-https://dplyr.tidyverse.org/reference/mutate it’s a little easier to see the pattern because each repetition is now one line:

+
+
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
+(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
+(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
+(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))  
+
+

To make this a bit clearer we can replace the bit that varies with :

+
+
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
+
+

To turn this into a function you need three things:

+
  1. A name. Here we’ll use rescale01 because this function rescales a vector to lie between 0 and 1.

  2. +
  3. The arguments. The arguments are things that vary across calls and our analysis above tells us that have just one. We’ll call it x because this is the conventional name for a numeric vector.

  4. +
  5. The body. The body is the code that repeated across all the calls.

  6. +

Then you create a function by following the template:

+
+
name <- function(arguments) {
+  body
+}
+
+

For this case that leads to:

+
+
rescale01 <- function(x) {
+  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
+}
+
+

At this point you might test with a few simple inputs to make sure you’ve captured the logic correctly:

+
+
rescale01(c(-10, 0, 10))
+#> [1] 0.0 0.5 1.0
+rescale01(c(1, 2, 3, NA, 5))
+#> [1] 0.00 0.25 0.50   NA 1.00
+
+

Then you can rewrite the call to #chp-https://dplyr.tidyverse.org/reference/mutate as:

+
+
df |> mutate(
+  a = rescale01(a),
+  b = rescale01(b),
+  c = rescale01(c),
+  d = rescale01(d),
+)
+#> # A tibble: 5 × 4
+#>       a     b     c     d
+#>   <dbl> <dbl> <dbl> <dbl>
+#> 1 0.339 1     0.291 0    
+#> 2 0.880 0     0.611 0.557
+#> 3 0     0.530 1     0.752
+#> 4 0.795 0.531 0     1    
+#> 5 1     0.518 0.580 0.394
+
+

(In #chp-iteration, you’ll learn how to use #chp-https://dplyr.tidyverse.org/reference/across to reduce the duplication even further so all you need is df |> mutate(across(a:d, rescale01))).

+
+ +
+

+Improving our function

+

You might notice rescale01() function does some unnecessary work — instead of computing #chp-https://rdrr.io/r/base/Extremes twice and #chp-https://rdrr.io/r/base/Extremes once we could instead compute both the minimum and maximum in one step with #chp-https://rdrr.io/r/base/range:

+
+
rescale01 <- function(x) {
+  rng <- range(x, na.rm = TRUE)
+  (x - rng[1]) / (rng[2] - rng[1])
+}
+
+

Or you might try this function on a vector that includes an infinite value:

+
+
x <- c(1:10, Inf)
+rescale01(x)
+#>  [1]   0   0   0   0   0   0   0   0   0   0 NaN
+
+

That result is not particularly useful so we could ask #chp-https://rdrr.io/r/base/range to ignore infinite values:

+
+
rescale01 <- function(x) {
+  rng <- range(x, na.rm = TRUE, finite = TRUE)
+  (x - rng[1]) / (rng[2] - rng[1])
+}
+rescale01(x)
+#>  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
+#>  [8] 0.7777778 0.8888889 1.0000000       Inf
+
+

These changes illustrate an important benefit of functions: because we’ve moved the repeated code into a function, we only need to make the change in one place.

+
+ +
+

+Mutate functions

+

Now you’ve got the basic idea of functions, lets take a look a whole bunch of examples. We’ll start by looking at “mutate” functions, functions that work well like #chp-https://dplyr.tidyverse.org/reference/mutate and #chp-https://dplyr.tidyverse.org/reference/filter because they return an output the same length as the input.

+

Lets start with a simple variation of rescale01(). Maybe you want compute the Z-score, rescaling a vector to have to a mean of zero and a standard deviation of one:

+
+
z_score <- function(x) {
+  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
+}
+
+

Or maybe you want to wrap up a straightforward #chp-https://dplyr.tidyverse.org/reference/case_when in order to give it a useful name. For example, this clamp() function ensures all values of a vector lie in between a minimum or a maximum:

+
+
clamp <- function(x, min, max) {
+  case_when(
+    x < min ~ min,
+    x > max ~ max,
+    .default = x
+  )
+}
+clamp(1:10, min = 3, max = 7)
+#>  [1] 3 3 3 4 5 6 7 7 7 7
+
+

Or maybe you’d rather mark those values as NAs:

+
+
na_outside <- function(x, min, max) {
+  case_when(
+    x < min ~ NA,
+    x > max ~ NA,
+    .default = x
+  )
+}
+na_outside(1:10, min = 3, max = 7)
+#>  [1] NA NA  3  4  5  6  7 NA NA NA
+
+

Of course functions don’t just need to work with numeric variables. You might want to extract out some repeated string manipulation. Maybe you need to make the first character upper case:

+
+
first_upper <- function(x) {
+  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
+  x
+}
+first_upper("hello")
+#> [1] "Hello"
+
+

Or maybe you want to strip percent signs, commas, and dollar signs from a string before converting it into a number:

+
+
# https://twitter.com/NVlabormarket/status/1571939851922198530
+clean_number <- function(x) {
+  is_pct <- str_detect(x, "%")
+  num <- x |> 
+    str_remove_all("%") |> 
+    str_remove_all(",") |> 
+    str_remove_all(fixed("$")) |> 
+    as.numeric(x)
+  if_else(is_pct, num / 100, num)
+}
+clean_number("$12,300")
+#> [1] 12300
+clean_number("45%")
+#> [1] 0.45
+
+

Sometimes your functions will be highly specialized for one data analysis. For example, if you have a bunch of variables that record missing values as 997, 998, or 999, you might want to write a function to replace them with NA:

+
+
fix_na <- function(x) {
+  if_else(x %in% c(997, 998, 999), NA, x)
+}
+
+

We’ve focused on examples that take a single vector because we think they’re the most common. But there’s no reason that your function can’t take multiple vector inputs. For example, you might want to compute the distance between two locations on the globe using the haversine formula. This requires four vectors:

+
+
# https://twitter.com/RosanaFerrero/status/1574722120428539906/photo/1
+haversine <- function(long1, lat1, long2, lat2, round = 3) {
+  # convert to radians
+  long1 <- long1 * pi / 180
+  lat1  <- lat1  * pi / 180
+  long2 <- long2 * pi / 180
+  lat2  <- lat2  * pi / 180
+  
+  R <- 6371 # Earth mean radius in km
+  a <- sin((lat2 - lat1) / 2)^2 + 
+    cos(lat1) * cos(lat2) * sin((long2 - long1) / 2)^2
+  d <- R * 2 * asin(sqrt(a))
+  
+  round(d, round)
+}
+
+
+ +
+

+Summary functions

+

Another important family of vector functions is summary functions, functions that return a single value for use in #chp-https://dplyr.tidyverse.org/reference/summarise. Sometimes this can just be a matter of setting a default argument or two:

+
+
commas <- function(x) {
+  str_flatten(x, collapse = ", ", last = " and ")
+}
+commas(c("cat", "dog", "pigeon"))
+#> [1] "cat, dog and pigeon"
+
+

Or you might wrap up a simple computation, like for the coefficient of variation, which divides standard deviation by the mean:

+
+
cv <- function(x, na.rm = FALSE) {
+  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
+}
+cv(runif(100, min = 0, max = 50))
+#> [1] 0.5196276
+cv(runif(100, min = 0, max = 500))
+#> [1] 0.5652554
+
+

Or maybe you just want to make a common pattern easier to remember by giving it a memorable name:

+
+
# https://twitter.com/gbganalyst/status/1571619641390252033
+n_missing <- function(x) {
+  sum(is.na(x))
+} 
+
+

You can also write functions with multiple vector inputs. For example, maybe you want to compute the mean absolute prediction error to help you compare model predictions with actual values:

+
+
# https://twitter.com/neilgcurrie/status/1571607727255834625
+mape <- function(actual, predicted) {
+  sum(abs((actual - predicted) / actual)) / length(actual)
+}
+
+

+RStudio +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

Once you start writing functions, there are two RStudio shortcuts that are super useful:

  • To find the definition of a function that you’ve written, place the cursor on the name of the function and press F2.

  • +
  • To quickly jump to a function, press Ctrl + . to open the fuzzy file and function finder and type the first few letters of your function name. You can also navigate to files, Quarto sections, and more, making it a very handy navigation tool.

  • +
+ +
+ +
+

+Exercises

+
  1. +

    Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need?

    +
    +
    mean(is.na(x))
    +mean(is.na(y))
    +mean(is.na(z))
    +
    +x / sum(x, na.rm = TRUE)
    +y / sum(y, na.rm = TRUE)
    +z / sum(z, na.rm = TRUE)
    +
    +round(x / sum(x, na.rm = TRUE) * 100, 1)
    +round(y / sum(y, na.rm = TRUE) * 100, 1)
    +round(z / sum(z, na.rm = TRUE) * 100, 1)
    +
    +
  2. +
  3. In the second variant of rescale01(), infinite values are left unchanged. Can you rewrite rescale01() so that -Inf is mapped to 0, and Inf is mapped to 1?

  4. +
  5. Given a vector of birthdates, write a function to compute the age in years.

  6. +
  7. Write your own functions to compute the variance and skewness of a numeric vector. Variance is defined as \[ +\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,} +\] where \(\bar{x} = (\sum_i^n x_i) / n\) is the sample mean. Skewness is defined as \[ +\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.} +\]

  8. +
  9. Write both_na(), a summary function that takes two vectors of the same length and returns the number of positions that have an NA in both vectors.

  10. +
  11. +

    Read the documentation to figure out what the following functions do. Why are they useful even though they are so short?

    +
    +
    is_directory <- function(x) file.info(x)$isdir
    +is_readable <- function(x) file.access(x, 4) == 0
    +
    +
  12. +
+
+ +
+

+Data frame functions

+

Vector functions are useful for pulling out code that’s repeated within a dplyr verb. But you’ll often also repeat the verbs themselves, particularly within a large pipeline. When you notice yourself copying and pasting multiple verbs multiple times, you might think about writing a data frame function. Data frame functions work like dplyr verbs: they take a data frame as the first argument, some extra arguments that say what to do with it, and return a data frame or vector.

+

To let you write a function that uses dplyr verbs, we’ll first introduce you to the challenge of indirection and how you can overcome it with embracing, {{ }}. With this theory under your belt, we’ll then show you a bunch of examples to illustrate what you might do with it.

+ +
+

+Indirection and tidy evaluation

+

When you start writing functions that use dplyr verbs you rapidly hit the problem of indirection. Let’s illustrate the problem with a very simple function: pull_unique(). The goal of this function is to #chp-https://dplyr.tidyverse.org/reference/pull the unique (distinct) values of a variable:

+
+
pull_unique <- function(df, var) {
+  df |> 
+    distinct(var) |> 
+    pull(var)
+}
+
+

If we try and use it, we get an error:

+
+
diamonds |> pull_unique(clarity)
+#> Error in `distinct()` at ]8;line = 38:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/pull.Rdplyr/R/pull.R:38:2]8;;:
+#> ! Must use existing variables.
+#> ✖ `var` not found in `.data`.
+
+

To make the problem a bit more clear we can use a made up data frame:

+
+
df <- tibble(var = "var", x = "x", y = "y")
+df |> pull_unique(x)
+#> [1] "var"
+df |> pull_unique(y)
+#> [1] "var"
+
+

Regardless of how we call pull_unique() it always does df |> distinct(var) |> pull(var), instead of df |> distinct(x) |> pull(x) or df |> distinct(y) |> pull(y). This is a problem of indirection, and it arises because dplyr uses tidy evaluation to allow you to refer to the names of variables inside your data frame without any special treatment.

+

Tidy evaluation is great 95% of the time because it makes your data analyses very concise as you never have to say which data frame a variable comes from; it’s obvious from the context. The downside of tidy evaluation comes when we want to wrap up repeated tidyverse code into a function. Here we need some way to tell #chp-https://dplyr.tidyverse.org/reference/distinct and #chp-https://dplyr.tidyverse.org/reference/pull not to treat var as the name of a variable, but instead look inside var for the variable we actually want to use.

+

Tidy evaluation includes a solution to this problem called embracing 🤗. Embracing a variable means to wrap it in braces so (e.g.) var becomes {{ var }}. Embracing a variable tells dplyr to use the value stored inside the argument, not the argument as the literal variable name. One way to remember what’s happening is to think of {{ }} as looking down a tunnel — {{ var }} will make a dplyr function look inside of var rather than looking for a variable called var.

+

So to make pull_unique() work we need to replace var with {{ var }}:

+
+
pull_unique <- function(df, var) {
+  df |> 
+    distinct({{ var }}) |> 
+    pull({{ var }})
+}
+diamonds |> pull_unique(clarity)
+#> [1] SI2  SI1  VS1  VS2  VVS2 VVS1 I1   IF  
+#> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF
+
+

Success!

+
+ +
+

+When to embrace?

+

So the key challenge in writing data frame functions is figuring out which arguments need to be embraced. Fortunately this is easy because you can look it up from the documentation 😄. There are two terms to look for in the docs which corresponding to the two most common sub-types of tidy evaluation:

+

Your intuition about which arguments use tidy evaluation should be good for many common functions — just think about whether you can compute (e.g. x + 1) or select (e.g. a:x).

+

In the following sections we’ll explore the sorts of handy functions you might write once you understand embracing.

+
+ +
+

+Common use cases

+

If you commonly perform the same set of summaries when doing initial data exploration, you might consider wrapping them up in a helper function:

+
+
summary6 <- function(data, var) {
+  data |> summarise(
+    min = min({{ var }}, na.rm = TRUE),
+    mean = mean({{ var }}, na.rm = TRUE),
+    median = median({{ var }}, na.rm = TRUE),
+    max = max({{ var }}, na.rm = TRUE),
+    n = n(),
+    n_miss = sum(is.na({{ var }})),
+    .groups = "drop"
+  )
+}
+diamonds |> summary6(carat)
+#> # A tibble: 1 × 6
+#>     min  mean median   max     n n_miss
+#>   <dbl> <dbl>  <dbl> <dbl> <int>  <int>
+#> 1   0.2 0.798    0.7  5.01 53940      0
+
+

(Whenever you wrap #chp-https://dplyr.tidyverse.org/reference/summarise in a helper, we think it’s good practice to set .groups = "drop" to both avoid the message and leave the data in an ungrouped state.)

+

The nice thing about this function is because it wraps #chp-https://dplyr.tidyverse.org/reference/summarise you can used it on grouped data:

+
+
diamonds |> 
+  group_by(cut) |> 
+  summary6(carat)
+#> # A tibble: 5 × 7
+#>   cut         min  mean median   max     n n_miss
+#>   <ord>     <dbl> <dbl>  <dbl> <dbl> <int>  <int>
+#> 1 Fair       0.22 1.05    1     5.01  1610      0
+#> 2 Good       0.23 0.849   0.82  3.01  4906      0
+#> 3 Very Good  0.2  0.806   0.71  4    12082      0
+#> 4 Premium    0.2  0.892   0.86  4.01 13791      0
+#> 5 Ideal      0.2  0.703   0.54  3.5  21551      0
+
+

Because the arguments to summarize are data-masking that also means that the var argument to summary6() is data-masking. That means you can also summarize computed variables:

+
+
diamonds |> 
+  group_by(cut) |> 
+  summary6(log10(carat))
+#> # A tibble: 5 × 7
+#>   cut          min    mean  median   max     n n_miss
+#>   <ord>      <dbl>   <dbl>   <dbl> <dbl> <int>  <int>
+#> 1 Fair      -0.658 -0.0273  0      0.700  1610      0
+#> 2 Good      -0.638 -0.133  -0.0862 0.479  4906      0
+#> 3 Very Good -0.699 -0.164  -0.149  0.602 12082      0
+#> 4 Premium   -0.699 -0.125  -0.0655 0.603 13791      0
+#> 5 Ideal     -0.699 -0.225  -0.268  0.544 21551      0
+
+

To summarize multiple variables you’ll need to wait until #sec-across, where you’ll learn how to use #chp-https://dplyr.tidyverse.org/reference/across.

+

Another popular #chp-https://dplyr.tidyverse.org/reference/summarise helper function is a version of #chp-https://dplyr.tidyverse.org/reference/count that also computes proportions:

+
+
# https://twitter.com/Diabb6/status/1571635146658402309
+count_prop <- function(df, var, sort = FALSE) {
+  df |>
+    count({{ var }}, sort = sort) |>
+    mutate(prop = n / sum(n))
+}
+diamonds |> count_prop(clarity)
+#> # A tibble: 8 × 3
+#>   clarity     n   prop
+#>   <ord>   <int>  <dbl>
+#> 1 I1        741 0.0137
+#> 2 SI2      9194 0.170 
+#> 3 SI1     13065 0.242 
+#> 4 VS2     12258 0.227 
+#> 5 VS1      8171 0.151 
+#> 6 VVS2     5066 0.0939
+#> # … with 2 more rows
+
+

This function has three arguments: df, var, and sort, and only var needs to be embraced because it’s passed to #chp-https://dplyr.tidyverse.org/reference/count which uses data-masking for all variables in .

+

Or maybe you want to find the sorted unique values of a variable for a subset of the data. Rather than supplying a variable and a value to do the filtering, we’ll allow the user to supply a condition:

+
+
unique_where <- function(df, condition, var) {
+  df |> 
+    filter({{ condition }}) |> 
+    distinct({{ var }}) |> 
+    arrange({{ var }}) |> 
+    pull({{ var }})
+}
+
+# Find all the destinations in December
+flights |> unique_where(month == 12, dest)
+#>  [1] "ABQ" "ALB" "ATL" "AUS" "AVL" "BDL" "BGR" "BHM" "BNA" "BOS" "BQN" "BTV"
+#> [13] "BUF" "BUR" "BWI" "BZN" "CAE" "CAK" "CHS" "CLE" "CLT" "CMH" "CVG" "DAY"
+#> [25] "DCA" "DEN" "DFW" "DSM" "DTW" "EGE" "EYW" "FLL" "GRR" "GSO" "GSP" "HDN"
+#> [37] "HNL" "HOU" "IAD" "IAH" "ILM" "IND" "JAC" "JAX" "LAS" "LAX" "LGB" "MCI"
+#> [49] "MCO" "MDW" "MEM" "MHT" "MIA" "MKE" "MSN" "MSP" "MSY" "MTJ" "OAK" "OKC"
+#> [61] "OMA" "ORD" "ORF" "PBI" "PDX" "PHL" "PHX" "PIT" "PSE" "PSP" "PVD" "PWM"
+#> [73] "RDU" "RIC" "ROC" "RSW" "SAN" "SAT" "SAV" "SBN" "SDF" "SEA" "SFO" "SJC"
+#> [85] "SJU" "SLC" "SMF" "SNA" "SRQ" "STL" "STT" "SYR" "TPA" "TUL" "TYS" "XNA"
+# Which months did plane N14228 fly in?
+flights |> unique_where(tailnum == "N14228", month)
+#>  [1]  1  2  3  4  5  6  7  8  9 10 12
+
+

Here we embrace condition because it’s passed to #chp-https://dplyr.tidyverse.org/reference/filter and var because its passed to #chp-https://dplyr.tidyverse.org/reference/distinct, #chp-https://dplyr.tidyverse.org/reference/arrange, and #chp-https://dplyr.tidyverse.org/reference/pull.

+

We’ve made all these examples take a data frame as the first argument, but if you’re working repeatedly with the same data, it can make sense to hardcode it. For example, the following function always works with the flights dataset and always selects time_hour, carrier, and flight since they form the compound primary key that allows you to identify a row.

+
+
flights_sub <- function(rows, cols) {
+  flights |> 
+    filter({{ rows }}) |> 
+    select(time_hour, carrier, flight, {{ cols }})
+}
+
+flights_sub(dest == "IAH", contains("time"))
+#> # A tibble: 7,198 × 8
+#>   time_hour           carrier flight dep_time sched_de…¹ arr_t…² sched…³ air_t…⁴
+#>   <dttm>              <chr>    <int>    <int>      <int>   <int>   <int>   <dbl>
+#> 1 2013-01-01 05:00:00 UA        1545      517        515     830     819     227
+#> 2 2013-01-01 05:00:00 UA        1714      533        529     850     830     227
+#> 3 2013-01-01 06:00:00 UA         496      623        627     933     932     229
+#> 4 2013-01-01 07:00:00 UA         473      728        732    1041    1038     238
+#> 5 2013-01-01 07:00:00 UA        1479      739        739    1104    1038     249
+#> 6 2013-01-01 09:00:00 UA        1220      908        908    1228    1219     233
+#> # … with 7,192 more rows, and abbreviated variable names ¹​sched_dep_time,
+#> #   ²​arr_time, ³​sched_arr_time, ⁴​air_time
+
+
+ +
+

+Data-masking vs tidy-selection

+

Sometimes you want to select variables inside a function that uses data-masking. For example, imagine you want to write a count_missing() that counts the number of missing observations in rows. You might try writing something like:

+
+
count_missing <- function(df, group_vars, x_var) {
+  df |> 
+    group_by({{ group_vars }}) |> 
+    summarise(n_miss = sum(is.na({{ x_var }})))
+}
+flights |> 
+  count_missing(c(year, month, day), dep_time)
+#> Error in `group_by()` at ]8;line = 127:col = 2;file:///Users/hadleywickham/Documents/dplyr/dplyr/R/summarise.Rdplyr/R/summarise.R:127:2]8;;:
+#> ℹ In argument: `..1 = c(year, month, day)`.
+#> Caused by error:
+#> ! `..1` must be size 336776 or 1, not 1010328.
+
+

This doesn’t work because #chp-https://dplyr.tidyverse.org/reference/group_by uses data-masking, not tidy-selection. We can work around that problem by using the handy #chp-https://dplyr.tidyverse.org/reference/pick which allows you to use use tidy-selection inside data-masking functions:

+
+
count_missing <- function(df, group_vars, x_var) {
+  df |> 
+    group_by(pick({{ group_vars }})) |> 
+    summarise(n_miss = sum(is.na({{ x_var }})))
+}
+flights |> 
+  count_missing(c(year, month, day), dep_time)
+#> `summarise()` has grouped output by 'year', 'month'. You can override using the
+#> `.groups` argument.
+#> # A tibble: 365 × 4
+#> # Groups:   year, month [12]
+#>    year month   day n_miss
+#>   <int> <int> <int>  <int>
+#> 1  2013     1     1      4
+#> 2  2013     1     2      8
+#> 3  2013     1     3     10
+#> 4  2013     1     4      6
+#> 5  2013     1     5      3
+#> 6  2013     1     6      1
+#> # … with 359 more rows
+
+

Another convenient use of #chp-https://dplyr.tidyverse.org/reference/pick is to make a 2d table of counts. Here we count using all the variables in the rows and columns, then use #chp-https://tidyr.tidyverse.org/reference/pivot_wider to rearrange the counts into a grid:

+
+
# https://twitter.com/pollicipes/status/1571606508944719876
+count_wide <- function(data, rows, cols) {
+  data |> 
+    count(pick(c({{ rows }}, {{ cols }}))) |> 
+    pivot_wider(
+      names_from = {{ cols }}, 
+      values_from = n,
+      names_sort = TRUE,
+      values_fill = 0
+    )
+}
+diamonds |> count_wide(clarity, cut)
+#> # A tibble: 8 × 6
+#>   clarity  Fair  Good `Very Good` Premium Ideal
+#>   <ord>   <int> <int>       <int>   <int> <int>
+#> 1 I1        210    96          84     205   146
+#> 2 SI2       466  1081        2100    2949  2598
+#> 3 SI1       408  1560        3240    3575  4282
+#> 4 VS2       261   978        2591    3357  5071
+#> 5 VS1       170   648        1775    1989  3589
+#> 6 VVS2       69   286        1235     870  2606
+#> # … with 2 more rows
+diamonds |> count_wide(c(clarity, color), cut)
+#> # A tibble: 56 × 7
+#>   clarity color  Fair  Good `Very Good` Premium Ideal
+#>   <ord>   <ord> <int> <int>       <int>   <int> <int>
+#> 1 I1      D         4     8           5      12    13
+#> 2 I1      E         9    23          22      30    18
+#> 3 I1      F        35    19          13      34    42
+#> 4 I1      G        53    19          16      46    16
+#> 5 I1      H        52    14          12      46    38
+#> 6 I1      I        34     9           8      24    17
+#> # … with 50 more rows
+
+

While our examples have mostly focused on dplyr, tidy evaluation also underpins tidyr, and if you look at the #chp-https://tidyr.tidyverse.org/reference/pivot_wider docs you can see that names_from uses tidy-selection.

+
+ +
+

+Exercises

+
  1. +

    Using the datasets from nyclights13, write functions that:

    +
    1. +

      Find all flights that were cancelled (i.e. is.na(arr_time)) or delayed by more than an hour.

      +
      +
      flights |> filter_severe()
      +
      +
    2. +
    3. +

      Counts the number of cancelled flights and the number of flights delayed by more than an hour.

      +
      +
      flights |> group_by(dest) |> summarise_severe()
      +
      +
    4. +
    5. +

      Finds all flights that were cancelled or delayed by more than a user supplied number of hours:

      +
      +
      flights |> filter_severe(hours = 2)
      +
      +
    6. +
    7. +

      Summarizes the weather to compute the minum, mean, and maximum, of a user supplied variable:

      +
      +
      weather |> summarise_weather(temp)
      +
      +
    8. +
    9. +

      Converts the user supplied variable that uses clock time (e.g. dep_time, arr_time, etc) into a decimal time (i.e. hours + minutes / 60).

      +
      +
      weather |> standardise_time(sched_dep_time)
      +
      +
    10. +
  2. +
  3. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-select: #chp-https://dplyr.tidyverse.org/reference/distinct, #chp-https://dplyr.tidyverse.org/reference/count, #chp-https://dplyr.tidyverse.org/reference/group_by, #chp-https://dplyr.tidyverse.org/reference/rename, #chp-https://dplyr.tidyverse.org/reference/slice, #chp-https://dplyr.tidyverse.org/reference/slice.

  4. +
  5. +

    Generalize the following function so that you can supply any number of variables to count.

    +
    +
    count_prop <- function(df, var, sort = FALSE) {
    +  df |>
    +    count({{ var }}, sort = sort) |>
    +    mutate(prop = n / sum(n))
    +}
    +
    +
  6. +
+
+ +
+

+Plot functions

+

Instead of returning a data frame, you might want to return a plot. Fortunately you can use the same techniques with ggplot2, because #chp-https://ggplot2.tidyverse.org/reference/aes is a data-masking function. For example, imagine that you’re making a lot of histograms:

+
+
diamonds |> 
+  ggplot(aes(carat)) +
+  geom_histogram(binwidth = 0.1)
+
+diamonds |> 
+  ggplot(aes(carat)) +
+  geom_histogram(binwidth = 0.05)
+
+

Wouldn’t it be nice if you could wrap this up into a histogram function? This is easy as once you know that #chp-https://ggplot2.tidyverse.org/reference/aes is a data-masking function so that you need to embrace:

+
+
histogram <- function(df, var, binwidth = NULL) {
+  df |> 
+    ggplot(aes({{ var }})) + 
+    geom_histogram(binwidth = binwidth)
+}
+
+diamonds |> histogram(carat, 0.1)
+
+

+
+
+

Note that histogram() returns a ggplot2 plot, so that you can still add on additional components if you want. Just remember to switch from |> to +:

+
+
diamonds |> 
+  histogram(carat, 0.1) +
+  labs(x = "Size (in carats)", y = "Number of diamonds")
+
+

+
+
+ +
+

+More variables

+

It’s straightforward to add more variables to the mix. For example, maybe you want an easy way to eyeball whether or not a data set is linear by overlaying a smooth line and a straight line:

+
+
# https://twitter.com/tyler_js_smith/status/1574377116988104704
+
+linearity_check <- function(df, x, y) {
+  df |>
+    ggplot(aes({{ x }}, {{ y }})) +
+    geom_point() +
+    geom_smooth(method = "loess", color = "red", se = FALSE) +
+    geom_smooth(method = "lm", color = "blue", se = FALSE) 
+}
+
+starwars |> 
+  filter(mass < 1000) |> 
+  linearity_check(mass, height)
+#> `geom_smooth()` using formula = 'y ~ x'
+#> `geom_smooth()` using formula = 'y ~ x'
+
+

+
+
+

Or maybe you want an alternative to colored scatterplots for very large datasets where overplotting is a problem:

+
+
# https://twitter.com/ppaxisa/status/1574398423175921665
+hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
+  df |> 
+    ggplot(aes({{ x }}, {{ y }}, z = {{ z }})) + 
+    stat_summary_hex(
+      aes(colour = after_scale(fill)), # make border same colour as fill
+      bins = bins, 
+      fun = fun,
+    )
+}
+diamonds |> hex_plot(carat, price, depth)
+
+

+
+
+
+ +
+

+Combining with dplyr

+

Some of the most useful helpers combine a dash of dplyr with ggplot2. For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using #chp-https://forcats.tidyverse.org/reference/fct_inorder. Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:

+
+
sorted_bars <- function(df, var) {
+  df |> 
+    mutate({{ var }} := fct_rev(fct_infreq({{ var }}))) |> 
+    ggplot(aes(y = {{ var }})) + 
+    geom_bar()
+}
+diamonds |> sorted_bars(cut)
+
+

+
+
+

Or you could maybe you want to make it easy to draw a bar plot just for a subset of the data:

+
+
conditional_bars <- function(df, condition, var) {
+  df |> 
+    filter({{ condition }}) |> 
+    ggplot(aes({{ var }})) + 
+    geom_bar()
+}
+
+diamonds |> conditional_bars(cut == "Good", clarity)
+
+

+
+
+

You can also get creative and display data summaries in other way. For example, this code uses the axis labels to display the highest value. As you learn more about ggplot2, the power of your functions will continue to increase.

+
+
# https://gist.github.com/GShotwell/b19ef520b6d56f61a830fabb3454965b
+
+fancy_ts <- function(df, val, group) {
+  labs <- df |> 
+    group_by({{group}}) |> 
+    summarize(breaks = max({{val}}))
+  
+  df |> 
+    ggplot(aes(date, {{val}}, group = {{group}}, color = {{group}})) +
+    geom_path() +
+    scale_y_continuous(
+      breaks = labs$breaks, 
+      labels = scales::label_comma(),
+      minor_breaks = NULL,
+      guide = guide_axis(position = "right")
+    ) 
+}
+
+df <- tibble(
+  dist1 = sort(rnorm(50, 5, 2)), 
+  dist2 = sort(rnorm(50, 8, 3)),
+  dist4 = sort(rnorm(50, 15, 1)),
+  date = seq.Date(as.Date("2022-01-01"), as.Date("2022-04-10"), by = "2 days")
+)
+df <- pivot_longer(df, cols = -date, names_to = "dist_name", values_to = "value")
+
+fancy_ts(df, value, dist_name)
+
+

+
+
+

Next we’ll discuss two more complicated cases: faceting and automatic labeling.

+
+ +
+

+Faceting

+

Unfortunately programming with faceting is a special challenge, because faceting was implemented before we understood what tidy evaluation was and how it should work. so you have to learn a new syntax. When programming with facets, instead of writing ~ x, you need to write vars(x) and instead of ~ x + y you need to write vars(x, y). The only advantage of this syntax is that #chp-https://ggplot2.tidyverse.org/reference/vars uses tidy evaluation so you can embrace within it:

+
+
# https://twitter.com/sharoz/status/1574376332821204999
+
+foo <- function(x) {
+  ggplot(mtcars, aes(mpg, disp)) +
+    geom_point() +
+    facet_wrap(vars({{ x }}))
+}
+foo(cyl)
+
+

+
+
+

As with data frame functions, it can be useful to make your plotting functions tightly coupled to a specific dataset, or even a specific variable. For example, the following function makes it particularly easy to interactively explore the conditional distribution bill_length_mm from palmerpenguins dataset.

+
+
# https://twitter.com/yutannihilat_en/status/1574387230025875457
+density <- function(colour, facets, binwidth = 0.1) {
+  diamonds |> 
+    ggplot(aes(carat, after_stat(density), colour = {{ colour }})) +
+    geom_freqpoly(binwidth = binwidth) +
+    facet_wrap(vars({{ facets }}))
+}
+
+density()
+density(cut)
+density(cut, clarity)
+
+

+
+
+

+
+
+

+
+
+
+ +
+

+Labeling

+

Remember the histogram function we showed you earlier?

+
+
histogram <- function(df, var, binwidth = NULL) {
+  df |> 
+    ggplot(aes({{ var }})) + 
+    geom_histogram(binwidth = binwidth)
+}
+
+

Wouldn’t it be nice if we could label the output with the variable and the bin width that was used? To do so, we’re going to have to go under the covers of tidy evaluation and use a function from package we haven’t talked about before: rlang. rlang is a low-level package that’s used by just about every other package in the tidyverse because it implements tidy evaluation (as well as many other useful tools).

+

To solve the labeling problem we can use #chp-https://rlang.r-lib.org/reference/englue. This works similarly to #chp-https://stringr.tidyverse.org/reference/str_glue, so any value wrapped in #chp-https://rdrr.io/r/base/Paren will be inserted into the string. But it also understands {{ }}, which automatically insert the appropriate variable name:

+
+
histogram <- function(df, var, binwidth) {
+  label <- rlang::englue("A histogram of {{var}} with binwidth {binwidth}")
+  
+  df |> 
+    ggplot(aes({{ var }})) + 
+    geom_histogram(binwidth = binwidth) + 
+    labs(title = label)
+}
+
+diamonds |> histogram(carat, 0.1)
+
+

+
+
+

You can use the same approach any other place that you might supply a string in a ggplot2 plot.

+
+ +
+

+Exercises

+
  1. Build up a rich plotting function by incrementally implementing each of the steps below. +
    1. Draw a scatterplot given dataset and x and y variables.

    2. +
    3. Add a line of best fit (i.e. a linear model with no standard errors).

    4. +
    5. Add a title.

    6. +
  2. +
+
+ +
+

+Style

+

R doesn’t care what your function or arguments are called but the names make a big difference for humans. Ideally, the name of your function will be short, but clearly evoke what the function does. That’s hard! But it’s better to be clear than short, as RStudio’s autocomplete makes it easy to type long names.

+

Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. #chp-https://rdrr.io/r/base/mean is better than compute_mean()), or accessing some property of an object (i.e. #chp-https://rdrr.io/r/stats/coef is better than get_coefficients()). Use your best judgement and don’t be afraid to rename a function if you figure out a better name later.

+
+
# Too short
+f()
+
+# Not a verb, or descriptive
+my_awesome_function()
+
+# Long, but clear
+impute_missing()
+collapse_years()
+
+

R also doesn’t care about how you use white space in your functions but future readers will. Continue to follow the rules from #chp-workflow-style. Additionally, function() should always be followed by squiggly brackets (#chp-https://rdrr.io/r/base/Paren), and the contents should be indented by an additional two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.

+
+
# missing extra two spaces
+pull_unique <- function(df, var) {
+df |> 
+  distinct({{ var }}) |> 
+  pull({{ var }})
+}
+
+# Pipe indented incorrectly
+pull_unique <- function(df, var) {
+  df |> 
+  distinct({{ var }}) |> 
+  pull({{ var }})
+}
+
+# Missing {} and all one line
+pull_unique <- function(df, var) df |> distinct({{ var }}) |> pull({{ var }})
+
+

As you can see we recommend putting extra spaces inside of {{ }}. This makes it very obvious that something unusual is happening.

+ +
+

+Exercises

+
  1. +

    Read the source code for each of the following two functions, puzzle out what they do, and then brainstorm better names.

    +
    +
    f1 <- function(string, prefix) {
    +  substr(string, 1, nchar(prefix)) == prefix
    +}
    +f3 <- function(x, y) {
    +  rep(y, length.out = length(x))
    +}
    +
    +
  2. +
  3. Take a function that you’ve written recently and spend 5 minutes brainstorming a better name for it and its arguments.

  4. +
  5. Make a case for why norm_r(), norm_d() etc would be better than #chp-https://rdrr.io/r/stats/Normal, #chp-https://rdrr.io/r/stats/Normal. Make a case for the opposite.

  6. +
+
+ +
+

+Summary

+

In this chapter you learned how to write functions for three useful scenarios: creating a vector, creating a data frames, or creating a plot. Along the way your saw many examples, which hopefully started to get your creative juices flowing, and gave you some ideas for where functions might help your analysis code.

+

We have only shown you the bare minimum to get started with functions and there’s much more to learn. A few places to learn more are:

+

In the next chapter, we’ll dive into some of the details of R’s vector data structures that we’ve omitted so far. These are not immediately useful by themselves, but are a necessary foundation for the following chapter on iteration which gives you further tools for reducing code duplication.

+ + +
+
diff --git a/oreilly/intro.html b/oreilly/intro.html new file mode 100644 index 0000000..60d0dfb --- /dev/null +++ b/oreilly/intro.html @@ -0,0 +1,297 @@ +
+

Introduction

Data science is an exciting discipline that allows you to transform raw data into understanding, insight, and knowledge. The goal of “R for Data Science” is to help you learn the most important tools in R that will allow you to do data science efficiently and reproducibly. After reading this book, you’ll have the tools to tackle a wide variety of data science challenges, using the best parts of R.

+
+

+What you will learn

+

Data science is a huge field, and there’s no way you can master it all by reading a single book. The goal of this book is to give you a solid foundation in the most important tools, and enough knowledge to find the resources to learn more when necessary. Our model of the tools needed in a typical data science project looks something like #fig-ds-diagram.

+
+
+ +

A diagram displaying the data science cycle: Import -> Tidy -> Understand (which has the phases Transform -> Visualize -> Model in a cycle) -> Communicate. Surrounding all of these is Communicate.

+
Figure 1.1: In our model of the data science process you start with data import and tidying. Next you understand your data with an iterative cycle of transforming, visualizing, and modeling. You finish the process by communicating your results to other humans.
+
+
+
+

First you must import your data into R. This typically means that you take data stored in a file, database, or web application programming interface (API), and load it into a data frame in R. If you can’t get your data into R, you can’t do data science on it!

+

Once you’ve imported your data, it is a good idea to tidy it. Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your efforts on answering questions about the data, not fighting to get the data into the right form for different functions.

+

Once you have tidy data, a common next step is to transform it. Transformation includes narrowing in on observations of interest (like all people in one city, or all data from the last year), creating new variables that are functions of existing variables (like computing speed from distance and time), and calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling, because getting your data in a form that’s natural to work with often feels like a fight!

+

Once you have tidy data with the variables you need, there are two main engines of knowledge generation: visualisation and modelling. These have complementary strengths and weaknesses so any real analysis will iterate between them many times.

+

Visualisation is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also hint that you’re asking the wrong question, or that you need to collect different data. Visualisations can surprise you and they don’t scale particularly well because they require a human to interpret them.

+

The last step of data science is communication, an absolutely critical part of any data analysis project. It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.

+

Surrounding all these tools is programming. Programming is a cross-cutting tool that you use in nearly every part of a data science project. You don’t need to be an expert programmer to be a successful data scientist, but learning more about programming pays off, because becoming a better programmer allows you to automate common tasks, and solve new problems with greater ease.

+

You’ll use these tools in every data science project, but for most projects they’re not enough. There’s a rough 80-20 rule at play; you can tackle about 80% of every project using the tools that you’ll learn in this book, but you’ll need other tools to tackle the remaining 20%. Throughout this book, we’ll point you to resources where you can learn more.

+
+ +
+

+How this book is organised

+

The previous description of the tools of data science is organised roughly according to the order in which you use them in an analysis (although of course you’ll iterate through them multiple times). In our experience, however, learning data ingest and tidying first is sub-optimal, because 80% of the time it’s routine and boring, and the other 20% of the time it’s weird and frustrating. That’s a bad place to start learning a new subject! Instead, we’ll start with visualisation and transformation of data that’s already been imported and tidied. That way, when you ingest and tidy your own data, your motivation will stay high because you know the pain is worth the effort.

+

Within each chapter, we try and adhere to a similar pattern: start with some motivating examples so you can see the bigger picture, and then dive into the details. Each section of the book is paired with exercises to help you practice what you’ve learned. Although it can be tempting to skip the exercises, there’s no better way to learn than practicing on real problems.

+
+ +
+

+What you won’t learn

+

There are a number of important topics that this book doesn’t cover. We believe it’s important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible. That means this book can’t cover every important topic.

+ +
+

+Modeling

+ +

To learn more about modeling, we highly recommend #chp-https://www.tmwr, by our colleagues Max Kuhn and Julia Silge. This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.

+
+ +
+

+Big data

+

This book proudly focuses on small, in-memory datasets. This is the right place to start because you can’t tackle big data unless you have experience with small data. The tools you learn in this book will easily handle hundreds of megabytes of data, and with a little care, you can typically use them to work with 1-2 Gb of data. If you’re routinely working with larger data (10-100 Gb, say), you should learn more about #chp-https://github.com/Rdatatable/data. This book doesn’t teach data.table because it has a very concise interface that offers fewer linguistic cues, which makes it harder to learn. However, if you’re working with large data, the performance payoff is well worth the effort required to learn it.

+

If your data is bigger than this, carefully consider whether your big data problem is actually a small data problem in disguise. While the complete data set might be big, often the data needed to answer a specific question is small. You might be able to find a subset, subsample, or summary that fits in memory and still allows you to answer the question that you’re interested in. The challenge here is finding the right small data, which often requires a lot of iteration.

+

Another possibility is that your big data problem is actually a large number of small data problems in disguise. Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. This would be trivial if you had just 10 or 100 people, but instead you have a million. Fortunately, each problem is independent of the others (a setup that is sometimes called embarrassingly parallel), so you just need a system (like #chp-https://hadoop.apache.org/ or #chp-https://spark.apache.org/) that allows you to send different datasets to different computers for processing. Once you’ve figured out how to answer your question for a single subset using the tools described in this book, you can learn new tools like sparklyr to solve it for the full dataset.

+
+ +
+

+Python, Julia, and friends

+

In this book, you won’t learn anything about Python, Julia, or any other programming language useful for data science. This isn’t because we think these tools are bad. They’re not! And in practice, most data science teams use a mix of languages, often at least R and Python.

+

However, we strongly believe that it’s best to master one tool at a time. You will get better faster if you dive deep, rather than spreading yourself thinly over many topics. This doesn’t mean you should only know one thing, just that you’ll generally learn faster if you stick to one thing at a time. You should strive to learn new things throughout your career, but make sure your understanding is solid before you move on to the next interesting thing.

+

We think R is a great place to start your data science journey because it is an environment designed from the ground up to support data science. R is not just a programming language, it is also an interactive environment for doing data science. To support interaction, R is a much more flexible language than many of its peers. This flexibility comes with its downsides, but the big upside is how easy it is to evolve tailored grammars for specific parts of the data science process. These mini languages help you think about problems as a data scientist, while supporting fluent interaction between your brain and the computer.

+
+
+ +
+

+Prerequisites

+

We’ve made a few assumptions about what you already know in order to get the most out of this book. You should be generally numerically literate, and it’s helpful if you have some programming experience already. If you’ve never programmed before, you might find #chp-https://rstudio-education.github.io/hopr/ by Garrett to be a useful adjunct to this book.

+

There are four things you need to run the code in this book: R, RStudio, a collection of R packages called the tidyverse, and a handful of other packages. Packages are the fundamental units of reproducible R code. They include reusable functions, the documentation that describes how to use them, and sample data.

+ +
+

+R

+

To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.

+

A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly. Upgrading can be a bit of a hassle, especially for major versions, which require you to re-install all your packages, but putting it off only makes it worse. You’ll need at least R 4.1.0 for this book.

+
+ +
+

+RStudio

+

RStudio is an integrated development environment, or IDE, for R programming. Download and install it from https://www.rstudio.com/download. RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know. It’s a good idea to upgrade regularly so you can take advantage of the latest and greatest features. For this book, make sure you have at least RStudio 2022.02.0.

+

When you start RStudio, #fig-rstudio-console, you’ll see two key regions in the interface: the console pane, and the output pane. For now, all you need to know is that you type R code in the console pane, and press enter to run it. You’ll learn more as we go along!

+
+
+ +

The RStudio IDE with the panes Console and Output highlighted.

+
Figure 1.2: The RStudio IDE has two key regions: type R code in the console pane on the left, and look for plots in the output pane on the right.
+
+
+
+
+ +
+

+The tidyverse

+

You’ll also need to install some R packages. An R package is a collection of functions, data, and documentation that extends the capabilities of base R. Using packages is key to the successful use of R. The majority of the packages that you will learn in this book are part of the so-called tidyverse. All packages in the tidyverse share a common philosophy of data and R programming, and are designed to work together naturally.

+

You can install the complete tidyverse with a single line of code:

+
+
install.packages("tidyverse")
+
+

On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer. If you have problems installing, make sure that you are connected to the internet, and that https://cloud.r-project.org/ isn’t blocked by your firewall or proxy.

+

You will not be able to use the functions, objects, or help files in a package until you load it with #chp-https://rdrr.io/r/base/library. Once you have installed a package, you can load it using the #chp-https://rdrr.io/r/base/library function:

+
+
library(tidyverse)
+#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
+#> ✔ ggplot2 3.4.0.9000        ✔ purrr   0.9000.0.9000
+#> ✔ tibble  3.1.8             ✔ dplyr   1.0.99.9000  
+#> ✔ tidyr   1.2.1.9001        ✔ stringr 1.4.1.9000   
+#> ✔ readr   2.1.3             ✔ forcats 0.5.2        
+#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
+#> ✖ dplyr::filter() masks stats::filter()
+#> ✖ dplyr::lag()    masks stats::lag()
+
+

This tells you that tidyverse is loading eight packages: ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, and forcats packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.

+

Packages in the tidyverse change fairly frequently. You can check whether updates are available, and optionally install them, by running #chp-https://tidyverse.tidyverse.org/reference/tidyverse_update.

+
+ +
+

+Other packages

+

There are many other excellent packages that are not part of the tidyverse, because they solve problems in a different domain, or are designed with a different set of underlying principles. This doesn’t make them better or worse, just different. In other words, the complement to the tidyverse is not the messyverse, but many other universes of interrelated packages. As you tackle more data science projects with R, you’ll learn new packages and new ways of thinking about data.

+

In this book we’ll use three data packages from outside the tidyverse:

+
+
install.packages(c("nycflights13", "gapminder", "Lahman"))
+
+

These packages provide data on airline flights, world development, and baseball that we’ll use to illustrate key data science ideas.

+
+
+ +
+

+Running R code

+

The previous section showed you several examples of running R code. Code in the book looks like this:

+
+
1 + 2
+#> [1] 3
+
+

If you run the same code in your local console, it will look like this:

+
> 1 + 2
+[1] 3
+

There are two main differences. In your console, you type after the >, called the prompt; we don’t show the prompt in the book. In the book, output is commented out with #>; in your console it appears directly after your code. These two differences mean that if you’re working with an electronic version of the book, you can easily copy code out of the book and into the console.

+

Throughout the book, we use a consistent set of conventions to refer to code:

+
+ +
+

+Acknowledgements

+

This book isn’t just the product of Hadley, Mine, and Garrett, but is the result of many conversations (in person and online) that we’ve had with many people in the R community. There are a few people we’d like to thank in particular, because they have spent many hours answering our questions and helping us to better think about data science:

+
  • Jenny Bryan and Lionel Henry for many helpful discussions around working with lists and list-columns.

  • +
  • The three chapters on workflow were adapted (with permission), from https://stat545.com/block002_hello-r-workspace-wd-project.html by Jenny Bryan.

  • +
  • Yihui Xie for his work on the #chp-https://github.com/rstudio/bookdown package, and for tirelessly responding to my feature requests.

  • +
  • Bill Behrman for his thoughtful reading of the entire book, and for trying it out with his data science class at Stanford.

  • +
  • The #rstats Twitter community who reviewed all of the draft chapters and provided tons of useful feedback.

  • +

This book was written in the open, and many people contributed pull requests to fix minor problems. Special thanks goes to everyone who contributed via GitHub:

+
+ +
+

A big thank you to all 212 people who contributed specific improvements via GitHub pull requests (in alphabetical order by username): Alex (@ALShum), A. s. (@Adrianzo), @AlanFeder, Antti Rask (@AnttiRask), Oluwafemi OYEDELE (@BB1464), Brian G. Barkley (@BarkleyBG), Bianca Peterson (@BinxiePeterson), Birger Niklas (@BirgerNi), David Clark (@DDClark), @DSGeoff, Edwin Thoen (@EdwinTh), Eric Kitaif (@EricKit), Gerome Meyer (@GeroVanMi), Josh Goldberg (@GoldbergData), Iain (@Iain-S), Jeffrey Stevens (@JeffreyRStevens), 蒋雨蒙 (@JeldorPKU), @MJMarshall, Kara de la Marck (@MarckK), Matt Wittbrodt (@MattWittbrodt), Jakub Nowosad (@Nowosad), Y. Yu (@PursuitOfDataScience), Jajo (@RIngyao), Richard Knight (@RJHKnight), Ranae Dietzel (@Ranae), @ReeceGoding, Robin (@Robinlovelace), Rod Mazloomi (@RodAli), Rohan Alexander (@RohanAlexander), Romero Morais (@RomeroBarata), Shannon Ellis (@ShanEllis), Christian Heinrich (@Shurakai), Steven M. Mortimer (@StevenMMortimer), @a-rosenberg, Tim Becker (@a2800276), Adam Gruer (@adam-gruer), adi pradhan (@adidoit), Andrea Gilardi (@agila5), Ajay Deonarine (@ajay-d), @aleloi, pete (@alonzi), Andrew M. (@amacfarland), Andrew Landgraf (@andland), Angela Li (@angela-li), @ariespirgel, @august-18, Michael Henry (@aviast), Azza Ahmed (@azzaea), Steven Moran (@bambooforest), Mara Averick (@batpigandme), Brent Brewington (@bbrewington), Bill Behrman (@behrman), Ben Herbertson (@benherbertson), Ben Marwick (@benmarwick), Ben Steinberg (@bensteinberg), Benjamin Yeh (@bentyeh), Betul Turkoglu (@betulturkoglu), Brandon Greenwell (@bgreenwell), Brett Klamer (@bklamer), @boardtc, Christian (@c-hoh), Camille V Leonard (@camillevleonard), Christian Mongeau (@chrMongeau), Cooper Morris (@coopermor), Colin Gillespie (@csgillespie), Rademeyer Vermaak (@csrvermaak), Chris Saunders (@ctsa), Abhinav Singh (@curious-abhinav), Curtis Alexander (@curtisalexander), Christian G. Warden (@cwarden), Charlotte Wickham (@cwickham), Kenny Darrell (@darrkj), David Rubinger (@davidrubinger), Derwin McGeary (@derwinmcgeary), Daniel Gromer (@dgromer), @djbirke, Zhuoer Dong (@dongzhuoer), Devin Pastoor (@dpastoor), Julian During (@duju211), Dylan Cashman (@dylancashman), Dirk Eddelbuettel (@eddelbuettel), Ahmed El-Gabbas (@elgabbas), Henry Webel (@enryH), Eric Watt (@ericwatt), Erik Erhardt (@erikerhardt), Etienne B. Racine (@etiennebr), Everett Robinson (@evjrob), @fellennert, Flemming Miguel (@flemmingmiguel), Floris Vanderhaeghe (@florisvdh), @funkybluehen, @gabrivera, Garrick Aden-Buie (@gadenbuie), bahadir cankardes (@gridgrad), Gustav W Delius (@gustavdelius), Hao Chen (@hao-trivago), Harris McGehee (@harrismcgehee), @hendrikweisser, Hengni Cai (@hengnicai), Ian Sealy (@iansealy), Ian Lyttle (@ijlyttle), Ivan Krukov (@ivan-krukov), Jacob Kaplan (@jacobkap), Jazz Weisman (@jazzlw), John Blischak (@jdblischak), John D. Storey (@jdstorey), Jeff Boichuk (@jeffboichuk), Gregory Jefferis (@jefferis), Jennifer (Jenny) Bryan (@jennybc), Jen Ren (@jenren), Jeroen Janssens (@jeroenjanssens), Janet Wesner (@jilmun), Jim Hester (@jimhester), JJ Chen (@jjchern), Jacek Kolacz (@jkolacz), Joanne Jang (@joannejang), John Sears (@johnsears), @jonathanflint, Jon Calder (@jonmcalder), Jonathan Page (@jonpage), JooYoung Seo (@jooyoungseo), Justinas Petuchovas (@jpetuchovas), Jordan (@jrdnbradford), Jeffrey Arnold (@jrnold), Jose Roberto Ayala Solares (@jroberayalas), @juandering, Julia Stewart Lowndes (@jules32), Sonja (@kaetschap), Kara Woo (@karawoo), Katrin Leinweber (@katrinleinweber), Karandeep Singh (@kdpsingh), Kevin Perese (@kevinxperese), Kevin Ferris (@kferris10), Kirill Sevastyanenko (@kirillseva), @koalabearski, Kirill Müller (@krlmlr), Rafał Kucharski (@kucharsky), Noah Landesberg (@landesbergn), Lawrence Wu (@lawwu), @lindbrook, Luke W Johnston (@lwjohnst86), Kunal Marwaha (@marwahaha), Matan Hakim (@matanhakim), Mauro Lepore (@maurolepore), Mark Beveridge (@mbeveridge), @mcewenkhundi, Matt Herman (@mfherman), Michael Boerman (@michaelboerman), Mitsuo Shiota (@mitsuoxv), Matthew Hendrickson (@mjhendrickson), Mohammed Hamdy (@mmhamdy), Maxim Nazarov (@mnazarov), Maria Paula Caldas (@mpaulacaldas), Mustafa Ascha (@mustafaascha), Nelson Areal (@nareal), Nate Olson (@nate-d-olson), Nathanael (@nateaff), @nattalides, Nick Clark (@nickclark1000), @nickelas, Nirmal Patel (@nirmalpatel), Nischal Shrestha (@nischalshrestha), Nicholas Tierney (@njtierney), @olivier6088, Pablo E. Garcia (@pabloedug), Paul Adamson (@padamson), Peter Hurford (@peterhurford), Patrick Kennedy (@pkq), Pooya Taherkhani (@pooyataher), Radu Grosu (@radugrosu), Rayna M Harris (@raynamharris), Robin Gertenbach (@rgertenbach), Riva Quiroga (@rivaquiroga), Richard Zijdeman (@rlzijdeman), @robertchu03, Emily Robinson (@robinsones), Rob Tenorio (@robtenorio), Albert Y. Kim (@rudeboybert), Saghir (@saghirb), Hojjat Salmasian (@salmasian), Jonas (@sauercrowd), Vebash Naidoo (@sciencificity), Seamus McKinsey (@seamus-mckinsey), @seanpwilliams, Luke Smith (@seasmith), Matthew Sedaghatfar (@sedaghatfar), Sebastian Kraus (@sekR4), Sam Firke (@sfirke), @shoili, S’busiso Mkhondwane (@sibusiso16), Jakob Krigovsky (@sonicdoe), Stéphane Guillou (@stragu), Sergiusz Bleja (@svenski), Tal Galili (@talgalili), Tim Broderick (@timbroderick), Tim Waterhouse (@timwaterhouse), TJ Mahr (@tjmahr), Thomas Klebel (@tklebel), Tom Prior (@tomjamesprior), Terence Teo (@tteo), @twgardner2, Ulrik Lyngs (@ulyngs), Martin Van der Linden (@vanderlindenma), Walter Somerville (@waltersom), Will Beasley (@wibeasley), Yihui Xie (@yihui), Yiming (Paul) Li (@yimingli), Hiroaki Yutani (@yutannihilation), Yu Yu Aung (@yuyu-aung), Zach Bogart (@zachbogart), @zeal626, Zeki Akyol (@zekiakyol).

+
+ +
+

+Colophon

+

An online version of this book is available at https://r4ds.hadley.nz. It will continue to evolve in between reprints of the physical book. The source of the book is available at https://github.com/hadley/r4ds. The book is powered by #chp-https://quarto which makes it easy to write books that combine text and executable code.

+

This book was built with:

+
+
sessioninfo::session_info(c("tidyverse"))
+#> ─ Session info ───────────────────────────────────────────────────────────────
+#>  setting  value
+#>  version  R version 4.2.1 (2022-06-23)
+#>  os       macOS Ventura 13.0.1
+#>  system   aarch64, darwin20
+#>  ui       X11
+#>  language (EN)
+#>  collate  en_US.UTF-8
+#>  ctype    en_US.UTF-8
+#>  tz       America/Chicago
+#>  date     2022-11-18
+#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
+#> 
+#> ─ Packages ───────────────────────────────────────────────────────────────────
+#>  package       * version       date (UTC) lib source
+#>  askpass         1.1           2019-01-13 [1] CRAN (R 4.2.0)
+#>  assertthat      0.2.1         2019-03-21 [1] CRAN (R 4.2.0)
+#>  backports       1.4.1         2021-12-13 [1] CRAN (R 4.2.0)
+#>  base64enc       0.1-3         2015-07-28 [1] CRAN (R 4.2.0)
+#>  bit             4.0.4         2020-08-04 [1] CRAN (R 4.2.0)
+#>  bit64           4.0.5         2020-08-30 [1] CRAN (R 4.2.0)
+#>  blob            1.2.3         2022-04-10 [1] CRAN (R 4.2.0)
+#>  broom           1.0.1         2022-08-29 [1] CRAN (R 4.2.0)
+#>  bslib           0.4.1         2022-11-02 [1] CRAN (R 4.2.0)
+#>  cachem          1.0.6         2021-08-19 [1] CRAN (R 4.2.0)
+#>  callr           3.7.3         2022-11-02 [1] CRAN (R 4.2.1)
+#>  cellranger      1.1.0         2016-07-27 [1] CRAN (R 4.2.0)
+#>  cli             3.4.1         2022-09-23 [1] CRAN (R 4.2.1)
+#>  clipr           0.8.0         2022-02-22 [1] CRAN (R 4.2.0)
+#>  colorspace      2.0-3         2022-02-21 [1] CRAN (R 4.2.0)
+#>  cpp11           0.4.3         2022-10-12 [1] CRAN (R 4.2.0)
+#>  crayon          1.5.2         2022-09-29 [1] CRAN (R 4.2.0)
+#>  curl            4.3.3         2022-10-06 [1] CRAN (R 4.2.0)
+#>  data.table      1.14.4        2022-10-17 [1] CRAN (R 4.2.1)
+#>  DBI             1.1.3         2022-06-18 [1] CRAN (R 4.2.0)
+#>  dbplyr          2.2.1.9000    2022-11-03 [1] Github (tidyverse/dbplyr@f7b5596)
+#>  digest          0.6.30        2022-10-18 [1] CRAN (R 4.2.0)
+#>  dplyr         * 1.0.99.9000   2022-11-17 [1] local
+#>  dtplyr          1.2.2         2022-08-20 [1] CRAN (R 4.2.0)
+#>  ellipsis        0.3.2         2021-04-29 [1] CRAN (R 4.2.0)
+#>  evaluate        0.18          2022-11-07 [1] CRAN (R 4.2.1)
+#>  fansi           1.0.3         2022-03-24 [1] CRAN (R 4.2.0)
+#>  farver          2.1.1         2022-07-06 [1] CRAN (R 4.2.0)
+#>  fastmap         1.1.0         2021-01-25 [1] CRAN (R 4.2.0)
+#>  forcats       * 0.5.2         2022-08-19 [1] CRAN (R 4.2.0)
+#>  fs              1.5.2         2021-12-08 [1] CRAN (R 4.2.0)
+#>  gargle          1.2.1.9000    2022-10-27 [1] Github (r-lib/gargle@69d3f28)
+#>  generics        0.1.3         2022-07-05 [1] CRAN (R 4.2.0)
+#>  ggplot2       * 3.4.0.9000    2022-11-10 [1] Github (tidyverse/ggplot2@4fea51b)
+#>  glue            1.6.2         2022-02-24 [1] CRAN (R 4.2.0)
+#>  googledrive     2.0.0         2021-07-08 [1] CRAN (R 4.2.0)
+#>  googlesheets4   1.0.1         2022-08-13 [1] CRAN (R 4.2.0)
+#>  gtable          0.3.1.9000    2022-09-25 [1] local
+#>  haven           2.5.1         2022-08-22 [1] CRAN (R 4.2.0)
+#>  highr           0.9           2021-04-16 [1] CRAN (R 4.2.0)
+#>  hms             1.1.2         2022-08-19 [1] CRAN (R 4.2.0)
+#>  htmltools       0.5.3         2022-07-18 [1] CRAN (R 4.2.0)
+#>  httr            1.4.4         2022-08-17 [1] CRAN (R 4.2.0)
+#>  ids             1.0.1         2017-05-31 [1] CRAN (R 4.2.0)
+#>  isoband         0.2.6         2022-10-06 [1] CRAN (R 4.2.0)
+#>  jquerylib       0.1.4         2021-04-26 [1] CRAN (R 4.2.0)
+#>  jsonlite        1.8.3         2022-10-21 [1] CRAN (R 4.2.1)
+#>  knitr           1.40          2022-08-24 [1] CRAN (R 4.2.0)
+#>  labeling        0.4.2         2020-10-20 [1] CRAN (R 4.2.0)
+#>  lattice         0.20-45       2021-09-22 [2] CRAN (R 4.2.1)
+#>  lifecycle       1.0.3.9000    2022-10-10 [1] Github (r-lib/lifecycle@80a1e52)
+#>  lubridate       1.9.0         2022-11-06 [1] CRAN (R 4.2.1)
+#>  magrittr        2.0.3         2022-03-30 [1] CRAN (R 4.2.0)
+#>  MASS            7.3-58.1      2022-08-03 [1] CRAN (R 4.2.0)
+#>  Matrix          1.5-1         2022-09-13 [1] CRAN (R 4.2.0)
+#>  memoise         2.0.1         2021-11-26 [1] CRAN (R 4.2.0)
+#>  mgcv            1.8-41        2022-10-21 [1] CRAN (R 4.2.0)
+#>  mime            0.12          2021-09-28 [1] CRAN (R 4.2.0)
+#>  modelr          0.1.9         2022-08-19 [1] CRAN (R 4.2.0)
+#>  munsell         0.5.0         2018-06-12 [1] CRAN (R 4.2.0)
+#>  nlme            3.1-160       2022-10-10 [1] CRAN (R 4.2.0)
+#>  openssl         2.0.4         2022-10-17 [1] CRAN (R 4.2.1)
+#>  pillar          1.8.1         2022-08-19 [1] CRAN (R 4.2.0)
+#>  pkgconfig       2.0.3         2019-09-22 [1] CRAN (R 4.2.0)
+#>  prettyunits     1.1.1         2020-01-24 [1] CRAN (R 4.2.0)
+#>  processx        3.8.0         2022-10-26 [1] CRAN (R 4.2.1)
+#>  progress        1.2.2         2019-05-16 [1] CRAN (R 4.2.0)
+#>  ps              1.7.2         2022-10-26 [1] CRAN (R 4.2.1)
+#>  purrr         * 0.9000.0.9000 2022-11-10 [1] Github (tidyverse/purrr@aaaa58a)
+#>  R6              2.5.1         2021-08-19 [1] CRAN (R 4.2.0)
+#>  rappdirs        0.3.3         2021-01-31 [1] CRAN (R 4.2.0)
+#>  RColorBrewer    1.1-3         2022-04-03 [1] CRAN (R 4.2.0)
+#>  readr         * 2.1.3         2022-10-01 [1] CRAN (R 4.2.1)
+#>  readxl          1.4.1         2022-08-17 [1] CRAN (R 4.2.0)
+#>  rematch         1.0.1         2016-04-21 [1] CRAN (R 4.2.0)
+#>  rematch2        2.1.2         2020-05-01 [1] CRAN (R 4.2.0)
+#>  reprex          2.0.2         2022-08-17 [1] CRAN (R 4.2.0)
+#>  rlang           1.0.6         2022-09-24 [1] CRAN (R 4.2.0)
+#>  rmarkdown       2.18          2022-11-09 [1] CRAN (R 4.2.1)
+#>  rstudioapi      0.14          2022-08-22 [1] CRAN (R 4.2.0)
+#>  rvest           1.0.3         2022-08-19 [1] CRAN (R 4.2.0)
+#>  sass            0.4.2         2022-07-16 [1] CRAN (R 4.2.0)
+#>  scales          1.2.1         2022-08-20 [1] CRAN (R 4.2.0)
+#>  selectr         0.4-2         2019-11-20 [1] CRAN (R 4.2.0)
+#>  stringi         1.7.8         2022-07-11 [1] CRAN (R 4.2.0)
+#>  stringr       * 1.4.1.9000    2022-11-10 [1] Github (tidyverse/stringr@ebf3823)
+#>  sys             3.4.1         2022-10-18 [1] CRAN (R 4.2.0)
+#>  tibble        * 3.1.8         2022-07-22 [1] CRAN (R 4.2.0)
+#>  tidyr         * 1.2.1.9001    2022-11-05 [1] Github (tidyverse/tidyr@9174795)
+#>  tidyselect      1.2.0         2022-10-10 [1] CRAN (R 4.2.1)
+#>  tidyverse     * 1.3.2         2022-07-18 [1] CRAN (R 4.2.0)
+#>  timechange      0.1.1         2022-11-04 [1] CRAN (R 4.2.1)
+#>  tinytex         0.42          2022-09-27 [1] CRAN (R 4.2.1)
+#>  tzdb            0.3.0         2022-03-28 [1] CRAN (R 4.2.0)
+#>  utf8            1.2.2         2021-07-24 [1] CRAN (R 4.2.0)
+#>  uuid            1.1-0         2022-04-19 [1] CRAN (R 4.2.0)
+#>  vctrs           0.5.0         2022-10-22 [1] CRAN (R 4.2.0)
+#>  viridisLite     0.4.1         2022-08-22 [1] CRAN (R 4.2.0)
+#>  vroom           1.6.0         2022-09-30 [1] CRAN (R 4.2.0)
+#>  withr           2.5.0         2022-03-03 [1] CRAN (R 4.2.0)
+#>  xfun            0.34          2022-10-18 [1] CRAN (R 4.2.1)
+#>  xml2            1.3.3         2021-11-30 [1] CRAN (R 4.2.0)
+#>  yaml            2.3.6         2022-10-18 [1] CRAN (R 4.2.0)
+#> 
+#>  [1] /Users/hadleywickham/Library/R/arm64/4.2/library
+#>  [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
+#> 
+#> ──────────────────────────────────────────────────────────────────────────────
+cli:::ruler()
+#> ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
+#> 12345678901234567890123456789012345678901234567890123456789012345678901234567890
+
+ + +
+
diff --git a/oreilly/iteration.html b/oreilly/iteration.html new file mode 100644 index 0000000..2907946 --- /dev/null +++ b/oreilly/iteration.html @@ -0,0 +1,1092 @@ +
+

Iteration

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

In this chapter, you’ll learn tools for iteration, repeatedly performing the same action on different objects. Iteration in R generally tends to look rather different from other programming languages because so much of it is implicit and we get it for free. For example, if you want to double a numeric vector x in R, you can just write 2 * x. In most other languages, you’d need to explicitly double each element of x using some sort of for loop.

+

This book has already given you a small but powerful number of tools that perform the same action for multiple “things”:

+

Now it’s time to learn some more general tools, often called functional programming tools because they are built around functions that take other functions as inputs. Learning functional programming can easily veer into the abstract, but in this chapter we’ll keep things concrete by focusing on three common tasks: modifying multiple columns, reading multiple files, and saving multiple objects.

+ +
+

+Prerequisites

+
+
+ +
+ +
+ +

This chapter relies on features only found in purrr 1.0.0 and dplyr 1.1.0, which are still in development. If you want to live life on the edge you can get the dev version with devtools::install_github(c("tidyverse/purrr", "tidyverse/dplyr")).

+ +

In this chapter, we’ll focus on tools provided by dplyr and purrr, both core members of the tidyverse. You’ve seen dplyr before, but #chp-http://purrr.tidyverse.org/ is new. We’re going to use just a couple of purrr functions from in this chapter, but it’s a great package to explore as you improve your programming skills.

+
+
library(tidyverse)
+
+
+
+ +
+

+Modifying multiple columns

+

Imagine you have this simple tibble and you want to count the number of observations and compute the median of every column.

+
+
df <- tibble(
+  a = rnorm(10),
+  b = rnorm(10),
+  c = rnorm(10),
+  d = rnorm(10)
+)
+
+

You could do it with copy-and-paste:

+
+
df |> summarise(
+  n = n(),
+  a = median(a),
+  b = median(b),
+  c = median(c),
+  d = median(d),
+)
+#> # A tibble: 1 × 5
+#>       n      a      b       c     d
+#>   <int>  <dbl>  <dbl>   <dbl> <dbl>
+#> 1    10 -0.246 -0.287 -0.0567 0.144
+
+

That breaks our rule of thumb to never copy and paste more than twice, and you can imagine that this will get very tedious if you have tens or even hundreds of columns. Instead you can use #chp-https://dplyr.tidyverse.org/reference/across:

+
+
df |> summarise(
+  n = n(),
+  across(a:d, median),
+)
+#> # A tibble: 1 × 5
+#>       n      a      b       c     d
+#>   <int>  <dbl>  <dbl>   <dbl> <dbl>
+#> 1    10 -0.246 -0.287 -0.0567 0.144
+
+

#chp-https://dplyr.tidyverse.org/reference/across has three particularly important arguments, which we’ll discuss in detail in the following sections. You’ll use the first two every time you use #chp-https://dplyr.tidyverse.org/reference/across: the first argument, .cols, specifies which columns you want to iterate over, and the second argument, .fns, specifies what to do with each column. You can use the .names argument when you need additional control over the names of output columns, which is particularly important when you use #chp-https://dplyr.tidyverse.org/reference/across with #chp-https://dplyr.tidyverse.org/reference/mutate. We’ll also discuss two important variations, #chp-https://dplyr.tidyverse.org/reference/across and #chp-https://dplyr.tidyverse.org/reference/across, which work with #chp-https://dplyr.tidyverse.org/reference/filter.

+ +
+

+Selecting columns with.cols +

+

The first argument to #chp-https://dplyr.tidyverse.org/reference/across, .cols, selects the columns to transform. This uses the same specifications as #chp-https://dplyr.tidyverse.org/reference/select, #sec-select, so you can use functions like #chp-https://tidyselect.r-lib.org/reference/starts_with and #chp-https://tidyselect.r-lib.org/reference/starts_with to select columns based on their name.

+

There are two additional selection techniques that are particularly useful for #chp-https://dplyr.tidyverse.org/reference/across: #chp-https://tidyselect.r-lib.org/reference/everything and where(). #chp-https://tidyselect.r-lib.org/reference/everything is straightforward: it selects every (non-grouping) column:

+
+
df <- tibble(
+  grp = sample(2, 10, replace = TRUE),
+  a = rnorm(10),
+  b = rnorm(10),
+  c = rnorm(10),
+  d = rnorm(10)
+)
+
+df |> 
+  group_by(grp) |> 
+  summarise(across(everything(), median))
+#> # A tibble: 2 × 5
+#>     grp       a       b     c     d
+#>   <int>   <dbl>   <dbl> <dbl> <dbl>
+#> 1     1 -0.0935 -0.0163 0.363 0.364
+#> 2     2  0.312  -0.0576 0.208 0.565
+
+

Note grouping columns (grp here) are not included in #chp-https://dplyr.tidyverse.org/reference/across, because they’re automatically preserved by #chp-https://dplyr.tidyverse.org/reference/summarise.

+

where() allows you to select columns based on their type:

+
  • +where(is.numeric) selects all numeric columns.
  • +
  • +where(is.character) selects all string columns.
  • +
  • +where(is.Date) selects all date columns.
  • +
  • +where(is.POSIXct) selects all date-time columns.
  • +
  • +where(is.logical) selects all logical columns.
  • +
+
df_types <- tibble(
+  x1 = 1:3,
+  x2 = runif(3),
+  y1 = sample(letters, 3),
+  y2 = c("banana", "apple", "egg")
+)
+
+df_types |> 
+  summarise(across(where(is.numeric), mean))
+#> # A tibble: 1 × 2
+#>      x1    x2
+#>   <dbl> <dbl>
+#> 1     2 0.370
+
+df_types |> 
+  summarise(across(where(is.character), str_flatten))
+#> # A tibble: 1 × 2
+#>   y1    y2            
+#>   <chr> <chr>         
+#> 1 kjh   bananaappleegg
+
+

Just like other selectors, you can combine these with Boolean algebra. For example, !where(is.numeric) selects all non-numeric columns and starts_with("a") & where(is.logical) selects all logical columns whose name starts with “a”.

+
+ +
+

+Calling a single function

+

The second argument to #chp-https://dplyr.tidyverse.org/reference/across defines how each column will be transformed. In simple cases, as above, this will be a single existing function. This is a pretty special feature of R: we’re passing one function (median, mean, str_flatten, …) to another function (across). This is one of the features that makes R a function programming language.

+

It’s important to note that we’re passing this function to #chp-https://dplyr.tidyverse.org/reference/across, so #chp-https://dplyr.tidyverse.org/reference/across can call it, not calling it ourselves. That means the function name should never be followed by (). If you forget, you’ll get an error:

+
+
df |> 
+  group_by(grp) |> 
+  summarise(across(everything(), median()))
+#> Error in vapply(.x, .f, .mold, ..., USE.NAMES = FALSE): values must be length 1,
+#>  but FUN(X[[1]]) result is length 0
+
+

This error arises because you’re calling the function with no input, e.g.:

+
+
median()
+#> Error in is.factor(x): argument "x" is missing, with no default
+
+
+ +
+

+Calling multiple functions

+

In more complex cases, you might want to supply additional arguments or perform multiple transformations. Lets motivate this problem with a simple example: what happens if we have some missing values in our data? #chp-https://rdrr.io/r/stats/median propagates those missing values, giving us a suboptimal output:

+
+
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
+  sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
+}
+
+df_miss <- tibble(
+  a = rnorm_na(5, 1),
+  b = rnorm_na(5, 1),
+  c = rnorm_na(5, 2),
+  d = rnorm(5)
+)
+df_miss |> 
+  summarise(
+    across(a:d, median),
+    n = n()
+  )
+#> # A tibble: 1 × 5
+#>       a     b     c     d     n
+#>   <dbl> <dbl> <dbl> <dbl> <int>
+#> 1    NA    NA    NA 0.704     5
+
+

It would be nice if we could pass along na.rm = TRUE to #chp-https://rdrr.io/r/stats/median to remove these missing values. To do so, instead of calling #chp-https://rdrr.io/r/stats/median directly, we need to create a new function that calls #chp-https://rdrr.io/r/stats/median with the desired arguments:

+
+
df_miss |> 
+  summarise(
+    across(a:d, function(x) median(x, na.rm = TRUE)),
+    n = n()
+  )
+#> # A tibble: 1 × 5
+#>       a      b      c     d     n
+#>   <dbl>  <dbl>  <dbl> <dbl> <int>
+#> 1 0.429 -0.721 -0.796 0.704     5
+
+

This is a little verbose, so R comes with a handy shortcut: for this sort of throw away, or anonymousAnonymous, because we never explicitly gave it a name with <-. Another term programmers use for this is “lambda function”., function you can replace function with \In older code you might see syntax that looks like ~ .x + 1. This is another way to write anonymous functions but it only works inside tidyverse functions and always uses the variable name .x. We now recommend the base syntax, \(x) x + 1.:

+
+
df_miss |> 
+  summarise(
+    across(a:d, \(x) median(x, na.rm = TRUE)),
+    n = n()
+  )
+
+

In either case, #chp-https://dplyr.tidyverse.org/reference/across effectively expands to the following code:

+
+
df_miss |> 
+  summarise(
+    a = median(a, na.rm = TRUE),
+    b = median(b, na.rm = TRUE),
+    c = median(c, na.rm = TRUE),
+    d = median(d, na.rm = TRUE),
+    n = n()
+  )
+
+

When we remove the missing values from the #chp-https://rdrr.io/r/stats/median, it would be nice to know just how many values we were removing. We can find that out by supplying two functions to #chp-https://dplyr.tidyverse.org/reference/across: one to compute the median and the other to count the missing values. You supply multiple functions by using a named list to .fns:

+
+
df_miss |> 
+  summarise(
+    across(a:d, list(
+      median = \(x) median(x, na.rm = TRUE),
+      n_miss = \(x) sum(is.na(x))
+    )),
+    n = n()
+  )
+#> # A tibble: 1 × 9
+#>   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss     n
+#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
+#> 1    0.429        1   -0.721        1   -0.796        2    0.704        0     5
+
+

If you look carefully, you might intuit that the columns are named using using a glue specification (#sec-glue) like {.col}_{.fn} where .col is the name of the original column and .fn is the name of the function. That’s not a coincidence! As you’ll learn in the next section, you can use .names argument to supply your own glue spec.

+
+ +
+

+Column names

+

The result of #chp-https://dplyr.tidyverse.org/reference/across is named according to the specification provided in the .names argument. We could specify our own if we wanted the name of the function to come firstYou can’t currently change the order of the columns, but you could reorder them after the fact using #chp-https://dplyr.tidyverse.org/reference/relocate or similar.:

+
+
df_miss |> 
+  summarise(
+    across(
+      a:d,
+      list(
+        median = \(x) median(x, na.rm = TRUE),
+        n_miss = \(x) sum(is.na(x))
+      ),
+      .names = "{.fn}_{.col}"
+    ),
+    n = n(),
+  )
+#> # A tibble: 1 × 9
+#>   median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d     n
+#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
+#> 1    0.429        1   -0.721        1   -0.796        2    0.704        0     5
+
+

The .names argument is particularly important when you use #chp-https://dplyr.tidyverse.org/reference/across with #chp-https://dplyr.tidyverse.org/reference/mutate. By default the output of #chp-https://dplyr.tidyverse.org/reference/across is given the same names as the inputs. This means that #chp-https://dplyr.tidyverse.org/reference/across inside of #chp-https://dplyr.tidyverse.org/reference/mutate will replace existing columns. For example, here we use #chp-https://dplyr.tidyverse.org/reference/coalesce to replace NAs with 0:

+
+
df_miss |> 
+  mutate(
+    across(a:d, \(x) coalesce(x, 0))
+  )
+#> # A tibble: 5 × 4
+#>        a      b      c      d
+#>    <dbl>  <dbl>  <dbl>  <dbl>
+#> 1  0     -0.463  0      2.13 
+#> 2 -0.382 -0.980  0      0.704
+#> 3  0.434  0     -1.06   0.715
+#> 4  1.06   1.21  -0.796 -1.09 
+#> 5  0.424 -1.28  -0.785  0.402
+
+

If you’d like to instead create new columns, you can use the .names argument to give the output new names:

+
+
df_miss |> 
+  mutate(
+    across(a:d, \(x) abs(x), .names = "{.col}_abs")
+  )
+#> # A tibble: 5 × 8
+#>        a      b      c      d  a_abs  b_abs  c_abs d_abs
+#>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
+#> 1 NA     -0.463 NA      2.13  NA      0.463 NA     2.13 
+#> 2 -0.382 -0.980 NA      0.704  0.382  0.980 NA     0.704
+#> 3  0.434 NA     -1.06   0.715  0.434 NA      1.06  0.715
+#> 4  1.06   1.21  -0.796 -1.09   1.06   1.21   0.796 1.09 
+#> 5  0.424 -1.28  -0.785  0.402  0.424  1.28   0.785 0.402
+
+
+ +
+

+Filtering

+

#chp-https://dplyr.tidyverse.org/reference/across is a great match for #chp-https://dplyr.tidyverse.org/reference/summarise and #chp-https://dplyr.tidyverse.org/reference/mutate but it’s more awkward to use with #chp-https://dplyr.tidyverse.org/reference/filter, because you usually combine multiple conditions with either | or &. It’s clear that #chp-https://dplyr.tidyverse.org/reference/across can help to create multiple logical columns, but then what? So dplyr provides two variants of #chp-https://dplyr.tidyverse.org/reference/across called #chp-https://dplyr.tidyverse.org/reference/across and #chp-https://dplyr.tidyverse.org/reference/across:

+
+
df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
+#> # A tibble: 3 × 4
+#>        a      b     c     d
+#>    <dbl>  <dbl> <dbl> <dbl>
+#> 1 NA     -0.463 NA    2.13 
+#> 2 -0.382 -0.980 NA    0.704
+#> 3  0.434 NA     -1.06 0.715
+# same as:
+df_miss |> filter(if_any(a:d, is.na))
+#> # A tibble: 3 × 4
+#>        a      b     c     d
+#>    <dbl>  <dbl> <dbl> <dbl>
+#> 1 NA     -0.463 NA    2.13 
+#> 2 -0.382 -0.980 NA    0.704
+#> 3  0.434 NA     -1.06 0.715
+
+df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
+#> # A tibble: 0 × 4
+#> # … with 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>
+# same as:
+df_miss |> filter(if_all(a:d, is.na))
+#> # A tibble: 0 × 4
+#> # … with 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>
+
+
+ +
+

+across() in functions

+

#chp-https://dplyr.tidyverse.org/reference/across is particularly useful to program with because it allows you to operate on multiple columns. For example, #chp-https://twitter.com/_wurli/status/1571836746899283969 uses this little helper which wraps a bunch of lubridate function to expand all date columns into year, month, and day columns:

+
+
library(lubridate)
+#> Loading required package: timechange
+#> 
+#> Attaching package: 'lubridate'
+#> The following objects are masked from 'package:base':
+#> 
+#>     date, intersect, setdiff, union
+
+expand_dates <- function(df) {
+  df |> 
+    mutate(
+      across(where(is.Date), list(year = year, month = month, day = mday))
+    )
+}
+
+df_date <- tibble(
+  name = c("Amy", "Bob"),
+  date = ymd(c("2009-08-03", "2010-01-16"))
+)
+
+df_date |> 
+  expand_dates()
+#> # A tibble: 2 × 5
+#>   name  date       date_year date_month date_day
+#>   <chr> <date>         <dbl>      <dbl>    <int>
+#> 1 Amy   2009-08-03      2009          8        3
+#> 2 Bob   2010-01-16      2010          1       16
+
+

#chp-https://dplyr.tidyverse.org/reference/across also makes it easy to supply multiple columns in a single argument because the first argument uses tidy-select; you just need to remember to embrace that argument, as we discussed in #sec-embracing. For example, this function will compute the means of numeric columns by default. But by supplying the second argument you can choose to summarize just selected columns:

+
+
summarise_means <- function(df, summary_vars = where(is.numeric)) {
+  df |> 
+    summarise(
+      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
+      n = n()
+    )
+}
+diamonds |> 
+  group_by(clarity) |> 
+  summarise_means()
+#> # A tibble: 8 × 9
+#>   clarity carat depth table price     x     y     z     n
+#>   <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
+#> 1 I1      1.28   62.7  58.3 3924.  6.76  6.71  4.21   741
+#> 2 SI2     1.08   61.8  57.9 5063.  6.40  6.40  3.95  9194
+#> 3 SI1     0.850  61.9  57.7 3996.  5.89  5.89  3.64 13065
+#> 4 VS2     0.764  61.7  57.4 3925.  5.66  5.66  3.49 12258
+#> 5 VS1     0.727  61.7  57.3 3839.  5.57  5.58  3.44  8171
+#> 6 VVS2    0.596  61.7  57.0 3284.  5.22  5.23  3.22  5066
+#> # … with 2 more rows
+
+diamonds |> 
+  group_by(clarity) |> 
+  summarise_means(c(carat, x:z))
+#> # A tibble: 8 × 6
+#>   clarity carat     x     y     z     n
+#>   <ord>   <dbl> <dbl> <dbl> <dbl> <int>
+#> 1 I1      1.28   6.76  6.71  4.21   741
+#> 2 SI2     1.08   6.40  6.40  3.95  9194
+#> 3 SI1     0.850  5.89  5.89  3.64 13065
+#> 4 VS2     0.764  5.66  5.66  3.49 12258
+#> 5 VS1     0.727  5.57  5.58  3.44  8171
+#> 6 VVS2    0.596  5.22  5.23  3.22  5066
+#> # … with 2 more rows
+
+
+ +
+

+Vspivot_longer() +

+

Before we go on, it’s worth pointing out an interesting connection between #chp-https://dplyr.tidyverse.org/reference/across and #chp-https://tidyr.tidyverse.org/reference/pivot_longer (#sec-pivoting). In many cases, you perform the same calculations by first pivoting the data and then performing the operations by group rather than by column. For example, take this multi-function summary:

+
+
df |> 
+  summarise(across(a:d, list(median = median, mean = mean)))
+#> # A tibble: 1 × 8
+#>   a_median a_mean b_median b_mean c_median c_mean d_median d_mean
+#>      <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
+#> 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508
+
+

We could compute the same values by pivoting longer and then summarizing:

+
+
long <- df |> 
+  pivot_longer(a:d) |> 
+  group_by(name) |> 
+  summarise(
+    median = median(value),
+    mean = mean(value)
+  )
+long
+#> # A tibble: 4 × 3
+#>   name   median   mean
+#>   <chr>   <dbl>  <dbl>
+#> 1 a      0.0380 0.205 
+#> 2 b     -0.0163 0.0910
+#> 3 c      0.260  0.0716
+#> 4 d      0.540  0.508
+
+

And if you wanted the same structure as #chp-https://dplyr.tidyverse.org/reference/across you could pivot again:

+
+
long |> 
+  pivot_wider(
+    names_from = name,
+    values_from = c(median, mean),
+    names_vary = "slowest",
+    names_glue = "{name}_{.value}"
+  )
+#> # A tibble: 1 × 8
+#>   a_median a_mean b_median b_mean c_median c_mean d_median d_mean
+#>      <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
+#> 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508
+
+

This is a useful technique to know about because sometimes you’ll hit a problem that’s not currently possible to solve with #chp-https://dplyr.tidyverse.org/reference/across: when you have groups of columns that you want to compute with simultaneously. For example, imagine that our data frame contains both values and weights and we want to compute a weighted mean:

+
+
df_paired <- tibble(
+  a_val = rnorm(10),
+  a_wts = runif(10),
+  b_val = rnorm(10),
+  b_wts = runif(10),
+  c_val = rnorm(10),
+  c_wts = runif(10),
+  d_val = rnorm(10),
+  d_wts = runif(10)
+)
+
+

There’s currently no way to do this with #chp-https://dplyr.tidyverse.org/reference/acrossMaybe there will be one day, but currently we don’t see how., but it’s relatively straightforward with #chp-https://tidyr.tidyverse.org/reference/pivot_longer:

+
+
df_long <- df_paired |> 
+  pivot_longer(
+    everything(), 
+    names_to = c("group", ".value"), 
+    names_sep = "_"
+  )
+df_long
+#> # A tibble: 40 × 3
+#>   group    val   wts
+#>   <chr>  <dbl> <dbl>
+#> 1 a      0.404 0.678
+#> 2 b      1.74  0.650
+#> 3 c     -0.921 0.261
+#> 4 d     -0.953 0.327
+#> 5 a      2.04  0.665
+#> 6 b     -1.64  0.815
+#> # … with 34 more rows
+
+df_long |> 
+  group_by(group) |> 
+  summarise(mean = weighted.mean(val, wts))
+#> # A tibble: 4 × 2
+#>   group    mean
+#>   <chr>   <dbl>
+#> 1 a      0.109 
+#> 2 b      0.585 
+#> 3 c     -0.746 
+#> 4 d     -0.0142
+
+

If needed, you could #chp-https://tidyr.tidyverse.org/reference/pivot_wider this back to the original form.

+
+ +
+

+Exercises

+
  1. Compute the number of unique values in each column of #chp-https://allisonhorst.github.io/palmerpenguins/reference/penguins.

  2. +
  3. Compute the mean of every column in mtcars.

  4. +
  5. Group diamonds by cut, clarity, and color then count the number of observations and the mean of each numeric column.

  6. +
  7. What happens if you use a list of functions, but don’t name them? How is the output named?

  8. +
  9. It is possible to use #chp-https://dplyr.tidyverse.org/reference/across inside #chp-https://dplyr.tidyverse.org/reference/filter where it’s equivalent to #chp-https://dplyr.tidyverse.org/reference/across. Can you explain why?

  10. +
  11. Adjust expand_dates() to automatically remove the date columns after they’ve been expanded. Do you need to embrace any arguments?

  12. +
  13. +

    Explain what each step of the pipeline in this function does. What special feature of where() are we taking advantage of?

    +
    +
    show_missing <- function(df, group_vars, summary_vars = everything()) {
    +  df |> 
    +    group_by(pick({{ group_vars }})) |> 
    +    summarise(
    +      across({{ summary_vars }}, \(x) sum(is.na(x))),
    +      .groups = "drop"
    +    ) |>
    +    select(where(\(x) any(x > 0)))
    +}
    +nycflights13::flights |> show_missing(c(year, month, day))
    +
    +
  14. +
+
+ +
+

+Reading multiple files

+

In the previous section, you learned how to use #chp-https://dplyr.tidyverse.org/reference/across to repeat a transformation on multiple columns. In this section, you’ll learn how to use #chp-https://purrr.tidyverse.org/reference/map to do something to every file in a directory. Let’s start with a little motivation: imagine you have a directory full of excel spreadsheetsIf you instead had a directory of csv files with the same format, you can use the technique from #sec-readr-directory. you want to read. You could do it with copy and paste:

+
+
data2019 <- readxl::read_excel("data/y2019.xlsx")
+data2020 <- readxl::read_excel("data/y2020.xlsx")
+data2021 <- readxl::read_excel("data/y2021.xlsx")
+data2022 <- readxl::read_excel("data/y2022.xlsx")
+
+

And then use #chp-https://dplyr.tidyverse.org/reference/bind_rows to combine them all together:

+
+
data <- bind_rows(data2019, data2020, data2021, data2022)
+
+

You can imagine that this would get tedious quickly, especially if you had hundreds of files, not just four. The following sections show you how to automate this sort of task. There are three basic steps: use #chp-https://rdrr.io/r/base/list.files to list all the files in a directory, then use #chp-https://purrr.tidyverse.org/reference/map to read each of them into a list, then use #chp-https://purrr.tidyverse.org/reference/list_c to combine them into a single data frame. We’ll then discuss how you can handle situations of increasing heterogeneity, where you can’t do exactly the same thing to every file.

+ +
+

+Listing files in a directory

+

As the name suggests, #chp-https://rdrr.io/r/base/list.files lists the files in a directory. TO CONSIDER: why not use it via the more obvious name #chp-https://rdrr.io/r/base/list.files? You’ll almost always use three arguments:

+
  • The first argument, path, is the directory to look in.

  • +
  • pattern is a regular expression used to filter the file names. The most common pattern is something like [.]xlsx$ or [.]csv$ to find all files with a specified extension.

  • +
  • full.names determines whether or not the directory name should be included in the output. You almost always want this to be TRUE.

  • +

To make our motivating example concrete, this book contains a folder with 12 excel spreadsheets containing data from the gapminder package. Each file contains one year’s worth of data for 142 countries. We can list them all with the appropriate call to #chp-https://rdrr.io/r/base/list.files:

+
+
paths <- list.files("data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
+paths
+#>  [1] "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx"
+#>  [3] "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx"
+#>  [5] "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx"
+#>  [7] "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx"
+#>  [9] "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx"
+#> [11] "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"
+
+
+ +
+

+Lists

+

Now that we have these 12 paths, we could call read_excel() 12 times to get 12 data frames:

+
+
gapminder_1952 <- readxl::read_excel("data/gapminder/1952.xlsx")
+gapminder_1957 <- readxl::read_excel("data/gapminder/1957.xlsx")
+gapminder_1962 <- readxl::read_excel("data/gapminder/1962.xlsx")
+ ...,
+gapminder_2007 <- readxl::read_excel("data/gapminder/2007.xlsx")
+
+

But putting each sheet into its own variable is going to make it hard to work with them a few steps down the road. Instead, they’ll be easier to work with if we put them into a single object. A list is the perfect tool for this job:

+
+
files <- list(
+  readxl::read_excel("data/gapminder/1952.xlsx"),
+  readxl::read_excel("data/gapminder/1957.xlsx"),
+  readxl::read_excel("data/gapminder/1962.xlsx"),
+  ...,
+  readxl::read_excel("data/gapminder/2007.xlsx")
+)
+
+

Now that you have these data frames in a list, how do you get one out? You can use files[[i]] to extract the i-th element:

+
+
files[[3]]
+#> # A tibble: 142 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         32.0 10267083      853.
+#> 2 Albania     Europe       64.8  1728137     2313.
+#> 3 Algeria     Africa       48.3 11000948     2551.
+#> 4 Angola      Africa       34    4826015     4269.
+#> 5 Argentina   Americas     65.1 21283783     7133.
+#> 6 Australia   Oceania      70.9 10794968    12217.
+#> # … with 136 more rows
+
+

We’ll come back to [[ in more detail in #sec-subset-one.

+
+ +
+

+purrr::map() and list_rbind() +

+

The code to collect those data frames in a list “by hand” is basically just as tedious to type as code that reads the files one-by-one. Happily, we can use #chp-https://purrr.tidyverse.org/reference/map to make even better use of our paths vector. #chp-https://purrr.tidyverse.org/reference/map is similar to#chp-https://dplyr.tidyverse.org/reference/across, but instead of doing something to each column in a data frame, it does something to each element of a vector.map(x, f) is shorthand for:

+
+
list(
+  f(x[[1]]),
+  f(x[[2]]),
+  ...,
+  f(x[[n]])
+)
+
+

So we can use #chp-https://purrr.tidyverse.org/reference/map get a list of 12 data frames:

+
+
files <- map(paths, readxl::read_excel)
+length(files)
+#> [1] 12
+
+files[[1]]
+#> # A tibble: 142 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         28.8  8425333      779.
+#> 2 Albania     Europe       55.2  1282697     1601.
+#> 3 Algeria     Africa       43.1  9279525     2449.
+#> 4 Angola      Africa       30.0  4232095     3521.
+#> 5 Argentina   Americas     62.5 17876956     5911.
+#> 6 Australia   Oceania      69.1  8691212    10040.
+#> # … with 136 more rows
+
+

(This is another data structure that doesn’t display particularly compactly with #chp-https://rdrr.io/r/utils/str so you might want to load into RStudio and inspect it with #chp-https://rdrr.io/r/utils/View).

+

Now we can use #chp-https://purrr.tidyverse.org/reference/list_c to combine that list of data frames into a single data frame:

+
+
list_rbind(files)
+#> # A tibble: 1,704 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         28.8  8425333      779.
+#> 2 Albania     Europe       55.2  1282697     1601.
+#> 3 Algeria     Africa       43.1  9279525     2449.
+#> 4 Angola      Africa       30.0  4232095     3521.
+#> 5 Argentina   Americas     62.5 17876956     5911.
+#> 6 Australia   Oceania      69.1  8691212    10040.
+#> # … with 1,698 more rows
+
+

Or we could do both steps at once in pipeline:

+
+
paths |> 
+  map(readxl::read_excel) |> 
+  list_rbind()
+
+

What if we want to pass in extra arguments to read_excel()? We use the same technique that we used with #chp-https://dplyr.tidyverse.org/reference/across. For example, it’s often useful to peak at the first few row of the data with n_max = 1:

+
+
paths |> 
+  map(\(path) readxl::read_excel(path, n_max = 1)) |> 
+  list_rbind()
+#> # A tibble: 12 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         28.8  8425333      779.
+#> 2 Afghanistan Asia         30.3  9240934      821.
+#> 3 Afghanistan Asia         32.0 10267083      853.
+#> 4 Afghanistan Asia         34.0 11537966      836.
+#> 5 Afghanistan Asia         36.1 13079460      740.
+#> 6 Afghanistan Asia         38.4 14880372      786.
+#> # … with 6 more rows
+
+

This makes it clear that something is missing: there’s no year column because that value is recorded in the path, not the individual files. We’ll tackle that problem next.

+
+ +
+

+Data in the path

+

Sometimes the name of the file is itself data. In this example, the file name contains the year, which is not otherwise recorded in the individual files. To get that column into the final data frame, we need to do two things.

+

First, we name the vector of paths. The easiest way to do this is with the #chp-https://rlang.r-lib.org/reference/set_names function, which can take a function. Here we use #chp-https://rdrr.io/r/base/basename to extract just the file name from the full path:

+
+
paths |> set_names(basename) 
+#>                  1952.xlsx                  1957.xlsx 
+#> "data/gapminder/1952.xlsx" "data/gapminder/1957.xlsx" 
+#>                  1962.xlsx                  1967.xlsx 
+#> "data/gapminder/1962.xlsx" "data/gapminder/1967.xlsx" 
+#>                  1972.xlsx                  1977.xlsx 
+#> "data/gapminder/1972.xlsx" "data/gapminder/1977.xlsx" 
+#>                  1982.xlsx                  1987.xlsx 
+#> "data/gapminder/1982.xlsx" "data/gapminder/1987.xlsx" 
+#>                  1992.xlsx                  1997.xlsx 
+#> "data/gapminder/1992.xlsx" "data/gapminder/1997.xlsx" 
+#>                  2002.xlsx                  2007.xlsx 
+#> "data/gapminder/2002.xlsx" "data/gapminder/2007.xlsx"
+
+

Those names are automatically carried along by all the map functions, so the list of data frames will have those same names:

+
+
files <- paths |> 
+  set_names(basename) |> 
+  map(readxl::read_excel)
+
+

That makes this call to #chp-https://purrr.tidyverse.org/reference/map shorthand for:

+
+
files <- list(
+  "1952.xlsx" = readxl::read_excel("data/gapminder/1952.xlsx"),
+  "1957.xlsx" = readxl::read_excel("data/gapminder/1957.xlsx"),
+  "1962.xlsx" = readxl::read_excel("data/gapminder/1962.xlsx"),
+  ...,
+  "2007.xlsx" = readxl::read_excel("data/gapminder/2007.xlsx")
+)
+
+

You can also use [[ to extract elements by name:

+
+
files[["1962.xlsx"]]
+#> # A tibble: 142 × 5
+#>   country     continent lifeExp      pop gdpPercap
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 Afghanistan Asia         32.0 10267083      853.
+#> 2 Albania     Europe       64.8  1728137     2313.
+#> 3 Algeria     Africa       48.3 11000948     2551.
+#> 4 Angola      Africa       34    4826015     4269.
+#> 5 Argentina   Americas     65.1 21283783     7133.
+#> 6 Australia   Oceania      70.9 10794968    12217.
+#> # … with 136 more rows
+
+

Then we use the names_to argument to #chp-https://purrr.tidyverse.org/reference/list_c to tell it to save the names into a new column called year then use #chp-https://readr.tidyverse.org/reference/parse_number to extract the number from the string.

+
+
paths |> 
+  set_names(basename) |> 
+  map(readxl::read_excel) |> 
+  list_rbind(names_to = "year") |> 
+  mutate(year = parse_number(year))
+#> # A tibble: 1,704 × 6
+#>    year country     continent lifeExp      pop gdpPercap
+#>   <dbl> <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1  1952 Afghanistan Asia         28.8  8425333      779.
+#> 2  1952 Albania     Europe       55.2  1282697     1601.
+#> 3  1952 Algeria     Africa       43.1  9279525     2449.
+#> 4  1952 Angola      Africa       30.0  4232095     3521.
+#> 5  1952 Argentina   Americas     62.5 17876956     5911.
+#> 6  1952 Australia   Oceania      69.1  8691212    10040.
+#> # … with 1,698 more rows
+
+

In more complicated cases, there might be other variables stored in the directory name, or maybe the file name contains multiple bits of data. In that case, use #chp-https://rlang.r-lib.org/reference/set_names (without any arguments) to record the full path, and then use #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim and friends to turn them into useful columns.

+
+
# NOTE: this chapter also depends on dev tidyr (in addition to dev purrr and dev dplyr)
+paths |> 
+  set_names() |> 
+  map(readxl::read_excel) |> 
+  list_rbind(names_to = "year") |> 
+  separate_wider_delim(year, delim = "/", names = c(NA, "dir", "file")) |> 
+  separate_wider_delim(file, delim = ".", names = c("file", "ext"))
+#> # A tibble: 1,704 × 8
+#>   dir       file  ext   country     continent lifeExp      pop gdpPercap
+#>   <chr>     <chr> <chr> <chr>       <chr>       <dbl>    <dbl>     <dbl>
+#> 1 gapminder 1952  xlsx  Afghanistan Asia         28.8  8425333      779.
+#> 2 gapminder 1952  xlsx  Albania     Europe       55.2  1282697     1601.
+#> 3 gapminder 1952  xlsx  Algeria     Africa       43.1  9279525     2449.
+#> 4 gapminder 1952  xlsx  Angola      Africa       30.0  4232095     3521.
+#> 5 gapminder 1952  xlsx  Argentina   Americas     62.5 17876956     5911.
+#> 6 gapminder 1952  xlsx  Australia   Oceania      69.1  8691212    10040.
+#> # … with 1,698 more rows
+
+
+ +
+

+Save your work

+

Now that you’ve done all this hard work to get to a nice tidy data frame, it’s a great time to save your work:

+
+
gapminder <- paths |> 
+  set_names(basename) |> 
+  map(readxl::read_excel) |> 
+  list_rbind(names_to = "year") |> 
+  mutate(year = parse_number(year))
+
+write_csv(gapminder, "gapminder.csv")
+
+

Now when you come back to this problem in the future, you can read in a single csv file.

+

If you’re working in a project, we’d suggest calling the file that does this sort of data prep work something like 0-cleanup.R. The 0 in the file name suggests that this should be run before anything else.

+

If your input data files change over time, you might consider learning a tool like #chp-https://docs.ropensci.org/targets/ to set up your data cleaning code to automatically re-run whenever one of the input files is modified.

+
+ +
+

+Many simple iterations

+

Here we’ve just loaded the data directly from disk, and were lucky enough to get a tidy dataset. In most cases, you’ll need to do some additional tidying, and you have two basic basic options: you can do one round of iteration with a complex function, or do a multiple rounds of iteration with simple functions. In our experience most folks reach first for one complex iteration, but you’re often better by doing multiple simple iterations.

+

For example, imagine that you want to read in a bunch of files, filter out missing values, pivot, and then combine. One way to approach the problem is write a function that takes a file and does all those steps then call #chp-https://purrr.tidyverse.org/reference/map once:

+
+
process_file <- function(path) {
+  df <- read_csv(path)
+  
+  df |> 
+    filter(!is.na(id)) |> 
+    mutate(id = tolower(id)) |> 
+    pivot_longer(jan:dec, names_to = "month")
+}
+
+paths |> 
+  map(process_file) |> 
+  list_rbind()
+
+

Alternatively, you could perform each step of process_file() to every file:

+
+
paths |> 
+  map(read_csv) |> 
+  map(\(df) df |> filter(!is.na(id))) |> 
+  map(\(df) df |> mutate(id = tolower(id))) |> 
+  map(\(df) df |> pivot_longer(jan:dec, names_to = "month")) |> 
+  list_rbind()
+
+

We recommend this approach because it stops you getting fixated on getting the first file right because moving on to the rest. By considering all of the data when doing tidying and cleaning, you’re more likely to think holistically and end up with a higher quality result.

+

In this particular example, there’s another optimization you could make, by binding all the data frames together earlier. Then you can rely on regular dplyr behavior:

+
+
paths |> 
+  map(read_csv) |> 
+  list_rbind() |> 
+  filter(!is.na(id)) |> 
+  mutate(id = tolower(id)) |> 
+  pivot_longer(jan:dec, names_to = "month")
+
+
+ +
+

+Heterogeneous data

+

Unfortunately sometimes it’s not possible to go from #chp-https://purrr.tidyverse.org/reference/map straight to #chp-https://purrr.tidyverse.org/reference/list_c because the data frames are so heterogeneous that #chp-https://purrr.tidyverse.org/reference/list_c either fails or yields a data frame that’s not very useful. In that case, it’s still useful to start by loading all of the files:

+
+
files <- paths |> 
+  map(readxl::read_excel) 
+
+

Then a very useful strategy is to capture the structure of the data frames to data so that you can explore it using your data science skills. One way to do so is with this handy df_types function that returns a tibble with one row for each column:

+
+
df_types <- function(df) {
+  tibble(
+    col_name = names(df), 
+    col_type = map_chr(df, vctrs::vec_ptype_full),
+    n_miss = map_int(df, \(x) sum(is.na(x)))
+  )
+}
+
+df_types(starwars)
+#> # A tibble: 14 × 3
+#>   col_name   col_type  n_miss
+#>   <chr>      <chr>      <int>
+#> 1 name       character      0
+#> 2 height     integer        6
+#> 3 mass       double        28
+#> 4 hair_color character      5
+#> 5 skin_color character      0
+#> 6 eye_color  character      0
+#> # … with 8 more rows
+df_types(nycflights13::flights)
+#> # A tibble: 19 × 3
+#>   col_name       col_type n_miss
+#>   <chr>          <chr>     <int>
+#> 1 year           integer       0
+#> 2 month          integer       0
+#> 3 day            integer       0
+#> 4 dep_time       integer    8255
+#> 5 sched_dep_time integer       0
+#> 6 dep_delay      double     8255
+#> # … with 13 more rows
+
+

You can then apply this function all of the files, and maybe do some pivoting to make it easy to see where there are differences. For example, this makes it easy to verify that the gapminder spreadsheets that we’ve been working with are all quite homogeneous:

+
+
files |> 
+  map(df_types) |> 
+  list_rbind(names_to = "file_name") |> 
+  select(-n_miss) |> 
+  pivot_wider(names_from = col_name, values_from = col_type)
+#> # A tibble: 12 × 6
+#>   file_name country   continent lifeExp pop    gdpPercap
+#>   <chr>     <chr>     <chr>     <chr>   <chr>  <chr>    
+#> 1 1952.xlsx character character double  double double   
+#> 2 1957.xlsx character character double  double double   
+#> 3 1962.xlsx character character double  double double   
+#> 4 1967.xlsx character character double  double double   
+#> 5 1972.xlsx character character double  double double   
+#> 6 1977.xlsx character character double  double double   
+#> # … with 6 more rows
+
+

If the files have heterogeneous formats you might need to do more processing before you can successfully merge them. Unfortunately we’re now going to leave you to figure that out on your own, but you might want to read about #chp-https://purrr.tidyverse.org/reference/map_if and #chp-https://purrr.tidyverse.org/reference/map_if. #chp-https://purrr.tidyverse.org/reference/map_if allows you to selectively modify elements of a list based on their values; #chp-https://purrr.tidyverse.org/reference/map_if allows you to selectively modify elements based on their names.

+
+ +
+

+Handling failures

+

Sometimes the structure of your data might be sufficiently wild that you can’t even read all the files with a single command. And then you’ll encounter one of the downsides of map: it succeeds or fails as a whole. #chp-https://purrr.tidyverse.org/reference/map will either successfully read all of the files in a directory or fail with an error, reading zero files. This is annoying: why does one failure prevent you from accessing all the other successes?

+

Luckily, purrr comes with a helper to tackle this problem: #chp-https://purrr.tidyverse.org/reference/possibly. #chp-https://purrr.tidyverse.org/reference/possibly is what’s known as a function operator: it takes a function and returns a function with modified behavior. In particular, #chp-https://purrr.tidyverse.org/reference/possibly changes a function from erroring to returning a value that you specify:

+
+
files <- paths |> 
+  map(possibly(\(path) readxl::read_excel(path), NULL))
+
+data <- files |> list_rbind()
+
+

This works particularly well here because #chp-https://purrr.tidyverse.org/reference/list_c, like many tidyverse functions, automatically ignores NULLs.

+

Now you have all the data that can be read easily, and it’s time to tackle the hard part of figuring out why some files failed load and what do to about it. Start by getting the paths that failed:

+
+
failed <- map_vec(files, is.null)
+paths[failed]
+#> character(0)
+
+

Then call the import function again for each failure and figure out what went wrong.

+
+
+ +
+

+Saving multiple outputs

+

In the last section, you learned about #chp-https://purrr.tidyverse.org/reference/map, which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:

+
  • Saving multiple data frames into one database.
  • +
  • Saving multiple data frames into multiple csv files.
  • +
  • Saving multiple plots to multiple .png files.
  • +
+
+

+Writing to a database

+

Sometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your into a database so you can access just the bits you need with dbplyr.

+

If you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s duckdb_read_csv():

+
+
con <- DBI::dbConnect(duckdb::duckdb())
+duckdb::duckdb_read_csv(con, "gapminder", paths)
+
+

This would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.

+

We need to start by creating a table that will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:

+
+
template <- readxl::read_excel(paths[[1]])
+template$year <- 1952
+template
+#> # A tibble: 142 × 6
+#>   country     continent lifeExp      pop gdpPercap  year
+#>   <chr>       <chr>       <dbl>    <dbl>     <dbl> <dbl>
+#> 1 Afghanistan Asia         28.8  8425333      779.  1952
+#> 2 Albania     Europe       55.2  1282697     1601.  1952
+#> 3 Algeria     Africa       43.1  9279525     2449.  1952
+#> 4 Angola      Africa       30.0  4232095     3521.  1952
+#> 5 Argentina   Americas     62.5 17876956     5911.  1952
+#> 6 Australia   Oceania      69.1  8691212    10040.  1952
+#> # … with 136 more rows
+
+

Now we can connect to the database, and use #chp-https://dbi.r-dbi.org/reference/dbCreateTable to turn our template into database table:

+
+
con <- DBI::dbConnect(duckdb::duckdb())
+DBI::dbCreateTable(con, "gapminder", template)
+
+

dbCreateTable() doesn’t use the data in template, just the variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need with the types we expect:

+
+
con |> tbl("gapminder")
+#> # Source:   table<gapminder> [0 x 6]
+#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
+#> # … with 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,
+#> #   gdpPercap <dbl>, year <dbl>
+
+

Next, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with #chp-https://dbi.r-dbi.org/reference/dbAppendTable:

+
+
append_file <- function(path) {
+  df <- readxl::read_excel(path)
+  df$year <- parse_number(basename(path))
+  
+  DBI::dbAppendTable(con, "gapminder", df)
+}
+
+

Now we need to call append_csv() once for each element of paths. That’s certainly possible with #chp-https://purrr.tidyverse.org/reference/map:

+
+
paths |> map(append_file)
+
+

But we don’t care about the output of append_file(), so instead of #chp-https://purrr.tidyverse.org/reference/map it’s slightly nicer to use #chp-https://purrr.tidyverse.org/reference/map. #chp-https://purrr.tidyverse.org/reference/map does exactly the same thing as #chp-https://purrr.tidyverse.org/reference/map but throws the output away:

+
+
paths |> walk(append_file)
+
+

Now we can see if we have all the data in our table:

+
+
con |> 
+  tbl("gapminder") |> 
+  count(year)
+#> # Source:   SQL [?? x 2]
+#> # Database: DuckDB 0.5.1 [root@Darwin 22.1.0:R 4.2.1/:memory:]
+#>    year     n
+#>   <dbl> <dbl>
+#> 1  1952   142
+#> 2  1987   142
+#> 3  1957   142
+#> 4  1992   142
+#> 5  1962   142
+#> 6  1997   142
+#> # … with more rows
+
+
+ +
+

+Writing csv files

+

The same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the #chp-https://ggplot2.tidyverse.org/reference/diamonds data and save one csv file for each clarity. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: #chp-https://dplyr.tidyverse.org/reference/group_nest.

+
+
by_clarity <- diamonds |> 
+  group_nest(clarity)
+
+by_clarity
+#> # A tibble: 8 × 2
+#>   clarity               data
+#>   <ord>   <list<tibble[,9]>>
+#> 1 I1               [741 × 9]
+#> 2 SI2            [9,194 × 9]
+#> 3 SI1           [13,065 × 9]
+#> 4 VS2           [12,258 × 9]
+#> 5 VS1            [8,171 × 9]
+#> 6 VVS2           [5,066 × 9]
+#> # … with 2 more rows
+
+

This gives us a new tibble with eight rows and two columns. clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity:

+
+
by_clarity$data[[1]]
+#> # A tibble: 741 × 9
+#>   carat cut       color depth table price     x     y     z
+#>   <dbl> <ord>     <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
+#> 1  0.32 Premium   E      60.9    58   345  4.38  4.42  2.68
+#> 2  1.17 Very Good J      60.2    61  2774  6.83  6.9   4.13
+#> 3  1.01 Premium   F      61.8    60  2781  6.39  6.36  3.94
+#> 4  1.01 Fair      E      64.5    58  2788  6.29  6.21  4.03
+#> 5  0.96 Ideal     F      60.7    55  2801  6.37  6.41  3.88
+#> 6  1.04 Premium   G      62.2    58  2801  6.46  6.41  4   
+#> # … with 735 more rows
+
+

While we’re here, lets create a column that gives the name of output file, using #chp-https://dplyr.tidyverse.org/reference/mutate and #chp-https://stringr.tidyverse.org/reference/str_glue:

+
+
by_clarity <- by_clarity |> 
+  mutate(path = str_glue("diamonds-{clarity}.csv"))
+
+by_clarity
+#> # A tibble: 8 × 3
+#>   clarity               data path             
+#>   <ord>   <list<tibble[,9]>> <glue>           
+#> 1 I1               [741 × 9] diamonds-I1.csv  
+#> 2 SI2            [9,194 × 9] diamonds-SI2.csv 
+#> 3 SI1           [13,065 × 9] diamonds-SI1.csv 
+#> 4 VS2           [12,258 × 9] diamonds-VS2.csv 
+#> 5 VS1            [8,171 × 9] diamonds-VS1.csv 
+#> 6 VVS2           [5,066 × 9] diamonds-VVS2.csv
+#> # … with 2 more rows
+
+

So if we were going to save these data frames by hand, we might write something like:

+
+
write_csv(by_clarity$data[[1]], by_clarity$path[[1]])
+write_csv(by_clarity$data[[2]], by_clarity$path[[2]])
+write_csv(by_clarity$data[[3]], by_clarity$path[[3]])
+...
+write_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])
+
+

This is a little different to our previous uses of #chp-https://purrr.tidyverse.org/reference/map because there are two arguments that are changing, not just one. That means we need a new function: #chp-https://purrr.tidyverse.org/reference/map2, which varies both the first and second arguments. And because we again don’t care about the output, we want #chp-https://purrr.tidyverse.org/reference/map2 rather than #chp-https://purrr.tidyverse.org/reference/map2. That gives us:

+
+
walk2(by_clarity$data, by_clarity$path, write_csv)
+
+
+ +
+

+Saving plots

+

We can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:

+
+
carat_histogram <- function(df) {
+  ggplot(df, aes(carat)) + geom_histogram(binwidth = 0.1)  
+}
+
+carat_histogram(by_clarity$data[[1]])
+
+

+
+
+

Now we can use #chp-https://purrr.tidyverse.org/reference/map to create a list of many plotsYou can print by_clarity$plot to get a crude animation — you’ll get one plot for each element of plots. NOTE: this didn’t happen for me. and their eventual file paths:

+
+
by_clarity <- by_clarity |> 
+  mutate(
+    plot = map(data, carat_histogram),
+    path = str_glue("clarity-{clarity}.png")
+  )
+
+

Then use #chp-https://purrr.tidyverse.org/reference/map2 with #chp-https://ggplot2.tidyverse.org/reference/ggsave to save each plot:

+
+
walk2(
+  by_clarity$path,
+  by_clarity$plot,
+  \(path, plot) ggsave(path, plot, width = 6, height = 6)
+)
+
+

This is shorthand for:

+
+
ggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)
+ggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)
+ggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)
+...
+ggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)
+
+ + +
+
+ +
+

+Summary

+

In this chapter you’ve seen how to use explicit iteration to solve three problems that come up frequently when doing data science: manipulating multiple columns, reading multiple files, and saving multiple outputs. But in general, iteration is a super power: if you know the right iteration technique, you can easily go from fixing one problem to fixing all the problems. Once you’ve mastered the techniques in this chapter, we highly recommend learning more by reading the #chp-https://adv-r.hadley.nz/functionals of Advanced R and consulting the #chp-https://purrr.tidyverse.

+

If you know much about iteration in other languages you might be surprised that we didn’t discuss the for loop. That’s because R’s orientation towards data analysis changes how we iterate: in most cases you can rely on an existing idiom to do something to each columns or each group. And when you can’t, you can often use a functional programming tool like #chp-https://purrr.tidyverse.org/reference/map that does something to each element of a list. However, you will see for loops in wild-caught code, so you’ll learn about them in the next chapter where we’ll discuss some important base R tools.

+ + +
+
diff --git a/oreilly/joins.html b/oreilly/joins.html new file mode 100644 index 0000000..7f91269 --- /dev/null +++ b/oreilly/joins.html @@ -0,0 +1,972 @@ +
+

Joins

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in. This chapter will introduce you to two important types of joins:

+
  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • +
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.
  • +

We’ll begin by discussing keys, the variables used to connect a pair of data frames in a join. We cement the theory with an examination of the keys in the nycflights13 datasets, then use that knowledge to start joining data frames together. Next we’ll discuss how joins work, focusing on their action on the rows. We’ll finish up with a discussion of non-equi-joins, a family of joins that provide a more flexible way of matching keys than the default equality relationship.

+ +
+

+Prerequisites

+

In this chapter, we’ll explore the five related datasets from nycflights13 using the join functions from dplyr.

+
+
library(tidyverse)
+library(nycflights13)
+
+
+
+ +
+

+Keys

+

To understand joins, you need to first understand how two tables can be connected through a pair of keys, with on each table. In this section, you’ll learn about the two types of key and see examples of both in the datasets of the nycflights13 package. You’ll also learn how to check that your keys are valid, and what to do if your table lacks a key.

+ +
+

+Primary and foreign keys

+

Every join involves a pair of keys: a primary key and a foreign key. A primary key is a variable or set of variables that uniquely identifies each observation. When more than one variable is needed, the key is called a compound key. For example, in nycfights13:

+
  • +

    airlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.

    +
    +
    airlines
    +#> # A tibble: 16 × 2
    +#>   carrier name                    
    +#>   <chr>   <chr>                   
    +#> 1 9E      Endeavor Air Inc.       
    +#> 2 AA      American Airlines Inc.  
    +#> 3 AS      Alaska Airlines Inc.    
    +#> 4 B6      JetBlue Airways         
    +#> 5 DL      Delta Air Lines Inc.    
    +#> 6 EV      ExpressJet Airlines Inc.
    +#> # … with 10 more rows
    +
    +
  • +
  • +

    airports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.

    +
    +
    airports
    +#> # A tibble: 1,458 × 8
    +#>   faa   name                             lat   lon   alt    tz dst   tzone      
    +#>   <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
    +#> 1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne…
    +#> 2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch…
    +#> 3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch…
    +#> 4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne…
    +#> 5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne…
    +#> 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne…
    +#> # … with 1,452 more rows
    +
    +
  • +
  • +

    planes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.

    +
    +
    planes
    +#> # A tibble: 3,322 × 9
    +#>   tailnum  year type                    manuf…¹ model engines seats speed engine
    +#>   <chr>   <int> <chr>                   <chr>   <chr>   <int> <int> <int> <chr> 
    +#> 1 N10156   2004 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo…
    +#> 2 N102UW   1998 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    +#> 3 N103US   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    +#> 4 N104UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    +#> 5 N10575   2002 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo…
    +#> 6 N105UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
    +#> # … with 3,316 more rows, and abbreviated variable name ¹​manufacturer
    +
    +
  • +
  • +

    weather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.

    +
    +
    weather
    +#> # A tibble: 26,115 × 15
    +#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
    +#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
    +#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
    +#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
    +#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
    +#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
    +#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
    +#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
    +#> # … with 26,109 more rows, and 4 more variables: precip <dbl>, pressure <dbl>,
    +#> #   visib <dbl>, time_hour <dttm>
    +
    +
  • +

A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example:

+
  • +flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.
  • +
  • +flights$carrier is a foreign key that corresponds to the primary key airlines$carrier.
  • +
  • +flights$origin is a foreign key that corresponds to the primary key airports$faa.
  • +
  • +flights$dest is a foreign key that corresponds to the primary key airports$faa .
  • +
  • +flights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.
  • +

These relationships are summarized visually in #fig-flights-relationships.

+
+
+ +

The relationships between airports, planes, flights, weather, and airlines datasets from the nycflights13 package. airports$faa connected to the flights$origin and flights$dest. planes$tailnum is connected to the flights$tailnum. weather$time_hour and weather$origin are jointly connected to flights$time_hour and flights$origin. airlines$carrier is connected to flights$carrier. There are no direct connections between airports, planes, airlines, and weather data frames.

+
Figure 19.1: Connections between all five data frames in the nycflights13 package. Variables making up a primary key are coloured grey, and are connected to their corresponding foreign keys with arrows.
+
+
+
+

You’ll notice a nice feature in the design of these keys: the primary and foreign keys almost always have the same names, which, as you’ll see shortly, will make your joining life much easier. It’s also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. There’s only one exception: year means year of departure in flights and year of manufacturer in planes. This will become important when we start actually joining tables together.

+
+ +
+

+Checking primary keys

+

Now that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to #chp-https://dplyr.tidyverse.org/reference/count the primary keys and look for entries where n is greater than one. This reveals that planes and weather both look good:

+
+
planes |> 
+  count(tailnum) |> 
+  filter(n > 1)
+#> # A tibble: 0 × 2
+#> # … with 2 variables: tailnum <chr>, n <int>
+
+weather |> 
+  count(time_hour, origin) |> 
+  filter(n > 1)
+#> # A tibble: 0 × 3
+#> # … with 3 variables: time_hour <dttm>, origin <chr>, n <int>
+
+

You should also check for missing values in your primary keys — if a value is missing then it can’t identify an observation!

+
+
planes |> 
+  filter(is.na(tailnum))
+#> # A tibble: 0 × 9
+#> # … with 9 variables: tailnum <chr>, year <int>, type <chr>,
+#> #   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
+#> #   engine <chr>
+
+weather |> 
+  filter(is.na(time_hour) | is.na(origin))
+#> # A tibble: 0 × 15
+#> # … with 15 variables: origin <chr>, year <int>, month <int>, day <int>,
+#> #   hour <int>, temp <dbl>, dewp <dbl>, humid <dbl>, wind_dir <dbl>,
+#> #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
+#> #   visib <dbl>, time_hour <dttm>
+
+
+ +
+

+Surrogate keys

+

So far we haven’t talked about the primary key for flights. It’s not super important here, because there are no data frames that use it as a foreign key, but it’s still useful to consider because it’s easier to work with observations if have some way to describe them to others.

+

After a little thinking and experimentation, we determined that there are three variables that together uniquely identify each flight:

+
+
flights |> 
+  count(time_hour, carrier, flight) |> 
+  filter(n > 1)
+#> # A tibble: 0 × 4
+#> # … with 4 variables: time_hour <dttm>, carrier <chr>, flight <int>, n <int>
+
+

Does the absence of duplicates automatically make time_hour-carrier-flight a primary key? It’s certainly a good start, but it doesn’t guarantee it. For example, are altitude and latitude a good primary key for airports?

+
+
airports |>
+  count(alt, lat) |> 
+  filter(n > 1)
+#> # A tibble: 1 × 3
+#>     alt   lat     n
+#>   <dbl> <dbl> <int>
+#> 1    13  40.6     2
+
+

Identifying an airport by it’s altitude and latitude is clearly a bad idea, and in general it’s not possible to know from the data alone whether or not a combination of variables makes a good a primary key. But for flights, the combination of time_hour, carrier, and flight seems reasonable because it would be really confusing for an airline and its customers if there were multiple flights with the same flight number in the air at the same time.

+

That said, we might be better off introducing a simple numeric surrogate key using the row number:

+
+
flights2 <- flights |> 
+  mutate(id = row_number(), .before = 1)
+flights2
+#> # A tibble: 336,776 × 20
+#>      id  year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
+#>   <int> <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>   <dbl>
+#> 1     1  2013     1     1      517           515       2     830     819      11
+#> 2     2  2013     1     1      533           529       4     850     830      20
+#> 3     3  2013     1     1      542           540       2     923     850      33
+#> 4     4  2013     1     1      544           545      -1    1004    1022     -18
+#> 5     5  2013     1     1      554           600      -6     812     837     -25
+#> 6     6  2013     1     1      554           558      -4     740     728      12
+#> # … with 336,770 more rows, 10 more variables: carrier <chr>, flight <int>,
+#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
+#> #   hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

Surrogate keys can be particular useful when communicating to other humans: it’s much easier to tell someone to take a look at flight 2001 than to say look at UA430 which departed 9am 2013-01-03.

+
+ +
+

+Exercises

+
  1. We forgot to draw the relationship between weather and airports in #fig-flights-relationships. What is the relationship and how should it appear in the diagram?

  2. +
  3. weather only contains information for the three origin airports in NYC. If it contained weather records for all airports in the USA, what additional connection would it make to flights?

  4. +
  5. The year, month, day, hour, and origin variables almost form a compound key for weather, but there’s one hour that has duplicate observations. Can you figure out what’s special about that hour?

  6. +
  7. We know that some days of the year are special and fewer people than usual fly on them (e.g. Christmas eve and Christmas day). How might you represent that data as a data frame? What would be the primary key? How would it connect to the existing data frames?

  8. +
  9. Draw a diagram illustrating the connections between the Batting, People, and Salaries data frames in the Lahman package. Draw another diagram that shows the relationship between People, Managers, AwardsManagers. How would you characterise the relationship between the Batting, Pitching, and Fielding data frames?

  10. +
+
+ +
+

+Basic joins

+

Now that you understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: #chp-https://dplyr.tidyverse.org/reference/mutate-joins, #chp-https://dplyr.tidyverse.org/reference/mutate-joins, #chp-https://dplyr.tidyverse.org/reference/mutate-joins, #chp-https://dplyr.tidyverse.org/reference/filter-joins, and #chp-https://dplyr.tidyverse.org/reference/filter-joins. They all have the same interface: they take a pair of data frames (x and y) and return a data frame. The order of the rows and columns in the output is primarily determined by x.

+

In this section, you’ll learn how to use one mutating join, #chp-https://dplyr.tidyverse.org/reference/mutate-joins, and two filtering joins, #chp-https://dplyr.tidyverse.org/reference/filter-joins and #chp-https://dplyr.tidyverse.org/reference/filter-joins. In the next section, you’ll learn exactly how these functions work, and about the remaining #chp-https://dplyr.tidyverse.org/reference/mutate-joins, #chp-https://dplyr.tidyverse.org/reference/mutate-joins and #chp-https://dplyr.tidyverse.org/reference/mutate-joins.

+ +
+

+Mutating joins

+

A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like #chp-https://dplyr.tidyverse.org/reference/mutate, the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. For these examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variablesRemember that in RStudio you can also use #chp-https://rdrr.io/r/utils/View to avoid this problem.:

+
+
flights2 <- flights |> 
+  select(year, time_hour, origin, dest, tailnum, carrier)
+flights2
+#> # A tibble: 336,776 × 6
+#>    year time_hour           origin dest  tailnum carrier
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>  
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     
+#> # … with 336,770 more rows
+
+

There are four types of mutating join, but there’s one that you’ll use almost all of the time: #chp-https://dplyr.tidyverse.org/reference/mutate-joins. It’s special because the output will always have the same rows as xThat’s not 100% true, but you’ll get a warning whenever it isn’t.. The primary use of #chp-https://dplyr.tidyverse.org/reference/mutate-joins is to add in additional metadata. For example, we can use #chp-https://dplyr.tidyverse.org/reference/mutate-joins to add the full airline name to the flights2 data:

+
+
flights2 |>
+  left_join(airlines)
+#> Joining with `by = join_by(carrier)`
+#> # A tibble: 336,776 × 7
+#>    year time_hour           origin dest  tailnum carrier name                  
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                 
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc. 
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc. 
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways       
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.  
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc. 
+#> # … with 336,770 more rows
+
+

Or we could find out the temperature and wind speed when each plane departed:

+
+
flights2 |> 
+  left_join(weather |> select(origin, time_hour, temp, wind_speed))
+#> Joining with `by = join_by(time_hour, origin)`
+#> # A tibble: 336,776 × 8
+#>    year time_hour           origin dest  tailnum carrier  temp wind_speed
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <dbl>      <dbl>
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA       39.0       12.7
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA       39.9       15.0
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA       39.0       15.0
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6       39.0       15.0
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL       39.9       16.1
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA       39.0       12.7
+#> # … with 336,770 more rows
+
+

Or what size of plane was flying:

+
+
flights2 |> 
+  left_join(planes |> select(tailnum, type, engines, seats))
+#> Joining with `by = join_by(tailnum)`
+#> # A tibble: 336,776 × 9
+#>    year time_hour           origin dest  tailnum carrier type      engines seats
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>       <int> <int>
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed wi…       2   149
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed wi…       2   149
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed wi…       2   178
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed wi…       2   200
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed wi…       2   178
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed wi…       2   191
+#> # … with 336,770 more rows
+
+

When #chp-https://dplyr.tidyverse.org/reference/mutate-joins fails to find a match for a row in x, it fills in the new variables with missing values. For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:

+
+
flights2 |> 
+  filter(tailnum == "N3ALAA") |> 
+  left_join(planes |> select(tailnum, type, engines, seats))
+#> Joining with `by = join_by(tailnum)`
+#> # A tibble: 63 × 9
+#>    year time_hour           origin dest  tailnum carrier type  engines seats
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>   <int> <int>
+#> 1  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
+#> 2  2013 2013-01-02 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
+#> 3  2013 2013-01-03 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
+#> 4  2013 2013-01-07 19:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
+#> 5  2013 2013-01-08 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
+#> 6  2013 2013-01-16 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
+#> # … with 57 more rows
+
+

We’ll come back to this problem a few times in the rest of the chapter.

+
+ +
+

+Specifying join keys

+

By default, #chp-https://dplyr.tidyverse.org/reference/mutate-joins will use all variables that appear in both data frames as the join key, the so called natural join. This is a useful heuristic, but it doesn’t always work. For example, what happens if we try to join flights2 with the complete planes dataset?

+
+
flights2 |> 
+  left_join(planes)
+#> Joining with `by = join_by(year, tailnum)`
+#> # A tibble: 336,776 × 13
+#>    year time_hour           origin dest  tailnum carrier type  manufactu…¹ model
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       <chr>
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        <NA> 
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        <NA> 
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        <NA> 
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        <NA> 
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        <NA> 
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        <NA> 
+#> # … with 336,770 more rows, 4 more variables: engines <int>, seats <int>,
+#> #   speed <int>, engine <chr>, and abbreviated variable name ¹​manufacturer
+
+

We get a lot of missing matches because our join is trying to use tailnum and year as a compound key. Both flights and planes have a year column but they mean different things: flights$year is year the flight occurred and planes$year is the year the plane was built. We only want to join on tailnum so we need to provide an explicit specification with #chp-https://dplyr.tidyverse.org/reference/join_by:

+
+
flights2 |> 
+  left_join(planes, join_by(tailnum))
+#> # A tibble: 336,776 × 14
+#>   year.x time_hour           origin dest  tailnum carrier year.y type    manuf…¹
+#>    <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>   <chr>  
+#> 1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed … BOEING 
+#> 2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed … BOEING 
+#> 3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed … BOEING 
+#> 4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed … AIRBUS 
+#> 5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed … BOEING 
+#> 6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed … BOEING 
+#> # … with 336,770 more rows, 5 more variables: model <chr>, engines <int>,
+#> #   seats <int>, speed <int>, engine <chr>, and abbreviated variable name
+#> #   ¹​manufacturer
+
+

Note that the year variables are disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x or y argument. You can override the default suffixes with the suffix argument.

+

join_by(tailnum) is short for join_by(tailnum == tailnum). It’s important to know about this fuller form for two reasons. Firstly, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is often called an equi-join. You’ll learn about non-equi-joins in #sec-non-equi-joins.

+

Secondly, it’s how you specify different join keys in each table. For example, there are two ways to join the flight2 and airports table: either by dest or origin:

+
+
flights2 |> 
+  left_join(airports, join_by(dest == faa))
+#> # A tibble: 336,776 × 13
+#>    year time_hour           origin dest  tailnum carrier name    lat   lon   alt
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Geor…  30.0 -95.3    97
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Geor…  30.0 -95.3    97
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miam…  25.8 -80.3     8
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>   NA    NA      NA
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hart…  33.6 -84.4  1026
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chic…  42.0 -87.9   668
+#> # … with 336,770 more rows, and 3 more variables: tz <dbl>, dst <chr>,
+#> #   tzone <chr>
+
+flights2 |> 
+  left_join(airports, join_by(origin == faa))
+#> # A tibble: 336,776 × 13
+#>    year time_hour           origin dest  tailnum carrier name    lat   lon   alt
+#>   <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <dbl> <dbl> <dbl>
+#> 1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Newa…  40.7 -74.2    18
+#> 2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      La G…  40.8 -73.9    22
+#> 3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      John…  40.6 -73.8    13
+#> 4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      John…  40.6 -73.8    13
+#> 5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      La G…  40.8 -73.9    22
+#> 6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Newa…  40.7 -74.2    18
+#> # … with 336,770 more rows, and 3 more variables: tz <dbl>, dst <chr>,
+#> #   tzone <chr>
+
+

In older code you might see a different way of specifying the join keys, using a character vector:

+
  • +by = "x" corresponds to join_by(x).
  • +
  • +by = c("a" = "x") corresponds to join_by(a == x).
  • +

Now that it exists, we prefer #chp-https://dplyr.tidyverse.org/reference/join_by since it provides a clearer and more flexible specification.

+
+ +
+

+Filtering joins

+

As you might guess the primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins. Semi-joins keep all rows in x that have a match in y. For example, we could use a semi-join to filter the airports dataset to show just the origin airports:

+
+
airports |> 
+  semi_join(flights2, join_by(faa == origin))
+#> # A tibble: 3 × 8
+#>   faa   name                  lat   lon   alt    tz dst   tzone           
+#>   <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
+#> 1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
+#> 2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
+#> 3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York
+
+

Or just the destinations:

+
+
airports |> 
+  semi_join(flights2, join_by(faa == dest))
+#> # A tibble: 101 × 8
+#>   faa   name                                lat    lon   alt    tz dst   tzone  
+#>   <chr> <chr>                             <dbl>  <dbl> <dbl> <dbl> <chr> <chr>  
+#> 1 ABQ   Albuquerque International Sunport  35.0 -107.   5355    -7 A     Americ…
+#> 2 ACK   Nantucket Mem                      41.3  -70.1    48    -5 A     Americ…
+#> 3 ALB   Albany Intl                        42.7  -73.8   285    -5 A     Americ…
+#> 4 ANC   Ted Stevens Anchorage Intl         61.2 -150.    152    -9 A     Americ…
+#> 5 ATL   Hartsfield Jackson Atlanta Intl    33.6  -84.4  1026    -5 A     Americ…
+#> 6 AUS   Austin Bergstrom Intl              30.2  -97.7   542    -6 A     Americ…
+#> # … with 95 more rows
+
+

Anti-joins are the opposite: they return all rows in x that don’t have a match in y. They’re useful for finding missing values that are implicit in the data, the topic of #sec-missing-implicit. Implicitly missing values don’t show up as NAs but instead only exist as an absence. For example, we can find rows that as missing from airports by looking for flights that don’t have a matching destination airport:

+
+
flights2 |> 
+  anti_join(airports, join_by(dest == faa)) |> 
+  distinct(dest)
+#> # A tibble: 4 × 1
+#>   dest 
+#>   <chr>
+#> 1 BQN  
+#> 2 SJU  
+#> 3 STT  
+#> 4 PSE
+
+

Or we can find which tailnums are missing from planes:

+
+
flights2 |>
+  anti_join(planes, join_by(tailnum)) |> 
+  distinct(tailnum)
+#> # A tibble: 722 × 1
+#>   tailnum
+#>   <chr>  
+#> 1 N3ALAA 
+#> 2 N3DUAA 
+#> 3 N542MQ 
+#> 4 N730MQ 
+#> 5 N9EAMQ 
+#> 6 N532UA 
+#> # … with 716 more rows
+
+
+ +
+

+Exercises

+
  1. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

  2. +
  3. +

    Imagine you’ve found the top 10 most popular destinations using this code:

    +
    +
    top_dest <- flights2 |>
    +  count(dest, sort = TRUE) |>
    +  head(10)
    +
    +

    How can you find all flights to those destinations?

    +
  4. +
  5. Does every departing flight have corresponding weather data for that hour?

  6. +
  7. What do the tail numbers that don’t have a matching record in planes have in common? (Hint: one variable explains ~90% of the problems.)

  8. +
  9. Add a column to planes that lists every carrier that has flown that plane. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned in previous chapters.

  10. +
  11. Add the latitude and the longitude of the origin and destination airport to flights. Is it easier to rename the columns before or after the join?

  12. +
  13. +

    Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

    +
    +
    airports |>
    +  semi_join(flights, join_by(faa == dest)) |>
    +  ggplot(aes(lon, lat)) +
    +    borders("state") +
    +    geom_point() +
    +    coord_quickmap()
    +
    +

    You might want to use the size or colour of the points to display the average delay for each airport.

    +
  14. +
  15. What happened on June 13 2013? Draw a map of the delays, and then use Google to cross-reference with the weather.

  16. +
+
+ +
+

+How do joins work?

+

Now that you’ve used joins a few times it’s time to learn more about how they work, focusing on how each row in x matches rows in y. We’ll begin by using #fig-join-setup to introduce a visual representation of the two simple tibbles defined below. In these examples we’ll use a single key called key and a single value column (val_x and val_y), but the ideas all generalize to multiple keys and multiple values.

+
+
x <- tribble(
+  ~key, ~val_x,
+     1, "x1",
+     2, "x2",
+     3, "x3"
+)
+y <- tribble(
+  ~key, ~val_y,
+     1, "y1",
+     2, "y2",
+     4, "y3"
+)
+
+
+
+ +

x and y are two data frames with 2 columns and 3 rows, with contents as described in the text. The values of the keys are coloured: 1 is green, 2 is purple, 3 is orange, and 4 is yellow.

+
Figure 19.2: Graphical representation of two simple tables. The coloured key columns map background colour to key value. The grey columns represent the “value” columns that are carried along for the ride.
+
+
+
+

#fig-join-setup2 shows all potential matches between x and y as the intersection between lines drawn from each row of x and each row of y. The rows and columns in the output are primarily determined by x, so the x table is horizontal and lines up with the output.

+
+
+ +

x and y are placed at right-angles, with horizonal lines extending from x and vertical lines extending from y. There are 3 rows in x and 3 rows in y, which leads to nine intersections representing nine potential matches.

+
Figure 19.3: To understand how joins work, it’s useful to think of every possible match. Here we show that with a grid of connecting lines.
+
+
+
+

In an actual join, matches will be indicated with dots, as in #fig-join-inner. The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values. The join shown here is a so-called equi inner join, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both x and y. Equi-joins are the most common type of join, so we’ll typically omit the equi prefix, and just call it an inner join. We’ll come back to non-equi joins in #sec-non-equi-joins.

+
+
+ +

x and y are placed at right-angles with lines forming a grid of potential matches. Keys 1 and 2 appear in both x and y, so we get a match, indicated by a dot. Each dot corresponds to a row in the output, so the resulting joined data frame has two rows.

+
Figure 19.4: An inner join matches each row in x to the row in y that has the same value of key. Each match becomes a row in the output.
+
+
+
+

An outer join keeps observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with NA. There are three types of outer joins:

+
  • +

    A left join keeps all observations in x, #fig-join-left. Every row of x is preserved in the output because it can fall back to matching a row of NAs in y.

    +
    +
    + +

    Compared to the previous diagram showing an inner join, the y table gets a new virtual row containin NA that will match any row in x that didn't otherwise match. This means that the output now has three rows. For key = 3, which matches this virtual row, val_y takes value NA.

    +
    Figure 19.5: A visual representation of the left join where every row in x appears in the output.
    +
    +
    +
    +
  • +
  • +

    A right join keeps all observations in y, #fig-join-right. Every row of y is preserved in the output because it can fall back to matching a row of NAs in x. The output still matches x as much as possible; any extra rows from y are added to the end.

    +
    +
    + +

    Compared to the previous diagram showing an left join, the x table now gains a virtual row so that every row in y gets a match in x. val_x contains NA for the row in y that didn't match x.

    +
    Figure 19.6: A visual representation of the right join where every row of y appears in the output.
    +
    +
    +
    +
  • +
  • +

    A full join keeps all observations that appear in x or y, #fig-join-full. Every row of x and y is included in the output because both x and y have a fall back row of NAs. Again, the output starts with all rows from x, followed by the remaining unmatched y rows.

    +
    +
    + +

    Now both x and y have a virtual row that always matches. The result has 4 rows: keys 1, 2, 3, and 4 with all values from val_x and val_y, however key 2, val_y and key 4, val_x are NAs since those keys don't have a match in the other data frames.

    +
    Figure 19.7: A visual representation of the full join where every row in x and y appears in the output.
    +
    +
    +
    +
  • +

Another way to show how the types of outer join differ is with a Venn diagram, as in #fig-join-venn. However, this is not a great representation because while it might jog your memory about which rows are preserved, it fails to illustrate what’s happening with the columns.

+
+
+ +

Venn diagrams for inner, full, left, and right joins. Each join represented with two intersecting circles representing data frames x and y, with x on the right and y on the left. Shading indicates the result of the join.

+
Figure 19.8: Venn diagrams showing the difference between inner, left, right, and full joins.
+
+
+
+ +
+

+Row matching

+

So far we’ve explored what happens if a row in x matches zero or one rows in y. What happens if it matches more than one row? To understand what’s going let’s first narrow our focus to the #chp-https://dplyr.tidyverse.org/reference/mutate-joins and then draw a picture, #fig-join-match-types.

+
+
+ +

A join diagram where x has key values 1, 2, and 3, and y has key values 1, 2, 2. The output has three rows because key 1 matches one row, key 2 matches two rows, and key 3 matches zero rows.

+
Figure 19.9: The three ways a row in x can match. x1 matches one row in y, x2 matches two rows in y, x3 matches zero rows in y. Note that while there are three rows in x and three rows in the output, there isn’t a direct correspondence between the rows.
+
+
+
+

There are three possible outcomes for a row in x:

+
  • If it doesn’t match anything, it’s dropped.
  • +
  • If it matches 1 row in y, it’s preserved.
  • +
  • If it matches more than 1 row in y, it’s duplicated once for each match.
  • +

In principle, this means that there’s no guaranteed correspondence between the rows in the output and the rows in the x:

+
  • There might be fewer rows if some rows in x don’t match any rows in y.
  • +
  • There might be more rows if some rows in x match multiple rows in y.
  • +
  • There might be the same number of rows if every row in x matches one row in y.
  • +
  • There might be the same number of rows if some rows don’t match any rows, and exactly the same number of rows match two rows in y!!
  • +

Row expansion is a fundamental property of joins, but it’s dangerous because it might happen without you realizing it. To avoid this problem, dplyr will warn whenever there are multiple matches:

+
+
df1 <- tibble(key = c(1, 2, 3), val_x = c("x1", "x2", "x3"))
+df2 <- tibble(key = c(1, 2, 2), val_y = c("y1", "y2", "y3"))
+
+df1 |> 
+  inner_join(df2, join_by(key))
+#> Warning in inner_join(df1, df2, join_by(key)): Each row in `x` is expected to match at most 1 row in `y`.
+#> ℹ Row 2 of `x` matches multiple rows.
+#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
+#>   warning.
+#> # A tibble: 3 × 3
+#>     key val_x val_y
+#>   <dbl> <chr> <chr>
+#> 1     1 x1    y1   
+#> 2     2 x2    y2   
+#> 3     2 x2    y3
+
+

This is one reason we like #chp-https://dplyr.tidyverse.org/reference/mutate-joins — if it runs without warning, you know that each row of the output matches the row in the same position in x.

+

You can gain further control over row matching with two arguments:

+
  • +unmatched controls what happens when a row in x fails to match any rows in y. It defaults to "drop" which will silently drop any unmatched rows.
  • +
  • +multiple controls what happens when a row in x matches more than one row in y. For equi-joins, it defaults to "warn" which emits a warning message if any rows have multiple matches.
  • +

There are two common cases in which you might want to override these defaults: enforcing a one-to-one mapping or deliberately allowing the rows to increase.

+
+ +
+

+One-to-one mapping

+

Both unmatched and multiple can take value "error" which means that the join will fail unless each row in x matches exactly one row in y:

+
+
df1 <- tibble(x = 1)
+df2 <- tibble(x = c(1, 1))
+df3 <- tibble(x = 3)
+
+df1 |> 
+  inner_join(df2, join_by(x), unmatched = "error", multiple = "error")
+#> Error in `inner_join()`:
+#> ! Each row in `x` must match at most 1 row in `y`.
+#> ℹ Row 1 of `x` matches multiple rows.
+df1 |> 
+  inner_join(df3, join_by(x), unmatched = "error", multiple = "error")
+#> Error in `inner_join()`:
+#> ! Each row of `x` must have a match in `y`.
+#> ℹ Row 1 of `x` does not have a match.
+
+

Note that unmatched = "error" is not useful with #chp-https://dplyr.tidyverse.org/reference/mutate-joins because, as described above, every row in x has a fallback match to a virtual row in y.

+
+ +
+

+Allow multiple rows

+

Sometimes it’s useful to deliberately expand the number of rows in the output. This can come about naturally if you “flip” the direction of the question you’re asking. For example, as we’ve seen above, it’s natural to supplement the flights data with information about the plane that flew each flight:

+
+
flights2 |> 
+  left_join(planes, by = "tailnum")
+
+

But it’s also reasonable to ask what flights did each plane fly:

+
+
plane_flights <- planes |> 
+  select(tailnum, type, engines, seats) |> 
+  left_join(flights2, by = "tailnum")
+#> Warning in left_join(select(planes, tailnum, type, engines, seats), flights2, : Each row in `x` is expected to match at most 1 row in `y`.
+#> ℹ Row 1 of `x` matches multiple rows.
+#> ℹ If multiple matches are expected, set `multiple = "all"` to silence this
+#>   warning.
+
+

Since this duplicates rows in x (the planes), we need to explicitly say that we’re ok with the multiple matches by setting multiple = "all":

+
+
plane_flights <- planes |> 
+  select(tailnum, type, engines, seats) |> 
+  left_join(flights2, by = "tailnum", multiple = "all")
+
+plane_flights
+#> # A tibble: 284,170 × 9
+#>   tailnum type      engines seats  year time_hour           origin dest  carrier
+#>   <chr>   <chr>       <int> <int> <int> <dttm>              <chr>  <chr> <chr>  
+#> 1 N10156  Fixed wi…       2    55  2013 2013-01-10 06:00:00 EWR    PIT   EV     
+#> 2 N10156  Fixed wi…       2    55  2013 2013-01-10 10:00:00 EWR    CHS   EV     
+#> 3 N10156  Fixed wi…       2    55  2013 2013-01-10 15:00:00 EWR    MSP   EV     
+#> 4 N10156  Fixed wi…       2    55  2013 2013-01-11 06:00:00 EWR    CMH   EV     
+#> 5 N10156  Fixed wi…       2    55  2013 2013-01-11 11:00:00 EWR    MCI   EV     
+#> 6 N10156  Fixed wi…       2    55  2013 2013-01-11 18:00:00 EWR    PWM   EV     
+#> # … with 284,164 more rows
+
+
+ +
+

+Filtering joins

+

The number of matches also determines the behavior of the filtering joins. The semi-join keeps rows in x that have one or more matches in y, as in #fig-join-semi. The anti-join keeps rows in x that match zero rows in y, as in #fig-join-anti. In both cases, only the existence of a match is important; it doesn’t matter how many times it matches. This means that filtering joins never duplicate rows like mutating joins do.

+
+
+ +

A join diagram with old friends x and y. In a semi join, only the presence of a match matters so the output contains the same columns as x.

+
Figure 19.10: In a semi-join it only matters that there is a match; otherwise values in y don’t affect the output.
+
+
+
+
+
+ +

An anti-join is the inverse of a semi-join so matches are drawn with red lines indicating that they will be dropped from the output.

+
Figure 19.11: An anti-join is the inverse of a semi-join, dropping rows from x that have a match in y.
+
+
+
+
+
+ +
+

+Non-equi joins

+

So far you’ve only seen equi-joins, joins where the rows match if the x key equals the y key. Now we’re going to relax that restriction and discuss other ways of determining if a pair of rows match.

+

But before we can do that, we need to revisit a simplification we made above. In equi-joins the x keys and y are always equal, so we only need to show one in the output. We can request that dplyr keep both keys with keep = TRUE, leading to the code below and the re-drawn #chp-https://dplyr.tidyverse.org/reference/mutate-joins in #fig-inner-both.

+
+
x |> left_join(y, by = "key", keep = TRUE)
+#> # A tibble: 3 × 4
+#>   key.x val_x key.y val_y
+#>   <dbl> <chr> <dbl> <chr>
+#> 1     1 x1        1 y1   
+#> 2     2 x2        2 y2   
+#> 3     3 x3       NA <NA>
+
+
+
+ +

A join diagram showing an inner join betwen x and y. The result now includes four columns: key.x, val_x, key.y, and val_y. The values of key.x and key.y are identical, which is why we usually only show one.

+
Figure 19.12: An left join showing both x and y keys in the output.
+
+
+
+

When we move away from equi-joins we’ll always show the keys, because the key values will often different. For example, instead of matching only when the x$key and y$key are equal, we could match whenever the x$key is greater than or equal to the y$key, leading to #fig-join-gte. dplyr’s join functions understand this distinction equi and non-equi joins so will always show both keys when you perform a non-equi join.

+
+
+ +

A join diagram illustrating join_by(key >= key). The first row of x matches one row of y and the second and thirds rows each match two rows. This means the output has five rows containing each of the following (key.x, key.y) pairs: (1, 1), (2, 1), (2, 2), (3, 1), (3, 2).

+
Figure 19.13: A non-equi join where the x key must greater than or equal to than the y key. Many rows generate multiple matches.
+
+
+
+

Non-equi-join isn’t a particularly useful term because it only tells you what the join is not, not what it is. dplyr helps by identifying four particularly useful types of non-equi-join:

+
  • +Cross joins match every pair of rows.
  • +
  • +Inequality joins use <, <=, >, and >= instead of ==.
  • +
  • +Rolling joins are similar to inequality joins but only find the closest match.
  • +
  • +Overlap joins are a special type of inequality join designed to work with ranges.
  • +

Each of these is described in more detail in the following sections.

+ +
+

+Cross joins

+

A cross join matches everything, as in #fig-join-cross, generating the Cartesian product of rows. This means the output will have nrow(x) * nrow(y) rows.

+
+
+ +

A join diagram showing a dot for every combination of x and y.

+
Figure 19.14: A cross join matches each row in x with every row in y.
+
+
+
+

Cross joins are useful when generating permutations. For example, the code below generates every possible pair of names. Since we’re joining df to itself, this is sometimes called a self-join.

+
+
df <- tibble(name = c("John", "Simon", "Tracy", "Max"))
+df |> left_join(df, join_by())
+#> # A tibble: 16 × 2
+#>   name.x name.y
+#>   <chr>  <chr> 
+#> 1 John   John  
+#> 2 John   Simon 
+#> 3 John   Tracy 
+#> 4 John   Max   
+#> 5 Simon  John  
+#> 6 Simon  Simon 
+#> # … with 10 more rows
+
+
+ +
+

+Inequality joins

+

Inequality joins use <, <=, >=, or > to restrict the set of possible matches, as in #fig-join-gte and #fig-join-lt.

+
+
+ +

+
Figure 19.15: An inequality join where x is joined to y on rows where the key of x is less than the key of y. This makes a triangular shape in the top-left corner.
+
+
+
+

Inequality joins are extremely general, so general that it’s hard to come up with meaningful specific use cases. One small useful technique is to use them to restrict the cross join so that instead of generating all permutations, we generate all combinations:

+
+
df <- tibble(id = 1:4, name = c("John", "Simon", "Tracy", "Max"))
+
+df |> left_join(df, join_by(id < id))
+#> # A tibble: 7 × 4
+#>    id.x name.x  id.y name.y
+#>   <int> <chr>  <int> <chr> 
+#> 1     1 John       2 Simon 
+#> 2     1 John       3 Tracy 
+#> 3     1 John       4 Max   
+#> 4     2 Simon      3 Tracy 
+#> 5     2 Simon      4 Max   
+#> 6     3 Tracy      4 Max   
+#> # … with 1 more row
+
+
+ +
+

+Rolling joins

+

Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, you get just the closest row, as in #fig-join-closest. You can turn any inequality join into a rolling join by adding closest(). For example join_by(closest(x <= y)) matches the smallest y that’s greater than or equal to x, and join_by(closest(x > y)) matches the biggest y that’s less than x.

+
+
+ +

A rolling join is a subset of an inequality join so some matches are grayed out indicating that they're not used because they're not the "closest".

+
Figure 19.16: A following join is similar to a greater-than-or-equal inequality join but only matches the first value.
+
+
+
+

Rolling joins are particularly useful when you have two tables of dates that don’t perfectly line up and you want to find (e.g.) the closest date in table 1 that comes before (or after) some date in table 2.

+

For example, imagine that you’re in charge of the party planning commission for your office. Your company is rather cheap so instead of having individual parties, you only have a party once each quarter. The rules for determining when a party will be held are a little complex: parties are always on a Monday, you skip the first week of January since a lot of people are on holiday, and the first Monday of Q3 2022 is July 4, so that has to be pushed back a week. That leads to the following party days:

+
+
parties <- tibble(
+  q = 1:4,
+  party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
+)
+
+

Now imagine that you have a table of employee birthdays:

+
+
employees <- tibble(
+  name = wakefield::name(100),
+  birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
+)
+employees
+#> # A tibble: 100 × 2
+#>   name       birthday  
+#>   <variable> <date>    
+#> 1 Lindzy     2022-08-11
+#> 2 Santania   2022-03-01
+#> 3 Gardell    2022-03-04
+#> 4 Cyrille    2022-11-15
+#> 5 Kynli      2022-07-09
+#> 6 Sever      2022-02-03
+#> # … with 94 more rows
+
+

And for each employee we want to find the first party date that comes after (or on) their birthday. We can express that with a rolling join:

+
+
employees |> 
+  left_join(parties, join_by(closest(birthday >= party)))
+#> # A tibble: 100 × 4
+#>   name       birthday       q party     
+#>   <variable> <date>     <int> <date>    
+#> 1 Lindzy     2022-08-11     3 2022-07-11
+#> 2 Santania   2022-03-01     1 2022-01-10
+#> 3 Gardell    2022-03-04     1 2022-01-10
+#> 4 Cyrille    2022-11-15     4 2022-10-03
+#> 5 Kynli      2022-07-09     2 2022-04-04
+#> 6 Sever      2022-02-03     1 2022-01-10
+#> # … with 94 more rows
+
+

There is, however, one problem with this approach: the folks with birthdays before January 10 don’t get a party:

+
+
employees |> 
+  anti_join(parties, join_by(closest(birthday >= party)))
+#> # A tibble: 4 × 2
+#>   name       birthday  
+#>   <variable> <date>    
+#> 1 Janeida    2022-01-04
+#> 2 Aires      2022-01-07
+#> 3 Mikalya    2022-01-06
+#> 4 Carlynn    2022-01-08
+
+

To resolve that issue we’ll need to tackle the problem a different way, with overlap joins.

+
+ +
+

+Overlap joins

+

Overlap joins provide three helpers that use inequality joins to make it easier to work with intervals:

+
  • +between(x, y_lower, y_upper) is short for x >= y_lower, x <= y_upper.
  • +
  • +within(x_lower, x_upper, y_lower, y_upper) is short for x_lower >= y_lower, x_upper <= y_upper.
  • +
  • +overlaps(x_lower, x_upper, y_lower, y_upper) is short for x_lower <= y_upper, x_upper >= y_lower.
  • +

Let’s continue the birthday example to see how you might use them. There’s one problem with the strategy we used above: there’s no party preceding the birthdays Jan 1-9. So it might be better to to be explicit about the date ranges that each party spans, and make a special case for those early birthdays:

+
+
parties <- tibble(
+  q = 1:4,
+  party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
+  start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
+  end = lubridate::ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
+)
+parties
+#> # A tibble: 4 × 4
+#>       q party      start      end       
+#>   <int> <date>     <date>     <date>    
+#> 1     1 2022-01-10 2022-01-01 2022-04-03
+#> 2     2 2022-04-04 2022-04-04 2022-07-11
+#> 3     3 2022-07-11 2022-07-11 2022-10-02
+#> 4     4 2022-10-03 2022-10-03 2022-12-31
+
+

Hadley is hopelessly bad at data entry so he also wanted to check that the party periods don’t overlap. One way to do this is by using a self-join to check to if any start-end interval overlap with another:

+
+
parties |> 
+  inner_join(parties, join_by(overlaps(start, end, start, end), q < q)) |> 
+  select(start.x, end.x, start.y, end.y)
+#> # A tibble: 1 × 4
+#>   start.x    end.x      start.y    end.y     
+#>   <date>     <date>     <date>     <date>    
+#> 1 2022-04-04 2022-07-11 2022-07-11 2022-10-02
+
+

Ooops, there is an overlap, so let’s fix that problem and continue:

+
+
parties <- tibble(
+  q = 1:4,
+  party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
+  start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
+  end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
+)
+
+

Now we can match each employee to their party. This is a good place to use unmatched = "error" because we want to quickly find out if any employees didn’t get assigned a party.

+
+
employees |> 
+  inner_join(parties, join_by(between(birthday, start, end)), unmatched = "error")
+#> # A tibble: 100 × 6
+#>   name       birthday       q party      start      end       
+#>   <variable> <date>     <int> <date>     <date>     <date>    
+#> 1 Lindzy     2022-08-11     3 2022-07-11 2022-07-11 2022-10-02
+#> 2 Santania   2022-03-01     1 2022-01-10 2022-01-01 2022-04-03
+#> 3 Gardell    2022-03-04     1 2022-01-10 2022-01-01 2022-04-03
+#> 4 Cyrille    2022-11-15     4 2022-10-03 2022-10-03 2022-12-31
+#> 5 Kynli      2022-07-09     2 2022-04-04 2022-04-04 2022-07-10
+#> 6 Sever      2022-02-03     1 2022-01-10 2022-01-01 2022-04-03
+#> # … with 94 more rows
+
+
+ +
+

+Exercises

+
  1. +

    Can you explain what’s happening with the keys in this equi-join? Why are they different?

    +
    +
    x |> full_join(y, by = "key")
    +#> # A tibble: 4 × 3
    +#>     key val_x val_y
    +#>   <dbl> <chr> <chr>
    +#> 1     1 x1    y1   
    +#> 2     2 x2    y2   
    +#> 3     3 x3    <NA> 
    +#> 4     4 <NA>  y3
    +
    +x |> full_join(y, by = "key", keep = TRUE)
    +#> # A tibble: 4 × 4
    +#>   key.x val_x key.y val_y
    +#>   <dbl> <chr> <dbl> <chr>
    +#> 1     1 x1        1 y1   
    +#> 2     2 x2        2 y2   
    +#> 3     3 x3       NA <NA> 
    +#> 4    NA <NA>      4 y3
    +
    +
  2. +
  3. When finding if any party period overlapped with another party period we used q < q in the #chp-https://dplyr.tidyverse.org/reference/join_by? Why? What happens if you remove this inequality?

  4. +
+
+ +
+

+Summary

+

In this chapter, you’ve learned how to use mutating and filtering joins to combine data from a pair of data frames. Along the way you learned how to identify keys, and the difference between primary and foreign keys. You also understand how joins work and how to figure out how many rows the output will have. Finally, you’ve gained a glimpse into the power of non-equi-joins and seen a few interesting use cases.

+

This chapter concludes the “Transform” part of the book where the focus was on the tools you could use with individual columns and tibbles. You learned about dplyr and base functions for working with logical vectors, numbers, and complete tables, stringr functions for working strings, lubridate functions for working with date-times, and forcats functions for working with factors.

+

In the next part of the book, you’ll learn more about getting various types of data into R in a tidy form.

+ + +
+
diff --git a/oreilly/logicals.html b/oreilly/logicals.html new file mode 100644 index 0000000..433ee1d --- /dev/null +++ b/oreilly/logicals.html @@ -0,0 +1,633 @@ +
+

Logical vectors

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

In this chapter, you’ll learn tools for working with logical vectors. Logical vectors are the simplest type of vector because each element can only be one of three possible values: TRUE, FALSE, and NA. It’s relatively rare to find logical vectors in your raw data, but you’ll create and manipulate in the course of almost every analysis.

+

We’ll begin by discussing the most common way of creating logical vectors: with numeric comparisons. Then you’ll learn about how you can use Boolean algebra to combine different logical vectors, as well as some useful summaries. We’ll finish off with #chp-https://dplyr.tidyverse.org/reference/if_else and #chp-https://dplyr.tidyverse.org/reference/case_when, two useful functions for making conditional changes powered by logical vectors.

+ +
+

+Prerequisites

+

Most of the functions you’ll learn about in this chapter are provided by base R, so we don’t need the tidyverse, but we’ll still load it so we can use #chp-https://dplyr.tidyverse.org/reference/mutate, #chp-https://dplyr.tidyverse.org/reference/filter, and friends to work with data frames. We’ll also continue to draw examples from the nycflights13 dataset.

+
+
library(tidyverse)
+library(nycflights13)
+
+

However, as we start to cover more tools, there won’t always be a perfect real example. So we’ll start making up some dummy data with #chp-https://rdrr.io/r/base/c:

+
+
x <- c(1, 2, 3, 5, 7, 11, 13)
+x * 2
+#> [1]  2  4  6 10 14 22 26
+
+

This makes it easier to explain individual functions at the cost of making it harder to see how it might apply to your data problems. Just remember that any manipulation we do to a free-floating vector, you can do to a variable inside data frame with #chp-https://dplyr.tidyverse.org/reference/mutate and friends.

+
+
df <- tibble(x)
+df |> 
+  mutate(y = x *  2)
+#> # A tibble: 7 × 2
+#>       x     y
+#>   <dbl> <dbl>
+#> 1     1     2
+#> 2     2     4
+#> 3     3     6
+#> 4     5    10
+#> 5     7    14
+#> 6    11    22
+#> # … with 1 more row
+
+
+
+ +
+

+Comparisons

+

A very common way to create a logical vector is via a numeric comparison with <, <=, >, >=, !=, and ==. So far, we’ve mostly created logical variables transiently within #chp-https://dplyr.tidyverse.org/reference/filter — they are computed, used, and then thrown away. For example, the following filter finds all daytime departures that leave roughly on time:

+
+
flights |> 
+  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
+#> # A tibble: 172,286 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      601         600       1     844     850      -6 B6     
+#> 2  2013     1     1      602         610      -8     812     820      -8 DL     
+#> 3  2013     1     1      602         605      -3     821     805      16 MQ     
+#> 4  2013     1     1      606         610      -4     858     910     -12 AA     
+#> 5  2013     1     1      606         610      -4     837     845      -8 DL     
+#> 6  2013     1     1      607         607       0     858     915     -17 UA     
+#> # … with 172,280 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

It’s useful to know that this is a shortcut and you can explicitly create the underlying logical variables with #chp-https://dplyr.tidyverse.org/reference/mutate:

+
+
flights |> 
+  mutate(
+    daytime = dep_time > 600 & dep_time < 2000,
+    approx_ontime = abs(arr_delay) < 20,
+    .keep = "used"
+  )
+#> # A tibble: 336,776 × 4
+#>   dep_time arr_delay daytime approx_ontime
+#>      <int>     <dbl> <lgl>   <lgl>        
+#> 1      517        11 FALSE   TRUE         
+#> 2      533        20 FALSE   FALSE        
+#> 3      542        33 FALSE   FALSE        
+#> 4      544       -18 FALSE   TRUE         
+#> 5      554       -25 FALSE   FALSE        
+#> 6      554        12 FALSE   TRUE         
+#> # … with 336,770 more rows
+
+

This is particularly useful for more complicated logic because naming the intermediate steps makes it easier to both read your code and check that each step has been computed correctly.

+

All up, the initial filter is equivalent to:

+
+
flights |> 
+  mutate(
+    daytime = dep_time > 600 & dep_time < 2000,
+    approx_ontime = abs(arr_delay) < 20,
+  ) |> 
+  filter(daytime & approx_ontime)
+
+ +
+

+Floating point comparison

+

Beware of using == with numbers. For example, it looks like this vector contains the numbers 1 and 2:

+
+
x <- c(1 / 49 * 49, sqrt(2) ^ 2)
+x
+#> [1] 1 2
+
+

But if you test them for equality, you get FALSE:

+
+
x == c(1, 2)
+#> [1] FALSE FALSE
+
+

What’s going on? Computers store numbers with a fixed number of decimal places so there’s no way to exactly represent 1/49 or sqrt(2) and subsequent computations will be very slightly off. We can see the exact values by calling #chp-https://rdrr.io/r/base/print with the the digitsR normally calls print for you (i.e. x is a shortcut for print(x)), but calling it explicitly is useful if you want to provide other arguments. argument:

+
+
print(x, digits = 16)
+#> [1] 0.9999999999999999 2.0000000000000004
+
+

You can see why R defaults to rounding these numbers; they really are very close to what you expect.

+

Now that you’ve seen why == is failing, what can you do about it? One option is to use #chp-https://dplyr.tidyverse.org/reference/near which ignores small differences:

+
+
near(x, c(1, 2))
+#> [1] TRUE TRUE
+
+
+ +
+

+Missing values

+

Missing values represent the unknown so they are “contagious”: almost any operation involving an unknown value will also be unknown:

+
+
NA > 5
+#> [1] NA
+10 == NA
+#> [1] NA
+
+

The most confusing result is this one:

+
+
NA == NA
+#> [1] NA
+
+

It’s easiest to understand why this is true if we artificially supply a little more context:

+
+
# Let x be Mary's age. We don't know how old she is.
+x <- NA
+
+# Let y be John's age. We don't know how old he is.
+y <- NA
+
+# Are John and Mary the same age?
+x == y
+#> [1] NA
+# We don't know!
+
+

So if you want to find all flights with dep_time is missing, the following code doesn’t work because dep_time == NA will yield a NA for every single row, and #chp-https://dplyr.tidyverse.org/reference/filter automatically drops missing values:

+
+
flights |> 
+  filter(dep_time == NA)
+#> # A tibble: 0 × 19
+#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
+#> #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
+#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
+#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
+#> #   hour <dbl>, minute <dbl>, time_hour <dttm>
+
+

Instead we’ll need a new tool: #chp-https://rdrr.io/r/base/NA.

+
+ +
+

+is.na() +

+

is.na(x) works with any type of vector and returns TRUE for missing values and FALSE for everything else:

+
+
is.na(c(TRUE, NA, FALSE))
+#> [1] FALSE  TRUE FALSE
+is.na(c(1, NA, 3))
+#> [1] FALSE  TRUE FALSE
+is.na(c("a", NA, "b"))
+#> [1] FALSE  TRUE FALSE
+
+

We can use #chp-https://rdrr.io/r/base/NA to find all the rows with a missing dep_time:

+
+
flights |> 
+  filter(is.na(dep_time))
+#> # A tibble: 8,255 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1       NA        1630      NA      NA    1815      NA EV     
+#> 2  2013     1     1       NA        1935      NA      NA    2240      NA AA     
+#> 3  2013     1     1       NA        1500      NA      NA    1825      NA AA     
+#> 4  2013     1     1       NA         600      NA      NA     901      NA B6     
+#> 5  2013     1     2       NA        1540      NA      NA    1747      NA EV     
+#> 6  2013     1     2       NA        1620      NA      NA    1746      NA EV     
+#> # … with 8,249 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

#chp-https://rdrr.io/r/base/NA can also be useful in #chp-https://dplyr.tidyverse.org/reference/arrange. #chp-https://dplyr.tidyverse.org/reference/arrange usually puts all the missing values at the end but you can override this default by first sorting by #chp-https://rdrr.io/r/base/NA:

+
+
flights |> 
+  filter(month == 1, day == 1) |> 
+  arrange(dep_time)
+#> # A tibble: 842 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+flights |> 
+  filter(month == 1, day == 1) |> 
+  arrange(desc(is.na(dep_time)), dep_time)
+#> # A tibble: 842 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1       NA        1630      NA      NA    1815      NA EV     
+#> 2  2013     1     1       NA        1935      NA      NA    2240      NA AA     
+#> 3  2013     1     1       NA        1500      NA      NA    1825      NA AA     
+#> 4  2013     1     1       NA         600      NA      NA     901      NA B6     
+#> 5  2013     1     1      517         515       2     830     819      11 UA     
+#> 6  2013     1     1      533         529       4     850     830      20 UA     
+#> # … with 836 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

We’ll come back to cover missing values in more depth in #chp-missing-values.

+
+ +
+

+Exercises

+
  1. How does #chp-https://dplyr.tidyverse.org/reference/near work? Type near to see the source code.
  2. +
  3. Use #chp-https://dplyr.tidyverse.org/reference/mutate, #chp-https://rdrr.io/r/base/NA, and #chp-https://dplyr.tidyverse.org/reference/count together to describe how the missing values in dep_time, sched_dep_time and dep_delay are connected.
  4. +
+
+ +
+

+Boolean algebra

+

Once you have multiple logical vectors, you can combine them together using Boolean algebra. In R, & is “and”, | is “or”, and ! is “not”, and #chp-https://rdrr.io/r/base/Logic is exclusive orThat is, xor(x, y) is true if x is true, or y is true, but not both. This is how we usually use “or” In English. “Both” is not usually an acceptable answer to the question “would you like ice cream or cake?”.. #fig-bool-ops shows the complete set of Boolean operations and how they work.

+
+
+ +

Six Venn diagrams, each explaining a given logical operator. The circles (sets) in each of the Venn diagrams represent x and y. 1. y & !x is y but none of x; x & y is the intersection of x and y; x & !y is x but none of y; x is all of x none of y; xor(x, y) is everything except the intersection of x and y; y is all of y and none of x; and x | y is everything.

+
Figure 12.1: The complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.
+
+
+
+

As well as & and |, R also has && and ||. Don’t use them in dplyr functions! These are called short-circuiting operators and only ever return a single TRUE or FALSE. They’re important for programming, not data science

+ +
+

+Missing values

+

The rules for missing values in Boolean algebra are a little tricky to explain because they seem inconsistent at first glance:

+
+
df <- tibble(x = c(TRUE, FALSE, NA))
+
+df |> 
+  mutate(
+    and = x & NA,
+    or = x | NA
+  )
+#> # A tibble: 3 × 3
+#>   x     and   or   
+#>   <lgl> <lgl> <lgl>
+#> 1 TRUE  NA    TRUE 
+#> 2 FALSE FALSE NA   
+#> 3 NA    NA    NA
+
+

To understand what’s going on, think about NA | TRUE. A missing value in a logical vector means that the value could either be TRUE or FALSE. TRUE | TRUE and FALSE | TRUE are both TRUE, so NA | TRUE must also be TRUE. Similar reasoning applies with NA & FALSE.

+
+ +
+

+Order of operations

+

Note that the order of operations doesn’t work like English. Take the following code finds all flights that departed in November or December:

+
+
flights |> 
+   filter(month == 11 | month == 12)
+
+

You might be tempted to write it like you’d say in English: “find all flights that departed in November or December”:

+
+
flights |> 
+   filter(month == 11 | 12)
+#> # A tibble: 336,776 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      533         529       4     850     830      20 UA     
+#> 3  2013     1     1      542         540       2     923     850      33 AA     
+#> 4  2013     1     1      544         545      -1    1004    1022     -18 B6     
+#> 5  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 6  2013     1     1      554         558      -4     740     728      12 UA     
+#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

This code doesn’t error but it also doesn’t seem to have worked. What’s going on? Here R first evaluates month == 11 creating a logical vector, which we call nov. It computes nov | 12. When you use a number with a logical operator it converts everything apart from 0 to TRUE, so this is equivalent to nov | TRUE which will always be TRUE, so every row will be selected:

+
+
flights |> 
+  mutate(
+    nov = month == 11,
+    final = nov | 12,
+    .keep = "used"
+  )
+#> # A tibble: 336,776 × 3
+#>   month nov   final
+#>   <int> <lgl> <lgl>
+#> 1     1 FALSE TRUE 
+#> 2     1 FALSE TRUE 
+#> 3     1 FALSE TRUE 
+#> 4     1 FALSE TRUE 
+#> 5     1 FALSE TRUE 
+#> 6     1 FALSE TRUE 
+#> # … with 336,770 more rows
+
+
+ +
+

+%in% +

+

An easy way to avoid the problem of getting your ==s and |s in the right order is to use %in%. x %in% y returns a logical vector the same length as x that is TRUE whenever a value in x is anywhere in y .

+
+
1:12 %in% c(1, 5, 11)
+#>  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
+letters[1:10] %in% c("a", "e", "i", "o", "u")
+#>  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
+
+

So to find all flights in November and December we could write:

+
+
flights |> 
+  filter(month %in% c(11, 12))
+
+

Note that %in% obeys different rules for NA to ==, as NA %in% NA is TRUE.

+
+
c(1, 2, NA) == NA
+#> [1] NA NA NA
+c(1, 2, NA) %in% NA
+#> [1] FALSE FALSE  TRUE
+
+

This can make for a useful shortcut:

+
+
flights |> 
+  filter(dep_time %in% c(NA, 0800))
+#> # A tibble: 8,803 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      800         800       0    1022    1014       8 DL     
+#> 2  2013     1     1      800         810     -10     949     955      -6 MQ     
+#> 3  2013     1     1       NA        1630      NA      NA    1815      NA EV     
+#> 4  2013     1     1       NA        1935      NA      NA    2240      NA AA     
+#> 5  2013     1     1       NA        1500      NA      NA    1825      NA AA     
+#> 6  2013     1     1       NA         600      NA      NA     901      NA B6     
+#> # … with 8,797 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+
+ +
+

+Exercises

+
  1. Find all flights where arr_delay is missing but dep_delay is not. Find all flights where neither arr_time nor sched_arr_time are missing, but arr_delay is.
  2. +
  3. How many flights have a missing dep_time? What other variables are missing in these rows? What might these rows represent?
  4. +
  5. Assuming that a missing dep_time implies that a flight is cancelled, look at the number of cancelled flights per day. Is there a pattern? Is there a connection between the proportion of cancelled flights and average delay of non-cancelled flights?
  6. +
+
+ +
+

+Summaries

+

The following sections describe some useful techniques for summarizing logical vectors. As well as functions that only work specifically with logical vectors, you can also use functions that work with numeric vectors.

+ +
+

+Logical summaries

+

There are two main logical summaries: #chp-https://rdrr.io/r/base/any and #chp-https://rdrr.io/r/base/all. any(x) is the equivalent of |; it’ll return TRUE if there are any TRUE’s in x. all(x) is equivalent of &; it’ll return TRUE only if all values of x are TRUE’s. Like all summary functions, they’ll return NA if there are any missing values present, and as usual you can make the missing values go away with na.rm = TRUE.

+

For example, we could use #chp-https://rdrr.io/r/base/all to find out if there were days where every flight was delayed:

+
+
flights |> 
+  group_by(year, month, day) |> 
+  summarise(
+    all_delayed = all(arr_delay >= 0, na.rm = TRUE),
+    any_delayed = any(arr_delay >= 0, na.rm = TRUE),
+    .groups = "drop"
+  )
+#> # A tibble: 365 × 5
+#>    year month   day all_delayed any_delayed
+#>   <int> <int> <int> <lgl>       <lgl>      
+#> 1  2013     1     1 FALSE       TRUE       
+#> 2  2013     1     2 FALSE       TRUE       
+#> 3  2013     1     3 FALSE       TRUE       
+#> 4  2013     1     4 FALSE       TRUE       
+#> 5  2013     1     5 FALSE       TRUE       
+#> 6  2013     1     6 FALSE       TRUE       
+#> # … with 359 more rows
+
+

In most cases, however, #chp-https://rdrr.io/r/base/any and #chp-https://rdrr.io/r/base/all are a little too crude, and it would be nice to be able to get a little more detail about how many values are TRUE or FALSE. That leads us to the numeric summaries.

+
+ +
+

+Numeric summaries of logical vectors

+

When you use a logical vector in a numeric context, TRUE becomes 1 and FALSE becomes 0. This makes #chp-https://rdrr.io/r/base/sum and #chp-https://rdrr.io/r/base/mean very useful with logical vectors because sum(x) will give the number of TRUEs and mean(x) the proportion of TRUEs. That lets us see the distribution of delays across the days of the year as shown in #fig-prop-delayed-dist.

+
+
flights |> 
+  group_by(year, month, day) |> 
+  summarise(
+    prop_delayed = mean(arr_delay > 0, na.rm = TRUE),
+    .groups = "drop"
+  ) |> 
+  ggplot(aes(prop_delayed)) + 
+  geom_histogram(binwidth = 0.05)
+
+ +

The distribution is unimodal and mildly right skewed. The distribution peaks around 30% delayed flights.

+
Figure 12.2: A histogram showing the proportion of delayed flights each day.
+
+
+
+

Or we could ask how many flights left before 5am, which are often flights that were delayed from the previous day:

+
+
flights |> 
+  group_by(year, month, day) |> 
+  summarise(
+    n_early = sum(dep_time < 500, na.rm = TRUE),
+    .groups = "drop"
+  ) |> 
+  arrange(desc(n_early))
+#> # A tibble: 365 × 4
+#>    year month   day n_early
+#>   <int> <int> <int>   <int>
+#> 1  2013     6    28      32
+#> 2  2013     4    10      30
+#> 3  2013     7    28      30
+#> 4  2013     3    18      29
+#> 5  2013     7     7      29
+#> 6  2013     7    10      29
+#> # … with 359 more rows
+
+
+ +
+

+Logical subsetting

+

There’s one final use for logical vectors in summaries: you can use a logical vector to filter a single variable to a subset of interest. This makes use of the base [ (pronounced subset) operator, which you’ll learn more about in #sec-subset-many.

+

Imagine we wanted to look at the average delay just for flights that were actually delayed. One way to do so would be to first filter the flights:

+
+
flights |> 
+  filter(arr_delay > 0) |> 
+  group_by(year, month, day) |> 
+  summarise(
+    behind = mean(arr_delay),
+    n = n(),
+    .groups = "drop"
+  )
+#> # A tibble: 365 × 5
+#>    year month   day behind     n
+#>   <int> <int> <int>  <dbl> <int>
+#> 1  2013     1     1   32.5   461
+#> 2  2013     1     2   32.0   535
+#> 3  2013     1     3   27.7   460
+#> 4  2013     1     4   28.3   297
+#> 5  2013     1     5   22.6   238
+#> 6  2013     1     6   24.4   381
+#> # … with 359 more rows
+
+

This works, but what if we wanted to also compute the average delay for flights that arrived early? We’d need to perform a separate filter step, and then figure out how to combine the two data frames togetherWe’ll cover this in #chp-joins]. Instead you could use [ to perform an inline filtering: arr_delay[arr_delay > 0] will yield only the positive arrival delays.

+

This leads to:

+
+
flights |> 
+  group_by(year, month, day) |> 
+  summarise(
+    behind = mean(arr_delay[arr_delay > 0], na.rm = TRUE),
+    ahead = mean(arr_delay[arr_delay < 0], na.rm = TRUE),
+    n = n(),
+    .groups = "drop"
+  )
+#> # A tibble: 365 × 6
+#>    year month   day behind ahead     n
+#>   <int> <int> <int>  <dbl> <dbl> <int>
+#> 1  2013     1     1   32.5 -12.5   842
+#> 2  2013     1     2   32.0 -14.3   943
+#> 3  2013     1     3   27.7 -18.2   914
+#> 4  2013     1     4   28.3 -17.0   915
+#> 5  2013     1     5   22.6 -14.0   720
+#> 6  2013     1     6   24.4 -13.6   832
+#> # … with 359 more rows
+
+

Also note the difference in the group size: in the first chunk #chp-https://dplyr.tidyverse.org/reference/context gives the number of delayed flights per day; in the second, #chp-https://dplyr.tidyverse.org/reference/context gives the total number of flights.

+
+ +
+

+Exercises

+
  1. What will sum(is.na(x)) tell you? How about mean(is.na(x))?
  2. +
  3. What does #chp-https://rdrr.io/r/base/prod return when applied to a logical vector? What logical summary function is it equivalent to? What does #chp-https://rdrr.io/r/base/Extremes return applied to a logical vector? What logical summary function is it equivalent to? Read the documentation and perform a few experiments.
  4. +
+
+ +
+

+Conditional transformations

+

One of the most powerful features of logical vectors are their use for conditional transformations, i.e. doing one thing for condition x, and something different for condition y. There are two important tools for this: #chp-https://dplyr.tidyverse.org/reference/if_else and #chp-https://dplyr.tidyverse.org/reference/case_when.

+ +
+

+if_else() +

+

If you want to use one value when a condition is true and another value when it’s FALSE, you can use #chp-https://dplyr.tidyverse.org/reference/if_elsedplyr’s #chp-https://dplyr.tidyverse.org/reference/if_else is very similar to base R’s #chp-https://rdrr.io/r/base/ifelse. There are two main advantages of #chp-https://dplyr.tidyverse.org/reference/if_elseover #chp-https://rdrr.io/r/base/ifelse: you can choose what should happen to missing values, and #chp-https://dplyr.tidyverse.org/reference/if_else is much more likely to give you a meaningful error if you variables have incompatible types.. You’ll always use the first three argument of #chp-https://dplyr.tidyverse.org/reference/if_else. The first argument, condition, is a logical vector, the second, true, gives the output when the condition is true, and the third, false, gives the output if the condition is false.

+

Let’s begin with a simple example of labeling a numeric vector as either “+ve” or “-ve”:

+
+
x <- c(-3:3, NA)
+if_else(x > 0, "+ve", "-ve")
+#> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" NA
+
+

There’s an optional fourth argument, missing which will be used if the input is NA:

+
+
if_else(x > 0, "+ve", "-ve", "???")
+#> [1] "-ve" "-ve" "-ve" "-ve" "+ve" "+ve" "+ve" "???"
+
+

You can also use vectors for the the true and false arguments. For example, this allows us to create a minimal implementation of #chp-https://rdrr.io/r/base/MathFun:

+
+
if_else(x < 0, -x, x)
+#> [1]  3  2  1  0  1  2  3 NA
+
+

So far all the arguments have used the same vectors, but you can of course mix and match. For example, you could implement a simple version of #chp-https://dplyr.tidyverse.org/reference/coalesce like this:

+
+
x1 <- c(NA, 1, 2, NA)
+y1 <- c(3, NA, 4, 6)
+if_else(is.na(x1), y1, x1)
+#> [1] 3 1 2 6
+
+

You might have noticed a small infelicity in our labeling: zero is neither positive nor negative. We could resolve this by adding an additional #chp-https://dplyr.tidyverse.org/reference/if_else:

+
+
if_else(x == 0, "0", if_else(x < 0, "-ve", "+ve"), "???")
+#> [1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"
+
+

This is already a little hard to read, and you can imagine it would only get harder if you have more conditions. Instead, you can switch to #chp-https://dplyr.tidyverse.org/reference/case_when.

+
+ +
+

+case_when() +

+

dplyr’s #chp-https://dplyr.tidyverse.org/reference/case_when is inspired by SQL’s CASE statement and provides a flexible way of performing different computations for different computations. It has a special syntax that unfortunately looks like nothing else you’ll use in the tidyverse. It takes pairs that look like condition ~ output. condition must be a logical vector; when it’s TRUE, output will be used.

+

This means we could recreate our previous nested #chp-https://dplyr.tidyverse.org/reference/if_else as follows:

+
+
case_when(
+  x == 0   ~ "0",
+  x < 0    ~ "-ve", 
+  x > 0    ~ "+ve",
+  is.na(x) ~ "???"
+)
+#> [1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"
+
+

This is more code, but it’s also more explicit.

+

To explain how #chp-https://dplyr.tidyverse.org/reference/case_when works, lets explore some simpler cases. If none of the cases match, the output gets an NA:

+
+
case_when(
+  x < 0 ~ "-ve",
+  x > 0 ~ "+ve"
+)
+#> [1] "-ve" "-ve" "-ve" NA    "+ve" "+ve" "+ve" NA
+
+

If you want to create a “default”/catch all value, use TRUE on the left hand side:

+
+
case_when(
+  x < 0 ~ "-ve",
+  x > 0 ~ "+ve",
+  TRUE ~ "???"
+)
+#> [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???"
+
+

And note that if multiple conditions match, only the first will be used:

+
+
case_when(
+  x > 0 ~ "+ve",
+  x > 3 ~ "big"
+)
+#> [1] NA    NA    NA    NA    "+ve" "+ve" "+ve" NA
+
+

Just like with #chp-https://dplyr.tidyverse.org/reference/if_else you can use variables on both sides of the ~ and you can mix and match variables as needed for your problem. For example, we could use #chp-https://dplyr.tidyverse.org/reference/case_when to provide some human readable labels for the arrival delay:

+
+
flights |> 
+  mutate(
+    status = case_when(
+      is.na(arr_delay)      ~ "cancelled",
+      arr_delay > 60        ~ "very late",
+      arr_delay > 15        ~ "late",
+      abs(arr_delay) <= 15  ~ "on time",
+      arr_delay < -15       ~ "early",
+      arr_delay < -30       ~ "very early",
+    ),
+    .keep = "used"
+  )
+#> # A tibble: 336,776 × 2
+#>   arr_delay status 
+#>       <dbl> <chr>  
+#> 1        11 on time
+#> 2        20 late   
+#> 3        33 late   
+#> 4       -18 early  
+#> 5       -25 early  
+#> 6        12 on time
+#> # … with 336,770 more rows
+
+
+
+ +
+

+Summary

+

The definition of a logical vector is simple because each value must be either TRUE, FALSE, or NA. But logical vectors provide a huge amount of power. In this chapter, you learned how to create logical vectors with >, <, <=, =>, ==, !=, and #chp-https://rdrr.io/r/base/NA, how to combine them with !, &, and |, and how to summarize them with #chp-https://rdrr.io/r/base/any, #chp-https://rdrr.io/r/base/all, #chp-https://rdrr.io/r/base/sum, and #chp-https://rdrr.io/r/base/mean. You also learned the powerful #chp-https://dplyr.tidyverse.org/reference/if_else and #chp-https://dplyr.tidyverse.org/reference/case_when that allow you to return values depending on the value of a logical vector.

+

We’ll see logical vectors again and in the following chapters. For example in #chp-strings you’ll learn about str_detect(x, pattern) which returns a logical vector that’s TRUE for the elements of x that match the pattern, and in #chp-datetimes you’ll create logical vectors from the comparison of dates and times. But for now, we’re going to move onto the next most important type of vector: numeric vectors.

+ + +
+
diff --git a/oreilly/missing-values.html b/oreilly/missing-values.html new file mode 100644 index 0000000..47ddd3c --- /dev/null +++ b/oreilly/missing-values.html @@ -0,0 +1,342 @@ +
+

Missing values

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

You’ve already learned the basics of missing values earlier in the book. You first saw them in #sec-summarize where they interfered with computing summary statistics, and you learned about their infectious nature and how to check for their presence in #sec-na-comparison. Now we’ll come back to them in more depth, so you can learn more of the details.

+

We’ll start by discussing some general tools for working with missing values recorded as NAs. We’ll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit. We’ll finish off with a related discussion of empty groups, caused by factor levels that don’t appear in the data.

+ +
+

+Prerequisites

+

The functions for working with missing data mostly come from dplyr and tidyr, which are core members of the tidyverse.

+
+
library(tidyverse)
+
+
+
+ +
+

+Explicit missing values

+

To begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an NA.

+ +
+

+Last observation carried forward

+

A common use for missing values is as a data entry convenience. When data is entered by hand, missing values sometimes indicate that the value in the previous row has been repeated (or carried forward):

+
+
treatment <- tribble(
+  ~person,           ~treatment, ~response,
+  "Derrick Whitmore", 1,         7,
+  NA,                 2,         10,
+  NA,                 3,         NA,
+  "Katherine Burke",  1,         4
+)
+
+

You can fill in these missing values with #chp-https://tidyr.tidyverse.org/reference/fill. It works like #chp-https://dplyr.tidyverse.org/reference/select, taking a set of columns:

+
+
treatment |>
+  fill(everything())
+#> # A tibble: 4 × 3
+#>   person           treatment response
+#>   <chr>                <dbl>    <dbl>
+#> 1 Derrick Whitmore         1        7
+#> 2 Derrick Whitmore         2       10
+#> 3 Derrick Whitmore         3       10
+#> 4 Katherine Burke          1        4
+
+

This treatment is sometimes called “last observation carried forward”, or locf for short. You can use the .direction argument to fill in missing values that have been generated in more exotic ways.

+
+ +
+

+Fixed values

+

Some times missing values represent some fixed and known value, most commonly 0. You can use #chp-https://dplyr.tidyverse.org/reference/coalesce to replace them:

+
+
x <- c(1, 4, 5, 7, NA)
+coalesce(x, 0)
+#> [1] 1 4 5 7 0
+
+

Sometimes you’ll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.

+

If possible, handle this when reading in the data, for example, by using the na argument to #chp-https://readr.tidyverse.org/reference/read_delim. If you discover the problem later, or your data source doesn’t provide a way to handle on it read, you can use #chp-https://dplyr.tidyverse.org/reference/na_if:

+
+
x <- c(1, 4, 5, 7, -99)
+na_if(x, -99)
+#> [1]  1  4  5  7 NA
+
+
+ +
+

+NaN

+

Before we continue, there’s one special type of missing value that you’ll encounter from time to time: a NaN (pronounced “nan”), or not a number. It’s not that important to know about because it generally behaves just like NA:

+
+
x <- c(NA, NaN)
+x * 10
+#> [1]  NA NaN
+x == 1
+#> [1] NA NA
+is.na(x)
+#> [1] TRUE TRUE
+
+

In the rare case you need to distinguish an NA from a NaN, you can use is.nan(x).

+

You’ll generally encounter a NaN when you perform a mathematical operation that has an indeterminate result:

+
+
0 / 0 
+#> [1] NaN
+0 * Inf
+#> [1] NaN
+Inf - Inf
+#> [1] NaN
+sqrt(-1)
+#> Warning in sqrt(-1): NaNs produced
+#> [1] NaN
+
+
+
+ +
+

+Implicit missing values

+

So far we’ve talked about missing values that are explicitly missing, i.e. you can see an NA in your data. But missing values can also be implicitly missing, if an entire row of data is simply absent from the data. Let’s illustrate the difference with a simple data set that records the price of some stock each quarter:

+
+
stocks <- tibble(
+  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
+  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
+  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
+)
+
+

This dataset has two missing observations:

+
  • The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.

  • +
  • The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

  • +

One way to think about the difference is with this Zen-like koan:

+
+

An explicit missing value is the presence of an absence.

+

An implicit missing value is the absence of a presence.

+
+

Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.

+ +
+

+Pivoting

+

You’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value. For example, if we pivot stocks to put the quarter in the columns, both missing values become explicit:

+
+
stocks |>
+  pivot_wider(
+    names_from = qtr, 
+    values_from = price
+  )
+#> # A tibble: 2 × 5
+#>    year   `1`   `2`   `3`   `4`
+#>   <dbl> <dbl> <dbl> <dbl> <dbl>
+#> 1  2020  1.88  0.59  0.35 NA   
+#> 2  2021 NA     0.92  0.17  2.66
+
+

By default, making data longer preserves explicit missing values, but if they are structurally missing values that only exist because the data is not tidy, you can drop them (make them implicit) by setting values_drop_na = TRUE. See the examples in #sec-tidy-data for more details.

+
+ +
+

+Complete

+

#chp-https://tidyr.tidyverse.org/reference/complete allows you to generate explicit missing values by providing a set of variables that define the combination of rows that should exist. For example, we know that all combinations of year and qtr should exist in the stocks data:

+
+
stocks |>
+  complete(year, qtr)
+#> # A tibble: 8 × 3
+#>    year   qtr price
+#>   <dbl> <dbl> <dbl>
+#> 1  2020     1  1.88
+#> 2  2020     2  0.59
+#> 3  2020     3  0.35
+#> 4  2020     4 NA   
+#> 5  2021     1 NA   
+#> 6  2021     2  0.92
+#> # … with 2 more rows
+
+

Typically, you’ll call #chp-https://tidyr.tidyverse.org/reference/complete with names of existing variables, filling in the missing combinations. However, sometimes the individual variables are themselves incomplete, so you can instead provide your own data. For example, you might know that the stocks dataset is supposed to run from 2019 to 2021, so you could explicitly supply those values for year:

+
+
stocks |>
+  complete(year = 2019:2021, qtr)
+#> # A tibble: 12 × 3
+#>    year   qtr price
+#>   <dbl> <dbl> <dbl>
+#> 1  2019     1 NA   
+#> 2  2019     2 NA   
+#> 3  2019     3 NA   
+#> 4  2019     4 NA   
+#> 5  2020     1  1.88
+#> 6  2020     2  0.59
+#> # … with 6 more rows
+
+

If the range of a variable is correct, but not all values are present, you could use full_seq(x, 1) to generate all values from min(x) to max(x) spaced out by 1.

+

In some cases, the complete set of observations can’t be generated by a simple combination of variables. In that case, you can do manually what #chp-https://tidyr.tidyverse.org/reference/complete does for you: create a data frame that contains all the rows that should exist (using whatever combination of techniques you need), then combine it with your original dataset with #chp-https://dplyr.tidyverse.org/reference/mutate-joins.

+
+ +
+

+Joins

+

This brings us to another important way of revealing implicitly missing observations: joins. You’ll learn more about joins in #chp-joins, but we wanted to quickly mention them to you here since you can often only know that values are missing from one dataset when you compare it another.

+

dplyr::anti_join(x, y) is a particularly useful tool here because it selects only the rows in x that don’t have a match in y. For example, we can use two #chp-https://dplyr.tidyverse.org/reference/filter-joinss reveal to reveal that we’re missing information for four airports and 722 planes mentioned in flights:

+
+
library(nycflights13)
+
+flights |> 
+  distinct(faa = dest) |> 
+  anti_join(airports)
+#> Joining with `by = join_by(faa)`
+#> # A tibble: 4 × 1
+#>   faa  
+#>   <chr>
+#> 1 BQN  
+#> 2 SJU  
+#> 3 STT  
+#> 4 PSE
+
+flights |> 
+  distinct(tailnum) |> 
+  anti_join(planes)
+#> Joining with `by = join_by(tailnum)`
+#> # A tibble: 722 × 1
+#>   tailnum
+#>   <chr>  
+#> 1 N3ALAA 
+#> 2 N3DUAA 
+#> 3 N542MQ 
+#> 4 N730MQ 
+#> 5 N9EAMQ 
+#> 6 N532UA 
+#> # … with 716 more rows
+
+
+ +
+

+Exercises

+
  1. Can you find any relationship between the carrier and the rows that appear to be missing from planes?
  2. +
+
+ +
+

+Factors and empty groups

+

A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with factors. For example, imagine we have a dataset that contains some health information about people:

+
+
health <- tibble(
+  name   = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
+  smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
+  age    = c(34L, 88L, 75L, 47L, 56L),
+)
+
+

And we want to count the number of smokers with #chp-https://dplyr.tidyverse.org/reference/count:

+
+
health |> count(smoker)
+#> # A tibble: 1 × 2
+#>   smoker     n
+#>   <fct>  <int>
+#> 1 no         5
+
+

This dataset only contains non-smokers, but we know that smokers exist; the group of non-smoker is empty. We can request #chp-https://dplyr.tidyverse.org/reference/count to keep all the groups, even those not seen in the data by using .drop = FALSE:

+
+
health |> count(smoker, .drop = FALSE)
+#> # A tibble: 2 × 2
+#>   smoker     n
+#>   <fct>  <int>
+#> 1 yes        0
+#> 2 no         5
+
+

The same principle applies to ggplot2’s discrete axes, which will also drop levels that don’t have any values. You can force them to display by supplying drop = FALSE to the appropriate discrete axis:

+
+
ggplot(health, aes(smoker)) +
+  geom_bar() +
+  scale_x_discrete()
+
+ggplot(health, aes(smoker)) +
+  geom_bar() +
+  scale_x_discrete(drop = FALSE)
+
+
+
+

A bar chart with a single value on the x-axis, "no". The same bar chart as the last plot, but now with two values on the x-axis, "yes" and "no". There is no bar for the "yes" category.

+
+
+

A bar chart with a single value on the x-axis, "no". The same bar chart as the last plot, but now with two values on the x-axis, "yes" and "no". There is no bar for the "yes" category.

+
+
+
+
+

The same problem comes up more generally with #chp-https://dplyr.tidyverse.org/reference/group_by. And again you can use .drop = FALSE to preserve all factor levels:

+
+
health |> 
+  group_by(smoker, .drop = FALSE) |> 
+  summarise(
+    n = n(),
+    mean_age = mean(age),
+    min_age = min(age),
+    max_age = max(age),
+    sd_age = sd(age)
+  )
+#> Warning: There were 2 warnings in `summarise()`.
+#> The first warning was:
+#> ℹ In argument `min_age = min(age)`.
+#> ℹ In group 1: `smoker = yes`.
+#> Caused by warning in `min()`:
+#> ! no non-missing arguments to min; returning Inf
+#> ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
+#> # A tibble: 2 × 6
+#>   smoker     n mean_age min_age max_age sd_age
+#>   <fct>  <int>    <dbl>   <dbl>   <dbl>  <dbl>
+#> 1 yes        0      NaN     Inf    -Inf   NA  
+#> 2 no         5       60      34      88   21.6
+
+

We get some interesting results here because when summarizing an empty group, the summary functions are applied to zero-length vectors. There’s an important distinction between empty vectors, which have length 0, and missing values, each of which has length 1.

+
+
# A vector containing two missing values
+x1 <- c(NA, NA)
+length(x1)
+#> [1] 2
+
+# A vector containing nothing
+x2 <- numeric()
+length(x2)
+#> [1] 0
+
+

All summary functions work with zero-length vectors, but they may return results that are surprising at first glance. Here we see mean(age) returning NaN because mean(age) = sum(age)/length(age) which here is 0/0. #chp-https://rdrr.io/r/base/Extremes and #chp-https://rdrr.io/r/base/Extremes return -Inf and Inf for empty vectors so if you combine the results with a non-empty vector of new data and recompute you’ll get the minimum or maximum of the new dataIn other words, min(c(x, y)) is always equal to min(min(x), min(y))..

+

Sometimes a simpler approach is to perform the summary and then make the implicit missings explicit with #chp-https://tidyr.tidyverse.org/reference/complete.

+
+
health |> 
+  group_by(smoker) |> 
+  summarise(
+    n = n(),
+    mean_age = mean(age),
+    min_age = min(age),
+    max_age = max(age),
+    sd_age = sd(age)
+  ) |> 
+  complete(smoker)
+#> # A tibble: 2 × 6
+#>   smoker     n mean_age min_age max_age sd_age
+#>   <fct>  <int>    <dbl>   <int>   <int>  <dbl>
+#> 1 yes       NA       NA      NA      NA   NA  
+#> 2 no         5       60      34      88   21.6
+
+

The main drawback of this approach is that you get an NA for the count, even though you know that it should be zero.

+
+ +
+

+Summary

+

Missing values are weird! Sometimes they’re recorded as an explicit NA but other times you only notice them by their absence. This chapter has given you some tools for working with explicit missing values, tools for uncovering implicit missing values, and discussed some of the ways that implicit can become explicit and vice versa.

+

In the next chapter, we tackle the final chapter in this part of the book: joins. This is a bit of a change from the chapters so far because we’re going to discuss tools that work with data frames as a whole, not something that you put inside a data frame.

+ + +
+
diff --git a/oreilly/numbers.html b/oreilly/numbers.html new file mode 100644 index 0000000..79cb344 --- /dev/null +++ b/oreilly/numbers.html @@ -0,0 +1,838 @@ +
+

Numbers

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

Numeric vectors are the backbone of data science, and you’ve already used them a bunch of times earlier in the book. Now it’s time to systematically survey what you can do with them in R, ensuring that you’re well situated to tackle any future problem involving numeric vectors.

+

We’ll start by giving you a couple of tools to make numbers if you have strings, and then going into a little more detail of #chp-https://dplyr.tidyverse.org/reference/count. Then we’ll dive into various numeric transformations that pair well with #chp-https://dplyr.tidyverse.org/reference/mutate, including more general transformations that can be applied to other types of vector, but are often used with numeric vectors. We’ll finish off by covering the summary functions that pair well with #chp-https://dplyr.tidyverse.org/reference/summarise and show you how they can also be used with #chp-https://dplyr.tidyverse.org/reference/mutate.

+ +
+

+Prerequisites

+

This chapter mostly uses functions from base R, which are available without loading any packages. But we still need the tidyverse because we’ll use these base R functions inside of tidyverse functions like #chp-https://dplyr.tidyverse.org/reference/mutate and #chp-https://dplyr.tidyverse.org/reference/filter. Like in the last chapter, we’ll use real examples from nycflights13, as well as toy examples made with #chp-https://rdrr.io/r/base/c and #chp-https://tibble.tidyverse.org/reference/tribble.

+
+
library(tidyverse)
+library(nycflights13)
+
+
+
+ +
+

+Making numbers

+

In most cases, you’ll get numbers already recorded in one of R’s numeric types: integer or double. In some cases, however, you’ll encounter them as strings, possibly because you’ve created them by pivoting from column headers or something has gone wrong in your data import process.

+

readr provides two useful functions for parsing strings into numbers: #chp-https://readr.tidyverse.org/reference/parse_atomic and #chp-https://readr.tidyverse.org/reference/parse_number. Use #chp-https://readr.tidyverse.org/reference/parse_atomic when you have numbers that have been written as strings:

+
+
x <- c("1.2", "5.6", "1e3")
+parse_double(x)
+#> [1]    1.2    5.6 1000.0
+
+

Use #chp-https://readr.tidyverse.org/reference/parse_number when the string contains non-numeric text that you want to ignore. This is particularly useful for currency data and percentages:

+
+
x <- c("$1,234", "USD 3,513", "59%")
+parse_number(x)
+#> [1] 1234 3513   59
+
+
+ +
+

+Counts

+

It’s surprising how much data science you can do with just counts and a little basic arithmetic, so dplyr strives to make counting as easy as possible with #chp-https://dplyr.tidyverse.org/reference/count. This function is great for quick exploration and checks during analysis:

+
+
flights |> count(dest)
+#> # A tibble: 105 × 2
+#>   dest      n
+#>   <chr> <int>
+#> 1 ABQ     254
+#> 2 ACK     265
+#> 3 ALB     439
+#> 4 ANC       8
+#> 5 ATL   17215
+#> 6 AUS    2439
+#> # … with 99 more rows
+
+

(Despite the advice in #chp-workflow-style, we usually put #chp-https://dplyr.tidyverse.org/reference/count on a single line because it’s usually used at the console for a quick check that a calculation is working as expected.)

+

If you want to see the most common values add sort = TRUE:

+
+
flights |> count(dest, sort = TRUE)
+#> # A tibble: 105 × 2
+#>   dest      n
+#>   <chr> <int>
+#> 1 ORD   17283
+#> 2 ATL   17215
+#> 3 LAX   16174
+#> 4 BOS   15508
+#> 5 MCO   14082
+#> 6 CLT   14064
+#> # … with 99 more rows
+
+

And remember that if you want to see all the values, you can use |> View() or |> print(n = Inf).

+

You can perform the same computation “by hand” with #chp-https://dplyr.tidyverse.org/reference/group_by, #chp-https://dplyr.tidyverse.org/reference/summarise and #chp-https://dplyr.tidyverse.org/reference/context. This is useful because it allows you to compute other summaries at the same time:

+
+
flights |> 
+  group_by(dest) |> 
+  summarise(
+    n = n(),
+    delay = mean(arr_delay, na.rm = TRUE)
+  )
+#> # A tibble: 105 × 3
+#>   dest      n delay
+#>   <chr> <int> <dbl>
+#> 1 ABQ     254  4.38
+#> 2 ACK     265  4.85
+#> 3 ALB     439 14.4 
+#> 4 ANC       8 -2.5 
+#> 5 ATL   17215 11.3 
+#> 6 AUS    2439  6.02
+#> # … with 99 more rows
+
+

#chp-https://dplyr.tidyverse.org/reference/context is a special summary function that doesn’t take any arguments and instead accesses information about the “current” group. This means that it only works inside dplyr verbs:

+
+
n()
+#> Error in `n()`:
+#> ! Must only be used inside data-masking verbs like `mutate()`,
+#>   `filter()`, and `group_by()`.
+
+

There are a couple of variants of #chp-https://dplyr.tidyverse.org/reference/context that you might find useful:

+
  • +

    n_distinct(x) counts the number of distinct (unique) values of one or more variables. For example, we could figure out which destinations are served by the most carriers:

    +
    +
    flights |> 
    +  group_by(dest) |> 
    +  summarise(
    +    carriers = n_distinct(carrier)
    +  ) |> 
    +  arrange(desc(carriers))
    +#> # A tibble: 105 × 2
    +#>   dest  carriers
    +#>   <chr>    <int>
    +#> 1 ATL          7
    +#> 2 BOS          7
    +#> 3 CLT          7
    +#> 4 ORD          7
    +#> 5 TPA          7
    +#> 6 AUS          6
    +#> # … with 99 more rows
    +
    +
  • +
  • +

    A weighted count is a sum. For example you could “count” the number of miles each plane flew:

    +
    +
    flights |> 
    +  group_by(tailnum) |> 
    +  summarise(miles = sum(distance))
    +#> # A tibble: 4,044 × 2
    +#>   tailnum  miles
    +#>   <chr>    <dbl>
    +#> 1 D942DN    3418
    +#> 2 N0EGMQ  250866
    +#> 3 N10156  115966
    +#> 4 N102UW   25722
    +#> 5 N103US   24619
    +#> 6 N104UW   25157
    +#> # … with 4,038 more rows
    +
    +

    Weighted counts are a common problem so #chp-https://dplyr.tidyverse.org/reference/count has a wt argument that does the same thing:

    +
    +
    flights |> count(tailnum, wt = distance)
    +#> # A tibble: 4,044 × 2
    +#>   tailnum      n
    +#>   <chr>    <dbl>
    +#> 1 D942DN    3418
    +#> 2 N0EGMQ  250866
    +#> 3 N10156  115966
    +#> 4 N102UW   25722
    +#> 5 N103US   24619
    +#> 6 N104UW   25157
    +#> # … with 4,038 more rows
    +
    +
  • +
  • +

    You can count missing values by combining #chp-https://rdrr.io/r/base/sum and #chp-https://rdrr.io/r/base/NA. In the flights dataset this represents flights that are cancelled:

    +
    +
    flights |> 
    +  group_by(dest) |> 
    +  summarise(n_cancelled = sum(is.na(dep_time))) 
    +#> # A tibble: 105 × 2
    +#>   dest  n_cancelled
    +#>   <chr>       <int>
    +#> 1 ABQ             0
    +#> 2 ACK             0
    +#> 3 ALB            20
    +#> 4 ANC             0
    +#> 5 ATL           317
    +#> 6 AUS            21
    +#> # … with 99 more rows
    +
    +
  • +
+
+

+Exercises

+
  1. How can you use #chp-https://dplyr.tidyverse.org/reference/count to count the number rows with a missing value for a given variable?
  2. +
  3. Expand the following calls to #chp-https://dplyr.tidyverse.org/reference/count to instead use #chp-https://dplyr.tidyverse.org/reference/group_by, #chp-https://dplyr.tidyverse.org/reference/summarise, and #chp-https://dplyr.tidyverse.org/reference/arrange: +
    1. flights |> count(dest, sort = TRUE)

    2. +
    3. flights |> count(tailnum, wt = distance)

    4. +
  4. +
+
+ +
+

+Numeric transformations

+

Transformation functions work well with #chp-https://dplyr.tidyverse.org/reference/mutate because their output is the same length as the input. The vast majority of transformation functions are already built into base R. It’s impractical to list them all so this section will show the most useful ones. As an example, while R provides all the trigonometric functions that you might dream of, we don’t list them here because they’re rarely needed for data science.

+ +
+

+Arithmetic and recycling rules

+

We introduced the basics of arithmetic (+, -, *, /, ^) in #chp-workflow-basics and have used them a bunch since. These functions don’t need a huge amount of explanation because they do what you learned in grade school. But we need to briefly talk about the recycling rules which determine what happens when the left and right hand sides have different lengths. This is important for operations like flights |> mutate(air_time = air_time / 60) because there are 336,776 numbers on the left of / but only one on the right.

+

R handles mismatched lengths by recycling, or repeating, the short vector. We can see this in operation more easily if we create some vectors outside of a data frame:

+
+
x <- c(1, 2, 10, 20)
+x / 5
+#> [1] 0.2 0.4 2.0 4.0
+# is shorthand for
+x / c(5, 5, 5, 5)
+#> [1] 0.2 0.4 2.0 4.0
+
+

Generally, you only want to recycle single numbers (i.e. vectors of length 1), but R will recycle any shorter length vector. It usually (but not always) gives you a warning if the longer vector isn’t a multiple of the shorter:

+
+
x * c(1, 2)
+#> [1]  1  4 10 40
+x * c(1, 2, 3)
+#> Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
+#> object length
+#> [1]  1  4 30 20
+
+

These recycling rules are also applied to logical comparisons (==, <, <=, >, >=, !=) and can lead to a surprising result if you accidentally use == instead of %in% and the data frame has an unfortunate number of rows. For example, take this code which attempts to find all flights in January and February:

+
+
flights |> 
+  filter(month == c(1, 2))
+#> # A tibble: 25,977 × 19
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1      542         540       2     923     850      33 AA     
+#> 3  2013     1     1      554         600      -6     812     837     -25 DL     
+#> 4  2013     1     1      555         600      -5     913     854      19 B6     
+#> 5  2013     1     1      557         600      -3     838     846      -8 B6     
+#> 6  2013     1     1      558         600      -2     849     851      -2 B6     
+#> # … with 25,971 more rows, 9 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+

The code runs without error, but it doesn’t return what you want. Because of the recycling rules it finds flights in odd numbered rows that departed in January and flights in even numbered rows that departed in February. And unforuntately there’s no warning because flights has an even number of rows.

+

To protect you from this type of silent failure, most tidyverse functions use a stricter form of recycling that only recycles single values. Unfortunately that doesn’t help here, or in many other cases, because the key computation is performed by the base R function ==, not #chp-https://dplyr.tidyverse.org/reference/filter.

+
+ +
+

+Minimum and maximum

+

The arithmetic functions work with pairs of variables. Two closely related functions are #chp-https://rdrr.io/r/base/Extremes and #chp-https://rdrr.io/r/base/Extremes, which when given two or more variables will return the smallest or largest value in each row:

+
+
df <- tribble(
+  ~x, ~y,
+  1,  3,
+  5,  2,
+  7, NA,
+)
+
+df |> 
+  mutate(
+    min = pmin(x, y, na.rm = TRUE),
+    max = pmax(x, y, na.rm = TRUE)
+  )
+#> # A tibble: 3 × 4
+#>       x     y   min   max
+#>   <dbl> <dbl> <dbl> <dbl>
+#> 1     1     3     1     3
+#> 2     5     2     2     5
+#> 3     7    NA     7     7
+
+

Note that these are different to the summary functions #chp-https://rdrr.io/r/base/Extremes and #chp-https://rdrr.io/r/base/Extremes which take multiple observations and return a single value. You can tell that you’ve used the wrong form when all the minimums and all the maximums have the same value:

+
+
df |> 
+  mutate(
+    min = min(x, y, na.rm = TRUE),
+    max = max(x, y, na.rm = TRUE)
+  )
+#> # A tibble: 3 × 4
+#>       x     y   min   max
+#>   <dbl> <dbl> <dbl> <dbl>
+#> 1     1     3     1     7
+#> 2     5     2     1     7
+#> 3     7    NA     1     7
+
+
+ +
+

+Modular arithmetic

+

Modular arithmetic is the technical name for the type of math you did before you learned about real numbers, i.e. division that yields a whole number and a remainder. In R, %/% does integer division and %% computes the remainder:

+
+
1:10 %/% 3
+#>  [1] 0 0 1 1 1 2 2 2 3 3
+1:10 %% 3
+#>  [1] 1 2 0 1 2 0 1 2 0 1
+
+

Modular arithmetic is handy for the flights dataset, because we can use it to unpack the sched_dep_time variable into and hour and minute:

+
+
flights |> 
+  mutate(
+    hour = sched_dep_time %/% 100,
+    minute = sched_dep_time %% 100,
+    .keep = "used"
+  )
+#> # A tibble: 336,776 × 3
+#>   sched_dep_time  hour minute
+#>            <int> <dbl>  <dbl>
+#> 1            515     5     15
+#> 2            529     5     29
+#> 3            540     5     40
+#> 4            545     5     45
+#> 5            600     6      0
+#> 6            558     5     58
+#> # … with 336,770 more rows
+
+

We can combine that with the mean(is.na(x)) trick from #sec-logical-summaries to see how the proportion of cancelled flights varies over the course of the day. The results are shown in #fig-prop-cancelled.

+
+
flights |> 
+  group_by(hour = sched_dep_time %/% 100) |> 
+  summarise(prop_cancelled = mean(is.na(dep_time)), n = n()) |> 
+  filter(hour > 1) |> 
+  ggplot(aes(hour, prop_cancelled)) +
+  geom_line(color = "grey50") + 
+  geom_point(aes(size = n))
+
+ +

A line plot showing how proportion of cancelled flights changes over the course of the day. The proportion starts low at around 0.5% at 6am, then steadily increases over the course of the day until peaking at 4% at 7pm. The proportion of cancelled flights then drops rapidly getting down to around 1% by midnight.

+
Figure 13.1: A line plot with scheduled departure hour on the x-axis, and proportion of cancelled flights on the y-axis. Cancellations seem to accumulate over the course of the day until 8pm, very late flights are much less likely to be cancelled.
+
+
+
+
+ +
+

+Logarithms

+

Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert exponential growth to linear growth. For example, take compounding interest — the amount of money you have at year + 1 is the amount of money you had at year multiplied by the interest rate. That gives a formula like money = starting * interest ^ year:

+
+
starting <- 100
+interest <- 1.05
+
+money <- tibble(
+  year = 2000 + 1:50,
+  money = starting * interest^(1:50)
+)
+
+

If you plot this data, you’ll get an exponential curve:

+
+
ggplot(money, aes(year, money)) +
+  geom_line()
+
+

+
+
+

Log transforming the y-axis gives a straight line:

+
+
ggplot(money, aes(year, money)) +
+  geom_line() + 
+  scale_y_log10()
+
+

+
+
+

This a straight line because a little algebra reveals that log(money) = log(starting) + n * log(interest), which matches the pattern for a line, y = m * x + b. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there’s underlying exponential growth.

+

If you’re log-transforming your data with dplyr you have a choice of three logarithms provided by base R: #chp-https://rdrr.io/r/base/Log (the natural log, base e), #chp-https://rdrr.io/r/base/Log (base 2), and #chp-https://rdrr.io/r/base/Log (base 10). We recommend using #chp-https://rdrr.io/r/base/Log or #chp-https://rdrr.io/r/base/Log. #chp-https://rdrr.io/r/base/Log is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas #chp-https://rdrr.io/r/base/Log is easy to back-transform because (e.g) 3 is 10^3 = 1000.

+

The inverse of #chp-https://rdrr.io/r/base/Log is #chp-https://rdrr.io/r/base/Log; to compute the inverse of #chp-https://rdrr.io/r/base/Log or #chp-https://rdrr.io/r/base/Log you’ll need to use 2^ or 10^.

+
+ +
+

+Rounding

+

Use round(x) to round a number to the nearest integer:

+
+
round(123.456)
+#> [1] 123
+
+

You can control the precision of the rounding with the second argument, digits. round(x, digits) rounds to the nearest 10^-n so digits = 2 will round to the nearest 0.01. This definition is useful because it implies round(x, -3) will round to the nearest thousand, which indeed it does:

+
+
round(123.456, 2)  # two digits
+#> [1] 123.46
+round(123.456, 1)  # one digit
+#> [1] 123.5
+round(123.456, -1) # round to nearest ten
+#> [1] 120
+round(123.456, -2) # round to nearest hundred
+#> [1] 100
+
+

There’s one weirdness with #chp-https://rdrr.io/r/base/Round that seems surprising at first glance:

+
+
round(c(1.5, 2.5))
+#> [1] 2 2
+
+

#chp-https://rdrr.io/r/base/Round uses what’s known as “round half to even” or Banker’s rounding: if a number is half way between two integers, it will be rounded to the even integer. This is a good strategy because it keeps the rounding unbiased: half of all 0.5s are rounded up, and half are rounded down.

+

#chp-https://rdrr.io/r/base/Round is paired with #chp-https://rdrr.io/r/base/Round which always rounds down and #chp-https://rdrr.io/r/base/Round which always rounds up:

+
+
x <- 123.456
+
+floor(x)
+#> [1] 123
+ceiling(x)
+#> [1] 124
+
+

These functions don’t have a digits argument, so you can instead scale down, round, and then scale back up:

+
+
# Round down to nearest two digits
+floor(x / 0.01) * 0.01
+#> [1] 123.45
+# Round up to nearest two digits
+ceiling(x / 0.01) * 0.01
+#> [1] 123.46
+
+

You can use the same technique if you want to #chp-https://rdrr.io/r/base/Round to a multiple of some other number:

+
+
# Round to nearest multiple of 4
+round(x / 4) * 4
+#> [1] 124
+
+# Round to nearest 0.25
+round(x / 0.25) * 0.25
+#> [1] 123.5
+
+
+ +
+

+Cutting numbers into ranges

+

Use #chp-https://rdrr.io/r/base/cutggplot2 provides some helpers for common cases in #chp-https://ggplot2.tidyverse.org/reference/cut_interval, #chp-https://ggplot2.tidyverse.org/reference/cut_interval, and #chp-https://ggplot2.tidyverse.org/reference/cut_interval. ggplot2 is an admittedly weird place for these functions to live, but they are useful as part of histogram computation and were written before any other parts of the tidyverse existed. to break up a numeric vector into discrete buckets:

+
+
x <- c(1, 2, 5, 10, 15, 20)
+cut(x, breaks = c(0, 5, 10, 15, 20))
+#> [1] (0,5]   (0,5]   (0,5]   (5,10]  (10,15] (15,20]
+#> Levels: (0,5] (5,10] (10,15] (15,20]
+
+

The breaks don’t need to be evenly spaced:

+
+
cut(x, breaks = c(0, 5, 10, 100))
+#> [1] (0,5]    (0,5]    (0,5]    (5,10]   (10,100] (10,100]
+#> Levels: (0,5] (5,10] (10,100]
+
+

You can optionally supply your own labels. Note that there should be one less labels than breaks.

+
+
cut(x, 
+  breaks = c(0, 5, 10, 15, 20), 
+  labels = c("sm", "md", "lg", "xl")
+)
+#> [1] sm sm sm md lg xl
+#> Levels: sm md lg xl
+
+

Any values outside of the range of the breaks will become NA:

+
+
y <- c(NA, -10, 5, 10, 30)
+cut(y, breaks = c(0, 5, 10, 15, 20))
+#> [1] <NA>   <NA>   (0,5]  (5,10] <NA>  
+#> Levels: (0,5] (5,10] (10,15] (15,20]
+
+

See the documentation for other useful arguments like right and include.lowest, which control if the intervals are [a, b) or (a, b] and if the lowest interval should be [a, b].

+
+ +
+

+Cumulative and rolling aggregates

+

Base R provides #chp-https://rdrr.io/r/base/cumsum, #chp-https://rdrr.io/r/base/cumsum, #chp-https://rdrr.io/r/base/cumsum, #chp-https://rdrr.io/r/base/cumsum for running, or cumulative, sums, products, mins and maxes. dplyr provides #chp-https://dplyr.tidyverse.org/reference/cumall for cumulative means. Cumulative sums tend to come up the most in practice:

+
+
x <- 1:10
+cumsum(x)
+#>  [1]  1  3  6 10 15 21 28 36 45 55
+
+

If you need more complex rolling or sliding aggregates, try the #chp-https://davisvaughan.github.io/slider/ package by Davis Vaughan. The following example illustrates some of its features.

+
+
library(slider)
+
+# Same as a cumulative sum
+slide_vec(x, sum, .before = Inf)
+#>  [1]  1  3  6 10 15 21 28 36 45 55
+# Sum the current element and the one before it
+slide_vec(x, sum, .before = 1)
+#>  [1]  1  3  5  7  9 11 13 15 17 19
+# Sum the current element and the two before and after it
+slide_vec(x, sum, .before = 2, .after = 2)
+#>  [1]  6 10 15 20 25 30 35 40 34 27
+# Only compute if the window is complete
+slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
+#>  [1] NA NA 15 20 25 30 35 40 NA NA
+
+
+ +
+

+Exercises

+
  1. Explain in words what each line of the code used to generate #fig-prop-cancelled does.

  2. +
  3. What trigonometric functions does R provide? Guess some names and look up the documentation. Do they use degrees or radians?

  4. +
  5. +

    Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. You can see the basic problem in this plot: there’s a gap between each hour.

    +
    +
    flights |> 
    +  filter(month == 1, day == 1) |> 
    +  ggplot(aes(sched_dep_time, dep_delay)) +
    +  geom_point()
    +#> Warning: Removed 4 rows containing missing values (`geom_point()`).
    +
    +

    +
    +
    +

    Convert them to a more truthful representation of time (either fractional hours or minutes since midnight).

    +
  6. +
+
+ +
+

+General transformations

+

The following sections describe some general transformations which are often used with numeric vectors, but can be applied to all other column types.

+ +
+

+Ranks

+

dplyr provides a number of ranking functions inspired by SQL, but you should always start with #chp-https://dplyr.tidyverse.org/reference/row_number. It uses the typical method for dealing with ties, e.g. 1st, 2nd, 2nd, 4th.

+
+
x <- c(1, 2, 2, 3, 4, NA)
+min_rank(x)
+#> [1]  1  2  2  4  5 NA
+
+

Note that the smallest values get the lowest ranks; use desc(x) to give the largest values the smallest ranks:

+
+
min_rank(desc(x))
+#> [1]  5  3  3  2  1 NA
+
+

If #chp-https://dplyr.tidyverse.org/reference/row_number doesn’t do what you need, look at the variants #chp-https://dplyr.tidyverse.org/reference/row_number, #chp-https://dplyr.tidyverse.org/reference/row_number, #chp-https://dplyr.tidyverse.org/reference/percent_rank, and #chp-https://dplyr.tidyverse.org/reference/percent_rank. See the documentation for details.

+
+
df <- tibble(x = x)
+df |> 
+  mutate(
+    row_number = row_number(x),
+    dense_rank = dense_rank(x),
+    percent_rank = percent_rank(x),
+    cume_dist = cume_dist(x)
+  )
+#> # A tibble: 6 × 5
+#>       x row_number dense_rank percent_rank cume_dist
+#>   <dbl>      <int>      <int>        <dbl>     <dbl>
+#> 1     1          1          1         0          0.2
+#> 2     2          2          2         0.25       0.6
+#> 3     2          3          2         0.25       0.6
+#> 4     3          4          3         0.75       0.8
+#> 5     4          5          4         1          1  
+#> 6    NA         NA         NA        NA         NA
+
+

You can achieve many of the same results by picking the appropriate ties.method argument to base R’s #chp-https://rdrr.io/r/base/rank; you’ll probably also want to set na.last = "keep" to keep NAs as NA.

+

#chp-https://dplyr.tidyverse.org/reference/row_number can also be used without any arguments when inside a dplyr verb. In this case, it’ll give the number of the “current” row. When combined with %% or %/% this can be a useful tool for dividing data into similarly sized groups:

+
+
df <- tibble(x = runif(10))
+
+df |> 
+  mutate(
+    row0 = row_number() - 1,
+    three_groups = row0 %% 3,
+    three_in_each_group = row0 %/% 3,
+  )
+#> # A tibble: 10 × 4
+#>         x  row0 three_groups three_in_each_group
+#>     <dbl> <dbl>        <dbl>               <dbl>
+#> 1 0.0808      0            0                   0
+#> 2 0.834       1            1                   0
+#> 3 0.601       2            2                   0
+#> 4 0.157       3            0                   1
+#> 5 0.00740     4            1                   1
+#> 6 0.466       5            2                   1
+#> # … with 4 more rows
+
+
+ +
+

+Offsets

+

#chp-https://dplyr.tidyverse.org/reference/lead-lag and #chp-https://dplyr.tidyverse.org/reference/lead-lag allow you to refer the values just before or just after the “current” value. They return a vector of the same length as the input, padded with NAs at the start or end:

+
+
x <- c(2, 5, 11, 11, 19, 35)
+lag(x)
+#> [1] NA  2  5 11 11 19
+lead(x)
+#> [1]  5 11 11 19 35 NA
+
+
  • +

    x - lag(x) gives you the difference between the current and previous value.

    +
    +
    x - lag(x)
    +#> [1] NA  3  6  0  8 16
    +
    +
  • +
  • +

    x == lag(x) tells you when the current value changes.

    +
    +
    x == lag(x)
    +#> [1]    NA FALSE FALSE  TRUE FALSE FALSE
    +
    +
  • +

You can lead or lag by more than one position by using the second argument, n.

+
+ +
+

+Exercises

+
  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for #chp-https://dplyr.tidyverse.org/reference/row_number.

  2. +
  3. Which plane (tailnum) has the worst on-time record?

  4. +
  5. What time of day should you fly if you want to avoid delays as much as possible?

  6. +
  7. What does flights |> group_by(dest() |> filter(row_number() < 4) do? What does flights |> group_by(dest() |> filter(row_number(dep_delay) < 4) do?

  8. +
  9. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

  10. +
  11. +

    Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using #chp-https://dplyr.tidyverse.org/reference/lead-lag, explore how the average flight delay for an hour is related to the average delay for the previous hour.

    +
    +
    flights |> 
    +  mutate(hour = dep_time %/% 100) |> 
    +  group_by(year, month, day, hour) |> 
    +  summarise(
    +    dep_delay = mean(dep_delay, na.rm = TRUE),
    +    n = n(),
    +    .groups = "drop"
    +  ) |> 
    +  filter(n > 5)
    +
    +
  12. +
  13. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

  14. +
  15. Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.

  16. +
+
+ +
+

+Numeric summaries

+

Just using the counts, means, and sums that we’ve introduced already can get you a long way, but R provides many other useful summary functions. Here are a selection that you might find useful.

+ +
+

+Center

+

So far, we’ve mostly used #chp-https://rdrr.io/r/base/mean to summarize the center of a vector of values. Because the mean is the sum divided by the count, it is sensitive to even just a few unusually high or low values. An alternative is to use the #chp-https://rdrr.io/r/stats/median, which finds a value that lies in the “middle” of the vector, i.e. 50% of the values is above it and 50% are below it. Depending on the shape of the distribution of the variable you’re interested in, mean or median might be a better measure of center. For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.

+

#fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay. The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.

+
+
flights |>
+  group_by(year, month, day) |>
+  summarise(
+    mean = mean(dep_delay, na.rm = TRUE),
+    median = median(dep_delay, na.rm = TRUE),
+    n = n(),
+    .groups = "drop"
+  ) |> 
+  ggplot(aes(mean, median)) + 
+  geom_abline(slope = 1, intercept = 0, color = "white", size = 2) +
+  geom_point()
+#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
+#> ℹ Please use `linewidth` instead.
+
+ +

All points fall below a 45° line, meaning that the median delay is always less than the mean delay. Most points are clustered in a dense region of mean [0, 20] and median [0, 5]. As the mean delay increases, the spread of the median also increases. There are two outlying points with mean ~60, median ~50, and mean ~85, median ~55.

+
Figure 13.2: A scatterplot showing the differences of summarising hourly depature delay with median instead of mean.
+
+
+
+

You might also wonder about the mode, or the most common value. This is a summary that only works well for very simple cases (which is why you might have learned about it in high school), but it doesn’t work well for many real datasets. If the data is discrete, there may be multiple most common values, and if the data is continuous, there might be no most common value because every value is ever so slightly different. For these reasons, the mode tends not to be used by statisticians and there’s no mode function included in base RThe #chp-https://rdrr.io/r/base/mode function does something quite different!.

+
+ +
+

+Minimum, maximum, and quantiles

+

What if you’re interested in locations other than the center? #chp-https://rdrr.io/r/base/Extremes and #chp-https://rdrr.io/r/base/Extremes will give you the largest and smallest values. Another powerful tool is #chp-https://rdrr.io/r/stats/quantile which is a generalization of the median: quantile(x, 0.25) will find the value of x that is greater than 25% of the values, quantile(x, 0.5) is equivalent to the median, and quantile(x, 0.95) will find a value that’s greater than 95% of the values.

+

For the flights data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.

+
+
flights |>
+  group_by(year, month, day) |>
+  summarise(
+    max = max(dep_delay, na.rm = TRUE),
+    q95 = quantile(dep_delay, 0.95, na.rm = TRUE),
+    .groups = "drop"
+  )
+#> # A tibble: 365 × 5
+#>    year month   day   max   q95
+#>   <int> <int> <int> <dbl> <dbl>
+#> 1  2013     1     1   853  70.1
+#> 2  2013     1     2   379  85  
+#> 3  2013     1     3   291  68  
+#> 4  2013     1     4   288  60  
+#> 5  2013     1     5   327  41  
+#> 6  2013     1     6   202  51  
+#> # … with 359 more rows
+
+
+ +
+

+Spread

+

Sometimes you’re not so interested in where the bulk of the data lies, but in how it is spread out. Two commonly used summaries are the standard deviation, sd(x), and the inter-quartile range, #chp-https://rdrr.io/r/stats/IQR. We won’t explain #chp-https://rdrr.io/r/stats/sd here since you’re probably already familiar with it, but #chp-https://rdrr.io/r/stats/IQR might be new — it’s quantile(x, 0.75) - quantile(x, 0.25) and gives you the range that contains the middle 50% of the data.

+

We can use this to reveal a small oddity in the flights data. You might expect the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, #chp-https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport, might have moved.

+
+
flights |> 
+  group_by(origin, dest) |> 
+  summarise(
+    distance_sd = IQR(distance), 
+    n = n(),
+    .groups = "drop"
+  ) |> 
+  filter(distance_sd > 0)
+#> # A tibble: 2 × 4
+#>   origin dest  distance_sd     n
+#>   <chr>  <chr>       <dbl> <int>
+#> 1 EWR    EGE             1   110
+#> 2 JFK    EGE             1   103
+
+
+ +
+

+Distributions

+

It’s worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that they’re fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. That’s why it’s always a good idea to visualize the distribution before committing to your summary statistics.

+

#fig-flights-dist shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.

+
+
flights |>
+  ggplot(aes(dep_delay)) + 
+  geom_histogram(binwidth = 15)
+#> Warning: Removed 8255 rows containing non-finite values (`stat_bin()`).
+
+flights |>
+  filter(dep_delay < 120) |> 
+  ggplot(aes(dep_delay)) + 
+  geom_histogram(binwidth = 5)
+
+
+
+ +

Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that.

+
(a) Histogram shows the full range of delays.
+
+
+
+ +

Two histograms of `dep_delay`. On the left, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the right, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that.

+
(b) Histogram is zoomed in to show delays less than 2 hours.
+
+
+
+
Figure 13.3: The distribution of dep_delay appears highly skewed to the right in both histograms.
+
+
+

It’s also a good idea to check that distributions for subgroups resemble the whole. #fig-flights-dist-daily overlays a frequency polygon for each day. The distributions seem to follow a common pattern, suggesting it’s fine to use the same summary for each day.

+
+
flights |>
+  filter(dep_delay < 120) |> 
+  ggplot(aes(dep_delay, group = interaction(day, month))) + 
+  geom_freqpoly(binwidth = 5, alpha = 1/5)
+
+ +

The distribution of `dep_delay` is highly right skewed with a strong peak slightly less than 0. The 365 frequency polygons are mostly overlapping forming a thick black bland.

+
Figure 13.4: 365 frequency polygons of dep_delay, one for each day. The frequency polygons appear to have the same shape, suggesting that it’s reasonable to compare days by looking at just a few summary statistics.
+
+
+
+

Don’t be afraid to explore your own custom summaries specifically tailored for the data that you’re working with. In this case, that might mean separately summarizing the flights that left early vs the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don’t forget what you learned in #sec-sample-size: whenever creating numerical summaries, it’s a good idea to include the number of observations in each group.

+
+ +
+

+Positions

+

There’s one final type of summary that’s useful for numeric vectors, but also works with every other type of value: extracting a value at specific position. You can do this with the base R [ function, but we’re not going to cover it in detail until #sec-subset-many, because it’s a very powerful and general function. For now we’ll introduce three specialized functions that you can use to extract values at a specified position: first(x), last(x), and nth(x, n).

+

For example, we can find the first and last departure for each day:

+
+
flights |> 
+  group_by(year, month, day) |> 
+  summarise(
+    first_dep = first(dep_time), 
+    fifth_dep = nth(dep_time, 5),
+    last_dep = last(dep_time)
+  )
+#> `summarise()` has grouped output by 'year', 'month'. You can override using the
+#> `.groups` argument.
+#> # A tibble: 365 × 6
+#> # Groups:   year, month [12]
+#>    year month   day first_dep fifth_dep last_dep
+#>   <int> <int> <int>     <int>     <int>    <int>
+#> 1  2013     1     1       517       554       NA
+#> 2  2013     1     2        42       535       NA
+#> 3  2013     1     3        32       520       NA
+#> 4  2013     1     4        25       531       NA
+#> 5  2013     1     5        14       534       NA
+#> 6  2013     1     6        16       555       NA
+#> # … with 359 more rows
+
+

(These functions currently lack an na.rm argument but will hopefully be fixed by the time you read this book: https://github.com/tidyverse/dplyr/issues/6242).

+

If you’re familiar with [, you might wonder if you ever need these functions. There are two main reasons: the default argument and the order_by argument. default allows you to set a default value that’s used if the requested position doesn’t exist, e.g. you’re trying to get the 3rd element from a two element group. order_by lets you locally override the existing ordering of the rows, so you can get the element at the position in the ordering by #chp-https://dplyr.tidyverse.org/reference/order_by.

+

Extracting values at positions is complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

+
+
flights |> 
+  group_by(year, month, day) |> 
+  mutate(r = min_rank(desc(sched_dep_time))) |> 
+  filter(r %in% c(1, max(r)))
+#> # A tibble: 1,195 × 20
+#> # Groups:   year, month, day [365]
+#>    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
+#>   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
+#> 1  2013     1     1      517         515       2     830     819      11 UA     
+#> 2  2013     1     1     2353        2359      -6     425     445     -20 B6     
+#> 3  2013     1     1     2353        2359      -6     418     442     -24 B6     
+#> 4  2013     1     1     2356        2359      -3     425     437     -12 B6     
+#> 5  2013     1     2       42        2359      43     518     442      36 B6     
+#> 6  2013     1     2      458         500      -2     703     650      13 US     
+#> # … with 1,189 more rows, 10 more variables: flight <int>, tailnum <chr>,
+#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
+#> #   minute <dbl>, time_hour <dttm>, r <int>, and abbreviated variable names
+#> #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
+
+
+ +
+

+Withmutate() +

+

As the names suggest, the summary functions are typically paired with #chp-https://dplyr.tidyverse.org/reference/summarise. However, because of the recycling rules we discussed in #sec-recycling they can also be usefully paired with #chp-https://dplyr.tidyverse.org/reference/mutate, particularly when you want do some sort of group standardization. For example:

+
  • +x / sum(x) calculates the proportion of a total.
  • +
  • +(x - mean(x)) / sd(x) computes a Z-score (standardized to mean 0 and sd 1).
  • +
  • +x / first(x) computes an index based on the first observation.
  • +
+ +
+

+Exercises

+
  1. +

    Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:

    +
    • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
    • +
    • A flight is always 10 minutes late.
    • +
    • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
    • +
    • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
    • +

    Which do you think is more important: arrival delay or departure delay?

    +
  2. +
  3. Which destinations show the greatest variation in air speed?

  4. +
  5. Create a plot to further explore the adventures of EGE. Can you find any evidence that the airport moved locations?

  6. +
+
+ +
+

+Summary

+

You’re already familiar with many tools for working with numbers, and after reading this chapter you now know how to use them in R. You’ve also learned a handful of useful general transformations that are commonly, but not exclusively, applied to numeric vectors like ranks and offsets. Finally, you worked through a number of numeric summaries, and discussed a few of the statistical challenges that you should consider.

+

Over the next two chapters, we’ll dive into working with strings with the stringr package. Strings are a big topic so they get two chapters, one on the fundamentals of strings and one on regular expressions.

+ + +
+
diff --git a/oreilly/preface-2e.html b/oreilly/preface-2e.html new file mode 100644 index 0000000..3773a73 --- /dev/null +++ b/oreilly/preface-2e.html @@ -0,0 +1,19 @@ +
+

Preface to the second edition

Welcome to the second edition of “R for Data Science”.

+
+

Major changes

+
  • The first part is renamed to “whole game” to reflect the entire data science cycle. It gains a new chapter that briefly introduces the basics of reading data from csv files.

  • +
  • The wrangle part is now transform and gains new chapters on numbers, logical vectors, and missing values. These were previously parts of the data transformation chapter, but needed much more room.

  • +
  • We’ve added new chapters on column-wise and row-wise operations.

  • +
  • We’ve added a new set of chapters on import that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and scraping data from the web.

  • +
  • The modeling part has been removed. For modeling, we recommend using packages from #chp-https://www.tidymodels.org/ and reading #chp-https://www.tmwr.org/ by Max Kuhn and Julia Silge to learn more about them.

  • +
  • We’ve switched from the magrittr pipe to the base pipe.

  • +
+ +
+

Acknowledgements

+

TO DO: Add acknowledgements.

+ + +
+
diff --git a/oreilly/program.html b/oreilly/program.html new file mode 100644 index 0000000..26c1a8d --- /dev/null +++ b/oreilly/program.html @@ -0,0 +1,18 @@ +
+

Program

In this part of the book, you’ll improve your programming skills. Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.

+
+ +

Our model of the data science process with program (import, tidy, transform, visualize, model, and communicate, i.e. everything) highlighted in blue.

+
Figure 1: Programming is the water in which all other components of the data science process swims.
+
+
+

Programming produces code, and code is a tool of communication. Obviously code tells the computer what you want it to do. But it also communicates meaning to other humans. Thinking about code as a vehicle for communication is important because every project you do is fundamentally collaborative. Even if you’re not working with other people, you’ll definitely be working with future-you! Writing clear code is important so that others (like future-you) can understand why you tackled an analysis in the way you did. That means getting better at programming also involves getting better at communicating. Over time, you want your code to become not just easier to write, but easier for others to read.

Writing code is similar in many ways to writing prose. One parallel which we find particularly useful is that in both cases rewriting is the key to clarity. The first expression of your ideas is unlikely to be particularly clear, and you may need to rewrite multiple times. After solving a data analysis challenge, it’s often worth looking at your code and thinking about whether or not it’s obvious what you’ve done. If you spend a little time rewriting your code while the ideas are fresh, you can save a lot of time later trying to recreate what your code did. But this doesn’t mean you should rewrite every function: you need to balance what you need to achieve now with saving time in the long run. (But the more you rewrite your functions the more likely your first attempt will be clear.)

In the following three chapters, you’ll learn skills to improve your programming skills:

  1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice. Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies. Instead, in #chp-functions, you’ll learn how to write functions which let you extract out repeated code so that it can be easily reused.

  2. +
  3. Functions extract out repeated code, but you often need to repeat the same actions on different inputs. You need tools for iteration that let you do similar things again and again. These tools include for loops and functional programming, which you’ll learn about in #chp-iteration.

  4. +
  5. As you read more code written by others, you’ll see more code that doesn’t use the tidyverse. In #chp-base-R, you’ll learn some of the most important base R functions that you’ll see in the wild. These functions tend to be designed to use individual vectors, rather than data frames, often making them a good fit for your programming needs.

  6. +
+

Learning more

+

The goal of these chapters is to teach you the minimum about programming that you need to practice data science. Once you have mastered the material in this book, we strongly believe you should continue to invest in your programming skills. Learning more about programming is a long-term investment: it won’t pay off immediately, but in the long term it will allow you to solve new problems more quickly, and let you reuse your insights from previous problems in new scenarios.

+

To learn more you need to study R as a programming language, not just an interactive environment for data science. We have written two books that will help you do so:

+
  • #chp-https://rstudio-education.github.io/hopr/, by Garrett Grolemund. This is an introduction to R as a programming language and is a great place to start if R is your first programming language. It covers similar material to these chapters, but with a different style and different motivation examples (based in the casino). It’s a useful complement if you find that these four chapters go by too quickly.

  • +
  • #chp-https://adv-r.hadley.nz/ by Hadley Wickham. This dives into the details of R the programming language. This is a great place to start if you have existing programming experience. It’s also a great next step once you’ve internalized the ideas in these chapters.

  • +
diff --git a/oreilly/quarto-formats.html b/oreilly/quarto-formats.html new file mode 100644 index 0000000..6311bae --- /dev/null +++ b/oreilly/quarto-formats.html @@ -0,0 +1,293 @@ +
+

Quarto formats

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

So far you’ve seen Quarto used to produce HTML documents. This chapter gives a brief overview of some of the many other types of output you can produce with Quarto.

+

There are two ways to set the output of a document:

+
  1. +

    Permanently, by modifying the YAML header:

    +
    title: "Diamond sizes"
    +format: html
    +
  2. +
  3. +

    Transiently, by calling quarto::quarto_render() by hand:

    +
    +
    quarto::quarto_render("diamond-sizes.qmd", output_format = "docx")
    +
    +

    This is useful if you want to programmatically produce multiple types of output since the output_format argument can also take a list of values.

    +
    +
    quarto::quarto_render("diamond-sizes.qmd", output_format = c("docx", "pdf"))
    +
    +
  4. +
+ +
+

+Output options

+

Quarto offers a wide range of output formats. You can find the complete list at https://quarto.org/docs/output-formats/all-formats.html. Many formats share some output options (e.g., toc: true for including a table of contents), but others have options that are format specific (e.g., code-fold: true collapses code chunks into a <details> tag for HTML output so the user can display it on demand, it’s not applicable in a PDF or Word document).

+

To override the default voptions, you need to use an expanded format field. For example, if you wanted to render an html with a floating table of contents, you’d use:

+
format:
+  html:
+    toc: true
+    toc_float: true
+

You can even render to multiple outputs by supplying a list of formats:

+
format:
+  html:
+    toc: true
+    toc_float: true
+  pdf: default
+  docx: default
+

Note the special syntax (pdf: default) if you don’t want to override any of the default options.

+

To render to all formats specified in the YAML of a document, you can use output_format = "all".

+
+
quarto::quarto_render("diamond-sizes.qmd", output_format = "all")
+
+
+ +
+

+Documents

+

The previous chapter focused on the default html output. There are a number of basic variations on that theme, generating different types of documents. For example:

+
  • pdf makes a PDF with LaTeX (an open source document layout system), which you’ll need to install. RStudio will prompt you if you don’t already have it.

  • +
  • docx for Microsoft Word (.docx) documents.

  • +
  • odt for OpenDocument Text (.odt) documents.

  • +
  • rtf for Rich Text Format (.rtf) documents.

  • +
  • gfm for a GitHub Flavored Markdown (.md) document.

  • +
  • ipynb for Jupyter Notebooks (.ipynb).

  • +

Remember, when generating a document to share with decision makers, you can turn off the default display of code by setting global options in document YAML:

+
execute:
+  echo: false
+

For html documents another option is to make the code chunks hidden by default, but visible with a click:

+
format:
+  html:
+    code: true
+
+ +
+

+Presentations

+

You can also use Quarto to produce presentations. You get less visual control than with a tool like Keynote or PowerPoint, but automatically inserting the results of your R code into a presentation can save a huge amount of time. Presentations work by dividing your content into slides, with a new slide beginning at each second (##) level header. Additionally, first (#) level headers can be used to indicate the beginning of a new section with a section title slide that is by default centered in the middle.

+

Quarto supports a variety of presentation formats, including:

+
  1. revealjs - HTML presentation with revealjs

  2. +
  3. pptx - PowerPoint presentation

  4. +
  5. beamer - PDF presentation with LaTeX Beamer.

  6. +

You can read more about creating presentations with Quarto at https://quarto.org/docs/presentations.

+
+ +
+

+Dashboards

+

Dashboards are a useful way to communicate large amounts of information visually and quickly. A dashboard-like look can be achieved with Quarto using document layout options like sidebars, tabsets, multi-column layouts, etc.

+

For example, you can produce this dashboard:

+
+
+

Quarto dashboard with the title "Diamonds dashboard". The first tab shows four plots of the diamonds dataset. The second tab shows summary statistics for price and carat of diamonds. The third tab shows an interactive data table of the first 100 diamonds.

+
+
+

Using this code:

+
+
---
+title: "💍 Diamonds dashboard"
+format: html
+execute:
+  echo: false
+---
+
+```{r}
+#| label: setup
+#| include: false
+
+library(tidyverse)
+library(gt)
+```
+
+::: panel-tabset
+## Plots
+
+```{r}
+#| layout: [[30,-5, 30, -5, 30], [100]]
+
+ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.1)
+ggplot(diamonds, aes(x = price)) + geom_histogram(binwidth = 500)
+ggplot(diamonds, aes(x = cut, color = cut)) + geom_bar()
+
+ggplot(diamonds, aes(x = carat, y = price, color = cut)) + geom_point()
+```
+
+## Summaries
+
+```{r}
+diamonds |>
+  select(price, carat, cut) |>
+  group_by(cut) |>
+  summarize(
+    across(where(is.numeric), list(mean = mean, median = median, sd = sd, IQR = IQR))
+  ) |>
+  pivot_longer(cols = -cut) |>
+  pivot_wider(names_from = cut, values_from = value) |>
+  separate(name, into = c("var", "stat")) |>
+  mutate(
+    var = str_to_title(var),
+    stat = str_to_title(stat),
+    stat = if_else(stat == "Iqr", "IQR", stat)
+    ) |>
+  group_by(var) |>
+  gt() |>
+  fmt_currency(columns = -stat, rows = 1:4, decimals = 0) |>
+  fmt_number(columns = -stat, rows = 5:8,) |>
+  cols_align(columns = -stat, align = "center") |>
+  cols_label(stat = "")
+```
+
+## Data
+
+```{r}
+diamonds |> 
+  arrange(desc(carat)) |> 
+  slice_head(n = 100) |> 
+  select(price, carat, cut) |> 
+  DT::datatable()
+```
+:::
+
+

To learn more about Quarto component layouts, visit https://quarto.org/docs/interactive/layout.html.

+
+ +
+

+Interactivity

+

Any HTML documents can contain interactive components.

+ +
+

+htmlwidgets

+

HTML is an interactive format, and you can take advantage of that interactivity with htmlwidgets, R functions that produce interactive HTML visualizations. For example, take the leaflet map below. If you’re viewing this page on the web, you can drag the map around, zoom in and out, etc. You obviously can’t do that in a book, so Quarto automatically inserts a static screenshot for you.

+
+
library(leaflet)
+leaflet() |>
+  setView(174.764, -36.877, zoom = 16) |> 
+  addTiles() |>
+  addMarkers(174.764, -36.877, popup = "Maungawhau") 
+
+
+
+
+

The great thing about htmlwidgets is that you don’t need to know anything about HTML or JavaScript to use them. All the details are wrapped inside the package, so you don’t need to worry about it.

+

There are many packages that provide htmlwidgets, including:

+

To learn more about htmlwidgets and see a more complete list of packages that provide them visit https://www.htmlwidgets.org.

+
+ +
+

+Shiny

+

htmlwidgets provide client-side interactivity — all the interactivity happens in the browser, independently of R. On one hand, that’s great because you can distribute the HTML file without any connection to R. However, that fundamentally limits what you can do to things that have been implemented in HTML and JavaScript. An alternative approach is to use shiny, a package that allows you to create interactivity using R code, not JavaScript.

+

To call Shiny code from an Quarto document, add server: shiny to the YAML header:

+
title: "Shiny Web App"
+format: html
+server: shiny
+

Then you can use the “input” functions to add interactive components to the document:

+
+
library(shiny)
+
+textInput("name", "What is your name?")
+numericInput("age", "How old are you?", NA, min = 0, max = 150)
+
+

And you also need a code chunk with chunk option context: server which contains the code that needs to run in a Shiny server.

+
+
+

Two input boxes on top of each other. Top one says "What is your name?", the bottom one "How old are you?".

+
+
+

You can then refer to the values with input$name and input$age, and the code that uses them will be automatically re-run whenever they change.

+

We can’t show you a live shiny app here because shiny interactions occur on the server-side. This means that you can write interactive apps without knowing JavaScript, but you need a server to run them on. This introduces a logistical issue: Shiny apps need a Shiny server to be run online. When you run Shiny apps on your own computer, Shiny automatically sets up a Shiny server for you, but you need a public facing Shiny server if you want to publish this sort of interactivity online. That’s the fundamental trade-off of shiny: you can do anything in a shiny document that you can do in R, but it requires someone to be running R.

+

For learning more about Shiny, we recommend reading Mastering Shiny by Hadley Wickham, https://mastering-shiny.org.

+
+
+ +
+

+Websites and books

+

With a little additional infrastructure you can use Quarto to generate a complete website:

+
  • Put your .qmd files in a single directory. index.qmd will become the home page.

  • +
  • +

    Add a YAML file named _quarto.yml that provides the navigation for the site. In this file, set the project type:

    +
    • For a website, set type: book:
    • +
    project:
    +  type: book
    +
    • For a website, set type: website:
    • +
    project:
    +  type: website
    +
  • +

For example, the following _quarto.yml file creates a website from three source files: index.qmd (the home page), viridis-colors.qmd, and terrain-colors.qmd.

+
+
project:
+  type: website
+
+website:
+  title: "A website on color scales"
+  navbar:
+    left:
+      - href: index.qmd
+        text: Home
+      - href: viridis-colors.qmd
+        text: Viridis colors
+      - href: terrain-colors.qmd
+        text: Terrain colors
+
+

The _quarto.yml file you need for a book is very similarly structured. The following example shows how you can create a book with four chapters that renders to three different outputs (html, pdf, and epub). Once again, the source files are .qmd files.

+
+
project:
+  type: book
+
+book:
+  title: "A book on color scales"
+  author: "Jane Coloriste"
+  chapters:
+    - index.qmd
+    - intro.qmd
+    - viridis-colors.qmd
+    - terrain-colors.qmd
+
+format:
+  html:
+    theme: cosmo
+  pdf: default
+  epub: default
+
+

We recommend that you use an RStudio project for your websites and books. Based on the _quarto.yml file, RStudio will recognize the type of project you’re working on, and add a Built tab to the IDE that you can use to render and preview your websites and books. Both websites and books can also be rendered using quarto::render().

+

Read more at https://quarto.org/docs/websites about Quarto websites and https://quarto.org/docs/books about books.

+
+ +
+

+Other formats

+

Quarto offers even more output formats:

+

See https://quarto.org/docs/output-formats/all-formats.html for a list of even more formats.

+
+ +
+

+Learning more

+

To learn more about effective communication in these different formats we recommend the following resources:

+
+
diff --git a/oreilly/quarto-workflow.html b/oreilly/quarto-workflow.html new file mode 100644 index 0000000..0631881 --- /dev/null +++ b/oreilly/quarto-workflow.html @@ -0,0 +1,25 @@ +
+

Quarto workflow

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+

Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the console, then capture what works in the script editor. Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. When you’re happy, you move on and start a new chunk.

Quarto is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. It:

  • Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!

  • +
  • Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.

  • +
  • Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share not only what you’ve done, but why you did it with your colleagues or lab mates.

  • +

Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. We’ve drawn on our own experiences and Colin Purrington’s advice on lab notebooks (https://colinpurrington.com/tips/lab-notebooks) to come up with the following tips:

  • Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis.

  • +
  • +

    Use the YAML header date field to record the date you started working on the notebook:

    +
    date: 2016-08-23
    +

    Use ISO8601 YYYY-MM-DD format so that’s there no ambiguity. Use it even if you don’t normally write dates that way!

    +
  • +
  • If you spend a lot of time on an analysis idea and it turns out to be a dead end, don’t delete it! Write up a brief note about why it failed and leave it in the notebook. That will help you avoid going down the same dead end when you come back to the analysis in the future.

  • +
  • Generally, you’re better off doing data entry outside of R. But if you do need to record a small snippet of data, clearly lay it out using #chp-https://tibble.tidyverse.org/reference/tribble.

  • +
  • If you discover an error in a data file, never modify it directly, but instead write code to correct the value. Explain why you made the fix.

  • +
  • Before you finish for the day, make sure you can render the notebook. If you’re using caching, make sure to clear the caches. That will let you fix any problems while the code is still fresh in your mind.

  • +
  • If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you’ll need to track the versions of the packages that your code uses. A rigorous approach is to use renv, https://rstudio.github.io/renv/index.html, which stores packages in your project directory. A quick and dirty hack is to include a chunk that runs #chp-https://rdrr.io/r/utils/sessionInfo — that won’t let you easily recreate your packages as they are today, but at least you’ll know what they were.

  • +
  • You are going to create many, many, many analysis notebooks over the course of your career. How are you going to organize them so you can find them again in the future? We recommend storing them in individual projects, and coming up with a good naming scheme.

  • +
diff --git a/oreilly/quarto.html b/oreilly/quarto.html new file mode 100644 index 0000000..b3e38dd --- /dev/null +++ b/oreilly/quarto.html @@ -0,0 +1,682 @@ +
+

Quarto

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

Quarto provides a unified authoring framework for data science, combining your code, its results, and your prose. Quarto documents are fully reproducible and support dozens of output formats, like PDFs, Word files, presentations, and more.

+

Quarto files are designed to be used in three ways:

+
  1. For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.

  2. +
  3. For collaborating with other data scientists (including future you!), who are interested in both your conclusions, and how you reached them (i.e. the code).

  4. +
  5. As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.

  6. +

Quarto is a command line interface tool, not an R package. This means that help is, by-and-large, not available through ?. Instead, as you work through this chapter, and use Quarto in the future, you should refer to the Quarto documentation page at https://quarto.org for help.

+

If you’re an R Markdown user, you might be thinking “Quarto sounds a lot like R Markdown”. You’re not wrong! Quarto unifies the functionality of many packages from the R Markdown ecosystem (rmarkdown, bookdown, distill, xaringan, etc.) into a single consistent system as well as extends it with native support for multiple programming languages like Python and Julia in addition to R. In a way, Quarto reflects everything that was learned from expanding and supporting the R Markdown ecosystem over a decade.

+ +
+

+Prerequisites

+

You need the Quarto command line interface (Quarto CLI), but you don’t need to explicitly install it or load it, as RStudio automatically does both when needed.

+
+
+ +
+

+Quarto basics

+

This is a Quarto file – a plain text file that has the extension .qmd:

+
+
---
+title: "Diamond sizes"
+date: 2022-09-12
+format: html
+---
+
+```{r}
+#| label: setup
+#| include: false
+
+library(tidyverse)
+
+smaller <- diamonds |> 
+  filter(carat <= 2.5)
+```
+
+We have data about `r nrow(diamonds)` diamonds.
+Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats.
+The distribution of the remainder is shown below:
+
+```{r}
+#| label: plot-smaller-diamonds
+#| echo: false
+
+smaller |> 
+  ggplot(aes(carat)) + 
+  geom_freqpoly(binwidth = 0.01)
+```
+
+

It contains three important types of content:

+
  1. An (optional) YAML header surrounded by ---s.
  2. +
  3. +Chunks of R code surrounded by ```.
  4. +
  5. Text mixed with simple text formatting like # heading and _italics_.
  6. +

When you open a .qmd, you get a notebook interface where code and output are interleaved. You can run each code chunk by clicking the Run icon (it looks like a play button at the top of the chunk), or by pressing Cmd/Ctrl + Shift + Enter. RStudio executes the code and displays the results inline with the code:

+
+
+

RStudio window with a Quarto document titled "diamond-sizes.qmd" on the left and a blank Viewer window on the right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less then 2.5 carats. The plot shows that the frequency decreases as the weight increases.

+
+
+

If you don’t like seeing your plots and output in your document and would rather make use of RStudio’s console and plot panes, you can click on the gear icon next to “Render” and switch to “Chunk Output in Console”.

+
+
+

RStudio window with a Quarto document titled "diamond-sizes.qmd" on the left and the Plot pane on the bottom right. The Quarto document has a code chunk that creates a frequency plot of diamonds that weigh less then 2.5 carats. The plot is displayed in the Plot pane and shows that the frequency decreases as the weight increases. The RStudio option to show Chunk Output in Console is also highlighted.

+
+
+

To produce a complete report containing all text, code, and results, click “Render” or press Cmd/Ctrl + Shift + K. You can also do this programmatically with quarto::quarto_render("diamond-sizes.qmd"). This will display the report in the viewer pane and create an HTML file.

+
+
+

RStudio window with a Quarto document titled "diamond-sizes.qmd" on the left and the Plot pane on the bottom right. The rendered document does not show any of the code, but the code is visible in the source document.

+
+
+

When you render the document, Quarto sends the .qmd file to knitr, https://yihui.name/knitr, which executes all of the code chunks and creates a new markdown (.md) document which includes the code and its output. The markdown file generated by knitr is then processed by pandoc, https://pandoc.org, which is responsible for creating the finished file. The advantage of this two step workflow is that you can create a very wide range of output formats, as you’ll learn about in #chp-quarto-formats.

+
+
+

Workflow diagram starting with a qmd file, then knitr, then md, then pandoc, then PDF, MS Word, or HTML.

+
+
+

To get started with your own .qmd file, select File > New File > Quarto Document… in the menu bar. RStudio will launch a wizard that you can use to pre-populate your file with useful content that reminds you how the key features of Quarto work.

+

The following sections dive into the three components of a Quarto document in more details: the markdown text, the code chunks, and the YAML header.

+ +
+

+Exercises

+
  1. Create a new Quarto document using File > New File > Quarto Document. Read the instructions. Practice running the chunks individually. Then render the document by clicking the appropriate button and then by using the appropriate keyboard short cut. Verify that you can modify the code, re-run it, and see modified output.

  2. +
  3. Create one new Quarto document for each of the three built-in formats: HTML, PDF and Word. Render each of the three documents. How do the outputs differ? How do the inputs differ? (You may need to install LaTeX in order to build the PDF output — RStudio will prompt you if this is necessary.)

  4. +
+
+ +
+

+Visual editor

+

The Visual editor in RStudio provides a #chp-https://en.wikipedia.org/wiki/WYSIWYM interface for authoring Quarto documents. Under the hood, prose in Quarto documents (.qmd files) is written in Markdown, a lightweight set of conventions for formatting plain text files. In fact, Quarto uses Pandoc markdown (a slightly extended version of Markdown that Quarto understands), including tables, citations, cross-references, footnotes, divs/spans, definition lists, attributes, raw HTML/TeX, and more as well as support for executing code cells and viewing their output inline. While Markdown is designed to be easy to read and write, as you will see in #sec-source-editor, it still requires learning new syntax. Therefore, if you’re new to computational documents like .qmd files but have experience using tools like Google Docs or MS Word, the easiest way to get started with Quarto in RStudio is the visual editor.

+

In the visual editor you can either use the buttons on the menu bar to insert images, tables, cross-references, etc. or you can use the catch-all ⌘ / shortcut to insert just about anything. If you are at the beginning of a line (as shown below), you can also enter just / to invoke the shortcut.

+
+
+

A Quarto document displaying various features of the visual editor such as text formatting (italic, bold, underline, small caps, code, superscript, and subscript), first through third level headings, bulleted and numbered lists, links, linked phrases, and images (along with a pop-up window for customizing image size, adding a caption and alt text, etc.), tables with a header row, and the insert anything tool with options to insert an R code chunk, a Python code chunk, a div, a bullet list, a numbered list, or a first level heading (the top few choices in the tool).

+
+
+

Inserting images and customizing how they are displayed is also facilitated with the visual editor. You can either paste an image from your clipboard directly into the visual editor (and RStudio will place a copy of that image in the project directory and link to it) or you can use the visual editor’s Insert > Figure / Image menu to browse to the image you want to insert or paste it’s URL. In addition, using the same menu you can resize the image as well as add a caption, alternative text, and a link.

+

The visual editor has many more features that we haven’t enumerated here that you might find useful as you gain experience authoring with it.

+

Most importantly, while the visual editor displays your content with formatting, under the hood, it saves your content in plain Markdown and you can switch back and forth between the visual and source editors to view and edit your content using either tool.

+ +
+

+Exercises

+ +
+
+ +
+

+Source editor

+

You can also edit Quarto documents using the Source editor in RStudio, without the assist of the Visual editor. While the Visual editor will feel familiar to those with experience writing in tools like Google docs, the Source editor will feel familiar to those with experience writing R scripts or R Markdown documents. The Source editor can also be useful for debugging any Quarto syntax errors since it’s often easier to catch these in plain text.

+

The guide below shows how to use Pandoc’s Markdown for authoring Quarto documents in the source editor.

+
+
## Text formatting
+
+*italic* **bold** [underline]{.underline} ~~strikeout~~ [small caps]{.smallcaps} `code` superscript^2^ and subscript~2~
+
+## Headings
+
+# 1st Level Header
+
+## 2nd Level Header
+
+### 3rd Level Header
+
+## Lists
+
+-   Bulleted list item 1
+
+-   Item 2
+
+    -   Item 2a
+
+    -   Item 2b
+
+1.  Numbered list item 1
+
+2.  Item 2.
+    The numbers are incremented automatically in the output.
+
+## Links and images
+
+<http://example.com>
+
+[linked phrase](http://example.com)
+
+![optional caption text](quarto.png){fig-alt="Quarto logo and the word quarto spelled in small case letters"}
+
+## Tables
+
+| First Header | Second Header |
+|--------------|---------------|
+| Content Cell | Content Cell  |
+| Content Cell | Content Cell  |
+
+/
+
+

The best way to learn these is simply to try them out. It will take a few days, but soon they will become second nature, and you won’t need to think about them. If you forget, you can get to a handy reference sheet with Help > Markdown Quick Reference.

+ +
+

+Exercises

+
  1. Practice what you’ve learned by creating a brief CV. The title should be your name, and you should include headings for (at least) education or employment. Each of the sections should include a bulleted list of jobs/degrees. Highlight the year in bold.

  2. +
  3. +

    Using the visual editor, figure out how to:

    +
    1. Add a footnote.
    2. +
    3. Add a horizontal rule.
    4. +
    5. Add a block quote.
    6. +
  4. +
  5. +

    Now, using the source editor and the Markdown quick reference, figure out how to:

    +
    1. Add a footnote.
    2. +
    3. Add a horizontal rule.
    4. +
    5. Add a block quote.
    6. +
  6. +
  7. Copy and paste the contents of diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto in to a local R Quarto document. Check that you can run it, then add text after the frequency polygon that describes its most striking features.

  8. +
+
+ +
+

+Code chunks

+

To run code inside a Quarto document, you need to insert a chunk. There are three ways to do so:

+
  1. The keyboard shortcut Cmd + Option + I / Ctrl + Alt + I.

  2. +
  3. The “Insert” button icon in the editor toolbar.

  4. +
  5. By manually typing the chunk delimiters ```{r} and ```.

  6. +

We’d recommend you learn the keyboard shortcut. It will save you a lot of time in the long run!

+

You can continue to run the code using the keyboard shortcut that by now (we hope!) you know and love: Cmd/Ctrl + Enter. However, chunks get a new keyboard shortcut: Cmd/Ctrl + Shift + Enter, which runs all the code in the chunk. Think of a chunk like a function. A chunk should be relatively self-contained, and focused around a single task.

+

The following sections describe the chunk header which consists of ```{r}, followed by an optional chunk label and various other chunk options, each on their own line, marked by #|.

+ +
+

+Chunk label

+

Chunks can be given an optional label, e.g.

+
+
```{r}
+#| label: simple-addition
+
+1 + 1
+```
+
#> [1] 2
+
+

This has three advantages:

+
  1. +

    You can more easily navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor:

    +
    +
    +

    Snippet of RStudio IDE showing only the drop-down code navigator which shows three chunks. Chunk 1 is setup. Chunk 2 is cars and it is in a section called Quarto. Chunk 3 is pressure and it is in a section called Including plots.

    +
    +
    +
  2. +
  3. Graphics produced by the chunks will have useful names that make them easier to use elsewhere. More on that in #sec-figures.

  4. +
  5. You can set up networks of cached chunks to avoid re-performing expensive computations on every run. More on that in #sec-caching.

  6. +

Your chunk labels should be short but evocative and should not contain spaces. We recommend using dashes (-) to separate words (instead of underscores, _) and avoiding other special characters in chunk labels.

+

You are generally free to label your chunk however you like, but there is one chunk name that imbues special behavior: setup. When you’re in a notebook mode, the chunk named setup will be run automatically once, before any other code is run.

+

Additionally, chunk labels cannot be duplicated. Each chunk label must be unique.

+
+ +
+

+Chunk options

+

Chunk output can be customized with options, fields supplied to chunk header. Knitr provides almost 60 options that you can use to customize your code chunks. Here we’ll cover the most important chunk options that you’ll use frequently. You can see the full list at https://yihui.name/knitr/options.

+

The most important set of options controls if your code block is executed and what results are inserted in the finished report:

+
  • eval: false prevents code from being evaluated. (And obviously if the code is not run, no results will be generated). This is useful for displaying example code, or for disabling a large block of code without commenting each line.

  • +
  • include: false runs the code, but doesn’t show the code or results in the final document. Use this for setup code that you don’t want cluttering your report.

  • +
  • echo: false prevents code, but not the results from appearing in the finished file. Use this when writing reports aimed at people who don’t want to see the underlying R code.

  • +
  • message: false or warning: false prevents messages or warnings from appearing in the finished file.

  • +
  • results: hide hides printed output; fig-show: hide hides plots.

  • +
  • error: true causes the render to continue even if code returns an error. This is rarely something you’ll want to include in the final version of your report, but can be very useful if you need to debug exactly what is going on inside your .qmd. It’s also useful if you’re teaching R and want to deliberately include an error. The default, error: false causes rendering to fail if there is a single error in the document.

  • +

Each of these chunk options get added to the header of the chunk, following #|, e.g., in the following chunk the result is not printed since eval is set to false.

+
+
```{r}
+#| label: simple-multiplication
+#| eval: false
+
+2 * 2
+```
+
+

The following table summarizes which types of output each option suppresses:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OptionRun codeShow codeOutputPlotsMessagesWarnings
eval: false- +----
include: false +-----
echo: false +- + + + +
results: hide + +- + + +
fig-show: hide + + +- + +
message: false + + + +- +
warning: false + + + + +-
+ +
+

+Global options

+

As you work more with knitr, you will discover that some of the default chunk options don’t fit your needs and you want to change them.

+

You can do this by adding the preferred options in the document YAML, under execute. For example, if you are preparing a report for an audience who does not need to see your code but only your results and narrative, you might set echo: false at the document level. That will hide the code by default, so only showing the chunks you deliberately choose to show (with echo: true). You might consider setting message: false and warning: false, but that would make it harder to debug problems because you wouldn’t see any messages in the final document.

+
title: "My report"
+execute:
+  echo: false
+

Since Quarto is designed to be multi-lingual (works with R as well as other languages like Python, Julia, etc.), all of the knitr options are not available at the document execution level since some of them only work with knitr and not other engines Quarto uses for running code in other languages (e.g., Jupyter). You can, however, still set these as global options for your document under the knitr field, under opts_chunk. For example, when writing books and tutorials we set:

+
title: "Tutorial"
+knitr:
+  opts_chunk:
+    comment: "#>"
+    collapse: true
+

This uses our preferred comment formatting and ensures that the code and output are kept closely entwined.

+
+ +
+

+Inline code

+

There is one other way to embed R code into a Quarto document: directly into the text, with: `r `. This can be very useful if you mention properties of your data in the text. For example, the example document used at the start of the chapter had:

+
+

We have data about `r nrow(diamonds)` diamonds. Only `r nrow(diamonds) - nrow(smaller)` are larger than 2.5 carats. The distribution of the remainder is shown below:

+
+

When the report is rendered, the results of these computations are inserted into the text:

+
+

We have data about 53940 diamonds. Only 126 are larger than 2.5 carats. The distribution of the remainder is shown below:

+
+

When inserting numbers into text, #chp-https://rdrr.io/r/base/format is your friend. It allows you to set the number of digits so you don’t print to a ridiculous degree of accuracy, and a big.mark to make numbers easier to read. You might combine these into a helper function:

+
+
comma <- function(x) format(x, digits = 2, big.mark = ",")
+comma(3452345)
+#> [1] "3,452,345"
+comma(.12358124331)
+#> [1] "0.12"
+
+
+ +
+

+Exercises

+
  1. Add a section that explores how diamond sizes vary by cut, colour, and clarity. Assume you’re writing a report for someone who doesn’t know R, and instead of setting echo: false on each chunk, set a global option.

  2. +
  3. Download diamond-sizes.qmd from https://github.com/hadley/r4ds/tree/main/quarto. Add a section that describes the largest 20 diamonds, including a table that displays their most important attributes.

  4. +
  5. Modify diamonds-sizes.qmd to use label_comma() to produce nicely formatted output. Also include the percentage of diamonds that are larger than 2.5 carats.

  6. +
+
+ +
+

+Figures

+

The figures in a Quarto document can be embedded (e.g., a PNG or JPEG file) or generated as a result of a code chunk.

+

To embed an image from an external file, you can use the Insert menu in RStudio and select Figure / Image. This will pop open a menu where you can browse to the image you want to insert as well as add alternative text or caption to it and adjust its size. In the visual editor you can also simply paste an image from your clipboard into your document and RStudio will place a copy of that image in your project folder.

+

If you include a code chunk that generates a figure (e.g., includes a ggplot() call), the resulting figure will be automatically included in your Quarto document.

+ +
+

+Figure sizing

+

The biggest challenge of graphics in Quarto is getting your figures the right size and shape. There are five main options that control figure sizing: fig-width, fig-height, fig-asp, out-width and out-height. Image sizing is challenging because there are two sizes (the size of the figure created by R and the size at which it is inserted in the output document), and multiple ways of specifying the size (i.e., height, width, and aspect ratio: pick two of three).

+ +

We recommend three of the five options:

+
  • Plots tend to be more aesthetically pleasing if they have consistent width. To enforce this, set fig-width: 6 (6”) and fig-asp: 0.618 (the golden ratio) in the defaults. Then in individual chunks, only adjust fig-asp.

  • +
  • Control the output size with out-width and set it to a percentage of the line width. We suggest to out-width: "70%" and fig-align: "center". That gives plots room to breathe, without taking up too much space.

  • +
  • To put multiple plots in a single row, set the out-width to 50% for two plots, 33% for 3 plots, or 25% to 4 plots, and set fig-align: "default". Depending on what you’re trying to illustrate (e.g. show data or show plot variations), you might also tweak fig-width, as discussed below.

  • +

If you find that you’re having to squint to read the text in your plot, you need to tweak fig-width. If fig-width is larger than the size the figure is rendered in the final doc, the text will be too small; if fig-width is smaller, the text will be too big. You’ll often need to do a little experimentation to figure out the right ratio between the fig-width and the eventual width in your document. To illustrate the principle, the following three plots have fig-width of 4, 6, and 8 respectively:

+
+
+

+
+
+
+
+

+
+
+
+
+

+
+
+

If you want to make sure the font size is consistent across all your figures, whenever you set out-width, you’ll also need to adjust fig-width to maintain the same ratio with your default out-width. For example, if your default fig-width is 6 and out-width is 0.7, when you set out-width: "50%" you’ll need to set fig-width to 4.3 (6 * 0.5 / 0.7).

+
+ +
+

+Other important options

+

When mingling code and text, like in this book, you can set fig-show: "hold" so that plots are shown after the code. This has the pleasant side effect of forcing you to break up large blocks of code with their explanations.

+

To add a caption to the plot, use fig-cap. In Quarto this will change the figure from inline to “floating”.

+

If you’re producing PDF output, the default graphics type is PDF. This is a good default because PDFs are high quality vector graphics. However, they can produce very large and slow plots if you are displaying thousands of points. In that case, set fig-format: "png" to force the use of PNGs. They are slightly lower quality, but will be much more compact.

+

It’s a good idea to name code chunks that produce figures, even if you don’t routinely label other chunks. The chunk label is used to generate the file name of the graphic on disk, so naming your chunks makes it much easier to pick out plots and reuse in other circumstances (i.e. if you want to quickly drop a single plot into an email or a tweet).

+
+ +
+

+Exercises

+ +
+
+ +
+

+Tables

+

Similar to figures, you can include two types of tables in a Quarto document. They can be markdown tables that you create in directly in your Quarto document (using the Insert Table menu) or they can be tables generated as a result of a code chunk. In this section we will focus on the latter, tables generated via computation.

+

By default, Quarto prints data frames and matrices as you’d see them in the console:

+
+
mtcars[1:5, ]
+#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
+#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
+#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
+#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
+#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
+#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
+
+

If you prefer that data be displayed with additional formatting you can use the #chp-https://rdrr.io/pkg/knitr/man/kable function. The code below generates #tbl-kable.

+
+
knitr::kable(mtcars[1:5, ], )
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Table 27.1: A knitr kable.
+mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX421.061601103.902.62016.460144
Mazda RX4 Wag21.061601103.902.87517.020144
Datsun 71022.84108933.852.32018.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.44017.020032
+
+
+

Read the documentation for #chp-https://rdrr.io/pkg/knitr/man/kable to see the other ways in which you can customize the table. For even deeper customization, consider the gt, huxtable, reactable, kableExtra, xtable, stargazer, pander, tables, and ascii packages. Each provides a set of tools for returning formatted tables from R code.

+

There is also a rich set of options for controlling how figures are embedded. You’ll learn about these in #chp-communicate-plots.

+ +
+

+Exercises

+ +
+
+ +
+

+Caching

+

Normally, each render of a document starts from a completely clean slate. This is great for reproducibility, because it ensures that you’ve captured every important computation in code. However, it can be painful if you have some computations that take a long time. The solution is cache: true.

+

You can enable the Knitr cache at the document level for caching the results of all computations in a document using standard YAML options:

+
---
+title: "My Document"
+execute: 
+  cache: true
+---
+

You can also enable caching at the chunk level for caching the results of computation in a specific chunk:

+
+
```{r}
+#| cache: true
+
+# code for lengthy computation...
+```
+
+

When set, this will save the output of the chunk to a specially named file on disk. On subsequent runs, knitr will check to see if the code has changed, and if it hasn’t, it will reuse the cached results.

+

The caching system must be used with care, because by default it is based on the code only, not its dependencies. For example, here the processed_data chunk depends on the raw-data chunk:

+
```{r}
+#| label: raw-data
+
+rawdata <- readr::read_csv("a_very_large_file.csv")
+```
+
+```{r}
+#| label: processed_data
+#| cache: true
+
+processed_data <- rawdata |> 
+  filter(!is.na(import_var)) |> 
+  mutate(new_variable = complicated_transformation(x, y, z))
+```
+

Caching the processed_data chunk means that it will get re-run if the dplyr pipeline is changed, but it won’t get rerun if the read_csv() call changes. You can avoid that problem with the dependson chunk option:

+
```{r}
+#| label: processed-data
+#| cache: true
+#| dependson: "raw-data"
+
+processed_data <- rawdata |> 
+  filter(!is.na(import_var)) |> 
+  mutate(new_variable = complicated_transformation(x, y, z))
+```
+

dependson should contain a character vector of every chunk that the cached chunk depends on. Knitr will update the results for the cached chunk whenever it detects that one of its dependencies have changed.

+

Note that the chunks won’t update if a_very_large_file.csv changes, because knitr caching only tracks changes within the .qmd file. If you want to also track changes to that file you can use the cache.extra option. This is an arbitrary R expression that will invalidate the cache whenever it changes. A good function to use is #chp-https://rdrr.io/r/base/file.info: it returns a bunch of information about the file including when it was last modified. Then you can write:

+
```{r}
+#| label: raw-data
+#| cache.extra: file.info("a_very_large_file.csv")
+
+rawdata <- readr::read_csv("a_very_large_file.csv")
+```
+

As your caching strategies get progressively more complicated, it’s a good idea to regularly clear out all your caches with #chp-https://rdrr.io/pkg/knitr/man/clean_cache.

+

We’ve followed the advice of #chp-https://twitter.com/drob/status/738786604731490304 to name these chunks: each chunk is named after the primary object that it creates. This makes it easier to understand the dependson specification.

+ +
+

+Exercises

+
  1. Set up a network of chunks where d depends on c and b, and both b and c depend on a. Have each chunk print #chp-https://lubridate.tidyverse.org/reference/now, set cache: true, then verify your understanding of caching.
  2. +
+
+ +
+

+Troubleshooting

+

Troubleshooting Quarto documents can be challenging because you are no longer in an interactive R environment, and you will need to learn some new tricks. Additionally, the error could be due to issues with the Quarto document itself or due to the R code in the Quarto document.

+

One common error in documents with code chunks is duplicated chunk labels, which are especially pervasive if your workflow involves copying and pasting code chunks. To address this issue, all you need to do is to change one of your duplicated labels.

+

If the errors are due to the R code in the document, the first thing you should always try is to recreate the problem in an interactive session. Restart R, then “Run all chunks” (either from Code menu, under Run region), or with the keyboard shortcut Ctrl + Alt + R. If you’re lucky, that will recreate the problem, and you can figure out what’s going on interactively.

+

If that doesn’t help, there must be something different between your interactive environment and the Quarto environment. You’re going to need to systematically explore the options. The most common difference is the working directory: the working directory of a Quarto is the directory in which it lives. Check the working directory is what you expect by including #chp-https://rdrr.io/r/base/getwd in a chunk.

+

Next, brainstorm all the things that might cause the bug. You’ll need to systematically check that they’re the same in your R session and your Quarto session. The easiest way to do that is to set error: true on the chunk causing the problem, then use #chp-https://rdrr.io/r/base/print and #chp-https://rdrr.io/r/utils/str to check that settings are as you expect.

+
+ +
+

+YAML header

+

You can control many other “whole document” settings by tweaking the parameters of the YAML header. You might wonder what YAML stands for: it’s “YAML Ain’t Markup Language”, which is designed for representing hierarchical data in a way that’s easy for humans to read and write. Quarto uses it to control many details of the output. Here we’ll discuss three: self-contained documents, document parameters, and bibliographies.

+ +
+

+Self-contained

+

HTML documents typically have a number of external dependencies (e.g. images, CSS style sheets, JavaScript, etc.) and, by default, Quarto places these dependencies in a _files folder in the same directory as your .qmd file. If you publish the HTML file on a hosting platform (e.g., QuartoPub, https://quartopub.com/), the dependencies in this directory are published with your document and hence are available in the published report. However, if you want to email the report to a colleague, you might prefer to have a single, self-contained, HTML document that embeds all of its dependencies. You can do this by specifying the embed-resources option:

+

By default these dependencies are placed in a _files directory alongside your document. For example, if you render report.qmd to HTML:

+
format:
+  html:
+    embed-resources: true
+

The resulting file will be self-contained, such that it will need no external files and no internet access to be displayed properly by a browser.

+
+ +
+

+Parameters

+

Quarto documents can include one or more parameters whose values can be set when you render the report. Parameters are useful when you want to re-render the same report with distinct values for various key inputs. For example, you might be producing sales reports per branch, exam results by student, or demographic summaries by country. To declare one or more parameters, use the params field.

+

This example uses a my_class parameter to determine which class of cars to display:

+
+
---
+output: html_document
+params:
+  my_class: "suv"
+---
+
+```{r}
+#| label: setup
+#| include: false
+
+library(tidyverse)
+
+class <- mpg |> filter(class == params$my_class)
+```
+
+# Fuel economy for `r params$my_class`s
+
+```{r}
+#| message: false
+
+ggplot(class, aes(displ, hwy)) + 
+  geom_point() + 
+  geom_smooth(se = FALSE)
+```
+
+

As you can see, parameters are available within the code chunks as a read-only list named params.

+

You can write atomic vectors directly into the YAML header. You can also run arbitrary R expressions by prefacing the parameter value with !r. This is a good way to specify date/time parameters.

+
params:
+  start: !r lubridate::ymd("2015-01-01")
+  snapshot: !r lubridate::ymd_hms("2015-01-01 12:30:00")
+
+ +
+

+Bibliographies and Citations

+

Quarto can automatically generate citations and a bibliography in a number of styles. The most straightforward way of adding citations and bibliographies to a Quarto document is using the visual editor in RStudio.

+

To add a citation using the visual editor, go to Insert > Citation. Citations can be inserted from a variety of sources:

+
  1. #citations-from-dois (Document Object Identifier) references.

  2. +
  3. #citations-from-zotero personal or group libraries.

  4. +
  5. Searches of #chp-https://www.crossref.org/, #chp-https://datacite.org/, or #chp-https://pubmed.ncbi.nlm.nih.gov/.

  6. +
  7. Your document bibliography (a .bib file in the directory of your document)

  8. +

Under the hood, the visual mode uses the standard Pandoc markdown representation for citations (e.g. [@citation]).

+

If you add a citation using one of the first three methods, the visual editor will automatically create a bibliography.bib file for you and add the reference to it. It will also add a bibliography field to the document YAML. As you add more references, this file will get populated with their citations. You can also directly edit this file using many common bibliography formats including BibLaTeX, BibTeX, EndNote, Medline.

+

To create a citation within your .qmd file in the source editor, use a key composed of ‘@’ + the citation identifier from the bibliography file. Then place the citation in square brackets. Here are some examples:

+
Separate multiple citations with a `;`: Blah blah [@smith04; @doe99].
+
+You can add arbitrary comments inside the square brackets: 
+Blah blah [see @doe99, pp. 33-35; also @smith04, ch. 1].
+
+Remove the square brackets to create an in-text citation: @smith04 
+says blah, or @smith04 [p. 33] says blah.
+
+Add a `-` before the citation to suppress the author's name: 
+Smith says blah [-@smith04].
+

When Quarto renders your file, it will build and append a bibliography to the end of your document. The bibliography will contain each of the cited references from your bibliography file, but it will not contain a section heading. As a result it is common practice to end your file with a section header for the bibliography, such as # References or # Bibliography.

+

You can change the style of your citations and bibliography by referencing a CSL (citation style language) file in the csl field:

+
bibliography: rmarkdown.bib
+csl: apa.csl
+

As with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is https://github.com/citation-style-language/styles.

+
+
+ +
+

+Learning more

+

Quarto is still relatively young, and is still growing rapidly. The best place to stay on top of innovations is the official Quarto website: https://quarto.org.

+

There are two important topics that we haven’t covered here: collaboration and the details of accurately communicating your ideas to other humans. Collaboration is a vital part of modern data science, and you can make your life much easier by using version control tools, like Git and GitHub. We recommend “Happy Git with R”, a user friendly introduction to Git and GitHub from R users, by Jenny Bryan. The book is freely available online: https://happygitwithr.com.

+

We have also not touched on what you should actually write in order to clearly communicate the results of your analysis. To improve your writing, we highly recommend reading either #chp-https://www.amazon.com/Style-Lessons-Clarity-Grace-12th/dp/0134080416 by Joseph M. Williams & Joseph Bizup, or #chp-https://www.amazon.com/Sense-Structure-Writing-Readers-Perspective/dp/0205296327 by George Gopen. Both books will help you understand the structure of sentences and paragraphs, and give you the tools to make your writing more clear. (These books are rather expensive if purchased new, but they’re used by many English classes so there are plenty of cheap second-hand copies). George Gopen also has a number of short articles on writing at https://www.georgegopen.com/the-litigation-articles.html. They are aimed at lawyers, but almost everything applies to data scientists too.

+ + +
+
diff --git a/oreilly/rectangling.html b/oreilly/rectangling.html new file mode 100644 index 0000000..fe624d0 --- /dev/null +++ b/oreilly/rectangling.html @@ -0,0 +1,1148 @@ +
+

Data rectangling

+
+ +
+ +

+Base R +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because #chp-https://rdrr.io/r/base/data.frame treats a list as a list of columns:

+
data.frame(x = list(1:3, 3:5))
+#>   x.1.3 x.3.5
+#> 1     1     3
+#> 2     2     4
+#> 3     3     5
+

You can force #chp-https://rdrr.io/r/base/data.frame to treat a list as a list of rows by wrapping it in list #chp-https://rdrr.io/r/base/AsIs, but the result doesn’t print particularly well:

+
data.frame(
+  x = I(list(1:2, 3:5)), 
+  y = c("1, 2", "3, 4, 5")
+)
+#>         x       y
+#> 1    1, 2    1, 2
+#> 2 3, 4, 5 3, 4, 5
+

It’s easier to use list-columns with tibbles because #chp-https://tibble.tidyverse.org/reference/tibble treats lists like either vectors and the print method has been designed with lists in mind.

+ +
+

+Introduction

+

In this chapter, you’ll learn the art of data rectangling, taking data that is fundamentally tree-like and converting it into a rectangular data frames made up of rows and columns. This is important because hierarchical data is surprisingly common, especially when working with data that comes from the web.

+

To learn about rectangling, you’ll need to first learn about lists, the data structure that makes hierarchical data possible. Then you’ll learn about two crucial tidyr functions: #chp-https://tidyr.tidyverse.org/reference/unnest_longer and #chp-https://tidyr.tidyverse.org/reference/unnest_wider. We’ll then show you a few case studies, applying these simple functions again and again to solve real problems. We’ll finish off by talking about JSON, the most frequent source of hierarchical datasets and a common format for data exchange on the web.

+ +
+

+Prerequisites

+

In this chapter we’ll use many functions from tidyr, a core member of the tidyverse. We’ll also use repurrrsive to provide some interesting datasets for rectangling practice, and we’ll finish by using jsonlite to read JSON files into R lists.

+
+
library(tidyverse)
+library(repurrrsive)
+library(jsonlite)
+
+
+
+ +
+

+Lists

+

So far you’ve worked with data frames that contain simple vectors like integers, numbers, characters, date-times, and factors. These vectors are simple because they’re homogeneous: every element is the same type. If you want to store element of different types in the same vector, you’ll need a list, which you create with #chp-https://rdrr.io/r/base/list:

+
+
x1 <- list(1:4, "a", TRUE)
+x1
+#> [[1]]
+#> [1] 1 2 3 4
+#> 
+#> [[2]]
+#> [1] "a"
+#> 
+#> [[3]]
+#> [1] TRUE
+
+

It’s often convenient to name the components, or children, of a list, which you can do in the same way as naming the columns of a tibble:

+
+
x2 <- list(a = 1:2, b = 1:3, c = 1:4)
+x2
+#> $a
+#> [1] 1 2
+#> 
+#> $b
+#> [1] 1 2 3
+#> 
+#> $c
+#> [1] 1 2 3 4
+
+

Even for these very simple lists, printing takes up quite a lot of space. A useful alternative is #chp-https://rdrr.io/r/utils/str, which generates a compact display of the structure, de-emphasizing the contents:

+
+
str(x1)
+#> List of 3
+#>  $ : int [1:4] 1 2 3 4
+#>  $ : chr "a"
+#>  $ : logi TRUE
+str(x2)
+#> List of 3
+#>  $ a: int [1:2] 1 2
+#>  $ b: int [1:3] 1 2 3
+#>  $ c: int [1:4] 1 2 3 4
+
+

As you can see, #chp-https://rdrr.io/r/utils/str displays each child of the list on its own line. It displays the name, if present, then an abbreviation of the type, then the first few values.

+ +
+

+Hierarchy

+

Lists can contain any type of object, including other lists. This makes them suitable for representing hierarchical (tree-like) structures:

+
+
x3 <- list(list(1, 2), list(3, 4))
+str(x3)
+#> List of 2
+#>  $ :List of 2
+#>   ..$ : num 1
+#>   ..$ : num 2
+#>  $ :List of 2
+#>   ..$ : num 3
+#>   ..$ : num 4
+
+

This is notably different to #chp-https://rdrr.io/r/base/c, which generates a flat vector:

+
+
c(c(1, 2), c(3, 4))
+#> [1] 1 2 3 4
+
+x4 <- c(list(1, 2), list(3, 4))
+str(x4)
+#> List of 4
+#>  $ : num 1
+#>  $ : num 2
+#>  $ : num 3
+#>  $ : num 4
+
+

As lists get more complex, #chp-https://rdrr.io/r/utils/str gets more useful, as it lets you see the hierarchy at a glance:

+
+
x5 <- list(1, list(2, list(3, list(4, list(5)))))
+str(x5)
+#> List of 2
+#>  $ : num 1
+#>  $ :List of 2
+#>   ..$ : num 2
+#>   ..$ :List of 2
+#>   .. ..$ : num 3
+#>   .. ..$ :List of 2
+#>   .. .. ..$ : num 4
+#>   .. .. ..$ :List of 1
+#>   .. .. .. ..$ : num 5
+
+

As lists get even larger and more complex, #chp-https://rdrr.io/r/utils/str eventually starts to fail, and you’ll need to switch to #chp-https://rdrr.io/r/utils/ViewThis is an RStudio feature.. #fig-view-collapsed shows the result of calling View(x4). The viewer starts by showing just the top level of the list, but you can interactively expand any of the components to see more, as in #fig-view-expand-1. RStudio will also show you the code you need to access that element, as in #fig-view-expand-2. We’ll come back to how this code works in #sec-subset-one.

+
+
+ +

A screenshot of RStudio showing the list-viewer. It shows the two children of x4: the first child is a double vector and the second child is a list. A rightward facing triable indicates that the second child itself has children but you can't see them.

+
Figure 22.1: The RStudio view lets you interactively explore a complex list. The viewer opens showing only the top level of the list.
+
+
+
+
+
+ +

Another screenshot of the list-viewer having expand the second child of x2. It also has two children, a double vector and another list.

+
Figure 22.2: Clicking on the rightward facing triangle expands that component of the list so that you can also see its children.
+
+
+
+
+
+ +

Another screenshot, having expanded the grandchild of x4 to see its two children, again a double vector and a list.

+
Figure 22.3: You can repeat this operation as many times as needed to get to the data you’re interested in. Note the bottom-left corner: if you click an element of the list, RStudio will give you the subsetting code needed to access it, in this case x4[[2]][[2]][[2]].
+
+
+
+
+ +
+

+List-columns

+

Lists can also live inside a tibble, where we call them list-columns. List-columns are useful because they allow you to shoehorn in objects that wouldn’t usually belong in a tibble. In particular, list-columns are are used a lot in the #chp-https://www.tidymodels ecosystem, because they allow you to store things like models or resamples in a data frame.

+

Here’s a simple example of a list-column:

+
+
df <- tibble(
+  x = 1:2, 
+  y = c("a", "b"),
+  z = list(list(1, 2), list(3, 4, 5))
+)
+df
+#> # A tibble: 2 × 3
+#>       x y     z         
+#>   <int> <chr> <list>    
+#> 1     1 a     <list [2]>
+#> 2     2 b     <list [3]>
+
+

There’s nothing special about lists in a tibble; they behave like any other column:

+
+
df |> 
+  filter(x == 1)
+#> # A tibble: 1 × 3
+#>       x y     z         
+#>   <int> <chr> <list>    
+#> 1     1 a     <list [2]>
+
+

Computing with list-columns is harder, but that’s because computing with lists is harder in general; we’ll come back to that in #chp-iteration. In this chapter, we’ll focus on unnesting list-columns out into regular variables so you can use your existing tools on them.

+

The default print method just displays a rough summary of the contents. The list column could be arbitrarily complex, so there’s no good way to print it. If you want to see it, you’ll need to pull the list-column out and apply one of the techniques that you learned above:

+
+
df |> 
+  filter(x == 1) |> 
+  pull(z) |> 
+  str()
+#> List of 1
+#>  $ :List of 2
+#>   ..$ : num 1
+#>   ..$ : num 2
+
+

Similarly, if you #chp-https://rdrr.io/r/utils/View a data frame in RStudio, you’ll get the standard tabular view, which doesn’t allow you to selectively expand list columns. To explore those fields you’ll need to #chp-https://dplyr.tidyverse.org/reference/pull and view, e.g. df |> pull(z) |> View().

+

+Base R +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

It’s possible to put a list in a column of a data.frame, but it’s a lot fiddlier because #chp-https://rdrr.io/r/base/data.frame treats a list as a list of columns:

+
data.frame(x = list(1:3, 3:5))
+#>   x.1.3 x.3.5
+#> 1     1     3
+#> 2     2     4
+#> 3     3     5
+

You can force #chp-https://rdrr.io/r/base/data.frame to treat a list as a list of rows by wrapping it in list #chp-https://rdrr.io/r/base/AsIs, but the result doesn’t print particularly well:

+
data.frame(
+  x = I(list(1:2, 3:5)), 
+  y = c("1, 2", "3, 4, 5")
+)
+#>         x       y
+#> 1    1, 2    1, 2
+#> 2 3, 4, 5 3, 4, 5
+

It’s easier to use list-columns with tibbles because #chp-https://tibble.tidyverse.org/reference/tibble treats lists like either vectors and the print method has been designed with lists in mind.

+ +
+
+ +
+

+Unnesting

+

Now that you’ve learned the basics of lists and list-columns, let’s explore how you can turn them back into regular rows and columns. Here we’ll use very simple sample data so you can get the basic idea; in the next section we’ll switch to real data.

+

List-columns tend to come in two basic forms: named and unnamed. When the children are named, they tend to have the same names in every row. For example, in df1, every element of list-column y has two elements named a and b. Named list-columns naturally unnest into columns: each named element becomes a new named column.

+
+
df1 <- tribble(
+  ~x, ~y,
+  1, list(a = 11, b = 12),
+  2, list(a = 21, b = 22),
+  3, list(a = 31, b = 32),
+)
+
+

When the children are unnamed, the number of elements tends to vary from row-to-row. For example, in df2, the elements of list-column y are unnamed and vary in length from one to three. Unnamed list-columns naturally unnest in to rows: you’ll get one row for each child.

+
+
+df2 <- tribble(
+  ~x, ~y,
+  1, list(11, 12, 13),
+  2, list(21),
+  3, list(31, 32),
+)
+
+

tidyr provides two functions for these two cases: #chp-https://tidyr.tidyverse.org/reference/unnest_wider and #chp-https://tidyr.tidyverse.org/reference/unnest_longer. The following sections explain how they work.

+ +
+

+unnest_wider() +

+

When each row has the same number of elements with the same names, like df1, it’s natural to put each component into its own column with #chp-https://tidyr.tidyverse.org/reference/unnest_wider:

+
+
df1 |> 
+  unnest_wider(y)
+#> # A tibble: 3 × 3
+#>       x     a     b
+#>   <dbl> <dbl> <dbl>
+#> 1     1    11    12
+#> 2     2    21    22
+#> 3     3    31    32
+
+

By default, the names of the new columns come exclusively from the names of the list elements, but you can use the names_sep argument to request that they combine the column name and the element name. This is useful for disambiguating repeated names.

+
+
df1 |> 
+  unnest_wider(y, names_sep = "_")
+#> # A tibble: 3 × 3
+#>       x   y_a   y_b
+#>   <dbl> <dbl> <dbl>
+#> 1     1    11    12
+#> 2     2    21    22
+#> 3     3    31    32
+
+

We can also use #chp-https://tidyr.tidyverse.org/reference/unnest_wider with unnamed list-columns, as in df2. Since columns require names but the list lacks them, #chp-https://tidyr.tidyverse.org/reference/unnest_wider will label them with consecutive integers:

+
+
df2 |> 
+  unnest_wider(y, names_sep = "_")
+#> # A tibble: 3 × 4
+#>       x   y_1   y_2   y_3
+#>   <dbl> <dbl> <dbl> <dbl>
+#> 1     1    11    12    13
+#> 2     2    21    NA    NA
+#> 3     3    31    32    NA
+
+

You’ll notice that #chp-https://tidyr.tidyverse.org/reference/unnest_wider, much like #chp-https://tidyr.tidyverse.org/reference/pivot_wider, turns implicit missing values in to explicit missing values.

+
+ +
+

+unnest_longer() +

+

When each row contains an unnamed list, it’s most natural to put each element into its own row with #chp-https://tidyr.tidyverse.org/reference/unnest_longer:

+
+
df2 |> 
+  unnest_longer(y)
+#> # A tibble: 6 × 2
+#>       x     y
+#>   <dbl> <dbl>
+#> 1     1    11
+#> 2     1    12
+#> 3     1    13
+#> 4     2    21
+#> 5     3    31
+#> 6     3    32
+
+

Note how x is duplicated for each element inside of y: we get one row of output for each element inside the list-column. But what happens if one of the elements is empty, as in the following example?

+
+
df6 <- tribble(
+  ~x, ~y,
+  "a", list(1, 2),
+  "b", list(3),
+  "c", list()
+)
+df6 |> unnest_longer(y)
+#> # A tibble: 3 × 2
+#>   x         y
+#>   <chr> <dbl>
+#> 1 a         1
+#> 2 a         2
+#> 3 b         3
+
+

We get zero rows in the output, so the row effectively disappears. Once https://github.com/tidyverse/tidyr/issues/1339 is fixed, you’ll be able to keep this row, replacing y with NA by setting keep_empty = TRUE.

+

You can also unnest named list-columns, like df1$y, into rows. Because the elements are named, and those names might be useful data, tidyr puts them in a new column with the suffix _id:

+
+
df1 |> 
+  unnest_longer(y)
+#> # A tibble: 6 × 3
+#>       x     y y_id 
+#>   <dbl> <dbl> <chr>
+#> 1     1    11 a    
+#> 2     1    12 b    
+#> 3     2    21 a    
+#> 4     2    22 b    
+#> 5     3    31 a    
+#> 6     3    32 b
+
+

If you don’t want these ids, you can suppress them with indices_include = FALSE. On the other hand, it’s sometimes useful to retain the position of unnamed elements in unnamed list-columns. You can do this with indices_include = TRUE:

+
+
df2 |> 
+  unnest_longer(y, indices_include = TRUE)
+#> # A tibble: 6 × 3
+#>       x     y  y_id
+#>   <dbl> <dbl> <int>
+#> 1     1    11     1
+#> 2     1    12     2
+#> 3     1    13     3
+#> 4     2    21     1
+#> 5     3    31     1
+#> 6     3    32     2
+
+
+ +
+

+Inconsistent types

+

What happens if you unnest a list-column contains different types of vector? For example, take the following dataset where the list-column y contains two numbers, a factor, and a logical, which can’t normally be mixed in a single column.

+
+
df4 <- tribble(
+  ~x, ~y,
+  "a", list(1, "a"),
+  "b", list(TRUE, factor("a"), 5)
+)
+
+

#chp-https://tidyr.tidyverse.org/reference/unnest_longer always keeps the set of columns change, while changing the number of rows. So what happens? How does #chp-https://tidyr.tidyverse.org/reference/unnest_longer produce five rows while keeping everything in y?

+
+
df4 |> 
+  unnest_longer(y)
+#> # A tibble: 5 × 2
+#>   x     y        
+#>   <chr> <list>   
+#> 1 a     <dbl [1]>
+#> 2 a     <chr [1]>
+#> 3 b     <lgl [1]>
+#> 4 b     <fct [1]>
+#> 5 b     <dbl [1]>
+
+

As you can see, the output contains a list-column, but every element of the list-column contains a single element. Because #chp-https://tidyr.tidyverse.org/reference/unnest_longer can’t find a common type of vector, it keeps the original types in a list-column. You might wonder if this breaks the commandment that every element of a column must be the same type — not quite: every element is a still a list, even though the contents of each element is a different type.

+

What happens if you find this problem in a dataset you’re trying to rectangle? There are two basic options. You could use the transform argument to coerce all inputs to a common type. It’s not particularly useful here because there’s only really one class that these five class can be converted to character.

+
+
df4 |> 
+  unnest_longer(y, transform = as.character)
+#> # A tibble: 5 × 2
+#>   x     y    
+#>   <chr> <chr>
+#> 1 a     1    
+#> 2 a     a    
+#> 3 b     TRUE 
+#> 4 b     a    
+#> 5 b     5
+
+

Another option would be to filter down to the rows that have values of a specific type:

+
+
df4 |> 
+  unnest_longer(y) |> 
+  filter(map_lgl(y, is.numeric))
+#> # A tibble: 2 × 2
+#>   x     y        
+#>   <chr> <list>   
+#> 1 a     <dbl [1]>
+#> 2 b     <dbl [1]>
+
+

Then you can call #chp-https://tidyr.tidyverse.org/reference/unnest_longer once more:

+
+
df4 |> 
+  unnest_longer(y) |> 
+  filter(map_lgl(y, is.numeric)) |> 
+  unnest_longer(y)
+#> # A tibble: 2 × 2
+#>   x         y
+#>   <chr> <dbl>
+#> 1 a         1
+#> 2 b         5
+
+

You’ll learn more about #chp-https://purrr.tidyverse.org/reference/map in #chp-iteration.

+
+ +
+

+Other functions

+

tidyr has a few other useful rectangling functions that we’re not going to cover in this book:

+

These are good to know about when you’re reading other people’s code or tackling rarer rectangling challenges.

+
+ +
+

+Exercises

+
  1. +

    From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of y and z are aligned (i.e. y and z will always have the same length within a row, and the first value of y corresponds to the first value of z). What happens if you apply two #chp-https://tidyr.tidyverse.org/reference/unnest_longer calls to this data frame? How can you preserve the relationship between x and y? (Hint: carefully read the docs).

    +
    +
    df4 <- tribble(
    +  ~x, ~y, ~z,
    +  "a", list("y-a-1", "y-a-2"), list("z-a-1", "z-a-2"),
    +  "b", list("y-b-1", "y-b-2", "y-b-3"), list("z-b-1", "z-b-2", "z-b-3")
    +)
    +
    +
  2. +
+
+ +
+

+Case studies

+

The main difference between the simple examples we used above and real data is that real data typically contains multiple levels of nesting that require multiple calls to #chp-https://tidyr.tidyverse.org/reference/unnest_longer and/or #chp-https://tidyr.tidyverse.org/reference/unnest_wider. This section will work through four real rectangling challenges using datasets from the repurrrsive package, inspired by datasets that we’ve encountered in the wild.

+ +
+

+Very wide data

+

We’ll with gh_repos. This is a list that contains data about a collection of GitHub repositories retrieved using the GitHub API. It’s a very deeply nested list so it’s difficult to show the structure in this book; you might want to explore a little on your own with View(gh_repos) before we continue.

+

gh_repos is a list, but our tools work with list-columns, so we’ll begin by putting it into a tibble. We call the column json for reasons we’ll get to later.

+
+
repos <- tibble(json = gh_repos)
+repos
+#> # A tibble: 6 × 1
+#>   json       
+#>   <list>     
+#> 1 <list [30]>
+#> 2 <list [30]>
+#> 3 <list [30]>
+#> 4 <list [26]>
+#> 5 <list [30]>
+#> 6 <list [30]>
+
+

This tibble contains 6 rows, one row for each child of gh_repos. Each row contains a unnamed list with either 26 or 30 rows. Since these are unnamed, we’ll start with #chp-https://tidyr.tidyverse.org/reference/unnest_longer to put each child in its own row:

+
+
repos |> 
+  unnest_longer(json)
+#> # A tibble: 176 × 1
+#>   json             
+#>   <list>           
+#> 1 <named list [68]>
+#> 2 <named list [68]>
+#> 3 <named list [68]>
+#> 4 <named list [68]>
+#> 5 <named list [68]>
+#> 6 <named list [68]>
+#> # … with 170 more rows
+
+

At first glance, it might seem like we haven’t improved the situation: while we have more rows (176 instead of 6) each element of json is still a list. However, there’s an important difference: now each element is a named list so we can use #chp-https://tidyr.tidyverse.org/reference/unnest_wider to put each element into its own column:

+
+
repos |> 
+  unnest_longer(json) |> 
+  unnest_wider(json) 
+#> # A tibble: 176 × 68
+#>        id name  full_…¹ owner        private html_…² descr…³ fork  url   forks…⁴
+#>     <int> <chr> <chr>   <list>       <lgl>   <chr>   <chr>   <lgl> <chr> <chr>  
+#> 1  6.12e7 after gaborc… <named list> FALSE   https:… Run Co… FALSE http… https:…
+#> 2  4.05e7 argu… gaborc… <named list> FALSE   https:… Declar… FALSE http… https:…
+#> 3  3.64e7 ask   gaborc… <named list> FALSE   https:… Friend… FALSE http… https:…
+#> 4  3.49e7 base… gaborc… <named list> FALSE   https:… Do we … FALSE http… https:…
+#> 5  6.16e7 cite… gaborc… <named list> FALSE   https:… Test R… TRUE  http… https:…
+#> 6  3.39e7 clis… gaborc… <named list> FALSE   https:… Unicod… FALSE http… https:…
+#> # … with 170 more rows, 58 more variables: keys_url <chr>,
+#> #   collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
+#> #   issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
+#> #   branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
+#> #   git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
+#> #   languages_url <chr>, stargazers_url <chr>, contributors_url <chr>,
+#> #   subscribers_url <chr>, subscription_url <chr>, commits_url <chr>, …
+
+

This has worked but the result is a little overwhelming: there are so many columns that tibble doesn’t even print all of them! We can see them all with #chp-https://rdrr.io/r/base/names:

+
+
repos |> 
+  unnest_longer(json) |> 
+  unnest_wider(json) |> 
+  names()
+#>  [1] "id"                "name"              "full_name"        
+#>  [4] "owner"             "private"           "html_url"         
+#>  [7] "description"       "fork"              "url"              
+#> [10] "forks_url"         "keys_url"          "collaborators_url"
+#> [13] "teams_url"         "hooks_url"         "issue_events_url" 
+#> [16] "events_url"        "assignees_url"     "branches_url"     
+#> [19] "tags_url"          "blobs_url"         "git_tags_url"     
+#> [22] "git_refs_url"      "trees_url"         "statuses_url"     
+#> [25] "languages_url"     "stargazers_url"    "contributors_url" 
+#> [28] "subscribers_url"   "subscription_url"  "commits_url"      
+#> [31] "git_commits_url"   "comments_url"      "issue_comment_url"
+#> [34] "contents_url"      "compare_url"       "merges_url"       
+#> [37] "archive_url"       "downloads_url"     "issues_url"       
+#> [40] "pulls_url"         "milestones_url"    "notifications_url"
+#> [43] "labels_url"        "releases_url"      "deployments_url"  
+#> [46] "created_at"        "updated_at"        "pushed_at"        
+#> [49] "git_url"           "ssh_url"           "clone_url"        
+#> [52] "svn_url"           "homepage"          "size"             
+#> [55] "stargazers_count"  "watchers_count"    "language"         
+#> [58] "has_issues"        "has_downloads"     "has_wiki"         
+#> [61] "has_pages"         "forks_count"       "mirror_url"       
+#> [64] "open_issues_count" "forks"             "open_issues"      
+#> [67] "watchers"          "default_branch"
+
+

Let’s select a few that look interesting:

+
+
repos |> 
+  unnest_longer(json) |> 
+  unnest_wider(json) |> 
+  select(id, full_name, owner, description)
+#> # A tibble: 176 × 4
+#>         id full_name               owner             description                
+#>      <int> <chr>                   <list>            <chr>                      
+#> 1 61160198 gaborcsardi/after       <named list [17]> Run Code in the Background 
+#> 2 40500181 gaborcsardi/argufy      <named list [17]> Declarative function argum…
+#> 3 36442442 gaborcsardi/ask         <named list [17]> Friendly CLI interaction i…
+#> 4 34924886 gaborcsardi/baseimports <named list [17]> Do we get warnings for und…
+#> 5 61620661 gaborcsardi/citest      <named list [17]> Test R package and repo fo…
+#> 6 33907457 gaborcsardi/clisymbols  <named list [17]> Unicode symbols for CLI ap…
+#> # … with 170 more rows
+
+

You can use this to work back to understand how gh_repos was strucured: each child was a GitHub user containing a list of up to 30 GitHub repositories that they created.

+

owner is another list-column, and since it contains a named list, we can use #chp-https://tidyr.tidyverse.org/reference/unnest_wider to get at the values:

+
+
repos |> 
+  unnest_longer(json) |> 
+  unnest_wider(json) |> 
+  select(id, full_name, owner, description) |> 
+  unnest_wider(owner)
+#> Error in `unpack()`:
+#> ! Names must be unique.
+#> ✖ These names are duplicated:
+#>   * "id" at locations 1 and 4.
+#> ℹ Use argument `names_repair` to specify repair strategy.
+
+ +

Uh oh, this list column also contains an id column and we can’t have two id columns in the same data frame. Rather than following the advice to use names_repair (which would also work), we’ll instead use names_sep:

+
+
repos |> 
+  unnest_longer(json) |> 
+  unnest_wider(json) |> 
+  select(id, full_name, owner, description) |> 
+  unnest_wider(owner, names_sep = "_")
+#> # A tibble: 176 × 20
+#>       id full_…¹ owner…² owner…³ owner…⁴ owner…⁵ owner…⁶ owner…⁷ owner…⁸ owner…⁹
+#>    <int> <chr>   <chr>     <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
+#> 1 6.12e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:…
+#> 2 4.05e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:…
+#> 3 3.64e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:…
+#> 4 3.49e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:…
+#> 5 6.16e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:…
+#> 6 3.39e7 gaborc… gaborc…  660288 https:… ""      https:… https:… https:… https:…
+#> # … with 170 more rows, 10 more variables: owner_gists_url <chr>,
+#> #   owner_starred_url <chr>, owner_subscriptions_url <chr>,
+#> #   owner_organizations_url <chr>, owner_repos_url <chr>,
+#> #   owner_events_url <chr>, owner_received_events_url <chr>, owner_type <chr>,
+#> #   owner_site_admin <lgl>, description <chr>, and abbreviated variable names
+#> #   ¹​full_name, ²​owner_login, ³​owner_id, ⁴​owner_avatar_url, ⁵​owner_gravatar_id,
+#> #   ⁶​owner_url, ⁷​owner_html_url, ⁸​owner_followers_url, ⁹​owner_following_url
+
+

This gives another wide dataset, but you can see that owner appears to contain a lot of additional data about the person who “owns” the repository.

+
+ +
+

+Relational data

+

Nested data is sometimes used to represent data that we’d usually spread out into multiple data frames. For example, take got_chars. Like gh_repos it’s a list, so we start by turning it into a list-column of a tibble:

+
+
chars <- tibble(json = got_chars)
+chars
+#> # A tibble: 30 × 1
+#>   json             
+#>   <list>           
+#> 1 <named list [18]>
+#> 2 <named list [18]>
+#> 3 <named list [18]>
+#> 4 <named list [18]>
+#> 5 <named list [18]>
+#> 6 <named list [18]>
+#> # … with 24 more rows
+
+

The json column contains named elements, so we’ll start by widening it:

+
+
chars |> 
+  unnest_wider(json)
+#> # A tibble: 30 × 18
+#>   url            id name  gender culture born  died  alive titles aliases father
+#>   <chr>       <int> <chr> <chr>  <chr>   <chr> <chr> <lgl> <list> <list>  <chr> 
+#> 1 https://ww…  1022 Theo… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
+#> 2 https://ww…  1052 Tyri… Male   ""      "In … ""    TRUE  <chr>  <chr>   ""    
+#> 3 https://ww…  1074 Vict… Male   "Ironb… "In … ""    TRUE  <chr>  <chr>   ""    
+#> 4 https://ww…  1109 Will  Male   ""      ""    "In … FALSE <chr>  <chr>   ""    
+#> 5 https://ww…  1166 Areo… Male   "Norvo… "In … ""    TRUE  <chr>  <chr>   ""    
+#> 6 https://ww…  1267 Chett Male   ""      "At … "In … FALSE <chr>  <chr>   ""    
+#> # … with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
+#> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
+#> #   playedBy <list>
+
+

And selecting a few columns to make it easier to read:

+
+
characters <- chars |> 
+  unnest_wider(json) |> 
+  select(id, name, gender, culture, born, died, alive)
+characters
+#> # A tibble: 30 × 7
+#>      id name              gender culture    born                     died  alive
+#>   <int> <chr>             <chr>  <chr>      <chr>                    <chr> <lgl>
+#> 1  1022 Theon Greyjoy     Male   "Ironborn" "In 278 AC or 279 AC, a… ""    TRUE 
+#> 2  1052 Tyrion Lannister  Male   ""         "In 273 AC, at Casterly… ""    TRUE 
+#> 3  1074 Victarion Greyjoy Male   "Ironborn" "In 268 AC or before, a… ""    TRUE 
+#> 4  1109 Will              Male   ""         ""                       "In … FALSE
+#> 5  1166 Areo Hotah        Male   "Norvoshi" "In 257 AC or before, a… ""    TRUE 
+#> 6  1267 Chett             Male   ""         "At Hag's Mire"          "In … FALSE
+#> # … with 24 more rows
+
+

There are also many list-columns:

+
+
chars |> 
+  unnest_wider(json) |> 
+  select(id, where(is.list))
+#> # A tibble: 30 × 8
+#>      id titles    aliases    allegiances books     povBooks  tvSeries  playedBy 
+#>   <int> <list>    <list>     <list>      <list>    <list>    <list>    <list>   
+#> 1  1022 <chr [3]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr [1]>
+#> 2  1052 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr [1]>
+#> 3  1074 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr [1]>
+#> 4  1109 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr [1]> <chr [1]> <chr [1]>
+#> 5  1166 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr [1]>
+#> 6  1267 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr [1]> <chr [1]>
+#> # … with 24 more rows
+
+

Lets explore the titles column. It’s an unnamed list-column, so we’ll unnest it into rows:

+
+
chars |> 
+  unnest_wider(json) |> 
+  select(id, titles) |> 
+  unnest_longer(titles)
+#> # A tibble: 60 × 2
+#>      id titles                                              
+#>   <int> <chr>                                               
+#> 1  1022 Prince of Winterfell                                
+#> 2  1022 Captain of Sea Bitch                                
+#> 3  1022 Lord of the Iron Islands (by law of the green lands)
+#> 4  1052 Acting Hand of the King (former)                    
+#> 5  1052 Master of Coin (former)                             
+#> 6  1074 Lord Captain of the Iron Fleet                      
+#> # … with 54 more rows
+
+

You might expect to see this data in its own table because it would be easy to join to the characters data as needed. To do so, we’ll do a little cleaning: removing the rows containing empty strings and renaming titles to title since each row now only contains a single title.

+
+
titles <- chars |> 
+  unnest_wider(json) |> 
+  select(id, titles) |> 
+  unnest_longer(titles) |> 
+  filter(titles != "") |> 
+  rename(title = titles)
+titles
+#> # A tibble: 53 × 2
+#>      id title                                               
+#>   <int> <chr>                                               
+#> 1  1022 Prince of Winterfell                                
+#> 2  1022 Captain of Sea Bitch                                
+#> 3  1022 Lord of the Iron Islands (by law of the green lands)
+#> 4  1052 Acting Hand of the King (former)                    
+#> 5  1052 Master of Coin (former)                             
+#> 6  1074 Lord Captain of the Iron Fleet                      
+#> # … with 47 more rows
+
+

Now, for example, we could use this table tofind all the characters that are captains and see all their titles:

+
+
captains <- titles |> filter(str_detect(title, "Captain"))
+captains
+#> # A tibble: 5 × 2
+#>      id title                                 
+#>   <int> <chr>                                 
+#> 1  1022 Captain of Sea Bitch                  
+#> 2  1074 Lord Captain of the Iron Fleet        
+#> 3  1166 Captain of the Guard at Sunspear      
+#> 4   150 Captain of the Black Wind             
+#> 5    60 Captain of the Golden Storm (formerly)
+
+characters |> 
+  select(id, name) |> 
+  inner_join(titles, by = "id", multiple = "all")
+#> # A tibble: 53 × 3
+#>      id name              title                                               
+#>   <int> <chr>             <chr>                                               
+#> 1  1022 Theon Greyjoy     Prince of Winterfell                                
+#> 2  1022 Theon Greyjoy     Captain of Sea Bitch                                
+#> 3  1022 Theon Greyjoy     Lord of the Iron Islands (by law of the green lands)
+#> 4  1052 Tyrion Lannister  Acting Hand of the King (former)                    
+#> 5  1052 Tyrion Lannister  Master of Coin (former)                             
+#> 6  1074 Victarion Greyjoy Lord Captain of the Iron Fleet                      
+#> # … with 47 more rows
+
+

You could imagine creating a table like this for each of the list-columns, then using joins to combine them with the character data as you need it.

+
+ +
+

+A dash of text analysis

+

What if we wanted to find the most common words in the title? One simple approach starts by using #chp-https://stringr.tidyverse.org/reference/str_split to break each element of title up into words by spitting on " ":

+
+
titles |> 
+  mutate(word = str_split(title, " "), .keep = "unused")
+#> # A tibble: 53 × 2
+#>      id word      
+#>   <int> <list>    
+#> 1  1022 <chr [3]> 
+#> 2  1022 <chr [4]> 
+#> 3  1022 <chr [11]>
+#> 4  1052 <chr [6]> 
+#> 5  1052 <chr [4]> 
+#> 6  1074 <chr [6]> 
+#> # … with 47 more rows
+
+

This creates a unnamed variable length list-column, so we can use #chp-https://tidyr.tidyverse.org/reference/unnest_longer:

+
+
titles |> 
+  mutate(word = str_split(title, " "), .keep = "unused") |> 
+  unnest_longer(word)
+#> # A tibble: 202 × 2
+#>      id word      
+#>   <int> <chr>     
+#> 1  1022 Prince    
+#> 2  1022 of        
+#> 3  1022 Winterfell
+#> 4  1022 Captain   
+#> 5  1022 of        
+#> 6  1022 Sea       
+#> # … with 196 more rows
+
+

And then we can count that column to find the most common words:

+
+
titles |> 
+  mutate(word = str_split(title, " "), .keep = "unused") |> 
+  unnest_longer(word) |> 
+  count(word, sort = TRUE)
+#> # A tibble: 78 × 2
+#>   word        n
+#>   <chr>   <int>
+#> 1 of         41
+#> 2 the        29
+#> 3 Lord        9
+#> 4 Hand        6
+#> 5 Captain     5
+#> 6 King        5
+#> # … with 72 more rows
+
+

Some of those words are not very interesting so we could create a list of common words to drop. In text analysis these is commonly called stop words.

+
+
stop_words <- tibble(word = c("of", "the"))
+
+titles |> 
+  mutate(word = str_split(title, " "), .keep = "unused") |> 
+  unnest_longer(word) |> 
+  anti_join(stop_words) |> 
+  count(word, sort = TRUE)
+#> Joining with `by = join_by(word)`
+#> # A tibble: 76 × 2
+#>   word         n
+#>   <chr>    <int>
+#> 1 Lord         9
+#> 2 Hand         6
+#> 3 Captain      5
+#> 4 King         5
+#> 5 Princess     5
+#> 6 Queen        5
+#> # … with 70 more rows
+
+

Breaking up text into individual fragments is a powerful idea that underlies much of text analysis. If this sounds interesting, a good place to learn more is #chp-https://www.tidytextmining by Julia Silge and David Robinson.

+
+ +
+

+Deeply nested

+

We’ll finish off these case studies with a list-column that’s very deeply nested and requires repeated rounds of #chp-https://tidyr.tidyverse.org/reference/unnest_wider and #chp-https://tidyr.tidyverse.org/reference/unnest_longer to unravel: gmaps_cities. This is a two column tibble containing five city names and the results of using Google’s #chp-https://developers.google.com/maps/documentation/geocoding to determine their location:

+
+
gmaps_cities
+#> # A tibble: 5 × 2
+#>   city       json            
+#>   <chr>      <list>          
+#> 1 Houston    <named list [2]>
+#> 2 Washington <named list [2]>
+#> 3 New York   <named list [2]>
+#> 4 Chicago    <named list [2]>
+#> 5 Arlington  <named list [2]>
+
+

json is a list-column with internal names, so we start with an #chp-https://tidyr.tidyverse.org/reference/unnest_wider:

+
+
gmaps_cities |> 
+  unnest_wider(json)
+#> # A tibble: 5 × 3
+#>   city       results    status
+#>   <chr>      <list>     <chr> 
+#> 1 Houston    <list [1]> OK    
+#> 2 Washington <list [2]> OK    
+#> 3 New York   <list [1]> OK    
+#> 4 Chicago    <list [1]> OK    
+#> 5 Arlington  <list [2]> OK
+
+

This gives us the status and the results. We’ll drop the status column since they’re all OK; in a real analysis, you’d also want capture all the rows where status != "OK" and figure out what went wrong. results is an unnamed list, with either one or two elements (we’ll see why shortly) so we’ll unnest it into rows:

+
+
gmaps_cities |> 
+  unnest_wider(json) |> 
+  select(-status) |> 
+  unnest_longer(results)
+#> # A tibble: 7 × 2
+#>   city       results         
+#>   <chr>      <list>          
+#> 1 Houston    <named list [5]>
+#> 2 Washington <named list [5]>
+#> 3 Washington <named list [5]>
+#> 4 New York   <named list [5]>
+#> 5 Chicago    <named list [5]>
+#> 6 Arlington  <named list [5]>
+#> # … with 1 more row
+
+

Now results is a named list, so we’ll use #chp-https://tidyr.tidyverse.org/reference/unnest_wider:

+
+
locations <- gmaps_cities |> 
+  unnest_wider(json) |> 
+  select(-status) |> 
+  unnest_longer(results) |> 
+  unnest_wider(results)
+locations
+#> # A tibble: 7 × 6
+#>   city       address_components formatted_address   geometry     place_id types 
+#>   <chr>      <list>             <chr>               <list>       <chr>    <list>
+#> 1 Houston    <list [4]>         Houston, TX, USA    <named list> ChIJAYW… <list>
+#> 2 Washington <list [2]>         Washington, USA     <named list> ChIJ-bD… <list>
+#> 3 Washington <list [4]>         Washington, DC, USA <named list> ChIJW-T… <list>
+#> 4 New York   <list [3]>         New York, NY, USA   <named list> ChIJOwg… <list>
+#> 5 Chicago    <list [4]>         Chicago, IL, USA    <named list> ChIJ7cv… <list>
+#> 6 Arlington  <list [4]>         Arlington, TX, USA  <named list> ChIJ05g… <list>
+#> # … with 1 more row
+
+

Now we can see why two cities got two results: Washington matched both Washington state and Washington, DC, and Arlington matched Arlington, Virginia and Arlington, Texas.

+

There are few different places we could go from here. We might want to determine the exact location of the match, which is stored in the geometry list-column:

+
+
locations |> 
+  select(city, formatted_address, geometry) |> 
+  unnest_wider(geometry)
+#> # A tibble: 7 × 6
+#>   city       formatted_address   bounds       location     locati…¹ viewport    
+#>   <chr>      <chr>               <list>       <list>       <chr>    <list>      
+#> 1 Houston    Houston, TX, USA    <named list> <named list> APPROXI… <named list>
+#> 2 Washington Washington, USA     <named list> <named list> APPROXI… <named list>
+#> 3 Washington Washington, DC, USA <named list> <named list> APPROXI… <named list>
+#> 4 New York   New York, NY, USA   <named list> <named list> APPROXI… <named list>
+#> 5 Chicago    Chicago, IL, USA    <named list> <named list> APPROXI… <named list>
+#> 6 Arlington  Arlington, TX, USA  <named list> <named list> APPROXI… <named list>
+#> # … with 1 more row, and abbreviated variable name ¹​location_type
+
+

That gives us new bounds (a rectangular region) and location (a point). We can unnest location to see the latitude (lat) and longitude (lng):

+
+
locations |> 
+  select(city, formatted_address, geometry) |> 
+  unnest_wider(geometry) |> 
+  unnest_wider(location)
+#> # A tibble: 7 × 7
+#>   city       formatted_address   bounds         lat    lng locati…¹ viewport    
+#>   <chr>      <chr>               <list>       <dbl>  <dbl> <chr>    <list>      
+#> 1 Houston    Houston, TX, USA    <named list>  29.8  -95.4 APPROXI… <named list>
+#> 2 Washington Washington, USA     <named list>  47.8 -121.  APPROXI… <named list>
+#> 3 Washington Washington, DC, USA <named list>  38.9  -77.0 APPROXI… <named list>
+#> 4 New York   New York, NY, USA   <named list>  40.7  -74.0 APPROXI… <named list>
+#> 5 Chicago    Chicago, IL, USA    <named list>  41.9  -87.6 APPROXI… <named list>
+#> 6 Arlington  Arlington, TX, USA  <named list>  32.7  -97.1 APPROXI… <named list>
+#> # … with 1 more row, and abbreviated variable name ¹​location_type
+
+

Extracting the bounds requires a few more steps:

+
+
locations |> 
+  select(city, formatted_address, geometry) |> 
+  unnest_wider(geometry) |> 
+  # focus on the variables of interest
+  select(!location:viewport) |>
+  unnest_wider(bounds)
+#> # A tibble: 7 × 4
+#>   city       formatted_address   northeast        southwest       
+#>   <chr>      <chr>               <list>           <list>          
+#> 1 Houston    Houston, TX, USA    <named list [2]> <named list [2]>
+#> 2 Washington Washington, USA     <named list [2]> <named list [2]>
+#> 3 Washington Washington, DC, USA <named list [2]> <named list [2]>
+#> 4 New York   New York, NY, USA   <named list [2]> <named list [2]>
+#> 5 Chicago    Chicago, IL, USA    <named list [2]> <named list [2]>
+#> 6 Arlington  Arlington, TX, USA  <named list [2]> <named list [2]>
+#> # … with 1 more row
+
+

We then rename southwest and northeast (the corners of the rectangle) so we can use names_sep to create short but evocative names:

+
+
locations |> 
+  select(city, formatted_address, geometry) |> 
+  unnest_wider(geometry) |> 
+  select(!location:viewport) |>
+  unnest_wider(bounds) |> 
+  rename(ne = northeast, sw = southwest) |> 
+  unnest_wider(c(ne, sw), names_sep = "_") 
+#> # A tibble: 7 × 6
+#>   city       formatted_address   ne_lat ne_lng sw_lat sw_lng
+#>   <chr>      <chr>                <dbl>  <dbl>  <dbl>  <dbl>
+#> 1 Houston    Houston, TX, USA      30.1  -95.0   29.5  -95.8
+#> 2 Washington Washington, USA       49.0 -117.    45.5 -125. 
+#> 3 Washington Washington, DC, USA   39.0  -76.9   38.8  -77.1
+#> 4 New York   New York, NY, USA     40.9  -73.7   40.5  -74.3
+#> 5 Chicago    Chicago, IL, USA      42.0  -87.5   41.6  -87.9
+#> 6 Arlington  Arlington, TX, USA    32.8  -97.0   32.6  -97.2
+#> # … with 1 more row
+
+

Note how we unnest two columns simultaneously by supplying a vector of variable names to #chp-https://tidyr.tidyverse.org/reference/unnest_wider.

+

This is somewhere that #chp-https://tidyr.tidyverse.org/reference/hoist, mentioned briefly above, can be useful. Once you’ve discovered the path to get to the components you’re interested in, you can extract them directly using #chp-https://tidyr.tidyverse.org/reference/hoist:

+
+
locations |> 
+  select(city, formatted_address, geometry) |> 
+  hoist(
+    geometry,
+    ne_lat = c("bounds", "northeast", "lat"),
+    sw_lat = c("bounds", "southwest", "lat"),
+    ne_lng = c("bounds", "northeast", "lng"),
+    sw_lng = c("bounds", "southwest", "lng"),
+  )
+#> # A tibble: 7 × 7
+#>   city       formatted_address   ne_lat sw_lat ne_lng sw_lng geometry        
+#>   <chr>      <chr>                <dbl>  <dbl>  <dbl>  <dbl> <list>          
+#> 1 Houston    Houston, TX, USA      30.1   29.5  -95.0  -95.8 <named list [4]>
+#> 2 Washington Washington, USA       49.0   45.5 -117.  -125.  <named list [4]>
+#> 3 Washington Washington, DC, USA   39.0   38.8  -76.9  -77.1 <named list [4]>
+#> 4 New York   New York, NY, USA     40.9   40.5  -73.7  -74.3 <named list [4]>
+#> 5 Chicago    Chicago, IL, USA      42.0   41.6  -87.5  -87.9 <named list [4]>
+#> 6 Arlington  Arlington, TX, USA    32.8   32.6  -97.0  -97.2 <named list [4]>
+#> # … with 1 more row
+
+

If these case studies have whetted your appetite for more real-life rectangling, you can see a few more examples in vignette("rectangling", package = "tidyr").

+
+ +
+

+Exercises

+
  1. Roughly estimate when gh_repos was created. Why can you only roughly estimate the date?

  2. +
  3. The owner column of gh_repo contains a lot of duplicated information because each owner can have many repos. Can you construct a owners data frame that contains one row for each owner? (Hint: does #chp-https://dplyr.tidyverse.org/reference/distinct work with list-cols?)

  4. +
  5. +

    Explain the following code line-by-line. Why is it interesting? Why does it work for got_chars but might not work in general?

    +
    +
    tibble(json = got_chars) |> 
    +  unnest_wider(json) |> 
    +  select(id, where(is.list)) |> 
    +  pivot_longer(
    +    where(is.list), 
    +    names_to = "name", 
    +    values_to = "value"
    +  ) |>  
    +  unnest_longer(value)
    +
    +
  6. +
  7. In gmaps_cities, what does address_components contain? Why does the length vary between rows? Unnest it appropriately to figure it out. (Hint: types always appears to contain two elements. Does #chp-https://tidyr.tidyverse.org/reference/unnest_wider make it easier to work with than #chp-https://tidyr.tidyverse.org/reference/unnest_longer?) .

  8. +
+
+ +
+

+JSON

+

All of the case studies in the previous section were sourced from wild-caught JSON. JSON is short for javascript object notation and is the way that most web APIs return data. It’s important to understand it because while JSON and R’s data types are pretty similar, there isn’t a perfect 1-to-1 mapping, so it’s good to understand a bit about JSON if things go wrong.

+ +
+

+Data types

+

JSON is a simple format designed to be easily read and written by machines, not humans. It has six key data types. Four of them are scalars:

+
  • The simplest type is a null (null) which plays the same role as both NULL and NA in R. It represents the absence of data.
  • +
  • A string is much like a string in R, but must always use double quotes.
  • +
  • A number is similar to R’s numbers: they can use integer (e.g. 123), decimal (e.g. 123.45), or scientific (e.g. 1.23e3) notation. JSON doesn’t support Inf, -Inf, or NaN.
  • +
  • A boolean is similar to R’s TRUE and FALSE, but uses lowercase true and false.
  • +

JSON’s strings, numbers, and booleans are pretty similar to R’s character, numeric, and logical vectors. The main difference is that JSON’s scalars can only represent a single value. To represent multiple values you need to use one of the two remaining types: arrays and objects.

+

Both arrays and objects are similar to lists in R; the difference is whether or not they’re named. An array is like an unnamed list, and is written with []. For example [1, 2, 3] is an array containing 3 numbers, and [null, 1, "string", false] is an array that contains a null, a number, a string, and a boolean. An object is like a named list, and is written with #chp-https://rdrr.io/r/base/Paren. The names (keys in JSON terminology) are strings, so must be surrounded by quotes. For example, {"x": 1, "y": 2} is an object that maps x to 1 and y to 2.

+
+ +
+

+jsonlite

+

To convert JSON into R data structures, we recommend the jsonlite package, by Jeroen Ooms. We’ll use only two jsonlite functions: #chp-https://rdrr.io/pkg/jsonlite/man/read_json and #chp-https://rdrr.io/pkg/jsonlite/man/read_json. In real life, you’ll use #chp-https://rdrr.io/pkg/jsonlite/man/read_json to read a JSON file from disk. For example, the repurrsive package also provides the source for gh_user as a JSON file and you can read it with #chp-https://rdrr.io/pkg/jsonlite/man/read_json:

+
+
# A path to a json file inside the package:
+gh_users_json()
+#> [1] "/Users/hadleywickham/Library/R/arm64/4.2/library/repurrrsive/extdata/gh_users.json"
+
+# Read it with read_json()
+gh_users2 <- read_json(gh_users_json())
+
+# Check it's the same as the data we were using previously
+identical(gh_users, gh_users2)
+#> [1] TRUE
+
+

In this book, I’ll also use #chp-https://rdrr.io/pkg/jsonlite/man/read_json, since it takes a string containing JSON, which makes it good for generating simple examples. To get started, here’s three simple JSON datasets, starting with a number, then putting a few number in an array, then putting that array in an object:

+
+
str(parse_json('1'))
+#>  int 1
+str(parse_json('[1, 2, 3]'))
+#> List of 3
+#>  $ : int 1
+#>  $ : int 2
+#>  $ : int 3
+str(parse_json('{"x": [1, 2, 3]}'))
+#> List of 1
+#>  $ x:List of 3
+#>   ..$ : int 1
+#>   ..$ : int 2
+#>   ..$ : int 3
+
+

jsonlite has another important function called #chp-https://rdrr.io/pkg/jsonlite/man/fromJSON. We don’t use it here because it performs automatic simplification (simplifyVector = TRUE). This often works well, particularly in simple cases, but we think you’re better off doing the rectangling yourself so you know exactly what’s happening and can more easily handle the most complicated nested structures.

+
+ +
+

+Starting the rectangling process

+

In most cases, JSON files contain a single top-level array, because they’re designed to provide data about multiple “things”, e.g. multiple pages, or multiple records, or multiple results. In this case, you’ll start your rectangling with tibble(json) so that each element becomes a row:

+
+
json <- '[
+  {"name": "John", "age": 34},
+  {"name": "Susan", "age": 27}
+]'
+df <- tibble(json = parse_json(json))
+df
+#> # A tibble: 2 × 1
+#>   json            
+#>   <list>          
+#> 1 <named list [2]>
+#> 2 <named list [2]>
+
+df |> 
+  unnest_wider(json)
+#> # A tibble: 2 × 2
+#>   name    age
+#>   <chr> <int>
+#> 1 John     34
+#> 2 Susan    27
+
+

In rarer cases, the JSON consists of a single top-level JSON object, representing one “thing”. In this case, you’ll need to kick off the rectangling process by wrapping it a list, before you put it in a tibble.

+
+
json <- '{
+  "status": "OK", 
+  "results": [
+    {"name": "John", "age": 34},
+    {"name": "Susan", "age": 27}
+ ]
+}
+'
+df <- tibble(json = list(parse_json(json)))
+df
+#> # A tibble: 1 × 1
+#>   json            
+#>   <list>          
+#> 1 <named list [2]>
+
+df |> 
+  unnest_wider(json) |> 
+  unnest_longer(results) |> 
+  unnest_wider(results)
+#> # A tibble: 2 × 3
+#>   status name    age
+#>   <chr>  <chr> <int>
+#> 1 OK     John     34
+#> 2 OK     Susan    27
+
+

Alternatively, you can reach inside the parsed JSON and start with the bit that you actually care about:

+
+
df <- tibble(results = parse_json(json)$results)
+df |> 
+  unnest_wider(results)
+#> # A tibble: 2 × 2
+#>   name    age
+#>   <chr> <int>
+#> 1 John     34
+#> 2 Susan    27
+
+
+ +
+

+Translation challenges

+

Since JSON doesn’t have any way to represent dates or date-times, they’re often stored as ISO8601 date times in strings, and you’ll need to use #chp-https://readr.tidyverse.org/reference/parse_datetime or #chp-https://readr.tidyverse.org/reference/parse_datetime to turn them into the correct data structure. Similarly, JSON’s rules for representing floating point numbers in JSON are a little imprecise, so you’ll also sometimes find numbers stored in strings. Apply #chp-https://readr.tidyverse.org/reference/parse_atomic as needed to the get correct variable type.

+
+ +
+

+Exercises

+
  1. +

    Rectangle the df_col and df_row below. They represent the two ways of encoding a data frame in JSON.

    +
    +
    json_col <- parse_json('
    +  {
    +    "x": ["a", "x", "z"],
    +    "y": [10, null, 3]
    +  }
    +')
    +json_row <- parse_json('
    +  [
    +    {"x": "a", "y": 10},
    +    {"x": "x", "y": null},
    +    {"x": "z", "y": 3}
    +  ]
    +')
    +
    +df_col <- tibble(json = list(json_col)) 
    +df_row <- tibble(json = json_row)
    +
    +
  2. +
+
+ +
+

+Summary

+

In this chapter, you learned what lists are, how you can generate the from JSON files, and how turn them into rectangular data frames. Surprisingly we only need two new functions: #chp-https://tidyr.tidyverse.org/reference/unnest_longer to put list elements into rows and #chp-https://tidyr.tidyverse.org/reference/unnest_wider to put list elements into columns. It doesn’t matter how deeply nested the list-column is, all you need to do is repeatedly call these two functions.

+

JSON is the most common data format returned by web APIs. What happens if the website doesn’t have an API, but you can see data you want on the website? That’s the topic of the next chapter: web scraping, extracting data from HTML webpages.

+ + +
+
diff --git a/oreilly/regexps.html b/oreilly/regexps.html new file mode 100644 index 0000000..53b9b11 --- /dev/null +++ b/oreilly/regexps.html @@ -0,0 +1,1032 @@ +
+

Regular expressions

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

In #chp-strings, you learned a whole bunch of useful functions for working with strings. In this chapter we’ll focusing on functions that use regular expressions, a concise and powerful language for describing patterns within strings. The term “regular expression” is a bit of a mouthful, so most people abbreviate it to “regex”You can pronounce it with either a hard-g (reg-x) or a soft-g (rej-x). or “regexp”.

+

The chapter starts with the basics of regular expressions and the most useful stringr functions for data analysis. We’ll then expand your knowledge of patterns and cover seven important new topics (escaping, anchoring, character classes, shorthand classes, quantifiers, precedence, and grouping). Next, we’ll talk about some of the other types of patterns that stringr functions can work with, and the various “flags” that allow you to tweak the operation of regular expressions. We’ll finish up with a survey of other places in the tidyverse and base R where you might use regexes.

+ +
+

+Prerequisites

+
+
+ +
+ +
+ +

This chapter relies on features only found in stringr 1.5.0 and tidyr 1.3.0 which are still in development. If you want to live life on the edge, you can get the dev versions with devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr")).

+ +

In this chapter, we’ll use regular expression functions from stringr and tidyr, both core members of the tidyverse, as well as data from the babynames package.

+
+
library(tidyverse)
+library(babynames)
+
+

Through this chapter we’ll use a mix of very simple inline examples so you can get the basic idea, the baby names data, and three character vectors from stringr:

+
  • +fruit contains the names of 80 fruits.
  • +
  • +words contains 980 common English words.
  • +
  • +sentences contains 720 short sentences.
  • +
+
+ +
+

+Pattern basics

+

We’ll use #chp-https://stringr.tidyverse.org/reference/str_view to learn how regex patterns work. We used #chp-https://stringr.tidyverse.org/reference/str_view in the last chapter to better understand a string vs its printed representation, and now we’ll use it with its second argument, a regular expression. When this is supplied, #chp-https://stringr.tidyverse.org/reference/str_view will show only the elements of the string vector that match, surrounding each match with <>, and, where possible, highlighting the match in blue.

+

The simplest patterns consist of letters and numbers which match those characters exactly:

+
+
str_view(fruit, "berry")
+#>  [6] │ bil<berry>
+#>  [7] │ black<berry>
+#> [10] │ blue<berry>
+#> [11] │ boysen<berry>
+#> [19] │ cloud<berry>
+#> [21] │ cran<berry>
+#> [29] │ elder<berry>
+#> [32] │ goji <berry>
+#> [33] │ goose<berry>
+#> [38] │ huckle<berry>
+#> ... and 4 more
+
+str_view(fruit, "BERRY")
+
+

Letters and numbers match exactly and are called literal characters. Punctuation characters like ., +, *, [, ], ? have special meaningsYou’ll learn how to escape these special meanings in #sec-regexp-escaping. and are called meta-characters. For example, . will match any characterWell, any character apart from \n., so "a." will match any string that contains an “a” followed by another character :

+
+
str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
+#> [2] │ <ab>
+#> [3] │ <ae>
+#> [6] │ e<ab>
+
+

Or we could find all the fruits that contain an “a”, followed by three letters, followed by an “e”:

+
+
str_view(fruit, "a...e")
+#>  [1] │ <apple>
+#>  [7] │ bl<ackbe>rry
+#> [48] │ mand<arine>
+#> [51] │ nect<arine>
+#> [62] │ pine<apple>
+#> [64] │ pomegr<anate>
+#> [70] │ r<aspbe>rry
+#> [73] │ sal<al be>rry
+
+

Quantifiers control how many times a pattern can match:

+
  • +? makes a pattern optional (i.e. it matches 0 or 1 times)
  • +
  • ++ lets a pattern repeat (i.e. it matches at least once)
  • +
  • +* lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
  • +
+
# ab? matches an "a", optionally followed by a "b".
+str_view(c("a", "ab", "abb"), "ab?")
+#> [1] │ <a>
+#> [2] │ <ab>
+#> [3] │ <ab>b
+
+# ab+ matches an "a", followed by at least one "b".
+str_view(c("a", "ab", "abb"), "ab+")
+#> [2] │ <ab>
+#> [3] │ <abb>
+
+# ab* matches an "a", followed by any number of "b"s.
+str_view(c("a", "ab", "abb"), "ab*")
+#> [1] │ <a>
+#> [2] │ <ab>
+#> [3] │ <abb>
+
+

Character classes are defined by [] and let you match a set set of characters, e.g. [abcd] matches “a”, “b”, “c”, or “d”. You can also invert the match by starting with ^: [^abcd] matches anything except “a”, “b”, “c”, or “d”. We can use this idea to find the words with three vowels or four consonants in a row:

+
+
str_view(words, "[aeiou][aeiou][aeiou]")
+#>  [79] │ b<eau>ty
+#> [565] │ obv<iou>s
+#> [644] │ prev<iou>s
+#> [670] │ q<uie>t
+#> [741] │ ser<iou>s
+#> [915] │ var<iou>s
+str_view(words, "[^aeiou][^aeiou][^aeiou][^aeiou]")
+#>  [45] │ a<pply>
+#> [198] │ cou<ntry>
+#> [424] │ indu<stry>
+#> [830] │ su<pply>
+#> [836] │ <syst>em
+
+

You can combine character classes and quantifiers. For example, the following regexp looks for two vowel followed by two or more consonants:

+
+
str_view(words, "[aeiou][aeiou][^aeiou][^aeiou]+")
+#>  [6] │ acc<ount>
+#> [21] │ ag<ainst>
+#> [31] │ alr<eady>
+#> [34] │ alth<ough>
+#> [37] │ am<ount>
+#> [46] │ app<oint>
+#> [47] │ appr<oach>
+#> [52] │ ar<ound>
+#> [61] │ <auth>ority
+#> [79] │ be<auty>
+#> ... and 62 more
+
+

(We’ll learn some more elegant ways to express these ideas in #sec-quantifiers.)

+

You can use alternation, | to pick between one or more alternative patterns. For example, the following patterns look for fruits containing “apple”, “pear”, or “banana”, or a repeated vowel.

+
+
str_view(fruit, "apple|pear|banana")
+#>  [1] │ <apple>
+#>  [4] │ <banana>
+#> [59] │ <pear>
+#> [62] │ pine<apple>
+str_view(fruit, "aa|ee|ii|oo|uu")
+#>  [9] │ bl<oo>d orange
+#> [33] │ g<oo>seberry
+#> [47] │ lych<ee>
+#> [66] │ purple mangost<ee>n
+
+

Regular expressions are very compact and use a lot of punctuation characters, so they can seem overwhelming and hard to read at first. Don’t worry; you’ll get better with practice, and simple patterns will soon become second nature. Let’s kick off that process by practicing with some useful stringr functions.

+ +
+

+Exercises

+
+
+ +
+

+Key functions

+

Now that you’ve got the basics of regular expressions under your belt, let’s use them with some stringr and tidyr functions. In the following section, you’ll learn about how to detect the presence or absence of a match, how to count the number of matches, how to replace a match with fixed text, and how to extract text using a pattern.

+ +
+

+Detect matches

+

#chp-https://stringr.tidyverse.org/reference/str_detect returns a logical vector that is TRUE if the pattern matched an element of the character vector and FALSE otherwise:

+
+
str_detect(c("a", "b", "c"), "[aeiou]")
+#> [1]  TRUE FALSE FALSE
+
+

Since #chp-https://stringr.tidyverse.org/reference/str_detect returns a logical vector of the same length as the initial vector, it pairs well with #chp-https://dplyr.tidyverse.org/reference/filter. For example, this code finds all the most popular names containing a lower-case “x”:

+
+
babynames |> 
+  filter(str_detect(name, "x")) |> 
+  count(name, wt = n, sort = TRUE)
+#> # A tibble: 974 × 2
+#>   name           n
+#>   <chr>      <int>
+#> 1 Alexander 665492
+#> 2 Alexis    399551
+#> 3 Alex      278705
+#> 4 Alexandra 232223
+#> 5 Max       148787
+#> 6 Alexa     123032
+#> # … with 968 more rows
+
+

We can also use #chp-https://stringr.tidyverse.org/reference/str_detect with #chp-https://dplyr.tidyverse.org/reference/summarise by pairing it with #chp-https://rdrr.io/r/base/sum or #chp-https://rdrr.io/r/base/mean: sum(str_detect(x, pattern)) tells you the number of observations that match and mean(str_detect(x, pattern)) tells you the proportion that match. For example, the following snippet computes and visualizes the proportion of baby namesThis gives us the proportion of names that contain an “x”; if you wanted the proportion of babies with a name containing an x, you’d need to perform a weighted mean. that contain “x”, broken down by year. It looks like they’ve radically increased in popularity lately!

+
+
babynames |> 
+  group_by(year) |> 
+  summarise(prop_x = mean(str_detect(name, "x"))) |> 
+  ggplot(aes(year, prop_x)) + 
+  geom_line()
+
+ +

A timeseries showing the proportion of baby names that contain the letter x. The proportion declines gradually from 8 per 1000 in 1880 to 4 per 1000 in 1980, then increases rapidly to 16 per 1000 in 2019.

+
Figure 15.1: A time series showing the proportion of baby names that contain a lower case “x”.
+
+
+
+

There are two functions that are closely related to #chp-https://stringr.tidyverse.org/reference/str_detect, namely #chp-https://stringr.tidyverse.org/reference/str_subset which returns just the strings that contain a match and #chp-https://stringr.tidyverse.org/reference/str_which which returns the indexes of strings that have a match:

+
+
str_subset(c("a", "b", "c"), "[aeiou]")
+#> [1] "a"
+str_which(c("a", "b", "c"), "[aeiou]")
+#> [1] 1
+
+
+ +
+

+Count matches

+

The next step up in complexity from #chp-https://stringr.tidyverse.org/reference/str_detect is #chp-https://stringr.tidyverse.org/reference/str_count: rather than a simple true or false, it tells you how many matches there are in each string.

+
+
x <- c("apple", "banana", "pear")
+str_count(x, "p")
+#> [1] 2 0 1
+
+

Note that each match starts at the end of the previous match; i.e. regex matches never overlap. For example, in "abababa", how many times will the pattern "aba" match? Regular expressions say two, not three:

+
+
str_count("abababa", "aba")
+#> [1] 2
+str_view("abababa", "aba")
+#> [1] │ <aba>b<aba>
+
+

It’s natural to use #chp-https://stringr.tidyverse.org/reference/str_count with #chp-https://dplyr.tidyverse.org/reference/mutate. The following example uses #chp-https://stringr.tidyverse.org/reference/str_count with character classes to count the number of vowels and consonants in each name.

+
+
babynames |> 
+  count(name) |> 
+  mutate(
+    vowels = str_count(name, "[aeiou]"),
+    consonants = str_count(name, "[^aeiou]")
+  )
+#> # A tibble: 97,310 × 4
+#>   name          n vowels consonants
+#>   <chr>     <int>  <int>      <int>
+#> 1 Aaban        10      2          3
+#> 2 Aabha         5      2          3
+#> 3 Aabid         2      2          3
+#> 4 Aabir         1      2          3
+#> 5 Aabriella     5      4          5
+#> 6 Aada          1      2          2
+#> # … with 97,304 more rows
+
+

If you look closely, you’ll notice that there’s something off with our calculations: “Aaban” contains three “a”s, but our summary reports only two vowels. That’s because regular expressions are case sensitive. There are three ways we could fix this:

+
  • Add the upper case vowels to the character class: str_count(name, "[aeiouAEIOU]").
  • +
  • Tell the regular expression to ignore case: str_count(regex(name, ignore_case = TRUE), "[aeiou]"). We’ll talk about more in #sec-flags.
  • +
  • Use #chp-https://stringr.tidyverse.org/reference/case to convert the names to lower case: str_count(str_to_lower(name), "[aeiou]"). You learned about this function in #sec-other-languages.
  • +

This variety of approaches is pretty typical when working with strings — there are often multiple ways to reach your goal, either by making your pattern more complicated or by doing some preprocessing on your string. If you get stuck trying one approach, it can often be useful to switch gears and tackle the problem from a different perspective.

+

In this case, since we’re applying two functions to the name, I think it’s easier to transform it first:

+
+
babynames |> 
+  count(name) |> 
+  mutate(
+    name = str_to_lower(name),
+    vowels = str_count(name, "[aeiou]"),
+    consonants = str_count(name, "[^aeiou]")
+  )
+#> # A tibble: 97,310 × 4
+#>   name          n vowels consonants
+#>   <chr>     <int>  <int>      <int>
+#> 1 aaban        10      3          2
+#> 2 aabha         5      3          2
+#> 3 aabid         2      3          2
+#> 4 aabir         1      3          2
+#> 5 aabriella     5      5          4
+#> 6 aada          1      3          1
+#> # … with 97,304 more rows
+
+
+ +
+

+Replace values

+

As well as detecting and counting matches, we can also modify them with #chp-https://stringr.tidyverse.org/reference/str_replace and #chp-https://stringr.tidyverse.org/reference/str_replace. #chp-https://stringr.tidyverse.org/reference/str_replace replaces the first match, and as the name suggests, #chp-https://stringr.tidyverse.org/reference/str_replace replaces all matches.

+
+
x <- c("apple", "pear", "banana")
+str_replace_all(x, "[aeiou]", "-")
+#> [1] "-ppl-"  "p--r"   "b-n-n-"
+
+

#chp-https://stringr.tidyverse.org/reference/str_remove and #chp-https://stringr.tidyverse.org/reference/str_remove are handy shortcuts for str_replace(x, pattern, "").

+
+
x <- c("apple", "pear", "banana")
+str_remove_all(x, "[aeiou]")
+#> [1] "ppl" "pr"  "bnn"
+
+

These functions are naturally paired with #chp-https://dplyr.tidyverse.org/reference/mutate when doing data cleaning, and you’ll often apply them repeatedly to peel off layers of inconsistent formatting.

+
+ +
+

+Extract variables

+

The last function we’ll discuss uses regular expressions to extract data out of one column into one or more new columns: #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim. It’s a peer of the separate_wider_location() and #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim functions that you learned about in #sec-string-columns. These functions live in tidyr because the operates on (columns of) data frames, rather than individual vectors.

+

Let’s create a simple dataset to show how it works. Here we have some data derived from babynames where we have the name, gender, and age of a bunch of people in a rather weird formatWe wish we could reassure you that you’d never see something this weird in real life, but unfortunately over the course of your career you’re likely to see much weirder!:

+
+
df <- tribble(
+  ~str,
+  "<Sheryl>-F_34",
+  "<Kisha>-F_45", 
+  "<Brandon>-N_33",
+  "<Sharon>-F_38", 
+  "<Penny>-F_58",
+  "<Justin>-M_41", 
+  "<Patricia>-F_84", 
+)
+
+

To extract this data using #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim we just need to construct a sequence of regular expressions that match each piece. If we want the contents of that piece to appear in the output, we give it a name:

+
+
df |> 
+  separate_wider_regex(
+    str,
+    patterns = c(
+      "<", name = "[A-Za-z]+", ">-", 
+      gender = ".", "_", 
+      age = "[0-9]+"
+    )
+  )
+#> # A tibble: 7 × 3
+#>   name    gender age  
+#>   <chr>   <chr>  <chr>
+#> 1 Sheryl  F      34   
+#> 2 Kisha   F      45   
+#> 3 Brandon N      33   
+#> 4 Sharon  F      38   
+#> 5 Penny   F      58   
+#> 6 Justin  M      41   
+#> # … with 1 more row
+
+

If the match fails, you can use too_short = "debug" to figure out what went wrong, just like #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim and #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim.

+
+ +
+

+Exercises

+
  1. What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)

  2. +
  3. Replace all forward slashes in a string with backslashes.

  4. +
  5. Implement a simple version of #chp-https://stringr.tidyverse.org/reference/case using #chp-https://stringr.tidyverse.org/reference/str_replace.

  6. +
  7. Create a regular expression that will match telephone numbers as commonly written in your country.

  8. +
+
+ +
+

+Pattern details

+

Now that you understand the basics of the pattern language and how to use it with some stringr and tidyr functions, its time to dig into more of the details. First, we’ll start with escaping, which allows you to match metacharacters that would otherwise be treated specially. Next, you’ll learn about anchors which allow you to match the start or end of the string. Then, you’ll more learn about character classes and their shortcuts which allow you to match any character from a set. Next, you’ll learn the final details of quantifiers which control how many times a pattern can match. Then, we have to cover the important (but complex) topic of operator precedence and parentheses. And we’ll finish off with some details of grouping components of the pattern.

+

The terms we use here are the technical names for each component. They’re not always the most evocative of their purpose, but it’s very helpful to know the correct terms if you later want to Google for more details.

+ +
+

+Escaping

+

In order to match a literal ., you need an escape which tells the regular expression to match metacharacters literally. Like strings, regexps use the backslash for escaping. So, to match a ., you need the regexp \.. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.", as the following example shows.

+
+
# To create the regular expression \., we need to use \\.
+dot <- "\\."
+
+# But the expression itself only contains one \
+str_view(dot)
+#> [1] │ \.
+
+# And this tells R to look for an explicit .
+str_view(c("abc", "a.c", "bef"), "a\\.c")
+#> [2] │ <a.c>
+
+

In this book, we’ll usually write regular expression without quotes, like \.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like "\\.".

+

If \ is used as an escape character in regular expressions, how do you match a literal \? Well, you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

+
+
x <- "a\\b"
+str_view(x)
+#> [1] │ a\b
+str_view(x, "\\\\")
+#> [1] │ a<\>b
+
+

Alternatively, you might find it easier to use the raw strings you learned about in #sec-raw-strings). That lets you to avoid one layer of escaping:

+
+
str_view(x, r"{\\}")
+#> [1] │ a<\>b
+
+

If you’re trying to match a literal ., $, |, *, +, ?, {, }, (, ), there’s an alternative to using a backslash escape: you can use a character class: [.], [$], [|], ... all match the literal values.

+
+
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
+#> [2] │ <a.c>
+str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
+#> [3] │ <a*c>
+
+

The full set of metacharacters is .^$\|*+?{}[](). In general, look at punctuation characters with suspicion; if your regular expression isn’t matching what you think it should, check if you’ve used any of these characters.

+
+ +
+

+Anchors

+

By default, regular expressions will match any part of a string. If you want to match at the start of end you need to anchor the regular expression using ^ to match the start of the string or $ to match the end of the string:

+
+
str_view(fruit, "^a")
+#> [1] │ <a>pple
+#> [2] │ <a>pricot
+#> [3] │ <a>vocado
+str_view(fruit, "a$")
+#>  [4] │ banan<a>
+#> [15] │ cherimoy<a>
+#> [30] │ feijo<a>
+#> [36] │ guav<a>
+#> [56] │ papay<a>
+#> [74] │ satsum<a>
+
+

It’s tempting to think that $ should matches the start of a string, because that’s how we write dollar amounts, but it’s not what regular expressions want.

+

To force a regular expression to only the full string, anchor it with both ^ and $:

+
+
str_view(fruit, "apple")
+#>  [1] │ <apple>
+#> [62] │ pine<apple>
+str_view(fruit, "^apple$")
+#> [1] │ <apple>
+
+

You can also match the boundary between words (i.e. the start or end of a word) with \b. This can be particularly when using RStudio’s find and replace tool. For example, if to find all uses of #chp-https://rdrr.io/r/base/sum, you can search for \bsum\b to avoid matching summarise, summary, rowsum and so on:

+
+
x <- c("summary(x)", "summarise(df)", "rowsum(x)", "sum(x)")
+str_view(x, "sum")
+#> [1] │ <sum>mary(x)
+#> [2] │ <sum>marise(df)
+#> [3] │ row<sum>(x)
+#> [4] │ <sum>(x)
+str_view(x, "\\bsum\\b")
+#> [4] │ <sum>(x)
+
+

When used alone, anchors will produce a zero-width match:

+
+
str_view("abc", c("$", "^", "\\b"))
+#> [1] │ abc<>
+#> [2] │ <>abc
+#> [3] │ <>abc<>
+
+

This helps you understand what happens when you replace a standalone anchor:

+
+
str_replace_all("abc", c("$", "^", "\\b"), "--")
+#> [1] "abc--"   "--abc"   "--abc--"
+
+
+ +
+

+Character classes

+

A character class, or character set, allows you to match any character in a set. As we discussed above, you can construct your own sets with [], where [abc] matches a, b, or c. There are three characters that have special meaning inside of []:

+
  • +- defines a range, e.g. [a-z] matches any lower case letter and [0-9] matches any number.
  • +
  • +^ takes the inverse of the set, e.g. [^abc] matches anything except a, b, or c.
  • +
  • +\ escapes special characters, so [\^\-\]] matches ^, -, or ].
  • +

Here are few examples:

+
+
x <- "abcd ABCD 12345 -!@#%."
+str_view(x, "[abc]+")
+#> [1] │ <abc>d ABCD 12345 -!@#%.
+str_view(x, "[a-z]+")
+#> [1] │ <abcd> ABCD 12345 -!@#%.
+str_view(x, "[^a-z0-9]+")
+#> [1] │ abcd< ABCD >12345< -!@#%.>
+
+# You need an escape to match characters that are otherwise
+# special inside of []
+str_view("a-b-c", "[a-c]")
+#> [1] │ <a>-<b>-<c>
+str_view("a-b-c", "[a\\-c]")
+#> [1] │ <a><->b<-><c>
+
+

Some character classes are used so commonly that they get their own shortcut. You’ve already seen ., which matches any character apart from a newline. There are three other particularly useful pairsRemember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s".:

+
  • +\d matches any digit;
    \D matches anything that isn’t a digit.
  • +
  • +\s matches any whitespace (e.g. space, tab, newline);
    \S matches anything that isn’t whitespace.
  • +
  • +\w matches any “word” character, i.e. letters and numbers;
    \W matches any “non-word” character.
  • +

The following code demonstrates the six shortcuts with a selection of letters, numbers, and punctuation characters.

+
+
x <- "abcd ABCD 12345 -!@#%."
+str_view(x, "\\d+")
+#> [1] │ abcd ABCD <12345> -!@#%.
+str_view(x, "\\D+")
+#> [1] │ <abcd ABCD >12345< -!@#%.>
+str_view(x, "\\w+")
+#> [1] │ <abcd> <ABCD> <12345> -!@#%.
+str_view(x, "\\W+")
+#> [1] │ abcd< >ABCD< >12345< -!@#%.>
+str_view(x, "\\s+")
+#> [1] │ abcd< >ABCD< >12345< >-!@#%.
+str_view(x, "\\S+")
+#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>
+
+
+ +
+

+Quantifiers

+

Quantifiers control how many times a pattern matches. In #sec-reg-basics you learned about ? (0 or 1 matches), + (1 or more matches), and * (0 or more matches). For example, colou?r will match American or British spelling, \d+ will match one or more digits, and \s? will optionally match a single item of whitespace. You can also specify the number of matches precisely with #chp-https://rdrr.io/r/base/Paren:

+
  • +{n} matches exactly n times.
  • +
  • +{n,} matches at least n times.
  • +
  • +{n,m} matches between n and m times.
  • +

The following code shows how this works for a few simple examples:

+
+
x <- "-- -x- -xx- -xxx- -xxxx- -xxxxx-"
+str_view(x, "-x?-")      # [0, 1]
+#> [1] │ <--> <-x-> -xx- -xxx- -xxxx- -xxxxx-
+str_view(x, "-x+-")      # [1, Inf)
+#> [1] │ -- <-x-> <-xx-> <-xxx-> <-xxxx-> <-xxxxx->
+str_view(x, "-x*-")      # [0, Inf)
+#> [1] │ <--> <-x-> <-xx-> <-xxx-> <-xxxx-> <-xxxxx->
+str_view(x, "-x{2}-")    # [2. 2]
+#> [1] │ -- -x- <-xx-> -xxx- -xxxx- -xxxxx-
+str_view(x, "-x{2,}-")   # [2, Inf)
+#> [1] │ -- -x- <-xx-> <-xxx-> <-xxxx-> <-xxxxx->
+str_view(x, "-x{2,3}-")  # [2, 3]
+#> [1] │ -- -x- <-xx-> <-xxx-> -xxxx- -xxxxx-
+
+
+ +
+

+Operator precedence and parentheses

+

What does ab+ match? Does it match “a” followed by one or more “b”s, or does it match “ab” repeated any number of times? What does ^a|b$ match? Does it match the complete string a or the complete string b, or does it match a string starting with a or a string starting with “b”?

+

The answer to these questions is determined by operator precedence, similar to the PEMDAS or BEDMAS rules you might have learned in school. You know that a + b * c is equivalent to a + (b * c) not (a + b) * c because * has higher precedence and + has lower precedence: you compute * before +.

+

Similarly, regular expressions have their own precedence rules: quantifiers have high precedence and alternation has low precedence which means that ab+ is equivalent to a(b+), and ^a|b$ is equivalent to (^a)|(b$). Just like with algebra, you can use parentheses to override the usual order. But unlike algebra you’re unlikely to remember the precedence rules for regexes, so feel free to use parentheses liberally.

+
+ +
+

+Grouping and capturing

+

As well overriding operator precedence, parentheses have another important effect: they create capturing groups that allow you to use sub-components of the match.

+

The first way to use a capturing group is to refer back to it within a match with back reference: \1 refers to the match contained in the first parenthesis, \2 in the second parenthesis, and so on. For example, the following pattern finds all fruits that have a repeated pair of letters:

+
+
str_view(fruit, "(..)\\1")
+#>  [4] │ b<anan>a
+#> [20] │ <coco>nut
+#> [22] │ <cucu>mber
+#> [41] │ <juju>be
+#> [56] │ <papa>ya
+#> [73] │ s<alal> berry
+
+

And this one finds all words that start and end with the same pair of letters:

+
+
str_view(words, "^(..).*\\1$")
+#> [152] │ <church>
+#> [217] │ <decide>
+#> [617] │ <photograph>
+#> [699] │ <require>
+#> [739] │ <sense>
+
+

You can also use back references in #chp-https://stringr.tidyverse.org/reference/str_replace. For example, this code switches the order of the second and third words in sentences:

+
+
sentences |> 
+  str_replace("(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2") |> 
+  str_view()
+#>  [1] │ The canoe birch slid on the smooth planks.
+#>  [2] │ Glue sheet the to the dark blue background.
+#>  [3] │ It's to easy tell the depth of a well.
+#>  [4] │ These a days chicken leg is a rare dish.
+#>  [5] │ Rice often is served in round bowls.
+#>  [6] │ The of juice lemons makes fine punch.
+#>  [7] │ The was box thrown beside the parked truck.
+#>  [8] │ The were hogs fed chopped corn and garbage.
+#>  [9] │ Four of hours steady work faced us.
+#> [10] │ A size large in stockings is hard to sell.
+#> ... and 710 more
+
+

If you want extract the matches for each group you can use #chp-https://stringr.tidyverse.org/reference/str_match. But #chp-https://stringr.tidyverse.org/reference/str_match returns a matrix, so it’s not particularly easy to work withMostly because we never discuss matrices in this book!:

+
+
sentences |> 
+  str_match("the (\\w+) (\\w+)") |> 
+  head()
+#>      [,1]                [,2]     [,3]    
+#> [1,] "the smooth planks" "smooth" "planks"
+#> [2,] "the sheet to"      "sheet"  "to"    
+#> [3,] "the depth of"      "depth"  "of"    
+#> [4,] NA                  NA       NA      
+#> [5,] NA                  NA       NA      
+#> [6,] NA                  NA       NA
+
+

You could convert to a tibble and name the columns:

+
+
sentences |> 
+  str_match("the (\\w+) (\\w+)") |> 
+  as_tibble(.name_repair = "minimal") |> 
+  set_names("match", "word1", "word2")
+#> # A tibble: 720 × 3
+#>   match             word1  word2 
+#>   <chr>             <chr>  <chr> 
+#> 1 the smooth planks smooth planks
+#> 2 the sheet to      sheet  to    
+#> 3 the depth of      depth  of    
+#> 4 <NA>              <NA>   <NA>  
+#> 5 <NA>              <NA>   <NA>  
+#> 6 <NA>              <NA>   <NA>  
+#> # … with 714 more rows
+
+

But then you’ve basically recreated your own version of #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim. Indeed, behind the scenes, #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim converts your vector of patterns to a single regex that uses grouping to capture the named components.

+

Occasionally, you’ll want to use parentheses without creating matching groups. You can create a non-capturing group with (?:).

+
+
x <- c("a gray cat", "a grey dog")
+str_match(x, "gr(e|a)y")
+#>      [,1]   [,2]
+#> [1,] "gray" "a" 
+#> [2,] "grey" "e"
+str_match(x, "gr(?:e|a)y")
+#>      [,1]  
+#> [1,] "gray"
+#> [2,] "grey"
+
+
+ +
+

+Exercises

+
  1. How would you match the literal string "'\? How about "$^$"?

  2. +
  3. Explain why each of these patterns don’t match a \: "\", "\\", "\\\".

  4. +
  5. +

    Given the corpus of common words in #chp-https://stringr.tidyverse.org/reference/stringr-data, create regular expressions that find all words that:

    +
    1. Start with “y”.
    2. +
    3. Don’t start with “y”.
    4. +
    5. End with “x”.
    6. +
    7. Are exactly three letters long. (Don’t cheat by using #chp-https://stringr.tidyverse.org/reference/str_length!)
    8. +
    9. Have seven letters or more.
    10. +
    11. Contain a vowel-consonant pair.
    12. +
    13. Contain at least two vowel-consonant pairs in a row.
    14. +
    15. Only consist of repeated vowel-consonant pairs.
    16. +
  6. +
  7. Create 11 regular expressions that match the British or American spellings for each of the following words: grey/gray, modelling/modeling, summarize/summarise, aluminium/aluminum, defence/defense, analog/analogue, center/centre, sceptic/skeptic, aeroplane/airplane, arse/ass, doughnut/donut. Try and make the shortest possible regex!

  8. +
  9. Switch the first and last letters in words. Which of those strings are still words?

  10. +
  11. +

    Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)

    +
    1. ^.*$
    2. +
    3. "\\{.+\\}"
    4. +
    5. \d{4}-\d{2}-\d{2}
    6. +
    7. "\\\\{4}"
    8. +
    9. \..\..\..
    10. +
    11. (.)\1\1
    12. +
    13. "(..)\\1"
    14. +
  12. +
  13. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.

  14. +
+
+ +
+

+Pattern control

+

It’s possible to exercise extra control over the details of the match by using a pattern object instead of just a string. This allows you control the so called regex flags and match various types of fixed strings, as described below.

+ +
+

+Regex flags

+

There are a number of settings that can use to control the details of the regexp. These settings are often called flags in other programming languages. In stringr, you can use these by wrapping the pattern in a call to #chp-https://stringr.tidyverse.org/reference/modifiers. The most useful flag is probably ignore_case = TRUE because it allows characters to match either their uppercase or lowercase forms:

+
+
bananas <- c("banana", "Banana", "BANANA")
+str_view(bananas, "banana")
+#> [1] │ <banana>
+str_view(bananas, regex("banana", ignore_case = TRUE))
+#> [1] │ <banana>
+#> [2] │ <Banana>
+#> [3] │ <BANANA>
+
+

If you’re doing a lot of work with multiline strings (i.e. strings that contain \n), dotalland multiline may also be useful:

+
  • +

    dotall = TRUE lets . match everything, including \n:

    +
    +
    x <- "Line 1\nLine 2\nLine 3"
    +str_view(x, ".Line")
    +str_view(x, regex(".Line", dotall = TRUE))
    +#> [1] │ Line 1<
    +#>     │ Line> 2<
    +#>     │ Line> 3
    +
    +
  • +
  • +

    multiline = TRUE makes ^ and $ match the start and end of each line rather than the start and end of the complete string:

    +
    +
    x <- "Line 1\nLine 2\nLine 3"
    +str_view(x, "^Line")
    +#> [1] │ <Line> 1
    +#>     │ Line 2
    +#>     │ Line 3
    +str_view(x, regex("^Line", multiline = TRUE))
    +#> [1] │ <Line> 1
    +#>     │ <Line> 2
    +#>     │ <Line> 3
    +
    +
  • +

Finally, if you’re writing a complicated regular expression and you’re worried you might not understand it in the future, you might try comments = TRUE. It tweaks the pattern language to ignore spaces and new lines, as well as everything after #. This allows you to use comments and whitespace to make complex regular expressions more understandablecomments = TRUE is particularly effective in combination with a raw string, as we use here., as in the following example:

+
+
phone <- regex(
+  r"(
+    \(?     # optional opening parens
+    (\d{3}) # area code
+    [)\ -]? # optional closing parens, space, or dash
+    (\d{3}) # another three numbers
+    [\ -]?  # optional space or dash
+    (\d{3}) # three more numbers
+  )", 
+  comments = TRUE
+)
+
+str_match("514-791-8141", phone)
+#>      [,1]          [,2]  [,3]  [,4] 
+#> [1,] "514-791-814" "514" "791" "814"
+
+

If you’re using comments and want to match a space, newline, or #, you’ll need to escape it:

+
+
str_view("x x #", regex(r"(x #)", comments = TRUE))
+#> [1] │ <x> <x> #
+str_view("x x #", regex(r"(x\ \#)", comments = TRUE))
+#> [1] │ x <x #>
+
+
+ +
+

+Fixed matches

+

You can opt-out of the regular expression rules by using #chp-https://stringr.tidyverse.org/reference/modifiers:

+
+
str_view(c("", "a", "."), fixed("."))
+#> [3] │ <.>
+
+

#chp-https://stringr.tidyverse.org/reference/modifiers also gives you the ability to ignore case:

+
+
str_view("x X", "X")
+#> [1] │ x <X>
+str_view("x X", fixed("X", ignore_case = TRUE))
+#> [1] │ <x> <X>
+
+

If you’re working with non-English text, you will probably want #chp-https://stringr.tidyverse.org/reference/modifiers instead of #chp-https://stringr.tidyverse.org/reference/modifiers, as it implements the full rules for capitalization as used by the locale you specify. See #sec-other-languages for more details on locales.

+
+
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))
+#> [1] │ i <İ> ı I
+str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))
+#> [1] │ <i> <İ> ı I
+
+
+
+ +
+

+Practice

+

To put these ideas into practice we’ll solve a few semi-authentic problems next. We’ll discuss three general techniques:

+
  1. checking you work by creating simple positive and negative controls
  2. +
  3. combining regular expressions with Boolean algebra
  4. +
  5. creating complex patterns using string manipulation
  6. +
+
+

+Check your work

+

First, let’s find all sentences that start with “The”. Using the ^ anchor alone is not enough:

+
+
str_view(sentences, "^The")
+#>  [1] │ <The> birch canoe slid on the smooth planks.
+#>  [4] │ <The>se days a chicken leg is a rare dish.
+#>  [6] │ <The> juice of lemons makes fine punch.
+#>  [7] │ <The> box was thrown beside the parked truck.
+#>  [8] │ <The> hogs were fed chopped corn and garbage.
+#> [11] │ <The> boy was there when the sun rose.
+#> [13] │ <The> source of the huge river is the clear spring.
+#> [18] │ <The> soft cushion broke the man's fall.
+#> [19] │ <The> salt breeze came across from the sea.
+#> [20] │ <The> girl at the booth sold fifty bonds.
+#> ... and 267 more
+
+

Because that pattern also matches sentences starting with words like They or These. We need to make sure that the “e” is the last letter in the word, which we can do by adding adding a word boundary:

+
+
str_view(sentences, "^The\\b")
+#>  [1] │ <The> birch canoe slid on the smooth planks.
+#>  [6] │ <The> juice of lemons makes fine punch.
+#>  [7] │ <The> box was thrown beside the parked truck.
+#>  [8] │ <The> hogs were fed chopped corn and garbage.
+#> [11] │ <The> boy was there when the sun rose.
+#> [13] │ <The> source of the huge river is the clear spring.
+#> [18] │ <The> soft cushion broke the man's fall.
+#> [19] │ <The> salt breeze came across from the sea.
+#> [20] │ <The> girl at the booth sold fifty bonds.
+#> [21] │ <The> small pup gnawed a hole in the sock.
+#> ... and 246 more
+
+

What about finding all sentences that begin with a pronoun?

+
+
str_view(sentences, "^She|He|It|They\\b")
+#>   [3] │ <It>'s easy to tell the depth of a well.
+#>  [15] │ <He>lp the woman get back to her feet.
+#>  [27] │ <He>r purse was full of useless trash.
+#>  [29] │ <It> snowed, rained, and hailed the same morning.
+#>  [63] │ <He> ran half way to the hardware store.
+#>  [90] │ <He> lay prone and hardly moved a limb.
+#> [116] │ <He> ordered peach pie with ice cream.
+#> [118] │ <He>mp is a weed found in parts of the tropics.
+#> [127] │ <It> caught its hind paw in a rusty trap.
+#> [132] │ <He> said the same phrase thirty times.
+#> ... and 53 more
+
+

A quick inspection of the results shows that we’re getting some spurious matches. That’s because we’ve forgotten to use parentheses:

+
+
str_view(sentences, "^(She|He|It|They)\\b")
+#>   [3] │ <It>'s easy to tell the depth of a well.
+#>  [29] │ <It> snowed, rained, and hailed the same morning.
+#>  [63] │ <He> ran half way to the hardware store.
+#>  [90] │ <He> lay prone and hardly moved a limb.
+#> [116] │ <He> ordered peach pie with ice cream.
+#> [127] │ <It> caught its hind paw in a rusty trap.
+#> [132] │ <He> said the same phrase thirty times.
+#> [153] │ <He> broke a new shoelace that day.
+#> [159] │ <She> sewed the torn coat quite neatly.
+#> [168] │ <He> knew the skill of the great young actress.
+#> ... and 47 more
+
+

You might wonder how you might spot such a mistake if it didn’t occur in the first few matches. A good technique is to create a few positive and negative matches and use them to test that your pattern works as expected:

+
+
pos <- c("He is a boy", "She had a good time")
+neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")
+
+pattern <- "^(She|He|It|They)\\b"
+str_detect(pos, pattern)
+#> [1] TRUE TRUE
+str_detect(neg, pattern)
+#> [1] FALSE FALSE
+
+

It’s typically much easier to come up with good positive examples than negative examples, because it takes a while before you’re good enough with regular expressions to predict where your weaknesses are. Nevertheless, they’re still useful: as you work on the problem you can slowly accumulate a collection of your mistakes, ensuring that you never make the same mistake twice.

+
+ +
+

+Boolean operations

+

Imagine we want to find words that only contain consonants. One technique is to create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (^[^aeiou]+$):

+
+
str_view(words, "^[^aeiou]+$")
+#> [123] │ <by>
+#> [249] │ <dry>
+#> [328] │ <fly>
+#> [538] │ <mrs>
+#> [895] │ <try>
+#> [952] │ <why>
+
+

But you can make this problem a bit easier by flipping the problem around. Instead of looking for words that contain only consonants, we could look for words that don’t contain any vowels:

+
+
str_view(words[!str_detect(words, "[aeiou]")])
+#> [1] │ by
+#> [2] │ dry
+#> [3] │ fly
+#> [4] │ mrs
+#> [5] │ try
+#> [6] │ why
+
+

This is a useful technique whenever you’re dealing with logical combinations, particularly those involving “and” or “not”. For example, imagine if you want to find all words that contain “a” and “b”. There’s no “and” operator built in to regular expressions so we have to tackle it by looking for all words that contain an “a” followed by a “b”, or a “b” followed by an “a”:

+
+
str_view(words, "a.*b|b.*a")
+#>  [2] │ <ab>le
+#>  [3] │ <ab>out
+#>  [4] │ <ab>solute
+#> [62] │ <availab>le
+#> [66] │ <ba>by
+#> [67] │ <ba>ck
+#> [68] │ <ba>d
+#> [69] │ <ba>g
+#> [70] │ <bala>nce
+#> [71] │ <ba>ll
+#> ... and 20 more
+
+

It’s simpler to combine the results of two calls to #chp-https://stringr.tidyverse.org/reference/str_detect:

+
+
words[str_detect(words, "a") & str_detect(words, "b")]
+#>  [1] "able"      "about"     "absolute"  "available" "baby"      "back"     
+#>  [7] "bad"       "bag"       "balance"   "ball"      "bank"      "bar"      
+#> [13] "base"      "basis"     "bear"      "beat"      "beauty"    "because"  
+#> [19] "black"     "board"     "boat"      "break"     "brilliant" "britain"  
+#> [25] "debate"    "husband"   "labour"    "maybe"     "probable"  "table"
+
+

What if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:

+
+
words[str_detect(words, "a.*e.*i.*o.*u")]
+# ...
+words[str_detect(words, "u.*o.*i.*e.*a")]
+
+

It’s much simpler to combine five calls to #chp-https://stringr.tidyverse.org/reference/str_detect:

+
+
words[
+  str_detect(words, "a") &
+  str_detect(words, "e") &
+  str_detect(words, "i") &
+  str_detect(words, "o") &
+  str_detect(words, "u")
+]
+#> character(0)
+
+

In general, if you get stuck trying to create a single regexp that solves your problem, take a step back and think if you could break the problem down into smaller pieces, solving each challenge before moving onto the next one.

+
+ +
+

+Creating a pattern with code

+

What if we wanted to find all sentences that mention a color? The basic idea is simple: we just combine alternation with word boundaries.

+
+
str_view(sentences, "\\b(red|green|blue)\\b")
+#>   [2] │ Glue the sheet to the dark <blue> background.
+#>  [26] │ Two <blue> fish swam in the tank.
+#>  [92] │ A wisp of cloud hung in the <blue> air.
+#> [148] │ The spot on the blotter was made by <green> ink.
+#> [160] │ The sofa cushion is <red> and of light weight.
+#> [174] │ The sky that morning was clear and bright <blue>.
+#> [204] │ A <blue> crane is a tall wading bird.
+#> [217] │ It is hard to erase <blue> or <red> ink.
+#> [224] │ The lamp shone with a steady <green> flame.
+#> [247] │ The box is held by a bright <red> snapper.
+#> ... and 16 more
+
+

But as the number of colors grows, it would quickly get tedious to construct this pattern by hand. Wouldn’t it be nice if we could store the colors in a vector?

+
+
rgb <- c("red", "green", "blue")
+
+

Well, we can! We’d just need to create the pattern from the vector using #chp-https://stringr.tidyverse.org/reference/str_c and #chp-https://stringr.tidyverse.org/reference/str_flatten:

+
+
str_c("\\b(", str_flatten(rgb, "|"), ")\\b")
+#> [1] "\\b(red|green|blue)\\b"
+
+

We could make this pattern more comprehensive if we had a good list of colors. One place we could start from is the list of built-in colors that R can use for plots:

+
+
str_view(colors())
+#>  [1] │ white
+#>  [2] │ aliceblue
+#>  [3] │ antiquewhite
+#>  [4] │ antiquewhite1
+#>  [5] │ antiquewhite2
+#>  [6] │ antiquewhite3
+#>  [7] │ antiquewhite4
+#>  [8] │ aquamarine
+#>  [9] │ aquamarine1
+#> [10] │ aquamarine2
+#> ... and 647 more
+
+

But lets first eliminate the numbered variants:

+
+
cols <- colors()
+cols <- cols[!str_detect(cols, "\\d")]
+str_view(cols)
+#>  [1] │ white
+#>  [2] │ aliceblue
+#>  [3] │ antiquewhite
+#>  [4] │ aquamarine
+#>  [5] │ azure
+#>  [6] │ beige
+#>  [7] │ bisque
+#>  [8] │ black
+#>  [9] │ blanchedalmond
+#> [10] │ blue
+#> ... and 133 more
+
+

Then we can turn this into one giant pattern. We won’t show the pattern here because it’s huge, but you can see it working:

+
+
pattern <- str_c("\\b(", str_flatten(cols, "|"), ")\\b")
+str_view(sentences, pattern)
+#>   [2] │ Glue the sheet to the dark <blue> background.
+#>  [12] │ A rod is used to catch <pink> <salmon>.
+#>  [26] │ Two <blue> fish swam in the tank.
+#>  [66] │ Cars and busses stalled in <snow> drifts.
+#>  [92] │ A wisp of cloud hung in the <blue> air.
+#> [112] │ Leaves turn <brown> and <yellow> in the fall.
+#> [148] │ The spot on the blotter was made by <green> ink.
+#> [149] │ Mud was spattered on the front of his <white> shirt.
+#> [160] │ The sofa cushion is <red> and of light weight.
+#> [167] │ The office paint was a dull, sad <tan>.
+#> ... and 53 more
+
+

In this example, cols only contains numbers and letters so you don’t need to worry about metacharacters. But in general, whenever you create create patterns from existing strings it’s wise to run them through #chp-https://stringr.tidyverse.org/reference/str_escape to ensure they match literally.

+
+ +
+

+Exercises

+
  1. +

    For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple #chp-https://stringr.tidyverse.org/reference/str_detect calls.

    +
    1. Find all words that start or end with x.
    2. +
    3. Find all words that start with a vowel and end with a consonant.
    4. +
    5. Are there any words that contain at least one of each different vowel?
    6. +
  2. +
  3. Construct patterns to find evidence for and against the rule “i before e except after c”?

  4. +
  5. #chp-https://rdrr.io/r/grDevices/colors contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then removed the colors that are modified).

  6. +
  7. Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the #chp-https://rdrr.io/r/utils/data function: data(package = "datasets")$results[, "Item"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.

  8. +
+
+ +
+

+Regular expressions in other places

+

Just like in the stringr and tidyr functions, there are many other places in R where you can use regular expressions. The following sections describe some other useful functions in the wider tidyverse and base R.

+ +
+

+tidyverse

+

There are three other particularly useful places where you might want to use a regular expressions

+
+ +
+

+Base R

+

apropos(pattern) searches all objects available from the global environment that match the given pattern. This is useful if you can’t quite remember the name of a function:

+
+
apropos("replace")
+#> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
+#> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"
+
+

list.files(path, pattern) lists all files in path that match a regular expression pattern. For example, you can find all the R Markdown files in the current directory with:

+
+
head(list.files(pattern = "\\.Rmd$"))
+#> character(0)
+
+

It’s worth noting that the pattern language used by base R is very slightly different to that used by stringr. That’s because stringr is built on top of the #chp-https://stringi.gagolewski, which is in turn built on top of the #chp-https://unicode-org.github.io/icu/userguide/strings/regexp, whereas base R functions use either the #chp-https://github.com/laurikari/tre or the #chp-https://www.pcre, depending on whether or not you’ve set perl = TRUE. Fortunately, the basics of regular expressions are so well established that you’ll encounter few variations when working with the patterns you’ll learn in this book. You only need to be aware of the difference when you start to rely on advanced features like complex Unicode character ranges or special features that use the (?…) syntax.

+
+
+ +
+

+Summary

+

With every punctuation character potentially overloaded with meaning, regular expressions are one of the most compact languages out there. They’re definitely confusing at first but as you train your eyes to read them and your brain to understand them, you unlock a powerful skill that you can use in R and in many other places.

+

In this chapter, you’ve started your journey to become a regular expression master by learning the most useful stringr functions and the most important components of the regular expression language. And there are plenty of resources to learn more.

+

A good place to start is #chp-https://stringr.tidyverse.org/articles/regular-expressions: it documents the full set of syntax supported by stringr. Another useful reference is https://www.regular-expressions.info/. It’s not R specific, but you can use it to learn about the most advanced features of regexes and how they work under the hood.

+

It’s also good to know that stringr is implemented on top of the stringi package by Marek Gagolewsk. If you’re struggling to find a function that does what you need in stringr, don’t be afraid to look in stringi. You’ll find stringi very easy to pick up because it follows many of the the same conventions as stringr.

+

In the next chapter, we’ll talk about a data structure closely related to strings: factors. Factors are used to represent categorical data in R, i.e. data with a fixed and known set of possible values identified by a vector of strings.

+ + +
+
diff --git a/oreilly/spreadsheets.html b/oreilly/spreadsheets.html new file mode 100644 index 0000000..51cb011 --- /dev/null +++ b/oreilly/spreadsheets.html @@ -0,0 +1,539 @@ +
+

Spreadsheets

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

So far you have learned about importing data from plain text files, e.g. .csv and .tsv files. Sometimes you need to analyze data that lives in a spreadsheet. In this chapter we will introduce you to tools for working with data in Excel spreadsheets and Google Sheets. This will build on much of what you’ve learned in #chp-data-import but we will also discuss additional considerations and complexities when working with data from spreadsheets.

+

If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper “Data Organization in Spreadsheets” by Karl Broman and Kara Woo: https://doi.org/10.1080/00031305.2017.1375989. The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise.

+
+ +
+

+Excel

+ +
+

+Prerequisites

+

In this chapter, you’ll learn how to load data from Excel spreadsheets in R with the readxl package. This package is non-core tidyverse, so you need to load it explicitly but it is installed automatically when you install the tidyverse package.

+
+
library(readxl)
+library(tidyverse)
+
+

xlsx and XLConnect can be used for reading data from and writing data to Excel spreadsheets. However, these two packages require Java installed on your machine and the rJava package. Due to potential challenges with installation, we recommend using alternative packages we’ve introduced in this chapter.

+
+ +
+

+Getting started

+

Most of readxl’s functions allow you to load Excel spreadsheets into R:

+

These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g. #chp-https://readr.tidyverse.org/reference/read_delim, #chp-https://readr.tidyverse.org/reference/read_table, etc. For the rest of the chapter we will focus on using #chp-https://readxl.tidyverse.org/reference/read_excel.

+
+ +
+

+Reading spreadsheets

+

#fig-students-excel shows what the spreadsheet we’re going to read into R looks like in Excel.

+
+
+ +

A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.

+
Figure 20.1: Spreadsheet called students.xlsx in Excel.
+
+
+
+

The first argument to #chp-https://readxl.tidyverse.org/reference/read_excel is the path to the file to read.

+
+
students <- read_excel("data/students.xlsx")
+
+

#chp-https://readxl.tidyverse.org/reference/read_excel will read the file in as a tibble.

+
+
students
+#> # A tibble: 6 × 5
+#>   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
+#>          <dbl> <chr>            <chr>              <chr>               <chr>
+#> 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
+#> 2            2 Barclay Lynn     French fries       Lunch only          5    
+#> 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
+#> 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
+#> 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
+#> 6            6 Güvenç Attila    Ice cream          Lunch only          6
+
+

We have six students in the data and five variables on each student. However there are a few things we might want to address in this dataset:

+
  1. +

    The column names are all over the place. You can provide column names that follow a consistent format; we recommend snake_case using the col_names argument.

    +
    +
    read_excel(
    +  "data/students.xlsx",
    +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
    +)
    +#> # A tibble: 7 × 5
    +#>   student_id full_name        favourite_food     meal_plan           age  
    +#>   <chr>      <chr>            <chr>              <chr>               <chr>
    +#> 1 Student ID Full Name        favourite.food     mealPlan            AGE  
    +#> 2 1          Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 3 2          Barclay Lynn     French fries       Lunch only          5    
    +#> 4 3          Jayendra Lyne    N/A                Breakfast and lunch 7    
    +#> 5 4          Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 6 5          Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> # … with 1 more row
    +
    +

    Unfortunately, this didn’t quite do the trick. You now have the variable names we want, but what was previously the header row now shows up as the first observation in the data. You can explicitly skip that row using the skip argument.

    +
    +
    read_excel(
    +  "data/students.xlsx",
    +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
    +  skip = 1
    +)
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan           age  
    +#>        <dbl> <chr>            <chr>              <chr>               <chr>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3          3 Jayendra Lyne    N/A                Breakfast and lunch 7    
    +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +
  2. +
  3. +

    In the favourite_food column, one of the observations is N/A, which stands for “not available” but it’s currently not recognized as an NA (note the contrast between this N/A and the age of the fourth student in the list). You can specify which character strings should be recognized as NAs with the na argument. By default, only "" (empty string, or, in the case of reading from a spreadsheet, an empty cell) is recognized as an NA.

    +
    +
    read_excel(
    +  "data/students.xlsx",
    +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
    +  skip = 1,
    +  na = c("", "N/A")
    +)
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan           age  
    +#>        <dbl> <chr>            <chr>              <chr>               <chr>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
    +#> 2          2 Barclay Lynn     French fries       Lunch only          5    
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch 7    
    +#> 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only          6
    +
    +
  4. +
  5. +

    One other remaining issue is that age is read in as a character variable, but it really should be numeric. Just like with #chp-https://readr.tidyverse.org/reference/read_delim and friends for reading data from flat files, you can supply a col_types argument to #chp-https://readxl.tidyverse.org/reference/read_excel and specify the column types for the variables you read in. The syntax is a bit different, though. Your options are "skip", "guess", "logical", "numeric", "date", "text" or "list".

    +
    +
    read_excel(
    +  "data/students.xlsx",
    +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
    +  skip = 1,
    +  na = c("", "N/A"),
    +  col_types = c("numeric", "text", "text", "text", "numeric")
    +)
    +#> Warning: Expecting numeric in E6 / R6C5: got 'five'
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <chr>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch    NA
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +
    +

    However, this didn’t quite produce the desired result either. By specifying that age should be numeric, we have turned the one cell with the non-numeric entry (which had the value five) into an NA. In this case, we should read age in as "text" and then make the change once the data is loaded in R.

    +
    +
    students <- read_excel(
    +  "data/students.xlsx",
    +  col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
    +  skip = 1,
    +  na = c("", "N/A"),
    +  col_types = c("numeric", "text", "text", "text", "text")
    +)
    +
    +students <- students |>
    +  mutate(
    +    age = if_else(age == "five", "5", age),
    +    age = parse_number(age)
    +  )
    +
    +students
    +#> # A tibble: 6 × 5
    +#>   student_id full_name        favourite_food     meal_plan             age
    +#>        <dbl> <chr>            <chr>              <chr>               <dbl>
    +#> 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only              4
    +#> 2          2 Barclay Lynn     French fries       Lunch only              5
    +#> 3          3 Jayendra Lyne    <NA>               Breakfast and lunch     7
    +#> 4          4 Leon Rossini     Anchovies          Lunch only             NA
    +#> 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch     5
    +#> 6          6 Güvenç Attila    Ice cream          Lunch only              6
    +
    +
  6. +

It took us multiple steps and trial-and-error to load the data in exactly the format we want, and this is not unexpected. Data science is an iterative process. There is no way to know exactly what the data will look like until you load it and take a look at it. Well, there is one way, actually. You can open the file in Excel and take a peek. That might be tempting, but it’s strongly not recommended. Instead, you should not be afraid of doing what we did here: load the data, take a peek, make adjustments to your code, load it again, and repeat until you’re happy with the result.

+
+ +
+

+Reading individual sheets

+

An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets. #fig-penguins-islands shows an Excel spreadsheet with multiple sheets. The data come from the palmerpenguins package. Each sheet contains information on penguins from a different island where data were collected.

+
+
+ +

A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island.

+
Figure 20.2: Spreadsheet called penguins.xlsx in Excel.
+
+
+
+

You can read a single sheet from a spreadsheet with the sheet argument in #chp-https://readxl.tidyverse.org/reference/read_excel.

+
+
read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
+#> # A tibble: 52 × 8
+#>   species island    bill_length_mm     bill_depth_mm flipp…¹ body_…² sex    year
+#>   <chr>   <chr>     <chr>              <chr>         <chr>   <chr>   <chr> <dbl>
+#> 1 Adelie  Torgersen 39.1               18.7          181     3750    male   2007
+#> 2 Adelie  Torgersen 39.5               17.399999999… 186     3800    fema…  2007
+#> 3 Adelie  Torgersen 40.299999999999997 18            195     3250    fema…  2007
+#> 4 Adelie  Torgersen NA                 NA            NA      NA      NA     2007
+#> 5 Adelie  Torgersen 36.700000000000003 19.3          193     3450    fema…  2007
+#> 6 Adelie  Torgersen 39.299999999999997 20.6          190     3650    male   2007
+#> # … with 46 more rows, and abbreviated variable names ¹​flipper_length_mm,
+#> #   ²​body_mass_g
+
+

Some variables that appear to contain numerical data are read in as characters due to the character string "NA" not being recognized as a true NA.

+
+
penguins_torgersen <- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
+
+penguins_torgersen
+#> # A tibble: 52 × 8
+#>   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
+#>   <chr>   <chr>              <dbl>         <dbl>       <dbl>   <dbl> <chr> <dbl>
+#> 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
+#> 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
+#> 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
+#> 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
+#> 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
+#> 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
+#> # … with 46 more rows, and abbreviated variable names ¹​flipper_length_mm,
+#> #   ²​body_mass_g
+
+

However, we cheated here a bit. We looked inside the Excel spreadsheet, which is not a recommended workflow. Instead, you can use #chp-https://readxl.tidyverse.org/reference/excel_sheets to get information on all sheets in an Excel spreadsheet, and then read the one(s) you’re interested in.

+
+
excel_sheets("data/penguins.xlsx")
+#> [1] "Torgersen Island" "Biscoe Island"    "Dream Island"
+
+

Once you know the names of the sheets, you can read them in individually with #chp-https://readxl.tidyverse.org/reference/read_excel.

+
+
penguins_biscoe <- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
+penguins_dream  <- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")
+
+

In this case the full penguins dataset is spread across three sheets in the spreadsheet. Each sheet has the same number of columns but different numbers of rows.

+
+
dim(penguins_torgersen)
+#> [1] 52  8
+dim(penguins_biscoe)
+#> [1] 168   8
+dim(penguins_dream)
+#> [1] 124   8
+
+

We can put them together with #chp-https://dplyr.tidyverse.org/reference/bind_rows.

+
+
penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
+penguins
+#> # A tibble: 344 × 8
+#>   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
+#>   <chr>   <chr>              <dbl>         <dbl>       <dbl>   <dbl> <chr> <dbl>
+#> 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
+#> 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
+#> 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
+#> 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
+#> 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
+#> 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
+#> # … with 338 more rows, and abbreviated variable names ¹​flipper_length_mm,
+#> #   ²​body_mass_g
+
+

In #chp-iteration we’ll talk about ways of doing this sort of task without repetitive code.

+
+ +
+

+Reading part of a sheet

+

Since many use Excel spreadsheets for presentation as well as for data storage, it’s quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R. #fig-deaths-excel shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.

+
+
+ +

A look at the deaths spreadsheet in Excel. The spreadsheet has four rows on top that contain non-data information; the text 'For the same of consistency in the data layout, which is really a beautiful thing, I will keep making notes up here.' is spread across cells in these top four rows. Then, there is a data frame that includes information on deaths of 10 famous people, including their names, professions, ages, whether they have kids or not, date of birth and death. At the bottom, there are four more rows of non-data information; the text 'This has been really fun, but we're signing off now!' is spread across cells in these bottom four rows.

+
Figure 20.3: Spreadsheet called deaths.xlsx in Excel.
+
+
+
+

This spreadsheet is one of the example spreadsheets provided in the readxl package. You can use the #chp-https://readxl.tidyverse.org/reference/readxl_example function to locate the spreadsheet on your system in the directory where the package is installed. This function returns the path to the spreadsheet, which you can use in #chp-https://readxl.tidyverse.org/reference/read_excel as usual.

+
+
deaths_path <- readxl_example("deaths.xlsx")
+deaths <- read_excel(deaths_path)
+#> New names:
+#> • `` -> `...2`
+#> • `` -> `...3`
+#> • `` -> `...4`
+#> • `` -> `...5`
+#> • `` -> `...6`
+deaths
+#> # A tibble: 18 × 6
+#>   `Lots of people`             ...2       ...3  ...4     ...5          ...6     
+#>   <chr>                        <chr>      <chr> <chr>    <chr>         <chr>    
+#> 1 simply cannot resist writing <NA>       <NA>  <NA>     <NA>          some not…
+#> 2 at                           the        top   <NA>     of            their sp…
+#> 3 or                           merging    <NA>  <NA>     <NA>          cells    
+#> 4 Name                         Profession Age   Has kids Date of birth Date of …
+#> 5 David Bowie                  musician   69    TRUE     17175         42379    
+#> 6 Carrie Fisher                actor      60    TRUE     20749         42731    
+#> # … with 12 more rows
+
+

The top three rows and the bottom four rows are not part of the data frame.

+

We could skip the top three rows with skip. Note that we set skip = 4 since the fourth row contains column names, not the data.

+
+
read_excel(deaths_path, skip = 4)
+#> # A tibble: 14 × 6
+#>   Name          Profession Age   `Has kids` `Date of birth`     `Date of death`
+#>   <chr>         <chr>      <chr> <chr>      <dttm>              <chr>          
+#> 1 David Bowie   musician   69    TRUE       1947-01-08 00:00:00 42379          
+#> 2 Carrie Fisher actor      60    TRUE       1956-10-21 00:00:00 42731          
+#> 3 Chuck Berry   musician   90    TRUE       1926-10-18 00:00:00 42812          
+#> 4 Bill Paxton   actor      61    TRUE       1955-05-17 00:00:00 42791          
+#> 5 Prince        musician   57    TRUE       1958-06-07 00:00:00 42481          
+#> 6 Alan Rickman  actor      69    FALSE      1946-02-21 00:00:00 42383          
+#> # … with 8 more rows
+
+

We could also set n_max to omit the extraneous rows at the bottom.

+
+
read_excel(deaths_path, skip = 4, n_max = 10)
+#> # A tibble: 10 × 6
+#>   Name          Profession   Age Has k…¹ `Date of birth`     `Date of death`    
+#>   <chr>         <chr>      <dbl> <lgl>   <dttm>              <dttm>             
+#> 1 David Bowie   musician      69 TRUE    1947-01-08 00:00:00 2016-01-10 00:00:00
+#> 2 Carrie Fisher actor         60 TRUE    1956-10-21 00:00:00 2016-12-27 00:00:00
+#> 3 Chuck Berry   musician      90 TRUE    1926-10-18 00:00:00 2017-03-18 00:00:00
+#> 4 Bill Paxton   actor         61 TRUE    1955-05-17 00:00:00 2017-02-25 00:00:00
+#> 5 Prince        musician      57 TRUE    1958-06-07 00:00:00 2016-04-21 00:00:00
+#> 6 Alan Rickman  actor         69 FALSE   1946-02-21 00:00:00 2016-01-14 00:00:00
+#> # … with 4 more rows, and abbreviated variable name ¹​`Has kids`
+
+

Another approach is using cell ranges. In Excel, the top left cell is A1. As you move across columns to the right, the cell label moves down the alphabet, i.e. B1, C1, etc. And as you move down a column, the number in the cell label increases, i.e. A2, A3, etc.

+

The data we want to read in starts in cell A5 and ends in cell F15. In spreadsheet notation, this is A5:F15.

+
  • +

    Supply this information to the range argument:

    +
    +
    read_excel(deaths_path, range = "A5:F15")
    +
    +
  • +
  • +

    Specify rows:

    +
    +
    read_excel(deaths_path, range = cell_rows(c(5, 15)))
    +
    +
  • +
  • +

    Specify cells that mark the top-left and bottom-right corners of the data – the top-left corner, A5, translates to c(5, 1) (5th row down, 1st column) and the bottom-right corner, F15, translates to c(15, 6):

    +
    +
    read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))
    +
    +
  • +

If you have control over the sheet, an even better way is to create a “named range”. This is useful within Excel because named ranges help repeat formulas easier to create and they have some useful properties for creating dynamic charts and graphs as well. Even if you’re not working in Excel, named ranges can be useful for identifying which cells to read into R. In the example above, the table we’re reading in is named Table1, so we can read it in with the following.

+

TO DO: Add this once reading in named ranges are implemented in readxl.

+
+ +
+

+Data types

+

In CSV files, all values are strings. This is not particularly true to the data, but it is simple: everything is a string.

+

The underlying data in Excel spreadsheets is more complex. A cell can be one of five things:

+
  • A logical, like TRUE / FALSE

  • +
  • A number, like “10” or “10.5”

  • +
  • A date, which can also include time like “11/1/21” or “11/1/21 3:00 PM”

  • +
  • A string, like “ten”

  • +
  • A currency, which allows numeric values in a limited range and four decimal digits of fixed precision

  • +

When working with spreadsheet data, it’s important to keep in mind that how the underlying data is stored can be very different than what you see in the cell. For example, Excel has no notion of an integer. All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points. Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970. You can customize how you display the date by applying formatting in Excel. Confusingly, it’s also possible to have something that looks like a number but is actually a string (e.g. type '10 into a cell in Excel).

+

These differences between how the underlying data are stored vs. how they’re displayed can cause surprises when the data are loaded into R. By default readxl will guess the data type in a given column. A recommended workflow is to let readxl guess the column types, confirm that you’re happy with the guessed column types, and if not, go back and re-import specifying col_types as shown in #sec-reading-spreadsheets.

+

Another challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g. some cells are numeric, others text, others dates. When importing the data into R readxl has to make some decisions. In these cases you can set the type for this column to "list", which will load the column as a list of length 1 vectors, where the type of each element of the vector is guessed.

+
+ +
+

+Data not in cell values

+

tidyxl is useful for importing non-tabular data from Excel files into R. For example, tidyxl doesn’t coerce a pivot table into a data frame. See https://nacnudus.github.io/spreadsheet-munging-strategies/ for more on strategies for working with non-tabular data from Excel.

+
+ +
+

+Writing to Excel

+

Let’s create a small data frame that we can then write out. Note that item is a factor and quantity is an integer.

+
+
bake_sale <- tibble(
+  item     = factor(c("brownie", "cupcake", "cookie")),
+  quantity = c(10, 5, 8)
+)
+
+bake_sale
+#> # A tibble: 3 × 2
+#>   item    quantity
+#>   <fct>      <dbl>
+#> 1 brownie       10
+#> 2 cupcake        5
+#> 3 cookie         8
+
+

You can write data back to disk as an Excel file using the #chp-https://docs.ropensci.org/writexl/reference/write_xlsx from the writexl package.

+
+
library(writexl)
+write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
+
+

#fig-bake-sale-excel shows what the data looks like in Excel. Note that column names are included and bolded. These can be turned off by setting col_names and format_headers arguments to FALSE.

+
+
+ +

Bake sale data frame created earlier in Excel.

+
Figure 20.4: Spreadsheet called bake_sale.xlsx in Excel.
+
+
+
+

Just like reading from a CSV, information on data type is lost when we read the data back in. This makes Excel files unreliable for caching interim results as well. For alternatives, see #sec-writing-to-a-file.

+
+
read_excel("data/bake-sale.xlsx")
+#> # A tibble: 3 × 2
+#>   item    quantity
+#>   <chr>      <dbl>
+#> 1 brownie       10
+#> 2 cupcake        5
+#> 3 cookie         8
+
+
+ +
+

+Formatted output

+

The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if you’re interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the openxlsx package. Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can’t be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.

+

Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the penguins data frame.

+
+
library(openxlsx)
+library(palmerpenguins)
+
+# Create a workbook (spreadsheet)
+penguins_species <- createWorkbook()
+
+# Add three sheets to the spreadsheet
+addWorksheet(penguins_species, sheetName = "Adelie")
+addWorksheet(penguins_species, sheetName = "Gentoo")
+addWorksheet(penguins_species, sheetName = "Chinstrap")
+
+# Write data to each sheet
+writeDataTable(
+  penguins_species, 
+  sheet = "Adelie", 
+  x = penguins |> filter(species == "Adelie")
+)
+writeDataTable(
+  penguins_species, 
+  sheet = "Gentoo", 
+  x = penguins |> filter(species == "Gentoo")
+)
+writeDataTable(
+  penguins_species, 
+  sheet = "Chinstrap", 
+  x = penguins |> filter(species == "Chinstrap")
+)
+
+

This creates a workbook object:

+
+
penguins_species
+#> A Workbook object.
+#>  
+#> Worksheets:
+#>  Sheet 1: "Adelie"
+#>  
+#> 
+#>  Sheet 2: "Gentoo"
+#>  
+#> 
+#>  Sheet 3: "Chinstrap"
+#>  
+#> 
+#>  
+#>  Worksheet write order: 1, 2, 3
+#>  Active Sheet 1: "Adelie" 
+#>  Position: 1
+
+

And we can write this to this with #chp-https://rdrr.io/pkg/openxlsx/man/saveWorkbook.

+
+
saveWorkbook(penguins_species, "data/penguins-species.xlsx")
+
+

The resulting spreadsheet is shown in #fig-penguins-species. By default, openxlsx formats the data as an Excel table.

+
+
+ +

A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island.

+
Figure 20.5: Spreadsheet called penguins.xlsx in Excel.
+
+
+
+

See https://ycphs.github.io/openxlsx/articles/Formatting.html for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.

+
+ +
+

+Exercises

+
  1. Recreate the bake_sale data frame, write it out to an Excel file using the #chp-https://rdrr.io/pkg/openxlsx/man/write.xlsx function from the openxlsx package.
  2. +
  3. What happens if you try to read in a file with .xlsx extension with #chp-https://readxl.tidyverse.org/reference/read_excel?
  4. +
+
+ +
+

+Google Sheets

+ + +
+

+Prerequisites

+

TO DO:

+
  • use googlesheets4
  • +
  • why 4?
  • +
+ +
+

+Getting started

+

TO DO:

+
  • reading from public sheet with read_sheet() and read_range() +
  • +
+ +
+

+Authentication

+
+ +
+

+Read sheets

+
+ +
+

+Write sheets

+
+ +
+

+Exercises

+ + +
+
+
diff --git a/oreilly/strings.html b/oreilly/strings.html new file mode 100644 index 0000000..c16b365 --- /dev/null +++ b/oreilly/strings.html @@ -0,0 +1,752 @@ +
+

Strings

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +
+

+Introduction

+

So far, you’ve used a bunch of strings without learning much about the details. Now it’s time to dive into them, learning what makes strings tick, and mastering some of the powerful string manipulation tool you have at your disposal.

+

We’ll begin with the details of creating strings and character vectors. You’ll then dive into creating strings from data, then the opposite; extracting strings from data. We’ll then discuss tools that work with individual letters. The chapter finishes off with functions that work with individual letters and a brief discussion of where your expectations from English might steer you wrong when working with other languages.

+

We’ll keep working with strings in the next chapter, where you’ll learn more about the power of regular expressions.

+ +
+

+Prerequisites

+
+
+ +
+ +
+ +

This chapter relies on features only found in stringr 1.5.0 and tidyr 1.3.0 which are still in development. If you want to live life on the edge you can get the dev versions with devtools::install_github(c("tidyverse/stringr", "tidyverse/tidyr")).

+ +

In this chapter, we’ll use functions from the stringr package which is part of the core tidyverse. We’ll also use the babynames data since it provides some fun strings to manipulate.

+
+
library(tidyverse)
+library(babynames)
+
+

You can easily tell when you’re using a stringr function because all stringr functions start with str_. This is particularly useful if you use RStudio, because typing str_ will trigger autocomplete, allowing you jog your memory of which functions are available.

+
+
+

+
+
+
+
+ +
+

+Creating a string

+

We’ve created strings in passing earlier in the book, but didn’t discuss the details. Firstly, you can create a string using either single quotes (') or double quotes ("). There’s no difference in behavior between the two so in the interests of consistency the #character-vectors recommends using ", unless the string contains multiple ".

+
+
string1 <- "This is a string"
+string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
+
+

If you forget to close a quote, you’ll see +, the continuation character:

+
> "This is a string without a closing quote
++ 
++ 
++ HELP I'M STUCK IN A STRING
+

If this happens to you and you can’t figure out which quote you need to close, press Escape to cancel, and try again.

+ +
+

+Escapes

+

To include a literal single or double quote in a string you can use \ to “escape” it:

+
+
double_quote <- "\"" # or '"'
+single_quote <- '\'' # or "'"
+
+

So if you want to include a literal backslash in your string, you’ll need to escape it: "\\":

+
+
backslash <- "\\"
+
+

Beware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes (in other words, when you print a string, you can copy and paste the output to recreate that string). To see the raw contents of the string, use #chp-https://stringr.tidyverse.org/reference/str_viewOr use the base R function #chp-https://rdrr.io/r/base/writeLines.:

+
+
x <- c(single_quote, double_quote, backslash)
+x
+#> [1] "'"  "\"" "\\"
+
+str_view(x)
+#> [1] │ '
+#> [2] │ "
+#> [3] │ \
+
+
+ +
+

+Raw strings

+

Creating a string with multiple quotes or backslashes gets confusing quickly. To illustrate the problem, lets create a string that contains the contents of the code block where we define the double_quote and single_quote variables:

+
+
tricky <- "double_quote <- \"\\\"\" # or '\"'
+single_quote <- '\\'' # or \"'\""
+str_view(tricky)
+#> [1] │ double_quote <- "\"" # or '"'
+#>     │ single_quote <- '\'' # or "'"
+
+

That’s a lot of backslashes! (This is sometimes called #chp-https://en.wikipedia.org/wiki/Leaning_toothpick_syndrome.) To eliminate the escaping you can instead use a raw stringAvailable in R 4.0.0 and above.:

+
+
tricky <- r"(double_quote <- "\"" # or '"'
+single_quote <- '\'' # or "'")"
+str_view(tricky)
+#> [1] │ double_quote <- "\"" # or '"'
+#>     │ single_quote <- '\'' # or "'"
+
+

A raw string usually starts with r"( and finishes with )". But if your string contains )" you can instead use r"[]" or r"{}", and if that’s still not enough, you can insert any number of dashes to make the opening and closing pairs unique, e.g. `r"--()--", `r"---()---", etc. Raw strings are flexible enough to handle any text.

+
+ +
+

+Other special characters

+

As well as \", \', and \\ there are a handful of other special characters that may come in handy. The most common are \n, newline, and \t, tab. You’ll also sometimes see strings containing Unicode escapes that start with \u or \U. This is a way of writing non-English characters that works on all systems. You can see the complete list of other special characters in #chp-https://rdrr.io/r/base/Quotes.

+
+
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
+x
+#> [1] "one\ntwo" "one\ttwo" "µ"        "😄"
+str_view(x)
+#> [1] │ one
+#>     │ two
+#> [2] │ one{\t}two
+#> [3] │ µ
+#> [4] │ 😄
+
+

Note that #chp-https://stringr.tidyverse.org/reference/str_view uses a blue background for tabs to make them easier to spot. One of the challenges of working with text is that there’s a variety of ways that white space can end up in text, so this background helps you recognize that something strange is going on.

+
+ +
+

+Exercises

+
  1. +

    Create strings that contain the following values:

    +
    1. He said "That's amazing!"

    2. +
    3. \a\b\c\d

    4. +
    5. \\\\\\

    6. +
  2. +
  3. +

    Create the string in your R session and print it. What happens to the special “\u00a0”? How does #chp-https://stringr.tidyverse.org/reference/str_view display it? Can you do a little googling to figure out what this special character is?

    +
    +
    x <- "This\u00a0is\u00a0tricky"
    +
    +
  4. +
+
+ +
+

+Creating many strings from data

+

Now that you’ve learned the basics of creating a string or two by “hand”, we’ll go into the details of creating strings from other strings. This will help you solve the common problem where you have some text that you wrote that you want to combine with strings from a data frame. For example, to create a greeting you might combine “Hello” with a name variable. We’ll show you how to do this with #chp-https://stringr.tidyverse.org/reference/str_c and #chp-https://stringr.tidyverse.org/reference/str_glue and how you can you use them with #chp-https://dplyr.tidyverse.org/reference/mutate. That naturally raises the question of what string functions you might use with #chp-https://dplyr.tidyverse.org/reference/summarise, so we’ll finish this section with a discussion of #chp-https://stringr.tidyverse.org/reference/str_flatten which is a summary function for strings.

+ +
+

+str_c() +

+

#chp-https://stringr.tidyverse.org/reference/str_c#chp-https://stringr.tidyverse.org/reference/str_c is very similar to the base #chp-https://rdrr.io/r/base/paste. There are two main reasons we recommend it: it propagates NAs (rather than converting them to "NA") and it uses the tidyverse recycling rules. takes any number of vectors as arguments and returns a character vector:

+
+
str_c("x", "y")
+#> [1] "xy"
+str_c("x", "y", "z")
+#> [1] "xyz"
+str_c("Hello ", c("John", "Susan"))
+#> [1] "Hello John"  "Hello Susan"
+
+

#chp-https://stringr.tidyverse.org/reference/str_c is designed to be used with #chp-https://dplyr.tidyverse.org/reference/mutate so it obeys the usual rules for recycling and missing values:

+
+
set.seed(1410)
+df <- tibble(name = c(wakefield::name(3), NA))
+df |> mutate(greeting = str_c("Hi ", name, "!"))
+#> # A tibble: 4 × 2
+#>   name       greeting      
+#>   <chr>      <chr>         
+#> 1 Ilena      Hi Ilena!     
+#> 2 Sacramento Hi Sacramento!
+#> 3 Graylon    Hi Graylon!   
+#> 4 <NA>       <NA>
+
+

If you want missing values to display in some other way, use #chp-https://dplyr.tidyverse.org/reference/coalesce. Depending on what you want, you might use it either inside or outside of #chp-https://stringr.tidyverse.org/reference/str_c:

+
+
df |> 
+  mutate(
+    greeting1 = str_c("Hi ", coalesce(name, "you"), "!"),
+    greeting2 = coalesce(str_c("Hi ", name, "!"), "Hi!")
+  )
+#> # A tibble: 4 × 3
+#>   name       greeting1      greeting2     
+#>   <chr>      <chr>          <chr>         
+#> 1 Ilena      Hi Ilena!      Hi Ilena!     
+#> 2 Sacramento Hi Sacramento! Hi Sacramento!
+#> 3 Graylon    Hi Graylon!    Hi Graylon!   
+#> 4 <NA>       Hi you!        Hi!
+
+
+ +
+

+str_glue() +

+

If you are mixing many fixed and variable strings with #chp-https://stringr.tidyverse.org/reference/str_c, you’ll notice that you type a lot of "s, making it hard to see the overall goal of the code. An alternative approach is provided by the #chp-https://glue.tidyverse via #chp-https://stringr.tidyverse.org/reference/str_glueIf you’re not using stringr, you can also access it directly with #chp-https://glue.tidyverse.org/reference/glue.. You give it a single string that has a special feature: anything inside #chp-https://rdrr.io/r/base/Paren will be evaluated like it’s outside of the quotes:

+
+
df |> mutate(greeting = str_glue("Hi {name}!"))
+#> # A tibble: 4 × 2
+#>   name       greeting      
+#>   <chr>      <glue>        
+#> 1 Ilena      Hi Ilena!     
+#> 2 Sacramento Hi Sacramento!
+#> 3 Graylon    Hi Graylon!   
+#> 4 <NA>       Hi NA!
+
+

As you can see, #chp-https://stringr.tidyverse.org/reference/str_glue currently converts missing values to the string "NA" unfortunately making it inconsistent with #chp-https://stringr.tidyverse.org/reference/str_c.

+

You also might wonder what happens if you need to include a regular { or } in your string. If you guess that you’ll need to somehow escape it, you’re on the right track. The trick is that glue uses a slightly different escaping technique; instead of prefixing with special character like \, you double up the special characters:

+
+
df |> mutate(greeting = str_glue("{{Hi {name}!}}"))
+#> # A tibble: 4 × 2
+#>   name       greeting        
+#>   <chr>      <glue>          
+#> 1 Ilena      {Hi Ilena!}     
+#> 2 Sacramento {Hi Sacramento!}
+#> 3 Graylon    {Hi Graylon!}   
+#> 4 <NA>       {Hi NA!}
+
+
+ +
+

+str_flatten() +

+

#chp-https://stringr.tidyverse.org/reference/str_c and glue() work well with #chp-https://dplyr.tidyverse.org/reference/mutate because their output is the same length as their inputs. What if you want a function that works well with #chp-https://dplyr.tidyverse.org/reference/summarise, i.e. something that always returns a single string? That’s the job of #chp-https://stringr.tidyverse.org/reference/str_flattenThe base R equivalent is #chp-https://rdrr.io/r/base/paste used with the collapse argument.: it takes a character vector and combines each element of the vector into a single string:

+
+
str_flatten(c("x", "y", "z"))
+#> [1] "xyz"
+str_flatten(c("x", "y", "z"), ", ")
+#> [1] "x, y, z"
+str_flatten(c("x", "y", "z"), ", ", last = ", and ")
+#> [1] "x, y, and z"
+
+

This makes it work well with #chp-https://dplyr.tidyverse.org/reference/summarise:

+
+
df <- tribble(
+  ~ name, ~ fruit,
+  "Carmen", "banana",
+  "Carmen", "apple",
+  "Marvin", "nectarine",
+  "Terence", "cantaloupe",
+  "Terence", "papaya",
+  "Terence", "madarine"
+)
+df |>
+  group_by(name) |> 
+  summarise(fruits = str_flatten(fruit, ", "))
+#> # A tibble: 3 × 2
+#>   name    fruits                      
+#>   <chr>   <chr>                       
+#> 1 Carmen  banana, apple               
+#> 2 Marvin  nectarine                   
+#> 3 Terence cantaloupe, papaya, madarine
+
+
+ +
+

+Exercises

+
  1. +

    Compare and contrast the results of #chp-https://rdrr.io/r/base/paste with #chp-https://stringr.tidyverse.org/reference/str_c for the following inputs:

    +
    +
    str_c("hi ", NA)
    +str_c(letters[1:2], letters[1:3])
    +
    +
  2. +
  3. +

    Convert the following expressions from #chp-https://stringr.tidyverse.org/reference/str_c to #chp-https://stringr.tidyverse.org/reference/str_glue or vice versa:

    +
    1. str_c("The price of ", food, " is ", price)

    2. +
    3. str_glue("I'm {age} years old and live in {country}")

    4. +
    5. str_c("\\section{", title, "}")

    6. +
  4. +
+
+ +
+

+Extracting data from strings

+

It’s very common for multiple variables to be crammed together into a single string. In this section you’ll learn how to use four tidyr functions to extract them:

+
  • df |> separate_longer_delim(col, delim)
  • +
  • df |> separate_longer_position(col, width)
  • +
  • df |> separate_wider_delim(col, delim, names)
  • +
  • df |> separate_wider_position(col, widths)
  • +

If you look closely you can see there’s a common pattern here: separate_, then longer or wider, then _, then by delim or position. That’s because these four functions are composed from two simpler primitives:

+
  • +longer makes input data frame longer, creating new rows; wider makes the input data frame wider, generating new columns.
  • +
  • +delim splits up a string with a delimiter like ", " or " "; position splits at specified widths, like c(3, 5, 2).
  • +

We’ll come back the last member of this family, separate_regex_wider(), in #chp-regexps. It’s the most flexible of the wider functions but you need to know something about regular expression before you can use it.

+

The next two sections will give you the basic idea behind these separate functions, first separating into rows (which is a little simpler) and then separating in to columns. We’ll finish off my discussing the tools that the wider functions give you to diagnose problems.

+ +
+

+Separating into rows

+

Separating a string into rows tends to be most useful when the number of components varies from row to row. The most common case is requiring #chp-https://tidyr.tidyverse.org/reference/separate_longer_delim to split based on a delimiter:

+
+
df1 <- tibble(x = c("a,b,c", "d,e", "f"))
+df1 |> 
+  separate_longer_delim(x, delim = ",")
+#> # A tibble: 6 × 1
+#>   x    
+#>   <chr>
+#> 1 a    
+#> 2 b    
+#> 3 c    
+#> 4 d    
+#> 5 e    
+#> 6 f
+
+

It’s rarer to see #chp-https://tidyr.tidyverse.org/reference/separate_longer_delim in the wild, but some older datasets do use very compact format where each character is used to record a value:

+
+
df2 <- tibble(x = c("1211", "131", "21"))
+df2 |> 
+  separate_longer_position(x, width = 1)
+#> # A tibble: 9 × 1
+#>   x    
+#>   <chr>
+#> 1 1    
+#> 2 2    
+#> 3 1    
+#> 4 1    
+#> 5 1    
+#> 6 3    
+#> # … with 3 more rows
+
+
+ +
+

+Separating into columns

+

Separating a string into columns tends to be most useful when there are a fixed number of components in each string, and you want to spread them into columns. They are slightly more complicated than their longer equivalents because you need to name the columns. For example, in this following dataset x is made up of a code, an edition number, and a year, separated by ".". To use #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim we supply the delimiter and the names in two arguments:

+
+
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
+df3 |> 
+  separate_wider_delim(
+    x,
+    delim = ".",
+    names = c("code", "edition", "year")
+  )
+#> # A tibble: 3 × 3
+#>   code  edition year 
+#>   <chr> <chr>   <chr>
+#> 1 a10   1       2022 
+#> 2 b10   2       2011 
+#> 3 e15   1       2015
+
+

If a specific piece is not useful you can use an NA name to omit it from the results:

+
+
df3 |> 
+  separate_wider_delim(
+    x,
+    delim = ".",
+    names = c("code", NA, "year")
+  )
+#> # A tibble: 3 × 2
+#>   code  year 
+#>   <chr> <chr>
+#> 1 a10   2022 
+#> 2 b10   2011 
+#> 3 e15   2015
+
+

#chp-https://tidyr.tidyverse.org/reference/separate_wider_delim works a little differently, because you typically want to specify the width of each column. So you give it a named integer vector, where the name gives the name of the new column and the value is the number of characters it occupies. You can omit values from the output by not naming them:

+
+
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
+df4 |> 
+  separate_wider_position(
+    x,
+    widths = c(year = 4, age = 2, state = 2)
+  )
+#> # A tibble: 3 × 3
+#>   year  age   state
+#>   <chr> <chr> <chr>
+#> 1 2022  15    TX   
+#> 2 2021  22    LA   
+#> 3 2023  25    CA
+
+
+ +
+

+Diagnosing widening problems

+

#chp-https://tidyr.tidyverse.org/reference/separate_wider_delimThe same principles apply to #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim and #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim. requires a fixed and known set of columns. What happens if some of the rows don’t have the expected number of pieces? There are two possible problems, too few or too many pieces, so #chp-https://tidyr.tidyverse.org/reference/separate_wider_delim provides two arguments to help: too_few and too_many. Let’s first look at the too_few case with the following sample dataset:

+
+
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
+
+df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z")
+  )
+#> Error in `separate_wider_delim()`:
+#> ! Expected 3 pieces in each element of `x`.
+#> ! 2 values were too short.
+#> ℹ Use `too_few = "debug"` to diagnose the problem.
+#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
+
+

You’ll notice that we get an error, but the error gives us some suggestions as to how you might proceed. Let’s start by debugging the problem:

+
+
debug <- df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z"),
+    too_few = "debug"
+  )
+#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
+#> `x_remainder`.
+debug
+#> # A tibble: 5 × 6
+#>   x     y     z     x_ok  x_pieces x_remainder
+#>   <chr> <chr> <chr> <lgl>    <int> <chr>      
+#> 1 1-1-1 1     1     TRUE         3 ""         
+#> 2 1-1-2 1     2     TRUE         3 ""         
+#> 3 1-3   3     <NA>  FALSE        2 ""         
+#> 4 1-3-2 3     2     TRUE         3 ""         
+#> 5 1     <NA>  <NA>  FALSE        1 ""
+
+

When you use the debug mode you get three extra columns add to the output: x_ok, x_pieces, and x_remainder (if you separate variable with a different name, you’ll get a different prefix). Here, x_ok lets you quickly find the inputs that failed:

+
+
debug |> filter(!x_ok)
+#> # A tibble: 2 × 6
+#>   x     y     z     x_ok  x_pieces x_remainder
+#>   <chr> <chr> <chr> <lgl>    <int> <chr>      
+#> 1 1-3   3     <NA>  FALSE        2 ""         
+#> 2 1     <NA>  <NA>  FALSE        1 ""
+
+

x_pieces tells us how many pieces were found, compared to the expected 3 (the length of names). x_remainder isn’t useful when there are too few pieces, but we’ll see it again shortly.

+

Sometimes looking at this debugging information will reveal a problem with your delimiter strategy or suggest that you need to do more preprocessing before separating. In that case, fix the problem upstream and make sure to remove too_few = "debug" to ensure that new problem become errors.

+

In other cases you may just want to fill in the missing pieces with NAs and move on. That’s the job of too_few = "align_start" and too_few = "align_end" which allow you to control where the NAs should go:

+
+
df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z"),
+    too_few = "align_start"
+  )
+#> # A tibble: 5 × 3
+#>   x     y     z    
+#>   <chr> <chr> <chr>
+#> 1 1     1     1    
+#> 2 1     1     2    
+#> 3 1     3     <NA> 
+#> 4 1     3     2    
+#> 5 1     <NA>  <NA>
+
+

The same principles apply if you have too many pieces:

+
+
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
+
+df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z")
+  )
+#> Error in `separate_wider_delim()`:
+#> ! Expected 3 pieces in each element of `x`.
+#> ! 2 values were too long.
+#> ℹ Use `too_many = "debug"` to diagnose the problem.
+#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
+
+

But now when we debug the result, you can see the purpose of x_remainder:

+
+
debug <- df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z"),
+    too_many = "debug"
+  )
+#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
+#> `x_remainder`.
+debug |> filter(!x_ok)
+#> # A tibble: 2 × 6
+#>   x         y     z     x_ok  x_pieces x_remainder
+#>   <chr>     <chr> <chr> <lgl>    <int> <chr>      
+#> 1 1-3-5-6   3     5     FALSE        4 -6         
+#> 2 1-3-5-7-9 3     5     FALSE        5 -7-9
+
+

You have a slightly different set of options for handling too many pieces: you can either silently “drop” any additional pieces or “merge” them all into the final column:

+
+
df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z"),
+    too_many = "drop"
+  )
+#> # A tibble: 5 × 3
+#>   x     y     z    
+#>   <chr> <chr> <chr>
+#> 1 1     1     1    
+#> 2 1     1     2    
+#> 3 1     3     5    
+#> 4 1     3     2    
+#> 5 1     3     5
+
+
+df |> 
+  separate_wider_delim(
+    x,
+    delim = "-",
+    names = c("x", "y", "z"),
+    too_many = "merge"
+  )
+#> # A tibble: 5 × 3
+#>   x     y     z    
+#>   <chr> <chr> <chr>
+#> 1 1     1     1    
+#> 2 1     1     2    
+#> 3 1     3     5-6  
+#> 4 1     3     2    
+#> 5 1     3     5-7-9
+
+
+
+ +
+

+Letters

+

In this section, we’ll introduce you to functions that allow you to work with the individual letters within a string. You’ll learn how to find the length of a string, extract substrings, and handle long strings in plots and tables.

+ +
+

+Length

+

#chp-https://stringr.tidyverse.org/reference/str_length tells you the number of letters in the string:

+
+
str_length(c("a", "R for data science", NA))
+#> [1]  1 18 NA
+
+

You could use this with #chp-https://dplyr.tidyverse.org/reference/count to find the distribution of lengths of US babynames, and then with #chp-https://dplyr.tidyverse.org/reference/filter to look at the longest namesLooking at these entries, we’d guess that the babynames data drops spaces or hyphens and truncates after 15 letters.:

+
+
babynames |>
+  count(length = str_length(name), wt = n)
+#> # A tibble: 14 × 2
+#>   length        n
+#>    <int>    <int>
+#> 1      2   338150
+#> 2      3  8589596
+#> 3      4 48506739
+#> 4      5 87011607
+#> 5      6 90749404
+#> 6      7 72120767
+#> # … with 8 more rows
+
+babynames |> 
+  filter(str_length(name) == 15) |> 
+  count(name, wt = n, sort = TRUE)
+#> # A tibble: 34 × 2
+#>   name                n
+#>   <chr>           <int>
+#> 1 Franciscojavier   123
+#> 2 Christopherjohn   118
+#> 3 Johnchristopher   118
+#> 4 Christopherjame   108
+#> 5 Christophermich    52
+#> 6 Ryanchristopher    45
+#> # … with 28 more rows
+
+
+ +
+

+Subsetting

+

You can extract parts of a string using str_sub(string, start, end), where start and end are the letters where the substring should start and end. The start and end arguments are inclusive, so the length of the returned string will be end - start + 1:

+
+
x <- c("Apple", "Banana", "Pear")
+str_sub(x, 1, 3)
+#> [1] "App" "Ban" "Pea"
+
+

You can use negative values to count back from the end of the string: -1 is the last character, -2 is the second to last character, etc.

+
+
str_sub(x, -3, -1)
+#> [1] "ple" "ana" "ear"
+
+

Note that #chp-https://stringr.tidyverse.org/reference/str_sub won’t fail if the string is too short: it will just return as much as possible:

+
+
str_sub("a", 1, 5)
+#> [1] "a"
+
+

We could use #chp-https://stringr.tidyverse.org/reference/str_sub with #chp-https://dplyr.tidyverse.org/reference/mutate to find the first and last letter of each name:

+
+
babynames |> 
+  mutate(
+    first = str_sub(name, 1, 1),
+    last = str_sub(name, -1, -1)
+  )
+#> # A tibble: 1,924,665 × 7
+#>    year sex   name          n   prop first last 
+#>   <dbl> <chr> <chr>     <int>  <dbl> <chr> <chr>
+#> 1  1880 F     Mary       7065 0.0724 M     y    
+#> 2  1880 F     Anna       2604 0.0267 A     a    
+#> 3  1880 F     Emma       2003 0.0205 E     a    
+#> 4  1880 F     Elizabeth  1939 0.0199 E     h    
+#> 5  1880 F     Minnie     1746 0.0179 M     e    
+#> 6  1880 F     Margaret   1578 0.0162 M     t    
+#> # … with 1,924,659 more rows
+
+
+ +
+

+Long strings

+

Sometimes the reason you care about the length of a string is because you’re trying to fit it into a label on a plot or in a table. stringr provides two useful tools for cases where your string is too long:

+
  • str_trunc(x, 30) ensures that no string is longer than 30 characters, replacing any letters after 30 with .

  • +
  • str_wrap(x, 30) wraps a string introducing new lines so that each line is at most 30 characters (it doesn’t hyphenate, however, so any word longer than 30 characters will make a longer line)

  • +

The following code shows these functions in action with a made up string:

+
+
x <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat."
+
+str_view(str_trunc(x, 30))
+#> [1] │ Lorem ipsum dolor sit amet,...
+str_view(str_wrap(x, 30))
+#> [1] │ Lorem ipsum dolor sit amet,
+#>     │ consectetur adipiscing
+#>     │ elit, sed do eiusmod tempor
+#>     │ incididunt ut labore et dolore
+#>     │ magna aliqua. Ut enim ad
+#>     │ minim veniam, quis nostrud
+#>     │ exercitation ullamco laboris
+#>     │ nisi ut aliquip ex ea commodo
+#>     │ consequat.
+
+
+ +
+

+Exercises

+
  1. Use #chp-https://stringr.tidyverse.org/reference/str_length and #chp-https://stringr.tidyverse.org/reference/str_sub to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
  2. +
  3. Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
  4. +
+
+ +
+

+Non-English text

+

So far, we’ve focused on English language text which is particularly easy to work with for two reasons. Firstly, the English alphabet is fairly simple: there are just 26 letters. Secondly (and maybe more importantly), the computing infrastructure we use today was predominantly designed by English speakers. Unfortunately we don’t have room for a full treatment of non-English languages, but I wanted to draw your attention to some of the biggest challenges you might encounter: encoding, letter variations, and locale dependent functions.

+ +
+

+Encoding

+

When working with non-English text the first challenge is often the encoding. To understand what’s going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying representation of a string using #chp-https://rdrr.io/r/base/rawConversion:

+
+
charToRaw("Hadley")
+#> [1] 48 61 64 6c 65 79
+
+

Each of these six hexadecimal numbers represents one letter: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII. ASCII does a great job of representing English characters, because it’s the American Standard Code for Information Interchange.

+

Things aren’t so easy for languages other than English. In the early days of computing there were many competing standards for encoding non-English characters. For example, there were two different encodings for Europe: Latin1 (aka ISO-8859-1) was used for Western European languages and Latin2 (aka ISO-8859-2) was used for Central European languages. In Latin1, the byte b1 is “±”, but in Latin2, it’s “ą”! Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by humans today, as well as many extra symbols like emojis.

+

readr uses UTF-8 everywhere. This is a good default but will fail for data produced by older systems that don’t use UTF-8. If this happens to you, your strings will look weird when you print them. Sometimes just one or two characters might be messed up; other times you’ll get complete gibberish. For example here are two inline CSVs with unusual encodingsHere I’m using the special \x to encode binary data directly into a string.:

+
+
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
+read_csv(x1)
+#> # A tibble: 1 × 1
+#>   text                                       
+#>   <chr>                                      
+#> 1 "El Ni\xf1o was particularly bad this year"
+
+x2 <- "text\n\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
+read_csv(x2)
+#> # A tibble: 1 × 1
+#>   text                                      
+#>   <chr>                                     
+#> 1 "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
+
+

To read these correctly you specify the encoding via the locale argument:

+
+
read_csv(x1, locale = locale(encoding = "Latin1"))
+#> # A tibble: 1 × 1
+#>   text                                  
+#>   <chr>                                 
+#> 1 El Niño was particularly bad this year
+
+read_csv(x2, locale = locale(encoding = "Shift-JIS"))
+#> # A tibble: 1 × 1
+#>   text      
+#>   <chr>     
+#> 1 こんにちは
+
+

How do you find the correct encoding? If you’re lucky, it’ll be included somewhere in the data documentation. Unfortunately, that’s rarely the case, so readr provides #chp-https://readr.tidyverse.org/reference/encoding to help you figure it out. It’s not foolproof, and it works better when you have lots of text (unlike here), but it’s a reasonable place to start. Expect to try a few different encodings before you find the right one.

+
+
guess_encoding(x1)
+#> # A tibble: 1 × 2
+#>   encoding   confidence
+#>   <chr>           <dbl>
+#> 1 ISO-8859-1       0.41
+guess_encoding(x2)
+#> # A tibble: 1 × 2
+#>   encoding confidence
+#>   <chr>         <dbl>
+#> 1 KOI8-R         0.27
+
+

Encodings are a rich and complex topic, and we’ve only scratched the surface here. If you’d like to learn more we recommend reading the detailed explanation at http://kunststube.net/encoding/.

+
+ +
+

+Letter variations

+

If you’re working with individual letters (e.g. with #chp-https://stringr.tidyverse.org/reference/str_length and #chp-https://stringr.tidyverse.org/reference/str_sub) there’s an important challenge if you’re working with an language that has accents because letters might be represented as an individual character or by combing an unaccented letter (e.g. ü) with a diacritic mark (e.g. ¨). For example, this code shows two ways of representing ü that look identical:

+
+
u <- c("\u00fc", "u\u0308")
+str_view(u)
+#> [1] │ ü
+#> [2] │ ü
+
+

But they have different lengths and the first characters are different:

+
+
str_length(u)
+#> [1] 1 2
+str_sub(u, 1, 1)
+#> [1] "ü" "u"
+
+

Finally note that these strings look differently when you compare them with ==, for which is stringr provides the handy #chp-https://stringr.tidyverse.org/reference/str_equal function:

+
+
u[[1]] == u[[2]]
+#> [1] FALSE
+
+str_equal(u[[1]], u[[2]])
+#> [1] TRUE
+
+
+ +
+

+Locale-dependent function

+

Finally, there are a handful of stringr functions whose behavior depends on your locale. A locale is similar to a language, but includes an optional region specifier to handle regional variations within a language. A locale is specified by lower-case language abbreviation, optionally followed by a _ and a upper-case region identifier. For example, “en” is English, “en_GB” is British English, and “en_US” is American English. If you don’t already know the code for your language, #chp-https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes has a good list, and you can see which are supported in stringr by looking at #chp-https://rdrr.io/pkg/stringi/man/stri_locale_list.

+

Base R string functions automatically use the locale set by your operating system. This means that base R string functions do what you expect for your language, but your code might work differently if you share it with someone who lives in different country. To avoid this problem, stringr defaults to using English rules, by using the “en” locale, and requires you to specify the locale argument to override it. Fortunately there are two sets of functions where the locale really matters: changing case and sorting.

+

The rules for changing case are not the same in every language. For example, Turkish has two i’s: with and without a dot, and it capitalizes them in a different way to English:

+
+
str_to_upper(c("i", "ı"))
+#> [1] "I" "I"
+str_to_upper(c("i", "ı"), locale = "tr")
+#> [1] "İ" "I"
+
+

Sorting strings depends on the order of the alphabet, and order of the alphabet is not the same in every languageSorting in languages that don’t have an alphabet, like Chinese, is more complicated still.! Here’s an example: in Czech, “ch” is a compound letter that appears after h in the alphabet.

+
+
str_sort(c("a", "c", "ch", "h", "z"))
+#> [1] "a"  "c"  "ch" "h"  "z"
+str_sort(c("a", "c", "ch", "h", "z"), locale = "cs")
+#> [1] "a"  "c"  "h"  "ch" "z"
+
+

This also comes up when sorting strings with #chp-https://dplyr.tidyverse.org/reference/arrange which is why it also has a locale argument.

+
+
+ +
+

+Summary

+

In this chapter you’ve learned about some of the power of the stringr package: you learned how to create, combine, and extract strings, and about some of the challenges you might face with non-English strings. Now it’s time to learn one of the most important and powerful tools for working withr strings: regular expressions. Regular expressions are very concise, but very expressive, language for describing patterns within strings, and are the topic of the next chapter.

+ + +
+
diff --git a/oreilly/transform.html b/oreilly/transform.html new file mode 100644 index 0000000..9aedea6 --- /dev/null +++ b/oreilly/transform.html @@ -0,0 +1,17 @@ +
+

Transform

After reading the first part of the book, you understand (at least superficially) the most important tools for doing data science. Now it’s time to start diving into the details. In this part of the book, you’ll learn about the most important types of variables that you’ll encounter inside a data frame and learn the tools you can use to work with them.

+
+ +

Our data science model transform, highlighted in blue.

+
Figure 1: The options for data transformation depends heavily on the type of data involve, the subject of this part of the book.
+
+
+

You can read these chapters as you need them; they’re designed to be largely standalone so that they can be read out of order.

  • #chp-logicals teaches you about logical vectors. These are simplest type of vector, but are extremely powerful. You’ll learn how to create them with numeric comparisons, how to combine them with Boolean algebra, how to use them in summaries, and how to use them for condition transformations.

  • +
  • #chp-numbers dives into tools for vectors of numbers, the powerhouse of data science. You’ll learn more about counting and a bunch of important transformation and summary functions.

  • +
  • #chp-strings will give you the tools to work with strings: you’ll slice them, you’ll dice them, and you’ll stick them back together again. This chapter mostly focuses on the stringr package, but you’ll also learn some more tidyr functions devoted to extracting data from strings.

  • +
  • #chp-regexps introduces you to regular expressions, a powerful tool for manipulating strings. This chapter will take you from thinking that a cat walked over your keyboard to reading and writing complex string patterns.

  • +
  • #chp-factors introduces factors: the data type that R uses to store categorical data. You use a factor when variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.

  • +
  • #chp-datetimes will give you the key tools for working with dates and date-times. Unfortunately, the more you learn about date-times, the more complicated they seem to get, but with the help of the lubridate package, you’ll learn to how to overcome the most common challenges.

  • +
  • #chp-missing-values discusses missing values in depth. We’ve discussed them a couple of times in isolation, but now it’s time to discuss them holistically, helping you come to grips with the difference between implicit and explicit missing values, and how and why you might convert between them.

  • +
  • #chp-joins finishes up this part of the book by giving you tools to join two (or more) data frames together. Learning about joins will force you to grapple with the idea of keys, and think about how you identify each row in a dataset.

  • +
diff --git a/oreilly/webscraping.html b/oreilly/webscraping.html new file mode 100644 index 0000000..703ce73 --- /dev/null +++ b/oreilly/webscraping.html @@ -0,0 +1,10 @@ +
+

Web scraping

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it. You can find the complete first edition at https://r4ds.had.co.nz.

+
diff --git a/oreilly/whole-game.html b/oreilly/whole-game.html new file mode 100644 index 0000000..6bfd906 --- /dev/null +++ b/oreilly/whole-game.html @@ -0,0 +1,14 @@ +
+

Whole game

Our goal in this part of the book is to give you a rapid overview of the main tools of data science: importing, tidying, transforming, and visualizing data, as shown in #fig-ds-whole-game. We want to show you the “whole game” of data science giving you just enough of all the major pieces so that you can tackle real, if simple, data sets. The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.

+
+ +

A diagram displaying the data science cycle: Import -> Tidy -> Understand  (which has the phases Transform -> Visualize -> Model in a cycle) -> Communicate. Surrounding all of these is Program Import, Tidy, Transform, and Visualize is highlighted.

+
Figure 1: In this section of the book, you’ll learn how to import, tidy, transform, and visualize data.
+
+
+

Five chapters focus on the tools of data science:

  • Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data. In #chp-data-visualize you’ll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.

  • +
  • Visualisation alone is typically not enough, so in #chp-data-transform, you’ll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.

  • +
  • In #chp-data-tidy, you’ll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier. You’ll learn the underlying principles, and how to get your data into a tidy form.

  • +
  • Before you can transform and visualize your data, you need to first get your data into R. In #chp-data-import you’ll learn the basics of getting .csv files into R.

  • +
  • Finally, in #chp-EDA, you’ll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.

  • +

Nestled among these chapters that are five other chapters that focus on your R workflow. In #chp-workflow-basics, #chp-workflow-pipes, #chp-workflow-style, and #chp-workflow-scripts, you’ll learn good workflow practices for writing and organizing your R code. These will set you up for success in the long run, as they’ll give you the tools to stay organised when you tackle real projects.

diff --git a/oreilly/workflow-basics.html b/oreilly/workflow-basics.html new file mode 100644 index 0000000..47ca32c --- /dev/null +++ b/oreilly/workflow-basics.html @@ -0,0 +1,161 @@ +
+

Workflow: basics

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+

You now have some experience running R code. We didn’t give you many details, but you’ve obviously figured out the basics, or you would’ve thrown this book away in frustration! Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.

Before we go any further, let’s make sure you’ve got a solid foundation in running R code, and that you know about some of the most helpful RStudio features.

+
+

+Coding basics

+

Let’s review some basics we’ve so far omitted in the interests of getting you plotting as quickly as possible. You can use R as a calculator:

+
+
1 / 200 * 30
+#> [1] 0.15
+(59 + 73 + 2) / 3
+#> [1] 44.66667
+sin(pi / 2)
+#> [1] 1
+
+

You can create new objects with the assignment operator <-:

+
+
x <- 3 * 4
+
+

You can combine multiple elements into a vector with #chp-https://rdrr.io/r/base/c:

+
+
primes <- c(2, 3, 5, 7, 11, 13)
+
+

And basic arithmetic is applied to every element of the vector:

+
+
primes * 2
+#> [1]  4  6 10 14 22 26
+primes - 1
+#> [1]  1  2  4  6 10 12
+
+

All R statements where you create objects, assignment statements, have the same form:

+
+
object_name <- value
+
+

When reading that code, say “object name gets value” in your head.

+

You will make lots of assignments and <- is a pain to type. You can save time with RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automatically surrounds <- with spaces, which is a good code formatting practice. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.

+
+ +
+

+Comments

+

R will ignore any text after #. This allows to you to write comments, text that is ignored by R but read by other humans. We’ll sometimes include comments in examples explaining what’s happening with the code.

+

Comments can be helpful for briefly describing what the subsequent code does.

+
+
# define primes
+primes <- c(2, 3, 5, 7, 11, 13)
+
+# multiply primes by 2
+primes * 2
+#> [1]  4  6 10 14 22 26
+
+

With short pieces of code like this, it might not be necessary to leave a command for every single line of code. But as the code you’re writing gets more complex, comments can save you (and your collaborators) a lot of time in figuring out what was done in the code.

+

Use comments to explain the why of your code, not the how or the what. The what and how of code your is always possible to figure out, even if it might be tedious, by carefully reading the code. But if you describe the “what” in your comments and your code, you’ll have to remember to carefully update the comment and code in tandem. If you change the code and forget to update the comment, they’ll be inconsistent which will lead to confusion when you come back to your code in the future.

+

Figuring out why something was done is much more difficult, if not impossible. For example, geom_smooth() has an argument called span, which controls the smoothness of the curve, with larger values yielding a smoother curve. Suppose you decide to change the value of span from its default of 0.75 to 0.3: it’s easy for a future reader to understand what is happening, but unless you note your thinking in a comment, no one will understand why you changed the default.

+

For data analysis code, use comments to explain your overall plan of attack and record important insight as you encounter them. There’s no way to re-capture this knowledge from the code itself.

+
+ +
+

+What’s in a name?

+

Object names must start with a letter, and can only contain letters, numbers, _ and .. You want your object names to be descriptive, so you’ll need to adopt a convention for multiple words. We recommend snake_case where you separate lowercase words with _.

+
+
i_use_snake_case
+otherPeopleUseCamelCase
+some.people.use.periods
+And_aFew.People_RENOUNCEconvention
+
+

We’ll come back to names again when we talk more about code style in #chp-workflow-style.

+

You can inspect an object by typing its name:

+
+
x
+#> [1] 12
+
+

Make another assignment:

+
+
this_is_a_really_long_name <- 2.5
+
+

To inspect this object, try out RStudio’s completion facility: type “this”, press TAB, add characters until you have a unique prefix, then press return.

+

Ooops, you made a mistake! The value of this_is_a_really_long_name should be 3.5, not 2.5. Use another keyboard shortcut to help you fix it. Type “this” then press Cmd/Ctrl + ↑. Doing so will list all the commands you’ve typed that start with those letters. Use the arrow keys to navigate, then press enter to retype the command. Change 2.5 to 3.5 and rerun.

+

Make yet another assignment:

+
+
r_rocks <- 2 ^ 3
+
+

Let’s try to inspect it:

+
+
r_rock
+#> Error: object 'r_rock' not found
+R_rocks
+#> Error: object 'R_rocks' not found
+
+

This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions. Typos matter; R can’t read your mind and say “oh, they probably meant r_rocks when they typed r_rock”. Case matters; similarly R can’t read your mind and say “oh, they probably meant r_rocks when they typed R_rocks”.

+
+ +
+

+Calling functions

+

R has a large collection of built-in functions that are called like this:

+
+
function_name(arg1 = val1, arg2 = val2, ...)
+
+

Let’s try using #chp-https://rdrr.io/r/base/seq, which makes regular sequences of numbers and, while we’re at it, learn more helpful features of RStudio. Type se and hit TAB. A popup shows you possible completions. Specify #chp-https://rdrr.io/r/base/seq by typing more (a q) to disambiguate, or by using ↑/↓ arrows to select. Notice the floating tooltip that pops up, reminding you of the function’s arguments and purpose. If you want more help, press F1 to get all the details in the help tab in the lower right pane.

+

When you’ve selected the function you want, press TAB again. RStudio will add matching opening (() and closing ()) parentheses for you. Type the arguments 1, 10 and hit return.

+
+
seq(1, 10)
+#>  [1]  1  2  3  4  5  6  7  8  9 10
+
+

Type this code and notice that RStudio provides similar assistance with the paired quotation marks:

+
+
x <- "hello world"
+
+

Quotation marks and parentheses must always come in a pair. RStudio does its best to help you, but it’s still possible to mess up and end up with a mismatch. If this happens, R will show you the continuation character “+”:

+
> x <- "hello
++
+

The + tells you that R is waiting for more input; it doesn’t think you’re done yet. Usually, this means you’ve forgotten either a " or a ). Either add the missing pair, or press ESCAPE to abort the expression and try again.

+

Note that the environment tab in the upper right pane displays all of the objects that you’ve created:

+
+
+

Environment tab of RStudio which shows r_rocks, this_is_a_really_long_name, x, and y in the Global Environment.

+
+
+
+ +
+

+Exercises

+
  1. +

    Why does this code not work?

    +
    +
    my_variable <- 10
    +my_varıable
    +#> Error in eval(expr, envir, enclos): object 'my_varıable' not found
    +
    +

    Look carefully! (This may seem like an exercise in pointlessness, but training your brain to notice even the tiniest difference will pay off when programming.)

    +
  2. +
  3. +

    Tweak each of the following R commands so that they run correctly:

    +
    +
    libary(tidyverse)
    +
    +ggplot(dota = mpg) + 
    +  geom_point(maping = aes(x = displ, y = hwy))
    +
    +
  4. +
  5. Press Alt + Shift + K. What happens? How can you get to the same place using the menus?

  6. +
+ +
+

+Summary

+

Now that you’ve learned a little more about how R code works, and some tips to help you understand your code when you come back to it in the future. In the next chapter, we’ll continue your data science journey by teaching you about dplyr, the tidyverse package that helps you transform data, whether it’s selecting important variables, filtering down to rows of interest, or computing summary statistics.

+ + +
+
diff --git a/oreilly/workflow-help.html b/oreilly/workflow-help.html new file mode 100644 index 0000000..1968ddf --- /dev/null +++ b/oreilly/workflow-help.html @@ -0,0 +1,81 @@ +
+

Workflow: Getting help

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+

This book is not an island; there is no single resource that will allow you to master R. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.

+
+

+Google is your friend

+

If you get stuck, start with Google. Typically adding “R” to a query is enough to restrict it to relevant results: if the search isn’t useful, it often means that there aren’t any R-specific results available. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web. (If the error message isn’t in English, run Sys.setenv(LANGUAGE = "en") and re-run the code; you’re more likely to find help for English error messages.)

+

If Google doesn’t help, try #chp-https://stackoverflow. Start by spending a little time searching for an existing answer, including [R] to restrict your search to questions and answers that use R.

+
+ +
+

+Making a reprex

+

If your googling doesn’t find anything useful, it’s a really good idea prepare a reprex, short for minimal reproducible example. A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:

+
  • First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any #chp-https://rdrr.io/r/base/library calls and create all necessary objects. The easiest way to make sure you’ve done this is to use the reprex package.

  • +
  • Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler R object than the one you’re facing in real life or even using built-in data.

  • +

That sounds like a lot of work! And it can be, but it has a great payoff:

+
  • 80% of the time creating an excellent reprex reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.

  • +
  • The other 20% of time you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help!

  • +

When creating a reprex by hand, it’s easy to accidentally miss something that means your code can’t be run on someone else’s computer. Avoid this problem by using the reprex package which is installed as part of the tidyverse. Let’s say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):

+
+
y <- 1:4
+mean(y)
+
+

Then call reprex(), where the default target venue is GitHub:

+
reprex::reprex()
+

A nicely rendered HTML preview will display in RStudio’s Viewer (if you’re in RStudio) or your default browser otherwise. The relevant bit of GitHub-flavored Markdown is ready to be pasted from your clipboard (on RStudio Server or Cloud, you will need to copy this yourself):

+
``` r
+y <- 1:4
+mean(y)
+#> [1] 2.5
+```
+

Here’s what that Markdown would look like rendered in a GitHub issue:

+
+
y <- 1:4
+mean(y)
+#> [1] 2.5
+
+

Anyone else can copy, paste, and run this immediately.

+

There are three things you need to include to make your example reproducible: required packages, data, and code.

+
  1. Packages should be loaded at the top of the script, so it’s easy to see which ones the example needs. This is a good time to check that you’re using the latest version of each package; it’s possible you’ve discovered a bug that’s been fixed since you installed or last updated the package. For packages in the tidyverse, the easiest way to check is to run tidyverse_update().

  2. +
  3. +

    The easiest way to include data is to use #chp-https://rdrr.io/r/base/dput to generate the R code needed to recreate it. For example, to recreate the mtcars dataset in R, perform the following steps:

    +
    1. Run dput(mtcars) in R
    2. +
    3. Copy the output
    4. +
    5. In reprex, type mtcars <- then paste.
    6. +

    Try and find the smallest subset of your data that still reveals the problem.

    +
  4. +
  5. +

    Spend a little bit of time ensuring that your code is easy for others to read:

    +
    • Make sure you’ve used spaces and your variable names are concise, yet informative.

    • +
    • Use comments to indicate where your problem lies.

    • +
    • Do your best to remove everything that is not related to the problem.

    • +

    The shorter your code is, the easier it is to understand, and the easier it is to fix.

    +
  6. +

Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in.

+
+ +
+

+Investing in yourself

+

You should also spend some time preparing yourself to solve problems before they occur. Investing a little time in learning R each day will pay off handsomely in the long run. One way is to follow what the tidyverse team is doing on the #chp-https://www.tidyverse.org/blog/. To keep up with the R community more broadly, we recommend reading #chp-https://rweekly: it’s a community effort to aggregate the most interesting news in the R community each week.

+

If you’re an active Twitter user, you might also want to follow Hadley (#chp-https://twitter.com/hadleywickham), Mine (#chp-https://twitter.com/minebocek), Garrett (#chp-https://twitter.com/statgarrett), or follow #chp-https://twitter.com/rstudiotips to keep up with new features in the IDE. If you want the full fire hose of new developments, you can also read the (#chp-https://twitter.com/search?q=%23rstats) hashtag. This is one the key tools that Hadley and Mine use to keep up with new developments in the community.

+
+ +
+

+Summary

+

This chapter concludes the Whole Game part of the book. You’ve now seen the most important parts of the data science process: visualization, transformation, tidying and importing. Now you’ve got a holistic view of whole process and we start to get into the the details of small pieces.

+

The next part of the book, Transform, goes into depth into the different types of variables that you might encounter: logical vectors, numbers, strings, factors, and date-times, and covers important related topics like tibbles, regular expression, missing values, and joins. There’s no need to read these chapters in order; dip in and out as needed for the specific data that you’re working with.

+ + +
+
diff --git a/oreilly/workflow-pipes.html b/oreilly/workflow-pipes.html new file mode 100644 index 0000000..ca9e2c9 --- /dev/null +++ b/oreilly/workflow-pipes.html @@ -0,0 +1,106 @@ +
+

Workflow: Pipes

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter is largely complete and just needs final proof reading. You can find the complete first edition at https://r4ds.had.co.nz.

+

The pipe, |>, is a powerful tool for clearly expressing a sequence of operations that transform an object. We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss %>%, a predecessor to |>.

To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. You’ll need to make one change to your RStudio options to use |> instead of %>% as shown in #fig-pipe-options; more on %>% shortly.

+
+ +

Screenshot showing the "Use native pipe operator" option which can be found on the "Editing" panel of the "Code" options.

+
Figure 5.1: To insert |>, make sure the “Use native pipe operator” option is checked.|>, make sure the “Use native pipe operator” option is checked.
+
+
+
+
+

+Why use a pipe?

+

Each individual dplyr verb is quite simple, so solving complex problems typically requires combining multiple verbs. For example, the last chapter finished with a moderately complex pipe:

+
+
flights |>  
+  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
+  group_by(tailnum) |> 
+  summarise(
+    delay = mean(arr_delay, na.rm = TRUE),
+    n = n()
+  )
+
+

Even though this pipe has four steps, it’s easy to skim because the verbs come at the start of each line: start with the flights data, then filter, then group, then summarize.

+

What would happen if we didn’t have the pipe? We could nest each function call inside the previous call:

+
+
summarise(
+  group_by(
+    filter(
+      flights, 
+      !is.na(arr_delay), !is.na(tailnum)
+    ),
+    tailnum
+  ), 
+  delay = mean(arr_delay, na.rm = TRUE
+  ), 
+  n = n()
+)
+
+

Or we could use a bunch of intermediate variables:

+
+
flights1 <- filter(flights, !is.na(arr_delay), !is.na(tailnum))
+flights2 <- group_by(flights1, tailnum) 
+flights3 <- summarise(flight2,
+  delay = mean(arr_delay, na.rm = TRUE),
+  n = n()
+)
+
+

While both of these forms have their time and place, the pipe generally produces data analysis code that’s both easier to write and easier to read.

+
+ +
+

+magrittr and the%>% pipe

+

If you’ve been using the tidyverse for a while, you might be familiar with the %>% pipe provided by the magrittr package. The magrittr package is included in the core tidyverse, so you can use %>% whenever you load the tidyverse:

+
+
library(tidyverse)
+
+mtcars %>% 
+  group_by(cyl) %>%
+  summarise(n = n())
+#> # A tibble: 3 × 2
+#>     cyl     n
+#>   <dbl> <int>
+#> 1     4    11
+#> 2     6     7
+#> 3     8    14
+
+

For simple cases |> and %>% behave identically. So why do we recommend the base pipe? Firstly, because it’s part of base R, it’s always available for you to use, even when you’re not using the tidyverse. Secondly, |> is quite a bit simpler than %>%: in the time between the invention of %>% in 2014 and the inclusion of |> in R 4.1.0 in 2021, we gained a better understanding of the pipe. This allowed the base implementation to jettison infrequently used and less important features.

+
+ +
+

+|> vs %>% +

+

While |> and %>% behave identically for simple cases, there are a few important differences. These are most likely to affect you if you’re a long-term user of %>% who has taken advantage of some of the more advanced features. But they’re still good to know about even if you’ve never used %>% because you’re likely to encounter some of them when reading wild-caught code.

+
  • By default, the pipe passes the object on its left hand side to the first argument of the function on the right-hand side. %>% allows you change the placement with a . placeholder. For example, x %>% f(1) is equivalent to f(x, 1) but x %>% f(1, .) is equivalent to f(1, x). R 4.2.0 added a _ placeholder to the base pipe, with one additional restriction: the argument has to be named. For example, x |> f(1, y = _) is equivalent to f(1, y = x).

  • +
  • +

    The |> placeholder is deliberately simple and can’t replicate many features of the %>% placeholder: you can’t pass it to multiple arguments, and it doesn’t have any special behavior when the placeholder is used inside another function. For example, df %>% split(.$var) is equivalent to split(df, df$var) and df %>% {split(.$x, .$y)} is equivalent to split(df$x, df$y).

    +

    With %>% you can use . on the left-hand side of operators like $, [[, [ (which you’ll learn about in #sec-subset-many), so you can extract a single column from a data frame with (e.g.) mtcars %>% .$cyl. A future version of R may add similar support for |> and _. For the special case of extracting a column out of a data frame, you can also use #chp-https://dplyr.tidyverse.org/reference/pull:

    +
    +
    mtcars |> pull(cyl)
    +#>  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
    +
    +
  • +
  • %>% allows you to drop the parentheses when calling a function with no other arguments; |> always requires the parentheses.

  • +
  • %>% allows you to start a pipe with . to create a function rather than immediately executing the pipe; this is not supported by the base pipe.

  • +

Luckily there’s no need to commit entirely to one pipe or the other — you can use the base pipe for the majority of cases where it’s sufficient, and use the magrittr pipe when you really need its special features.

+
+ +
+

+Summary

+

In this chapter, you’ve learn more about the pipe: why we recommend it and some of the history that lead to |>. The pipe is important because you’ll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.

+

In the next chapter, we switch back to data science tools, learning about tidy data. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. Of course, life is never easy and most datasets that you encounter in the wild will not already be tidy. So we’ll also teach you how to use the tidyr package to tidy your untidy data.

+ + +
+
diff --git a/oreilly/workflow-scripts.html b/oreilly/workflow-scripts.html new file mode 100644 index 0000000..46fe111 --- /dev/null +++ b/oreilly/workflow-scripts.html @@ -0,0 +1,236 @@ +
+

Workflow: scripts and projects

+
+ +
+ +

+RStudio server +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

+

This chapter will introduce you to two very important tools for organizing your code: scripts and projects.

+
+

+Scripts

+

So far, you have used the console to run code. That’s a great place to start, but you’ll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now you’ll see four panes, as in #fig-rstudio-script. The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.

+
+
+ +

RStudio IDE with Editor, Console, and Output highlighted.

+
Figure 9.1: Opening the script editor adds a new pane at the top-left of the IDE.
+
+
+
+ +
+

+Running code

+

The script editor is a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below. If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates not_cancelled. It will also move the cursor to the next statement (beginning with not_cancelled |>). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.

+
+
library(dplyr)
+library(nycflights13)
+
+not_cancelled <- flights |> 
+  filter(!is.na(dep_delay)█, !is.na(arr_delay))
+
+not_cancelled |> 
+  group_by(year, month, day) |> 
+  summarize(mean = mean(dep_delay))
+
+

Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that you’ve captured all the important parts of your code in the script.

+

We recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include #chp-https://rdrr.io/r/utils/install.packages in a script that you share. It’s very antisocial to change settings on someone else’s computer!

+

When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you won’t even think about it.

+
+ +
+

+RStudio diagnostics

+

In script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:

+
+
+

Script editor with the script x y <- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line.

+
+
+

Hover over the cross to see what the problem is:

+
+
+

Script editor with the script x y <- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line. Hovering over the X shows a text box with the text unexpected token y and unexpected token <-.

+
+
+

RStudio will also let you know about potential problems:

+
+
+

Script editor with the script 3 == NA. A yellow exclamation park indicates that there may be a potential problem. Hovering over the exclamation mark shows a text box with the text use is.na to check whether expression evaluates to NA.

+
+
+
+ +
+

+Saving and naming

+

RStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, it’s a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.

+

It might be tempting to name your files code.R or myscript.R, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:

+
  1. File names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.
  2. +
  3. File names should be human readable: use file names to describe what’s in the file.
  4. +
  5. File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.
  6. +

For example, suppose you have the following files in a project folder.

+
alternative model.R
+code for exploratory analysis.r
+finalreport.qmd
+FinalReport.qmd
+fig 1.png
+Figure_02.png
+model_first_try.R
+run-first.r
+temp.txt
+

There are a variety of problems here: it’s hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (finalreport vs. FinalReportNot to mention that you’re tempting fate by using “final” in the name 😆 The comic piled higher and deeper has a #chp-https://phdcomics.com/comics/archive.php?comicid=1531.), and some names don’t describe their contents (run-first and temp).

+

Here’s better way of naming and organizing the same set of files:

+
01-load-data.R
+02-exploratory-analysis.R
+03-model-approach-1.R
+04-model-approach-2.R
+fig-01.png
+fig-02.png
+report-2022-03-20.qmd
+report-2022-04-02.qmd
+report-draft-notes.txt
+

Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and temp is renamed to report-draft-notes to better describe its contents.

+
+
+ +
+

+Projects

+

One day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.

+

To handle these real life situations, you need to make two decisions:

+
  1. What is the source of truth? What will you save as your lasting record of what happened?

  2. +
  3. Where does your analysis live?

  4. +
+
+

+What is the source of truth?

+

As a beginning R user, it’s OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis. However, in the long run, you’ll be much better off if you ensure that your R scripts are the source of truth. With your R scripts (and your data files), you can recreate the environment. With only your environment, it’s much harder to recreate your R scripts: you’ll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you’ll have to carefully mine your R history.

+

To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running #chp-https://usethis.r-lib.org/reference/use_blank_slateIf you don’t have usethis installed, you can install it with install.packages("usethis"). or by mimicking the options shown in #fig-blank-slate. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time. But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code. There’s nothing worse than discovering three months after the fact that you’ve only stored the results of an important calculation in your workspace, not the calculation itself in your code.

+
+
+ +

RStudio preferences window where the option Restore .RData into workspace at startup is not checked. Also, the option Save workspace to .RData on exit is set to Never.

+
Figure 9.2: Copy these options in your RStudio options to always start your RStudio session with a clean slate.
+
+
+
+

There is a great pair of keyboard shortcuts that will work together to make sure you’ve captured the important parts of your code in the editor:

+
  1. Press Cmd/Ctrl + Shift + F10 to restart R.
  2. +
  3. Press Cmd/Ctrl + Shift + S to re-run the current script.
  4. +

We collectively use this pattern hundreds of times a week.

+

+RStudio server +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+ +

If you’re using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like you’re closing R, but the server actually keeps it running in the background. The next time you return, you’ll be in exactly the same place you left. This makes it even more important to regularly restart R so that you’re starting with a refresh slate.

+ +
+ +
+

+Where does your analysis live?

+

R has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:

+
+
+

The Console tab shows the current working directory as ~/Documents/r4ds/r4ds.

+
+
+

And you can print this out in R code by running #chp-https://rdrr.io/r/base/getwd:

+
+
getwd()
+#> [1] "/Users/hadley/Documents/r4ds/r4ds"
+
+

As a beginning R user, it’s OK to let your working direction be your home directory, documents directory, or any other weird directory on your computer. But you’re nine chapters into this book, and you’re no longer a rank beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R’s working directory to the associated directory.

+

You can set the working directory from within R but we do not recommend it:

+
+
setwd("/path/to/my/CoolProject")
+
+

There’s a better way; a way that also puts you on the path to managing your R work like an expert. That way is the RStudio project.

+
+ +
+

+RStudio projects

+

Keeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via projects. Let’s make a project for you to use while you’re working through the rest of this book. Click File > New Project, then follow the steps shown in #fig-new-project.

+ +
+
+

Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project  window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop.

+
(a) First click New Directory.
+
+
+
+
+

Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project  window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop.

+
(b) Then click New Project.
+
+
+
+
+

Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project  window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop.

+
(c) Finally, fill in the directory (project) name, choose a good subdirectory for its home and click Create Project.
+
+
+
Figure 9.3: Create a new project by following these three steps.
+
+

Call your project r4ds and think carefully about which subdirectory you put the project in. If you don’t store it somewhere sensible, it will be hard to find it in the future!

+

Once this process is complete, you’ll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:

+
+
getwd()
+#> [1] /Users/hadley/Documents/r4ds/r4ds
+
+

Now enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Next, run the complete script which will save a PDF and CSV file into your project directory. Don’t worry about the details, you’ll learn them later in the book.

+
+
library(tidyverse)
+
+ggplot(diamonds, aes(carat, price)) + 
+  geom_hex()
+ggsave("diamonds.pdf")
+
+write_csv(diamonds, "diamonds.csv")
+
+

Quit RStudio. Inspect the folder associated with your project — notice the .Rproj file. Double-click that file to re-open the project. Notice you get back to where you left off: it’s the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you’re starting with a clean slate.

+

In your favorite OS-specific way, search your computer for diamonds.pdf and you will find the PDF (no surprise) but also the script that created it (diamonds.R). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files with R code and never with the mouse or the clipboard, you will be able to reproduce old work with ease!

+
+ +
+

+Relative and absolute paths

+

Once you’re inside a project, you should only ever use relative paths not absolute paths. What’s the difference? A relative path is relative to the working directory, i.e. the project’s home. When Hadley wrote diamonds.R above it was a shortcut for /Users/hadley/Documents/r4ds/r4ds/diamonds.R. But importantly, if Mine ran this code on her computer, it would point to /Users/Mine/Documents/r4ds/r4ds/diamonds.R. This is why relative paths are important: they’ll work regardless of where the project ends up.

+

Absolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g. C:) or two backslashes (e.g. \\servername) and on Mac/Linux they start with a slash “/” (e.g. /users/hadley). You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.

+

There’s another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g. plots/diamonds.pdf) and Windows uses backslashes (e.g. plots\diamonds.pdf). R can work with either type (no matter what platform you’re currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.

+
+
+ +
+

+Summary

+

In summary, scripts and projects give you a solid workflow that will serve you well in the future:

+
  • Create one RStudio project for each data analysis project.
  • +
  • Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you’ve captured everything in your scripts.
  • +
  • Only ever use relative paths, not absolute paths.
  • +

Then everything you need is in one place and cleanly separated from all the other projects that you are working on.

+
+ +
+

+Exercises

+
  1. Go to the RStudio Tips Twitter account, https://twitter.com/rstudiotips and find one tip that looks interesting. Practice using it!

  2. +
  3. What other common mistakes will RStudio diagnostics report? Read https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics to find out.

  4. +
+ +
+

+Summary

+

In this chapter, you’ve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, you’ll learn to appreciate how a little up front organisation can save you a bunch of time down the road.

+

Next up, we’ll switch back to data science tooling to talk about exploratory data analysis (or EDA for short), a philosophy and set of tools that you can use with your data to start to get a sense of what’s going on.

+ + +
+
diff --git a/oreilly/workflow-style.html b/oreilly/workflow-style.html new file mode 100644 index 0000000..40708e8 --- /dev/null +++ b/oreilly/workflow-style.html @@ -0,0 +1,211 @@ +
+

Workflow: code style

+
+ +
+ +
+ +

You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at https://r4ds.had.co.nz.

+

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer it’s a good idea to work on your code style. Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. This chapter will introduce to the most important points of the #chp-https://style.tidyverse, which is used throughout this book.

Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the #chp-https://styler.r-lib package by Lorenz Walthert. Once you’ve installed it with install.packages("styler"), an easy way to use it is via RStudio’s command palette. The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. Open the palette by pressing Cmd/Ctrl + Shift + P, then type “styler” to see all the shortcuts provided by styler. #fig-styler shows the results.

+
+ +

A screenshot showing the command palette after typing "styler", showing the four styling tool provided by the package.

+
Figure 7.1: RStudio’s command palette makes it easy to access every RStudio command using only the keyboard.
+
+
+
+
library(tidyverse)
+library(nycflights13)
+
+
+

+Names

+

We talked briefly about names in #sec-whats-in-a-name. Remember that variable names (those created by <- and those created by #chp-https://dplyr.tidyverse.org/reference/mutate) should use only lowercase letters, numbers, and _. Use _ to separate words within a name.

+
+
# Strive for:
+short_flights <- flights |> filter(air_time < 60)
+
+# Avoid:
+SHORTFLIGHTS <- flights |> filter(air_time < 60)
+
+

As a general rule of thumb, it’s better to prefer long, descriptive names that are easy to understand, rather than concise names that are fast to type. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.

+

If you have a bunch of names for related things, do your best to be consistent. It’s easy for inconsistencies to arise when you forget a previous convention, so don’t feel bad if you have to go back and rename things. In general, if you have a bunch of variables that are a variation on a theme you’re better off giving them a common prefix, rather than a common suffix, because autocomplete works best on the start of a variable.

+
+ +
+

+Spaces

+

Put spaces on either side of mathematical operators apart from ^ (i.e., +, -, ==, <, …), and around the assignment operator (<-).

+
+
# Strive for
+z <- (a + b)^2 / d
+
+# Avoid
+z<-( a + b ) ^ 2/d
+
+

Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma, just like in regular English.

+
+
# Strive for
+mean(x, na.rm = TRUE)
+
+# Avoid
+mean (x ,na.rm=TRUE)
+
+

It’s OK to add extra spaces if it improves alignment. For example, if you’re creating multiple variables in #chp-https://dplyr.tidyverse.org/reference/mutate, you might want to add spaces so that all the = line up. This makes it easier to skim the code.

+
+
flights |> 
+  mutate(
+    speed      = air_time / distance,
+    dep_hour   = dep_time %/% 100,
+    dep_minute = dep_time %%  100
+  )
+
+
+ +
+

+Pipes

+

|> should always have a space before it and should typically be the last thing on a line. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side.

+
+
# Strive for 
+flights |>  
+  filter(!is.na(arr_delay), !is.na(tailnum)) |> 
+  count(dest)
+
+# Avoid
+flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)
+
+

If the function you’re piping into has named arguments (like #chp-https://dplyr.tidyverse.org/reference/mutate or #chp-https://dplyr.tidyverse.org/reference/summarise), put each argument on a new line. If the function doesn’t have named arguments (like #chp-https://dplyr.tidyverse.org/reference/select or #chp-https://dplyr.tidyverse.org/reference/filter) keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.

+
+
# Strive for
+flights |>  
+  group_by(tailnum) |> 
+  summarize(
+    delay = mean(arr_delay, na.rm = TRUE),
+    n = n()
+  )
+
+# Avoid
+flights |>
+  group_by(
+    tailnum
+  ) |> 
+  summarize(delay = mean(arr_delay, na.rm = TRUE), n = n())
+
+

After the first step of the pipeline, indent each line by two spaces. If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.

+
+
# Strive for 
+flights |>  
+  group_by(tailnum) |> 
+  summarize(
+    delay = mean(arr_delay, na.rm = TRUE),
+    n = n()
+  )
+
+# Avoid
+flights|>
+  group_by(tailnum) |> 
+  summarize(
+             delay = mean(arr_delay, na.rm = TRUE), 
+             n = n()
+           )
+
+flights|>
+  group_by(tailnum) |> 
+  summarize(
+  delay = mean(arr_delay, na.rm = TRUE), 
+  n = n()
+  )
+
+

It’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.

+
+
# This fits compactly on one line
+df |> mutate(y = x + 1)
+
+# While this takes up 4x as many lines, it's easily extended to 
+# more variables and more steps in the future
+df |> 
+  mutate(
+    y = x + 1
+  )
+
+

Finally, be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name. The names will help cue the reader into what’s happening and makes it easier to check that intermediate results are as expected. Whenever you can give something an informative name, you should give it an informative name. Don’t expect to get it right the first time! This means breaking up long pipelines if there are intermediate states that can get good names.

+
+ +
+

+ggplot2

+

The same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |>.

+
+
flights |> 
+  group_by(month) |> 
+  summarize(
+    delay = mean(arr_delay, na.rm = TRUE)
+  ) |> 
+  ggplot(aes(month, delay)) +
+  geom_point() + 
+  geom_line()
+
+

Again, if you can fit all of the arguments to a function on to a single line, put each argument on its own line:

+
+
flights |> 
+  group_by(dest) |> 
+  summarize(
+    distance = mean(distance),
+    speed = mean(air_time / distance, na.rm = TRUE)
+  ) |> 
+  ggplot(aes(distance, speed)) +
+  geom_smooth(
+    method = "loess",
+    span = 0.5,
+    se = FALSE, 
+    color = "white", 
+    size = 4
+  ) +
+  geom_point()
+
+
+ +
+

+Sectioning comments

+

As your scripts get longer, you can use sectioning comments to break up your file into manageable pieces:

+
+
# Load data --------------------------------------
+
+# Plot data --------------------------------------
+
+

RStudio provides a keyboard shortcut to create these headers (Cmd/Ctrl + Shift + R), and will display them in the code navigation drop-down at the bottom-left of the editor, as shown in #fig-rstudio-sections.

+
+
+ +

+
Figure 7.2: After adding sectioning comments to your script, you can easily navigate to them using the code navigation tool in the bottom-left of the script editor.
+
+
+
+
+ +
+

+Exercises

+
  1. +

    Restyle the following pipelines following the guidelines above.

    +
    +
    flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
    +
    +flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)
    +
    +
  2. +
+ +
+

+Summary

+

In this chapter, you’ve learn the most important principles of code style. These may feel like a set of arbitrary rules to start with (because they are!) but over time, as you write more code, and share code with more people, you’ll see how important a consistent style is. And don’t forget about the styler package: it’s a great way to quickly improve the quality of poorly styled code.

+

So far, we’ve worked with datasets bundled inside of R packages. This makes it easier to get some practice on pre-prepared data, but obviously your data won’t available in this way. So in the next chapter, you’re going to learn how load data from disk into your R session using the readr package.

+ + +
+
diff --git a/oreilly/wrangle.html b/oreilly/wrangle.html new file mode 100644 index 0000000..854a874 --- /dev/null +++ b/oreilly/wrangle.html @@ -0,0 +1,16 @@ +
+

Wrangle

In this part of the book, you’ll learn about data wrangling, the art of getting your data into R in a useful form for further work. In some cases, this is a relatively simple application of a package that does data import. But in more complex cases it encompasses both tidying and transformation as the native structure of the data might be quite far from the tidy rectangle you’d prefer to work with.

+
+ +

Our data science model with import, tidy, and transform, highlighted in blue and labelled with "wrangle".

+
Figure 1: Data wrangling is the combination of importing, tidying, and transforming.
+
+
+

This part of the book proceeds as follows:

  • In #chp-rectangling, you’ll learn how to get plain-text data in rectangular formats from disk and into R.

  • +
  • In #chp-spreadsheets, you’ll learn how to get data from Excel spreadsheets and Google Sheets into R.

  • +
  • In #chp-databases, you’ll learn about getting data into R from databases.

  • +
  • In #chp-rectangling, you’ll learn how to work with hierarchical data that includes deeply nested lists, as is often created we your raw data is in JSON.

  • +
  • In #chp-webscraping, you’ll learn about harvesting data off the web and getting it into R.

  • +

Some other types of data are not covered in this book:

  • haven reads SPSS, Stata, and SAS files.

  • +
  • xml2 for xml2 for XML

  • +

For other file types, try the #chp-https://cran.r-project.org/doc/manuals/r-release/R-data and the #chp-https://github.com/leeper/rio package.