Typo + grammatical fixes + issue triage (#1217)

* Fix ex wording + grammatical, closes #1209

* Suppress warnings, closes #1210

* Update screenshot, closes #1211

* Grammatical

* Typos + grammatical

* Update workflow-basics.qmd

* Update workflow-basics.qmd

* Update workflow-basics.qmd

* Update workflow-help.qmd

* Update workflow-pipes.qmd
This commit is contained in:
Mine Cetinkaya-Rundel 2023-01-05 00:26:14 -05:00 committed by GitHub
parent e3b8211853
commit b4bde71f35
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 85 additions and 84 deletions

View File

@ -19,14 +19,14 @@ In this chapter, you will learn a consistent way to organize your data in R usin
Getting your data into this format requires some work up front, but that work pays off in the long term. Getting your data into this format requires some work up front, but that work pays off in the long term.
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.
In this chapter, you'll first learn the definition of tidy data and see it applied to simple toy dataset. In this chapter, you'll first learn the definition of tidy data and see it applied to a simple toy dataset.
Then we'll dive into the main tool you'll use for tidying data: pivoting. Then we'll dive into the primary tool you'll use for tidying data: pivoting.
Pivoting allows you to change the form of your data, without changing any of the values. Pivoting allows you to change the form of your data without changing any of the values.
We'll finish up with a discussion of usefully untidy data, and how you can create it if needed. We'll finish with a discussion of usefully untidy data and how you can create it if needed.
### Prerequisites ### Prerequisites
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. In this chapter, we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets.
tidyr is a member of the core tidyverse. tidyr is a member of the core tidyverse.
```{r} ```{r}
@ -41,7 +41,7 @@ From this chapter on, we'll suppress the loading message from `library(tidyverse
## Tidy data {#sec-tidy-data} ## Tidy data {#sec-tidy-data}
You can represent the same underlying data in multiple ways. You can represent the same underlying data in multiple ways.
The example below shows the same data organised in four different ways. The example below shows the same data organized in four different ways.
Each dataset shows the same values of four variables: *country*, *year*, *population*, and *cases* of TB (tuberculosis), but each dataset organizes the values in a different way. Each dataset shows the same values of four variables: *country*, *year*, *population*, and *cases* of TB (tuberculosis), but each dataset organizes the values in a different way.
<!-- TODO redraw as tables --> <!-- TODO redraw as tables -->
@ -62,7 +62,7 @@ One of them, `table1`, will be much easier to work with inside the tidyverse bec
There are three interrelated rules that make a dataset tidy: There are three interrelated rules that make a dataset tidy:
1. Each variable is a column; each column is a variable. 1. Each variable is a column; each column is a variable.
2. Each observation is row; each row is an observation. 2. Each observation is a row; each row is an observation.
3. Each value is a cell; each cell is a single value. 3. Each value is a cell; each cell is a single value.
@fig-tidy-structure shows the rules visually. @fig-tidy-structure shows the rules visually.
@ -88,17 +88,17 @@ There are two main advantages:
1. There's a general advantage to picking one consistent way of storing data. 1. There's a general advantage to picking one consistent way of storing data.
If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity. If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity.
2. There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine. 2. There's a specific advantage to placing variables in columns because it allows R's vectorized nature to shine.
As you learned in @sec-mutate and @sec-summarize, most built-in R functions work with vectors of values. As you learned in @sec-mutate and @sec-summarize, most built-in R functions work with vectors of values.
That makes transforming tidy data feel particularly natural. That makes transforming tidy data feel particularly natural.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data. dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.
Here are a couple of small examples showing how you might work with `table1`. Here are a few small examples showing how you might work with `table1`.
```{r} ```{r}
#| fig-width: 5 #| fig-width: 5
#| fig-alt: > #| fig-alt: >
#| This figure shows the numbers of cases in 1999 and 2000 for #| This figure shows the number of cases in 1999 and 2000 for
#| Afghanistan, Brazil, and China, with year on the x-axis and number #| Afghanistan, Brazil, and China, with year on the x-axis and number
#| of cases on the y-axis. Each point on the plot represents the number #| of cases on the y-axis. Each point on the plot represents the number
#| of cases in a given country in a given year. The points for each #| of cases in a given country in a given year. The points for each

View File

@ -108,7 +108,7 @@ Our ultimate goal in this chapter is to recreate the following visualization dis
#| fig-alt: > #| fig-alt: >
#| A scatterplot of body mass vs. flipper length of penguins, with a #| A scatterplot of body mass vs. flipper length of penguins, with a
#| smooth curve displaying the relationship between these two variables #| smooth curve displaying the relationship between these two variables
#| overlaid. The plot displays a positive, fairly linear, relatively #| overlaid. The plot displays a positive, fairly linear, and relatively
#| strong relationship between these two variables. Species (Adelie, #| strong relationship between these two variables. Species (Adelie,
#| Chinstrap, and Gentoo) are represented with different colors and #| Chinstrap, and Gentoo) are represented with different colors and
#| shapes. The relationship between body mass and flipper length is #| shapes. The relationship between body mass and flipper length is
@ -186,7 +186,7 @@ You'll learn a whole bunch of geoms throughout the book, particularly in @sec-la
```{r} ```{r}
#| fig-alt: > #| fig-alt: >
#| A scatterplot of body mass vs. flipper length of penguins. The plot #| A scatterplot of body mass vs. flipper length of penguins. The plot
#| displays a positive, linear, relatively strong relationship between #| displays a positive, linear, and relatively strong relationship between
#| these two variables. #| these two variables.
ggplot( ggplot(
@ -232,7 +232,7 @@ Throughout the book you will make many more ggplots and have many more opportuni
#| warning: false #| warning: false
#| fig-alt: > #| fig-alt: >
#| A scatterplot of body mass vs. flipper length of penguins. The plot #| A scatterplot of body mass vs. flipper length of penguins. The plot
#| displays a positive, fairly linear, relatively strong relationship #| displays a positive, fairly linear, and relatively strong relationship
#| between these two variables. Species (Adelie, Chinstrap, and Gentoo) #| between these two variables. Species (Adelie, Chinstrap, and Gentoo)
#| are represented with different colors. #| are represented with different colors.
@ -326,7 +326,7 @@ Other arguments match the aesthetic mappings, `x` is the x-axis label, `y` is th
#| fig-alt: > #| fig-alt: >
#| A scatterplot of body mass vs. flipper length of penguins, with a #| A scatterplot of body mass vs. flipper length of penguins, with a
#| smooth curve displaying the relationship between these two variables #| smooth curve displaying the relationship between these two variables
#| overlaid. The plot displays a positive, fairly linear, relatively #| overlaid. The plot displays a positive, fairly linear, and relatively
#| strong relationship between these two variables. Species (Adelie, #| strong relationship between these two variables. Species (Adelie,
#| Chinstrap, and Gentoo) are represented with different colors and #| Chinstrap, and Gentoo) are represented with different colors and
#| shapes. The relationship between body mass and flipper length is #| shapes. The relationship between body mass and flipper length is
@ -771,7 +771,7 @@ You will learn about many other geoms for visualizing distributions of variables
How can you see this information when you run `mpg`? How can you see this information when you run `mpg`?
2. Make a scatterplot of `hwy` vs. `displ` using the `mpg` data frame. 2. Make a scatterplot of `hwy` vs. `displ` using the `mpg` data frame.
Then, map a third, numerical variable to `color`, `size`, and `shape`. Next, map a third, numerical variable to `color`, then `size`, then both `color` and `size`, then `shape`.
How do these aesthetics behave differently for categorical vs. numerical variables? How do these aesthetics behave differently for categorical vs. numerical variables?
3. In the scatterplot of `hwy` vs. `displ`, what happens if you map a third variable to `linewidth`? 3. In the scatterplot of `hwy` vs. `displ`, what happens if you map a third variable to `linewidth`?
@ -781,7 +781,7 @@ You will learn about many other geoms for visualizing distributions of variables
5. Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm` and color the points by `species`. 5. Make a scatterplot of `bill_depth_mm` vs. `bill_length_mm` and color the points by `species`.
What does adding coloring by species reveal about the relationship between these two variables? What does adding coloring by species reveal about the relationship between these two variables?
6. Why does the following yield two separate legends. 6. Why does the following yield two separate legends?
How would you fix it to combine the two legends? How would you fix it to combine the two legends?
```{r} ```{r}
@ -810,6 +810,7 @@ That's the job of `ggsave()`, which will save the most recent plot to disk:
```{r} ```{r}
#| fig-show: hide #| fig-show: hide
#| warning: false
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() geom_point()

Binary file not shown.

Before

Width:  |  Height:  |  Size: 88 KiB

After

Width:  |  Height:  |  Size: 78 KiB

View File

@ -10,14 +10,14 @@ status("polishing")
You now have some experience running R code. You now have some experience running R code.
We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration! We didn't give you many details, but you've obviously figured out the basics, or you would've thrown this book away in frustration!
Frustration is natural when you start programming in R, because it is such a stickler for punctuation, and even one character out of place will cause it to complain. Frustration is natural when you start programming in R because it is such a stickler for punctuation, and even one character out of place will cause it to complain.
But while you should expect to be a little frustrated, take comfort in that this experience is both typical and temporary: it happens to everyone, and the only way to get over it is to keep trying. But while you should expect to be a little frustrated, take comfort in that this experience is typical and temporary: it happens to everyone, and the only way to get over it is to keep trying.
Before we go any further, let's make sure you've got a solid foundation in running R code, and that you know about some of the most helpful RStudio features. Before we go any further, let's ensure you've got a solid foundation in running R code and that you know some of the most helpful RStudio features.
## Coding basics ## Coding basics
Let's review some basics we've so far omitted in the interests of getting you plotting as quickly as possible. Let's review some basics we've omitted so far in the interest of getting you plotting as quickly as possible.
You can use R as a calculator: You can use R as a calculator:
```{r} ```{r}
@ -55,7 +55,7 @@ object_name <- value
When reading that code, say "object name gets value" in your head. When reading that code, say "object name gets value" in your head.
You will make lots of assignments and `<-` is a pain to type. You will make lots of assignments, and `<-` is a pain to type.
You can save time with RStudio's keyboard shortcut: Alt + - (the minus sign). You can save time with RStudio's keyboard shortcut: Alt + - (the minus sign).
Notice that RStudio automatically surrounds `<-` with spaces, which is a good code formatting practice. Notice that RStudio automatically surrounds `<-` with spaces, which is a good code formatting practice.
Code is miserable to read on a good day, so giveyoureyesabreak and use spaces. Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
@ -63,10 +63,10 @@ Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
## Comments ## Comments
R will ignore any text after `#`. R will ignore any text after `#`.
This allows to you to write **comments**, text that is ignored by R but read by other humans. This allows you to write **comments**, text that is ignored by R but read by other humans.
We'll sometimes include comments in examples explaining what's happening with the code. We'll sometimes include comments in examples explaining what's happening with the code.
Comments can be helpful for briefly describing what the subsequent code does. Comments can be helpful for briefly describing what the following code does.
```{r} ```{r}
# define primes # define primes
@ -76,26 +76,26 @@ primes <- c(2, 3, 5, 7, 11, 13)
primes * 2 primes * 2
``` ```
With short pieces of code like this, it might not be necessary to leave a command for every single line of code. With short pieces of code like this, leaving a comment for every single line of code might not be necessary.
But as the code you're writing gets more complex, comments can save you (and your collaborators) a lot of time in figuring out what was done in the code. But as the code you're writing gets more complex, comments can save you (and your collaborators) a lot of time figuring out what was done in the code.
Use comments to explain the *why* of your code, not the *how* or the *what*. Use comments to explain the *why* of your code, not the *how* or the *what*.
The *what* and *how* of code your is always possible to figure out, even if it might be tedious, by carefully reading the code. The *what* and *how* of your code are always possible to figure out, even if it might be tedious, by carefully reading it.
But if you describe the "what" in your comments and your code, you'll have to remember to carefully update the comment and code in tandem. But if you describe the "what" in your comments and your code, you'll have to remember to update the comment and code in tandem carefully.
If you change the code and forget to update the comment, they'll be inconsistent which will lead to confusion when you come back to your code in the future. If you change the code and forget to update the comment, they'll be inconsistent, leading to confusion when you return to your code in the future.
Figuring out *why* something was done is much more difficult, if not impossible. Figuring out *why* something was done is much more difficult, if not impossible.
For example, `geom_smooth()` has an argument called `span`, which controls the smoothness of the curve, with larger values yielding a smoother curve. For example, `geom_smooth()` has an argument called `span`, which controls the smoothness of the curve, with larger values yielding a smoother curve.
Suppose you decide to change the value of `span` from its default of 0.75 to 0.3: it's easy for a future reader to understand *what* is happening, but unless you note your thinking in a comment, no one will understand *why* you changed the default. Suppose you decide to change the value of `span` from its default of 0.75 to 0.3: it's easy for a future reader to understand *what* is happening, but unless you note your thinking in a comment, no one will understand *why* you changed the default.
For data analysis code, use comments to explain your overall plan of attack and record important insight as you encounter them. For data analysis code, use comments to explain your overall plan of attack and record important insights as you encounter them.
There's no way to re-capture this knowledge from the code itself. There's no way to re-capture this knowledge from the code itself.
## What's in a name? {#sec-whats-in-a-name} ## What's in a name? {#sec-whats-in-a-name}
Object names must start with a letter, and can only contain letters, numbers, `_` and `.`. Object names must start with a letter and can only contain letters, numbers, `_`, and `.`.
You want your object names to be descriptive, so you'll need to adopt a convention for multiple words. You want your object names to be descriptive, so you'll need to adopt a convention for multiple words.
We recommend **snake_case** where you separate lowercase words with `_`. We recommend **snake_case**, where you separate lowercase words with `_`.
```{r} ```{r}
#| eval: false #| eval: false
@ -106,7 +106,7 @@ some.people.use.periods
And_aFew.People_RENOUNCEconvention And_aFew.People_RENOUNCEconvention
``` ```
We'll come back to names again when we talk more about code style in @sec-workflow-style. We'll return to names again when we discuss code style in @sec-workflow-style.
You can inspect an object by typing its name: You can inspect an object by typing its name:
@ -148,8 +148,8 @@ R_rocks
``` ```
This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions. This illustrates the implied contract between you and R: R will do the tedious computations for you, but in exchange, you must be completely precise in your instructions.
Typos matter; R can't read your mind and say "oh, they probably meant `r_rocks` when they typed `r_rock`". Typos matter; R can't read your mind and say, "oh, they probably meant `r_rocks` when they typed `r_rock`".
Case matters; similarly R can't read your mind and say "oh, they probably meant `r_rocks` when they typed `R_rocks`". Case matters; similarly, R can't read your mind and say, "oh, they probably meant `r_rocks` when they typed `R_rocks`".
## Calling functions ## Calling functions
@ -161,10 +161,10 @@ R has a large collection of built-in functions that are called like this:
function_name(arg1 = val1, arg2 = val2, ...) function_name(arg1 = val1, arg2 = val2, ...)
``` ```
Let's try using `seq()`, which makes regular **seq**uences of numbers and, while we're at it, learn more helpful features of RStudio. Let's try using `seq()`, which makes regular **seq**uences of numbers, and while we're at it, learn more helpful features of RStudio.
Type `se` and hit TAB. Type `se` and hit TAB.
A popup shows you possible completions. A popup shows you possible completions.
Specify `seq()` by typing more (a `q`) to disambiguate, or by using ↑/↓ arrows to select. Specify `seq()` by typing more (a `q`) to disambiguate or by using ↑/↓ arrows to select.
Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose. Notice the floating tooltip that pops up, reminding you of the function's arguments and purpose.
If you want more help, press F1 to get all the details in the help tab in the lower right pane. If you want more help, press F1 to get all the details in the help tab in the lower right pane.

View File

@ -10,7 +10,7 @@ status("polishing")
This book is not an island; there is no single resource that will allow you to master R. This book is not an island; there is no single resource that will allow you to master R.
As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer.
This section describes a few tips on how to get help, and to help you keep learning. This section describes a few tips on how to get help and to help you keep learning.
## Google is your friend ## Google is your friend
@ -22,17 +22,17 @@ Chances are that someone else has been confused by it in the past, and there wil
(If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.) (If the error message isn't in English, run `Sys.setenv(LANGUAGE = "en")` and re-run the code; you're more likely to find help for English error messages.)
If Google doesn't help, try [Stack Overflow](https://stackoverflow.com). If Google doesn't help, try [Stack Overflow](https://stackoverflow.com).
Start by spending a little time searching for an existing answer, including `[R]` to restrict your search to questions and answers that use R. Start by spending a little time searching for an existing answer, including `[R]`, to restrict your search to questions and answers that use R.
## Making a reprex ## Making a reprex
If your googling doesn't find anything useful, it's a really good idea prepare a **reprex,** short for minimal **repr**oducible **ex**ample. If your googling doesn't find anything useful, it's a really good idea to prepare a **reprex,** short for minimal **repr**oducible **ex**ample.
A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it. A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
There are two parts to creating a reprex: There are two parts to creating a reprex:
- First, you need to make your code reproducible. - First, you need to make your code reproducible.
This means that you need to capture everything, i.e. include any `library()` calls and create all necessary objects. This means that you need to capture everything, i.e., include any `library()` calls and create all necessary objects.
The easiest way to make sure you've done this is to use the reprex package. The easiest way to make sure you've done this is using the reprex package.
- Second, you need to make it minimal. - Second, you need to make it minimal.
Strip away everything that is not directly related to your problem. Strip away everything that is not directly related to your problem.
@ -41,14 +41,14 @@ There are two parts to creating a reprex:
That sounds like a lot of work! That sounds like a lot of work!
And it can be, but it has a great payoff: And it can be, but it has a great payoff:
- 80% of the time creating an excellent reprex reveals the source of your problem. - 80% of the time, creating an excellent reprex reveals the source of your problem.
It's amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question. It's amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.
- The other 20% of time you will have captured the essence of your problem in a way that is easy for others to play with. - The other 20% of the time, you will have captured the essence of your problem in a way that is easy for others to play with.
This substantially improves your chances of getting help! This substantially improves your chances of getting help!
When creating a reprex by hand, it's easy to accidentally miss something that means your code can't be run on someone else's computer. When creating a reprex by hand, it's easy to accidentally miss something, meaning your code can't be run on someone else's computer.
Avoid this problem by using the reprex package which is installed as part of the tidyverse. Avoid this problem by using the reprex package, which is installed as part of the tidyverse.
Let's say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it): Let's say you copy this code onto your clipboard (or, on RStudio Server or Cloud, select it):
```{r} ```{r}
@ -87,8 +87,8 @@ Anyone else can copy, paste, and run this immediately.
There are three things you need to include to make your example reproducible: required packages, data, and code. There are three things you need to include to make your example reproducible: required packages, data, and code.
1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs. 1. **Packages** should be loaded at the top of the script so it's easy to see which ones the example needs.
This is a good time to check that you're using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed or last updated the package. This is a good time to check that you're using the latest version of each package; you may have discovered a bug that's been fixed since you installed or last updated the package.
For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`. For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
2. The easiest way to include **data** is to use `dput()` to generate the R code needed to recreate it. 2. The easiest way to include **data** is to use `dput()` to generate the R code needed to recreate it.
@ -96,21 +96,21 @@ There are three things you need to include to make your example reproducible: re
1. Run `dput(mtcars)` in R 1. Run `dput(mtcars)` in R
2. Copy the output 2. Copy the output
3. In reprex, type `mtcars <-` then paste. 3. In reprex, type `mtcars <-`, then paste.
Try and find the smallest subset of your data that still reveals the problem. Try and find the smallest subset of your data that still reveals the problem.
3. Spend a little bit of time ensuring that your **code** is easy for others to read: 3. Spend a little bit of time ensuring that your **code** is easy for others to read:
- Make sure you've used spaces and your variable names are concise, yet informative. - Make sure you've used spaces and your variable names are concise yet informative.
- Use comments to indicate where your problem lies. - Use comments to indicate where your problem lies.
- Do your best to remove everything that is not related to the problem. - Do your best to remove everything that is not related to the problem.
The shorter your code is, the easier it is to understand, and the easier it is to fix. The shorter your code is, the easier it is to understand and the easier it is to fix.
Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script in. Finish by checking that you have actually made a reproducible example by starting a fresh R session and copying and pasting your script.
## Investing in yourself ## Investing in yourself
@ -121,12 +121,12 @@ To keep up with the R community more broadly, we recommend reading [R Weekly](ht
If you're an active Twitter user, you might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)), Mine ([\@minebocek](https://twitter.com/minebocek)), Garrett ([\@statgarrett](https://twitter.com/statgarrett)), or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE. If you're an active Twitter user, you might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)), Mine ([\@minebocek](https://twitter.com/minebocek)), Garrett ([\@statgarrett](https://twitter.com/statgarrett)), or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
If you want the full fire hose of new developments, you can also read the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag. If you want the full fire hose of new developments, you can also read the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
This is one the key tools that Hadley and Mine use to keep up with new developments in the community. This is one of the key tools that Hadley and Mine use to keep up with new developments in the community.
## Summary ## Summary
This chapter concludes the Whole Game part of the book. This chapter concludes the Whole Game part of the book.
You've now seen the most important parts of the data science process: visualization, transformation, tidying and importing. You've now seen the most important parts of the data science process: visualization, transformation, tidying and importing.
Now you've got a holistic view of whole process and we start to get into the the details of small pieces. Now you've got a holistic view of the whole process, and we start to get into the details of small pieces.
The next part of the book, Visualize, does a deeper dive into the grammar of graphics and creating data visualizations with ggplot2, showcases how to use the tools you've learned so far to conduct exploratory data analysis, and introduces good practices for creating plots for communication. The next part of the book, Visualize, does a deeper dive into the grammar of graphics and creating data visualizations with ggplot2, showcases how to use the tools you've learned so far to conduct exploratory data analysis, and introduces good practices for creating plots for communication.

View File

@ -9,7 +9,7 @@ status("complete")
``` ```
The pipe, `|>`, is a powerful tool for clearly expressing a sequence of operations that transform an object. The pipe, `|>`, is a powerful tool for clearly expressing a sequence of operations that transform an object.
We briefly introduced pipes in the previous chapter, but before going too much farther, we want to give a few more details and discuss `%>%`, a predecessor to `|>`. We briefly introduced pipes in the previous chapter, but before going further, we want to give a few more details and discuss `%>%`, a predecessor to `|>`.
To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M. To add the pipe to your code, we recommend using the build-in keyboard shortcut Ctrl/Cmd + Shift + M.
You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in @fig-pipe-options; more on `%>%` shortly. You'll need to make one change to your RStudio options to use `|>` instead of `%>%` as shown in @fig-pipe-options; more on `%>%` shortly.
@ -78,7 +78,7 @@ flights3 <- summarize(flight2,
) )
``` ```
While both of these forms have their time and place, the pipe generally produces data analysis code that's both easier to write and easier to read. While both forms have their time and place, the pipe generally produces data analysis code that is easier to write and read.
## magrittr and the `%>%` pipe ## magrittr and the `%>%` pipe
@ -95,7 +95,7 @@ mtcars %>%
summarize(n = n()) summarize(n = n())
``` ```
For simple cases `|>` and `%>%` behave identically. For simple cases, `|>` and `%>%` behave identically.
So why do we recommend the base pipe? So why do we recommend the base pipe?
Firstly, because it's part of base R, it's always available for you to use, even when you're not using the tidyverse. Firstly, because it's part of base R, it's always available for you to use, even when you're not using the tidyverse.
Secondly, `|>` is quite a bit simpler than `%>%`: in the time between the invention of `%>%` in 2014 and the inclusion of `|>` in R 4.1.0 in 2021, we gained a better understanding of the pipe. Secondly, `|>` is quite a bit simpler than `%>%`: in the time between the invention of `%>%` in 2014 and the inclusion of `|>` in R 4.1.0 in 2021, we gained a better understanding of the pipe.
@ -103,12 +103,12 @@ This allowed the base implementation to jettison infrequently used and less impo
## `|>` vs. `%>%` ## `|>` vs. `%>%`
While `|>` and `%>%` behave identically for simple cases, there are a few important differences. While `|>` and `%>%` behave identically for simple cases, there are a few crucial differences.
These are most likely to affect you if you're a long-term user of `%>%` who has taken advantage of some of the more advanced features. These are most likely to affect you if you're a long-term user of `%>%` who has taken advantage of some of the more advanced features.
But they're still good to know about even if you've never used `%>%` because you're likely to encounter some of them when reading wild-caught code. But they're still good to know about even if you've never used `%>%` because you're likely to encounter some of them when reading wild-caught code.
- By default, the pipe passes the object on its left hand side to the first argument of the function on the right-hand side. - By default, the pipe passes the object on its left-hand side to the first argument of the function on the right-hand side.
`%>%` allows you change the placement with a `.` placeholder. `%>%` allows you to change the placement with a `.` placeholder.
For example, `x %>% f(1)` is equivalent to `f(x, 1)` but `x %>% f(1, .)` is equivalent to `f(1, x)`. For example, `x %>% f(1)` is equivalent to `f(x, 1)` but `x %>% f(1, .)` is equivalent to `f(1, x)`.
R 4.2.0 added a `_` placeholder to the base pipe, with one additional restriction: the argument has to be named. R 4.2.0 added a `_` placeholder to the base pipe, with one additional restriction: the argument has to be named.
For example, `x |> f(1, y = _)` is equivalent to `f(1, y = x)`. For example, `x |> f(1, y = _)` is equivalent to `f(1, y = x)`.
@ -116,7 +116,7 @@ But they're still good to know about even if you've never used `%>%` because you
- The `|>` placeholder is deliberately simple and can't replicate many features of the `%>%` placeholder: you can't pass it to multiple arguments, and it doesn't have any special behavior when the placeholder is used inside another function. - The `|>` placeholder is deliberately simple and can't replicate many features of the `%>%` placeholder: you can't pass it to multiple arguments, and it doesn't have any special behavior when the placeholder is used inside another function.
For example, `df %>% split(.$var)` is equivalent to `split(df, df$var)` and `df %>% {split(.$x, .$y)}` is equivalent to `split(df$x, df$y)`. For example, `df %>% split(.$var)` is equivalent to `split(df, df$var)` and `df %>% {split(.$x, .$y)}` is equivalent to `split(df$x, df$y)`.
With `%>%` you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in @sec-subset-many), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`. With `%>%`, you can use `.` on the left-hand side of operators like `$`, `[[`, `[` (which you'll learn about in @sec-subset-many), so you can extract a single column from a data frame with (e.g.) `mtcars %>% .$cyl`.
A future version of R may add similar support for `|>` and `_`. A future version of R may add similar support for `|>` and `_`.
For the special case of extracting a column out of a data frame, you can also use `dplyr::pull()`: For the special case of extracting a column out of a data frame, you can also use `dplyr::pull()`:
@ -128,13 +128,13 @@ But they're still good to know about even if you've never used `%>%` because you
- `%>%` allows you to start a pipe with `.` to create a function rather than immediately executing the pipe; this is not supported by the base pipe. - `%>%` allows you to start a pipe with `.` to create a function rather than immediately executing the pipe; this is not supported by the base pipe.
Luckily there's no need to commit entirely to one pipe or the other --- you can use the base pipe for the majority of cases where it's sufficient, and use the magrittr pipe when you really need its special features. Luckily there's no need to commit entirely to one pipe or the other --- you can use the base pipe for the majority of cases where it's sufficient and use the magrittr pipe when you really need its special features.
## `|>` vs `+` ## `|>` vs `+`
Sometimes we'll turn the end of a pipeline of data transformation into a plot. Sometimes we'll turn the end of a data transformation pipeline into a plot.
Watch for the transition from `|>` to `+`. Watch for the transition from `|>` to `+`.
We wish this transition wasn't necessary but unfortunately ggplot2 was created before the pipe was discovered. We wish this transition wasn't necessary, but unfortunately, ggplot2 was created before the pipe was discovered.
```{r} ```{r}
#| eval: false #| eval: false
@ -148,10 +148,10 @@ diamonds |>
## Summary ## Summary
In this chapter, you've learned more about the pipe: why we recommend it and some of the history that lead to `|>`. In this chapter, you've learned more about the pipe: why we recommend it and some of the history that lead to `|>`.
The pipe is important because you'll use it again and again throughout your analysis, but hopefully it will quickly become invisible and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it. The pipe is important because you'll use it again and again throughout your analysis, but hopefully, it will quickly become invisible, and your fingers will type it (or use the keyboard shortcut) without your brain having to think too much about it.
In the next chapter, we switch back to data science tools, learning about tidy data. In the next chapter, we switch back to data science tools, learning about tidy data.
Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse. Tidy data is a consistent way of organizing your data frames that is used throughout the tidyverse.
This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions. This consistency makes your life easier because once you have tidy data, it just works with the vast majority of tidyverse functions.
Of course, life is never easy and most datasets that you encounter in the wild will not already be tidy. Of course, life is never easy, and most datasets you encounter in the wild will not already be tidy.
So we'll also teach you how to use the tidyr package to tidy your untidy data. So we'll also teach you how to use the tidyr package to tidy your untidy data.

View File

@ -7,14 +7,14 @@ source("_common.R")
status("polishing") status("polishing")
``` ```
This chapter will introduce you to two very important tools for organizing your code: scripts and projects. This chapter will introduce you to two essential tools for organizing your code: scripts and projects.
## Scripts ## Scripts
So far, you have used the console to run code. So far, you have used the console to run code.
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines.
To give yourself more room to work, use the script editor. To give yourself more room to work, use the script editor.
Open it up by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Open it up by clicking the File menu, selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
Now you'll see four panes, as in @fig-rstudio-script. Now you'll see four panes, as in @fig-rstudio-script.
The script editor is a great place to put code you care about. The script editor is a great place to put code you care about.
Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.
@ -33,12 +33,12 @@ knitr::include_graphics("diagrams/rstudio/script.png", dpi = 270)
### Running code ### Running code
The script editor is a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations. The script editor is an excellent place for building complex ggplot2 plots or long sequences of dplyr manipulations.
The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter.
This executes the current R expression in the console. This executes the current R expression in the console.
For example, take the code below. For example, take the code below.
If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`. If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates `not_cancelled`.
It will also move the cursor to the next statement (beginning with `not_cancelled |>`). It will also move the cursor to the following statement (beginning with `not_cancelled |>`).
That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter. That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.
```{r} ```{r}
@ -58,9 +58,9 @@ not_cancelled |>
Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S.
Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script. Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script.
We recommend that you always start your script with the packages that you need. We recommend you always start your script with the packages you need.
That way, if you share your code with others, they can easily see which packages they need to install. That way, if you share your code with others, they can easily see which packages they need to install.
Note, however, that you should never include `install.packages()` in a script that you share. Note, however, that you should never include `install.packages()` in a script you share.
It's very antisocial to change settings on someone else's computer! It's very antisocial to change settings on someone else's computer!
When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts.
@ -68,7 +68,7 @@ Over time, sending code to the console in this way will become so natural that y
### RStudio diagnostics ### RStudio diagnostics
In script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar: In the script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:
```{r} ```{r}
#| echo: false #| echo: false
@ -100,7 +100,7 @@ RStudio will also let you know about potential problems:
#| echo: false #| echo: false
#| out-width: ~ #| out-width: ~
#| fig-alt: > #| fig-alt: >
#| Script editor with the script 3 == NA. A yellow exclamation park #| Script editor with the script 3 == NA. A yellow exclamation mark
#| indicates that there may be a potential problem. Hovering over the #| indicates that there may be a potential problem. Hovering over the
#| exclamation mark shows a text box with the text use is.na to check #| exclamation mark shows a text box with the text use is.na to check
#| whether expression evaluates to NA. #| whether expression evaluates to NA.

View File

@ -9,15 +9,15 @@ status("polishing")
``` ```
Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
Even as a very new programmer it's a good idea to work on your code style. Even as a very new programmer, it's a good idea to work on your code style.
Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else. Using a consistent style makes it easier for others (including future-you!) to read your work and is particularly important if you need to get help from someone else.
This chapter will introduce to the most important points of the [tidyverse style guide](https://style.tidyverse.org), which is used throughout this book. This chapter will introduce the most important points of the [tidyverse style guide](https://style.tidyverse.org), which is used throughout this book.
Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature.
Additionally, there are some great tools to quickly restyle existing code, like the [**styler**](https://styler.r-lib.org) package by Lorenz Walthert. Additionally, there are some great tools to quickly restyle existing code, like the [**styler**](https://styler.r-lib.org) package by Lorenz Walthert.
Once you've installed it with `install.packages("styler")`, an easy way to use it is via RStudio's **command palette**. Once you've installed it with `install.packages("styler")`, an easy way to use it is via RStudio's **command palette**.
The command palette lets you use any build-in RStudio command, as well as many addins provided by packages. The command palette lets you use any built-in RStudio command and many addins provided by packages.
Open the palette by pressing Cmd/Ctrl + Shift + P, then type "styler" to see all the shortcuts provided by styler. Open the palette by pressing Cmd/Ctrl + Shift + P, then type "styler" to see all the shortcuts offered by styler.
@fig-styler shows the results. @fig-styler shows the results.
```{r} ```{r}
@ -58,12 +58,12 @@ short_flights <- flights |> filter(air_time < 60)
SHORTFLIGHTS <- flights |> filter(air_time < 60) SHORTFLIGHTS <- flights |> filter(air_time < 60)
``` ```
As a general rule of thumb, it's better to prefer long, descriptive names that are easy to understand, rather than concise names that are fast to type. As a general rule of thumb, it's better to prefer long, descriptive names that are easy to understand rather than concise names that are fast to type.
Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation. Short names save relatively little time when writing code (especially since autocomplete will help you finish typing them), but it can be time-consuming when you come back to old code and are forced to puzzle out a cryptic abbreviation.
If you have a bunch of names for related things, do your best to be consistent. If you have a bunch of names for related things, do your best to be consistent.
It's easy for inconsistencies to arise when you forget a previous convention, so don't feel bad if you have to go back and rename things. It's easy for inconsistencies to arise when you forget a previous convention, so don't feel bad if you have to go back and rename things.
In general, if you have a bunch of variables that are a variation on a theme you're better off giving them a common prefix, rather than a common suffix, because autocomplete works best on the start of a variable. In general, if you have a bunch of variables that are a variation on a theme, you're better off giving them a common prefix rather than a common suffix because autocomplete works best on the start of a variable.
## Spaces ## Spaces
@ -80,7 +80,7 @@ z<-( a + b ) ^ 2/d
``` ```
Don't put spaces inside or outside parentheses for regular function calls. Don't put spaces inside or outside parentheses for regular function calls.
Always put a space after a comma, just like in regular English. Always put a space after a comma, just like in standard English.
```{r} ```{r}
#| eval: false #| eval: false
@ -110,7 +110,7 @@ flights |>
## Pipes {#sec-pipes} ## Pipes {#sec-pipes}
`|>` should always have a space before it and should typically be the last thing on a line. `|>` should always have a space before it and should typically be the last thing on a line.
This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and to get a 50,000 ft view by skimming the verbs on the left-hand side. This makes it easier to add new steps, rearrange existing steps, modify elements within a step, and get a 50,000 ft view by skimming the verbs on the left-hand side.
```{r} ```{r}
#| eval: false #| eval: false
@ -125,7 +125,7 @@ flights|>filter(!is.na(arr_delay), !is.na(tailnum))|>count(dest)
``` ```
If the function you're piping into has named arguments (like `mutate()` or `summarize()`), put each argument on a new line. If the function you're piping into has named arguments (like `mutate()` or `summarize()`), put each argument on a new line.
If the function doesn't have named arguments (like `select()` or `filter()`) keep everything on one line unless it doesn't fit, in which case you should put each argument on its own line. If the function doesn't have named arguments (like `select()` or `filter()`), keep everything on one line unless it doesn't fit, in which case you should put each argument on its own line.
```{r} ```{r}
#| eval: false #| eval: false