Whole game feedback from O'Reilly (#1057)

* Redraw data science process diagrams

* Polishing the whole game

* Add reference to TMWR

* Respond to visualization feedback

* Minor changes

* Better integrate workflow-scripts chapter

* Minor getting help polishing

* Update investing in yourself links

* Redraw RStudio screenshots

* More scripts/projects polishing

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
Hadley Wickham 2022-08-29 06:24:32 -04:00 committed by GitHub
parent 7d4f86ca66
commit 21e31429a5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
45 changed files with 298 additions and 207 deletions

View File

@ -19,16 +19,33 @@ devtools::install_github("hadley/r4ds")
### Omnigraffle drawings
- Font: 12pt Ubuntu mono
- Font: 12pt Guardian Sans Condensed / Ubuntu mono
- Export as 300 dpi png.
- Website font is 18 px = 13.5 pt, so scale dpi to match font sizes: 270 = 300 \* 12 / 13.5
- Verified sizes are visually equivalent by screenshotting.
<!-- -->
- Website font is 18 px = 13.5 pt, so scale dpi to match font sizes: 270 = 300 \* 12 / 13.5.
(I also verified this empirically by screenshotting.)
``` r
#| echo: FALSE
#| out.width: NULL
knitr::include_graphics("diagrams/transform.png", dpi = 270)
```
### Screenshots
- Make sure you're using a light theme.
For small interface elements (eg. toolbars), zoom in twice.
- Screenshot with Cmd + Shift + 4.
- Don't need to set dpi:
``` r
#| echo: FALSE
#| out.width: NULL
knitr::include_graphics("screenshots/rstudio-wg.png")
```
## Code of Conduct

View File

@ -10,10 +10,18 @@ So far, you've learned the tools to get your data into R, tidy it into a form co
However, it doesn't matter how great your analysis is unless you can explain it to others: you need to **communicate** your results.
```{r}
#| label: fig-ds-communicate
#| echo: false
#| out-width: "75%"
#| fig-cap: >
#| Communication is the final part of the data science process; if you
#| can't communicate your results to other humans, it doesn't matter how
#| great your analysis is.
#| fig-alt: >
#| A diagram displaying the data science cycle with visualize and
#| communicate highlighed in blue.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science-communicate.png")
knitr::include_graphics("diagrams/data-science/communicate.png", dpi = 270)
```
Communication is the theme of the following four chapters:

View File

@ -90,7 +90,7 @@ There are two main advantages:
If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity.
2. There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine.
As you learned in Sections \@ref(mutate) and \@ref(summarize), most built-in R functions work with vectors of values.
As you learned in @sec-mutate and @sec-summarize, most built-in R functions work with vectors of values.
That makes transforming tidy data feel particularly natural.
dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

View File

@ -333,8 +333,9 @@ If you have a bunch of inconsistently named columns and it would be painful to f
### `relocate()`
You can move variables around with `relocate()`.
By default it moves variables to the front:
Use `relocate()` to move variables around.
You might want to collect related variables together or move important variables to the front.
By default `relocate()` moves variables to the front:
```{r}
flights |>

View File

@ -214,7 +214,7 @@ In hindsight, these cars were unlikely to be hybrids since they have large engin
In the above example, we mapped `class` to the color aesthetic, but we could have mapped `class` to the size aesthetic in the same way.
In this case, the exact size of each point would reveal its class affiliation.
We get a *warning* here, because mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is generally not a good idea.
We get a *warning* here: mapping an unordered variable (`class`) to an ordered aesthetic (`size`) is generally not a good idea because it implies a ranking that does not in fact exist.
```{r}
#| fig-alt: >
@ -824,11 +824,16 @@ Other graphs, like bar charts, calculate new values to plot:
- boxplots compute a robust summary of the distribution and then display that summary as a specially formatted box.
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
@fig-vis-stat-bar shows how this process works with `geom_bar()`.
```{r}
#| label: fig-vis-stat-bar
#| echo: false
#| out-width: "100%"
#| fig-cap: >
#| When create a bar chart we first start with the raw data, then
#| aggregate it to count the number of observations in each bar,
#| and finally map those computed variables to plot aesthetics.
#| fig-alt: >
#| A figure demonstrating three steps of creating a bar chart.
#| Step 1. geom_bar() begins with the diamonds data set. Step 2. geom_bar()
@ -1149,7 +1154,8 @@ There are three other coordinate systems that are occasionally helpful.
```
- `coord_quickmap()` sets the aspect ratio correctly for maps.
This is very important if you're plotting spatial data with ggplot2 (which unfortunately we don't have the space to cover in this book).
This is very important if you're plotting spatial data with ggplot2.
We don't have the space to discuss maps in this book, but you can learn more in the [Maps chapter](https://ggplot2-book.org/maps.html) of *ggplot2: Elegant graphics for data analysis*.
```{r}
#| layout-ncol: 2

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 15 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.4 MiB

View File

@ -0,0 +1,9 @@
x y <- 10
3 == NA

View File

@ -0,0 +1,16 @@
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: XeLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes

BIN
diagrams/rstudio.graffle Normal file

Binary file not shown.

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 459 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 384 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 454 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 380 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 328 KiB

BIN
diagrams/rstudio/script.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 386 KiB

View File

@ -14,17 +14,23 @@ After reading this book, you'll have the tools to tackle a wide variety of data
Data science is a huge field, and there's no way you can master it all by reading a single book.
The goal of this book is to give you a solid foundation in the most important tools, and enough knowledge to find the resources to learn more when necessary.
Our model of the tools needed in a typical data science project looks something like this:
Our model of the tools needed in a typical data science project looks something like @fig-ds-diagram.
```{r}
#| label: fig-ds-diagram
#| echo: false
#| fig-align: "center"
#| fig-cap: >
#| In our model of the data science process you start with data import
#| and tidying. Next you understand your data with an iterative cycle of
#| transforming, visualizing, and modeling. You finish the process
#| by communicating your results to other humans.
#| fig-alt: >
#| A diagram displaying the data science cycle: Import -> Tidy -> Understand
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
#| Communicate. Surrounding all of these is Communicate.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science.png")
knitr::include_graphics("diagrams/data-science/base.png", dpi = 270)
```
First you must **import** your data into R.
@ -77,10 +83,13 @@ There are a number of important topics that this book doesn't cover.
We believe it's important to stay ruthlessly focused on the essentials so you can get up and running as quickly as possible.
That means this book can't cover every important topic.
### Modelling
### Modeling
<!--# TO DO: Say a few sentences about modelling. -->
To learn more about modeling, we highly recommend [Tidy Modeling with R](https://www.tmwr.org), by our colleagues Max Kuhn and Julia Silge.
This book will teach you the tidymodels family of packages, which, as you might guess from the name, share many conventions with the tidyverse packages we use in this book.
### Big data
This book proudly focuses on small, in-memory datasets.
@ -150,20 +159,22 @@ When a new version is available, RStudio will let you know.
It's a good idea to upgrade regularly so you can take advantage of the latest and greatest features.
For this book, make sure you have at least RStudio 2022.02.0.
When you start RStudio, you'll see two key regions in the interface: the console pane, and the output pane.
```{r}
#| echo: false
#| fig-align: "center"
#| fig-alt: >
#| The RStudio IDE with the panes Console and Output highlighted.
knitr::include_graphics("diagrams/rstudio-console.png")
```
When you start RStudio, @fig-rstudio-console, you'll see two key regions in the interface: the console pane, and the output pane.
For now, all you need to know is that you type R code in the console pane, and press enter to run it.
You'll learn more as we go along!
```{r}
#| label: fig-rstudio-console
#| echo: false
#| out-width: ~
#| fig-cap: >
#| The RStudio IDE has two key regions: type R code in the console pane
#| on the left, and look for plots in the output pane on the right.
#| fig-alt: >
#| The RStudio IDE with the panes Console and Output highlighted.
knitr::include_graphics("diagrams/rstudio/console.png", dpi = 270)
```
### The tidyverse
You'll also need to install some R packages.

View File

@ -10,10 +10,18 @@ In this part of the book, you'll improve your programming skills.
Programming is a cross-cutting skill needed for all data science work: you must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
```{r}
#| label: fig-ds-program
#| echo: false
#| out-width: "75%"
#| fig-cap: >
#| Programming is the water in which all other components of the data
#| science process swims.
#| fig-alt: >
#| Our model of the data science process with program (import, tidy,
#| transform, visualize, model, and communicate, i.e. everything)
#| highlighted in blue.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science-program.png")
knitr::include_graphics("diagrams/data-science/program.png", dpi = 270)
```
Programming produces code, and code is a tool of communication.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.7 KiB

After

Width:  |  Height:  |  Size: 9.7 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 64 KiB

After

Width:  |  Height:  |  Size: 135 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 60 KiB

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 70 KiB

After

Width:  |  Height:  |  Size: 160 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 492 KiB

View File

@ -6,39 +6,41 @@
source("_common.R")
```
The goal of the first part of this book is to introduce you to the data science workflow including data **importing**, **tidying**, and data **exploration** as quickly as possible.
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
The goal of data exploration is to generate many promising leads that you can later explore in more depth.
Our goal in this part of the book is to give you a rapid overview of the main tools of data science: **importing**, **tidying**, **transforming**, and **visualizing data**, as shown in @fig-ds-whole-game.
We want to show you the "whole game" of data science giving you just enough of all the major pieces so that you can tackle real, if simple, data sets.
The later parts of the book, will hit each of these topics in more depth, increasing the range of data science challenges that you can tackle.
```{r}
#| label: fig-ds-whole-game
#| echo: false
#| out.width: NULL
#| fig-cap: >
#| In this section of the book, you'll learn how to import,
#| tidy, transform, and visualize data.
#| fig-alt: >
#| A diagram displaying the data science cycle: Import -> Tidy -> Explore
#| (which has the phases Transform -> Visualize -> Model in a cycle) ->
#| Communicate. Surrounding all of these is Communicate. Explore is highlighted.
#| A diagram displaying the data science cycle: Import -> Tidy ->
#| Understand (which has the phases Transform -> Visualize -> Model in a
#| cycle) -> Communicate. Surrounding all of these is Program
#| Import, Tidy, Transform, and Visualize is highlighted.
knitr::include_graphics("diagrams/data-science-explore.png")
knitr::include_graphics("diagrams/data-science/whole-game.png", dpi = 270)
```
<!--# TO DO: Update figure to include import and tidy as well. -->
In this part of the book, you will learn several useful tools that have an immediate payoff:
Five chapters focus on the tools of data science:
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
In [Chapter -@sec-data-visualisation] you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
In @sec-data-visualisation you'll dive into visualization, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
- Visualisation alone is typically not enough, so in [Chapter -@sec-data-transform], you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- Visualisation alone is typically not enough, so in @sec-data-transform, you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- In [Chapter -@sec-data-tidy], you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier.
- In @sec-data-tidy, you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualization, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.
- Before you can transform and visualize your data, you need to first get your data into R.
In [Chapter -@sec-data-import] you'll learn the basics of getting plain-text, rectangular data into R.
In @sec-data-import you'll learn the basics of getting `.csv` files into R.
- Finally, in [Chapter -@sec-exploratory-data-analysis], you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.
- Finally, in @sec-exploratory-data-analysis, you'll combine visualization and transformation with your curiosity and skepticism to ask and answer interesting questions about data.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet and details of modeling fall outside the scope of this book.
Nestled among these five chapters that teach you the tools for doing data science are three chapters that focus on your R workflow.
In [Chapter -@sec-workflow-basics], [Chapter -@sec-workflow-pipes], [Chapter -@sec-workflow-style], and [Chapter -@sec-workflow-scripts-projects], you'll learn good workflow practices for writing and organizing your R code.
Nestled among these chapters that are five other chapters that focus on your R workflow.
In @sec-workflow-basics, @sec-workflow-pipes, @sec-workflow-style, and @sec-workflow-scripts-projects, you'll learn good workflow practices for writing and organizing your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.

View File

@ -55,9 +55,8 @@ object_name <- value
When reading that code, say "object name gets value" in your head.
You will make lots of assignments and `<-` is a pain to type.
Don't be lazy and use `=`; it will work, but it will cause confusion later.
Instead, use RStudio's keyboard shortcut: Alt + - (the minus sign).
Notice that RStudio automagically surrounds `<-` with spaces, which is a good code formatting practice.
You can save time with RStudio's keyboard shortcut: Alt + - (the minus sign).
Notice that RStudio automatically surrounds `<-` with spaces, which is a good code formatting practice.
Code is miserable to read on a good day, so giveyoureyesabreak and use spaces.
## Comments

View File

@ -25,7 +25,7 @@ Start by spending a little time searching for an existing answer, including `[R]
## Making a reprex
If your googling doesn't find anything useful, it's a really good idea prepare a minimal reproducible example or **reprex**.
If your googling doesn't find anything useful, it's a really good idea prepare a **reprex,** short for minimal **repr**oducible **ex**ample.
A good reprex makes it easier for other people to help you, and often you'll figure out the problem yourself in the course of making it.
There are two parts to creating a reprex:
@ -84,28 +84,18 @@ mean(y)
Anyone else can copy, paste, and run this immediately.
Instead of reading from the clipboard, you can:
- `reprex(mean(rnorm(10)))` to get code from expression.
- `reprex(input = "mean(rnorm(10))\n")` gets code from character vector (detected via length or terminating newline). Leading prompts are stripped from input source: `reprex(input = "> median(1:3)\n")` produces same output as `reprex(input = "median(1:3)\n")`
- `reprex(input = "my_reprex.R")` gets code from file
- Use one of the RStudio add-ins to use the selected text or current file.
There are three things you need to include to make your example reproducible: required packages, data, and code.
1. **Packages** should be loaded at the top of the script, so it's easy to see which ones the example needs.
This is a good time to check that you're using the latest version of each package; it's possible you've discovered a bug that's been fixed since you installed or last updated the package.
For packages in the tidyverse, the easiest way to check is to run `tidyverse_update()`.
2. The easiest way to include **data** in a question is to use `dput()` to generate the R code to recreate it.
For example, to recreate the `mtcars` dataset in R, I'd perform the following steps:
2. The easiest way to include **data** is to use `dput()` to generate the R code needed to recreate it.
For example, to recreate the `mtcars` dataset in R, perform the following steps:
1. Run `dput(mtcars)` in R
2. Copy the output
3. In my reproducible script, type `mtcars <-` then paste.
3. In reprex, type `mtcars <-` then paste.
Try and find the smallest subset of your data that still reveals the problem.
@ -115,7 +105,7 @@ There are three things you need to include to make your example reproducible: re
- Use comments to indicate where your problem lies.
- Do your best to remove everything that is not related to the problem.\
- Do your best to remove everything that is not related to the problem.
The shorter your code is, the easier it is to understand, and the easier it is to fix.
@ -125,10 +115,9 @@ Finish by checking that you have actually made a reproducible example by startin
You should also spend some time preparing yourself to solve problems before they occur.
Investing a little time in learning R each day will pay off handsomely in the long run.
One way is to follow what Hadley, Garrett, and everyone else at RStudio are doing on the [RStudio blog](https://blog.rstudio.org).
This is where we post announcements about new packages, new IDE features, and in-person courses.
You might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)), Mine ([\@minebocek](https://twitter.com/minebocek)), Garrett ([\@statgarrett](https://twitter.com/statgarrett)) on Twitter, or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
One way is to follow what the tidyverse team is doing on the [tidyverse blog](https://www.tidyverse.org/blog/).
To keep up with the R community more broadly, we recommend reading [R Weekly](https://rweekly.org): it's a community effort to aggregate the most interesting news in the R community each week.
To keep up with the R community more broadly, we recommend reading <http://www.r-bloggers.com>: it aggregates over 500 blogs about R from around the world.
If you're an active Twitter user, follow the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
Twitter is one of the key tools that Hadley and Mine use to keep up with new developments in the community.
If you're an active Twitter user, you might also want to follow Hadley ([\@hadleywickham](https://twitter.com/hadleywickham)), Mine ([\@minebocek](https://twitter.com/minebocek)), Garrett ([\@statgarrett](https://twitter.com/statgarrett)), or follow [\@rstudiotips](https://twitter.com/rstudiotips) to keep up with new features in the IDE.
If you want the full fire hose of new developments, you can also read the ([`#rstats`](https://twitter.com/search?q=%23rstats)) hashtag.
This is one the key tools that Hadley and Mine use to keep up with new developments in the community.

View File

@ -7,68 +7,33 @@ source("_common.R")
status("polishing")
```
This chapter will introduce you to two very important tools for organizing your code: scripts and projects.
## Scripts
So far, you have used the console to run code.
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and dplyr pipes.
To give yourself more room to work, it's a great idea to use the script editor.
Open it up either by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
Now you'll see four panes:
```{r}
#| echo: false
#| out-width: "75%"
#| fig-alt: >
#| RStudio IDE with Editor, Console, and Output highlighted.
knitr::include_graphics("diagrams/rstudio-editor.png")
```
That's a great place to start, but you'll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines.
To give yourself more room to work, use the script editor.
Open it up by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N.
Now you'll see four panes, as in @fig-rstudio-script.
The script editor is a great place to put code you care about.
Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.
RStudio will automatically save the contents of the editor when you quit RStudio, and will automatically load it when you re-open.
Nevertheless, it's a good idea to save your scripts regularly and to back them up.
## Naming files
```{r}
#| label: fig-rstudio-script
#| echo: false
#| out-width: ~
#| fig-cap: >
#| Opening the script editor adds a new pane at the top-left of the
#| IDE
#| fig-alt: >
#| RStudio IDE with Editor, Console, and Output highlighted.
knitr::include_graphics("diagrams/rstudio/script.png", dpi = 270)
```
Saving your code in a script requires creating a new file that you will need to name.
It might be tempting to name this file `code.R` or `myscript.R`, but you should think a bit harder before choosing a name for your file.
Three important principles for file naming are as follows:
### Running code
1. File names should be machine readable: Avoid spaces, punctuation, symbols, and accented character. Do not rely on case sensitivity to distinguish files. Make deliberate use of delimiters.
2. File names should be human readable: Use file names that describe what is in the file.
3. File names should play well with default ordering: Start file names with numbers that allow them to be sorted in the order they get used.
Suppose you have the following files in a project folder.
run-first.R
alternative model.R
code for exploratory analysis.R
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
temp.txt
There are a variety of problems here: the files are misordered, file names contain spaces, there are two files with basically the same name but different capitalization (`finalreport` vs. `FinalReport`), and some file names don't reflect their contents (`run-first` and `temp`).
Below is an alternative way of naming and organizing the same set of files.
01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
notes-on-report-draft.txt
report-2022-03-20.qmd
report-2022-04-02.qmd
Numbering and descriptive names that are similarly formatted allow for a more useful organization of the R scripts.
Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and `temp` is renamed to `notes-on-report-draft` to better describe its contents.
## Running code
The script editor is also a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations.
The script editor is a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations.
The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter.
This executes the current R expression in the console.
For example, take the code below.
@ -90,24 +55,24 @@ not_cancelled |>
summarize(mean = mean(dep_delay))
```
Instead of running your code expression-by-expression, you can also execute the complete script in one step: Cmd/Ctrl + Shift + S.
Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S.
Doing this regularly is a great way to ensure that you've captured all the important parts of your code in the script.
I recommend that you always start your script with the packages that you need.
That way, if you share your code with others, they can easily see which packages they need to install.
Note, however, that you should never include `install.packages()` or `setwd()` in a script that you share.
Note, however, that you should never include `install.packages()` in a script that you share.
It's very antisocial to change settings on someone else's computer!
When working through future chapters, I highly recommend starting in the script editor and practicing your keyboard shortcuts.
Over time, sending code to the console in this way will become so natural that you won't even think about it.
## RStudio diagnostics
### RStudio diagnostics
The script editor will also highlight syntax errors with a red squiggly line and a cross in the sidebar:
In script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:
```{r}
#| echo: false
#| out-width: NULL
#| out-width: ~
#| fig-alt: >
#| Script editor with the script x y <- 10. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
@ -119,6 +84,7 @@ Hover over the cross to see what the problem is:
```{r}
#| echo: false
#| out-width: ~
#| fig-alt: >
#| Script editor with the script x y <- 10. A red X indicates that there is
#| syntax error. The syntax error is also highlighted with a red squiggly line.
@ -132,6 +98,7 @@ RStudio will also let you know about potential problems:
```{r}
#| echo: false
#| out-width: ~
#| fig-alt: >
#| Script editor with the script 3 == NA. A yellow exclamation park
#| indicates that there may be a potential problem. Hovering over the
@ -141,50 +108,99 @@ RStudio will also let you know about potential problems:
knitr::include_graphics("screenshots/rstudio-diagnostic-warn.png")
```
## Workflow: projects
### Saving and naming
RStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open.
Nevertheless, it's a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.
It might be tempting to name your files `code.R` or `myscript.R`, but you should think a bit harder before choosing a name for your file.
Three important principles for file naming are as follows:
1. File names should be **machine** readable: avoid spaces, symbols, and special characters. Don't rely on case sensitivity to distinguish files.
2. File names should be **human** readable: use file names to describe what's in the file.
3. File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.
For example, suppose you have the following files in a project folder.
alternative model.R
code for exploratory analysis.r
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
run-first.r
temp.txt
There are a variety of problems here: it's hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (`finalreport` vs. `FinalReport`[^workflow-scripts-1]), and some names don't describe their contents (`run-first` and `temp`).
[^workflow-scripts-1]: Not to mention that you're tempting fate by using "final" in the name 😆 The comic piled higher and deeper has a [fun strip on this](https://phdcomics.com/comics/archive.php?comicid=1531).
Here's better way of naming and organizing the same set of files:
01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
report-2022-03-20.qmd
report-2022-04-02.qmd
report-draft-notes.txt
Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies.
Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and `temp` is renamed to `report-draft-notes` to better describe its contents.
## Projects
One day, you will need to quit R, go do something else, and return to your analysis later.
One day, you will be working on multiple analyses simultaneously that all use R and you want to keep them separate.
One day, you will be working on multiple analyses simultaneously and you want to keep them separate.
One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.
To handle these real life situations, you need to make two decisions:
1. What about your analysis is "real", i.e. what will you save as your lasting record of what happened?
1. What is the source of truth?
What will you save as your lasting record of what happened?
2. Where does your analysis "live"?
2. Where does your analysis live?
## What is real?
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) "real".
However, in the long run, you'll be much better off if you consider your R scripts as "real".
### What is the source of truth?
As a beginning R user, it's OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis.
However, in the long run, you'll be much better off if you ensure that your R scripts are the source of truth.
With your R scripts (and your data files), you can recreate the environment.
It's much harder to recreate your R scripts from your environment!
You'll either have to retype a lot of code from memory (inevitably, making mistakes along the way) or you'll have to carefully mine your R history.
With only your environment, it's much harder to recreate your R scripts: you'll either have to retype a lot of code from memory (inevitably making mistakes along the way) or you'll have to carefully mine your R history.
To encourage this behavior, I highly recommend that you instruct RStudio not to preserve your workspace between sessions:
To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions.
You can do this either by running `usethis::use_blank_slate()`[^workflow-scripts-2] or by mimicking the options shown in @fig-blank-slate. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time.
But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
[^workflow-scripts-2]: If you don't have usethis installed, you can install it with `install.packages("usethis")`
```{r}
#| label: fig-blank-slate
#| echo: false
#| fig-cap: >
#| Copy these options in your RStudio options to always start your
#| RStudio session with a clean slate.
#| fig-alt: >
#| RStudio preferences window where the option Restore .RData into workspace
#| at startup is not checked. Also, the option Save workspace to .RData
#| on exit is set to Never.
#| out-width: ~
knitr::include_graphics("screenshots/rstudio-workspace.png")
knitr::include_graphics("diagrams/rstudio/clean-slate.png", dpi = 270)
```
This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the results of the code that you ran last time.
But this short-term pain will save you long-term agony because it forces you to capture all important interactions in your code.
There's nothing worse than discovering three months after the fact that you've only stored the results of an important calculation in your workspace, not the calculation itself in your code.
There is a great pair of keyboard shortcuts that will work together to make sure you've captured the important parts of your code in the editor:
1. Press Cmd/Ctrl + Shift + F10 to restart RStudio.
2. Press Cmd/Ctrl + Shift + S to rerun the current script.
2. Press Cmd/Ctrl + Shift + S to re-run the current script.
I use this pattern hundreds of times a week.
We collectively use this pattern hundreds of times a week.
## Where does your analysis live?
### Where does your analysis live?
R has a powerful notion of the **working directory**.
This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.
@ -195,7 +211,7 @@ RStudio shows your current working directory at the top of the console:
#| fig-alt: >
#| The Console tab shows the current working directory as
#| ~/Documents/r4ds/r4ds.
#| out-width: ~
knitr::include_graphics("screenshots/rstudio-wd.png")
```
@ -203,82 +219,66 @@ And you can print this out in R code by running `getwd()`:
```{r}
#| eval: false
getwd()
#> [1] "/Users/hadley/Documents/r4ds/r4ds"
```
As a beginning R user, it's OK to let your home directory, documents directory, or any other weird directory on your computer be R's working directory.
But you're six chapters into this book, and you're no longer a rank beginner.
Very soon now you should evolve to organizing your analytical projects into directories and, when working on a project, setting R's working directory to the associated directory.
As a beginning R user, it's OK to let your working direction be your home directory, documents directory, or any other weird directory on your computer.
But you're nine chapters into this book, and you're no longer a rank beginner.
Very soon now you should evolve to organizing your projects into directories and, when working on a project, set R's working directory to the associated directory.
**I do not recommend it**, but you can also set the working directory from within R:
You can set the working directory from within R but **we** **do not recommend it**:
```{r}
#| eval: false
setwd("/path/to/my/CoolProject")
```
But you should never do this because there's a better way; a way that also puts you on the path to managing your R work like an expert.
There's a better way; a way that also puts you on the path to managing your R work like an expert.
That way is the **RStudio** **project**.
## Paths and directories
Paths and directories are a little complicated because there are two basic styles of paths: Mac/Linux and Windows.
There are three chief ways in which they differ:
1. The most important difference is how you separate the components of the path.
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different.
In Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and in Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
3. The last minor difference is the place that `~` points to.
`~` is a convenient shortcut to your home directory.
Windows doesn't really have the notion of a home directory, so it instead points to your documents directory.
## RStudio projects
R experts keep all the files associated with a given project together --- input data, R scripts, analytical results, and figures.
This is such a wise and common practice that RStudio has built-in support for this via **projects**.
### RStudio projects
Keeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via **projects**.
Let's make a project for you to use while you're working through the rest of this book.
Click File \> New Project, then:
Click File \> New Project, then follow the steps shown in @fig-new-project.
```{r}
#| label: fig-new-project
#| echo: false
#| layout-ncol: 2
#| fig-cap: >
#| Create a new project by following these three steps.
#| fig-subcap:
#| - First click New Directory.
#| - Then click New Project.
#| - Finally, fill in the directory (project) name, choose a good
#| subdirectory for its home and click Create Project.
#| fig-alt: >
#| There are three screenshots of the New Project menu. In the first screenshot,
#| Three screenshots of the New Project menu. In the first screenshot,
#| the Create Project window is shown and New Directory is selected.
#| In the second screenshot, the Project Type window is shown and
#| Empty Project is selected. In the third screenshot, the Create New Project
#| window is shown and the directory name is given as r4ds and the project
#| is being created as subdirectory of the Desktop.
#| Empty Project is selected. In the third screenshot, the Create New
#| Project window is shown and the directory name is given as r4ds and
#| the project is being created as subdirectory of the Desktop.
#| out-width: ~
knitr::include_graphics("screenshots/rstudio-project-1.png")
knitr::include_graphics("screenshots/rstudio-project-2.png")
knitr::include_graphics("screenshots/rstudio-project-3.png")
```
Call your project `r4ds` and think carefully about which *subdirectory* you put the project in.
Call your project `r4ds` and think carefully about which subdirectory you put the project in.
If you don't store it somewhere sensible, it will be hard to find it in the future!
Once this process is complete, you'll get a new RStudio project just for this book.
Check that the "home" directory of your project is the current working directory:
Check that the "home" of your project is the current working directory:
```{r}
#| eval: false
getwd()
#> [1] /Users/hadley/Documents/r4ds/r4ds
```
Whenever you refer to a file using a relative path, R will look for it here.
Now enter the following commands in the script editor, and save the file, calling it "diamonds.R".
Next, run the complete script which will save a PDF and CSV file into your project directory.
Don't worry about the details, you'll learn them later in the book.
@ -300,28 +300,41 @@ Quit RStudio.
Inspect the folder associated with your project --- notice the `.Rproj` file.
Double-click that file to re-open the project.
Notice you get back to where you left off: it's the same working directory and command history, and all the files you were working on are still open.
Because you followed my instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that you're starting with a clean slate.
In your favorite OS-specific way, search your computer for `diamonds.pdf` and you will find the PDF (no surprise) but *also the script that created it* (`diamonds.R`).
This is a huge win!
One day, you will want to remake a figure or just understand where it came from.
If you rigorously save figures to files **with R code** and never with the mouse or the clipboard, you will be able to reproduce old work with ease!
### Relative and absolute paths
Once you're inside a project, you should only ever use relative paths not absolute paths.
What's the difference?
A relative path is **relative** to the working directory, i.e. the project's home.
When Hadley wrote `diamonds.R` above it was a shortcut for `/Users/hadley/Documents/r4ds/r4ds/diamonds.R`.
But importantly, if Mine ran this code on her computer, it would point to `/Users/Mine/Documents/r4ds/r4ds/diamonds.R`.
This is why relative paths are important: they'll work regardless of where the project ends up.
Absolute paths point to the same place regardless of your working directory.
They look a little different depending on your operating system.
On Windows they start with a drive letter (e.g. `C:`) or two backslashes (e.g. `\\servername`) and on Mac/Linux they start with a slash "/" (e.g. `/users/hadley`).
You should **never** use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
There's another important difference between operating systems: how you separate the components of the path.
Mac and Linux uses slashes (e.g. `plots/diamonds.pdf`) and Windows uses backslashes (e.g. `plots\diamonds.pdf`).
R can work with either type (no matter what platform you're currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes!
That makes life frustrating, so I recommend always using the Linux/Mac style with forward slashes.
## Summary
In summary, RStudio projects give you a solid workflow that will serve you well in the future:
- Create an RStudio project for each data analysis project.
- Keep data files there; we'll talk about loading them into R in \@ref(data-import).
- Keep scripts there; edit them, run them in bits or as a whole.
- Save your outputs (plots and cleaned data) there.
In summary, scripts and projects give you a solid workflow that will serve you well in the future:
- Create one RStudio project for each data analysis project.
- Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure you've captured everything in your scripts.
- Only ever use relative paths, not absolute paths.
Everything you need is in one place and cleanly separated from all the other projects that you are working on.
Then everything you need is in one place and cleanly separated from all the other projects that you are working on.
## Exercises

View File

@ -9,7 +9,7 @@ status("polishing")
Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
Even as a very new programmer it's a good idea to work on your code style.
Use a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else.
Using a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else.
This chapter will introduce to the most important points of the [tidyverse style guide](https://style.tidyverse.org), which is used throughout this book.
Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature.

View File

@ -10,7 +10,19 @@ In this part of the book, you'll learn about data wrangling, the art of getting
In some cases, this is a relatively simple application of a package that does data import.
But in more complex cases it encompasses both tidying and transformation as the native structure of the data might be quite far from the tidy rectangle you'd prefer to work with.
![](diagrams/data-science-wrangle.png)
```{r}
#| label: fig-ds-wrangle
#| echo: false
#| fig-cap: >
#| Data wrangling is the combination of importing, tidying, and
#| transforming.
#| fig-alt: >
#| Our data science model with import, tidy, and transform, highlighted
#| in blue.
#| out.width: NULL
knitr::include_graphics("diagrams/data-science/wrangle.png", dpi = 270)
```
This part of the book proceeds as follows: