r4ds/oreilly/workflow-scripts.html

225 lines
21 KiB
HTML
Raw Normal View History

<section data-type="chapter" id="chp-workflow-scripts">
<h1><span id="sec-workflow-scripts-projects" class="quarto-section-identifier d-none d-lg-block"><span class="chapter-title">Workflow: scripts and projects</span></span></h1><p>::: status callout-note You are reading the work-in-progress second edition of R for Data Science. This chapter should be readable but is currently undergoing final polishing. You can find the complete first edition at <a href="https://r4ds.had.co.nz" class="uri">https://r4ds.had.co.nz</a>. :::</p><p>This chapter will introduce you to two very important tools for organizing your code: scripts and projects.</p>
<section id="scripts" data-type="sect1">
<h1>
Scripts</h1>
<p>So far, you have used the console to run code. Thats a great place to start, but youll find it gets cramped pretty quickly as you create more complex ggplot2 graphics and longer dplyr pipelines. To give yourself more room to work, use the script editor. Open it up by clicking the File menu, and selecting New File, then R script, or using the keyboard shortcut Cmd/Ctrl + Shift + N. Now youll see four panes, as in <a href="#fig-rstudio-script" data-type="xref">#fig-rstudio-script</a>. The script editor is a great place to put code you care about. Keep experimenting in the console, but once you have written code that works and does what you want, put it in the script editor.</p>
<div class="cell">
<div class="cell-output-display">
2022-11-19 00:42:43 +08:00
<figure id="fig-rstudio-script"><p><img src="diagrams/rstudio/script.png" alt="RStudio IDE with Editor, Console, and Output highlighted." width="521"/></p>
<figcaption>Copy these options in your RStudio options to always start your RStudio session with a clean slate.</figcaption>
</figure>
</div>
</div>
<section id="running-code" data-type="sect2">
<h2>
Running code</h2>
<p>The script editor is a great place to build up complex ggplot2 plots or long sequences of dplyr manipulations. The key to using the script editor effectively is to memorize one of the most important keyboard shortcuts: Cmd/Ctrl + Enter. This executes the current R expression in the console. For example, take the code below. If your cursor is at █, pressing Cmd/Ctrl + Enter will run the complete command that generates <code>not_cancelled</code>. It will also move the cursor to the next statement (beginning with <code>not_cancelled |&gt;</code>). That makes it easy to step through your complete script by repeatedly pressing Cmd/Ctrl + Enter.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="r">library(dplyr)
library(nycflights13)
not_cancelled &lt;- flights |&gt;
filter(!is.na(dep_delay)█, !is.na(arr_delay))
not_cancelled |&gt;
group_by(year, month, day) |&gt;
summarize(mean = mean(dep_delay))</pre>
</div>
<p>Instead of running your code expression-by-expression, you can also execute the complete script in one step with Cmd/Ctrl + Shift + S. Doing this regularly is a great way to ensure that youve captured all the important parts of your code in the script.</p>
2022-11-19 00:30:32 +08:00
<p>We recommend that you always start your script with the packages that you need. That way, if you share your code with others, they can easily see which packages they need to install. Note, however, that you should never include <code><a href="https://rdrr.io/r/utils/install.packages.html">install.packages()</a></code> in a script that you share. Its very antisocial to change settings on someone elses computer!</p>
<p>When working through future chapters, we highly recommend starting in the script editor and practicing your keyboard shortcuts. Over time, sending code to the console in this way will become so natural that you wont even think about it.</p>
</section>
<section id="rstudio-diagnostics" data-type="sect2">
<h2>
RStudio diagnostics</h2>
<p>In script editor, RStudio will highlight syntax errors with a red squiggly line and a cross in the sidebar:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-diagnostic.png" alt="Script editor with the script x y &lt;- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line." width="148"/></p>
</div>
</div>
<p>Hover over the cross to see what the problem is:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-diagnostic-tip.png" alt="Script editor with the script x y &lt;- 10. A red X indicates that there is syntax error. The syntax error is also highlighted with a red squiggly line. Hovering over the X shows a text box with the text unexpected token y and unexpected token &lt;-." width="232"/></p>
</div>
</div>
<p>RStudio will also let you know about potential problems:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-diagnostic-warn.png" alt="Script editor with the script 3 == NA. A yellow exclamation park indicates that there may be a potential problem. Hovering over the exclamation mark shows a text box with the text use is.na to check whether expression evaluates to NA." width="439"/></p>
</div>
</div>
</section>
<section id="saving-and-naming" data-type="sect2">
<h2>
Saving and naming</h2>
<p>RStudio automatically saves the contents of the script editor when you quit, and automatically reloads it when you re-open. Nevertheless, its a good idea to avoid Untitled1, Untitled2, Untitled3, and so on and instead save your scripts and to give them informative names.</p>
<p>It might be tempting to name your files <code>code.R</code> or <code>myscript.R</code>, but you should think a bit harder before choosing a name for your file. Three important principles for file naming are as follows:</p>
<ol type="1"><li>File names should be <strong>machine</strong> readable: avoid spaces, symbols, and special characters. Dont rely on case sensitivity to distinguish files.</li>
<li>File names should be <strong>human</strong> readable: use file names to describe whats in the file.</li>
<li>File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.</li>
</ol><p>For example, suppose you have the following files in a project folder.</p>
<pre><code>alternative model.R
code for exploratory analysis.r
finalreport.qmd
FinalReport.qmd
fig 1.png
Figure_02.png
model_first_try.R
run-first.r
temp.txt</code></pre>
2022-11-19 00:30:32 +08:00
<p>There are a variety of problems here: its hard to find which file to run first, file names contain spaces, there are two files with the same name but different capitalization (<code>finalreport</code> vs. <code>FinalReport</code><span data-type="footnote">Not to mention that youre tempting fate by using “final” in the name 😆 The comic piled higher and deeper has a <a href="https://phdcomics.com/comics/archive.php?comicid=1531">fun strip on this</a>.</span>), and some names dont describe their contents (<code>run-first</code> and <code>temp</code>).</p>
<p>Heres better way of naming and organizing the same set of files:</p>
<pre><code>01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
report-2022-03-20.qmd
report-2022-04-02.qmd
report-draft-notes.txt</code></pre>
<p>Numbering the key scripts make it obvious in which order to run them and a consistent naming scheme makes it easier to see what varies. Additionally, the figures are labelled similarly, the reports are distinguished by dates included in the file names, and <code>temp</code> is renamed to <code>report-draft-notes</code> to better describe its contents.</p>
</section>
</section>
<section id="projects" data-type="sect1">
<h1>
Projects</h1>
<p>One day, you will need to quit R, go do something else, and return to your analysis later. One day, you will be working on multiple analyses simultaneously and you want to keep them separate. One day, you will need to bring data from the outside world into R and send numerical results and figures from R back out into the world.</p>
<p>To handle these real life situations, you need to make two decisions:</p>
<ol type="1"><li><p>What is the source of truth? What will you save as your lasting record of what happened?</p></li>
<li><p>Where does your analysis live?</p></li>
</ol>
<section id="what-is-the-source-of-truth" data-type="sect2">
<h2>
What is the source of truth?</h2>
<p>As a beginning R user, its OK to consider your environment (i.e. the objects listed in the environment pane) to be your analysis. However, in the long run, youll be much better off if you ensure that your R scripts are the source of truth. With your R scripts (and your data files), you can recreate the environment. With only your environment, its much harder to recreate your R scripts: youll either have to retype a lot of code from memory (inevitably making mistakes along the way) or youll have to carefully mine your R history.</p>
2022-11-19 00:30:32 +08:00
<p>To help keep your R scripts as the source of truth for your analysis, we highly recommend that you instruct RStudio not to preserve your workspace between sessions. You can do this either by running <code><a href="https://usethis.r-lib.org/reference/use_blank_slate.html">usethis::use_blank_slate()</a></code><span data-type="footnote">If you dont have usethis installed, you can install it with <code>install.packages("usethis")</code>.</span> or by mimicking the options shown in <a href="#fig-blank-slate" data-type="xref">#fig-blank-slate</a>. This will cause you some short-term pain, because now when you restart RStudio, it will no longer remember the code that you ran last time. But this short-term pain saves you long-term agony because it forces you to capture all important interactions in your code. Theres nothing worse than discovering three months after the fact that youve only stored the results of an important calculation in your workspace, not the calculation itself in your code.</p>
<div class="cell">
<div class="cell-output-display">
2022-11-19 00:42:43 +08:00
<figure id="fig-blank-slate"><p><img src="diagrams/rstudio/clean-slate.png" alt="RStudio preferences window where the option Restore .RData into workspace at startup is not checked. Also, the option Save workspace to .RData on exit is set to Never. " width="523"/></p>
<figcaption>(a) First click New Directory.</figcaption>
</figure>
</div>
</div>
<p>There is a great pair of keyboard shortcuts that will work together to make sure youve captured the important parts of your code in the editor:</p>
<ol type="1"><li>Press Cmd/Ctrl + Shift + F10 to restart R.</li>
<li>Press Cmd/Ctrl + Shift + S to re-run the current script.</li>
</ol><p>We collectively use this pattern hundreds of times a week.</p>
<div data-type="note"><h1>
RStudio server
</h1><p>If youre using RStudio server, your R session is never restarted by default. When you close your RStudio server tab, it might feel like youre closing R, but the server actually keeps it running in the background. The next time you return, youll be in exactly the same place you left. This makes it even more important to regularly restart R so that youre starting with a refresh slate.</p></div>
</section>
<section id="where-does-your-analysis-live" data-type="sect2">
<h2>
Where does your analysis live?</h2>
<p>R has a powerful notion of the <strong>working directory</strong>. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. RStudio shows your current working directory at the top of the console:</p>
<div class="cell">
<div class="cell-output-display">
<p><img src="screenshots/rstudio-wd.png" alt="The Console tab shows the current working directory as ~/Documents/r4ds/r4ds. " width="321"/></p>
</div>
</div>
2022-11-19 00:30:32 +08:00
<p>And you can print this out in R code by running <code><a href="https://rdrr.io/r/base/getwd.html">getwd()</a></code>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">getwd()
#&gt; [1] "/Users/hadley/Documents/r4ds/r4ds"</pre>
</div>
<p>As a beginning R user, its OK to let your working direction be your home directory, documents directory, or any other weird directory on your computer. But youre nine chapters into this book, and youre no longer a rank beginner. Very soon now you should evolve to organizing your projects into directories and, when working on a project, set Rs working directory to the associated directory.</p>
<p>You can set the working directory from within R but <strong>we</strong> <strong>do not recommend it</strong>:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">setwd("/path/to/my/CoolProject")</pre>
</div>
<p>Theres a better way; a way that also puts you on the path to managing your R work like an expert. That way is the <strong>RStudio</strong> <strong>project</strong>.</p>
</section>
<section id="rstudio-projects" data-type="sect2">
<h2>
RStudio projects</h2>
<p>Keeping all the files associated with a given project (input data, R scripts, analytical results, and figures) together in one directory is such a wise and common practice that RStudio has built-in support for this via <strong>projects</strong>. Lets make a project for you to use while youre working through the rest of this book. Click File &gt; New Project, then follow the steps shown in <a href="#fig-new-project" data-type="xref">#fig-new-project</a>.</p>
2022-11-19 00:42:43 +08:00
<figure id="fig-new-project"><div class="cell-output-display">
<div class="quarto-figure quarto-figure-center anchored">
2022-11-19 00:42:43 +08:00
<figure id="fig-new-project-1"><p><img src="screenshots/rstudio-project-1.png" alt="Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop. " data-ref-parent="fig-new-project" width="542"/></p>
<figcaption>(a) First click New Directory.</figcaption>
</figure></div>
</div>
<div class="cell-output-display">
<div class="quarto-figure quarto-figure-center anchored">
2022-11-19 00:42:43 +08:00
<figure id="fig-new-project-2"><p><img src="screenshots/rstudio-project-2.png" alt="Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop. " data-ref-parent="fig-new-project" width="545"/></p>
<figcaption>(b) Then click New Project.</figcaption>
</figure></div>
</div>
<div class="cell-output-display">
<div class="quarto-figure quarto-figure-center anchored">
2022-11-19 00:42:43 +08:00
<figure id="fig-new-project-3"><p><img src="screenshots/rstudio-project-3.png" alt="Three screenshots of the New Project menu. In the first screenshot, the Create Project window is shown and New Directory is selected. In the second screenshot, the Project Type window is shown and Empty Project is selected. In the third screenshot, the Create New Project window is shown and the directory name is given as r4ds and the project is being created as subdirectory of the Desktop. " data-ref-parent="fig-new-project" width="548"/></p>
<figcaption>Opening the script editor adds a new pane at the top-left of the IDE.</figcaption>
</figure></div>
</div>
<figcaption class="figure-caption">Figure 9.3: Create a new project by following these three steps.</figcaption>
</figure>
<p>Call your project <code>r4ds</code> and think carefully about which subdirectory you put the project in. If you dont store it somewhere sensible, it will be hard to find it in the future!</p>
<p>Once this process is complete, youll get a new RStudio project just for this book. Check that the “home” of your project is the current working directory:</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">getwd()
#&gt; [1] /Users/hadley/Documents/r4ds/r4ds</pre>
</div>
<p>Now enter the following commands in the script editor, and save the file, calling it “diamonds.R”. Next, run the complete script which will save a PDF and CSV file into your project directory. Dont worry about the details, youll learn them later in the book.</p>
<div class="cell">
<pre data-type="programlisting" data-code-language="downlit">library(tidyverse)
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds.pdf")
write_csv(diamonds, "diamonds.csv")</pre>
</div>
<p>Quit RStudio. Inspect the folder associated with your project — notice the <code>.Rproj</code> file. Double-click that file to re-open the project. Notice you get back to where you left off: its the same working directory and command history, and all the files you were working on are still open. Because you followed our instructions above, you will, however, have a completely fresh environment, guaranteeing that youre starting with a clean slate.</p>
<p>In your favorite OS-specific way, search your computer for <code>diamonds.pdf</code> and you will find the PDF (no surprise) but <em>also the script that created it</em> (<code>diamonds.R</code>). This is a huge win! One day, you will want to remake a figure or just understand where it came from. If you rigorously save figures to files <strong>with R code</strong> and never with the mouse or the clipboard, you will be able to reproduce old work with ease!</p>
</section>
<section id="relative-and-absolute-paths" data-type="sect2">
<h2>
Relative and absolute paths</h2>
<p>Once youre inside a project, you should only ever use relative paths not absolute paths. Whats the difference? A relative path is <strong>relative</strong> to the working directory, i.e. the projects home. When Hadley wrote <code>diamonds.R</code> above it was a shortcut for <code>/Users/hadley/Documents/r4ds/r4ds/diamonds.R</code>. But importantly, if Mine ran this code on her computer, it would point to <code>/Users/Mine/Documents/r4ds/r4ds/diamonds.R</code>. This is why relative paths are important: theyll work regardless of where the project ends up.</p>
<p>Absolute paths point to the same place regardless of your working directory. They look a little different depending on your operating system. On Windows they start with a drive letter (e.g. <code>C:</code>) or two backslashes (e.g. <code>\\servername</code>) and on Mac/Linux they start with a slash “/” (e.g. <code>/users/hadley</code>). You should <strong>never</strong> use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.</p>
<p>Theres another important difference between operating systems: how you separate the components of the path. Mac and Linux uses slashes (e.g. <code>plots/diamonds.pdf</code>) and Windows uses backslashes (e.g. <code>plots\diamonds.pdf</code>). R can work with either type (no matter what platform youre currently using), but unfortunately, backslashes mean something special to R, and to get a single backslash in the path, you need to type two backslashes! That makes life frustrating, so we recommend always using the Linux/Mac style with forward slashes.</p>
</section>
</section>
<section id="summary" data-type="sect1">
<h1>
Summary</h1>
<p>In summary, scripts and projects give you a solid workflow that will serve you well in the future:</p>
<ul><li>Create one RStudio project for each data analysis project.</li>
<li>Save your scripts (with informative names) in the project, edit them, run them in bits or as a whole. Restart R frequently to make sure youve captured everything in your scripts.</li>
<li>Only ever use relative paths, not absolute paths.</li>
</ul><p>Then everything you need is in one place and cleanly separated from all the other projects that you are working on.</p>
</section>
<section id="exercises" data-type="sect1">
<h1>
Exercises</h1>
<ol type="1"><li><p>Go to the RStudio Tips Twitter account, <a href="https://twitter.com/rstudiotips" class="uri">https://twitter.com/rstudiotips</a> and find one tip that looks interesting. Practice using it!</p></li>
<li><p>What other common mistakes will RStudio diagnostics report? Read <a href="https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics" class="uri">https://support.rstudio.com/hc/en-us/articles/205753617-Code-Diagnostics</a> to find out.</p></li>
</ol></section>
<section id="summary-1" data-type="sect1">
<h1>
Summary</h1>
<p>In this chapter, youve learned how to organize your R code in scripts (files) and projects (directories). Much like code style, this may feel like busywork at first. But as you accumulate more code across multiple projects, youll learn to appreciate how a little up front organisation can save you a bunch of time down the road.</p>
<p>Next up, well switch back to data science tooling to talk about exploratory data analysis (or EDA for short), a philosophy and set of tools that you can use with your data to start to get a sense of whats going on.</p>
</section>
</section>