diff --git a/DESCRIPTION b/DESCRIPTION index e2c7e15..bef1386 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -18,6 +18,7 @@ Imports: gapminder, ggrepel, ggridges, + ggthemes, hexbin, janitor, Lahman, diff --git a/_quarto.yml b/_quarto.yml index 93189ed..d30a763 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -71,7 +71,6 @@ book: chapters: - quarto.qmd - quarto-formats.qmd - - quarto-workflow.qmd format: html: diff --git a/base-R.qmd b/base-R.qmd index dfb1414..2f087d4 100644 --- a/base-R.qmd +++ b/base-R.qmd @@ -3,24 +3,25 @@ ```{r} #| results: "asis" #| echo: false + source("_common.R") status("complete") ``` To finish off the programming section, we're going to give you a quick tour of the most important base R functions that we don't otherwise discuss in the book. -These tools are particularly useful as you do more programming and will help you read code that you'll encounter in the wild. +These tools are particularly useful as you do more programming and will help you read code you'll encounter in the wild. This is a good place to remind you that the tidyverse is not the only way to solve data science problems. -We teach the tidyverse in this book because tidyverse packages share a common design philosophy, which increases the consistency across functions, making each new function or package a little easier to learn and use. +We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use. It's not possible to use the tidyverse without using base R, so we've actually already taught you a **lot** of base R functions: from `library()` to load packages, to `sum()` and `mean()` for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like `+`, `-`, `/`, `*`, `|`, `&`, and `!`. What we haven't focused on so far is base R workflows, so we will highlight a few of those in this chapter. -After you read this book you'll learn other approaches to the same problems using base R, data.table, and other packages. -You'll certainly encounter these other approaches when you start reading R code written by other people, particularly if you're using StackOverflow. +After you read this book, you'll learn other approaches to the same problems using base R, data.table, and other packages. +You'll undoubtedly encounter these other approaches when you start reading R code written by others, particularly if you're using StackOverflow. It's 100% okay to write code that uses a mix of approaches, and don't let anyone tell you otherwise! In this chapter, we'll focus on four big topics: subsetting with `[`, subsetting with `[[` and `$`, the apply family of functions, and `for` loops. -To finish off, we'll briefly discuss two important plotting functions. +To finish off, we'll briefly discuss two essential plotting functions. ### Prerequisites @@ -39,7 +40,7 @@ We'll then help you cement that knowledge by showing how various dplyr verbs are ### Subsetting vectors -There are five main types of things that you can subset a vector with, i.e. that can be the `i` in `x[i]`: +There are five main types of things that you can subset a vector with, i.e., that can be the `i` in `x[i]`: 1. **A vector of positive integers**. Subsetting with positive integers keeps the elements at those positions: @@ -76,7 +77,7 @@ There are five main types of things that you can subset a vector with, i.e. that x[x %% 2 == 0] ``` - Note that, unlike `filter()`, `NA` indices will be included in the output as `NA`s. + Unlike `filter()`, `NA` indices will be included in the output as `NA`s. 4. **A character vector**. If you have a named vector, you can subset it with a character vector: @@ -90,7 +91,7 @@ There are five main types of things that you can subset a vector with, i.e. that 5. **Nothing**. The final type of subsetting is nothing, `x[]`, which returns the complete `x`. - This is not useful for subsetting vectors, but as we'll see shortly it is useful when subsetting 2d structures like tibbles. + This is not useful for subsetting vectors, but as we'll see shortly, it is useful when subsetting 2d structures like tibbles. ### Subsetting data frames @@ -122,7 +123,7 @@ We'll come back to `$` shortly, but you should be able to guess what `df$x` does We need to use it here because `[` doesn't use tidy evaluation, so you need to be explicit about the source of the `x` variable. There's an important difference between tibbles and data frames when it comes to `[`. -In this book we've mostly used tibbles, which *are* data frames, but they tweak some older behaviors to make your life a little easier. +In this book, we've mainly used tibbles, which *are* data frames, but they tweak some behaviors to make your life a little easier. In most places, you can use "tibble" and "data frame" interchangeably, so when we want to draw particular attention to R's built-in data frame, we'll write `data.frame`. If `df` is a `data.frame`, then `df[, cols]` will return a vector if `col` selects a single column and a data frame if it selects more than one column. If `df` is a tibble, then `[` will always return a tibble. @@ -143,12 +144,13 @@ df1[, "x" , drop = FALSE] ### dplyr equivalents -A number of dplyr verbs are special cases of `[`: +Several dplyr verbs are special cases of `[`: - `filter()` is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values: ```{r} #| results: false + df <- tibble( x = c(2, 3, 1, 1, NA), y = letters[1:5], @@ -166,18 +168,20 @@ A number of dplyr verbs are special cases of `[`: ```{r} #| results: false + df |> arrange(x, y) # same as df[order(df$x, df$y), ] ``` - You can use `order(decreasing = TRUE)` to sort all columns in descending order or `-rank(col)` to individually sort columns in decreasing order. + You can use `order(decreasing = TRUE)` to sort all columns in descending order or `-rank(col)` to sort columns in decreasing order individually. - Both `select()` and `relocate()` are similar to subsetting the columns with a character vector: ```{r} #| results: false + df |> select(x, z) # same as @@ -196,6 +200,7 @@ df |> ```{r} #| results: false + # same as df |> subset(x > 1, c(y, z)) ``` @@ -206,7 +211,7 @@ This function was the inspiration for much of dplyr's syntax. 1. Create functions that take a vector as input and return: - a. The elements at even numbered positions. + a. The elements at even-numbered positions. b. Every element except the last value. c. Only even values (and no missing values). @@ -244,7 +249,7 @@ tb$z <- tb$x + tb$y tb ``` -There are a number of other base R approaches to creating new columns including with `transform()`, `with()`, and `within()`. +There are several other base R approaches to creating new columns including with `transform()`, `with()`, and `within()`. Hadley collected a few examples at . Using `$` directly is convenient when performing quick summaries. @@ -428,6 +433,7 @@ The basic structure of a `for` loop looks like this: ```{r} #| eval: false + for (element in vector) { # do something with element } @@ -438,6 +444,7 @@ For example, in @sec-save-database instead of using walk: ```{r} #| eval: false + paths |> walk(append_file) ``` @@ -445,6 +452,7 @@ We could have used a `for` loop: ```{r} #| eval: false + for (path in paths) { append_file(path) } @@ -506,6 +514,7 @@ Here's a quick example from the diamonds dataset: ```{r} #| dev: png + hist(diamonds$carat) plot(diamonds$carat, diamonds$price) diff --git a/communication.qmd b/communication.qmd index 3713925..966b9f2 100644 --- a/communication.qmd +++ b/communication.qmd @@ -217,41 +217,6 @@ ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + ``` Note the use of `hjust` and `vjust` to control the alignment of the label. -@fig-just shows all nine possible combinations. - -```{r} -#| label: fig-just -#| echo: false -#| fig-width: 4.5 -#| fig-asp: 0.5 -#| out-width: "60%" -#| fig-cap: > -#| All nine combinations of `hjust` and `vjust`. -#| fig-alt: > -#| A 1x1 grid. At (0,0) hjust is set to left and vjust is set to bottom. -#| At (0.5, 0) hjust is center and vjust is bottom and at (1, 0) hjust is -#| right and vjust is bottom. At (0, 0.5) hjust is left and vjust is -#| center, at (0.5, 0.5) hjust is center and vjust is center, and at (1, 0.5) -#| hjust is right and vjust is center. Finally, at (1, 0) hjust is left and -#| vjust is top, at (0.5, 1) hjust is center and vjust is top, and at (1, 1) -#| hjust is right and vjust is bottom. - -vjust <- c(bottom = 0, center = 0.5, top = 1) -hjust <- c(left = 0, center = 0.5, right = 1) - -df <- crossing(hj = names(hjust), vj = names(vjust)) |> - mutate( - y = vjust[vj], - x = hjust[hj], - label = paste0("hjust = '", hj, "'\n", "vjust = '", vj, "'") - ) - -ggplot(df, aes(x, y)) + - geom_point(color = "grey70", size = 5) + - geom_point(size = 0.5, color = "red") + - geom_text(aes(label = label, hjust = hj, vjust = vj), size = 4) + - labs(x = NULL, y = NULL) -``` However the annotated plot we made above is hard to read because the labels overlap with each other, and with the points. We can make things a little better by switching to `geom_label()` which draws a rectangle behind the text. @@ -342,28 +307,9 @@ ggplot(mpg, aes(x = displ, y = hwy)) + ) ``` -If you want to place the text exactly on the borders of the plot, you can use `+Inf` and `-Inf`. -Since we're no longer computing the positions from `mpg`, we can use `tibble()` to create the data frame: +If you want to place the text exactly on the borders of the plot, you can use set `displ = Inf` and `hwy = Inf` in the tibble above, instead of the calculated maximum values. -```{r} -#| fig-alt: > -#| Scatterplot of highway fuel efficiency versus engine size of cars. On the -#| top right corner, flush against the corner, is an annotation that -#| reads "increasing engine size is related to decreasing fuel economy". -#| The text spans two lines. - -label_info <- tibble( - displ = Inf, - hwy = Inf, - label = "Increasing engine size is \nrelated to decreasing fuel economy." -) - -ggplot(mpg, aes(x = displ, y = hwy)) + - geom_point() + - geom_text(data = label_info, aes(label = label), vjust = "top", hjust = "right") -``` - -Alternatively, we can add the annotation without creating a new data frame, using `annotate()`. +We can alternatively add the annotation without creating a new data frame, using `annotate()`. This function adds a geom to a plot, but it doesn't map variables of a data frame to an aesthetic. The first argument of this function, `geom`, is the geometric object you want to use for annotation. @@ -608,7 +554,7 @@ The theme setting `legend.position` controls where the legend is drawn: ```{r} #| layout-ncol: 2 #| fig-width: 4 -#| fig-asp: 1 +#| fig-height: 2 #| fig-alt: > #| Four scatterplots of highway fuel efficiency versus engine size of cars #| where points are colored based on class of car. Clockwise, the legend @@ -1059,6 +1005,7 @@ Finally, we have also customized the heights of the various components of our pa Patchwork divides up the area you have allotted for your plot using this scale and places the components accordingly. ```{r} +#| fig-width: 10 #| fig-alt: > #| Five plots laid out such that first two plots are next to each other. Plots #| three and four are underneath them. And the fifth plot stretches under them. diff --git a/data-visualize.qmd b/data-visualize.qmd index 763cbeb..2a01ba7 100644 --- a/data-visualize.qmd +++ b/data-visualize.qmd @@ -49,10 +49,11 @@ library(tidyverse) You only need to install a package once, but you need to load it every time you start a new session. -In addition to tidyverse, we will also use the **palmerpenguins** package, which includes the `penguins` dataset containing body measurements for penguins on three islands in the Palmer Archipelago. +In addition to tidyverse, we will also use the **palmerpenguins** package, which includes the `penguins` dataset containing body measurements for penguins on three islands in the Palmer Archipelago, and the ggthemes package, which offers a colorblind safe color palette. ```{r} library(palmerpenguins) +library(ggthemes) ``` ## First steps @@ -128,7 +129,8 @@ ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + y = "Body mass (g)", color = "Species", shape = "Species" - ) + ) + + scale_color_colorblind() ``` ### Creating a ggplot @@ -323,6 +325,7 @@ Note that the legend is automatically updated to reflect the different shapes of And finally, we can improve the labels of our plot using the `labs()` function in a new layer. Some of the arguments to `labs()` might be self explanatory: `title` adds a title and `subtitle` adds a subtitle to the plot. Other arguments match the aesthetic mappings, `x` is the x-axis label, `y` is the y-axis label, and `color` and `shape` define the label for the legend. +In addition, we can improve the color palette to be colorblind safe with the `scale_color_colorblind()` function from the ggthemes package. ```{r} #| warning: false @@ -345,11 +348,10 @@ ggplot( labs( title = "Body mass and flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", - x = "Flipper length (mm)", - y = "Body mass (g)", - color = "Species", - shape = "Species" - ) + x = "Flipper length (mm)", y = "Body mass (g)", + color = "Species", shape = "Species" + ) + + scale_color_colorblind() ``` We finally have a plot that perfectly matches our "ultimate goal"! diff --git a/data/penguins_species.xlsx b/data/penguins_species.xlsx deleted file mode 100644 index 4c8812a..0000000 Binary files a/data/penguins_species.xlsx and /dev/null differ diff --git a/layers.qmd b/layers.qmd index 9008c49..b4bbb56 100644 --- a/layers.qmd +++ b/layers.qmd @@ -3,6 +3,7 @@ ```{r} #| results: "asis" #| echo: false + source("_common.R") status("complete") ``` @@ -205,6 +206,7 @@ In the next section we dive deeper into geoms. 2. Why did the following code not result in a plot with blue points? ```{r} + #| fig-show: hide #| fig-alt: > #| Scatterplot of highway fuel efficiency versus engine size of cars #| that shows a negative association. All points are red and @@ -254,7 +256,7 @@ To change the geom in your plot, change the geom function that you add to `ggplo For instance, to make the plots above, you can use this code: ```{r} -#| eval: false +#| fig-show: hide # Left ggplot(mpg, aes(x = displ, y = hwy)) + @@ -441,7 +443,8 @@ To learn more about any single geom, use the help (e.g. `?geom_smooth`). 2. Earlier in this chapter we used `show.legend` without explaining it: ```{r} - #| eval: false + #| fig-show: hide + ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(aes(color = drv), show.legend = FALSE) ``` @@ -551,13 +554,11 @@ ggplot(mpg, aes(x = displ, y = hwy)) + 1. What happens if you facet on a continuous variable? 2. What do the empty cells in plot with `facet_grid(drv ~ cyl)` mean? - How do they relate to this plot? + Run the following code. + How do they relate to the resulting plot? ```{r} - #| fig-alt: > - #| Scatterplot of number of cycles versus type of drive train of cars. - #| The plot shows that there are no cars with 5 cylinders that are 4 - #| wheel drive or with 4 or 5 cylinders that are front wheel drive. + #| fig-show: hide ggplot(mpg) + geom_point(aes(x = drv, y = cyl)) @@ -567,7 +568,7 @@ ggplot(mpg, aes(x = displ, y = hwy)) + What does `.` do? ```{r} - #| eval: false + #| fig-show: hide ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + @@ -581,7 +582,7 @@ ggplot(mpg, aes(x = displ, y = hwy)) + 4. Take the first faceted plot in this section: ```{r} - #| eval: false + #| fig-show: hide ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + @@ -602,10 +603,7 @@ ggplot(mpg, aes(x = displ, y = hwy)) + What does this say about when to place a faceting variable across rows or columns? ```{r} - #| fig-alt: > - #| Two faceted plots, both visualizing highway fuel efficiency versus - #| engine size of cars, faceted by drive train. In the top plot, facet - #| are organized across rows and in the second, across columns. + #| fig-show: hide ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + @@ -616,13 +614,11 @@ ggplot(mpg, aes(x = displ, y = hwy)) + facet_grid(. ~ drv) ``` -7. Recreate this plot using `facet_wrap()` instead of `facet_grid()`. +7. Recreate the following plot using `facet_wrap()` instead of `facet_grid()`. How do the positions of the facet labels change? ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus engine size of cars, - #| faceted by type of drive train across rows. + #| fig-show: hide ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + @@ -770,7 +766,7 @@ Each stat is a function, so you can get help in the usual way, e.g. `?stat_bin`. In other words, what is the problem with these two graphs? ```{r} - #| eval: false + #| fig-show: hide ggplot(diamonds, aes(x = cut, y = after_stat(prop))) + geom_bar() @@ -785,7 +781,7 @@ You can color a bar chart using either the `color` aesthetic, or, more usefully, ```{r} #| layout-ncol: 2 -#| fig-width: 4 +#| fig-width: 5.5 #| fig-height: 2 #| fig-alt: > #| Two bar charts of cut of diamonds. In the first plot, the bars have colored @@ -822,7 +818,7 @@ If you don't want a stacked bar chart, you can use one of three other options: ` ```{r} #| layout-ncol: 2 - #| fig-width: 4 + #| fig-width: 5.5 #| fig-height: 2 #| fig-alt: > #| Two segmented bar charts of cut of diamonds, where each bar is filled @@ -844,28 +840,26 @@ If you don't want a stacked bar chart, you can use one of three other options: ` - `position = "fill"` works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups. - ```{r} - #| fig-alt: > - #| Segmented bar chart of cut of diamonds, where each bar is filled with - #| colors for the levels of clarity. Height of each bar is 1 and heights - #| of the colored segments are proportional to the proportion of diamonds - #| with a given clarity level within a given cut level. - - ggplot(diamonds, aes(x = cut, fill = clarity)) + - geom_bar(position = "fill") - ``` - - `position = "dodge"` places overlapping objects directly *beside* one another. This makes it easier to compare individual values. ```{r} + #| layout-ncol: 2 + #| fig-width: 5.5 + #| fig-height: 2 #| fig-alt: > - #| Dodged bar chart of cut of diamonds. Dodged bars are grouped by levels + #| On the left, segmented bar chart of cut of diamonds, where each bar is filled with + #| colors for the levels of clarity. Height of each bar is 1 and heights + #| of the colored segments are proportional to the proportion of diamonds + #| with a given clarity level within a given cut level. + #| On the right, dodged bar chart of cut of diamonds. Dodged bars are grouped by levels #| of cut (fair, good, very good, premium, and ideal). In each group there #| are eight bars, one for each level of clarity, and filled with a #| different color for each level. Heights of these bars represent the #| number of diamonds with a given level of cut and clarity. + ggplot(diamonds, aes(x = cut, fill = clarity)) + + geom_bar(position = "fill") ggplot(diamonds, aes(x = cut, fill = clarity)) + geom_bar(position = "dodge") ``` @@ -909,14 +903,11 @@ To learn more about a position adjustment, look up the help page associated with ### Exercises -1. What is the problem with this plot? +1. What is the problem with the following plot? How could you improve it? ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus city fuel efficiency - #| of cars that shows a positive association. The number of points - #| visible in this plot is less than the number of points in the dataset. + #| fig-show: hide ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() @@ -988,16 +979,12 @@ There are two other coordinate systems that are occasionally helpful. 2. What's the difference between `coord_quickmap()` and `coord_map()`? -3. What does the plot below tell you about the relationship between city and highway mpg? +3. What does the following plot tell you about the relationship between city and highway mpg? Why is `coord_fixed()` important? What does `geom_abline()` do? ```{r} - #| fig-alt: > - #| Scatterplot of highway fuel efficiency versus engine size of cars that - #| shows a negative association. The plot also has a straight line that - #| follows the trend of the relationship between the variables but does not - #| go through the cloud of points, it is beneath it. + #| fig-show: hide ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point() + diff --git a/quarto-workflow.qmd b/quarto-workflow.qmd deleted file mode 100644 index 39adfd5..0000000 --- a/quarto-workflow.qmd +++ /dev/null @@ -1,66 +0,0 @@ -# Quarto workflow {#sec-quarto-workflow} - -```{r} -#| results: "asis" -#| echo: false -source("_common.R") -status("complete") -``` - -Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the *console*, then capture what works in the *script editor*. -Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. -You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. -When you're happy, you move on and start a new chunk. - -Quarto is also important because it so tightly integrates prose and code. -This makes it a great **analysis notebook** because it lets you develop code and record your thoughts. -An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. -It: - -- Records what you did and why you did it. - Regardless of how great your memory is, if you don't record what you do, there will come a time when you have forgotten important details. - Write them down so you don't forget! - -- Supports rigorous thinking. - You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. - This also saves you time when you eventually write up your analysis to share with others. - -- Helps others understand your work. - It is rare to do data analysis by yourself, and you'll often be working as part of a team. - A lab notebook helps you share not only what you've done, but why you did it with your colleagues or lab mates. - -Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. -We've drawn on our own experiences and Colin Purrington's advice on lab notebooks () to come up with the following tips: - -- Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis. - -- Use the YAML header date field to record the date you started working on the notebook: - - ``` yaml - date: 2016-08-23 - ``` - - Use ISO8601 YYYY-MM-DD format so that's there no ambiguity. - Use it even if you don't normally write dates that way! - -- If you spend a lot of time on an analysis idea and it turns out to be a dead end, don't delete it! - Write up a brief note about why it failed and leave it in the notebook. - That will help you avoid going down the same dead end when you come back to the analysis in the future. - -- Generally, you're better off doing data entry outside of R. - But if you do need to record a small snippet of data, clearly lay it out using `tibble::tribble()`. - -- If you discover an error in a data file, never modify it directly, but instead write code to correct the value. - Explain why you made the fix. - -- Before you finish for the day, make sure you can render the notebook. - If you're using caching, make sure to clear the caches. - That will let you fix any problems while the code is still fresh in your mind. - -- If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you'll need to track the versions of the packages that your code uses. - A rigorous approach is to use **renv**, , which stores packages in your project directory. - A quick and dirty hack is to include a chunk that runs `sessionInfo()` --- that won't let you easily recreate your packages as they are today, but at least you'll know what they were. - -- You are going to create many, many, many analysis notebooks over the course of your career. - How are you going to organize them so you can find them again in the future? - We recommend storing them in individual projects, and coming up with a good naming scheme. diff --git a/quarto.qmd b/quarto.qmd index 475f076..4861b0d 100644 --- a/quarto.qmd +++ b/quarto.qmd @@ -447,6 +447,7 @@ plot <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ```{r} #| echo: false #| fig-width: 4 +#| out-width: "50%" plot ``` @@ -454,6 +455,7 @@ plot ```{r} #| echo: false #| fig-width: 6 +#| out-width: "50%" plot ``` @@ -461,6 +463,7 @@ plot ```{r} #| echo: false #| fig-width: 8 +#| out-width: "50%" plot ``` @@ -515,9 +518,6 @@ Read the documentation for `?knitr::kable` to see the other ways in which you ca For even deeper customization, consider the **gt**, **huxtable**, **reactable**, **kableExtra**, **xtable**, **stargazer**, **pander**, **tables**, and **ascii** packages. Each provides a set of tools for returning formatted tables from R code. -There is also a rich set of options for controlling how figures are embedded. -You'll learn about these in @sec-graphics-communication. - ### Exercises @@ -737,6 +737,66 @@ As with the bibliography field, your csl file should contain a path to the file. Here we assume that the csl file is in the same directory as the .qmd file. A good place to find CSL style files for common bibliography styles is . +## Workflow + +Earlier, we discussed a basic workflow for capturing your R code where you work interactively in the *console*, then capture what works in the *script editor*. +Quarto brings together the console and the script editor, blurring the lines between interactive exploration and long-term code capture. +You can rapidly iterate within a chunk, editing and re-executing with Cmd/Ctrl + Shift + Enter. +When you're happy, you move on and start a new chunk. + +Quarto is also important because it so tightly integrates prose and code. +This makes it a great **analysis notebook** because it lets you develop code and record your thoughts. +An analysis notebook shares many of the same goals as a classic lab notebook in the physical sciences. +It: + +- Records what you did and why you did it. + Regardless of how great your memory is, if you don't record what you do, there will come a time when you have forgotten important details. + Write them down so you don't forget! + +- Supports rigorous thinking. + You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. + This also saves you time when you eventually write up your analysis to share with others. + +- Helps others understand your work. + It is rare to do data analysis by yourself, and you'll often be working as part of a team. + A lab notebook helps you share not only what you've done, but why you did it with your colleagues or lab mates. + +Much of the good advice about using lab notebooks effectively can also be translated to analysis notebooks. +We've drawn on our own experiences and Colin Purrington's advice on lab notebooks () to come up with the following tips: + +- Ensure each notebook has a descriptive title, an evocative file name, and a first paragraph that briefly describes the aims of the analysis. + +- Use the YAML header date field to record the date you started working on the notebook: + + ``` yaml + date: 2016-08-23 + ``` + + Use ISO8601 YYYY-MM-DD format so that's there no ambiguity. + Use it even if you don't normally write dates that way! + +- If you spend a lot of time on an analysis idea and it turns out to be a dead end, don't delete it! + Write up a brief note about why it failed and leave it in the notebook. + That will help you avoid going down the same dead end when you come back to the analysis in the future. + +- Generally, you're better off doing data entry outside of R. + But if you do need to record a small snippet of data, clearly lay it out using `tibble::tribble()`. + +- If you discover an error in a data file, never modify it directly, but instead write code to correct the value. + Explain why you made the fix. + +- Before you finish for the day, make sure you can render the notebook. + If you're using caching, make sure to clear the caches. + That will let you fix any problems while the code is still fresh in your mind. + +- If you want your code to be reproducible in the long-run (i.e. so you can come back to run it next month or next year), you'll need to track the versions of the packages that your code uses. + A rigorous approach is to use **renv**, , which stores packages in your project directory. + A quick and dirty hack is to include a chunk that runs `sessionInfo()` --- that won't let you easily recreate your packages as they are today, but at least you'll know what they were. + +- You are going to create many, many, many analysis notebooks over the course of your career. + How are you going to organize them so you can find them again in the future? + We recommend storing them in individual projects, and coming up with a good naming scheme. + ## Learning more Quarto is still relatively young, and is still growing rapidly. diff --git a/quarto/markdown.qmd b/quarto/markdown.qmd index 538c47b..cd4f612 100644 --- a/quarto/markdown.qmd +++ b/quarto/markdown.qmd @@ -1,6 +1,10 @@ ## Text formatting -*italic* **bold** [underline]{.underline} ~~strikeout~~ [small caps]{.smallcaps} `code` superscript^2^ and subscript~2~ +*italic* **bold** ~~strikeout~~ `code` + +superscript^2^ subscript~2~ + +[underline]{.underline} [small caps]{.smallcaps} ## Headings diff --git a/screenshots/import-spreadsheets-penguins-species.png b/screenshots/import-spreadsheets-penguins-species.png deleted file mode 100644 index 1dde841..0000000 Binary files a/screenshots/import-spreadsheets-penguins-species.png and /dev/null differ diff --git a/spreadsheets.qmd b/spreadsheets.qmd index c2f61af..434efe7 100644 --- a/spreadsheets.qmd +++ b/spreadsheets.qmd @@ -55,6 +55,7 @@ For the rest of the chapter we will focus on using `read_excel()`. ```{r} #| label: fig-students-excel #| echo: false +#| fig-width: 5 #| fig-cap: > #| Spreadsheet called students.xlsx in Excel. #| fig-alt: > @@ -386,6 +387,7 @@ These can be turned off by setting `col_names` and `format_headers` arguments to ```{r} #| label: fig-bake-sale-excel #| echo: false +#| fig-width: 5 #| fig-cap: > #| Spreadsheet called bake_sale.xlsx in Excel. #| fig-alt: > @@ -405,84 +407,22 @@ read_excel("data/bake-sale.xlsx") ### Formatted output The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you're interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the **openxlsx** package. +We won't go into the details of using this package here, but we recommend reading for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx. + Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar. For example, function names are camelCase, multiple functions can't be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse. However, this is ok. -As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R. +As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might use to accomplish specific goals in R. A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package. -Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the `penguins` data frame. - -```{r} -#| message: false - -library(openxlsx) -library(palmerpenguins) - -# Create a workbook (spreadsheet) -penguins_species <- createWorkbook() - -# Add three sheets to the spreadsheet -addWorksheet(penguins_species, sheetName = "Adelie") -addWorksheet(penguins_species, sheetName = "Gentoo") -addWorksheet(penguins_species, sheetName = "Chinstrap") - -# Write data to each sheet -writeDataTable( - penguins_species, - sheet = "Adelie", - x = penguins |> filter(species == "Adelie") -) -writeDataTable( - penguins_species, - sheet = "Gentoo", - x = penguins |> filter(species == "Gentoo") -) -writeDataTable( - penguins_species, - sheet = "Chinstrap", - x = penguins |> filter(species == "Chinstrap") -) -``` - -This creates a workbook object: - -```{r} -penguins_species -``` - -And we can write this to this with `saveWorkbook()`. - -```{r} -#| eval: false - -saveWorkbook(penguins_species, "data/penguins-species.xlsx") -``` - -The resulting spreadsheet is shown in @fig-penguins-species. -By default, openxlsx formats the data as an Excel table. - -```{r} -#| label: fig-penguins-species -#| echo: false -#| fig-cap: > -#| Spreadsheet called penguins.xlsx in Excel. -#| fig-alt: > -#| A look at the penguins spreadsheet in Excel. The spreadsheet contains has -#| three sheets: Torgersen Island, Biscoe Island, and Dream Island. - -knitr::include_graphics("screenshots/import-spreadsheets-penguins-species.png") -``` - -See for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx. - ### Exercises 1. In an Excel file, create the following dataset and save it as `survey.xlsx`. - Alternatively, you can download it as an Excel file from [here](https://docs.google.com/spreadsheets/d/1yc5gL-a2OOBr8M7B3IsDNX5uR17vBHOyWZq6xSTG2G8/edit?usp=sharing). + Alternatively, you can download it as an Excel file from [here](https://docs.google.com/spreadsheets/d/1yc5gL-a2OOBr8M7B3IsDNX5uR17vBHOyWZq6xSTG2G8). ```{r} #| echo: false + #| fig-width: 4 #| fig-alt: > #| A spreadsheet with 3 columns (group, subgroup, and id) and 12 rows. #| The group column has two values: 1 (spanning 7 merged rows) and 2 @@ -512,10 +452,11 @@ See for an extensive ``` 2. In another Excel file, create the following dataset and save it as `roster.xlsx`. - Alternatively, you can download it as an Excel file from [here](https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE/edit?usp=sharing). + Alternatively, you can download it as an Excel file from [here](https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE). ```{r} #| echo: false + #| fig-width: 4 #| fig-alt: > #| A spreadsheet with 3 columns (group, subgroup, and id) and 12 rows. The #| group column has two values: 1 (spanning 7 merged rows) and 2 (spanning @@ -540,7 +481,7 @@ See for an extensive ``` 3. In a new Excel file, create the following dataset and save it as `sales.xlsx`. - Alternatively, you can download it as an Excel file from [here](https://docs.google.com/spreadsheets/d/1oCqdXUNO8JR3Pca8fHfiz_WXWxMuZAp3YiYFaKze5V0/edit?usp=sharing). + Alternatively, you can download it as an Excel file from [here](https://docs.google.com/spreadsheets/d/1oCqdXUNO8JR3Pca8fHfiz_WXWxMuZAp3YiYFaKze5V0). ```{r} #| echo: false @@ -647,7 +588,8 @@ gs4_deauth() ``` ```{r} -students <- read_sheet("https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/edit?usp=sharing") +students_url <- "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w" +students <- read_sheet(students_url) ``` `read_sheet()` will read the file in as a tibble. @@ -660,7 +602,7 @@ Just like we did with `read_excel()`, we can supply column names, NA strings, an ```{r} students <- read_sheet( - "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w/edit?usp=sharing", + students_url, col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"), skip = 1, na = c("", "N/A"), @@ -681,13 +623,14 @@ It's also possible to read individual sheets from Google Sheets as well. Let's read the penguins Google Sheet at , and specifically the "Torgersen Island" sheet in it. ```{r} -read_sheet("https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/edit?usp=sharing", sheet = "Torgersen Island") +penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY" +read_sheet(penguins_url, sheet = "Torgersen Island") ``` You can obtain a list of all sheets within a Google Sheet with `sheet_names()`: ```{r} -sheet_names("https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY/edit?usp=sharing") +sheet_names(penguins_url) ``` Finally, just like with `read_excel()`, we can read in a portion of a Google Sheet by defining a `range` in `read_sheet()`. @@ -740,7 +683,7 @@ For further authentication details, we recommend reading the documentation googl #| echo: false #| message: false - read_sheet("https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE/edit#gid=0") |> + read_sheet("https://docs.google.com/spreadsheets/d/1LgZ0Bkg9d_NK8uTdP2uHXm07kAlwx8-Ictf8NocebIE/") |> fill(group, subgroup) |> print(n = 12) ```