Technical review comments for import (#1345)

Includes fix for #1342
This commit is contained in:
Hadley Wickham
2023-03-08 07:30:03 -06:00
committed by GitHub
parent 424665c929
commit 08c3cdf6f2
6 changed files with 102 additions and 150 deletions

View File

@@ -10,9 +10,8 @@ status("complete")
## Introduction
So far, you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
Sometimes you need to analyze data that lives in a spreadsheet.
This chapter will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
In @sec-data-import you learned about importing data from plain text files like `.csv` and `.tsv`.
Now it's time to learn how to get data out of a spreadsheet, either an Excel spreadsheet or a Google Sheet.
This will build on much of what you've learned in @sec-data-import, but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: <https://doi.org/10.1080/00031305.2017.1375989>.
@@ -24,18 +23,16 @@ The best practices presented in this paper will save you much headache when you
In this section, you'll learn how to load data from Excel spreadsheets in R with the **readxl** package.
This package is non-core tidyverse, so you need to load it explicitly, but it is installed automatically when you install the tidyverse package.
Later, we'll also use the writexl package, which allows us to create Excel spreadsheets.
```{r}
#| message: false
library(readxl)
library(tidyverse)
```
**openxlsx**, **xlsx**, and **XLConnect** can also be used for reading data from and writing data to Excel spreadsheets.
We will discuss openxlsx in @sec-writing-to-excel.
The latter two packages require Java installed on your machine and the rJava package.
Due to potential challenges with installation, we recommend using alternative packages we're introducing in this chapter.
library(writexl)
```
### Getting started
@@ -201,6 +198,7 @@ knitr::include_graphics("screenshots/import-spreadsheets-penguins-islands.png")
```
You can read a single worksheet from a spreadsheet with the `sheet` argument in `read_excel()`.
The default, which we've been relying on up until now, is the first sheet.
```{r}
read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
@@ -280,45 +278,19 @@ deaths
```
The top three rows and the bottom four rows are not part of the data frame.
We could skip the top three rows with `skip`.
Note that we set `skip = 4` since the fourth row contains column names, not the data.
```{r}
read_excel(deaths_path, skip = 4)
```
We could also set `n_max` to omit the extraneous rows at the bottom.
```{r}
read_excel(deaths_path, skip = 4, n_max = 10)
```
Another approach is using cell ranges.
It's possible to eliminate these extraneous rows using the `skip` and `n_max` arguments, but we recommend using cell ranges.
In Excel, the top left cell is `A1`.
As you move across columns to the right, the cell label moves down the alphabet, i.e.
`B1`, `C1`, etc.
And as you move down a column, the number in the cell label increases, i.e.
`A2`, `A3`, etc.
The data we want to read in starts in cell `A5` and ends in cell `F15`.
In spreadsheet notation, this is `A5:F15`.
Here the data we want to read in starts in cell `A5` and ends in cell `F15`.
In spreadsheet notation, this is `A5:F15`, which we supply to the `range` argument:
- Supply this information to the `range` argument:
```{r}
#| results: "hide"
read_excel(deaths_path, range = "A5:F15")
```
- Specify rows:
```{r}
#| results: "hide"
read_excel(deaths_path, range = cell_rows(c(5, 15)))
```
```{r}
read_excel(deaths_path, range = "A5:F15")
```
### Data types
@@ -326,17 +298,17 @@ In CSV files, all values are strings.
This is not particularly true to the data, but it is simple: everything is a string.
The underlying data in Excel spreadsheets is more complex.
A cell can be one of five things:
A cell can be one of four things:
- A boolean, like TRUE, FALSE, or NA
- A boolean, like `TRUE`, `FALSE`, or `NA`.
- A number, like "10" or "10.5"
- A number, like "10" or "10.5".
- A datetime, which can also include time like "11/1/21" or "11/1/21 3:00 PM"
- A datetime, which can also include time like "11/1/21" or "11/1/21 3:00 PM".
- A text string, like "ten"
- A text string, like "ten".
When working with spreadsheet data, it's important to keep in mind that how the underlying data is stored can be very different than what you see in the cell.
When working with spreadsheet data, it's important to keep in mind that the underlying data can be very different than what you see in the cell.
For example, Excel has no notion of an integer.
All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points.
Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970.
@@ -353,8 +325,8 @@ In these cases you can set the type for this column to `"list"`, which will load
### Data not in cell values
**tidyxl** is useful for importing non-tabular data from Excel files into R.
For example, tidyxl doesn't coerce a pivot table into a data frame.
Sometimes data is stored in more exotic ways, like the color of the cell background, or whether or not the text is bold.
In such cases, you might find the [tidyxl package](https://nacnudus.github.io/tidyxl/) useful.
See <https://nacnudus.github.io/spreadsheet-munging-strategies/> for more on strategies for working with non-tabular data from Excel.
### Writing to Excel {#sec-writing-to-excel}
@@ -371,12 +343,11 @@ bake_sale <- tibble(
bake_sale
```
You can write data back to disk as an Excel file using the `write_xlsx()` from the **writexl** package.
You can write data back to disk as an Excel file using the `write_xlsx()` from the [writexl package](https://docs.ropensci.org/writexl/):
```{r}
#| eval: false
library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
```
@@ -406,7 +377,7 @@ read_excel("data/bake-sale.xlsx")
### Formatted output
The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you're interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the **openxlsx** package.
The writexl package is a light-weight solution for writing a simple Excel spreadsheet, but if you're interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the [openxlsx package](https://ycphs.github.io/openxlsx).
We won't go into the details of using this package here, but we recommend reading <https://ycphs.github.io/openxlsx/articles/Formatting.html> for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.
Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar.
@@ -578,8 +549,7 @@ This is the same dataset as in @fig-students-excel, except it's stored in a Goog
knitr::include_graphics("screenshots/import-googlesheets-students.png")
```
The first argument to `read_sheet()` is the URL of the file to read.
You can also access this file via <https://pos.it/r4ds-students>, however note that at the time of writing this book you can't read a sheet directly from a short link.
The first argument to `read_sheet()` is the URL of the file to read, and it returns a tibble:
```{r}
#| include: false
@@ -590,11 +560,6 @@ gs4_deauth()
```{r}
students_url <- "https://docs.google.com/spreadsheets/d/1V1nPp1tzOuutXFLb3G9Eyxi3qxeEhnOXUzL5_BcCQ0w"
students <- read_sheet(students_url)
```
`read_sheet()` will read the file in as a tibble.
```{r}
students
```
@@ -606,12 +571,8 @@ students <- read_sheet(
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("dcccc")
) |>
mutate(
age = if_else(age == "five", "5", age),
age = parse_number(age)
)
col_types = "dcccc"
)
students
```
@@ -620,7 +581,7 @@ Note that we defined column types a bit differently here, using short codes.
For example, "dcccc" stands for "double, character, character, character, character".
It's also possible to read individual sheets from Google Sheets as well.
Let's read the penguins Google Sheet at <https://pos.it/r4ds-penguins>, and specifically the "Torgersen Island" sheet in it.
Let's read the "Torgersen Island" sheet from the [penguins Google Sheet](https://pos.it/r4ds-penguins):
```{r}
penguins_url <- "https://docs.google.com/spreadsheets/d/1aFu8lnD_g0yjF5O-K6SFgSEWiHPpgvFCF0NY9D6LXnY"
@@ -644,7 +605,8 @@ deaths
### Write sheets
You can write from R to Google Sheets with `write_sheet()`:
You can write from R to Google Sheets with `write_sheet()`.
The first argument is the data frame to write, and the second argument is the name (or other identifier) of the Google Sheet to write to:
```{r}
#| eval: false