Draft/outline of spreadsheets (#949)

* Add draft/outline of spreadsheets

* Finish reading from Excel

* Write to Excel + consistency edits

* Add pkgs for writing Excel files

* Release the 🐧s

* Reogranize to highlight tibble/data.frame diffs

* Write again

* Add spreadsheets reference + rename_with janitor

* Move janitor to TO DO in rectangular data

* Need to load tidyverse

* Show csv file

* Add TO DO note

* Use students instead of challenge file in this chp

* Quiet down some of the redundant messages

* Add bit on reading in multiple files

* Fix up students example

* Revert back to old dataset to match spreadsheets chapter

* Fix typos

* Use US spelling

* Comments from Hadley

* Incorporate Hadley's suggestions

Co-authored-by: Hadley Wickham <h.wickham@gmail.com>
This commit is contained in:
Mine Cetinkaya-Rundel 2021-10-26 15:00:33 -04:00 committed by GitHub
parent 917864316d
commit 3ace9ce49b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
20 changed files with 1076 additions and 493 deletions

View File

@ -27,7 +27,12 @@ Imports:
pryr,
stringr,
tidyverse,
viridis
viridis,
readxl,
writexl,
openxlsx,
palmerpenguins,
janitor
Suggests:
bookdown,
bslib,

View File

@ -38,19 +38,36 @@ These functions all have similar syntax: once you've mastered one, you can use t
For the rest of this chapter we'll focus on `read_csv()`.
Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
## Reading data from a file
Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like.
```{r echo = FALSE, message = FALSE}
read_lines("data/students.csv") %>% cat(sep = "\n")
```
Note that the `,`s separate the columns.
Table \@ref(tab:students-table) shows a representation of the same data as a table.
```{r students-table, echo = FALSE, message = FALSE}
read_csv("data/students.csv") %>%
knitr::kable(caption = "Data from the students.csv file as a table.")
```
The first argument to `read_csv()` is the most important: it's the path to the file to read.
```{r, message = TRUE}
heights <- read_csv("data/heights.csv")
students <- read_csv("data/students.csv")
```
When you run `read_csv()` it prints out a column specification that gives the name and type of each column.
That's an important part of readr, which we'll come back to in [parsing a file].
When you run `read_csv()` it prints out a message that tells you how many rows (excluding the header row) and columns the data has along with the delimiter used, and the column specifications (names of columns organized by the type of data the column contains).
It also prints out some information about how to retrieve the full column specification as well as how to quiet this message.
This message is an important part of readr, which we'll come back to in Section \@ref(parsing-a-file) on parsing a file.
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
```{r}
```{r message = FALSE}
read_csv("a,b,c
1,2,3
4,5,6")
@ -62,7 +79,7 @@ There are two cases where you might want to tweak this behaviour:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
```{r}
```{r message = FALSE}
read_csv("The first line of metadata
The second line of metadata
x,y,z
@ -76,21 +93,21 @@ There are two cases where you might want to tweak this behaviour:
2. The data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
```{r}
```{r message = FALSE}
read_csv("1,2,3\n4,5,6", col_names = FALSE)
```
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [string basics].)
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in Chapter \@ref(strings).)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
```{r}
```{r message = FALSE}
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
```
Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:
```{r}
```{r message = FALSE}
read_csv("a,b,c\n1,2,.", na = ".")
```
@ -98,6 +115,50 @@ This is all you need to know to read \~75% of CSV files that you'll encounter in
You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
### First steps
Let's take another look at the `students` data.
In the `favourite.food` column, there are a bunch of foot items and then the character string `N/A`, which should have been an real `NA` that R will recognize as "not available".
This is something we can address using the `na` argument.
```{r message = FALSE}
students <- read_csv("data/students.csv", na = c("N/A", ""))
students
```
Once you read data in, the first step is usually involve transforming it in some way to make it easier to work with in the rest of your analysis.
For example, the column names in the `students` file we read in are formatted in non-standard ways.
You might consider renaming them one by one with `dplyr::rename()` or you might use the `janitor::clean_names()` function turn them all into snake case at once.[^data-import-1]
This function takes in a data frame and returns a data frame with variable names converted to snake case.
[^data-import-1]: The [janitor](http://sfirke.github.io/janitor/) package is not part of the tidyverse, but it offers handy functions for data cleaning and works well within data pipelines that uses `%>%`.
```{r message = FALSE}
library(janitor)
students %>%
clean_names()
```
Another common task after reading in data is to consider the variable types.
For example, `meal_type` is a categorical variable with a known set of possible values.
In R, factors can be used to work with categorical variables.
We can convert this variable to a factor using the `factor()` function.
You'll learn more about factors in Chapter \@ref(factors).
```{r}
students <- students %>%
clean_names() %>%
mutate(meal_plan = factor(meal_plan))
students
```
Note that the values in the `meal_type` variable has stayed exactly the same, but the type of variable denoted underneath the variable name has changed from character (`<chr>`) to factor (`<fct>`).
Before you move on to analyzing these data, you'll probably want to fix the `age` column as well: currently it's a character variable because of the one observation that is typed out as `five` instead of a numeric `5`.
We discuss the details of fixing this issue in Chapter \@ref(import-spreadsheets) in further detail.
### Compared to base R
If you've used R before, you might wonder why we're not using `read.csv()`.
@ -142,453 +203,29 @@ There are a few good reasons to favour readr functions over the base equivalents
read_csv("a;b\n1;3")
```
## Parsing a vector
## Reading data from multiple files
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more specialised vector like a logical, integer, or date:
Sometimes your data is split across multiple files instead of being contained in a single file.
For example, you might have sales data for multiple months, with each month's data in a separate file: `01-sales.csv` for January, `02-sales.csv` for February, and `03-sales.csv` for March.
With `read_csv()` you can read these data in at once and stack them on top of each other in a single data frame.
```{r}
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
sales_files <- c("data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv")
read_csv(sales_files, id = "file")
```
These functions are useful in their own right, but are also an important building block for readr.
Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
With the additional `id` parameter we have added a new column called `file` to the resulting data frame that identifies the file the data come from.
This is especially helpful in circumstances where the files you're reading in do not have an identifying column that can help you trace the observations back to their original sources.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing:
If you have many files you want to read in, it can get cumbersome to write out their names as a list.
Instead, you can use the `dir_ls()` function from the fs package to find the files for you by matching a pattern in the file names.
```{r}
parse_integer(c("1", "231", ".", "456"), na = ".")
library(fs)
sales_files <- dir_ls("data", glob = "*sales.csv")
sales_files
```
If parsing fails, you'll get a warning:
```{r}
x <- parse_integer(c("123", "345", "abc", "123.45"))
```
And the failures will be missing in the output:
```{r}
x
```
If there are many parsing failures, you'll need to use `problems()` to get the complete set.
This returns a tibble, which you can then manipulate with dplyr.
```{r}
problems(x)
```
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input.
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so I won't describe them here further.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
3. `parse_character()` seems so simple that it shouldn't be necessary.
But one complication makes it quite important: character encodings.
4. `parse_factor()` create factors, the data structure that R uses to represent categorical variables with fixed and known values.
5. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications.
These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
### Numbers
It seems like it should be straightforward to parse a number, but three problems make it tricky:
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
2. Numbers are often surrounded by other characters that provide some context, like "\$1000" or "10%".
3. Numbers often contain "grouping" characters to make them easier to read, like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character you use for the decimal mark.
You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
```{r}
parse_double("1.23")
parse_double("1,23", locale = locale(decimal_mark = ","))
```
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English).
An alternative approach would be to try and guess the defaults from your operating system.
This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
```{r}
parse_number("$100")
parse_number("20%")
parse_number("It cost $123.45")
```
The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the "grouping mark":
```{r}
# Used in America
parse_number("$123,456,789")
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
```
### Strings {#readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don't understand UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
```{r}
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x1
x2
```
To fix the problem you need to specify the encoding in `parse_character()`:
```{r}
parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
```
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
guess_encoding(charToRaw(x1))
guess_encoding(charToRaw(x2))
```
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Factors {#readr-factors}
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
```{r}
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
```
But if you have many problematic entries, it's often easier to leave them as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
- `parse_datetime()` expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
```{r}
parse_datetime("2010-10-01T2010")
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
```
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading <https://en.wikipedia.org/wiki/ISO_8601>
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
```{r}
parse_date("2010-10-01")
```
- `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds, and an optional am/pm specifier:
```{r}
library(hms)
parse_time("01:10 am")
parse_time("20:10:01")
```
Base R doesn't have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
Year
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -\> 2000-2069, 70-99 -\> 1970-1999.
Month
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
Day
: `%d` (2 digits).
: `%e` (optional leading space).
Time
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`).
Beware of abbreviations: if you're American, note that "EST" is a Canadian time zone that does not have daylight savings time.
It is *not* Eastern Standard Time!
We'll come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
Non-digits
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions.
For example:
```{r}
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`.
See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
```{r}
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
```
### Exercises
1. What are the most important arguments to `locale()`?
2. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
3. I didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
5. What's the difference between `read_csv()` and `read_csv2()`?
6. What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
7. Generate the correct format string to parse each of the following dates and times:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
```
## Parsing a file
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
1. How readr automatically guesses the type of each column.
2. How to override the default specification.
### Strategy
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
```{r}
guess_parser("2010-10-01")
guess_parser("15:01")
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
guess_parser(c("12,352,561"))
str(parse_guess("2010-10-10"))
```
The heuristic tries each of the following types, stopping when it finds a match:
- logical: contains only "F", "T", "FALSE", or "TRUE".
- integer: contains only numeric characters (and `-`).
- double: contains only valid doubles (including numbers like `4.5e-5`).
- number: contains valid doubles with the grouping mark inside.
- time: matches the default `time_format`.
- date: matches the default `date_format`.
- date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
### Problems
These defaults don't always work for larger files.
There are two basic problems:
1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general.
For example, you might have a column of doubles that only contains integers in the first 1000 rows.
2. The column might contain a lot of missing values.
If the first 1000 rows contain only `NA`s, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
```{r}
challenge <- read_csv(readr_example("challenge.csv"))
```
(Note the use of `readr_example()` which finds the path to one of the files included with the package.)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures.
It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
```{r}
problems(challenge)
```
A good strategy is to work column by column until there are no problems remaining.
Here we can see that there are a lot of parsing problems with the `y` column.
If we look at the last few rows, you'll see that they're dates stored in a character vector:
```{r}
tail(challenge)
```
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r, eval = FALSE}
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_logical()
)
)
```
Then you can fix the type of the `y` column by specifying that `y` is a date column:
```{r}
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
```
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
- In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
challenge2
```
- Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
```
This is particularly useful in conjunction with `type_convert()`, which applies the parsing heuristics to the character columns in a data frame.
```{r}
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
)
df
# Note the column types
type_convert(df)
```
- If you're reading a very large file, you might want to set `n_max` to a smallish number like 10,000 or 100,000.
That will accelerate your iterations while you eliminate common problems.
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
## Writing to a file
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
@ -600,19 +237,19 @@ Both functions increase the chances of the output file being read back in correc
If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it).
The most important arguments are `x` (the data frame to save), and `file` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
```{r, eval = FALSE}
write_csv(challenge, "challenge.csv")
write_csv(students, "students.csv")
```
Note that the type information is lost when you save to csv:
```{r, warning = FALSE}
challenge
write_csv(challenge, "challenge-2.csv")
read_csv("challenge-2.csv")
```{r, warning = FALSE, message = FALSE}
students
write_csv(students, "students-2.csv")
read_csv("students-2.csv")
```
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
@ -622,34 +259,33 @@ There are two alternatives:
These store data in R's custom binary format called RDS:
```{r}
write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")
write_rds(students, "students.rds")
read_rds("students.rds")
```
2. The feather package implements a fast binary file format that can be shared across programming languages:
```{r, eval = FALSE}
library(feather)
write_feather(challenge, "challenge.feather")
read_feather("challenge.feather")
#> # A tibble: 2,000 x 2
#> x y
#> <dbl> <date>
#> 1 404 <NA>
#> 2 4172 <NA>
#> 3 3004 <NA>
#> 4 787 <NA>
#> 5 37 <NA>
#> 6 2332 <NA>
#> # ... with 1,994 more rows
write_feather(students, "students.feather")
read_feather("students.feather")
#> # A tibble: 6 × 5
#> student_id full_name favourite_food meal_plan age
#> <dbl> <chr> <chr> <fct> <dbl>
#> 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
#> 2 2 Barclay Lynn French fries Lunch only 5
#> 3 3 Jayendra Lyne NA Breakfast and lunch 7
#> 4 4 Leon Rossini Anchovies Lunch only NA
#> 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch 5
#> 6 6 Güvenç Attila Ice cream Lunch only 6
```
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in Chapter \@ref(list-columns); feather currently does not.
```{r, include = FALSE}
file.remove("challenge-2.csv")
file.remove("challenge.rds")
file.remove("students-2.csv")
file.remove("students.rds")
```
## Other types of data
@ -658,11 +294,16 @@ To get other types of data into R, we recommend starting with the tidyverse pack
They're certainly not perfect, but they are a good place to start.
For rectangular data:
- **haven** reads SPSS, Stata, and SAS files.
- **readxl** reads Excel files (both `.xls` and `.xlsx`).
See Chapter \@ref(import-spreadsheets) for more on working with data stored in Excel spreadsheets.
- **readxl** reads excel files (both `.xls` and `.xlsx`).
- **googlesheets4** reads Google Sheets.
Also see Chapter \@ref(import-spreadsheets) for more on working with data stored in Google Sheets.
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
See Chapter \@ref(import-databases) for more on working with databases .
- **haven** reads SPSS, Stata, and SAS files.
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
Jenny Bryan has some excellent worked examples at <https://jennybc.github.io/purrr-tutorial/>.

8
data/01-sales.csv Normal file
View File

@ -0,0 +1,8 @@
month,year,brand,item,n
January,2019,1,1234,3
January,2019,1,8721,9
January,2019,1,1822,2
January,2019,2,3333,1
January,2019,2,2156,9
January,2019,2,3987,6
January,2019,2,3827,6
1 month year brand item n
2 January 2019 1 1234 3
3 January 2019 1 8721 9
4 January 2019 1 1822 2
5 January 2019 2 3333 1
6 January 2019 2 2156 9
7 January 2019 2 3987 6
8 January 2019 2 3827 6

7
data/02-sales.csv Normal file
View File

@ -0,0 +1,7 @@
month,year,brand,item,n
February,2019,1,1234,8
February,2019,1,8721,2
February,2019,1,1822,3
February,2019,2,3333,1
February,2019,2,2156,3
February,2019,2,3987,6
1 month year brand item n
2 February 2019 1 1234 8
3 February 2019 1 8721 2
4 February 2019 1 1822 3
5 February 2019 2 3333 1
6 February 2019 2 2156 3
7 February 2019 2 3987 6

7
data/03-sales.csv Normal file
View File

@ -0,0 +1,7 @@
month,year,brand,item,n
March,2019,1,1234,3
March,2019,1,3627,1
March,2019,1,8820,3
March,2019,2,7253,1
March,2019,2,8766,3
March,2019,2,8288,6
1 month year brand item n
2 March 2019 1 1234 3
3 March 2019 1 3627 1
4 March 2019 1 8820 3
5 March 2019 2 7253 1
6 March 2019 2 8766 3
7 March 2019 2 8288 6

BIN
data/bake-sale.xlsx Normal file

Binary file not shown.

BIN
data/penguins.xlsx Normal file

Binary file not shown.

BIN
data/penguins_species.xlsx Normal file

Binary file not shown.

7
data/students.csv Normal file
View File

@ -0,0 +1,7 @@
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
1 Student ID Full Name favourite.food mealPlan AGE
2 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
3 2 Barclay Lynn French fries Lunch only 5
4 3 Jayendra Lyne N/A Breakfast and lunch 7
5 4 Leon Rossini Anchovies Lunch only
6 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
7 6 Güvenç Attila Ice cream Lunch only 6

BIN
data/students.xlsx Normal file

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 497 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 984 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 963 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 794 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 709 KiB

View File

@ -1,7 +1,467 @@
# Rectangular data {#import-rectangular}
Things that should be mentioned in this chapter:
- `rename_with`, use with janitor example: "Alternatively, we can also read the data in first and then rename the columns to follow the `snake_case` format with the `make_clean_names()` function from the **janitor** package. This is a handy approach if you have too many columns and don't want to write out the names of each, though it might not always result in the exact names you want for your columns, e.g. it won't shorten column names, it will only convert them to snake case."
- ...
<!--# Moved from original import chapter -->
```{r}
library(tidyverse)
```
## Parsing a vector
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more specialised vector like a logical, integer, or date:
```{r}
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
```
These functions are useful in their own right, but are also an important building block for readr.
Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing:
```{r}
parse_integer(c("1", "231", ".", "456"), na = ".")
```
If parsing fails, you'll get a warning:
```{r}
x <- parse_integer(c("123", "345", "abc", "123.45"))
```
And the failures will be missing in the output:
```{r}
x
```
If there are many parsing failures, you'll need to use `problems()` to get the complete set.
This returns a tibble, which you can then manipulate with dplyr.
```{r}
problems(x)
```
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input.
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so I won't describe them here further.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
3. `parse_character()` seems so simple that it shouldn't be necessary.
But one complication makes it quite important: character encodings.
4. `parse_factor()` create factors, the data structure that R uses to represent categorical variables with fixed and known values.
5. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications.
These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
### Numbers
It seems like it should be straightforward to parse a number, but three problems make it tricky:
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
2. Numbers are often surrounded by other characters that provide some context, like "\$1000" or "10%".
3. Numbers often contain "grouping" characters to make them easier to read, like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character you use for the decimal mark.
You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
```{r}
parse_double("1.23")
parse_double("1,23", locale = locale(decimal_mark = ","))
```
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English).
An alternative approach would be to try and guess the defaults from your operating system.
This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
```{r}
parse_number("$100")
parse_number("20%")
parse_number("It cost $123.45")
```
The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the "grouping mark":
```{r}
# Used in America
parse_number("$123,456,789")
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
```
### Strings {#readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don't understand UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
```{r}
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
x1
x2
```
To fix the problem you need to specify the encoding in `parse_character()`:
```{r}
parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
```
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
```{r}
guess_encoding(charToRaw(x1))
guess_encoding(charToRaw(x2))
```
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
### Factors {#readr-factors}
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
```{r}
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
```
But if you have many problematic entries, it's often easier to leave them as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
- `parse_datetime()` expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
```{r}
parse_datetime("2010-10-01T2010")
# If time is omitted, it will be set to midnight
parse_datetime("20101010")
```
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading <https://en.wikipedia.org/wiki/ISO_8601>
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
```{r}
parse_date("2010-10-01")
```
- `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds, and an optional am/pm specifier:
```{r}
library(hms)
parse_time("01:10 am")
parse_time("20:10:01")
```
Base R doesn't have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
Year
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
Month
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
Day
: `%d` (2 digits).
: `%e` (optional leading space).
Time
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`).
Beware of abbreviations: if you're American, note that "EST" is a Canadian time zone that does not have daylight savings time.
It is *not* Eastern Standard Time!
We'll come back to this \[time zones\].
: `%z` (as offset from UTC, e.g. `+0800`).
Non-digits
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions.
For example:
```{r}
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
```
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`.
See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
```{r}
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
```
### Exercises
1. What are the most important arguments to `locale()`?
2. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
3. I didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
5. What's the difference between `read_csv()` and `read_csv2()`?
6. What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
7. Generate the correct format string to parse each of the following dates and times:
```{r}
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
```
## Parsing a file
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
1. How readr automatically guesses the type of each column.
2. How to override the default specification.
### Strategy
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
```{r}
guess_parser("2010-10-01")
guess_parser("15:01")
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
guess_parser(c("12,352,561"))
str(parse_guess("2010-10-10"))
```
The heuristic tries each of the following types, stopping when it finds a match:
- logical: contains only "F", "T", "FALSE", or "TRUE".
- integer: contains only numeric characters (and `-`).
- double: contains only valid doubles (including numbers like `4.5e-5`).
- number: contains valid doubles with the grouping mark inside.
- time: matches the default `time_format`.
- date: matches the default `date_format`.
- date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
### Problems
These defaults don't always work for larger files.
There are two basic problems:
1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general.
For example, you might have a column of doubles that only contains integers in the first 1000 rows.
2. The column might contain a lot of missing values.
If the first 1000 rows contain only `NA`s, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
```{r}
challenge <- read_csv(readr_example("challenge.csv"))
```
(Note the use of `readr_example()` which finds the path to one of the files included with the package)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures.
It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
```{r}
problems(challenge)
```
A good strategy is to work column by column until there are no problems remaining.
Here we can see that there are a lot of parsing problems with the `y` column.
If we look at the last few rows, you'll see that they're dates stored in a character vector:
```{r}
tail(challenge)
```
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r, eval = FALSE}
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_logical()
)
)
```
Then you can fix the type of the `y` column by specifying that `y` is a date column:
```{r}
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
```
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
- In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
challenge2
```
- Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:
```{r}
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
```
This is particularly useful in conjunction with `type_convert()`, which applies the parsing heuristics to the character columns in a data frame.
```{r}
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
)
df
# Note the column types
type_convert(df)
```
- If you're reading a very large file, you might want to set `n_max` to a smallish number like 10,000 or 100,000.
That will accelerate your iterations while you eliminate common problems.
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
```{r, results = "asis", echo = FALSE}
status("drafting")
```
<!--# TO DO: Write chapter. -->
> > > > > > > bfaa80ba44aec5248d15093a9c521a6e2acf27ed

View File

@ -1,5 +1,440 @@
# Spreadsheets {#import-spreadsheets}
## Introduction
So far you have learned about importing data from plain text files, e.g. `.csv` and `.tsv` files.
Sometimes you need to analyze data that lives in a spreadsheet.
In this chapter we will introduce you to tools for working with data in Excel spreadsheets and Google Sheets.
This will build on much of what you've learned in Chapter \@ref(data-import) and Chapter \@ref(import-rectangular), but we will also discuss additional considerations and complexities when working with data from spreadsheets.
If you or your collaborators are using spreadsheets for organizing data, we strongly recommend reading the paper "Data Organization in Spreadsheets" by Karl Broman and Kara Woo: <https://doi.org/10.1080/00031305.2017.1375989>.
The best practices presented in this paper will save you much headache down the line when you import the data from a spreadsheet into R to analyse and visualise.
## Excel
### Prerequisites
In this chapter, you'll learn how to load data from Excel spreadsheets in R with the **readxl** package.
This package is non-core tidyverse, so you need to load it explicitly but it is installed automatically when you install the tidyverse package.
```{r}
library(readxl)
library(tidyverse)
```
**xlsx** and **XLConnect** can be used for reading data from and writing data to Excel spreadsheets.
However, these two packages require Java installed on your machine and the rJava package.
Due to potential challenges with installation, we recommend using alternative packages we've introduced in this chapter.
### Getting started
Most of readxl's functions allow you to load Excel spreadsheets into R:
- `read_xls()` reads Excel files with `xls` format.
- `read_xlsx()` read Excel files with `xlsx` format.
- `read_excel()` can read files with both `xls` and `xlsx` format. It guesses the file type based on the input.
These functions all have similar syntax just like other functions we have previously introduced for reading other types of files, e.g. `read_csv()`, `read_table()`, etc.
For the rest of the chapter we will focus on using `read_excel()`.
### Reading spreadsheets
Figure \@ref(fig:students-excel) shows what the spreadsheet we're going to read into R looks like in Excel.
```{r students-excel, fig.alt = "A look at the students spreadsheet in Excel. The spreadsheet contains information on 6 students, their ID, full name, favourite food, meal plan, and age.", fig.cap = "Spreadsheet called students.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-students.png")
```
The first argument to `read_excel()` is the path to the file to read.
```{r}
students <- read_excel("data/students.xlsx")
```
`read_excel()` will read the file in as a tibble.
```{r}
students
```
We have six students in the data and five variables on each student.
However there are a few things we might want to address in this dataset:
1. The column names are all over the place.
You can provide column names that follow a consistent format; we recommend `snake_case` using the `col_names` argument.
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age")
)
```
Unfortunately, this didn't quite do the trick.
You now have the variable names we want, but what was previously the header row now shows up as the first observation in the data.
You can explicitly skip that row using the `skip` argument.
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1
)
```
2. In the `favourite_food` column, one of the observations is `N/A`, which stands for "not available" but it's currently not recognized as an `NA` (note the contrast between this `N/A` and the age of the fourth student in the list).
You can specify which character strings should be recognized as `NA`s with the `na` argument.
By default, only `""` (empty string, or, in the case of reading from a spreadsheet, an empty cell) is recognized as an `NA`.
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A")
)
```
3. One other remaining issue is that `age` is read in as a character variable, but it really should be numeric.
Just like with `read_csv()` and friends for reading data from flat files, you can supply a `col_types` argument to `read_excel()` and specify the column types for the variables you read in.
The syntax is a bit different, though.
Your options are `"skip"`, `"guess"`, `"logical"`, `"numeric"`, `"date"`, `"text"` or `"list"`.
```{r}
read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("numeric", "text", "text", "text", "numeric")
)
```
However, this didn't quite produce the desired result either.
By specifying that `age` should be numeric, we have turned the one cell with the non-numeric entry (which had the value `five`) into an `NA`.
In this case, we should read age in as `"text"` and then make the change once the data is loaded in R.
```{r}
students <- read_excel(
"data/students.xlsx",
col_names = c("student_id", "full_name", "favourite_food", "meal_plan", "age"),
skip = 1,
na = c("", "N/A"),
col_types = c("numeric", "text", "text", "text", "text")
)
students <- students %>%
mutate(
age = if_else(age == "five", "5", age),
age = parse_number(age)
)
students
```
It took us multiple steps and trial-and-error to load the data in exactly the format we want, and this is not unexpected.
Data science is an iterative process.
There is no way to know exactly what the data will look like until you load it and take a look at it.
Well, there is one way, actually.
You can open the file in Excel and take a peek.
That might be tempting, but it's strongly not recommended.
<!--# TO DO: Provide reason why it's not recommended. --> Instead, you should not be afraid of doing what we did here: load the data, take a peek, make adjustments to your code, load it again, and repeat until you're happy with the result.
### Reading individual sheets
An important feature that distinguishes spreadsheets from flat files is the notion of multiple sheets.
Figure \@ref(fig:penguins-islands) shows an Excel spreadsheet with multiple sheets.
The data come from the **palmerpenguins** package.
Each sheet contains information on penguins from a different island where data were collected.
```{r penguins-islands, fig.alt = "A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island.", fig.cap = "Spreadsheet called penguins.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-penguins-islands.png")
```
You can read a single sheet from a spreadsheet with the `sheet` argument in `read_excel()`.
```{r}
read_excel("data/penguins.xlsx", sheet = "Torgersen Island")
```
Some variables that appear to contain numerical data are read in as characters due to the character string `"NA"` not being recognized as a true `NA`.
```{r}
penguins_torgersen <- read_excel("data/penguins.xlsx", sheet = "Torgersen Island", na = "NA")
penguins_torgersen
```
However, we cheated here a bit.
We looked inside the Excel spreadsheet, which is not a recommended workflow.
Instead, you can use `excel_sheets()` to get information on all sheets in an Excel spreadsheet, and then read the one(s) you're interested in.
```{r}
excel_sheets("data/penguins.xlsx")
```
Once you know the names of the sheets, you can read them in individually with `read_excel()`.
```{r}
penguins_biscoe <- read_excel("data/penguins.xlsx", sheet = "Biscoe Island", na = "NA")
penguins_dream <- read_excel("data/penguins.xlsx", sheet = "Dream Island", na = "NA")
```
In this case the full penguins dataset is spread across three sheets in the spreadsheet.
Each sheet has the same number of columns but different numbers of rows.
```{r}
dim(penguins_torgersen)
dim(penguins_biscoe)
dim(penguins_dream)
```
We can put them together with `bind_rows()`.
```{r}
penguins <- bind_rows(penguins_torgersen, penguins_biscoe, penguins_dream)
penguins
```
In Chapter \@ref(iteration) we'll talk about ways of doing this sort of task without repetitive code <!--# Check to make sure that's the right place to present it -->.
### Reading part of a sheet
Since many use Excel spreadsheets for presentation as well as for data storage, it's quite common to find cell entries in a spreadsheet that are not part of the data you want to read into R.
Figure \@ref(fig:deaths-excel) shows such a spreadsheet: in the middle of the sheet is what looks like a data frame but there is extraneous text in cells above and below the data.
```{r deaths-excel, fig.alt = "A look at the deaths spreadsheet in Excel. The spreadsheet has four rows on top that contain non-data information; the text 'For the same of consistency in the data layout, which is really a beautiful thing, I will keep making notes up here.' is spread across cells in these top four rows. Then, there is a data frame that includes information on deaths of 10 famous people, including their names, professions, ages, whether they have kids r not, date of birth and death. At the bottom, there are four more rows of non-data information; the text 'This has been really fun, but we're signing off now!' is spread across cells in these bottom four rows.", fig.cap = "Spreadsheet called deaths.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-deaths.png")
```
This spreadsheet is one of the example spreadsheets provided in the readxl package.
You can use the `readxl_example()` function to locate the spreadsheet on your system in the directory where the package is installed.
This function returns the path to the spreadsheet, which you can use in `read_excel()` as usual.
```{r}
deaths_path <- readxl_example("deaths.xlsx")
deaths <- read_excel(deaths_path)
deaths
```
The top three rows and the bottom four rows are not part of the data frame.
We could skip the top three rows with `skip`.
Note that we set `skip = 4` since the fourth row contains column names, not the data.
```{r}
read_excel(deaths_path, skip = 4)
```
We could also set `n_max` to omit the extraneous rows at the bottom.
```{r}
read_excel(deaths_path, skip = 4, n_max = 10)
```
Another approach is using cell ranges.
In Excel, the top left cell is `A1`.
As you move across columns to the right, the cell label moves down the alphabet, i.e.
`B1`, `C1`, etc.
And as you move down a column, the number in the cell label increases, i.e.
`A2`, `A3`, etc.
The data we want to read in starts in cell `A5` and ends in cell `F15`.
In spreadsheet notation, this is `A5:F15`.
- Supply this information to the `range` argument:
```{r results = "hide"}
read_excel(deaths_path, range = "A5:F15")
```
- Specify rows:
```{r results = "hide"}
read_excel(deaths_path, range = cell_rows(c(5, 15)))
```
- Specify cells that mark the top-left and bottom-right corners of the data -- the top-left corner, `A5`, translates to `c(5, 1)` (5th row down, 1st column) and the bottom-right corner, `F15`, translates to `c(15, 6)`:
```{r results = "hide"}
read_excel(deaths_path, range = cell_limits(c(5, 1), c(15, 6)))
```
If you have control over the sheet, an even better way is to create a "named range".
This is useful within Excel because named ranges help repeat formulas easier to create and they have some useful properties for creating dynamic charts and graphs as well.
Even if you're not working in Excel, named ranges can be useful for identifying which cells to read into R.
In the example above, the table we're reading in is named `Table1`, so we can read it in with the following.
**TO DO:** Add this once reading in named ranges are implemented in readxl.
### Data types
In CSV files, all values are strings.
This is not particularly true to the data, but it is simple: everything is a string.
The underlying data in Excel spreadsheets is more complex.
A cell can be one of five things:
- A logical, like TRUE / FALSE
- A number, like "10" or "10.5"
- A date, which can also include time like "11/1/21" or "11/1/21 3:00 PM"
- A string, like "ten"
- A currency, which allows numeric values in a limited range and four decimal digits of fixed precision
When working with spreadsheet data, it's important to keep in mind that how the underlying data is stored can be very different than what you see in the cell.
For example, Excel has no notion of an integer.
All numbers are stored as floating points, but you can choose to display the data with a customizable number of decimal points.
Similarly, dates are actually stored as numbers, specifically the number of seconds since January 1, 1970.
You can customize how you display the date by applying formatting in Excel.
Confusingly, it's also possible to have something that looks like a number but is actually a string (e.g. type `'10` into a cell in Excel).
These differences between how the underlying data are stored vs. how they're displayed can cause surprises when the data are loaded into R.
By default readxl will guess the data type in a given column.
A recommended workflow is to let readxl guess the column types, confirm that you're happy with the guessed column types, and if not, go back and re-import specifying `col_types` as shown in Section \@ref(reading-spreadsheets).
Another challenge is when you have a column in your Excel spreadsheet that has a mix of these types, e.g. some cells are numeric, others text, others dates.
When importing the data into R readxl has to make some decisions.
In these cases you can set the type for this column to `"list"`, which will load the column as a list of length 1 vectors, where the type of each element of the vector is guessed.
### Data not in cell values
**tidyxl** is useful for importing non-tabular data from Excel files into R.
For example, tidyxl doesn't coerce a pivot table into a data frame.
See <https://nacnudus.github.io/spreadsheet-munging-strategies/> for more on strategies for working with non-tabular data from Excel.
### Writing to Excel
Let's create a small data frame that we can then write out.
Note that `item` is a factor and `quantity` is an integer.
```{r}
bake_sale <- tibble(
item = factor(c("brownie", "cupcake", "cookie")),
quantity = c(10, 5, 8)
)
bake_sale
```
You can write data back to disk as an Excel file using the `write_xlsx()` from the **writexl** package.
```{r eval = FALSE}
library(writexl)
write_xlsx(bake_sale, path = "data/bake-sale.xlsx")
```
Figure \@ref(fig:bake-sale-excel) shows what the data looks like in Excel.
Note that column names are included and bolded.
These can be turned off by setting `col_names` and `format_headers` arguments to `FALSE`.
```{r bake-sale-excel, fig.alt = "Bake sale data frame created earlier in Excel.", fig.cap = "Spreadsheet called bake_sale.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-bake-sale.png")
```
Just like reading from a CSV, information on data type is lost when we read the data back in.
This makes Excel files unreliable for caching interim results as well.
For alternatives, see Section \@ref(writing-to-a-file).
```{r}
read_excel("data/bake-sale.xlsx")
```
### Formatted output
The readxl package is a light-weight solution for writing a simple Excel spreadsheet, but if you're interested in additional features like writing to sheets within a spreadsheet and styling, you will want to use the **openxlsx** package.
Note that this package is not part of the tidyverse so the functions and workflows may feel unfamiliar.
For example, function names are camelCase, multiple functions can't be composed in pipelines, and arguments are in a different order than they tend to be in the tidyverse.
However, this is ok.
As your R learning and usage expands outside of this book you will encounter lots of different styles used in various R packages that you might need to use to accomplish specific goals in R.
A good way of familiarizing yourself with the coding style used in a new package is to run the examples provided in function documentation to get a feel for the syntax and the output formats as well as reading any vignettes that might come with the package.
Below we show how to write a spreadsheet with three sheets, one for each species of penguins in the `penguins` data frame.
```{r message = FALSE}
library(openxlsx)
library(palmerpenguins)
# Create a workbook (spreadsheet)
penguins_species <- createWorkbook()
# Add three sheets to the spreadsheet
addWorksheet(penguins_species, sheetName = "Adelie")
addWorksheet(penguins_species, sheetName = "Gentoo")
addWorksheet(penguins_species, sheetName = "Chinstrap")
# Write data to each sheet
writeDataTable(
penguins_species,
sheet = "Adelie",
x = penguins %>% filter(species == "Adelie")
)
writeDataTable(
penguins_species,
sheet = "Gentoo",
x = penguins %>% filter(species == "Gentoo")
)
writeDataTable(
penguins_species,
sheet = "Chinstrap",
x = penguins %>% filter(species == "Chinstrap")
)
```
This creates a workbook object:
```{r}
penguins_species
```
And we can write this to this with `saveWorkbook()`.
```{r eval = FALSE}
saveWorkbook(penguins_species, "data/penguins-species.xlsx")
```
The resulting spreadsheet is shown in Figure \@ref(fig:penguins-species).
By default, openxlsx formats the data as an Excel table.
```{r penguins-species, fig.alt = "A look at the penguins spreadsheet in Excel. The spreadsheet contains has three sheets: Torgersen Island, Biscoe Island, and Dream Island.", fig.cap = "Spreadsheet called penguins.xlsx in Excel.", echo = FALSE}
knitr::include_graphics("images/import-spreadsheets-penguins-species.png")
```
See <https://ycphs.github.io/openxlsx/articles/Formatting.html> for an extensive discussion on further formatting functionality for data written from R to Excel with openxlsx.
### Exercises
1. Recreate the `bake_sale` data frame, write it out to an Excel file using the `write.xlsx()` function from the openxlsx package.
2. What happens if you try to read in a file with `.xlsx` extension with `read_xls()`?
<!--# Need moar exercises -->
## Google Sheets
### Prerequisites
TO DO:
- use googlesheets4
- why 4?
### Getting started
TO DO:
- reading from public sheet with `read_sheet()` and `read_range()`
### Authentication
### Read sheets
### Write sheets
### Exercises
```{r, results = "asis", echo = FALSE}
status("drafting")
```

View File

@ -20,4 +20,4 @@ BuildType: Website
MarkdownWrap: Sentence
MarkdownCanonical: Yes
SpellingDictionary: en_GB
SpellingDictionary: en_US

BIN
students.feather Normal file

Binary file not shown.

View File

@ -124,40 +124,53 @@ nycflights13::flights %>%
### Subsetting
So far all the tools you've learned have worked with complete data frames.
If you want to pull out a single variable, you need some new tools, `$` and `[[`.
If you want to pull out a single variable, you can use `pull()`.
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector.
```{r}
tb <- tibble(
id = LETTERS[1:5],
x1 = 1:5,
y1 = 6:10
)
tb %>% pull(x1)
tb %>% pull(x1, name = id)
```
You can also use tools like `$` and `[[` to extract a variable.
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
```{r}
df <- tibble(
id = LETTERS[1:5],
x = 1:5,
y = 6:10
)
# Extract by name
df$x
df[["x"]]
tb$x1
tb[["x1"]]
# Extract by position
df[[1]]
tb[[1]]
```
To use these in a pipe, you'll need to use the special placeholder `.`:
```{r}
df %>% .$x
df %>% .[["x"]]
```
Alternatively, you can use the `pull()` function that is specifically designed to extract a variable from a data frame.
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector.
```{r}
df %>% pull(x)
df %>% pull(x, name = id)
tb %>% .$x1
tb %>% .[["x1"]]
```
Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
In the following chunk `df` is a `data.frame` and `tb` is a `tibble`.
```{r}
df <- as.data.frame(tb)
# Partial match to existing variable name
tb$x # Warning + no match
df$x # Warning + partial match
# Column doesn't exist
tb$z # Warning
df$z # No warning
```
## Interacting with older code