Merge pull request #924 from mine-cetinkaya-rundel/no-iris
Structural updates for 2e
This commit is contained in:
@ -14,6 +14,7 @@ URL:
R (>= 3.1.0)
@ -623,6 +623,8 @@ It's possible to use a model to remove the very strong relationship between pric
The following code fits a model that predicts `price` from `carat` and then computes the residuals (the difference between the predicted value and the actual value).
The residuals give us a view of the price of the diamond, once the effect of carat has been removed.
<!--# TO DO: Replace modelr based workflow with tidymodels, as a sneak preview. -->
```{r, dev = "png"}
@ -643,8 +645,7 @@ ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
You'll learn how models, and the modelr package, work in the final part of the book, [model](#model-intro).
We're saving modelling for later because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
We're not discussing modelling in this book because understanding what models are and how they work is easiest once you have tools of data wrangling and programming in hand.
## ggplot2 calls
@ -660,7 +661,7 @@ Typically, the first one or two arguments to a function are so important that yo
The first two arguments to `ggplot()` are `data` and `mapping`, and the first two arguments to `aes()` are `x` and `y`.
In the remainder of the book, we won't supply those names.
That saves typing, and, by reducing the amount of boilerplate, makes it easier to see what's different between plots.
That's a really important programming concern that we'll come back in [functions].
That's a really important programming concern that we'll come back to in Chapter \@ref(functions).
Rewriting the previous plot more concisely yields:
@ -3,36 +3,48 @@ new_session: yes
rmd_files: [
@ -0,0 +1,16 @@
# Column-wise operations
## Introduction
<!--# TO DO: Write introduction. -->
### Prerequisites
In this chapter we'll continue using dplyr.
dplyr is a member of the core tidyverse.
```{r setup, message = FALSE}
<!--# TO DO: Write chapter around across, etc. -->
@ -99,6 +99,7 @@ ggplot(df, aes(x, y)) +
2. The `geom_smooth()` is somewhat misleading because the `hwy` for large engines is skewed upwards due to the inclusion of lightweight sports cars with big engines.
Use your modelling tools to fit and display a better model.
<!--# TO DO: Reconsider this exercise in light of removing modeling chapters. -->
3. Take an exploratory graphic that you've created in the last month, and add informative titles to make it easier for others to understand.
@ -495,11 +496,11 @@ Note that all colour scales come in two variety: `scale_colour_x()` and `scale_f
3. Change the display of the presidential terms by:
1. Combining the two variants shown above.
2. Improving the display of the y axis.
3. Labelling each term with the name of the president.
4. Adding informative plot labels.
5. Placing breaks every 4 years (this is trickier than it seems!).
a. Combining the two variants shown above.
b. Improving the display of the y axis.
c. Labelling each term with the name of the president.
d. Adding informative plot labels.
e. Placing breaks every 4 years (this is trickier than it seems!).
4. Use `override.aes` to make the legend on the following plot easier to see.
@ -2,7 +2,7 @@
# Introduction {#communicate-intro}
So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, visualisation and modelling.
So far, you've learned the tools to get your data into R, tidy it into a form convenient for analysis, and then understand your data through transformation, and visualisation.
However, it doesn't matter how great your analysis is unless you can explain it to others: you need to **communicate** your results.
```{r echo = FALSE, out.width = "75%"}
@ -0,0 +1,666 @@
# Data import {#data-import}
<!--# TO DO: This chapter got moved here from the wrangle section, make sure it makes sense in this new location, doesn't assume anything that comes after it. -->
## Introduction
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
We'll finish with a few pointers to packages that are useful for other types of data.
### Prerequisites
In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.
```{r setup, message = FALSE}
## Getting started
Most of readr's functions are concerned with turning flat files into data frames:
- `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab delimited files, and `read_delim()` reads in files with any delimiter.
- `read_fwf()` reads fixed width files.
You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
`read_table()` reads a common variation of fixed width files where columns are separated by white space.
- `read_log()` reads Apache style log files.
(But also check out [webreadr]( which is built on top of `read_log()` and provides many more helpful tools.)
These functions all have similar syntax: once you've mastered one, you can use the others with ease.
For the rest of this chapter we'll focus on `read_csv()`.
Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
The first argument to `read_csv()` is the most important: it's the path to the file to read.
```{r, message = TRUE}
heights <- read_csv("data/heights.csv")
When you run `read_csv()` it prints out a column specification that gives the name and type of each column.
That's an important part of readr, which we'll come back to in [parsing a file].
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
There are two cases where you might want to tweak this behaviour:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
read_csv("The first line of metadata
The second line of metadata
1,2,3", skip = 2)
read_csv("# A comment I want to skip
1,2,3", comment = "#")
2. The data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [string basics].)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
This is all you need to know to read \~75% of CSV files that you'll encounter in practice.
You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
### Compared to base R
If you've used R before, you might wonder why we're not using `read.csv()`.
There are a few good reasons to favour readr functions over the base equivalents:
- They are typically much faster (\~10x) than their base equivalents.
Long running jobs have a progress bar, so you can see what's happening.
If you're looking for raw speed, try `data.table::fread()`.
It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.
- They produce tibbles, they don't convert character vectors to factors, use row names, or munge the column names.
These are common sources of frustration with the base R functions.
- They are more reproducible.
Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Exercises
1. What function would you use to read a file where fields were separated with\
2. Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
3. What are the most important arguments to `read_fwf()`?
4. Sometimes strings in a CSV file contain commas.
To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
What argument to `read_csv()` do you need to specify to read the following text into a data frame?
```{r, eval = FALSE}
5. Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
```{r, eval = FALSE}
## Parsing a vector
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more specialised vector like a logical, integer, or date:
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
These functions are useful in their own right, but are also an important building block for readr.
Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing:
parse_integer(c("1", "231", ".", "456"), na = ".")
If parsing fails, you'll get a warning:
x <- parse_integer(c("123", "345", "abc", "123.45"))
And the failures will be missing in the output:
If there are many parsing failures, you'll need to use `problems()` to get the complete set.
This returns a tibble, which you can then manipulate with dplyr.
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input.
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so I won't describe them here further.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
3. `parse_character()` seems so simple that it shouldn't be necessary.
But one complication makes it quite important: character encodings.
4. `parse_factor()` create factors, the data structure that R uses to represent categorical variables with fixed and known values.
5. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications.
These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
### Numbers
It seems like it should be straightforward to parse a number, but three problems make it tricky:
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
2. Numbers are often surrounded by other characters that provide some context, like "\$1000" or "10%".
3. Numbers often contain "grouping" characters to make them easier to read, like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character you use for the decimal mark.
You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
parse_double("1,23", locale = locale(decimal_mark = ","))
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English).
An alternative approach would be to try and guess the defaults from your operating system.
This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
parse_number("It cost $123.45")
The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the "grouping mark":
# Used in America
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
### Strings {#readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don't understand UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
To fix the problem you need to specify the encoding in `parse_character()`:
parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <>.
### Factors {#readr-factors}
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
But if you have many problematic entries, it's often easier to leave as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
- `parse_datetime()` expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
# If time is omitted, it will be set to midnight
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading <>
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
- `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds, and an optional am/pm specifier:
parse_time("01:10 am")
Base R doesn't have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -\> 2000-2069, 70-99 -\> 1970-1999.
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
: `%d` (2 digits).
: `%e` (optional leading space).
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`).
Beware of abbreviations: if you're American, note that "EST" is a Canadian time zone that does not have daylight savings time.
It is *not* Eastern Standard Time!
We'll come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions.
For example:
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`.
See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
### Exercises
1. What are the most important arguments to `locale()`?
2. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
3. I didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
5. What's the difference between `read_csv()` and `read_csv2()`?
6. What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
7. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
## Parsing a file
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
1. How readr automatically guesses the type of each column.
2. How to override the default specification.
### Strategy
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
The heuristic tries each of the following types, stopping when it finds a match:
- logical: contains only "F", "T", "FALSE", or "TRUE".
- integer: contains only numeric characters (and `-`).
- double: contains only valid doubles (including numbers like `4.5e-5`).
- number: contains valid doubles with the grouping mark inside.
- time: matches the default `time_format`.
- date: matches the default `date_format`.
- date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
### Problems
These defaults don't always work for larger files.
There are two basic problems:
1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general.
For example, you might have a column of doubles that only contains integers in the first 1000 rows.
2. The column might contain a lot of missing values.
If the first 1000 rows contain only `NA`s, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
challenge <- read_csv(readr_example("challenge.csv"))
(Note the use of `readr_example()` which finds the path to one of the files included with the package)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures.
It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
A good strategy is to work column by column until there are no problems remaining.
Here we can see that there are a lot of parsing problems with the `y` column.
If we look at the last few rows, you'll see that they're dates stored in a character vector:
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r, eval = FALSE}
challenge <- read_csv(
col_types = cols(
x = col_double(),
y = col_logical()
Then you can fix the type of the `y` column by specifying that `y` is a date column:
challenge <- read_csv(
col_types = cols(
x = col_double(),
y = col_date()
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
- In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
- Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
This is particularly useful in conjunction with `type_convert()`, which applies the parsing heuristics to the character columns in a data frame.
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
# Note the column types
- If you're reading a very large file, you might want to set `n_max` to a smallish number like 10,000 or 100,000.
That will accelerate your iterations while you eliminate common problems.
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
## Writing to a file
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by:
- Always encoding strings in UTF-8.
- Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
```{r, eval = FALSE}
write_csv(challenge, "challenge.csv")
Note that the type information is lost when you save to csv:
```{r, warning = FALSE}
write_csv(challenge, "challenge-2.csv")
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
There are two alternatives:
1. `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
These store data in R's custom binary format called RDS:
write_rds(challenge, "challenge.rds")
2. The feather package implements a fast binary file format that can be shared across programming languages:
```{r, eval = FALSE}
write_feather(challenge, "challenge.feather")
#> # A tibble: 2,000 x 2
#> x y
#> <dbl> <date>
#> 1 404 <NA>
#> 2 4172 <NA>
#> 3 3004 <NA>
#> 4 787 <NA>
#> 5 37 <NA>
#> 6 2332 <NA>
#> # ... with 1,994 more rows
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in Chapter \@ref(list-columns); feather currently does not.
```{r, include = FALSE}
## Other types of data
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
They're certainly not perfect, but they are a good place to start.
For rectangular data:
- **haven** reads SPSS, Stata, and SAS files.
- **readxl** reads excel files (both `.xls` and `.xlsx`).
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
Jenny Bryan has some excellent worked examples at <>.
For other file types, try the [R data import/export manual]( and the [**rio**]( package.
@ -1,4 +1,6 @@
# Tidy data
# Data tidying {#data-tidy}
<!--# Take out bit on missing values and move to missing values chapter. Maybe also move case study elsewhere? -->
## Introduction
@ -1,4 +1,4 @@
# Data transformation {#transform}
# Data transformation {#data-transform}
## Introduction
@ -229,13 +229,13 @@ filter(df, | x > 1)
1. Find all flights that
1. Had an arrival delay of two or more hours
2. Flew to Houston (`IAH` or `HOU`)
3. Were operated by United, American, or Delta
4. Departed in summer (July, August, and September)
5. Arrived more than two hours late, but didn't leave late
6. Were delayed by at least an hour, but made up over 30 minutes in flight
7. Departed between midnight and 6am (inclusive)
a. Had an arrival delay of two or more hours
b. Flew to Houston (`IAH` or `HOU`)
c. Were operated by United, American, or Delta
d. Departed in summer (July, August, and September)
e. Arrived more than two hours late, but didn't leave late
f. Were delayed by at least an hour, but made up over 30 minutes in flight
g. Departed between midnight and 6am (inclusive)
2. Another useful dplyr filtering helper is `between()`.
What does it do?
@ -423,7 +423,7 @@ There's no way to list every possible function that you might use, but here's a
- Logs: `log()`, `log2()`, `log10()`.
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
They also convert multiplicative relationships to additive, a feature we'll come back to in modelling.
They also convert multiplicative relationships to additive.
All else being equal, I recommend using `log2()` because it's easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.
@ -564,7 +564,7 @@ Naming things is hard, so this slows down our analysis.
There's another way to tackle the same problem with the pipe, `%>%`:
delays <- flights %>%
sdelays <- flights %>%
group_by(dest) %>%
count = n(),
@ -580,7 +580,7 @@ As suggested by this reading, a good way to pronounce `%>%` when reading code is
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.
You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in [pipes].
We'll use piping frequently from now on because it considerably improves the readability of code, and we'll come back to it in more detail in Chapter \@ref(pipes).
Working with the pipe is one of the key criteria for belonging to the tidyverse.
The only exception is ggplot2: it was written before the pipe was discovered.
@ -0,0 +1,29 @@
# (PART) Data types {.unnumbered}
# Introduction {#data-types-intro}
In this part of the book, you'll learn about data types, ...
<!--# TO DO: Add a diagram? -->
This part of the book proceeds as follows:
- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**. You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
Data wrangling also encompasses data transformation, which you've already learned a little about.
Now we'll focus on new skills for specific types of data you will frequently encounter in practice:
- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets.
<!--# TO DO: Something about logicals and numbers -->
<!--# TO DO: Something about general vector tools -->
<!--# TO DO: Something about missing values -->
- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- Chapter \@ref(factors) will introduce factors -- how R stores categorical data.
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times.
@ -1,4 +1,4 @@
# Data visualisation
# Data visualisation {#data-visualisation}
## Introduction
@ -1,27 +0,0 @@
# (PART) Explore {.unnumbered}
# Introduction {#explore-intro}
The goal of the first part of this book is to get you up to speed with the basic tools of **data exploration** as quickly as possible.
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
The goal of data exploration is to generate many promising leads that you can later explore in more depth.
```{r echo = FALSE, out.width = "75%"}
In this part of the book you will learn some useful tools that have an immediate payoff:
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
In [data visualisation] you'll dive into visualisation, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
- Visualisation alone is typically not enough, so in [data transformation] you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- Finally, in [exploratory data analysis], you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet.
We'll come back to it in [modelling](#model-intro), once you're better equipped with more data wrangling and programming tools.
Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow.
In [workflow: basics], [workflow: scripts], and [workflow: projects] you'll learn good practices for writing and organising your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.
@ -127,7 +127,7 @@ df$d <- rescale01(df$d)
Compared to the original, this code is easier to understand and we've eliminated one class of copy-and-paste errors.
There is still quite a bit of duplication since we're doing the same thing to multiple columns.
We'll learn how to eliminate that duplication in [iteration], once you've learned more about R's data structures in [vectors].
We'll learn how to eliminate that duplication with iteration in Chapter \@ref(iteration), once you've learned more about R's data structures in Chapter \@ref(vectors).
Another advantage of functions is that if our requirements change, we only need to make the change in one place.
For example, we might discover that some of our variables include infinite values, and `rescale01()` fails:
@ -0,0 +1,3 @@
# Databases {#import-databases}
<!--# TO DO: Write chapter. -->
@ -0,0 +1,3 @@
# Other types of data {#import-other}
<!--# TO DO: Write chapter. -->
@ -0,0 +1,3 @@
# Rectangular data {#import-rectangular}
<!--# TO DO: Write chapter. -->
@ -0,0 +1,3 @@
# Spreadsheets {#import-spreadsheets}
<!--# TO DO: Write chapter. -->
@ -0,0 +1,3 @@
# Web scraping {#import-webscrape}
<!--# TO DO: Write chapter. -->
@ -1,664 +1,21 @@
# Data import
# (PART) Import {.unnumbered}
## Introduction
# Introduction {#import-intro}
Working with data provided by R packages is a great way to learn the tools of data science, but at some point you want to stop learning and start working with your own data.
In this chapter, you'll learn how to read plain-text rectangular files into R.
Here, we'll only scratch the surface of data import, but many of the principles will translate to other forms of data.
We'll finish with a few pointers to packages that are useful for other types of data.
In this part of the book, you'll learn how to get your into R.
We'll focus on plain-text rectangular formats, spreadsheets, databases, and web data.
### Prerequisites
<!--# TO DO: Decide if a diagram is needed, see wrangle-intro for reference. -->
In this chapter, you'll learn how to load flat files in R with the **readr** package, which is part of the core tidyverse.
This part of the book proceeds as follows:
```{r setup, message = FALSE}
- In Chapter \@ref(import-rectangular), you'll learn how to get plain-text data in rectangular formats from disk and into R.
## Getting started
- In Chapter \@ref(import-spreadsheets), you'll learn how to get data from Excel spreadsheets and Google Sheets into R.
Most of readr's functions are concerned with turning flat files into data frames:
- In Chapter \@ref(import-databases), you'll learn about getting data into R from databases.
<!--# TO DO: List which types of databases. -->
- `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon separated files (common in countries where `,` is used as the decimal place), `read_tsv()` reads tab delimited files, and `read_delim()` reads in files with any delimiter.
- In Chapter \@ref(import-webscrape), you'll learn about harvesting data off the web and getting it into R.
- `read_fwf()` reads fixed width files.
You can specify fields either by their widths with `fwf_widths()` or their position with `fwf_positions()`.
`read_table()` reads a common variation of fixed width files where columns are separated by white space.
- `read_log()` reads Apache style log files.
(But also check out [webreadr]( which is built on top of `read_log()` and provides many more helpful tools.)
These functions all have similar syntax: once you've mastered one, you can use the others with ease.
For the rest of this chapter we'll focus on `read_csv()`.
Not only are csv files one of the most common forms of data storage, but once you understand `read_csv()`, you can easily apply your knowledge to all the other functions in readr.
The first argument to `read_csv()` is the most important: it's the path to the file to read.
```{r, message = TRUE}
heights <- read_csv("data/heights.csv")
When you run `read_csv()` it prints out a column specification that gives the name and type of each column.
That's an important part of readr, which we'll come back to in [parsing a file].
You can also supply an inline csv file.
This is useful for experimenting with readr and for creating reproducible examples to share with others:
In both cases `read_csv()` uses the first line of the data for the column names, which is a very common convention.
There are two cases where you might want to tweak this behaviour:
1. Sometimes there are a few lines of metadata at the top of the file.
You can use `skip = n` to skip the first `n` lines; or use `comment = "#"` to drop all lines that start with (e.g.) `#`.
read_csv("The first line of metadata
The second line of metadata
1,2,3", skip = 2)
read_csv("# A comment I want to skip
1,2,3", comment = "#")
2. The data might not have column names.
You can use `col_names = FALSE` to tell `read_csv()` not to treat the first row as headings, and instead label them sequentially from `X1` to `Xn`:
read_csv("1,2,3\n4,5,6", col_names = FALSE)
(`"\n"` is a convenient shortcut for adding a new line. You'll learn more about it and other types of string escape in [string basics].)
Alternatively you can pass `col_names` a character vector which will be used as the column names:
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
Another option that commonly needs tweaking is `na`: this specifies the value (or values) that are used to represent missing values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
This is all you need to know to read \~75% of CSV files that you'll encounter in practice.
You can also easily adapt what you've learned to read tab separated files with `read_tsv()` and fixed width files with `read_fwf()`.
To read in more challenging files, you'll need to learn more about how readr parses each column, turning them into R vectors.
### Compared to base R
If you've used R before, you might wonder why we're not using `read.csv()`.
There are a few good reasons to favour readr functions over the base equivalents:
- They are typically much faster (\~10x) than their base equivalents.
Long running jobs have a progress bar, so you can see what's happening.
If you're looking for raw speed, try `data.table::fread()`.
It doesn't fit quite so well into the tidyverse, but it can be quite a bit faster.
- They produce tibbles, they don't convert character vectors to factors, use row names, or munge the column names.
These are common sources of frustration with the base R functions.
- They are more reproducible.
Base R functions inherit some behaviour from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Exercises
1. What function would you use to read a file where fields were separated with\
2. Apart from `file`, `skip`, and `comment`, what other arguments do `read_csv()` and `read_tsv()` have in common?
3. What are the most important arguments to `read_fwf()`?
4. Sometimes strings in a CSV file contain commas.
To prevent them from causing problems they need to be surrounded by a quoting character, like `"` or `'`. By default, `read_csv()` assumes that the quoting character will be `"`.
What argument to `read_csv()` do you need to specify to read the following text into a data frame?
```{r, eval = FALSE}
5. Identify what is wrong with each of the following inline CSV files.
What happens when you run the code?
```{r, eval = FALSE}
## Parsing a vector
Before we get into the details of how readr reads files from disk, we need to take a little detour to talk about the `parse_*()` functions.
These functions take a character vector and return a more specialised vector like a logical, integer, or date:
str(parse_logical(c("TRUE", "FALSE", "NA")))
str(parse_integer(c("1", "2", "3")))
str(parse_date(c("2010-01-01", "1979-10-14")))
These functions are useful in their own right, but are also an important building block for readr.
Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing:
parse_integer(c("1", "231", ".", "456"), na = ".")
If parsing fails, you'll get a warning:
x <- parse_integer(c("123", "345", "abc", "123.45"))
And the failures will be missing in the output:
If there are many parsing failures, you'll need to use `problems()` to get the complete set.
This returns a tibble, which you can then manipulate with dplyr.
Using parsers is mostly a matter of understanding what's available and how they deal with different types of input.
There are eight particularly important parsers:
1. `parse_logical()` and `parse_integer()` parse logicals and integers respectively.
There's basically nothing that can go wrong with these parsers so I won't describe them here further.
2. `parse_double()` is a strict numeric parser, and `parse_number()` is a flexible numeric parser.
These are more complicated than you might expect because different parts of the world write numbers in different ways.
3. `parse_character()` seems so simple that it shouldn't be necessary.
But one complication makes it quite important: character encodings.
4. `parse_factor()` create factors, the data structure that R uses to represent categorical variables with fixed and known values.
5. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to parse various date & time specifications.
These are the most complicated because there are so many different ways of writing dates.
The following sections describe these parsers in more detail.
### Numbers
It seems like it should be straightforward to parse a number, but three problems make it tricky:
1. People write numbers differently in different parts of the world.
For example, some countries use `.` in between the integer and fractional parts of a real number, while others use `,`.
2. Numbers are often surrounded by other characters that provide some context, like "\$1000" or "10%".
3. Numbers often contain "grouping" characters to make them easier to read, like "1,000,000", and these grouping characters vary around the world.
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ from place to place.
When parsing numbers, the most important option is the character you use for the decimal mark.
You can override the default value of `.` by creating a new locale and setting the `decimal_mark` argument:
parse_double("1,23", locale = locale(decimal_mark = ","))
readr's default locale is US-centric, because generally R is US-centric (i.e. the documentation of base R is written in American English).
An alternative approach would be to try and guess the defaults from your operating system.
This is hard to do well, and, more importantly, makes your code fragile: even if it works on your computer, it might fail when you email it to a colleague in another country.
`parse_number()` addresses the second problem: it ignores non-numeric characters before and after the number.
This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
parse_number("It cost $123.45")
The final problem is addressed by the combination of `parse_number()` and the locale as `parse_number()` will ignore the "grouping mark":
# Used in America
# Used in many parts of Europe
parse_number("123.456.789", locale = locale(grouping_mark = "."))
# Used in Switzerland
parse_number("123'456'789", locale = locale(grouping_mark = "'"))
### Strings {#readr-strings}
It seems like `parse_character()` should be really simple --- it could just return its input.
Unfortunately life isn't so simple, as there are multiple ways to represent the same string.
To understand what's going on, we need to dive into the details of how computers represent strings.
In R, we can get at the underlying representation of a string using `charToRaw()`:
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on.
The mapping from hexadecimal number to character is called the encoding, and in this case the encoding is called ASCII.
ASCII does a great job of representing English characters, because it's the **American** Standard Code for Information Interchange.
Things get more complicated for languages other than English.
In the early days of computing there were many competing standards for encoding non-English characters, and to correctly interpret a string you needed to know both the values and the encoding.
For example, two common encodings are Latin1 (aka ISO-8859-1, used for Western European languages) and Latin2 (aka ISO-8859-2, used for Eastern European languages).
In Latin1, the byte `b1` is "±", but in Latin2, it's "ą"!
Fortunately, today there is one standard that is supported almost everywhere: UTF-8.
UTF-8 can encode just about every character used by humans today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing.
This is a good default, but will fail for data produced by older systems that don't understand UTF-8.
If this happens to you, your strings will look weird when you print them.
Sometimes just one or two characters might be messed up; other times you'll get complete gibberish.
For example:
x1 <- "El Ni\xf1o was particularly bad this year"
x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
To fix the problem you need to specify the encoding in `parse_character()`:
parse_character(x1, locale = locale(encoding = "Latin1"))
parse_character(x2, locale = locale(encoding = "Shift-JIS"))
How do you find the correct encoding?
If you're lucky, it'll be included somewhere in the data documentation.
Unfortunately, that's rarely the case, so readr provides `guess_encoding()` to help you figure it out.
It's not foolproof, and it works better when you have lots of text (unlike here), but it's a reasonable place to start.
Expect to try a few different encodings before you find the right one.
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
Encodings are a rich and complex topic, and I've only scratched the surface here.
If you'd like to learn more I'd recommend reading the detailed explanation at <>.
### Factors {#readr-factors}
R uses factors to represent categorical variables that have a known set of possible values.
Give `parse_factor()` a vector of known `levels` to generate a warning whenever an unexpected value is present:
fruit <- c("apple", "banana")
parse_factor(c("apple", "banana", "bananana"), levels = fruit)
But if you have many problematic entries, it's often easier to leave as character vectors and then use the tools you'll learn about in [strings](#readr-strings) and [factors](#readr-factors) to clean them up.
### Dates, date-times, and times {#readr-datetimes}
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date-time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight).
When called without any additional arguments:
- `parse_datetime()` expects an ISO8601 date-time.
ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.
# If time is omitted, it will be set to midnight
This is the most important date/time standard, and if you work with dates and times frequently, I recommend reading <>
- `parse_date()` expects a four digit year, a `-` or `/`, the month, a `-` or `/`, then the day:
- `parse_time()` expects the hour, `:`, minutes, optionally `:` and seconds, and an optional am/pm specifier:
parse_time("01:10 am")
Base R doesn't have a great built in class for time data, so we use the one provided in the hms package.
If these defaults don't work for your data you can supply your own date-time `format`, built up of the following pieces:
: `%Y` (4 digits).
: `%y` (2 digits); 00-69 -\> 2000-2069, 70-99 -\> 1970-1999.
: `%m` (2 digits).
: `%b` (abbreviated name, like "Jan").
: `%B` (full name, "January").
: `%d` (2 digits).
: `%e` (optional leading space).
: `%H` 0-23 hour.
: `%I` 0-12, must be used with `%p`.
: `%p` AM/PM indicator.
: `%M` minutes.
: `%S` integer seconds.
: `%OS` real seconds.
: `%Z` Time zone (as name, e.g. `America/Chicago`).
Beware of abbreviations: if you're American, note that "EST" is a Canadian time zone that does not have daylight savings time.
It is *not* Eastern Standard Time!
We'll come back to this [time zones].
: `%z` (as offset from UTC, e.g. `+0800`).
: `%.` skips one non-digit character.
: `%*` skips any number of non-digits.
The best way to figure out the correct format is to create a few examples in a character vector, and test with one of the parsing functions.
For example:
parse_date("01/02/15", "%m/%d/%y")
parse_date("01/02/15", "%d/%m/%y")
parse_date("01/02/15", "%y/%m/%d")
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`.
See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
### Exercises
1. What are the most important arguments to `locale()`?
2. What happens if you try and set `decimal_mark` and `grouping_mark` to the same character?
What happens to the default value of `grouping_mark` when you set `decimal_mark` to ","?
What happens to the default value of `decimal_mark` when you set the `grouping_mark` to "."?
3. I didn't discuss the `date_format` and `time_format` options to `locale()`.
What do they do?
Construct an example that shows when they might be useful.
4. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
5. What's the difference between `read_csv()` and `read_csv2()`?
6. What are the most common encodings used in Europe?
What are the most common encodings used in Asia?
Do some googling to find out.
7. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
t1 <- "1705"
t2 <- "11:15:10.12 PM"
## Parsing a file
Now that you've learned how to parse an individual vector, it's time to return to the beginning and explore how readr parses a file.
There are two new things that you'll learn about in this section:
1. How readr automatically guesses the type of each column.
2. How to override the default specification.
### Strategy
readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
guess_parser(c("TRUE", "FALSE"))
guess_parser(c("1", "5", "9"))
The heuristic tries each of the following types, stopping when it finds a match:
- logical: contains only "F", "T", "FALSE", or "TRUE".
- integer: contains only numeric characters (and `-`).
- double: contains only valid doubles (including numbers like `4.5e-5`).
- number: contains valid doubles with the grouping mark inside.
- time: matches the default `time_format`.
- date: matches the default `date_format`.
- date-time: any ISO8601 date.
If none of these rules apply, then the column will stay as a vector of strings.
### Problems
These defaults don't always work for larger files.
There are two basic problems:
1. The first thousand rows might be a special case, and readr guesses a type that is not sufficiently general.
For example, you might have a column of doubles that only contains integers in the first 1000 rows.
2. The column might contain a lot of missing values.
If the first 1000 rows contain only `NA`s, readr will guess that it's a logical vector, whereas you probably want to parse it as something more specific.
readr contains a challenging CSV that illustrates both of these problems:
challenge <- read_csv(readr_example("challenge.csv"))
(Note the use of `readr_example()` which finds the path to one of the files included with the package)
There are two printed outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures.
It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
A good strategy is to work column by column until there are no problems remaining.
Here we can see that there are a lot of parsing problems with the `y` column.
If we look at the last few rows, you'll see that they're dates stored in a character vector:
That suggests we need to use a date parser instead.
To fix the call, start by copying and pasting the column specification into your original call:
```{r, eval = FALSE}
challenge <- read_csv(
col_types = cols(
x = col_double(),
y = col_logical()
Then you can fix the type of the `y` column by specifying that `y` is a date column:
challenge <- read_csv(
col_types = cols(
x = col_double(),
y = col_date()
Every `parse_xyz()` function has a corresponding `col_xyz()` function.
You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend always supplying `col_types`, building up from the print-out provided by readr.
This ensures that you have a consistent and reproducible data import script.
If you rely on the default guesses and your data changes, readr will continue to read it in.
If you want to be really strict, use `stop_for_problems()`: that will throw an error and stop your script if there are any parsing problems.
### Other strategies
There are a few other general strategies to help you parse files:
- In the previous example, we just got unlucky: if we look at just one more row than the default, we can correctly parse in one shot:
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
- Sometimes it's easier to diagnose problems if you just read in all the columns as character vectors:
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
This is particularly useful in conjunction with `type_convert()`, which applies the parsing heuristics to the character columns in a data frame.
df <- tribble(
~x, ~y,
"1", "1.21",
"2", "2.32",
"3", "4.56"
# Note the column types
- If you're reading a very large file, you might want to set `n_max` to a smallish number like 10,000 or 100,000.
That will accelerate your iterations while you eliminate common problems.
- If you're having major parsing problems, sometimes it's easier to just read into a character vector of lines with `read_lines()`, or even a character vector of length 1 with `read_file()`.
Then you can use the string parsing skills you'll learn later to parse more exotic formats.
## Writing to a file
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`.
Both functions increase the chances of the output file being read back in correctly by:
- Always encoding strings in UTF-8.
- Saving dates and date-times in ISO8601 format so they are easily parsed elsewhere.
If you want to export a csv file to Excel, use `write_excel_csv()` --- this writes a special character (a "byte order mark") at the start of the file which tells Excel that you're using the UTF-8 encoding.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it).
You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
```{r, eval = FALSE}
write_csv(challenge, "challenge.csv")
Note that the type information is lost when you save to csv:
```{r, warning = FALSE}
write_csv(challenge, "challenge-2.csv")
This makes CSVs a little unreliable for caching interim results---you need to recreate the column specification every time you load in.
There are two alternatives:
1. `write_rds()` and `read_rds()` are uniform wrappers around the base functions `readRDS()` and `saveRDS()`.
These store data in R's custom binary format called RDS:
write_rds(challenge, "challenge.rds")
2. The feather package implements a fast binary file format that can be shared across programming languages:
```{r, eval = FALSE}
write_feather(challenge, "challenge.feather")
#> # A tibble: 2,000 x 2
#> x y
#> <dbl> <date>
#> 1 404 <NA>
#> 2 4172 <NA>
#> 3 3004 <NA>
#> 4 787 <NA>
#> 5 37 <NA>
#> 6 2332 <NA>
#> # ... with 1,994 more rows
Feather tends to be faster than RDS and is usable outside of R.
RDS supports list-columns (which you'll learn about in [many models]); feather currently does not.
```{r, include = FALSE}
## Other types of data
To get other types of data into R, we recommend starting with the tidyverse packages listed below.
They're certainly not perfect, but they are a good place to start.
For rectangular data:
- **haven** reads SPSS, Stata, and SAS files.
- **readxl** reads excel files (both `.xls` and `.xlsx`).
- **DBI**, along with a database specific backend (e.g. **RMySQL**, **RSQLite**, **RPostgreSQL** etc) allows you to run SQL queries against a database and return a data frame.
For hierarchical data: use **jsonlite** (by Jeroen Ooms) for json, and **xml2** for XML.
Jenny Bryan has some excellent worked examples at <>.
For other file types, try the [R data import/export manual]( and the [**rio**]( package.
- We'll close up the part with a brief discussion on other types of data and pointers for how to get them into R in Chapter \@ref(import-other).
@ -13,8 +13,8 @@ documentclass: book
# Welcome {.unnumbered}
<a href=""><img src="cover.png" alt="Buy from amazon" class="cover" width="250" height="375"/></a> This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
In this book, you will find a practicum of skills for data science.
[{.cover width="250"}]( This is the website for the work-in-progress 2nd edition of **"R for Data Science"**. This book will teach you how to do data science with R: You'll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
<!--# TO DO: Should "model it" stay here? Omitted? Mentioned with an explanation as to where to go for modeling? --> In this book, you will find a practicum of skills for data science.
Just as a chemist learns how to clean test tubes and stock a lab, you'll learn how to clean data and draw plots---and many other things besides.
These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R.
You'll learn how to use the grammar of graphics, literate programming, and reproducible research to save time.
@ -140,7 +140,6 @@ Hypothesis confirmation is hard for two reasons:
2. You can only use an observation once to confirm a hypothesis.
As soon as you use it more than once you're back to doing exploratory analysis.
This means to do hypothesis confirmation you need to "preregister" (write out in advance) your analysis plan, and not deviate from it even when you have seen the data.
We'll talk a little about some strategies you can use to make this easier in [modelling](#model-intro).
It's common to think about modelling as a tool for hypothesis confirmation, and visualisation as a tool for hypothesis generation.
But that's a false dichotomy: models are often used for exploration, and with a little care you can use visualisation for confirmation.
@ -2,7 +2,7 @@
## Introduction
In [functions], we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
In Chapter \@ref(functions), we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting.
Reducing code duplication has three main benefits:
1. It's easier to see the intent of your code, because your eyes are drawn to what's different, not what stays the same.
@ -100,10 +100,10 @@ Then we'll move on some variations of the for loop that help you solve other pro
1. Write for loops to:
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`.
3. Compute the number of unique values in each column of `iris`.
4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
a. Compute the mean of every column in `mtcars`.
b. Determine the type of each column in `nycflights13::flights`.
c. Compute the number of unique values in each column of `palmerpenguins::penguins`.
d. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
Think about the output, sequence, and body **before** you start writing the loop.
@ -132,13 +132,9 @@ Then we'll move on some variations of the for loop that help you solve other pro
3. Combine your function writing and for loop skills:
1. Write a for loop that `prints()` the lyrics to the children's song "Alice the camel".
2. Convert the nursery rhyme "ten in the bed" to a function.
Generalise it to any number of people in any sleeping structure.
3. Convert the song "99 bottles of beer on the wall" to a function.
Generalise to any number of any vessel containing any liquid on any surface.
a. Write a for loop that `prints()` the lyrics to the children's song "Alice the camel".
b. Convert the nursery rhyme "ten in the bed" to a function. Generalise it to any number of people in any sleeping structure.
c. Convert the song "99 bottles of beer on the wall" to a function. Generalise to any number of any vessel containing any liquid on any surface.
4. It's common to see for loops that don't preallocate the output and instead increase the length of a vector at each step:
@ -168,7 +164,7 @@ There are four variations on the basic theme of the for loop:
### Modifying an existing object
Sometimes you want to use a for loop to modify an existing object.
For example, remember our challenge from [functions].
For example, remember our challenge from Chapter \@ref(functions) on functions.
We wanted to rescale every column in a data frame:
@ -346,14 +342,14 @@ However, it is good to know they exist so that you're prepared for problems wher
What if the names are not unique?
3. Write a function that prints the mean of each numeric column in a data frame, along with its name.
For example, `show_mean(iris)` would print:
For example, `show_mean(mpg)` would print:
```{r, eval = FALSE}
#> Sepal.Length: 5.84
#> Sepal.Width: 3.06
#> Petal.Length: 3.76
#> Petal.Width: 1.20
#> displ: 3.47
#> year: 2004
#> cyl: 5.89
#> cty: 16.86
(Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?)
@ -634,10 +630,10 @@ I focus on purrr functions here because they have more consistent names and argu
1. Write code that uses one of the map functions to:
1. Compute the mean of every column in `mtcars`.
2. Determine the type of each column in `nycflights13::flights`.
3. Compute the number of unique values in each column of `iris`.
4. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
a. Compute the mean of every column in `mtcars`.
b. Determine the type of each column in `nycflights13::flights`.
c. Compute the number of unique values in each column of `palmerpenguins::penguins`.
d. Generate 10 random normals from distributions with means of -10, 0, 10, and 100.
2. How can you create a single vector that for each column in a data frame indicates whether or not it's a factor?
@ -909,11 +905,11 @@ A number of functions work with **predicate** functions that return either a sin
`keep()` and `discard()` keep elements of the input where the predicate is `TRUE` or `FALSE` respectively:
iris %>%
gss_cat %>%
keep(is.factor) %>%
iris %>%
gss_cat %>%
discard(is.factor) %>%
@ -1028,3 +1024,7 @@ x %>% accumulate(`+`)
What causes the bugs?
## Case study
<!--# Case study on web scraping from multiple pages -->
@ -0,0 +1,16 @@
# List columns
## Introduction
<!--# TO DO: Write introduction. -->
### Prerequisites
In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets.
tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
<!--# TO DO: Write chapter around unnest, hoist, etc. -->
@ -0,0 +1,3 @@
# Logicals and numbers
## Introduction
@ -0,0 +1,3 @@
# Missing values
## Introduction
@ -0,0 +1,15 @@
# Preface to the second edition {.unnumbered}
Welcome to the second edition of "R for Data Science".
## Major changes {.unnumbered}
- The first part is renamed to "whole game" to reflect the entire data science cycle, including a chapter on data import.
- In the wrangle part highlight improvements to dplyr that make data scientists' lives even easier, such as new functions for rectangling data, working with list columns, and column-wise and row-wise operations.
- Data import also gains a whole part that goes beyond importing rectangular data to include chapters on working with spreadsheets, databases, and web scraping.
- The iteration chapter gains a new case study on web scraping from multiple pages.
- The modeling part has been removed. For modeling, we recommend using packages from [tidymodels]( and reading [Tidy Modeling with R]( by Max Kuhn and Julia Silge to learn more about them.
## Acknowledgements {.unnumbered}
*TO DO: Add acknowledgements.*
@ -28,18 +28,18 @@ But this doesn't mean you should rewrite every function: you need to balance wha
In the following four chapters, you'll learn skills that will allow you to both tackle new programs and to solve existing problems with greater clarity and ease:
1. In [pipes], you will dive deep into the **pipe**, `%>%`, and learn more about how it works, what the alternatives are, and when not to use it.
1. In Chapter \@ref(pipes), you will dive deep into the **pipe**, `%>%`, and learn more about how it works, what the alternatives are, and when not to use it.
2. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice.
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
Instead, in [functions], you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
Instead, in Chapter \@ref(functions), you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by [vectors].
3. As you start to write more powerful functions, you'll need a solid grounding in R's **data structures**, provided by vectors, which we discuss in Chapter \@ref(vectors).
You must master the four common atomic vectors, the three important S3 classes built on top of them, and understand the mysteries of the list and data frame.
4. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
You need tools for **iteration** that let you do similar things again and again.
These tools include for loops and functional programming, which you'll learn about in [iteration].
These tools include for loops and functional programming, which you'll learn about in Chapter \@ref(iteration).
## Learning more
@ -0,0 +1,16 @@
# Rectangling data
## Introduction
<!--# TO DO: Write introduction. -->
### Prerequisites
In this chapter we'll continue using tidyr, which also provides a bunch of tools to rectangle your datasets.
tidyr is a member of the core tidyverse.
```{r setup, message = FALSE}
<!--# TO DO: Write chapter around unnest, hoist, etc. -->
@ -167,11 +167,11 @@ For example, in this data there's a many-to-many relationship between airlines a
2. Identify the keys in the following datasets
1. `Lahman::Batting`,
2. `babynames::babynames`
3. `nasaweather::atmos`
4. `fueleconomy::vehicles`
5. `ggplot2::diamonds`
a. `Lahman::Batting`,
b. `babynames::babynames`
c. `nasaweather::atmos`
d. `fueleconomy::vehicles`
e. `ggplot2::diamonds`
(You might need to install some packages and read some documentation.)
@ -124,9 +124,9 @@ If you forget, you can get to a handy reference sheet with *Help \> Markdown Qui
2. Using the R Markdown quick reference, figure out how to:
1. Add a footnote.
2. Add a horizontal rule.
3. Add a block quote.
a. Add a footnote.
b. Add a horizontal rule.
c. Add a block quote.
3. Copy and paste the contents of `diamond-sizes.Rmd` from <> in to a local R markdown document.
Check that you can run it, then add text after the frequency polygon that describes its most striking features.
@ -314,10 +314,10 @@ For example, I'll search for `\bsum\b` to avoid matching `summarise`, `summary`,
2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that:
1. Start with "y".
2. End with "x"
3. Are exactly three letters long. (Don't cheat by using `str_length()`!)
4. Have seven letters or more.
a. Start with "y".
b. End with "x"
c. Are exactly three letters long. (Don't cheat by using `str_length()`!)
d. Have seven letters or more.
Since this list is long, you might want to use the `match` argument to `str_view()` to show only the matching or non-matching words.
@ -360,14 +360,10 @@ str_view(c("grey", "gray"), "gr(e|a)y")
1. Create regular expressions to find all words that:
1. Start with a vowel.
2. That only contain consonants.
(Hint: thinking about matching "not"-vowels.)
3. End with `ed`, but not with `eed`.
4. End with `ing` or `ise`.
a. Start with a vowel.
b. That only contain consonants. (Hint: thinking about matching "not"-vowels.)
c. End with `ed`, but not with `eed`.
d. End with `ing` or `ise`.
2. Empirically verify the rule "i before e except after c".
@ -423,16 +419,16 @@ str_view(x, 'C[LX]+?')
2. Describe in words what these regular expressions match: (read carefully to see if I'm using a regular expression or a string that defines a regular expression.)
1. `^.*$`
2. `"\\{.+\\}"`
3. `\d{4}-\d{2}-\d{2}`
4. `"\\\\{4}"`
a. `^.*$`
b. `"\\{.+\\}"`
c. `\d{4}-\d{2}-\d{2}`
d. `"\\\\{4}"`
3. Create regular expressions to find all words that:
1. Start with three consonants.
2. Have three or more vowels in a row.
3. Have two or more vowel-consonant pairs in a row.
a. Start with three consonants.
b. Have three or more vowels in a row.
c. Have two or more vowel-consonant pairs in a row.
4. Solve the beginner regexp crosswords at <>.
@ -454,19 +450,17 @@ str_view(fruit, "(..)\\1", match = TRUE)
1. Describe, in words, what these expressions will match:
1. `(.)\1\1`
2. `"(.)(.)\\2\\1"`
3. `(..)\1`
4. `"(.).\\1.\\1"`
5. `"(.)(.)(.).*\\3\\2\\1"`
a. `(.)\1\1`
b. `"(.)(.)\\2\\1"`
c. `(..)\1`
d. `"(.).\\1.\\1"`
e. `"(.)(.)(.).*\\3\\2\\1"`
2. Construct regular expressions to match words that:
1. Start and end with the same character.
2. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
3. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
a. Start and end with the same character.
b. Contain a repeated pair of letters (e.g. "church" contains "ch" repeated twice.)
c. Contain one letter repeated in at least three places (e.g. "eleven" contains three "e"s.)
## Tools
@ -666,11 +660,9 @@ The second function will have the suffix `_all`.
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple `str_detect()` calls.
1. Find all words that start or end with `x`.
2. Find all words that start with a vowel and end with a consonant.
3. Are there any words that contain at least one of each different vowel?
a. Find all words that start or end with `x`.
b. Find all words that start with a vowel and end with a consonant.
c. Are there any words that contain at least one of each different vowel?
2. What word has the highest number of vowels?
What word has the highest proportion of vowels?
@ -723,7 +715,7 @@ It returns a list:
str_extract_all(more, colour_match)
You'll learn more about lists in [lists](#lists) and [iteration].
You'll learn more about lists in Section \@ref(lists) on lists and Chapter \@ref(iteration) on iteration.
If you use `simplify = TRUE`, `str_extract_all()` will return a matrix with short matches expanded to the same length as the longest:
@ -1048,8 +1040,8 @@ The main difference is the prefix: `str_` vs. `stri_`.
1. Find the stringi functions that:
1. Count the number of words.
2. Find duplicated strings.
3. Generate random text.
a. Count the number of words.
b. Find duplicated strings.
c. Generate random text.
2. How do you control the language that `stri_sort()` uses for sorting?
@ -26,7 +26,7 @@ Most other R packages use regular data frames, so you might want to coerce a dat
You can do that with `as_tibble()`:
You can create a new tibble from individual vectors with `tibble()`.
@ -184,13 +184,10 @@ With tibbles, `[` always returns another tibble.
4. Practice referring to non-syntactic names in the following data frame by:
1. Extracting the variable called `1`.
2. Plotting a scatterplot of `1` vs `2`.
3. Creating a new column called `3` which is `2` divided by `1`.
4. Renaming the columns to `one`, `two` and `three`.
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
annoying <- tibble(
@ -0,0 +1,3 @@
# General vector tools
## Introduction
@ -412,14 +412,10 @@ The distinction between `[` and `[[` is most important for lists, as we'll see s
4. Create functions that take a vector as input and returns:
1. The last value.
Should you use `[` or `[[`?
2. The elements at even numbered positions.
3. Every element except the last value.
4. Only even numbers (and no missing values).
a. The last value. Should you use `[` or `[[`?
b. The elements at even numbered positions.
c. Every element except the last value.
d. Only even numbers (and no missing values).
5. Why is `x[-which(x > 0)]` not the same as `x[x <= 0]`?
@ -561,8 +557,8 @@ knitr::include_graphics("images/pepper-3.jpg")
1. Draw the following lists as nested sets:
1. `list(a, b, list(c, d), list(e, f))`
2. `list(list(list(list(list(list(a))))))`
a. `list(a, b, list(c, d), list(e, f))`
b. `list(list(list(list(list(list(a))))))`
2. What happens if you subset a tibble as if you're subsetting a list?
What are the key differences between a list and a tibble?
@ -0,0 +1,32 @@
# (PART) Whole game {.unnumbered}
# Introduction {#explore-intro}
The goal of the first part of this book is to get you up to speed with the basic tools of **data exploration** as quickly as possible.
Data exploration is the art of looking at your data, rapidly generating hypotheses, quickly testing them, then repeating again and again and again.
The goal of data exploration is to generate many promising leads that you can later explore in more depth.
```{r echo = FALSE, out.width = "75%"}
In this part of the book you will learn some useful tools that have an immediate payoff:
- Visualisation is a great place to start with R programming, because the payoff is so clear: you get to make elegant and informative plots that help you understand data.
In Chapter \@ref(data-visualisation) you'll dive into visualisation, learning the basic structure of a ggplot2 plot, and powerful techniques for turning data into plots.
- Visualisation alone is typically not enough, so in Chapter \@ref(data-transform) you'll learn the key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
- In Chapter \@ref(data-tidy), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.
- Before you can transform and visualise your data, you need to first get your data into R.
In Chapter \@ref(data-import) you'll learn the basics of getting plain-text rectangular data into R.
- Finally, in Chapter \@ref(exploratory-data-analysis), you'll combine visualisation and transformation with your curiosity and scepticism to ask and answer interesting questions about data.
Modelling is an important part of the exploratory process, but you don't have the skills to effectively learn or apply it yet so we will not cover it in this part.
Nestled among these three chapters that teach you the tools of exploration are three chapters that focus on your R workflow.
In Chapters \@ref(workflow-basics), \@ref(workflow-scripts), and \@ref(workflow-projects), you'll learn good workflow practices for writing and organising your R code.
These will set you up for success in the long run, as they'll give you the tools to stay organised when you tackle real projects.
@ -51,7 +51,7 @@ some.people.use.periods
We'll come back to code style later, in [functions].
We'll come back to code style later, in Chapter \@ref(functions) on functions.
You can inspect an object by typing its name:
@ -10,25 +10,34 @@ There are three main parts to data wrangling:
<!--# TO DO: Redo the diagram without highlighting import. -->
This part of the book proceeds as follows:
- In [tibbles], you'll learn about the variant of the data frame that we use in this book: the **tibble**.
- In Chapter \@ref(tibbles), you'll learn about the variant of the data frame that we use in this book: the **tibble**.
You'll learn what makes them different from regular data frames, and how you can construct them "by hand".
- In [data import], you'll learn how to get your data from disk and into R.
We'll focus on plain-text rectangular formats, but will give you pointers to packages that help with other types of data.
- In [tidy data], you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier.
- In Chapter \@ref(tidy-data), you'll learn about tidy data, a consistent way of storing your data that makes transformation, visualisation, and modelling easier.
You'll learn the underlying principles, and how to get your data into a tidy form.
- In Chapter \@ref(rectangle-data), you'll learn about hierarchical data formats and how to turn them into rectangular data via unnesting.
- Chapter \@ref(column-wise-operations) will give you tools for performing the same operation on multiple columns.
- Chapter \@ref(row-wise-operations) will give you tools for performing operations over rows.
Data wrangling also encompasses data transformation, which you've already learned a little about.
Now we'll focus on new skills for three specific types of data you will frequently encounter in practice:
- [Relational data] will give you tools for working with multiple interrelated datasets.
- Chapter \@ref(relational-data) will give you tools for working with multiple interrelated datasets.
- [Strings] will introduce regular expressions, a powerful tool for manipulating strings.
- Chapter \@ref(list-columns) will give you tools for working with list columns --- data stored in columns of a tibble as lists.
- [Factors] are how R stores categorical data.
- Chapter \@ref(strings) will give you tools for working with strings and introduce regular expressions, a powerful tool for manipulating strings.
- Chapter \@ref(factors) will introduce factors --- how R stores categorical data.
They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
- [Dates and times] will give you the key tools for working with dates and date-times.
- Chapter \@ref(dates-and-times) will give you the key tools for working with dates and date-times.
<!--# TO DO: Revisit bullet points about new chapters. -->
Reference in New Issue