Working with existing data is a great way to learn the tools, but you can't apply the tools to your own data unless you can get it into R. In this chapter, we'll focus on the readr package for reading plain-text rectangular files from disk. This only scratches the surface of the ways you can load data into R, but it's the common way to get data, and many of the principles will translate to the other forms of data import.
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to apply your knowledge to all the other functions in readr.
Notice that `read_csv()` uses the first line of the data for column headings. This is a very common convention. There are two cases where you might want tweak this behaviour:
This is all you need to know to read ~50% of csv files that you'll encounter in practice. To read in the rest, you'll need to learn more about how readr parses each individual column, turning a character vector into the most appropriate type.
If you've used R before, you might wonder why we're not using `read.csv()`. There are a few good reasons to favour readr functions over the base equivalents:
Before we get into the details of how readr reads files from disk, we're going to take a little detour to talk about the `parse_*()` functions. These functions all take a character vector and return something more specialised like logical, integer, or date:
These functions are useful in their own right, but are also an important building block for readr. Once you've learned how the individual parsers work in this section, we'll circle back and see how they fit together to parse a complete file in the next section.
Like all functions in the tidyverse, the `parse_*()` functions are uniform: the first argument is a character vector to parse, and the `na` argument specifies which strings should be treated as missing.
To get more details about the problems, use `problems()`, which returns a tibble. That's useful if you have many parsing failures because you can use dplyr to figure out the common patterns.
Using parsers is mostly a matter of understanding what's avaialble and how they deal with different types of input. There are eight particularly important parsers:
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ around the world. For parsing numbers, the most important option is what character you use for the decimal mark:
The default locale in readr is US-centric, because R generally is US-centric (i.e. the documentation of base R is written in American English). An alternative approach would be to try and guess the defaults from your operating system. This is hard to do well, but more importantly makes your code fragile: it might work on your computer, but might fail when you email it to a colleague in another country.
`parse_number()` addresses problem two: it ignores non-numeric characters before and after the number. This is particularly useful for currencies and percentages, but also works to extract numbers embedded in text.
The final problem is addressed with the combination of `parse_number()` the locale: `parse_number()` will also ignore the "grouping mark" used to separate numbers:
It seems like `parse_character()` should be really simple - it could just return its input. Unfortunately life isn't so simple, as there are multiple ways to represent the same string. To understand what's going on, we need to dive into the details of how computers represent strings. In R, we can get at the underlying binary representation of a string using `charToRaw()`:
```{r}
charToRaw("Hadley")
```
Each hexadecimal number represents a byte of information: `48` is H, `61` is a, and so on. This encoding, from hexadecimal number to character is called ASCII. ASCII does a great job of representing English characters.
Unfortunately you can only represent a maximum of 255 values with a single byte of information, and there are many more characters when you look across languages. That means to represent a character in other languages you need to use multiple bytes of information. The way multiple bytes are used to encode a character is called the "encoding".
In the early days of computing there were many different ways of representing non-English characters which caused a lot of confusion. Fortunately, today there is one standard that is supported almost everywhere: UTF-8. UTF-8 can encode just about every character used by human's today, as well as many extra symbols (like emoji!).
readr uses UTF-8 everywhere: it assumes your data is UTF-8 encoded when you read it, and always uses it when writing. This is a good default, but will fail for data produced by older systems that don't understand UTF-8. Unfortunately handling
However, you may be attempting to read data that is produced by a system that doesn't understand UTF-8. You can tell you need to do this because when you print the data in R it looks weird. Sometimes you might get complete gibberish, or sometimes just one or two characters might be messed up.
How do you find the correct encoding? If you're lucky, it'll be included somewhere in the data documentation. But that rarely happens so readr provides `guess_encoding()` to help you figure it out. It's not foolproof, and it works better when you have lots of text, but it's a reasonable place to start. Even then you may need to try a couple of different encodings before you get the right once.
Encodings are a rich and complex topic, and I've only scratched the surface here. We'll come back to encodings again in [[Encoding]], but if you'd like to learn more I'd recommend reading the detailed explanation at <http://kunststube.net/encoding/>.
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`. See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
Now that you've learned how to parse an individual vector, it's time to turn back and explore how readr parses a file. There are two new things that you'll learn about in this section:
Readr uses a heuristic to figure out the type of each columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column.
You can emulate this process with a character vector using `guess_parser()`, which returns readr's best guess, and `parse_guess()` which uses that guess to parse the column:
(Note that the details will change a little from version to version as we tweak the guesses to provide the best balance between false positives and false negatives)
Note the two outputs: you see the column specification that readr used, and you can see all the problems. It's always a good idea to explicitly pull out the `problems()` so you can explore them in more depth:
A good strategy is to work column by column until there are no problems remaining. Here we can see that there are a lot of parsing problems with the `x` column - there are trailing characters after the integer value. That suggests we need to use a double vector instead.
Every `parse_xyz()` function has a corresponding `col_xyz()` function. You use `parse_xyz()` when the data is in a character vector in R already; you use `col_xyz()` when you want to tell readr how to load the data.
I highly recommend building up a complete column specification using the print out provided by readr. This ensures that you have a consistent reproducible way of reading the file - if you rely on the default guesses, if your data changes readr will continue to read it in. If you want to be really strict, use `stop_for_problems()`: that will throw an error if there are any parsing problems.
The most important arguments are `x` (the data frame to save), and `path` (the location to save it). You can also specify how missing values are written with `na`, and if you want to `append` to an existing file.
This makes csvs a little unreliable for caching interim results - you need to recreate the column specification every time you load in. There are two alternatives:
feather tends to be faster than rds and is usable outside of R. `rds` supports list-columns (which you'll learn about in [[Many models]]), which feather does not yet.
To get other types of data into R, we recommend starting with the packages listed below. They're certainly not perfect, but they are a good place to start as they are fully fledged members of the tidyverse.