Working with existing data is a great way to learn the tools, but you can't apply the tools to your own data unless you can get it into R. In this chapter, we'll focus on the readr package for reading plain-text rectangular files from disk. This only scratches the surface of the ways you can load data into R, but it's the common way to get data, and many of the principles will translate to the other forms of data import.
These functions all have similar syntax: once you've mastered one, you can use the others with ease. For the rest of this chapter we'll focus on `read_csv()`. If you understand how to use this function, it will be straightforward to apply your knowledge to all the other functions in readr.
Notice that `read_csv()` uses the first line of the data for column headings. This is a very common convention. But there are two cases where you might want tweak this behaviour:
This is all you need to know to read ~50% of csv files that you'll encounter in practice. To read in the rest, you'll need to learn more about how readr turns the data it reads from these files as strings into the most appropriate column type.
If you've used R before, you might wonder why we're not using `read.csv()` here. There are a few good reasons to favour readr functions over the base equivalents:
Before we get to how readr reads files from disk, we're going to take a little detour to talk about the `parse_*()` functions. These work with character vectors: they're useful in their own right, but are particularly important for experimentation. Once you've learned how the individual parsers work, we'll circle back and see how they fit together to parse an entire file.
Parsing occurs after leading and trailing whitespace has been removed (if not overridden with `trim_ws = FALSE`) and missing values listed in `na` have been removed:
Parsing logicals and integers is straightforward. Parsing numbers and characters are slightly more complicated than you might expect. Dates and date times are quite a bit more complex. There's also a factor parser.
1. People often extra characters to make them easier to read, like
"1,000,000", and these characters are different in different places
in the world.
To address problem 1 readr has the notion of a "locale", an object that bundles together all of various things that differ in different parts of the world. When parsing numbers the most important thing is what character you use for the decimal place:
(The defaults are American-centric because R is. Trying to adapt automatically to your default is hard, and makes code fragile because it might work on your computer, but might not when you email it to a colleague in another country.)
It seems like `parse_character()` should be really simple - it could just return it's input. Unfortunately there's one tricky bit: encoding. The encoding of a string determines how it is represented in binary. You can see the underlying representation of a string in R using `charToRaw()`:
Each hexadecimal number represents a byte of information. All English characters can be encoded in a single byte basically because most early computer technology was developed in the US.
Unfortunately you can only represent a maximum of 255 values with a single byte of information, and there are many more characters than that used across languages (and some language by themselves need more than 255 characters - Chinese, for example, uses over 20,000). That means to represent a character you need to use multiple bytes of information. The way multiple bytes are used to encode a character is called the "encoding".
In the early days of computing there were many different ways of representing non-English characters which caused a lot of confusion. Fortunately now days there is one standard that is supported almost everywhere: UTF-8. This is the default Encoding used on mac and linux:
readr uses UTF-8 everywhere: it assumes it by default when you're reading, and always uses it when writing. However, you may be attempting to read data that is produced by a system that doesn't understand UTF-8. To read such data, you might need to specify your own encoding. You can use `guess_encoding()` to attempt to figure it out - the more data you have the more likely it is to be correct. You may need to try a couple of different encodings before you get the right once.
Readr provides three options depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:
If you're using `%b` or `%p`, and you're in a non-English locale, you can set the values with `locale()`. readr comes bundled with a bunch: `date_names_langs()`, or create your own with `date_names()`. (Using month names seems to be relatively uncommon outside of Europe.)
Readr uses a heuristic to figure out the types of your columns: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. This is fast, and fairly robust. If readr detects the wrong type of data, you'll get warning messages. Readr prints out the first five, and you can access them all with `problems()`:
EXAMPLE
Typically, you'll see a lot of warnings if readr has guessed the column type incorrectly. This most often occurs when the first 1000 rows are different to the rest of the data. Perhaps there are a lot of missing data there, or maybe your data is mostly numeric but a few rows have characters. Fortunately, it's easy to fix these problems using the `col_type` argument.
(Note that if you have a very large file, you might want to set `n_max` to 10,000 or 100,000. That will speed up iterations while you're finding common problems)
Specifying the `col_type` looks like this:
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = col(
x = col_integer(),
treatment = col_character()
))
```
You can use the following types of columns
* `col_integer()` (i) and `col_double()` (d) specify integer and doubles.
`col_logical()` (l) parses TRUE, T, FALSE and F into a logical vector.
* `col_number()` (n) is a more flexible parser for numbers embedded in other
strings. It will look for the first number in a string, ignoring non-numeric
prefixes and suffixes. It will also ignore the grouping mark specified by
the locale (see below for more details).
* `col_factor()` (f) allows you to load data directly into a factor if you know
what the levels are.
* `col_skip()` (_, -) completely ignores a column.
* `col_date()` (D), `col_datetime()` (T) and `col_time()` (t) parse into dates,
date times, and times as described below.
You might have noticed that each column parser has a one letter abbreviation, which you can use instead of the full function call (assuming you're happy with the default arguments):
```{r, eval = FALSE}
read_csv("mypath.csv", col_types = cols(
x = "i",
treatment = "c"
))
```
(If you just have a few columns, you can supply a single string that gives the type for each column: `i__dc`. See the documentation for more details. It's not as easy to understand as the `cols()` specification, so I'm not going to describe it further here.)
By default, any column not mentioned in `cols` will be guessed. If you'd rather those columns are simply not read in, use `cols_only()`. In that case, you can use `col_guess()` (?) if you want to guess the type of a column and include it to be read.
Each `col_XYZ()` function also has a corresponding `parse_XYZ()` that you can use on a character vector. This makes it easier to explore what each of the parsers does interactively.
readr also provides a number of functions for reading files off disk directly into character vectors:
* `read_file()` reads an entire file into a character vector of length one.
* `read_lines()` reads a file into a character vector with one element per
line.
These are useful if you have a plain text file with an unusual format. Often you can use `read_lines()` to read into a character vector, and then use the regular expression skills you'll learn in [[strings]] to pull out the pieces that you need.
`read_file_raw()` and `read_lines_raw()` work similarly but return raw vectors, which are useful if you need to work with binary data.
### Converting
`type_convert()` applies the same parsing heuristics to the character columns in a data frame. It's useful if you've loaded data "by hand", and now want to convert character columns to the appropriate type:
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`. These are considerably faster than the base R equvalents, never write rownames, and automatically quote only when needed.
If you want to export a csv file to Excel, use `write_excel_csv()` - this writes a special character (a "byte order mark") at the start of the file which forces Excel to use UTF-8.
We have worked on a number of packages to make importing data into R as easy as possible. These packages are certainly not perfect, but they are the best place to start because they behave as similar as possible to readr.
There are two common forms of hierarchical data: XML and json. We recommend using xml2 and jsonlite respectively. These packages are performant, safe, and (relatively) easy to use. To work with these effectively in R, you'll need to x
If your data lives in a database, you'll need to use the DBI package. DBI provides a common interface that works with many different types of database. R's support is particularly good for open source databases (e.g. RPostgres, RMySQL, RSQLite, MonetDBLite).