More on import
This commit is contained in:
parent
2ebaee835d
commit
aa74378924
146
import.Rmd
146
import.Rmd
|
@ -38,6 +38,8 @@ The first argument to `read_csv()` is the most important: it's the path to the f
|
|||
heights <- read_csv("data/heights.csv")
|
||||
```
|
||||
|
||||
You'll notice when you run `read_csv()` it prints how it has read each column. We'll come back to that in a little bit.
|
||||
|
||||
Readr can automatically decompress files ending in `.zip`, `.gz`, `.bz2`, and `.xz`.
|
||||
|
||||
You can also supply an inline csv file. This is useful for experimenting and creating reproducible examples:
|
||||
|
@ -99,9 +101,28 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
|
|||
your operation system, so code that works on your computer might not
|
||||
work on another computer.
|
||||
|
||||
### Exericses
|
||||
|
||||
1. What function would you use to read a function that where fields were
|
||||
separated with with "|"?
|
||||
|
||||
1. Apart from `file`, `skip`, and `comment`, what other arguments do
|
||||
`read_csv()` and `read_tsv()` have in common?
|
||||
|
||||
1. Some times strings in a csv file contain commas. To prevent them from
|
||||
causing problems they need to be surrounded by a quoting character, like
|
||||
`"` or `'`. By convention, `read_csv()` assumes that the quoting
|
||||
character will be `"`, and if you want to change it you'll need to
|
||||
use `read_delim()` instead. What arguments do you need to specify
|
||||
to read the following text into a data frame?
|
||||
|
||||
```{r}
|
||||
"x,y\n1,'a,b'"
|
||||
```
|
||||
|
||||
## Parsing a vector
|
||||
|
||||
Before we get to how readr reads files from disk, we're going to take a little detour to talk about the `parse_*()` functions. These functions all take a character vector and return something more specialised like logical, integer, or date:
|
||||
Before we get into the details of how readr reads files from disk, we're going to take a little detour to talk about the `parse_*()` functions. These functions all take a character vector and return something more specialised like logical, integer, or date:
|
||||
|
||||
```{r}
|
||||
str(parse_logical(c("TRUE", "FALSE", "NA")))
|
||||
|
@ -117,14 +138,25 @@ Like all functions in the tidyverse, the `parse_*()` functions are uniform: the
|
|||
parse_integer(c("1", "231", ".", "456"), na = ".")
|
||||
```
|
||||
|
||||
If parsing fails, you'll get a warning, and can use the `problems()` function to get more details. `problems()` returns a tibble, so you can easily explore it using dplyr.
|
||||
If parsing fails, you'll get a warning:
|
||||
|
||||
```{r}
|
||||
x <- parse_integer(c("123", "345", "abc", "123.45"))
|
||||
```
|
||||
|
||||
And the failures will be missing in the output:
|
||||
|
||||
```{r}
|
||||
x
|
||||
```
|
||||
|
||||
To get more details about the problems, use `problems()`, which returns a tibble. That's useful if you have many parsing failures because you can use dplyr to figure out the common patterns.
|
||||
|
||||
```{r}
|
||||
problems(x)
|
||||
```
|
||||
|
||||
There are eight particularly important parsers:
|
||||
Using parsers is mostly a matter of understanding what's avaialble and how they deal with different types of input. There are eight particularly important parsers:
|
||||
|
||||
1. `parse_logical()` and `parse_integer()` parse logicals and integers
|
||||
respectively. There's basically nothing that can go wrong with them
|
||||
|
@ -136,7 +168,7 @@ There are eight particularly important parsers:
|
|||
ways.
|
||||
|
||||
1. `parse_character()` seems so simple that it shouldn't be necessary. But
|
||||
one complication makes it important: character encodings.
|
||||
one complication makes it quite important: character encodings.
|
||||
|
||||
1. `parse_datetime()`, `parse_date()`, and `parse_time()` allow you to
|
||||
parse various date & time specifications. These are the most complicated
|
||||
|
@ -146,9 +178,11 @@ The following sections describe the parsers in more detail.
|
|||
|
||||
### Numbers
|
||||
|
||||
There are three tricky bits to numbers:
|
||||
It seems like it should be straightforward to parse a number, but three factors make it tricky:
|
||||
|
||||
1. People write numbers differently in different parts of the world.
|
||||
Some countries use `.` in between the integer and fractional parts of
|
||||
a real number, while others uses `,`.
|
||||
|
||||
1. Numbers are often surrounded by other characters that provide some
|
||||
context, like "$1000" or "10%".
|
||||
|
@ -156,7 +190,7 @@ There are three tricky bits to numbers:
|
|||
1. Numbers often contain "grouping" characters to make them easier to read,
|
||||
like "1,000,000", and the characters are differ around the world.
|
||||
|
||||
To address the first problem, readr has the notion of a "locale", an object which specifies parsing options that differ around the world. For parsing numbers, the most important option is what character you use for the decimal place:
|
||||
To address the first problem, readr has the notion of a "locale", an object that specifies parsing options that differ around the world. For parsing numbers, the most important option is what character you use for the decimal mark:
|
||||
|
||||
```{r}
|
||||
parse_double("1.23")
|
||||
|
@ -222,47 +256,73 @@ guess_encoding(charToRaw(x2))
|
|||
|
||||
The first argument to `guess_encoding()` can either be a path to a file, or, as in this case, a raw vector (useful if the strings are already in R).
|
||||
|
||||
If you'd like to learn more, I'd recommend <http://kunststube.net/encoding/>.
|
||||
|
||||
### Dates, date times, and times
|
||||
|
||||
There are three options depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
|
||||
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (i.e. the number of seconds since midnight). The defaults read:
|
||||
|
||||
* Date times: an [ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time.
|
||||
* Date: a year, optional separator, month, optional separator, day.
|
||||
* Time: an hour, optional colon, hour, optional colon, minute, optional colon,
|
||||
optional seconds, optional am/pm.
|
||||
* `parse_datetime()`: an
|
||||
[ISO8601](https://en.wikipedia.org/wiki/ISO_8601) date time. This
|
||||
is the most important date/time standard, and I recommend that you get
|
||||
a little familiar with it.
|
||||
|
||||
```{r}
|
||||
parse_datetime("2010-10-01T2010")
|
||||
# If time is omitted, it will be set to midnight
|
||||
parse_datetime("20101010")
|
||||
```
|
||||
|
||||
* `parse_date()`: a year, optional separator, month, optional separator,
|
||||
day.
|
||||
|
||||
```{r}
|
||||
parse_date("2010-10-01")
|
||||
```
|
||||
|
||||
* `parse_time()`: an hour, optional colon, hour, optional colon, minute,
|
||||
optional colon, optional seconds, optional am/pm. Base R doesn't have
|
||||
a great built in class for time data, so we use the one provided in the
|
||||
hms package.
|
||||
|
||||
For example:
|
||||
|
||||
```{r}
|
||||
parse_datetime("2010-10-01T2010")
|
||||
parse_date("2010-10-01")
|
||||
parse_time("20:10:01")
|
||||
```
|
||||
```{r}
|
||||
library(hms)
|
||||
parse_time("20:10:01")
|
||||
```
|
||||
|
||||
If these defaults don't work for your data you can supply your own date time formats, built up of the following pieces:
|
||||
|
||||
* Year: `%Y` (4 digits). `%y` (2 digits); 00-69 -> 2000-2069,
|
||||
70-99 -> 1970-1999.
|
||||
Year
|
||||
: `%Y` (4 digits).
|
||||
: `%y` (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.
|
||||
|
||||
* Month: `%m` (2 digits), `%b` (abbreviated name), `%B` (full name).
|
||||
Month
|
||||
: `%m` (2 digits)
|
||||
: `%b` (abbreviated name, like "Jan")
|
||||
: `%B` (full name, "January").
|
||||
|
||||
* Day: `%d` (2 digits), `%e` (optional leading space).
|
||||
Day
|
||||
|
||||
* Hour: `%H`.
|
||||
: `%d` (2 digits)
|
||||
: `%e` (optional leading space)
|
||||
|
||||
* Minutes: `%M`.
|
||||
Time
|
||||
|
||||
* Seconds: `%S` (integer seconds), `%OS` (partial seconds).
|
||||
: `%H` 0-24 hour.
|
||||
: `%I` 1-12, must be used with `%p`.
|
||||
: `%p` AM/PM indicator.
|
||||
: `%M` minutes.
|
||||
: `%S` integer seconds.
|
||||
: `%OS` real seconds.
|
||||
: `%Z` Time zone (as name, e.g. `America/Chicago`). Beware abbreviations:
|
||||
if you're American, note that "EST" is a Canadian time zone that does not
|
||||
have daylight savings time. It is \emph{not} Eastern StandardTime!
|
||||
: `%z` (as offset from UTC, e.g. `+0800`).
|
||||
|
||||
* Time zone: `%Z` (as name, e.g. `America/Chicago`), `%z` (as offset from UTC,
|
||||
e.g. `+0800`). If you're American, note that "EST" is a Canadian time zone
|
||||
that does not have daylight savings time. It is \emph{not} Eastern Standard
|
||||
Time!
|
||||
Non-digits:
|
||||
|
||||
* AM/PM indicator: `%p`.
|
||||
|
||||
* Non-digits: `%.` skips one non-digit character, `%*` skips any number of
|
||||
non-digits.
|
||||
: `%.` skips one non-digit character
|
||||
: `%*` skips any number of non-digits.
|
||||
|
||||
The best way to figure out the correct string is to create a few examples in a character vector, and test with one of the parsing functions. For example:
|
||||
|
||||
|
@ -272,12 +332,26 @@ parse_date("01/02/15", "%d/%m/%y")
|
|||
parse_date("01/02/15", "%y/%m/%d")
|
||||
```
|
||||
|
||||
If you're using `%b` or `%p`, and you're in a non-English locale, you can set the values with the `lang` argument to `locale()`. readr comes bundled with a bunch: `date_names_langs()`, or create your own with `date_names()`.
|
||||
If you're using `%b` or `%B` with non-English month names, you'll need to set the `lang` argument to `locale()`. See the list of built-in languages in `date_names_langs()`, or if your language is not already included, create your own with `date_names()`.
|
||||
|
||||
```{r}
|
||||
locale("fr")
|
||||
locale("fr", asciify = TRUE)
|
||||
|
||||
parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
|
||||
```
|
||||
,
|
||||
### Exercises
|
||||
|
||||
1. What are the most important options to locale? If you live outside the
|
||||
US, create a new locale object that encapsulates the settings for the
|
||||
data files you read most commonly.
|
||||
|
||||
1. I didn't discuss the `date_format` and `time_format` options to
|
||||
`locale()`. What do they do? Construct an example that shows when they
|
||||
might be useful.
|
||||
|
||||
1. What are the most common encodings used in Europe? What are the
|
||||
most common encodings used in Asia?
|
||||
|
||||
## Parsing a file
|
||||
|
||||
|
@ -375,7 +449,7 @@ These are useful if you have a plain text file with an unusual format. Often you
|
|||
`type_convert()` applies the same parsing heuristics to the character columns in a data frame. It's useful if you've loaded data "by hand", and now want to convert character columns to the appropriate type:
|
||||
|
||||
```{r}
|
||||
df <- tibble(x = c("1", "2", "3"), y = c("1.21", "2.32", "4.56"))
|
||||
df <- tibble::tibble(x = c("1", "2", "3"), y = c("1.21", "2.32", "4.56"))
|
||||
df
|
||||
# Note the column types
|
||||
type_convert(df)
|
||||
|
|
Loading…
Reference in New Issue