Merge branch 'master' of github.com:hadley/r4ds
This commit is contained in:
commit
e8b4bbb905
16
import.Rmd
16
import.Rmd
|
@ -18,7 +18,7 @@ library(readr)
|
|||
|
||||
Most of readr's functions are concerned with turning flat files into data frames:
|
||||
|
||||
* `read_csv()` reads comma delimited files, `read_csv2()` reads semi-colon
|
||||
* `read_csv()` reads comma delimited files, `read_csv2()` reads semicolon
|
||||
separated files (common in countries where `,` is used as the decimal place),
|
||||
`read_tsv()` reads tab delimited files, and `read_delim()` reads in files
|
||||
with any delimiter.
|
||||
|
@ -108,7 +108,7 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
|
|||
your operating system and environment variables, so import code that works
|
||||
on your computer might not work on someone else's.
|
||||
|
||||
### Exericses
|
||||
### Exercises
|
||||
|
||||
1. What function would you use to read a file where fields were separated with
|
||||
"|"?
|
||||
|
@ -119,7 +119,7 @@ If you've used R before, you might wonder why we're not using `read.csv()`. Ther
|
|||
1. What is the most important argument to `read_fwf()` that we haven't already
|
||||
discussed?
|
||||
|
||||
1. Some times strings in a csv file contain commas. To prevent them from
|
||||
1. Sometimes strings in a csv file contain commas. To prevent them from
|
||||
causing problems they need to be surrounded by a quoting character, like
|
||||
`"` or `'`. By convention, `read_csv()` assumes that the quoting
|
||||
character will be `"`, and if you want to change it you'll need to
|
||||
|
@ -281,7 +281,7 @@ Encodings are a rich and complex topic, and I've only scratched the surface here
|
|||
|
||||
You pick between three parsers depending on whether you want a date (the number of days since 1970-01-01), a date time (the number of seconds since midnight 1970-01-01), or a time (the number of seconds since midnight):
|
||||
|
||||
* `parse_datetime()` expects an ISO8601 date time. ISO8691 is an
|
||||
* `parse_datetime()` expects an ISO8601 date time. ISO8601 is an
|
||||
international standard in which the components of a date are
|
||||
organised from biggest to smallest: year, month, day, hour, minute,
|
||||
second.
|
||||
|
@ -427,7 +427,7 @@ These defaults don't always work for larger files. There are two basic problems:
|
|||
a column of doubles that only contains integers in the first 1000 rows.
|
||||
|
||||
1. The column might contain a lot of missing values. If the first 1000
|
||||
rows contains on `NA`s, readr will guess that it's a character
|
||||
rows contains only `NA`s, readr will guess that it's a character
|
||||
vector, whereas you probably want to parse it as something more
|
||||
specific.
|
||||
|
||||
|
@ -439,7 +439,7 @@ challenge <- read_csv(readr_example("challenge.csv"))
|
|||
|
||||
(Note the use of `readr_example()` which finds the path to one of the files included with the package)
|
||||
|
||||
There are two outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It's always a good idea to explicitly pull out the `problems()` so you can explore them in more depth:
|
||||
There are two outputs: the column specification generated by looking at the first 1000 rows, and the first five parsing failures. It's always a good idea to explicitly pull out the `problems()`, so you can explore them in more depth:
|
||||
|
||||
```{r}
|
||||
problems(challenge)
|
||||
|
@ -543,7 +543,7 @@ There are a few other general strategies to help you parse files:
|
|||
|
||||
readr also comes with two useful functions for writing data back to disk: `write_csv()` and `write_tsv()`. They:
|
||||
|
||||
* Are faster than the base R equvalents.
|
||||
* Are faster than the base R equivalents.
|
||||
|
||||
* Never write rownames, and quote only when needed.
|
||||
|
||||
|
@ -610,7 +610,7 @@ file.remove("challenge.rds")
|
|||
|
||||
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They're certainly not perfect, but they are a good place to start.
|
||||
|
||||
For rectanuglar data:
|
||||
For rectangular data:
|
||||
|
||||
* haven reads SPSS, Stata, and SAS files.
|
||||
|
||||
|
|
|
@ -235,7 +235,7 @@ Graphically, that looks like:
|
|||
knitr::include_graphics("diagrams/join-outer.png")
|
||||
```
|
||||
|
||||
The most commonly used join is the left join: you use this whenever you lookup additional data out of another table, because it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
|
||||
The most commonly used join is the left join: you use this whenever you look up additional data out of another table, because it preserves the original observations even when there isn't a match. The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
|
||||
|
||||
Another way to depict the different types of joins is with a Venn diagram:
|
||||
|
||||
|
@ -383,7 +383,7 @@ dplyr | SQL
|
|||
|
||||
Note that "INNER" and "OUTER" are optional, and often omitted.
|
||||
|
||||
Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests SQL supports a wide range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).
|
||||
Joining different variables between the tables, e.g. `inner_join(x, y, by = c("a" = "b"))` uses a slightly different syntax in SQL: `SELECT * FROM x INNER JOIN y ON x.a = y.b`. As this syntax suggests SQL supports a wider range of join types than dplyr because you can connect the tables using constraints other than equality (sometimes called non-equijoins).
|
||||
|
||||
## Filtering joins {#filtering-joins}
|
||||
|
||||
|
|
Loading…
Reference in New Issue