Fix/data import (#1412)
* - read_csv() recognizes "NA" (Namibia) as NA (not available), as na = c("", "NA") by default. - this chapter is not about stringr, but about readr. - "seldom" is missing in the context. - missing comma of "for example". - list(.default = col_character()) can also work. - use another_csv again. - make tibble() and tribble() examaples identical * make exactly same as tibble() example * Update data-import.qmd * Update data-import.qmd * Update data-import.qmd * Update data-import.qmd --------- Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
parent
e5a847f7b3
commit
ae818956b2
|
@ -90,7 +90,7 @@ students
|
|||
|
||||
In the `favourite.food` column, there are a bunch of food items, and then the character string `N/A`, which should have been a real `NA` that R will recognize as "not available".
|
||||
This is something we can address using the `na` argument.
|
||||
By default `read_csv()` only recognizes empty strings (`""`) as `NA`s, we want it to also recognize the character string `"N/A`
|
||||
By default, `read_csv()` only recognizes empty strings (`""`) in this dataset as `NA`s, we want it to also recognize the character string `"N/A"`.
|
||||
|
||||
```{r}
|
||||
#| message: false
|
||||
|
@ -319,7 +319,7 @@ This heuristic works well if you have a clean dataset, but in real life, you'll
|
|||
### Missing values, column types, and problems
|
||||
|
||||
The most common way column detection fails is that a column contains unexpected values, and you get a character column instead of a more specific type.
|
||||
One of the most common causes for this is a missing value, recorded using something other than the `NA` that stringr expects.
|
||||
One of the most common causes for this is a missing value, recorded using something other than the `NA` that readr expects.
|
||||
|
||||
Take this simple 1 column CSV file as an example:
|
||||
|
||||
|
@ -373,13 +373,13 @@ read_csv(simple_csv, na = ".")
|
|||
readr provides a total of nine column types for you to use:
|
||||
|
||||
- `col_logical()` and `col_double()` read logicals and real numbers. They're relatively rarely needed (except as above), since readr will usually guess them for you.
|
||||
- `col_integer()` reads integers. We distinguish integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
|
||||
- `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half, for example a phone number, social security number, credit card number, etc.
|
||||
- `col_integer()` reads integers. We seldom distinguish integers and doubles in this book because they're functionally equivalent, but reading integers explicitly can occasionally be useful because they occupy half the memory of doubles.
|
||||
- `col_character()` reads strings. This is sometimes useful to specify explicitly when you have a column that is a numeric identifier, i.e. long series of digits that identifies some object, but it doesn't make sense to (e.g.) divide it in half, for example, a phone number, social security number, credit card number, etc.
|
||||
- `col_factor()`, `col_date()`, and `col_datetime()` create factors, dates, and date-times respectively; you'll learn more about those when we get to those data types in @sec-factors and @sec-dates-and-times.
|
||||
- `col_number()` is a permissive numeric parser that will ignore non-numeric components, and is particularly useful for currencies. You'll learn more about it in @sec-numbers.
|
||||
- `col_skip()` skips a column so it's not included in the result, which can be useful for speeding up reading the data if you have a large CSV file and you only want to use some of the columns.
|
||||
|
||||
It's also possible to override the default column by switching from `list()` to `cols()`:
|
||||
It's also possible to override the default column by switching from `list()` to `cols()` and specifying `.default`:
|
||||
|
||||
```{r}
|
||||
another_csv <- "
|
||||
|
@ -396,8 +396,7 @@ Another useful helper is `cols_only()` which will read in only the columns you s
|
|||
|
||||
```{r}
|
||||
read_csv(
|
||||
"x,y,z
|
||||
1,2,3",
|
||||
another_csv,
|
||||
col_types = cols_only(x = col_character())
|
||||
)
|
||||
```
|
||||
|
@ -526,9 +525,9 @@ This makes it possible to lay out small amounts of data in an easy to read form:
|
|||
```{r}
|
||||
tribble(
|
||||
~x, ~y, ~z,
|
||||
"h", 1, 0.08,
|
||||
"m", 2, 0.83,
|
||||
"g", 5, 0.60,
|
||||
1, "h", 0.08,
|
||||
2, "m", 0.83,
|
||||
5, "g", 0.60
|
||||
)
|
||||
```
|
||||
|
||||
|
|
Loading…
Reference in New Issue