Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
Hadley Wickham 2022-10-24 13:16:14 -05:00 committed by GitHub
parent c0461b11bd
commit f93a5daeeb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 83 additions and 97 deletions

View File

@ -201,6 +201,29 @@ There are a few good reasons to favor readr functions over the base equivalents:
- They are more reproducible.
Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Non-syntactic names
It's possible for a CSV file to have column names that are not valid R variable names, we refer to these as **non-syntactic** names.
For example, the variables might not start with a letter or they might contain unusual characters like a space:
```{r}
df <- read_csv("data/non-syntactic.csv", col_types = list())
df
```
You'll notice that they print surrounded by backticks, which you'll need to use when referring to them in other functions:
```{r}
df |> relocate(`2000`, .after = `:)`)
```
These values only need special handling when they appear in column names.
If you turn them into data (e.g. with `pivot_longer()`) they are just regular strings:
```{r}
df |> pivot_longer(everything())
```
### Exercises
1. What function would you use to read a file where fields were separated with "\|"?
@ -232,6 +255,20 @@ There are a few good reasons to favor readr functions over the base equivalents:
read_csv("a;b\n1;3")
```
6. Practice referring to non-syntactic names in the following data frame by:
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
```
## Reading data from multiple files {#sec-readr-directory}
Sometimes your data is split across multiple files instead of being contained in a single file.
@ -326,9 +363,50 @@ file.remove("students-2.csv")
file.remove("students.rds")
```
## Data entry
Sometimes you'll need to assemble a tibble "by hand" doing a little data entry in your R script.
There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows.
`tibble()` works by column:
```{r}
tibble(
x = c(1, 2, 5),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.60)
)
```
Note that every column in tibble must be same size, so you'll get an error if they're not:
```{r}
#| error: true
tibble(
x = c(1, 2),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.6)
)
```
Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for **tr**ansposed t**ibble**, which lets you lay out your data row by row.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
```{r}
tribble(
~x, ~y, ~z,
"h", 1, 0.08,
"m", 2, 0.83,
"g", 5, 0.60,
)
```
We'll use `tibble()` and `tribble()` later in the book to construct small examples to demonstrate how various functions work.
## Summary
In this chapter, you've learned how to use readr to load rectangular flat files from disk into R.
In this chapter, you've learned how to load CSV files with `read_csv()` and to do your own data entry with `tibble()` and `tribble()`.
You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
We'll come to data import a few times in this book: @sec-import-databases will show you how to load data from databases, @sec-import-spreadsheets from Excel and googlesheets, @sec-rectangling from JSON, and @sec-scraping from websites.

2
data/non-syntactic.csv Normal file
View File

@ -0,0 +1,2 @@
:),x y,2000
smile,space,number
1 :) x y 2000
2 smile space number

View File

@ -27,86 +27,6 @@ In this chapter we'll explore the **tibble** package, part of the core tidyverse
library(tidyverse)
```
## Creating tibbles
If you need to make a tibble "by hand", you can use `tibble()` or `tribble()`.
`tibble()` works by assembling individual vectors:
```{r}
x <- c(1, 2, 5)
y <- c("a", "b", "h")
tibble(x, y)
```
You can also optionally name the inputs, provide data inline with `c()`, and perform computation:
```{r}
tibble(
x1 = x,
x2 = c(10, 15, 25),
y = sqrt(x1^2 + x2^2)
)
```
Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
```{r}
#| error: true
tibble(
x = c(1, 5),
y = c("a", "b", "c")
)
```
As the error suggests, individual values will be recycled to the same length as everything else:
```{r}
tibble(
x = 1:5,
y = "a",
z = TRUE
)
```
Another way to create a tibble is with `tribble()`, which short for **tr**ansposed t**ibble**.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
```{r}
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
```
Finally, if you have a regular `data.frame` you can turn it into to a tibble with `as_tibble()`:
```{r}
as_tibble(mtcars)
```
The inverse of `as_tibble()` is `as.data.frame()`; it converts a tibble back into a regular `data.frame`.
## Non-syntactic names
It's possible for a tibble to have column names that are not valid R variable names, names that are **non-syntactic**.
For example, the variables might not start with a letter or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `` ` ``:
```{r}
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
```
You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
## Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
@ -244,24 +164,10 @@ If you hit one of those functions, just use `as.data.frame()` to turn your tibbl
3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble?
4. Practice referring to non-syntactic names in the following data frame by:
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
```
5. What does `tibble::enframe()` do?
4. What does `tibble::enframe()` do?
When might you use it?
6. What option controls how many additional column names are printed at the footer of a tibble?
5. What option controls how many additional column names are printed at the footer of a tibble?
## Summary