Move important tibble content earlier in the book (#1110)

Co-authored-by: Mine Cetinkaya-Rundel <cetinkaya.mine@gmail.com>
This commit is contained in:
Hadley Wickham 2022-10-24 13:16:14 -05:00 committed by GitHub
parent c0461b11bd
commit f93a5daeeb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 83 additions and 97 deletions

View File

@ -201,6 +201,29 @@ There are a few good reasons to favor readr functions over the base equivalents:
- They are more reproducible. - They are more reproducible.
Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's. Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else's.
### Non-syntactic names
It's possible for a CSV file to have column names that are not valid R variable names, we refer to these as **non-syntactic** names.
For example, the variables might not start with a letter or they might contain unusual characters like a space:
```{r}
df <- read_csv("data/non-syntactic.csv", col_types = list())
df
```
You'll notice that they print surrounded by backticks, which you'll need to use when referring to them in other functions:
```{r}
df |> relocate(`2000`, .after = `:)`)
```
These values only need special handling when they appear in column names.
If you turn them into data (e.g. with `pivot_longer()`) they are just regular strings:
```{r}
df |> pivot_longer(everything())
```
### Exercises ### Exercises
1. What function would you use to read a file where fields were separated with "\|"? 1. What function would you use to read a file where fields were separated with "\|"?
@ -232,6 +255,20 @@ There are a few good reasons to favor readr functions over the base equivalents:
read_csv("a;b\n1;3") read_csv("a;b\n1;3")
``` ```
6. Practice referring to non-syntactic names in the following data frame by:
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
```
## Reading data from multiple files {#sec-readr-directory} ## Reading data from multiple files {#sec-readr-directory}
Sometimes your data is split across multiple files instead of being contained in a single file. Sometimes your data is split across multiple files instead of being contained in a single file.
@ -326,9 +363,50 @@ file.remove("students-2.csv")
file.remove("students.rds") file.remove("students.rds")
``` ```
## Data entry
Sometimes you'll need to assemble a tibble "by hand" doing a little data entry in your R script.
There are two useful functions to help you do this which differ in whether you layout the tibble by columns or by rows.
`tibble()` works by column:
```{r}
tibble(
x = c(1, 2, 5),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.60)
)
```
Note that every column in tibble must be same size, so you'll get an error if they're not:
```{r}
#| error: true
tibble(
x = c(1, 2),
y = c("h", "m", "g"),
z = c(0.08, 0.83, 0.6)
)
```
Laying out the data by column can make it hard to see how the rows are related, so an alternative is `tribble()`, short for **tr**ansposed t**ibble**, which lets you lay out your data row by row.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
```{r}
tribble(
~x, ~y, ~z,
"h", 1, 0.08,
"m", 2, 0.83,
"g", 5, 0.60,
)
```
We'll use `tibble()` and `tribble()` later in the book to construct small examples to demonstrate how various functions work.
## Summary ## Summary
In this chapter, you've learned how to use readr to load rectangular flat files from disk into R. In this chapter, you've learned how to load CSV files with `read_csv()` and to do your own data entry with `tibble()` and `tribble()`.
You've learned how csv files work, some of the problems you might encounter, and how to overcome them. You've learned how csv files work, some of the problems you might encounter, and how to overcome them.
We'll come to data import a few times in this book: @sec-import-databases will show you how to load data from databases, @sec-import-spreadsheets from Excel and googlesheets, @sec-rectangling from JSON, and @sec-scraping from websites. We'll come to data import a few times in this book: @sec-import-databases will show you how to load data from databases, @sec-import-spreadsheets from Excel and googlesheets, @sec-rectangling from JSON, and @sec-scraping from websites.

2
data/non-syntactic.csv Normal file
View File

@ -0,0 +1,2 @@
:),x y,2000
smile,space,number
1 :) x y 2000
2 smile space number

View File

@ -27,86 +27,6 @@ In this chapter we'll explore the **tibble** package, part of the core tidyverse
library(tidyverse) library(tidyverse)
``` ```
## Creating tibbles
If you need to make a tibble "by hand", you can use `tibble()` or `tribble()`.
`tibble()` works by assembling individual vectors:
```{r}
x <- c(1, 2, 5)
y <- c("a", "b", "h")
tibble(x, y)
```
You can also optionally name the inputs, provide data inline with `c()`, and perform computation:
```{r}
tibble(
x1 = x,
x2 = c(10, 15, 25),
y = sqrt(x1^2 + x2^2)
)
```
Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
```{r}
#| error: true
tibble(
x = c(1, 5),
y = c("a", "b", "c")
)
```
As the error suggests, individual values will be recycled to the same length as everything else:
```{r}
tibble(
x = 1:5,
y = "a",
z = TRUE
)
```
Another way to create a tibble is with `tribble()`, which short for **tr**ansposed t**ibble**.
`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
This makes it possible to lay out small amounts of data in an easy to read form:
```{r}
tribble(
~x, ~y, ~z,
"a", 2, 3.6,
"b", 1, 8.5
)
```
Finally, if you have a regular `data.frame` you can turn it into to a tibble with `as_tibble()`:
```{r}
as_tibble(mtcars)
```
The inverse of `as_tibble()` is `as.data.frame()`; it converts a tibble back into a regular `data.frame`.
## Non-syntactic names
It's possible for a tibble to have column names that are not valid R variable names, names that are **non-syntactic**.
For example, the variables might not start with a letter or they might contain unusual characters like a space.
To refer to these variables, you need to surround them with backticks, `` ` ``:
```{r}
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
```
You'll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
## Tibbles vs. data.frame ## Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting. There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
@ -244,24 +164,10 @@ If you hit one of those functions, just use `as.data.frame()` to turn your tibbl
3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble? 3. If you have the name of a variable stored in an object, e.g. `var <- "mpg"`, how can you extract the reference variable from a tibble?
4. Practice referring to non-syntactic names in the following data frame by: 4. What does `tibble::enframe()` do?
a. Extracting the variable called `1`.
b. Plotting a scatterplot of `1` vs `2`.
c. Creating a new column called `3` which is `2` divided by `1`.
d. Renaming the columns to `one`, `two` and `three`.
```{r}
annoying <- tibble(
`1` = 1:10,
`2` = `1` * 2 + rnorm(length(`1`))
)
```
5. What does `tibble::enframe()` do?
When might you use it? When might you use it?
6. What option controls how many additional column names are printed at the footer of a tibble? 5. What option controls how many additional column names are printed at the footer of a tibble?
## Summary ## Summary