Polishing tibbles
This commit is contained in:
		
							
								
								
									
										125
									
								
								tibble.Rmd
									
									
									
									
									
								
							
							
						
						
									
										125
									
								
								tibble.Rmd
									
									
									
									
									
								
							@@ -1,9 +1,13 @@
 | 
				
			|||||||
# Tibbles
 | 
					# Tibbles
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r, results = "asis", echo = FALSE}
 | 
				
			||||||
 | 
					status("complete")
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Introduction
 | 
					## Introduction
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
 | 
					Throughout this book we work with "tibbles" instead of R's traditional `data.frame`.
 | 
				
			||||||
Tibbles *are* data frames, but they tweak some older behaviours to make life a little easier.
 | 
					Tibbles *are* data frames, but they tweak some older behaviors to make your life a little easier.
 | 
				
			||||||
R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
 | 
					R is an old language, and some things that were useful 10 or 20 years ago now get in your way.
 | 
				
			||||||
It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
 | 
					It's difficult to change base R without breaking existing code, so most innovation occurs in packages.
 | 
				
			||||||
Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
 | 
					Here we will describe the **tibble** package, which provides opinionated data frames that make working in the tidyverse a little easier.
 | 
				
			||||||
@@ -21,30 +25,48 @@ library(tidyverse)
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
## Creating tibbles
 | 
					## Creating tibbles
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Almost all of the functions that you'll use in this book produce tibbles, as tibbles are one of the unifying features of the tidyverse.
 | 
					If you need to make a tibble "by hand", you can use `tibble()` or `tribble()`.
 | 
				
			||||||
Most other R packages use regular `data.frame`s, so you might want to coerce a `data.frame` to a tibble.
 | 
					`tibble()` works by assembling individual vectors:
 | 
				
			||||||
You can do that with `as_tibble()`:
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
as_tibble(mtcars)
 | 
					x <- c(1, 2, 5)
 | 
				
			||||||
 | 
					y <- c("a", "b", "h")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					tibble(x, y)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
You can create a new tibble from individual vectors with `tibble()`.
 | 
					You can also optionally name the inputs, provide data inline with `c()`, and perform computation:
 | 
				
			||||||
`tibble()` will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown in this example:
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
 | 
					tibble(
 | 
				
			||||||
 | 
					  x1 = x,
 | 
				
			||||||
 | 
					  x2 = c(10, 15, 25),
 | 
				
			||||||
 | 
					  y = sqrt(x1^2 + x2^2)
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Every column in a data frame or tibble must be same length, so you'll get an error if the lengths are different:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r, error = TRUE}
 | 
				
			||||||
 | 
					tibble(
 | 
				
			||||||
 | 
					  x = c(1, 5),
 | 
				
			||||||
 | 
					  y = c("a", "b", "c")
 | 
				
			||||||
 | 
					)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					As the error suggests, individual values will be recycled to the same length as everything else:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
tibble(
 | 
					tibble(
 | 
				
			||||||
  x = 1:5,
 | 
					  x = 1:5,
 | 
				
			||||||
  y = 1, 
 | 
					  y = "a",
 | 
				
			||||||
  z = x ^ 2 + y
 | 
					  z = TRUE
 | 
				
			||||||
)
 | 
					)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
If you're already familiar with `data.frame()`, note that `tibble()` does less: it never changes the names of variables and it never creates row names.
 | 
					Another way to create a tibble is with `tribble()`, which short for **tr**ansposed tibble.
 | 
				
			||||||
 | 
					`tribble()` is customized for data entry in code: column headings start with `~` and entries are separated by commas.
 | 
				
			||||||
Another way to create a tibble is with `tribble()`, short for **tr**ansposed tibble.
 | 
					This makes it possible to lay out small amounts of data in an easy to read form:
 | 
				
			||||||
`tribble()` is customized for data entry in code: column headings start with `~`) and entries are separated by commas.
 | 
					 | 
				
			||||||
This makes it possible to lay out small amounts of data in easy to read form:
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
tribble(
 | 
					tribble(
 | 
				
			||||||
@@ -54,10 +76,18 @@ tribble(
 | 
				
			|||||||
)
 | 
					)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Non-syntactic names
 | 
					Finally, if you have a regular `data.frame` you can turn it into to a tibble with `as_tibble()`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
It's possible for a tibble to have column names that are not valid R variable names, aka **non-syntactic** names.
 | 
					```{r}
 | 
				
			||||||
For example, they might not start with a letter, or they might contain unusual characters like a space.
 | 
					as_tibble(mtcars)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The inverse of `as_tibble()` is `as.data.frame()`; it converts a tibble back into a regular `data.frame`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Non-syntactic names
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					It's possible for a tibble to have column names that are not valid R variable names, names that are **non-syntactic**.
 | 
				
			||||||
 | 
					For example, the variables might not start with a letter or they might contain unusual characters like a space.
 | 
				
			||||||
To refer to these variables, you need to surround them with backticks, `` ` ``:
 | 
					To refer to these variables, you need to surround them with backticks, `` ` ``:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
@@ -74,12 +104,13 @@ You'll also need the backticks when working with these variables in other packag
 | 
				
			|||||||
## Tibbles vs. data.frame
 | 
					## Tibbles vs. data.frame
 | 
				
			||||||
 | 
					
 | 
				
			||||||
There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
 | 
					There are two main differences in the usage of a tibble vs. a classic `data.frame`: printing and subsetting.
 | 
				
			||||||
 | 
					If these difference cause problems when working with older packages, you can turn a tibble back to a regular data frame with `as.data.frame()`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Printing
 | 
					### Printing
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
 | 
					Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.
 | 
				
			||||||
This makes it much easier to work with large data.
 | 
					This makes it much easier to work with large data.
 | 
				
			||||||
In addition to its name, each column reports its type, a nice feature borrowed from `str()`:
 | 
					In addition to its name, each column reports its type, a nice feature inspired by `str()`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
tibble(
 | 
					tibble(
 | 
				
			||||||
@@ -91,7 +122,7 @@ tibble(
 | 
				
			|||||||
)
 | 
					)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Where possible, they also use color to draw your eye to important differences.
 | 
					Where possible, tibbles also use color to draw your eye to important differences.
 | 
				
			||||||
One of the most important distinctions is between the string `"NA"` and the missing value, `NA`:
 | 
					One of the most important distinctions is between the string `"NA"` and the missing value, `NA`:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
@@ -106,7 +137,9 @@ First, you can explicitly `print()` the data frame and control the number of row
 | 
				
			|||||||
`width = Inf` will display all columns:
 | 
					`width = Inf` will display all columns:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
nycflights13::flights |> 
 | 
					library(nycflights13)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					flights |> 
 | 
				
			||||||
  print(n = 10, width = Inf)
 | 
					  print(n = 10, width = Inf)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -123,15 +156,13 @@ A final option is to use RStudio's built-in data viewer to get a scrollable view
 | 
				
			|||||||
This is also often useful at the end of a long chain of manipulations.
 | 
					This is also often useful at the end of a long chain of manipulations.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r, eval = FALSE}
 | 
					```{r, eval = FALSE}
 | 
				
			||||||
nycflights13::flights |> 
 | 
					flights |> View()
 | 
				
			||||||
  View()
 | 
					 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Subsetting
 | 
					### Extracting variables
 | 
				
			||||||
 | 
					
 | 
				
			||||||
So far all the tools you've learned have worked with complete data frames.
 | 
					So far all the tools you've learned have worked with complete data frames.
 | 
				
			||||||
If you want to pull out a single variable, you can use `dplyr::pull()`.
 | 
					If you want to pull out a single variable, you can use `dplyr::pull()`:
 | 
				
			||||||
`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector (you'll learn more about those in Chapter \@ref(vectors).
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
tb <- tibble(
 | 
					tb <- tibble(
 | 
				
			||||||
@@ -140,11 +171,17 @@ tb <- tibble(
 | 
				
			|||||||
  y1  = 6:10
 | 
					  y1  = 6:10
 | 
				
			||||||
)
 | 
					)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
tb |> pull(x1)
 | 
					tb |> pull(x1) # by name
 | 
				
			||||||
 | 
					tb |> pull(1)  # by position
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					`pull()` also takes an optional `name` argument that specifies the column to be used as names for a named vector, which you'll learn about in Chapter \@ref(vectors).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```{r}
 | 
				
			||||||
tb |> pull(x1, name = id)
 | 
					tb |> pull(x1, name = id)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Alternatively, you can use base R tools like `$` and `[[`.
 | 
					You can also use the base R tools `$` and `[[`.
 | 
				
			||||||
`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
 | 
					`[[` can extract by name or position; `$` only extracts by name but is a little less typing.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
@@ -157,35 +194,29 @@ tb[[1]]
 | 
				
			|||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
 | 
					Compared to a `data.frame`, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
 | 
				
			||||||
In the following chunk `df` is a `data.frame` and `tb` is a `tibble`.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					# Tibbles complain a lot:
 | 
				
			||||||
 | 
					tb$x
 | 
				
			||||||
 | 
					tb$z
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Data frame use partial matching and don't complain if a column doesn't exist
 | 
				
			||||||
df <- as.data.frame(tb)
 | 
					df <- as.data.frame(tb)
 | 
				
			||||||
 | 
					df$x
 | 
				
			||||||
# Partial match to existing variable name
 | 
					df$z
 | 
				
			||||||
tb$x # Warning + no match
 | 
					 | 
				
			||||||
df$x # Warning + partial match
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# Column doesn't exist
 | 
					 | 
				
			||||||
tb$z # Warning
 | 
					 | 
				
			||||||
df$z # No warning
 | 
					 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Interacting with older code
 | 
					For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Some older functions don't work with tibbles.
 | 
					### Subsetting
 | 
				
			||||||
If you encounter one of these functions, use `as.data.frame()` to turn a tibble back to a `data.frame`:
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					Lastly, there are some important differences when using `[`.
 | 
				
			||||||
class(as.data.frame(tb))
 | 
					With `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector, which is a common source of bugs.
 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
The main reason that some older functions don't work with tibble is the `[` function.
 | 
					 | 
				
			||||||
We don't use `[` much in this book because for data frames, `dplyr::filter()` and `dplyr::select()` typically allow you to solve the same problems with clearer code.
 | 
					 | 
				
			||||||
With base R `data.frame`s, `[` sometimes returns a `data.frame`, and sometimes returns a vector.
 | 
					 | 
				
			||||||
With tibbles, `[` always returns another tibble.
 | 
					With tibbles, `[` always returns another tibble.
 | 
				
			||||||
 | 
					This can sometimes cause problems when working with older code.
 | 
				
			||||||
 | 
					If you hit one of those functions, just use `as.data.frame()` to turn your tibble back to a `data.frame`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Exercises
 | 
					### Exercises
 | 
				
			||||||
 | 
					
 | 
				
			||||||
1.  How can you tell if an object is a tibble?
 | 
					1.  How can you tell if an object is a tibble?
 | 
				
			||||||
    (Hint: try printing `mtcars`, which is a regular `data.frame`).
 | 
					    (Hint: try printing `mtcars`, which is a regular `data.frame`).
 | 
				
			||||||
 
 | 
				
			|||||||
		Reference in New Issue
	
	Block a user