r4ds/vector-tools.Rmd

97 lines
2.8 KiB
Plaintext
Raw Normal View History

# Vector tools
2021-03-04 01:13:14 +08:00
## Introduction
2021-04-19 20:56:29 +08:00
`%in%`
## Counts
- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
To count the number of non-missing values, use `sum(!is.na(x))`.
To count the number of distinct (unique) values, use `n_distinct(x)`.
```{r}
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
```
Counts are so useful that dplyr provides a simple helper if all you want is a count:
```{r}
not_cancelled %>%
count(dest)
```
Just like with `group_by()`, you can also provide multiple variables to `count()`.
```{r}
not_cancelled %>%
count(carrier, dest)
```
You can optionally provide a weight variable.
For example, you could use this to "count" (sum) the total number of miles a plane flew:
```{r}
not_cancelled %>%
count(tailnum, wt = distance)
```
## Window functions
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
```{r}
(x <- 1:10)
lag(x)
lead(x)
```
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
```{r}
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
min_rank(desc(y))
```
If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
See their help pages for more details.
```{r}
row_number(y)
dense_rank(y)
percent_rank(y)
cume_dist(y)
```
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
For example, we can find the first and last departure for each day:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
```
These functions are complementary to filtering on ranks.
Filtering gives you all variables, with each observation in a separate row:
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
```