176 lines
5.2 KiB
Plaintext
176 lines
5.2 KiB
Plaintext
# Vector tools
|
|
|
|
## Introduction
|
|
|
|
`%in%`
|
|
|
|
```{r}
|
|
library(tidyverse)
|
|
library(nycflights13)
|
|
|
|
not_cancelled <- flights %>%
|
|
filter(!is.na(dep_delay), !is.na(arr_delay))
|
|
```
|
|
|
|
## Counts
|
|
|
|
- Counts: You've seen `n()`, which takes no arguments, and returns the size of the current group.
|
|
To count the number of non-missing values, use `sum(!is.na(x))`.
|
|
To count the number of distinct (unique) values, use `n_distinct(x)`.
|
|
|
|
```{r}
|
|
# Which destinations have the most carriers?
|
|
not_cancelled %>%
|
|
group_by(dest) %>%
|
|
summarise(carriers = n_distinct(carrier)) %>%
|
|
arrange(desc(carriers))
|
|
```
|
|
|
|
Counts are so useful that dplyr provides a simple helper if all you want is a count:
|
|
|
|
```{r}
|
|
not_cancelled %>%
|
|
count(dest)
|
|
```
|
|
|
|
Just like with `group_by()`, you can also provide multiple variables to `count()`.
|
|
|
|
```{r}
|
|
not_cancelled %>%
|
|
count(carrier, dest)
|
|
```
|
|
|
|
You can optionally provide a weight variable.
|
|
For example, you could use this to "count" (sum) the total number of miles a plane flew:
|
|
|
|
```{r}
|
|
not_cancelled %>%
|
|
count(tailnum, wt = distance)
|
|
```
|
|
|
|
## Window functions
|
|
|
|
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values.
|
|
This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
|
|
They are most useful in conjunction with `group_by()`, which you'll learn about shortly.
|
|
|
|
```{r}
|
|
(x <- 1:10)
|
|
lag(x)
|
|
lead(x)
|
|
```
|
|
|
|
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`.
|
|
It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th).
|
|
The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.
|
|
|
|
```{r}
|
|
y <- c(1, 2, 2, NA, 3, 4)
|
|
min_rank(y)
|
|
min_rank(desc(y))
|
|
```
|
|
|
|
If `min_rank()` doesn't do what you need, look at the variants `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`.
|
|
See their help pages for more details.
|
|
|
|
```{r}
|
|
row_number(y)
|
|
dense_rank(y)
|
|
percent_rank(y)
|
|
cume_dist(y)
|
|
```
|
|
|
|
- Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
|
|
These work similarly to `x[1]`, `x[2]`, and `x[length(x)]` but let you set a default value if that position does not exist (i.e. you're trying to get the 3rd element from a group that only has two elements).
|
|
For example, we can find the first and last departure for each day:
|
|
|
|
```{r}
|
|
not_cancelled %>%
|
|
group_by(year, month, day) %>%
|
|
summarise(
|
|
first_dep = first(dep_time),
|
|
last_dep = last(dep_time)
|
|
)
|
|
```
|
|
|
|
These functions are complementary to filtering on ranks.
|
|
Filtering gives you all variables, with each observation in a separate row:
|
|
|
|
```{r}
|
|
not_cancelled %>%
|
|
group_by(year, month, day) %>%
|
|
mutate(r = min_rank(desc(dep_time))) %>%
|
|
filter(r %in% range(r))
|
|
```
|
|
|
|
### dplyr
|
|
|
|
```{r}
|
|
flights_sml <- select(flights,
|
|
year:day,
|
|
ends_with("delay"),
|
|
distance,
|
|
air_time
|
|
)
|
|
```
|
|
|
|
- Find the worst members of each group:
|
|
|
|
```{r}
|
|
flights_sml %>%
|
|
group_by(year, month, day) %>%
|
|
filter(rank(desc(arr_delay)) < 10)
|
|
```
|
|
|
|
- Find all groups bigger than a threshold:
|
|
|
|
```{r}
|
|
popular_dests <- flights %>%
|
|
group_by(dest) %>%
|
|
filter(n() > 365)
|
|
popular_dests
|
|
```
|
|
|
|
- Standardise to compute per group metrics:
|
|
|
|
```{r}
|
|
popular_dests %>%
|
|
filter(arr_delay > 0) %>%
|
|
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
|
|
select(year:day, dest, arr_delay, prop_delay)
|
|
```
|
|
|
|
A grouped filter is a grouped mutate followed by an ungrouped filter.
|
|
I generally avoid them except for quick and dirty manipulations: otherwise it's hard to check that you've done the manipulation correctly.
|
|
|
|
Functions that work most naturally in grouped mutates and filters are known as window functions (vs. the summary functions used for summaries).
|
|
You can learn more about useful window functions in the corresponding vignette: `vignette("window-functions")`.
|
|
|
|
### Exercises
|
|
|
|
1. Find the 10 most delayed flights using a ranking function.
|
|
How do you want to handle ties?
|
|
Carefully read the documentation for `min_rank()`.
|
|
|
|
2. Which plane (`tailnum`) has the worst on-time record?
|
|
|
|
3. What time of day should you fly if you want to avoid delays as much as possible?
|
|
|
|
4. For each destination, compute the total minutes of delay.
|
|
For each flight, compute the proportion of the total delay for its destination.
|
|
|
|
5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave.
|
|
Using `lag()`, explore how the delay of a flight is related to the delay of the immediately preceding flight.
|
|
|
|
6. Look at each destination.
|
|
Can you find flights that are suspiciously fast?
|
|
(i.e. flights that represent a potential data entry error).
|
|
Compute the air time of a flight relative to the shortest flight to that destination.
|
|
Which flights were most delayed in the air?
|
|
|
|
7. Find all destinations that are flown by at least two carriers.
|
|
Use that information to rank the carriers.
|
|
|
|
8.
|
|
|