TR feedback 2 (#1318)
This commit is contained in:
parent
bf07203845
commit
7cd62150c0
|
@ -122,7 +122,7 @@ Thanks to arrow, this code will work regardless of how large the underlying data
|
|||
But it's currently rather slow: on Hadley's computer, it took \~10s to run.
|
||||
That's not terrible given how much data we have, but we can make it much faster by switching to a better format.
|
||||
|
||||
## The parquet format
|
||||
## The parquet format {#sec-parquet}
|
||||
|
||||
To make this data easier to work with, lets switch to the parquet file format and split it up into multiple files.
|
||||
The following sections will first introduce you to parquet and partitioning, and then apply what we learned to the Seattle library data.
|
||||
|
|
|
@ -174,6 +174,15 @@ You can also force the creation of a date-time from a date by supplying a timezo
|
|||
ymd("2017-01-31", tz = "UTC")
|
||||
```
|
||||
|
||||
Here I use the UTC[^datetimes-3] timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude[^datetimes-4]
|
||||
. It doesn't use daylight savings time, making it a bit easier to compute with
|
||||
.
|
||||
|
||||
[^datetimes-3]: You might wonder what UTC stands for.
|
||||
It's a compromise between the English "Coordinated Universal Time" and French "Temps Universel Coordonné".
|
||||
|
||||
[^datetimes-4]: No prizes for guessing which country came up with the longitude system.
|
||||
|
||||
### From individual components
|
||||
|
||||
Instead of a single string, sometimes you'll have the individual components of the date-time spread across multiple columns.
|
||||
|
@ -300,6 +309,7 @@ The next section will look at how arithmetic works with date-times.
|
|||
### Getting components
|
||||
|
||||
You can pull out individual parts of the date with the accessor functions `year()`, `month()`, `mday()` (day of the month), `yday()` (day of the year), `wday()` (day of the week), `hour()`, `minute()`, and `second()`.
|
||||
These are effectively the opposites of `make_datetime()`.
|
||||
|
||||
```{r}
|
||||
datetime <- ymd_hms("2026-07-08 12:34:56")
|
||||
|
@ -629,8 +639,8 @@ We can fix this by adding `days(1)` to the arrival time of each overnight flight
|
|||
flights_dt <- flights_dt |>
|
||||
mutate(
|
||||
overnight = arr_time < dep_time,
|
||||
arr_time = arr_time + days(if_else(overnight, 0, 1)),
|
||||
sched_arr_time = sched_arr_time + days(overnight * 1)
|
||||
arr_time = arr_time + days(!overnight),
|
||||
sched_arr_time = sched_arr_time + days(overnight)
|
||||
)
|
||||
```
|
||||
|
||||
|
@ -643,9 +653,10 @@ flights_dt |>
|
|||
|
||||
### Intervals {#sec-intervals}
|
||||
|
||||
It's obvious what `dyears(1) / ddays(365)` should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.
|
||||
What does `dyears(1) / ddays(365)` return?
|
||||
It's not quite one, because `dyear()` is defined as the number of seconds per average year, which is 365.25 days.
|
||||
|
||||
What should `years(1) / days(1)` return?
|
||||
What does `years(1) / days(1)` return?
|
||||
Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366!
|
||||
There's not quite enough information for lubridate to give a single clear answer.
|
||||
What it does instead is give an estimate:
|
||||
|
@ -676,8 +687,8 @@ y2024 / days(1)
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Explain `days(overnight * 1)` to someone who has just started learning R.
|
||||
How does it work?
|
||||
1. Explain `days(!overnight)` and `days(overnight)` to someone who has just started learning R.
|
||||
What is the key fact you need to know?
|
||||
|
||||
2. Create a vector of dates giving the first day of every month in 2015.
|
||||
Create a vector of dates giving the first day of every month in the *current* year.
|
||||
|
|
|
@ -19,6 +19,8 @@ Writing a function has three big advantages over using copy-and-paste:
|
|||
|
||||
3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
|
||||
|
||||
4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.
|
||||
|
||||
A good rule of thumb is to consider writing a function whenever you've copied and pasted a block of code more than twice (i.e. you now have three copies of the same code).
|
||||
In this chapter, you'll learn about three useful types of functions:
|
||||
|
||||
|
@ -327,12 +329,7 @@ Once you start writing functions, there are two RStudio shortcuts that are super
|
|||
3. Given a vector of birthdates, write a function to compute the age in years.
|
||||
|
||||
4. Write your own functions to compute the variance and skewness of a numeric vector.
|
||||
Variance is defined as $$
|
||||
\mathrm{Var}(x) = \frac{1}{n - 1} \sum_{i=1}^n (x_i - \bar{x}) ^2 \text{,}
|
||||
$$ where $\bar{x} = (\sum_i^n x_i) / n$ is the sample mean.
|
||||
Skewness is defined as $$
|
||||
\mathrm{Skew}(x) = \frac{\frac{1}{n-2}\left(\sum_{i=1}^n(x_i - \bar x)^3\right)}{\mathrm{Var}(x)^{3/2}} \text{.}
|
||||
$$
|
||||
You can look up the definitions on Wikipedia or elsewhere.
|
||||
|
||||
5. Write `both_na()`, a summary function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
|
||||
|
||||
|
@ -340,8 +337,12 @@ Once you start writing functions, there are two RStudio shortcuts that are super
|
|||
Why are they useful even though they are so short?
|
||||
|
||||
```{r}
|
||||
is_directory <- function(x) file.info(x)$isdir
|
||||
is_readable <- function(x) file.access(x, 4) == 0
|
||||
is_directory <- function(x) {
|
||||
file.info(x)$isdir
|
||||
}
|
||||
is_readable <- function(x) {
|
||||
file.access(x, 4) == 0
|
||||
}
|
||||
```
|
||||
|
||||
## Data frame functions
|
||||
|
@ -484,7 +485,8 @@ count_prop <- function(df, var, sort = FALSE) {
|
|||
diamonds |> count_prop(clarity)
|
||||
```
|
||||
|
||||
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables in `…`.
|
||||
This function has three arguments: `df`, `var`, and `sort`, and only `var` needs to be embraced because it's passed to `count()` which uses data-masking for all variables.
|
||||
Note that we use a default value for `sort` so that if the user doesn't supply their own value it will default to `FALSE`.
|
||||
|
||||
Or maybe you want to find the sorted unique values of a variable for a subset of the data.
|
||||
Rather than supplying a variable and a value to do the filtering, we'll allow the user to supply a condition:
|
||||
|
@ -499,8 +501,6 @@ unique_where <- function(df, condition, var) {
|
|||
|
||||
# Find all the destinations in December
|
||||
flights |> unique_where(month == 12, dest)
|
||||
# Which months did plane N14228 fly in?
|
||||
flights |> unique_where(tailnum == "N14228", month)
|
||||
```
|
||||
|
||||
Here we embrace `condition` because it's passed to `filter()` and `var` because it's passed to `distinct()` and `arrange()`.
|
||||
|
@ -509,7 +509,7 @@ We've made all these examples to take a data frame as the first argument, but if
|
|||
For example, the following function always works with the flights dataset and always selects `time_hour`, `carrier`, and `flight` since they form the compound primary key that allows you to identify a row.
|
||||
|
||||
```{r}
|
||||
flights_sub <- function(rows, cols) {
|
||||
subset_flights <- function(rows, cols) {
|
||||
flights |>
|
||||
filter({{ rows }}) |>
|
||||
select(time_hour, carrier, flight, {{ cols }})
|
||||
|
@ -527,7 +527,10 @@ You might try writing something like:
|
|||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by({{ group_vars }}) |>
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
summarize(
|
||||
n_miss = sum(is.na({{ x_var }})),
|
||||
.groups = "drop"
|
||||
)
|
||||
}
|
||||
|
||||
flights |>
|
||||
|
@ -541,7 +544,10 @@ We can work around that problem by using the handy `pick()` function, which allo
|
|||
count_missing <- function(df, group_vars, x_var) {
|
||||
df |>
|
||||
group_by(pick({{ group_vars }})) |>
|
||||
summarize(n_miss = sum(is.na({{ x_var }})))
|
||||
summarize(
|
||||
n_miss = sum(is.na({{ x_var }})),
|
||||
.groups = "drop"
|
||||
)
|
||||
}
|
||||
|
||||
flights |>
|
||||
|
@ -605,7 +611,7 @@ While our examples have mostly focused on dplyr, tidy evaluation also underpins
|
|||
|
||||
```{r}
|
||||
#| eval: false
|
||||
weather |> standardise_time(sched_dep_time)
|
||||
weather |> standardize_time(sched_dep_time)
|
||||
```
|
||||
|
||||
2. For each of the following functions list all arguments that use tidy evaluation and describe whether they use data-masking or tidy-selection: `distinct()`, `count()`, `group_by()`, `rename_with()`, `slice_min()`, `slice_sample()`.
|
||||
|
@ -697,9 +703,9 @@ hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
|
|||
diamonds |> hex_plot(carat, price, depth)
|
||||
```
|
||||
|
||||
### Combining with dplyr
|
||||
### Combining with other tidyverse
|
||||
|
||||
Some of the most useful helpers combine a dash of dplyr with ggplot2.
|
||||
Some of the most useful helpers combine a dash of data manipulation with ggplot2.
|
||||
For example, if you might want to do a vertical bar chart where you automatically sort the bars in frequency order using `fct_infreq()`.
|
||||
Since the bar chart is vertical, we also need to reverse the usual order to get the highest values at the top:
|
||||
|
||||
|
@ -839,7 +845,7 @@ This makes it very obvious that something unusual is happening.
|
|||
|
||||
```{r}
|
||||
f1 <- function(string, prefix) {
|
||||
substr(string, 1, nchar(prefix)) == prefix
|
||||
str_sub(string, 1, str_length(prefix)) == prefix
|
||||
}
|
||||
|
||||
f3 <- function(x, y) {
|
||||
|
@ -851,6 +857,7 @@ This makes it very obvious that something unusual is happening.
|
|||
|
||||
3. Make a case for why `norm_r()`, `norm_d()` etc. would be better than `rnorm()`, `dnorm()`.
|
||||
Make a case for the opposite.
|
||||
How could you make the names even clearer?
|
||||
|
||||
## Summary
|
||||
|
||||
|
|
|
@ -144,7 +144,7 @@ Let's motivate this problem with a simple example: what happens if we have some
|
|||
|
||||
```{r}
|
||||
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
|
||||
sample(c(rnorm(n - n_na, mean = mean, sd = 1), rep(NA, n_na)))
|
||||
sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
|
||||
}
|
||||
|
||||
df_miss <- tibble(
|
||||
|
@ -397,22 +397,21 @@ If needed, you could `pivot_wider()` this back to the original form.
|
|||
|
||||
### Exercises
|
||||
|
||||
1. Compute the number of unique values in each column of `palmerpenguins::penguins`.
|
||||
1. Practice your `across()` skills by:
|
||||
|
||||
2. Compute the mean of every column in `mtcars`.
|
||||
1. Computing the number of unique values in each column of `palmerpenguins::penguins`.
|
||||
|
||||
3. Group `diamonds` by `cut`, `clarity`, and `color` then count the number of observations and the mean of each numeric column.
|
||||
2. Computing the mean of every column in `mtcars`.
|
||||
|
||||
4. What happens if you use a list of functions, but don't name them?
|
||||
3. Grouping `diamonds` by `cut`, `clarity`, and `color` then counting the number of observations and computing the mean of each numeric column.
|
||||
|
||||
2. What happens if you use a list of functions in `across()`, but don't name them?
|
||||
How is the output named?
|
||||
|
||||
5. It is possible to use `across()` inside `filter()` where it's equivalent to `if_all()`.
|
||||
Can you explain why?
|
||||
|
||||
6. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
|
||||
3. Adjust `expand_dates()` to automatically remove the date columns after they've been expanded.
|
||||
Do you need to embrace any arguments?
|
||||
|
||||
7. Explain what each step of the pipeline in this function does.
|
||||
4. Explain what each step of the pipeline in this function does.
|
||||
What special feature of `where()` are we taking advantage of?
|
||||
|
||||
```{r}
|
||||
|
@ -656,6 +655,7 @@ write_csv(gapminder, "gapminder.csv")
|
|||
```
|
||||
|
||||
Now when you come back to this problem in the future, you can read in a single csv file.
|
||||
For large and richer datasets, using parquet might be a better choice than `.csv`, as discussed in @sec-parquet.
|
||||
|
||||
```{r}
|
||||
#| include: false
|
||||
|
@ -733,7 +733,9 @@ files <- paths |>
|
|||
```
|
||||
|
||||
Then a very useful strategy is to capture the structure of the data frames so that you can explore it using your data science skills.
|
||||
One way to do so is with this handy `df_types` function that returns a tibble with one row for each column:
|
||||
One way to do so is with this handy `df_types` function[^iteration-6] that returns a tibble with one row for each column:
|
||||
|
||||
[^iteration-6]: We're not going to explain how it works, but you if you look at the docs for the functions used, you should be able to puzzle it out.
|
||||
|
||||
```{r}
|
||||
df_types <- function(df) {
|
||||
|
@ -744,7 +746,7 @@ df_types <- function(df) {
|
|||
)
|
||||
}
|
||||
|
||||
df_types(starwars)
|
||||
df_types(gapminder)
|
||||
```
|
||||
|
||||
You can then apply this function to all of the files, and maybe do some pivoting to make it easier to see where the differences are.
|
||||
|
@ -952,9 +954,9 @@ carat_histogram <- function(df) {
|
|||
carat_histogram(by_clarity$data[[1]])
|
||||
```
|
||||
|
||||
Now we can use `map()` to create a list of many plots[^iteration-6] and their eventual file paths:
|
||||
Now we can use `map()` to create a list of many plots[^iteration-7] and their eventual file paths:
|
||||
|
||||
[^iteration-6]: You can print `by_clarity$plot` to get a crude animation --- you'll get one plot for each element of `plots`.
|
||||
[^iteration-7]: You can print `by_clarity$plot` to get a crude animation --- you'll get one plot for each element of `plots`.
|
||||
NOTE: this didn't happen for me.
|
||||
|
||||
```{r}
|
||||
|
|
32
joins.qmd
32
joins.qmd
|
@ -200,8 +200,7 @@ Surrogate keys can be particular useful when communicating to other humans: it's
|
|||
## Basic joins {#sec-mutating-joins}
|
||||
|
||||
Now that you understand how data frames are connected via keys, we can start using joins to better understand the `flights` dataset.
|
||||
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `semi_join()`, `anti_join(), and full_join()`.
|
||||
They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame.
|
||||
dplyr provides six join functions: `left_join()`, `inner_join()`, `right_join()`, `full_join()`, `semi_join()`, and `anti_join().` They all have the same interface: they take a pair of data frames (`x` and `y`) and return a data frame.
|
||||
The order of the rows and columns in the output is primarily determined by `x`.
|
||||
|
||||
In this section, you'll learn how to use one mutating join, `left_join()`, and two filtering joins, `semi_join()` and `anti_join()`.
|
||||
|
@ -305,6 +304,10 @@ In older code you might see a different way of specifying the join keys, using a
|
|||
|
||||
Now that it exists, we prefer `join_by()` since it provides a clearer and more flexible specification.
|
||||
|
||||
`inner_join()`, `right_join()`, `full_join()` have the same interface as `left_join()`.
|
||||
The difference is which rows they keep: left join keeps all the rows in `x`, the right join keeps all rows in `y`, the full join keeps all rows in either `x` or `y`, and the inner join only keeps rows that occur in both `x` and `y`.
|
||||
We'll come back to these in more detail later.
|
||||
|
||||
### Filtering joins
|
||||
|
||||
As you might guess the primary action of a **filtering join** is to filter the rows.
|
||||
|
@ -464,9 +467,6 @@ knitr::include_graphics("diagrams/join/setup2.png", dpi = 270)
|
|||
|
||||
In an actual join, matches will be indicated with dots, as in @fig-join-inner.
|
||||
The number of dots equals the number of matches, which in turn equals the number of rows in the output, a new data frame that contains the key, the x values, and the y values.
|
||||
The join shown here is a so-called **equi** **inner join**, where rows match if the keys are equal, so that the output contains only the rows with keys that appear in both `x` and `y`.
|
||||
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just call it an inner join.
|
||||
We'll come back to non-equi joins in @sec-non-equi-joins.
|
||||
|
||||
```{r}
|
||||
#| label: fig-join-inner
|
||||
|
@ -572,6 +572,10 @@ However, this is not a great representation because while it might jog your memo
|
|||
knitr::include_graphics("diagrams/join/venn.png", dpi = 270)
|
||||
```
|
||||
|
||||
The joins shown here are the so-called **equi** **joins**, where rows match if the keys are equal.
|
||||
Equi-joins are the most common type of join, so we'll typically omit the equi prefix, and just say "inner join" rather than "equi inner join".
|
||||
We'll come back to non-equi joins in @sec-non-equi-joins.
|
||||
|
||||
### Row matching
|
||||
|
||||
So far we've explored what happens if a row in `x` matches zero or one rows in `y`.
|
||||
|
@ -620,8 +624,6 @@ df1 |>
|
|||
inner_join(df2, join_by(key))
|
||||
```
|
||||
|
||||
This is one reason we like `left_join()` --- if it runs without warning, you know that each row of the output matches the row in the same position in `x`.
|
||||
|
||||
You can gain further control over row matching with two arguments:
|
||||
|
||||
- `unmatched` controls what happens when a row in `x` fails to match any rows in `y`. It defaults to `"drop"` which will silently drop any unmatched rows.
|
||||
|
@ -850,7 +852,7 @@ That leads to the following party days:
|
|||
```{r}
|
||||
parties <- tibble(
|
||||
q = 1:4,
|
||||
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
|
||||
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03"))
|
||||
)
|
||||
```
|
||||
|
||||
|
@ -859,7 +861,7 @@ Now imagine that you have a table of employee birthdays:
|
|||
```{r}
|
||||
employees <- tibble(
|
||||
name = sample(babynames::babynames$name, 100),
|
||||
birthday = lubridate::ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
|
||||
birthday = ymd("2022-01-01") + (sample(365, 100, replace = TRUE) - 1)
|
||||
)
|
||||
employees
|
||||
```
|
||||
|
@ -896,9 +898,9 @@ So it might be better to to be explicit about the date ranges that each party sp
|
|||
```{r}
|
||||
parties <- tibble(
|
||||
q = 1:4,
|
||||
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
end = lubridate::ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
|
||||
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
end = ymd(c("2022-04-03", "2022-07-11", "2022-10-02", "2022-12-31"))
|
||||
)
|
||||
parties
|
||||
```
|
||||
|
@ -917,9 +919,9 @@ Ooops, there is an overlap, so let's fix that problem and continue:
|
|||
```{r}
|
||||
parties <- tibble(
|
||||
q = 1:4,
|
||||
party = lubridate::ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
start = lubridate::ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
end = lubridate::ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
|
||||
party = ymd(c("2022-01-10", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
start = ymd(c("2022-01-01", "2022-04-04", "2022-07-11", "2022-10-03")),
|
||||
end = ymd(c("2022-04-03", "2022-07-10", "2022-10-02", "2022-12-31"))
|
||||
)
|
||||
```
|
||||
|
||||
|
|
|
@ -544,7 +544,7 @@ if_else(TRUE, "a", 1)
|
|||
|
||||
case_when(
|
||||
x < -1 ~ TRUE,
|
||||
x > 0 ~ lubridate::now()
|
||||
x > 0 ~ now()
|
||||
)
|
||||
```
|
||||
|
||||
|
|
|
@ -71,7 +71,7 @@ coalesce(x, 0)
|
|||
Sometimes you'll hit the opposite problem where some concrete value actually represents a missing value.
|
||||
This typically arises in data generated by older software that doesn't have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
|
||||
|
||||
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`.
|
||||
If possible, handle this when reading in the data, for example, by using the `na` argument to `readr::read_csv()`, e.g. `read_csv(path, na = "99")`.
|
||||
If you discover the problem later, or your data source doesn't provide a way to handle on it read, you can use `dplyr::na_if()`:
|
||||
|
||||
```{r}
|
||||
|
@ -206,7 +206,7 @@ For example, imagine we have a dataset that contains some health information abo
|
|||
health <- tibble(
|
||||
name = c("Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"),
|
||||
smoker = factor(c("no", "no", "no", "no", "no"), levels = c("yes", "no")),
|
||||
age = c(34L, 88L, 75L, 47L, 56L),
|
||||
age = c(34, 88, 75, 47, 56),
|
||||
)
|
||||
```
|
||||
|
||||
|
@ -248,6 +248,7 @@ The same problem comes up more generally with `dplyr::group_by()`.
|
|||
And again you can use `.drop = FALSE` to preserve all factor levels:
|
||||
|
||||
```{r}
|
||||
#| warning: false
|
||||
health |>
|
||||
group_by(smoker, .drop = FALSE) |>
|
||||
summarize(
|
||||
|
|
|
@ -36,7 +36,7 @@ In the following three chapters, you'll learn skills to improve your programming
|
|||
|
||||
1. Copy-and-paste is a powerful tool, but you should avoid doing it more than twice.
|
||||
Repeating yourself in code is dangerous because it can easily lead to errors and inconsistencies.
|
||||
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated code so that it can be easily reused.
|
||||
Instead, in @sec-functions, you'll learn how to write **functions** which let you extract out repeated tidyverse code so that it can be easily reused.
|
||||
|
||||
2. Functions extract out repeated code, but you often need to repeat the same actions on different inputs.
|
||||
You need tools for **iteration** that let you do similar things again and again.
|
||||
|
|
Loading…
Reference in New Issue