parent
2069938079
commit
bf474ecffc
33
numbers.qmd
33
numbers.qmd
|
@ -33,7 +33,7 @@ library(nycflights13)
|
|||
## Making numbers
|
||||
|
||||
In most cases, you'll get numbers already recorded in one of R's numeric types: integer or double.
|
||||
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or something has gone wrong in your data import process.
|
||||
In some cases, however, you'll encounter them as strings, possibly because you've created them by pivoting from column headers or because something has gone wrong in your data import process.
|
||||
|
||||
readr provides two useful functions for parsing strings into numbers: `parse_double()` and `parse_number()`.
|
||||
Use `parse_double()` when you have numbers that have been written as strings:
|
||||
|
@ -62,7 +62,7 @@ flights |> count(dest)
|
|||
|
||||
(Despite the advice in @sec-workflow-style, we usually put `count()` on a single line because it's usually used at the console for a quick check that a calculation is working as expected.)
|
||||
|
||||
If you want to see the most common values add `sort = TRUE`:
|
||||
If you want to see the most common values, add `sort = TRUE`:
|
||||
|
||||
```{r}
|
||||
flights |> count(dest, sort = TRUE)
|
||||
|
@ -225,7 +225,7 @@ In R, `%/%` does integer division and `%%` computes the remainder:
|
|||
1:10 %% 3
|
||||
```
|
||||
|
||||
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into and `hour` and `minute`:
|
||||
Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into `hour` and `minute`:
|
||||
|
||||
```{r}
|
||||
flights |>
|
||||
|
@ -273,12 +273,12 @@ starting <- 100
|
|||
interest <- 1.05
|
||||
|
||||
money <- tibble(
|
||||
year = 2000 + 1:50,
|
||||
money = starting * interest^(1:50)
|
||||
year = 1:50,
|
||||
money = starting * interest ^ year
|
||||
)
|
||||
```
|
||||
|
||||
If you plot this data, you'll get an exponential curve:
|
||||
If you plot this data, you'll get an exponential curve showing how your money grows year by year at an interest rate of 1.05:
|
||||
|
||||
```{r}
|
||||
ggplot(money, aes(year, money)) +
|
||||
|
@ -293,12 +293,12 @@ ggplot(money, aes(year, money)) +
|
|||
scale_y_log10()
|
||||
```
|
||||
|
||||
This a straight line because a little algebra reveals that `log(money) = log(starting) + n * log(interest)`, which matches the pattern for a line, `y = m * x + b`.
|
||||
This a straight line because a little algebra reveals that `log10(money) = log10(interest) * year + log10(starting)`, which matches the pattern for a line, `y = m * x + b`.
|
||||
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
|
||||
|
||||
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
|
||||
We recommend using `log2()` or `log10()`.
|
||||
`log2()` is easy to interpret because difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g) 3 is 10\^3 = 1000.
|
||||
`log2()` is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g.) 3 is 10\^3 = 1000.
|
||||
|
||||
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
|
||||
|
||||
|
@ -339,7 +339,7 @@ floor(x)
|
|||
ceiling(x)
|
||||
```
|
||||
|
||||
These functions don't have a digits argument, so you can instead scale down, round, and then scale back up:
|
||||
These functions don't have a `digits` argument, so you can instead scale down, round, and then scale back up:
|
||||
|
||||
```{r}
|
||||
# Round down to nearest two digits
|
||||
|
@ -583,8 +583,8 @@ df |>
|
|||
|
||||
3. What time of day should you fly if you want to avoid delays as much as possible?
|
||||
|
||||
4. What does `flights |> group_by(dest() |> filter(row_number() < 4)` do?
|
||||
What does `flights |> group_by(dest() |> filter(row_number(dep_delay) < 4)` do?
|
||||
4. What does `flights |> group_by(dest) |> filter(row_number() < 4)` do?
|
||||
What does `flights |> group_by(dest) |> filter(row_number(dep_delay) < 4)` do?
|
||||
|
||||
5. For each destination, compute the total minutes of delay.
|
||||
For each flight, compute the proportion of the total delay for its destination.
|
||||
|
@ -607,8 +607,7 @@ df |>
|
|||
```
|
||||
|
||||
7. Look at each destination.
|
||||
Can you find flights that are suspiciously fast?
|
||||
(i.e. flights that represent a potential data entry error).
|
||||
Can you find flights that are suspiciously fast (i.e. flights that represent a potential data entry error)?
|
||||
Compute the air time of a flight relative to the shortest flight to that destination.
|
||||
Which flights were most delayed in the air?
|
||||
|
||||
|
@ -618,7 +617,7 @@ df |>
|
|||
## Numeric summaries
|
||||
|
||||
Just using the counts, means, and sums that we've introduced already can get you a long way, but R provides many other useful summary functions.
|
||||
Here are a selection that you might find useful.
|
||||
Here is a selection that you might find useful.
|
||||
|
||||
### Center
|
||||
|
||||
|
@ -629,7 +628,7 @@ Depending on the shape of the distribution of the variable you're interested in,
|
|||
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
|
||||
|
||||
@fig-mean-vs-median compares the mean vs the median when looking at the hourly vs median departure delay.
|
||||
The median delay is always smaller than the mean delay because because flights sometimes leave multiple hours late, but never leave multiple hours early.
|
||||
The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.
|
||||
|
||||
```{r}
|
||||
#| label: fig-mean-vs-median
|
||||
|
@ -666,7 +665,7 @@ For these reasons, the mode tends not to be used by statisticians and there's no
|
|||
|
||||
What if you're interested in locations other than the center?
|
||||
`min()` and `max()` will give you the largest and smallest values.
|
||||
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find a value that's greater than 95% of the values.
|
||||
Another powerful tool is `quantile()` which is a generalization of the median: `quantile(x, 0.25)` will find the value of `x` that is greater than 25% of the values, `quantile(x, 0.5)` is equivalent to the median, and `quantile(x, 0.95)` will find the value that's greater than 95% of the values.
|
||||
|
||||
For the `flights` data, you might want to look at the 95% quantile of delays rather than the maximum, because it will ignore the 5% of most delayed flights which can be quite extreme.
|
||||
|
||||
|
@ -767,7 +766,7 @@ Finally, don't forget what you learned in @sec-sample-size: whenever creating nu
|
|||
|
||||
### Positions
|
||||
|
||||
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at specific position.
|
||||
There's one final type of summary that's useful for numeric vectors, but also works with every other type of value: extracting a value at a specific position.
|
||||
You can do this with the base R `[` function, but we're not going to cover it in detail until @sec-subset-many, because it's a very powerful and general function.
|
||||
For now we'll introduce three specialized functions that you can use to extract values at a specified position: `first(x)`, `last(x)`, and `nth(x, n)`.
|
||||
|
||||
|
|
Loading…
Reference in New Issue