Reduce numbers content
This commit is contained in:
		@@ -27,7 +27,6 @@ Imports:
 | 
				
			|||||||
    openxlsx,
 | 
					    openxlsx,
 | 
				
			||||||
    palmerpenguins,
 | 
					    palmerpenguins,
 | 
				
			||||||
    repurrrsive (>= 1.1.0),
 | 
					    repurrrsive (>= 1.1.0),
 | 
				
			||||||
    slider,
 | 
					 | 
				
			||||||
    tidymodels,
 | 
					    tidymodels,
 | 
				
			||||||
    tidyverse,
 | 
					    tidyverse,
 | 
				
			||||||
    writexl
 | 
					    writexl
 | 
				
			||||||
 
 | 
				
			|||||||
							
								
								
									
										78
									
								
								numbers.qmd
									
									
									
									
									
								
							
							
						
						
									
										78
									
								
								numbers.qmd
									
									
									
									
									
								
							@@ -262,43 +262,10 @@ flights |>
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
### Logarithms
 | 
					### Logarithms
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude.
 | 
					Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude and convert exponential growth to linear growth.
 | 
				
			||||||
They also convert exponential growth to linear growth.
 | 
					In R, you have a choice of three logarithms: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
 | 
				
			||||||
For example, take compounding interest --- the amount of money you have at `year + 1` is the amount of money you had at `year` multiplied by the interest rate.
 | 
					 | 
				
			||||||
That gives a formula like `money = starting * interest ^ year`:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
starting <- 100
 | 
					 | 
				
			||||||
interest <- 1.05
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
money <- tibble(
 | 
					 | 
				
			||||||
  year = 1:50,
 | 
					 | 
				
			||||||
  money = starting * interest ^ year
 | 
					 | 
				
			||||||
)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
If you plot this data, you'll get an exponential curve showing how your money grows year by year at an interest rate of 1.05:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
ggplot(money, aes(x = year, y = money)) +
 | 
					 | 
				
			||||||
  geom_line()
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
Log transforming the y-axis gives a straight line:
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
ggplot(money, aes(x = year, y = money)) +
 | 
					 | 
				
			||||||
  geom_line() + 
 | 
					 | 
				
			||||||
  scale_y_log10()
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
This a straight line because a little algebra reveals that `log10(money) = log10(interest) * year + log10(starting)`, which matches the pattern for a line, `y = m * x + b`.
 | 
					 | 
				
			||||||
This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
If you're log-transforming your data with dplyr you have a choice of three logarithms provided by base R: `log()` (the natural log, base e), `log2()` (base 2), and `log10()` (base 10).
 | 
					 | 
				
			||||||
We recommend using `log2()` or `log10()`.
 | 
					We recommend using `log2()` or `log10()`.
 | 
				
			||||||
`log2()` is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g.) 3 is 10\^3 = 1000.
 | 
					`log2()` is easy to interpret because a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving; whereas `log10()` is easy to back-transform because (e.g.) 3 is 10\^3 = 1000.
 | 
				
			||||||
 | 
					 | 
				
			||||||
The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
 | 
					The inverse of `log()` is `exp()`; to compute the inverse of `log2()` or `log10()` you'll need to use `2^` or `10^`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Rounding {#sec-rounding}
 | 
					### Rounding {#sec-rounding}
 | 
				
			||||||
@@ -406,20 +373,6 @@ cumsum(x)
 | 
				
			|||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
If you need more complex rolling or sliding aggregates, try the [slider](https://davisvaughan.github.io/slider/) package by Davis Vaughan.
 | 
					If you need more complex rolling or sliding aggregates, try the [slider](https://davisvaughan.github.io/slider/) package by Davis Vaughan.
 | 
				
			||||||
The following example illustrates some of its features.
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
```{r}
 | 
					 | 
				
			||||||
library(slider)
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
# Same as a cumulative sum
 | 
					 | 
				
			||||||
slide_vec(x, sum, .before = Inf)
 | 
					 | 
				
			||||||
# Sum the current element and the one before it
 | 
					 | 
				
			||||||
slide_vec(x, sum, .before = 1)
 | 
					 | 
				
			||||||
# Sum the current element and the two before and after it
 | 
					 | 
				
			||||||
slide_vec(x, sum, .before = 2, .after = 2)
 | 
					 | 
				
			||||||
# Only compute if the window is complete
 | 
					 | 
				
			||||||
slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
 | 
					 | 
				
			||||||
```
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Exercises
 | 
					### Exercises
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -430,9 +383,10 @@ slide_vec(x, sum, .before = 2, .after = 2, .complete = TRUE)
 | 
				
			|||||||
    Do they use degrees or radians?
 | 
					    Do they use degrees or radians?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
3.  Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they're not really continuous numbers.
 | 
					3.  Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they're not really continuous numbers.
 | 
				
			||||||
    You can see the basic problem in this plot: there's a gap between each hour.
 | 
					    You can see the basic problem by running the code below: there's a gap between each hour.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
    ```{r}
 | 
					    ```{r}
 | 
				
			||||||
 | 
					    #| eval: false
 | 
				
			||||||
    flights |> 
 | 
					    flights |> 
 | 
				
			||||||
      filter(month == 1, day == 1) |> 
 | 
					      filter(month == 1, day == 1) |> 
 | 
				
			||||||
      ggplot(aes(x = sched_dep_time, y = dep_delay)) +
 | 
					      ggplot(aes(x = sched_dep_time, y = dep_delay)) +
 | 
				
			||||||
@@ -625,7 +579,7 @@ An alternative is to use the `median()`, which finds a value that lies in the "m
 | 
				
			|||||||
Depending on the shape of the distribution of the variable you're interested in, mean or median might be a better measure of center.
 | 
					Depending on the shape of the distribution of the variable you're interested in, mean or median might be a better measure of center.
 | 
				
			||||||
For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
 | 
					For example, for symmetric distributions we generally report the mean while for skewed distributions we usually report the median.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@fig-mean-vs-median compares the mean vs. the median when looking at the hourly vs. median departure delay.
 | 
					@fig-mean-vs-median compares the mean vs. the median when looking at the hourly vs. median departure delay for each destination.
 | 
				
			||||||
The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.
 | 
					The median delay is always smaller than the mean delay because flights sometimes leave multiple hours late, but never leave multiple hours early.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
@@ -709,12 +663,13 @@ The distribution is so skewed that we have to zoom in to see the bulk of the dat
 | 
				
			|||||||
This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.
 | 
					This suggests that the mean is unlikely to be a good summary and we might prefer the median instead.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| echo: false
 | 
				
			||||||
#| label: fig-flights-dist
 | 
					#| label: fig-flights-dist
 | 
				
			||||||
#| fig-cap: >
 | 
					#| fig-cap: >
 | 
				
			||||||
#|   The distribution of `dep_delay` appears highly skewed to the right in 
 | 
					#|   (Left) The histogram of the full data is extremely skewed making it
 | 
				
			||||||
#|   both histograms.
 | 
					#|   hard to get any details. (Right) Zooming into delays of less than two
 | 
				
			||||||
#| fig-subcap: ["Histogram shows the full range of delays.", 
 | 
					#|   hours makes it possible to see what's happening with the bulk of the
 | 
				
			||||||
#|   "Histogram is zoomed in to show delays less than 2 hours."]
 | 
					#|   observations.
 | 
				
			||||||
#| fig-alt: >
 | 
					#| fig-alt: >
 | 
				
			||||||
#|   Two histograms of `dep_delay`. On the left, it's very hard to see
 | 
					#|   Two histograms of `dep_delay`. On the left, it's very hard to see
 | 
				
			||||||
#|   any pattern except that there's a very large spike around zero, the
 | 
					#|   any pattern except that there's a very large spike around zero, the
 | 
				
			||||||
@@ -724,18 +679,19 @@ This suggests that the mean is unlikely to be a good summary and we might prefer
 | 
				
			|||||||
#|   see that the spike occurs slightly below zero (i.e. most flights
 | 
					#|   see that the spike occurs slightly below zero (i.e. most flights
 | 
				
			||||||
#|   leave a couple of minutes early), but there's still a very steep
 | 
					#|   leave a couple of minutes early), but there's still a very steep
 | 
				
			||||||
#|   decay after that.
 | 
					#|   decay after that.
 | 
				
			||||||
#| layout-ncol: 2
 | 
					#| fig-asp: 0.5
 | 
				
			||||||
#| fig-width: 4
 | 
					library(patchwork)
 | 
				
			||||||
#| fig-height: 2
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
flights |>
 | 
					full <- flights |>
 | 
				
			||||||
  ggplot(aes(x = dep_delay)) + 
 | 
					  ggplot(aes(x = dep_delay)) + 
 | 
				
			||||||
  geom_histogram(binwidth = 15)
 | 
					  geom_histogram(binwidth = 15, na.rm = TRUE)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
flights |>
 | 
					delayed120 <- flights |>
 | 
				
			||||||
  filter(dep_delay < 120) |> 
 | 
					  filter(dep_delay < 120) |> 
 | 
				
			||||||
  ggplot(aes(x = dep_delay)) + 
 | 
					  ggplot(aes(x = dep_delay)) + 
 | 
				
			||||||
  geom_histogram(binwidth = 5)
 | 
					  geom_histogram(binwidth = 5)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					full + delayed120
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
It's also a good idea to check that distributions for subgroups resemble the whole.
 | 
					It's also a good idea to check that distributions for subgroups resemble the whole.
 | 
				
			||||||
 
 | 
				
			|||||||
		Reference in New Issue
	
	Block a user