Changes from @mine-cetinkaya-rundel
This commit is contained in:
parent
fb8f3e5884
commit
9cf3badbf0
36
EDA.Rmd
36
EDA.Rmd
|
@ -93,7 +93,8 @@ ggplot(data = diamonds) +
|
|||
The height of the bars displays how many observations occurred with each x value. You can compute these values manually with `dplyr::count()`:
|
||||
|
||||
```{r}
|
||||
diamonds %>% count(cut)
|
||||
diamonds %>%
|
||||
count(cut)
|
||||
```
|
||||
|
||||
A variable is **continuous** if can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, use a histogram:
|
||||
|
@ -106,7 +107,8 @@ ggplot(data = diamonds) +
|
|||
You can compute this by hand by combining `dplyr::count()` and `ggplot2::cut_width()`:
|
||||
|
||||
```{r}
|
||||
diamonds %>% count(cut_width(carat, 0.5))
|
||||
diamonds %>%
|
||||
count(cut_width(carat, 0.5))
|
||||
```
|
||||
|
||||
A histogram divides the x axis into equally spaced bins and then uses the height of bar to display the number of observations that fall in each bin. In the graph above, the tallest bar shows that almost 30,000 observations have a `carat` value between 0.25 and 0.75, which are the left and right edges of the bar.
|
||||
|
@ -114,7 +116,8 @@ A histogram divides the x axis into equally spaced bins and then uses the height
|
|||
You can set the width of the intervals in a histogram with the `binwidth` argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns. For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.
|
||||
|
||||
```{r}
|
||||
smaller <- diamonds %>% filter(carat < 3)
|
||||
smaller <- diamonds %>%
|
||||
filter(carat < 3)
|
||||
|
||||
ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
geom_histogram(binwidth = 0.1)
|
||||
|
@ -123,10 +126,12 @@ ggplot(data = smaller, mapping = aes(x = carat)) +
|
|||
If you wish to overlay multiple histograms in the same plot, I recommend using `geom_freqpoly()` instead of `geom_histogram()`. `geom_freqpoly()` performs the same calculation as `geom_histogram()`, but instead of displaying the counts with bars, uses lines instead. It's much easier to understand overlapping lines than bars.
|
||||
|
||||
```{r}
|
||||
ggplot(data = smaller, mapping = aes(x = carat)) +
|
||||
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
|
||||
geom_freqpoly(binwidth = 0.1)
|
||||
```
|
||||
|
||||
There are a few challenges with this type of plot, which will come back to in [visualisation a categorical and a continuous variable](#cat-cont).
|
||||
|
||||
Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? I've put together a list below of the most useful types of information that you will find in your graphs, along with some follow up questions for each type of information. The key to asking good follow up questions will be to rely on your **curiosity** (What do you want to learn more about?) as well as your **skepticism** (How could this be misleading?).
|
||||
|
||||
### Typical values
|
||||
|
@ -202,7 +207,8 @@ unusual
|
|||
|
||||
The `y` variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can't have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!
|
||||
|
||||
When you discover an outlier, it's a good idea to trace it back as far as possible. You'll be in a much stronger analytical position if you can figure out why it happened. If you can't figure it out, and want to just move on with your analysis, replace it with a missing value, which we'll discuss in the next section.
|
||||
It's good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values, and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.
|
||||
|
||||
|
||||
### Exercises
|
||||
|
||||
|
@ -227,8 +233,9 @@ If you've encountered unusual values in your dataset, and simply want to move on
|
|||
|
||||
1. Drop the entire row with the strange values:
|
||||
|
||||
```{r}
|
||||
diamonds2 <- diamonds %>% filter(between(y, 3, 20))
|
||||
```{r, eval = FALSE}
|
||||
diamonds2 <- diamonds %>%
|
||||
filter(between(y, 3, 20))
|
||||
```
|
||||
|
||||
I don't recommend this option because just because one measurement
|
||||
|
@ -289,7 +296,7 @@ However this plot isn't great because there are many more non-cancelled flights
|
|||
|
||||
If variation describes the behavior _within_ a variable, covariation describes the behavior _between_ variables. **Covariation** is the tendency for the values of two or more variables to vary together in a related way. The best way to spot covariation is to visualise the relationship between two or more variables. How you do that should again depend on the type of variables involved.
|
||||
|
||||
### A categorical and continuous variable
|
||||
### A categorical and continuous variable {#cat-cont}
|
||||
|
||||
It's common to want to explore the distribution of a continuous variable broken down by a categorical, as in the previous frequency polygon. The default appearance of `geom_freqpoly()` is not that useful for that sort of comparison because the height is given by the count. That means if one of the groups is much smaller than the others, it's hard to see the differences in shape. For example, lets explore how the price of a diamond varies with its quality:
|
||||
|
||||
|
@ -343,14 +350,16 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
|
|||
|
||||
We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot). It supports the counterintuitive finding that better quality diamonds are cheaper on average! In the exercises, you'll be challenged to figure out why.
|
||||
|
||||
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Most factors are unordered, so it's fair game to reorder to display the results better. For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
|
||||
`cut` is an ordered factor: fair is worse than good, which is worse than very good and so on. Many categorical variables don't have an intrinsic order, so you might want to reorder them to make an more informative display. One way to do that is with the `reorder()` function.
|
||||
|
||||
For example, take the `class` variable in the `mpg` dataset. You might be interested to know how highway mileage varies across classes:
|
||||
|
||||
```{r}
|
||||
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
|
||||
geom_boxplot()
|
||||
```
|
||||
|
||||
Covariation will appear as a systematic change in the medians or IQRs of the boxplots. To make the trend easier to see, reorder `x` variable with `reorder()`. This code reorders the `class` based on the median value of `hwy` in each group.
|
||||
To make the trend easier to see, we can reorder `class` based on the median value of `hwy`:
|
||||
|
||||
```{r fig.height = 3}
|
||||
ggplot(data = mpg) +
|
||||
|
@ -410,7 +419,8 @@ The size of each circle in the plot displays how many observations occurred at e
|
|||
Another approach is to compute the count with dplyr:
|
||||
|
||||
```{r}
|
||||
diamonds %>% count(color, cut)
|
||||
diamonds %>%
|
||||
count(color, cut)
|
||||
```
|
||||
|
||||
Then visualise with `geom_tile()` and the fill aesthetic:
|
||||
|
@ -419,7 +429,7 @@ Then visualise with `geom_tile()` and the fill aesthetic:
|
|||
diamonds %>%
|
||||
count(color, cut) %>%
|
||||
ggplot(mapping = aes(x = color, y = cut)) +
|
||||
geom_tile(aes(fill = n))
|
||||
geom_tile(aes(fill = n))
|
||||
```
|
||||
|
||||
If the categorical variables are unordered, you might want to use the seriation package to simultaneously reorder the rows and columns in order to more clearly reveal interesting patterns. For larger plots, you might want to try the d3heatmap or heatmaply packages, which create interactive plots.
|
||||
|
@ -580,7 +590,7 @@ Sometimes we'll turn the end of pipeline of data transformation into a plot. Wat
|
|||
diamonds %>%
|
||||
count(cut, clarity) %>%
|
||||
ggplot(aes(clarity, cut, fill = n)) +
|
||||
geom_tile()
|
||||
geom_tile()
|
||||
```
|
||||
|
||||
If you want learn more about ggplot2, I'd highly recommend grabbing a copy of the ggplot2 book: <https://amzn.com/331924275X>. It's been recently updated, so includes dplyr and tidyr code, and has much more space to explore all the facets of visualisation. Unfortunately the book isn't generally available for free, but if you have a connection to a university you can probably get an electronic version for free through SpringerLink.
|
||||
|
|
|
@ -287,8 +287,10 @@ update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
|
|||
If values are too big, they will roll-over:
|
||||
|
||||
```{r}
|
||||
ymd("2015-02-01") %>% update(mday = 30)
|
||||
ymd("2015-02-01") %>% update(hour = 400)
|
||||
ymd("2015-02-01") %>%
|
||||
update(mday = 30)
|
||||
ymd("2015-02-01") %>%
|
||||
update(hour = 400)
|
||||
```
|
||||
|
||||
You can use `update()` to show the distribution of flights across the course of the day for every day of the year:
|
||||
|
|
|
@ -116,8 +116,10 @@ my_model <- function(df) {
|
|||
mod <- my_model(df)
|
||||
rmse(mod, df)
|
||||
|
||||
grid <- df %>% expand(x = seq_range(x, 50))
|
||||
preds <- grid %>% add_predictions(mod, var = "y")
|
||||
grid <- df %>%
|
||||
expand(x = seq_range(x, 50))
|
||||
preds <- grid %>%
|
||||
add_predictions(mod, var = "y")
|
||||
|
||||
df %>%
|
||||
ggplot(aes(x, y)) +
|
||||
|
@ -156,10 +158,11 @@ But do you think this model will do well if we apply it to new data from the sam
|
|||
In real-life you can't easily go out and recollect your data. There are two approach to help you get around this problem. I'll introduce them briefly here, and then we'll go into more depth in the following sections.
|
||||
|
||||
```{r}
|
||||
boot <- bootstrap(df, 100) %>% mutate(
|
||||
mod = map(strap, my_model),
|
||||
pred = map2(list(grid), mod, add_predictions)
|
||||
)
|
||||
boot <- bootstrap(df, 100) %>%
|
||||
mutate(
|
||||
mod = map(strap, my_model),
|
||||
pred = map2(list(grid), mod, add_predictions)
|
||||
)
|
||||
|
||||
boot %>%
|
||||
unnest(pred) %>%
|
||||
|
|
|
@ -125,7 +125,8 @@ sim1_dist <- function(a1, a2) {
|
|||
measure_distance(c(a1, a2), sim1)
|
||||
}
|
||||
|
||||
models <- models %>% mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
|
||||
models <- models %>%
|
||||
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
|
||||
models
|
||||
```
|
||||
|
||||
|
@ -245,7 +246,8 @@ It's also useful to see what the model doesn't capture, the so called residuals
|
|||
To visualise the predictions from a model, we start by generating an evenly spaced grid of values that covers the region where our data lies. The easiest way to do that is to use `modelr::data_grid()`. Its first argument is a data frame, and for each subsequent argument it finds the unique variables and then generates all combinations:
|
||||
|
||||
```{r}
|
||||
grid <- sim1 %>% data_grid(x)
|
||||
grid <- sim1 %>%
|
||||
data_grid(x)
|
||||
grid
|
||||
```
|
||||
|
||||
|
@ -254,7 +256,8 @@ grid
|
|||
Next we add predictions. We'll use `modelr::add_predictions()` which takes a data frame and a model. It adds the predictions from the model to a new column in the data frame:
|
||||
|
||||
```{r}
|
||||
grid <- grid %>% add_predictions(sim1_mod)
|
||||
grid <- grid %>%
|
||||
add_predictions(sim1_mod)
|
||||
grid
|
||||
```
|
||||
|
||||
|
@ -275,7 +278,8 @@ The flip-side of predictions are __residuals__. The predictions tells you the pa
|
|||
We add residuals to the data with `add_residuals()`, which works much like `add_predictions()`. Note, however, that we use the original dataset, not a manufactured grid. This is because to compute residuals we need actual y values.
|
||||
|
||||
```{r}
|
||||
sim1 <- sim1 %>% add_residuals(sim1_mod)
|
||||
sim1 <- sim1 %>%
|
||||
add_residuals(sim1_mod)
|
||||
sim1
|
||||
```
|
||||
|
||||
|
@ -392,7 +396,8 @@ ggplot(sim2, aes(x)) +
|
|||
You can't make predictions about levels that you didn't observe. Sometimes you'll do this by accident so it's good to recognise this error message:
|
||||
|
||||
```{r, error = TRUE}
|
||||
tibble(x = "e") %>% add_predictions(mod2)
|
||||
tibble(x = "e") %>%
|
||||
add_predictions(mod2)
|
||||
```
|
||||
|
||||
### Interactions (continuous and categorical)
|
||||
|
|
|
@ -222,7 +222,8 @@ ggplot(daily, aes(wday, n)) +
|
|||
Next we compute and visualise the residuals:
|
||||
|
||||
```{r}
|
||||
daily <- daily %>% add_residuals(mod)
|
||||
daily <- daily %>%
|
||||
add_residuals(mod)
|
||||
daily %>%
|
||||
ggplot(aes(date, resid)) +
|
||||
geom_ref_line(h = 0) +
|
||||
|
@ -248,7 +249,8 @@ Note the change in the y-axis: now we are seeing the deviation from the expected
|
|||
1. There are some days with far fewer flights than expected:
|
||||
|
||||
```{r}
|
||||
daily %>% filter(resid < -100)
|
||||
daily %>%
|
||||
filter(resid < -100)
|
||||
```
|
||||
|
||||
If you're familiar with American public holidays, you might spot New Year's
|
||||
|
@ -301,7 +303,8 @@ term <- function(date) {
|
|||
)
|
||||
}
|
||||
|
||||
daily <- daily %>% mutate(term = term(date))
|
||||
daily <- daily %>%
|
||||
mutate(term = term(date))
|
||||
|
||||
daily %>%
|
||||
filter(wday == "Sat") %>%
|
||||
|
@ -367,10 +370,11 @@ If you're experimenting with many models and many visualisations, it's a good id
|
|||
|
||||
```{r}
|
||||
compute_vars <- function(data) {
|
||||
data %>% mutate(
|
||||
term = term(date),
|
||||
wday = wday(date, label = TRUE)
|
||||
)
|
||||
data %>%
|
||||
mutate(
|
||||
term = term(date),
|
||||
wday = wday(date, label = TRUE)
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -413,7 +417,8 @@ How do you decide how many parameters to use for the spline? You can either eith
|
|||
How would these days generalise to another year?
|
||||
|
||||
```{r}
|
||||
daily %>% filter(resid > 80)
|
||||
daily %>%
|
||||
filter(resid > 80)
|
||||
```
|
||||
|
||||
1. Create a new variable that splits the `wday` variable into terms, but only
|
||||
|
|
|
@ -156,8 +156,10 @@ by_country
|
|||
This has a big advantage: because all the related objects are stored together, you don't need to manually keep them in sync when you filter or arrange. The semantics of the data frame takes take of that for you:
|
||||
|
||||
```{r}
|
||||
by_country %>% filter(continent == "Europe")
|
||||
by_country %>% arrange(continent, country)
|
||||
by_country %>%
|
||||
filter(continent == "Europe")
|
||||
by_country %>%
|
||||
arrange(continent, country)
|
||||
```
|
||||
|
||||
If your list of data frames and list of models were separate objects, you have to remember that whenever you re-order or subset one vector, you need to re-order or subset all the others in order to keep them in sync. If you forget, your code will continue to work, but it will give the wrong answer!
|
||||
|
@ -167,9 +169,10 @@ If your list of data frames and list of models were separate objects, you have t
|
|||
Previously we computed the residuals of a single model with a single dataset. Now we have 142 data frames and 142 models. To compute the residuals, we need to call `add_residuals()` with each model-data pair:
|
||||
|
||||
```{r}
|
||||
by_country <- by_country %>% mutate(
|
||||
resids = map2(data, model, add_residuals)
|
||||
)
|
||||
by_country <- by_country %>%
|
||||
mutate(
|
||||
resids = map2(data, model, add_residuals)
|
||||
)
|
||||
by_country
|
||||
```
|
||||
|
||||
|
@ -233,7 +236,8 @@ glance
|
|||
With this data frame in hand, we can start to look for models that don't fit well:
|
||||
|
||||
```{r}
|
||||
glance %>% arrange(r.squared)
|
||||
glance %>%
|
||||
arrange(r.squared)
|
||||
```
|
||||
|
||||
The worst models all appear to be in Africa. Let's double check that with a plot. Here we have a relatively small number of observations and a discrete variable, so `geom_jitter()` is effective:
|
||||
|
@ -435,7 +439,10 @@ The advantage of this structure is that it generalises in a straightforward way
|
|||
Now if you want to iterate over names and values in parallel, you can use `map2()`:
|
||||
|
||||
```{r}
|
||||
df %>% mutate(smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1])))
|
||||
df %>%
|
||||
mutate(
|
||||
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
|
|
|
@ -243,13 +243,14 @@ The pipe is provided by the magrittr package, by Stefan Milton Bache. Most of th
|
|||
* For assignment magrittr provides the `%<>%` operator which allows you to
|
||||
replace code like:
|
||||
|
||||
```R
|
||||
mtcars <- mtcars %>% transform(cyl = cyl * 2)
|
||||
```{r, eval = FALSE}
|
||||
mtcars <- mtcars %>%
|
||||
transform(cyl = cyl * 2)
|
||||
```
|
||||
|
||||
with
|
||||
|
||||
```R
|
||||
```{r, eval = FALSE}
|
||||
mtcars %<>% transform(cyl = cyl * 2)
|
||||
```
|
||||
|
||||
|
|
|
@ -120,15 +120,25 @@ A variable can be both a primary key _and_ a foreign key. For example, `origin`
|
|||
Once you've identified the primary keys in your tables, it's good practice to verify that they do indeed uniquely identify each observation. One way to do that is to `count()` the primary keys and look for entries where `n` is greater than one:
|
||||
|
||||
```{r}
|
||||
planes %>% count(tailnum) %>% filter(n > 1)
|
||||
weather %>% count(year, month, day, hour, origin) %>% filter(n > 1)
|
||||
planes %>%
|
||||
count(tailnum) %>%
|
||||
filter(n > 1)
|
||||
|
||||
weather %>%
|
||||
count(year, month, day, hour, origin) %>%
|
||||
filter(n > 1)
|
||||
```
|
||||
|
||||
Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. For example, what's the primary key in the `flights` table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
|
||||
|
||||
```{r}
|
||||
flights %>% count(year, month, day, flight) %>% filter(n > 1)
|
||||
flights %>% count(year, month, day, tailnum) %>% filter(n > 1)
|
||||
flights %>%
|
||||
count(year, month, day, flight) %>%
|
||||
filter(n > 1)
|
||||
|
||||
flights %>%
|
||||
count(year, month, day, tailnum) %>%
|
||||
filter(n > 1)
|
||||
```
|
||||
|
||||
When starting to work with this data, I had naively assumed that each flight number would be only used once per day: that would make it much easier to communicate problems with a specific flight. Unfortunately that is not the case! If a table lacks a primary key, it's sometimes useful to add one with `mutate()` and `row_number()`. That makes it easier to match observations if you've done some filtering and want to check back in with the original data. This is called a __surrogate key__.
|
||||
|
@ -160,7 +170,9 @@ The first tool we'll look at for combining a pair of tables is the __mutating jo
|
|||
Like `mutate()`, the join functions add variables to the right, so if you have a lot of variables already, the new variables won't get printed out. For these examples, we'll make it easier to see what's going on in the examples by creating a narrower dataset:
|
||||
|
||||
```{r}
|
||||
(flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier))
|
||||
flights2 <- flights %>%
|
||||
select(year:day, hour, origin, dest, tailnum, carrier)
|
||||
flights2
|
||||
```
|
||||
|
||||
(When you're in RStudio, you can use `View()` to avoid this problem).
|
||||
|
@ -224,7 +236,8 @@ knitr::include_graphics("diagrams/join-inner.png")
|
|||
The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use `by` to tell dplyr which variable is the key:
|
||||
|
||||
```{r}
|
||||
x %>% inner_join(y, by = "key")
|
||||
x %>%
|
||||
inner_join(y, by = "key")
|
||||
```
|
||||
|
||||
The most important property of an inner join is that unmatched rows are not included in the result. This means that generally inner joins are not appropriate for use in analysis because it's too easy to lose observations.
|
||||
|
@ -301,7 +314,8 @@ So far, the pairs of tables have always been joined by a single variable, and th
|
|||
`origin`.
|
||||
|
||||
```{r}
|
||||
flights2 %>% left_join(weather)
|
||||
flights2 %>%
|
||||
left_join(weather)
|
||||
```
|
||||
|
||||
* A character vector, `by = "x"`. This is like a natural join, but uses only
|
||||
|
@ -310,7 +324,8 @@ So far, the pairs of tables have always been joined by a single variable, and th
|
|||
`tailnum`.
|
||||
|
||||
```{r}
|
||||
flights2 %>% left_join(planes, by = "tailnum")
|
||||
flights2 %>%
|
||||
left_join(planes, by = "tailnum")
|
||||
```
|
||||
|
||||
Note that the `year` variables (which appear in both input data frames,
|
||||
|
@ -327,8 +342,11 @@ So far, the pairs of tables have always been joined by a single variable, and th
|
|||
need to specify which one we want to join to:
|
||||
|
||||
```{r}
|
||||
flights2 %>% left_join(airports, c("dest" = "faa"))
|
||||
flights2 %>% left_join(airports, c("origin" = "faa"))
|
||||
flights2 %>%
|
||||
left_join(airports, c("dest" = "faa"))
|
||||
|
||||
flights2 %>%
|
||||
left_join(airports, c("origin" = "faa"))
|
||||
```
|
||||
|
||||
### Exercises
|
||||
|
@ -414,7 +432,8 @@ top_dest
|
|||
Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:
|
||||
|
||||
```{r}
|
||||
flights %>% filter(dest %in% top_dest$dest)
|
||||
flights %>%
|
||||
filter(dest %in% top_dest$dest)
|
||||
```
|
||||
|
||||
But it's difficult to extend that approach to multiple variables. For example, imagine that you'd found the 10 days with highest average delays. How would you construct the filter statement that used `year`, `month`, and `day` to match it back to `flights`?
|
||||
|
@ -422,7 +441,8 @@ But it's difficult to extend that approach to multiple variables. For example, i
|
|||
Instead you can use a semi-join, which connects the two tables like a mutating join, but instead of adding new columns, only keeps the rows in `x` that have a match in `y`:
|
||||
|
||||
```{r}
|
||||
flights %>% semi_join(top_dest)
|
||||
flights %>%
|
||||
semi_join(top_dest)
|
||||
```
|
||||
|
||||
Graphically, a semi-join looks like this:
|
||||
|
|
37
tidy.Rmd
37
tidy.Rmd
|
@ -138,7 +138,8 @@ To tidy a dataset like this, we need to __gather__ those column into a new pair
|
|||
Together those parameters generate the call to `gather()`:
|
||||
|
||||
```{r}
|
||||
table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
|
||||
table4a %>%
|
||||
gather(`1999`, `2000`, key = "year", value = "cases")
|
||||
```
|
||||
|
||||
The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names so we have to surround them in backticks. To refresh your memory of the other ways to select columns, see [select](#select).
|
||||
|
@ -152,14 +153,17 @@ knitr::include_graphics("images/tidy-9.png")
|
|||
We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
|
||||
|
||||
```{r}
|
||||
table4b %>% gather(`1999`, `2000`, key = "year", value = "population")
|
||||
table4b %>%
|
||||
gather(`1999`, `2000`, key = "year", value = "population")
|
||||
```
|
||||
|
||||
To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you'll learn about in [relational data].
|
||||
|
||||
```{r}
|
||||
tidy4a <- table4a %>% gather(`1999`, `2000`, key = "year", value = "cases")
|
||||
tidy4b <- table4b %>% gather(`1999`, `2000`, key = "year", value = "population")
|
||||
tidy4a <- table4a %>%
|
||||
gather(`1999`, `2000`, key = "year", value = "cases")
|
||||
tidy4b <- table4b %>%
|
||||
gather(`1999`, `2000`, key = "year", value = "population")
|
||||
left_join(tidy4a, tidy4b)
|
||||
```
|
||||
|
||||
|
@ -215,7 +219,8 @@ As you might have guessed from the common `key` and `value` arguments, `spread()
|
|||
1. Why does this code fail?
|
||||
|
||||
```{r, error = TRUE}
|
||||
table4a %>% gather(1999, 2000, key = "year", value = "cases")
|
||||
table4a %>%
|
||||
gather(1999, 2000, key = "year", value = "cases")
|
||||
```
|
||||
|
||||
1. Why does spreading this tibble fail?
|
||||
|
@ -431,15 +436,16 @@ The best place to start is almost always to gathering together the columns that
|
|||
So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't know what those values represent yet, so we'll give them the generic name `"key"`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
|
||||
|
||||
```{r}
|
||||
who1 <- who %>% gather(new_sp_m014:newrel_f65, key = "key", value = "cases",
|
||||
na.rm = TRUE)
|
||||
who1 <- who %>%
|
||||
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
|
||||
who1
|
||||
```
|
||||
|
||||
We can get some hint of the structure of the values in the new `key` column:
|
||||
|
||||
```{r}
|
||||
who1 %>% count(key)
|
||||
who1 %>%
|
||||
count(key)
|
||||
```
|
||||
|
||||
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
|
||||
|
@ -474,28 +480,33 @@ You might be able to parse this out by yourself with a little thought and some e
|
|||
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.
|
||||
|
||||
```{r}
|
||||
who2 <- who1 %>% mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
|
||||
who2 <- who1 %>%
|
||||
mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
|
||||
who2
|
||||
```
|
||||
|
||||
We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.
|
||||
|
||||
```{r}
|
||||
who3 <- who2 %>% separate(key, c("new", "type", "sexage"), sep = "_")
|
||||
who3 <- who2 %>%
|
||||
separate(key, c("new", "type", "sexage"), sep = "_")
|
||||
who3
|
||||
```
|
||||
|
||||
Then we might as well drop the `new` colum because it's consistent in this dataset. While we're dropping columns, let's also drop `iso2` and `iso3` since they're redundant.
|
||||
|
||||
```{r}
|
||||
who3 %>% count(new)
|
||||
who4 <- who3 %>% select(-new, -iso2, -iso3)
|
||||
who3 %>%
|
||||
count(new)
|
||||
who4 <- who3 %>%
|
||||
select(-new, -iso2, -iso3)
|
||||
```
|
||||
|
||||
Next we'll separate `sexage` into `sex` and `age` by splitting after the first character:
|
||||
|
||||
```{r}
|
||||
who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
|
||||
who5 <- who4 %>%
|
||||
separate(sexage, c("sex", "age"), sep = 1)
|
||||
who5
|
||||
```
|
||||
|
||||
|
|
|
@ -192,7 +192,7 @@ filter(df, is.na(x) | x > 1)
|
|||
|
||||
1. Find all flights that
|
||||
|
||||
1. Were delayed by more two hours
|
||||
1. Had an arrival delay of two or more hours.
|
||||
1. Flew to Houston (`IAH` or `HOU`)
|
||||
1. Were operated by United, American, or Delta
|
||||
1. Departed in summer (July, August, and September)
|
||||
|
@ -276,13 +276,7 @@ There are a number of helper functions you can use within `select()`:
|
|||
|
||||
See `?select` for more details.
|
||||
|
||||
It's possible to use `select()` to rename variables:
|
||||
|
||||
```{r}
|
||||
select(flights, tail_num = tailnum)
|
||||
```
|
||||
|
||||
But because `select()` drops all the variables not explicitly mentioned, it's not that useful. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
|
||||
`select()` can be used to rename variables, but it's rarely useful because it drops all the variables not explicitly mentioned. Instead, use `rename()`, which is a variant of `select()` that keeps all the variables that aren't explicitly mentioned:
|
||||
|
||||
```{r}
|
||||
rename(flights, tail_num = tailnum)
|
||||
|
@ -619,15 +613,16 @@ RStudio tip: a useful keyboard shortcut is Cmd/Ctrl + Shift + P. This resends th
|
|||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player. When I plot the skill of the batter against the number of times batted, you see two patterns:
|
||||
There's another common variation of this type of pattern. Let's look at how the average performance of batters in baseball is related to the number of times they're at bat. Here I use data from the __Lahman__ package to compute the batting average (number of hits / number of attempts) of every major league baseball player.
|
||||
|
||||
When I plot the skill of the batter (measured by the batting average, `ba`) against the number of opportunities to hit the ball (measured by at bat, `ab`), you see two patterns:
|
||||
|
||||
1. As above, the variation in our aggregate decreases as we get more
|
||||
data points.
|
||||
|
||||
2. There's a positive correlation between skill (batting average, `ba`) and
|
||||
number of opportunities to hit the ball (at bat, `ab`). This is because
|
||||
teams control who gets to play, and obviously they'll pick their best
|
||||
players.
|
||||
2. There's a positive correlation between skill (`ba`) and opportunities to
|
||||
hit the ball (`ab`). This is because teams control who gets to play,
|
||||
and obviously they'll pick their best players.
|
||||
|
||||
```{r}
|
||||
# Convert to a tibble so it prints nicely
|
||||
|
@ -650,7 +645,8 @@ batters %>%
|
|||
This also has important implications for ranking. If you naively sort on `desc(ba)`, the people with the best batting averages are clearly lucky, not skilled:
|
||||
|
||||
```{r}
|
||||
batters %>% arrange(desc(ba))
|
||||
batters %>%
|
||||
arrange(desc(ba))
|
||||
```
|
||||
|
||||
You can find a good explanation of this problem at <http://varianceexplained.org/r/empirical_bayes_baseball/> and <http://www.evanmiller.org/how-not-to-sort-by-average-rating.html>.
|
||||
|
@ -744,7 +740,8 @@ Just using means, counts, and sum can get you a long way, but R provides many ot
|
|||
a count:
|
||||
|
||||
```{r}
|
||||
not_cancelled %>% count(dest)
|
||||
not_cancelled %>%
|
||||
count(dest)
|
||||
```
|
||||
|
||||
You can optionally provide a weight variable. For example, you could use
|
||||
|
@ -813,6 +810,11 @@ daily %>%
|
|||
|
||||
Which is more important: arrival delay or departure delay?
|
||||
|
||||
1. Come up with another appraoch that will give you the same output as
|
||||
`not_cancelled %>% count(dest)` and
|
||||
`not_cancelled %>% count(tailnum, wt = distance)` (without using
|
||||
`count()`).
|
||||
|
||||
1. Our definition of cancelled flights (`!is.na(dep_delay) & !is.na(arr_delay)`
|
||||
) is slightly suboptimal. Why? Which is the most important column?
|
||||
|
||||
|
|
|
@ -67,7 +67,7 @@ With ggplot2, you begin a plot with the function `ggplot()`. `ggplot()` creates
|
|||
|
||||
You complete your graph by adding one or more layers to `ggplot()`. The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot. You'll learn a whole bunch of them throughout this chapter.
|
||||
|
||||
Each geom function in ggplot2 takes a `mapping` argument. This defines how variables in your dataset are mapped to visual properties. You must always use `mapping` in conjunction with `aes()`. The `x` and `y` arguments of `aes()` describe which variables to map to the x and y axes of your plot, and ggplot2 will look for those variables in your dataset, `mpg`.
|
||||
Each geom function in ggplot2 takes a `mapping` argument. This defines how variables in your dataset are mapped to visual properties. The `mapping` argument is always paired with `aes()`, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the `data` argument, in this case, `mpg`.
|
||||
|
||||
Let's turn this code into a reusable template for making graphs with ggplot2. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a set of mappings.
|
||||
|
||||
|
@ -575,7 +575,7 @@ The stacking is performed automatically by the __position adjustment__ specified
|
|||
slightly transparent by setting `alpha` to a small value, or completely
|
||||
transparent by setting `fill = NA`.
|
||||
|
||||
```{r fig.width = 3, out.width = "50%", fig.align = "default"}
|
||||
```{r out.width = "50%", fig.align = "default"}
|
||||
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
|
||||
geom_bar(alpha = 1/5, position = "identity")
|
||||
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
|
||||
|
|
Loading…
Reference in New Issue