272 lines
9.1 KiB
Plaintext
272 lines
9.1 KiB
Plaintext
# Factors
|
|
|
|
## Introduction
|
|
|
|
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors with non-alphabetical order.
|
|
|
|
Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.
|
|
|
|
Factors aren't as common in the tidyverse, because no function will automatically turn a character vector into a factor. It is, however, a good idea to use factors when appropriate, and controlling their levels can be particularly useful for tailoring visualisations of categorical data.
|
|
|
|
### Prerequisites
|
|
|
|
To work with factors, we'll use the __forcats__ packages (tools for dealing **cat**egorical variables + anagram of factors). It provides a wide range of helpers for working with factors. We'll also use ggplot2 because factors are particularly important for visualisation.
|
|
|
|
```{r setup, message = FALSE}
|
|
# devtools::install_github("hadley/forcats")
|
|
library(forcats)
|
|
library(ggplot2)
|
|
library(dplyr)
|
|
```
|
|
|
|
## Creating factors
|
|
|
|
There are two ways to create a factor: during import with readr, using `col_factor()`, or after the fact, turning a string into a factor. Often you'll need to do a little experimetation, so I recommend starting with strings.
|
|
|
|
To turn a string into a factor, call `factor()`, supplying list of possible values:
|
|
|
|
```{r}
|
|
x <- c("pear", "apple", "banana", "apple", "pear", "apple")
|
|
factor(x, levels = c("apple", "banana", "pear"))
|
|
```
|
|
|
|
Any values not in the list of levels will be silently converted to `NA`:
|
|
|
|
```{r}
|
|
factor(x, levels = c("apple", "banana"))
|
|
```
|
|
|
|
If you omit the levels, they'll be taken from the data in alphabetical order:
|
|
|
|
```{r}
|
|
factor(x)
|
|
```
|
|
|
|
Sometimes you'd prefer that the order of the levels match the order of the first appearnace in the data. You can do that during creation by setting levels to `unique(x)`, or after the with `fct_inorder()`:
|
|
|
|
```{r}
|
|
factor(x, levels = unique(x))
|
|
|
|
f <- factor(x)
|
|
f <- fct_inorder(f)
|
|
f
|
|
```
|
|
|
|
You can access the levels of the factor with `levels()`:
|
|
|
|
```{r}
|
|
levels(f)
|
|
```
|
|
|
|
## General Social Survey
|
|
|
|
For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](http://gss.norc.org), which is a long-running US survey run by the the independent research organization NORC at the University of Chicago. The survey has thousands of questions, and in `gss_cat` I've selected a handful of variables to illustrate some common challenges you'll hit when working with factors.
|
|
|
|
```{r}
|
|
gss_cat
|
|
```
|
|
|
|
Note that the order of levels is preserved in operations like `count()`:
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
count(race)
|
|
```
|
|
|
|
And in visualisations like `geom_bar()`:
|
|
|
|
```{r}
|
|
ggplot(gss_cat, aes(race)) +
|
|
geom_bar()
|
|
```
|
|
|
|
By default, ggplot2 will drop levels that don't have any values. You can force them to appear with:
|
|
|
|
```{r}
|
|
ggplot(gss_cat, aes(race)) +
|
|
geom_bar() +
|
|
scale_x_discrete(drop = FALSE)
|
|
```
|
|
|
|
Unfortunatealy dplyr doesn't yet have a `drop` option, but it will in the future.
|
|
|
|
### Exercise
|
|
|
|
|
|
## Modifying factor order
|
|
|
|
The levels of a factor can be meaningful or arbitary:
|
|
|
|
* arbitrary: where the order of the factor levels is arbitrary, like race, sex,
|
|
or religion. You have to pick an order for display, but it doesn't mean
|
|
anything.
|
|
|
|
* meaningful: where the order of levels reflects an underlying order like
|
|
party affiliation (from strong republican - indepedent - strong democrat)
|
|
or income (from low to high)
|
|
|
|
Generally, you should avoid jumbling the order if it's meaningful.
|
|
|
|
Let's take a look with a concrete example. Here I compute the average number of tv hours for each religion:
|
|
|
|
```{r}
|
|
relig <- gss_cat %>%
|
|
group_by(relig) %>%
|
|
summarise(
|
|
age = mean(age, na.rm = TRUE),
|
|
tvhours = mean(tvhours, na.rm = TRUE),
|
|
n = n()
|
|
)
|
|
|
|
ggplot(relig, aes(tvhours, relig)) + geom_point()
|
|
```
|
|
|
|
This plot is a little hard to take in because the order of religion is basically arbitary. We can improve it by reordering the levels of `relig`. This makes it much easier to see that "Don't know" seems to watch much more, and Hinduism & Other Eastern religions watch much less.
|
|
|
|
```{r}
|
|
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +
|
|
geom_point()
|
|
```
|
|
|
|
What if we do the same thing for income levels?
|
|
|
|
```{r}
|
|
rincome <- gss_cat %>%
|
|
group_by(rincome) %>%
|
|
summarise(
|
|
age = mean(age, na.rm = TRUE),
|
|
tvhours = mean(tvhours, na.rm = TRUE),
|
|
n = n()
|
|
)
|
|
|
|
ggplot(rincome, aes(age, rincome)) + geom_point()
|
|
```
|
|
|
|
Arbitrarily reordering the levels isn't a good idea!
|
|
|
|
```{r}
|
|
ggplot(rincome, aes(age, fct_reorder(rincome, age))) + geom_point()
|
|
```
|
|
|
|
But it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. Why do you think the average age for "Not applicable" is so high?
|
|
|
|
```{r}
|
|
ggplot(rincome, aes(age, fct_relevel(rincome, "Not applicable"))) +
|
|
geom_point()
|
|
```
|
|
|
|
Another variation of `fct_reorder()` is useful when you are colouring the lines on a plot. Using `fct_reorder2()` makes the line colours nicely match the order of the legend.
|
|
|
|
```{r, fig.align = "default", out.width = "50%"}
|
|
by_age <- gss_cat %>%
|
|
group_by(age, marital) %>%
|
|
count() %>%
|
|
mutate(prop = n / sum(n))
|
|
|
|
ggplot(by_age, aes(age, prop, colour = marital)) +
|
|
geom_line()
|
|
|
|
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
|
|
geom_line() +
|
|
labs(colour = "marital")
|
|
```
|
|
|
|
Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency. You may want to combine with `fct_rev()`.
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
|
|
ggplot(aes(marital)) +
|
|
geom_bar()
|
|
```
|
|
|
|
### Exercises
|
|
|
|
1. There are some suspiciously high numbers in `tvhours`. Is the mean a good
|
|
summary?
|
|
|
|
1. For each factor in `gss_cat` identify whether the order is arbitrary
|
|
or meaningful.
|
|
|
|
1. Recreate the display of marital status by age, using `geom_area()` instead
|
|
of `geom_line()`. What do you need to change to the plot? How might you
|
|
tweak the levels?
|
|
|
|
## Modifying factor levels
|
|
|
|
More powerful than changing the orders of the levels is to change their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.
|
|
|
|
The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:
|
|
|
|
```{r}
|
|
gss_cat %>% count(partyid)
|
|
```
|
|
|
|
The levels are little hard to read. Let's tweak them to be longer and more consistent. Any levels that aren't explicitly mentioned will be left as is.
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
mutate(partyid = fct_recode(partyid,
|
|
"Republican, strong" = "Strong republican",
|
|
"Republican, weak" = "Not str republican",
|
|
"Independent, near rep" = "Ind,near rep",
|
|
"Independent, near dem" = "Ind,near dem",
|
|
"Democrat, weak" = "Not str democrat",
|
|
"Democrat, strong" = "Strong democrat"
|
|
)) %>%
|
|
count(partyid)
|
|
```
|
|
|
|
You can assign multiple old levels to the same new level:
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
mutate(partyid = fct_recode(partyid,
|
|
"Republican, strong" = "Strong republican",
|
|
"Republican, weak" = "Not str republican",
|
|
"Independent, near rep" = "Ind,near rep",
|
|
"Independent, near dem" = "Ind,near dem",
|
|
"Democrat, weak" = "Not str democrat",
|
|
"Democrat, strong" = "Strong democrat",
|
|
"Other" = "No answer",
|
|
"Other" = "Don't know",
|
|
"Other" = "Other party"
|
|
)) %>%
|
|
count(partyid)
|
|
```
|
|
|
|
You must use this technique with extreme care: if you group together categories that are truly different you will end up with misleading results.
|
|
|
|
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant. For each new variable, you can provide a vector of old levels:
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
mutate(partyid = fct_collapse(partyid,
|
|
other = c("No answer", "Don't know", "Other party"),
|
|
rep = c("Strong republican", "Not str republican"),
|
|
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
|
|
dem = c("Not str democrat", "Strong democrat")
|
|
)) %>%
|
|
count(partyid)
|
|
```
|
|
|
|
Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
mutate(relig = fct_lump(relig)) %>%
|
|
count(relig)
|
|
```
|
|
|
|
The default behaviour is to lump together all the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not super helpful: it is true that the majority of Americans are protestant, but we've probably over collapsed.
|
|
|
|
Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:
|
|
|
|
```{r}
|
|
gss_cat %>%
|
|
mutate(relig = fct_lump(relig, n = 5)) %>%
|
|
count(relig, sort = TRUE)
|
|
```
|
|
|
|
### Exercises
|