r4ds/factors.Rmd

# Factors

## Introduction

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors with non-alphabetical order.

Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.  

Factors aren't as common in the tidyverse, because no function will automatically turn a character vector into a factor. It is, however, a good idea to use factors when appropriate, and controlling their levels can be particularly useful for tailoring visualisations of categorical data.

### Prerequisites

To work with factors, we'll use the __forcats__ packages (tools for dealing **cat**egorical variables + anagram of factors). It provides a wide range of helpers for working with factors.  We'll also use ggplot2 because factors are particularly important for visualisation.

```{r setup, message = FALSE}
# devtools::install_github("hadley/forcats")
library(forcats)
library(ggplot2)
library(dplyr)
```

## Creating factors

There are two ways to create a factor: during import with readr, using `col_factor()`, or after the fact, turning a string into a factor. Often you'll need to do a little experimetation, so I recommend starting with strings.

To turn a string into a factor, call `factor()`, supplying list of possible values:

```{r}
x <- c("pear", "apple", "banana", "apple", "pear", "apple")
factor(x, levels = c("apple", "banana", "pear"))
```

Any values not in the list of levels will be silently converted to `NA`:

```{r}
factor(x, levels = c("apple", "banana"))
```

If you omit the levels, they'll be taken from the data in alphabetical order:

```{r}
factor(x)
```

Sometimes you'd prefer that the order of the levels match the order of the first appearnace in the data. You can do that during creation by setting levels to `unique(x)`, or after the with `fct_inorder()`:

```{r}
factor(x, levels = unique(x))

f <- factor(x)
f <- fct_inorder(f)
f
```

You can access the levels of the factor with `levels()`:

```{r}
levels(f)
```

## General Social Survey

For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](http://gss.norc.org), which is a long-running US survey run by the the independent research organization NORC at the University of Chicago. The survey has thousands of questions, and in `gss_cat` I've selected a handful of variables to illustrate some common challenges you'll hit when working with factors.

```{r}
gss_cat
```

Note that the order of levels is preserved in operations like `count()`:

```{r}
gss_cat %>% 
  count(race)
```

And in visualisations like `geom_bar()`:

```{r}
ggplot(gss_cat, aes(race)) + 
  geom_bar()
```

By default, ggplot2 will drop levels that don't have any values. You can force them to appear with:

```{r}
ggplot(gss_cat, aes(race)) + 
  geom_bar() + 
  scale_x_discrete(drop = FALSE)
```

Unfortunatealy dplyr doesn't yet have a `drop` option, but it will in the future.

### Exercise


## Modifying factor order

The levels of a factor can be meaningful or arbitary:

* arbitrary: where the order of the factor levels is arbitrary, like race, sex,
  or religion. You have to pick an order for display, but it doesn't mean 
  anything.

* meaningful: where the order of levels reflects an underlying order like
  party affiliation (from strong republican - indepedent - strong democrat)
  or income (from low to high)

Generally, you should avoid jumbling the order if it's meaningful. 

Let's take a look with a concrete example. Here I compute the average number of tv hours for each religion:

```{r}
relig <- gss_cat %>% 
  group_by(relig) %>% 
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig, aes(tvhours, relig)) + geom_point()
```

This plot is a little hard to take in because the order of religion is basically arbitary.  We can improve it by reordering the levels of `relig`. This makes it much easier to see that "Don't know" seems to watch much more, and Hinduism & Other Eastern religions watch much less.

```{r}
ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) + 
  geom_point()
```

What if we do the same thing for income levels?

```{r}
rincome <- gss_cat %>% 
  group_by(rincome) %>% 
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(rincome, aes(age, rincome)) + geom_point()
```

Arbitrarily reordering the levels isn't a good idea!

```{r}
ggplot(rincome, aes(age, fct_reorder(rincome, age))) + geom_point()
```

But it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. Why do you think the average age for "Not applicable" is so high?

```{r}
ggplot(rincome, aes(age, fct_relevel(rincome, "Not applicable"))) + 
  geom_point()
```

Another variation of `fct_reorder()` is useful when you are colouring the lines on a plot. Using `fct_reorder2()` makes the line colours nicely match the order of the legend.

```{r, fig.align = "default", out.width = "50%"}
by_age <- gss_cat %>% 
  group_by(age, marital) %>% 
  count() %>% 
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) + 
  geom_line()

ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + 
  geom_line() +
  labs(colour = "marital")
```

Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency. You may want to combine with `fct_rev()`.

```{r}
gss_cat %>% 
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>% 
  ggplot(aes(marital)) +
    geom_bar()
```

### Exercises

1.  There are some suspiciously high numbers in `tvhours`. Is the mean a good 
    summary?

1.  For each factor in `gss_cat` identify whether the order is arbitrary
    or meaningful.

1.  Recreate the display of marital status by age, using `geom_area()` instead
    of `geom_line()`. What do you need to change to the plot? How might you
    tweak the levels?

## Modifying factor levels

More powerful than changing the orders of the levels is to change their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.

The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:

```{r}
gss_cat %>% count(partyid)
```

The levels are little hard to read. Let's tweak them to be longer and more consistent. Any levels that aren't explicitly mentioned will be left as is.

```{r}
gss_cat %>% 
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>% 
  count(partyid)
```

You can assign multiple old levels to the same new level:

```{r}
gss_cat %>% 
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party" 
  )) %>% 
  count(partyid)
``` 

You must use this technique with extreme care: if you group together categories that are truly different you will end up with misleading results.

If you want to collapse a lot of levels, `fct_collapse()` is a useful variant. For each new variable, you can provide a vector of old levels: 

```{r}
gss_cat %>% 
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"), 
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>% 
  count(partyid)
```

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:

```{r}
gss_cat %>% 
  mutate(relig = fct_lump(relig)) %>% 
  count(relig)
```

The default behaviour is to lump together all the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not super helpful: it is true that the majority of Americans are protestant, but we've probably over collapsed.

Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:

```{r}
gss_cat %>% 
  mutate(relig = fct_lump(relig, n = 5)) %>% 
  count(relig, sort = TRUE)
```

### Exercises
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			`# Factors`

			`## Introduction`

			`In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors with non-alphabetical order.`

			Historically, factors were much easier to work with than characters so many functions in base R automatically convert characters to factors (controlled by the dread `stringsAsFactors` argument). To get more historical context, you might want to read [_stringsAsFactors: An unauthorized biography_](http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/) by Roger Peng or [_stringsAsFactors = \<sigh\>_](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh) by Thomas Lumley.

			`Factors aren't as common in the tidyverse, because no function will automatically turn a character vector into a factor. It is, however, a good idea to use factors when appropriate, and controlling their levels can be particularly useful for tailoring visualisations of categorical data.`

			`### Prerequisites`

			`To work with factors, we'll use the __forcats__ packages (tools for dealing categorical variables + anagram of factors). It provides a wide range of helpers for working with factors. We'll also use ggplot2 because factors are particularly important for visualisation.`

			```{r setup, message = FALSE}
			`# devtools::install_github("hadley/forcats")`
			`library(forcats)`
			`library(ggplot2)`
			`library(dplyr)`
			```

			`## Creating factors`

			There are two ways to create a factor: during import with readr, using `col_factor()`, or after the fact, turning a string into a factor. Often you'll need to do a little experimetation, so I recommend starting with strings.

			To turn a string into a factor, call `factor()`, supplying list of possible values:

			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`x <- c("pear", "apple", "banana", "apple", "pear", "apple")`
			`factor(x, levels = c("apple", "banana", "pear"))`
			```

			Any values not in the list of levels will be silently converted to `NA`:
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
Filling in some text about factors 2016-08-17 23:23:57 +08:00			```{r}
			`factor(x, levels = c("apple", "banana"))`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`If you omit the levels, they'll be taken from the data in alphabetical order:`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`factor(x)`
			```

			Sometimes you'd prefer that the order of the levels match the order of the first appearnace in the data. You can do that during creation by setting levels to `unique(x)`, or after the with `fct_inorder()`:

			```{r}
			`factor(x, levels = unique(x))`

			`f <- factor(x)`
			`f <- fct_inorder(f)`
			`f`
			```

			You can access the levels of the factor with `levels()`:

			```{r}
			`levels(f)`
			```

			`## General Social Survey`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
Filling in some text about factors 2016-08-17 23:23:57 +08:00			For the rest of this chapter, we're going to focus on `forcats::gss_cat`. It's a sample of variables from the [General Social Survey](http://gss.norc.org), which is a long-running US survey run by the the independent research organization NORC at the University of Chicago. The survey has thousands of questions, and in `gss_cat` I've selected a handful of variables to illustrate some common challenges you'll hit when working with factors.
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`gss_cat`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			Note that the order of levels is preserved in operations like `count()`:
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
			`gss_cat %>%`
			`count(race)`
			```

			And in visualisations like `geom_bar()`:

			```{r}
			`ggplot(gss_cat, aes(race)) +`
			`geom_bar()`
			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`By default, ggplot2 will drop levels that don't have any values. You can force them to appear with:`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
			`ggplot(gss_cat, aes(race)) +`
			`geom_bar() +`
			`scale_x_discrete(drop = FALSE)`
			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			Unfortunatealy dplyr doesn't yet have a `drop` option, but it will in the future.

			`### Exercise`

Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			`## Modifying factor order`

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`The levels of a factor can be meaningful or arbitary:`

			`* arbitrary: where the order of the factor levels is arbitrary, like race, sex,`
			`or religion. You have to pick an order for display, but it doesn't mean`
			`anything.`

			`* meaningful: where the order of levels reflects an underlying order like`
			`party affiliation (from strong republican - indepedent - strong democrat)`
			`or income (from low to high)`

			`Generally, you should avoid jumbling the order if it's meaningful.`

			`Let's take a look with a concrete example. Here I compute the average number of tv hours for each religion:`

Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```{r}
			`relig <- gss_cat %>%`
			`group_by(relig) %>%`
			`summarise(`
			`age = mean(age, na.rm = TRUE),`
			`tvhours = mean(tvhours, na.rm = TRUE),`
			`n = n()`
			`)`

			`ggplot(relig, aes(tvhours, relig)) + geom_point()`
			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			This plot is a little hard to take in because the order of religion is basically arbitary. We can improve it by reordering the levels of `relig`. This makes it much easier to see that "Don't know" seems to watch much more, and Hinduism & Other Eastern religions watch much less.

			```{r}
			`ggplot(relig, aes(tvhours, fct_reorder(relig, tvhours))) +`
			`geom_point()`
			```

			`What if we do the same thing for income levels?`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
			`rincome <- gss_cat %>%`
			`group_by(rincome) %>%`
			`summarise(`
			`age = mean(age, na.rm = TRUE),`
			`tvhours = mean(tvhours, na.rm = TRUE),`
			`n = n()`
			`)`

			`ggplot(rincome, aes(age, rincome)) + geom_point()`
			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`Arbitrarily reordering the levels isn't a good idea!`

			```{r}
			`ggplot(rincome, aes(age, fct_reorder(rincome, age))) + geom_point()`
			```
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
Filling in some text about factors 2016-08-17 23:23:57 +08:00			But it does make sense to pull "Not applicable" to the front with the other special levels. You can use `fct_relevel()`. Why do you think the average age for "Not applicable" is so high?
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`ggplot(rincome, aes(age, fct_relevel(rincome, "Not applicable"))) +`
			`geom_point()`
			```

			Another variation of `fct_reorder()` is useful when you are colouring the lines on a plot. Using `fct_reorder2()` makes the line colours nicely match the order of the legend.

			```{r, fig.align = "default", out.width = "50%"}
			`by_age <- gss_cat %>%`
			`group_by(age, marital) %>%`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			`count() %>%`
			`mutate(prop = n / sum(n))`

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`ggplot(by_age, aes(age, prop, colour = marital)) +`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			`geom_line()`

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +`
			`geom_line() +`
			`labs(colour = "marital")`
			```
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
Filling in some text about factors 2016-08-17 23:23:57 +08:00			Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing frequency. You may want to combine with `fct_rev()`.

			```{r}
			`gss_cat %>%`
			`mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%`
			`ggplot(aes(marital)) +`
			`geom_bar()`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`### Exercises`

			1. There are some suspiciously high numbers in `tvhours`. Is the mean a good
			`summary?`

			1. For each factor in `gss_cat` identify whether the order is arbitrary
			`or meaningful.`

			1. Recreate the display of marital status by age, using `geom_area()` instead
			of `geom_line()`. What do you need to change to the plot? How might you
			`tweak the levels?`

Start banging out factors chapter 2016-08-17 06:06:51 +08:00			`## Modifying factor levels`

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`More powerful than changing the orders of the levels is to change their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.`

			The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:

			```{r}
			`gss_cat %>% count(partyid)`
			```
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`The levels are little hard to read. Let's tweak them to be longer and more consistent. Any levels that aren't explicitly mentioned will be left as is.`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`gss_cat %>%`
			`mutate(partyid = fct_recode(partyid,`
			`"Republican, strong" = "Strong republican",`
			`"Republican, weak" = "Not str republican",`
			`"Independent, near rep" = "Ind,near rep",`
			`"Independent, near dem" = "Ind,near dem",`
			`"Democrat, weak" = "Not str democrat",`
			`"Democrat, strong" = "Strong democrat"`
			`)) %>%`
			`count(partyid)`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`You can assign multiple old levels to the same new level:`

			```{r}
			`gss_cat %>%`
			`mutate(partyid = fct_recode(partyid,`
			`"Republican, strong" = "Strong republican",`
			`"Republican, weak" = "Not str republican",`
			`"Independent, near rep" = "Ind,near rep",`
			`"Independent, near dem" = "Ind,near dem",`
			`"Democrat, weak" = "Not str democrat",`
			`"Democrat, strong" = "Strong democrat",`
			`"Other" = "No answer",`
			`"Other" = "Don't know",`
			`"Other" = "Other party"`
			`)) %>%`
			`count(partyid)`
			```

			`You must use this technique with extreme care: if you group together categories that are truly different you will end up with misleading results.`

			If you want to collapse a lot of levels, `fct_collapse()` is a useful variant. For each new variable, you can provide a vector of old levels:
Start banging out factors chapter 2016-08-17 06:06:51 +08:00
			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`gss_cat %>%`
			`mutate(partyid = fct_collapse(partyid,`
			`other = c("No answer", "Don't know", "Other party"),`
			`rep = c("Strong republican", "Not str republican"),`
			`ind = c("Ind,near rep", "Independent", "Ind,near dem"),`
			`dem = c("Not str democrat", "Strong democrat")`
			`)) %>%`
			`count(partyid)`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			Sometimes you just want to lump together all the small groups to make a plot or table simpler. That's the job of `fct_lump()`:

Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```{r}
Filling in some text about factors 2016-08-17 23:23:57 +08:00			`gss_cat %>%`
			`mutate(relig = fct_lump(relig)) %>%`
			`count(relig)`
			```

			`The default behaviour is to lump together all the smallest groups, ensuring that the aggregate is still the smallest group. In this case it's not super helpful: it is true that the majority of Americans are protestant, but we've probably over collapsed.`

			Instead, we can use the `n` parameter to specify how many groups (excluding other) we want to keep:

			```{r}
			`gss_cat %>%`
			`mutate(relig = fct_lump(relig, n = 5)) %>%`
			`count(relig, sort = TRUE)`
Start banging out factors chapter 2016-08-17 06:06:51 +08:00			```

Filling in some text about factors 2016-08-17 23:23:57 +08:00			`### Exercises`