Factors polishing
Add alt text. Update code style
This commit is contained in:
		
							
								
								
									
										101
									
								
								factors.Rmd
									
									
									
									
									
								
							
							
						
						
									
										101
									
								
								factors.Rmd
									
									
									
									
									
								
							@@ -2,13 +2,12 @@
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
## Introduction
 | 
					## Introduction
 | 
				
			||||||
 | 
					
 | 
				
			||||||
In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.
 | 
					Factors are used for categorical variables, variables that have a fixed and known set of possible values.
 | 
				
			||||||
They are also useful when you want to display character vectors in a non-alphabetical order.
 | 
					They are also useful when you want to display character vectors in a non-alphabetical order.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Historically, factors were much easier to work with than characters.
 | 
					If, after reading this chapter, you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
 | 
				
			||||||
As a result, many of the functions in base R automatically convert characters to factors.
 | 
					This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
 | 
				
			||||||
This means that factors often crop up in places where they're not actually helpful.
 | 
					An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
 | 
				
			||||||
Fortunately, you don't need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Prerequisites
 | 
					### Prerequisites
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -19,12 +18,6 @@ It provides tools for dealing with **cat**egorical variables (and it's an anagra
 | 
				
			|||||||
library(tidyverse)
 | 
					library(tidyverse)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Learning more
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
If you want to learn more about factors, I recommend reading Amelia McNamara and Nicholas Horton's paper, [*Wrangling categorical data in R*](https://peerj.com/preprints/3163/).
 | 
					 | 
				
			||||||
This paper lays out some of the history discussed in [*stringsAsFactors: An unauthorized biography*](https://simplystatistics.org/posts/2015-07-24-stringsasfactors-an-unauthorized-biography/) and [*stringsAsFactors = \<sigh\>*](http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh), and compares the tidy approaches to categorical data outlined in this book with base R methods.
 | 
					 | 
				
			||||||
An early version of the paper helped motivate and scope the forcats package; thanks Amelia & Nick!
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
## Creating factors
 | 
					## Creating factors
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Imagine that you have a variable that records month:
 | 
					Imagine that you have a variable that records month:
 | 
				
			||||||
@@ -103,8 +96,8 @@ levels(f2)
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
## General Social Survey
 | 
					## General Social Survey
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For the rest of this chapter, we're going to focus on `forcats::gss_cat`.
 | 
					For the rest of this chapter, we're going to use `forcats::gss_cat`.
 | 
				
			||||||
It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
 | 
					It's a sample of data from the [General Social Survey](http://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
 | 
				
			||||||
The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
 | 
					The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
@@ -124,6 +117,10 @@ gss_cat |>
 | 
				
			|||||||
Or with a bar chart:
 | 
					Or with a bar chart:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| fig.alt: > 
 | 
				
			||||||
 | 
					#|   A bar chart showing the distribution of race. There are ~2000
 | 
				
			||||||
 | 
					#|   records with race "Other", 3000 with race "Black", and other
 | 
				
			||||||
 | 
					#|   15,000 with race "White".
 | 
				
			||||||
ggplot(gss_cat, aes(race)) +
 | 
					ggplot(gss_cat, aes(race)) +
 | 
				
			||||||
  geom_bar()
 | 
					  geom_bar()
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
@@ -132,6 +129,9 @@ By default, ggplot2 will drop levels that don't have any values.
 | 
				
			|||||||
You can force them to display with:
 | 
					You can force them to display with:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#> fig.alt: >
 | 
				
			||||||
 | 
					#>   The same bar chart as the last plot, but now with an missing bar on
 | 
				
			||||||
 | 
					#>   the far right with label "Not applicable".
 | 
				
			||||||
ggplot(gss_cat, aes(race)) +
 | 
					ggplot(gss_cat, aes(race)) +
 | 
				
			||||||
  geom_bar() +
 | 
					  geom_bar() +
 | 
				
			||||||
  scale_x_discrete(drop = FALSE)
 | 
					  scale_x_discrete(drop = FALSE)
 | 
				
			||||||
@@ -142,8 +142,7 @@ In dplyr::count() set the `.drop` option to `FALSE`, to show these.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
gss_cat |> 
 | 
					gss_cat |> 
 | 
				
			||||||
  count(race, 
 | 
					  count(race, .drop = FALSE)
 | 
				
			||||||
        .drop = FALSE)
 | 
					 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
 | 
					When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels.
 | 
				
			||||||
@@ -160,14 +159,18 @@ Those operations are described in the sections below.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
3.  Which `relig` does `denom` (denomination) apply to?
 | 
					3.  Which `relig` does `denom` (denomination) apply to?
 | 
				
			||||||
    How can you find out with a table?
 | 
					    How can you find out with a table?
 | 
				
			||||||
    How can you find out with a visualisation?
 | 
					    How can you find out with a visualization?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Modifying factor order
 | 
					## Modifying factor order
 | 
				
			||||||
 | 
					
 | 
				
			||||||
It's often useful to change the order of the factor levels in a visualisation.
 | 
					It's often useful to change the order of the factor levels in a visualization.
 | 
				
			||||||
For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
 | 
					For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| fig.alt: >
 | 
				
			||||||
 | 
					#|   A scatterplot of with tvhours on the x-axis and religion on the y-axis.
 | 
				
			||||||
 | 
					#|   The y-axis is ordered seemingly aribtrarily making it hard to get
 | 
				
			||||||
 | 
					#|   any sense of overall pattern.
 | 
				
			||||||
relig_summary <- gss_cat |>
 | 
					relig_summary <- gss_cat |>
 | 
				
			||||||
  group_by(relig) |>
 | 
					  group_by(relig) |>
 | 
				
			||||||
  summarise(
 | 
					  summarise(
 | 
				
			||||||
@@ -176,7 +179,8 @@ relig_summary <- gss_cat |>
 | 
				
			|||||||
    n = n()
 | 
					    n = n()
 | 
				
			||||||
  )
 | 
					  )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
 | 
					ggplot(relig_summary, aes(tvhours, relig)) + 
 | 
				
			||||||
 | 
					  geom_point()
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
It is difficult to interpret this plot because there's no overall pattern.
 | 
					It is difficult to interpret this plot because there's no overall pattern.
 | 
				
			||||||
@@ -188,6 +192,10 @@ We can improve it by reordering the levels of `relig` using `fct_reorder()`.
 | 
				
			|||||||
-   Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
 | 
					-   Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| fig.alt: >
 | 
				
			||||||
 | 
					#|   The same scatterplot as above, but now the religion is displayed in
 | 
				
			||||||
 | 
					#|   increasing order of tvhours. "Other eastern" has the fewest tvhours
 | 
				
			||||||
 | 
					#|   under 2, and "Don't know" has the highest (over 5).
 | 
				
			||||||
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
 | 
					ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
 | 
				
			||||||
  geom_point()
 | 
					  geom_point()
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
@@ -207,6 +215,11 @@ relig_summary |>
 | 
				
			|||||||
What if we create a similar plot looking at how average age varies across reported income level?
 | 
					What if we create a similar plot looking at how average age varies across reported income level?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| fig.alt: >
 | 
				
			||||||
 | 
					#|   A scatterplot with age on the x-axis and income on the y-axis. Income
 | 
				
			||||||
 | 
					#|   has been reordered in order of average age which doesn't make much
 | 
				
			||||||
 | 
					#|   sense. One section of the y-axis goes from $6000-6999, then <$1000, 
 | 
				
			||||||
 | 
					#|   then $8000-9999.
 | 
				
			||||||
rincome_summary <- gss_cat |>
 | 
					rincome_summary <- gss_cat |>
 | 
				
			||||||
  group_by(rincome) |>
 | 
					  group_by(rincome) |>
 | 
				
			||||||
  summarise(
 | 
					  summarise(
 | 
				
			||||||
@@ -215,7 +228,8 @@ rincome_summary <- gss_cat |>
 | 
				
			|||||||
    n = n()
 | 
					    n = n()
 | 
				
			||||||
  )
 | 
					  )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
 | 
					ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + 
 | 
				
			||||||
 | 
					  geom_point()
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Here, arbitrarily reordering the levels isn't a good idea!
 | 
					Here, arbitrarily reordering the levels isn't a good idea!
 | 
				
			||||||
@@ -227,22 +241,43 @@ You can use `fct_relevel()`.
 | 
				
			|||||||
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
 | 
					It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| fig.alt: >
 | 
				
			||||||
 | 
					#|   The same scatterplot but now "Not Applicable" is displayed at the
 | 
				
			||||||
 | 
					#|   bottom of the y-axis. Generally there is a positive association
 | 
				
			||||||
 | 
					#|   between income and age, and the income band with the highest average
 | 
				
			||||||
 | 
					#|   age is "Not applicable".
 | 
				
			||||||
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
 | 
					ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
 | 
				
			||||||
  geom_point()
 | 
					  geom_point()
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Why do you think the average age for "Not applicable" is so high?
 | 
					Why do you think the average age for "Not applicable" is so high?
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Another type of reordering is useful when you are colouring the lines on a plot.
 | 
					Another type of reordering is useful when you are coloring the lines on a plot.
 | 
				
			||||||
`fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values.
 | 
					`fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values.
 | 
				
			||||||
This makes the plot easier to read because the line colours line up with the legend.
 | 
					This makes the plot easier to read because the line colurs line up with the legend.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r, fig.align = "default", out.width = "50%", fig.width = 4}
 | 
					```{r, fig.align = "default", out.width = "50%", fig.width = 4}
 | 
				
			||||||
 | 
					#| fig.alt: 
 | 
				
			||||||
 | 
					#| - >
 | 
				
			||||||
 | 
					#|     A line plot with age on the x-axis and proportion on the y-axis.
 | 
				
			||||||
 | 
					#|     There is one line for each category of marital status: no answer,
 | 
				
			||||||
 | 
					#|     never married, separated, divorced, widowed, and married. It is
 | 
				
			||||||
 | 
					#|     a little hard to read the plot because the order of the legend is 
 | 
				
			||||||
 | 
					#|     unrelated to the lines on the plot.    
 | 
				
			||||||
 | 
					#| - >
 | 
				
			||||||
 | 
					#|     Rearranging the legend makes the plot easier to read because the
 | 
				
			||||||
 | 
					#|     legend colours now match the order of the lines on the far right 
 | 
				
			||||||
 | 
					#|     of the plot. You can see some unsuprising patterns: the proportion
 | 
				
			||||||
 | 
					#|     never marred decreases with age, married forms an upside down U 
 | 
				
			||||||
 | 
					#|     shape, and widowed starts off low but increases steeply after age
 | 
				
			||||||
 | 
					#|     60.
 | 
				
			||||||
by_age <- gss_cat |>
 | 
					by_age <- gss_cat |>
 | 
				
			||||||
  filter(!is.na(age)) |>
 | 
					  filter(!is.na(age)) |>
 | 
				
			||||||
  count(age, marital) |>
 | 
					  count(age, marital) |>
 | 
				
			||||||
  group_by(age) |>
 | 
					  group_by(age) |>
 | 
				
			||||||
  mutate(prop = n / sum(n))
 | 
					  mutate(
 | 
				
			||||||
 | 
					    prop = n / sum(n)
 | 
				
			||||||
 | 
					  )
 | 
				
			||||||
 | 
					
 | 
				
			||||||
ggplot(by_age, aes(age, prop, colour = marital)) +
 | 
					ggplot(by_age, aes(age, prop, colour = marital)) +
 | 
				
			||||||
  geom_line(na.rm = TRUE)
 | 
					  geom_line(na.rm = TRUE)
 | 
				
			||||||
@@ -256,6 +291,10 @@ Finally, for bar plots, you can use `fct_infreq()` to order levels in increasing
 | 
				
			|||||||
You may want to combine with `fct_rev()`.
 | 
					You may want to combine with `fct_rev()`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
 | 
					#| fig.alt: >
 | 
				
			||||||
 | 
					#|   A bar char of marital status ordered in from least to most common:
 | 
				
			||||||
 | 
					#|   no answer (~0), separated (~1,000), widowed (~2,000), divorced 
 | 
				
			||||||
 | 
					#|   (~3,000), never married (~5,000), married (~10,000).
 | 
				
			||||||
gss_cat |>
 | 
					gss_cat |>
 | 
				
			||||||
  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
 | 
					  mutate(marital = marital |> fct_infreq() |> fct_rev()) |>
 | 
				
			||||||
  ggplot(aes(marital)) +
 | 
					  ggplot(aes(marital)) +
 | 
				
			||||||
@@ -288,14 +327,16 @@ Let's tweak them to be longer and use a parallel construction.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
gss_cat |>
 | 
					gss_cat |>
 | 
				
			||||||
  mutate(partyid = fct_recode(partyid,
 | 
					  mutate(
 | 
				
			||||||
 | 
					    partyid = fct_recode(partyid,
 | 
				
			||||||
      "Republican, strong"    = "Strong republican",
 | 
					      "Republican, strong"    = "Strong republican",
 | 
				
			||||||
      "Republican, weak"      = "Not str republican",
 | 
					      "Republican, weak"      = "Not str republican",
 | 
				
			||||||
      "Independent, near rep" = "Ind,near rep",
 | 
					      "Independent, near rep" = "Ind,near rep",
 | 
				
			||||||
      "Independent, near dem" = "Ind,near dem",
 | 
					      "Independent, near dem" = "Ind,near dem",
 | 
				
			||||||
      "Democrat, weak"        = "Not str democrat",
 | 
					      "Democrat, weak"        = "Not str democrat",
 | 
				
			||||||
      "Democrat, strong"      = "Strong democrat"
 | 
					      "Democrat, strong"      = "Strong democrat"
 | 
				
			||||||
  )) |>
 | 
					    )
 | 
				
			||||||
 | 
					  ) |>
 | 
				
			||||||
  count(partyid)
 | 
					  count(partyid)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -305,7 +346,8 @@ To combine groups, you can assign multiple old levels to the same new level:
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
gss_cat |>
 | 
					gss_cat |>
 | 
				
			||||||
  mutate(partyid = fct_recode(partyid,
 | 
					  mutate(
 | 
				
			||||||
 | 
					    partyid = fct_recode(partyid,
 | 
				
			||||||
      "Republican, strong"    = "Strong republican",
 | 
					      "Republican, strong"    = "Strong republican",
 | 
				
			||||||
      "Republican, weak"      = "Not str republican",
 | 
					      "Republican, weak"      = "Not str republican",
 | 
				
			||||||
      "Independent, near rep" = "Ind,near rep",
 | 
					      "Independent, near rep" = "Ind,near rep",
 | 
				
			||||||
@@ -315,7 +357,8 @@ gss_cat |>
 | 
				
			|||||||
      "Other"                 = "No answer",
 | 
					      "Other"                 = "No answer",
 | 
				
			||||||
      "Other"                 = "Don't know",
 | 
					      "Other"                 = "Don't know",
 | 
				
			||||||
      "Other"                 = "Other party"
 | 
					      "Other"                 = "Other party"
 | 
				
			||||||
  )) |>
 | 
					    )
 | 
				
			||||||
 | 
					  ) |>
 | 
				
			||||||
  count(partyid)
 | 
					  count(partyid)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -326,12 +369,14 @@ For each new variable, you can provide a vector of old levels:
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
```{r}
 | 
					```{r}
 | 
				
			||||||
gss_cat |>
 | 
					gss_cat |>
 | 
				
			||||||
  mutate(partyid = fct_collapse(partyid,
 | 
					  mutate(
 | 
				
			||||||
 | 
					    partyid = fct_collapse(partyid,
 | 
				
			||||||
      other = c("No answer", "Don't know", "Other party"),
 | 
					      other = c("No answer", "Don't know", "Other party"),
 | 
				
			||||||
      rep = c("Strong republican", "Not str republican"),
 | 
					      rep = c("Strong republican", "Not str republican"),
 | 
				
			||||||
      ind = c("Ind,near rep", "Independent", "Ind,near dem"),
 | 
					      ind = c("Ind,near rep", "Independent", "Ind,near dem"),
 | 
				
			||||||
      dem = c("Not str democrat", "Strong democrat")
 | 
					      dem = c("Not str democrat", "Strong democrat")
 | 
				
			||||||
  )) |>
 | 
					    )
 | 
				
			||||||
 | 
					  ) |>
 | 
				
			||||||
  count(partyid)
 | 
					  count(partyid)
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 
 | 
				
			|||||||
		Reference in New Issue
	
	Block a user