In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. We'll also need to use a little dplyr, as is common when tidying data.
You can represent the same underlying data in multiple ways. The example below shows the same data organized in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in different way.
These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:
These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:
1. There's a general advantage to just picking one consistent way of storing
data. If you have a consistent data structure, it's easier to learn the
tools that work with it because they have an underlying uniformity.
1. There's a specific advantage to placing variables in columns because
it allows R's vectorised nature to shine. As you learned in [useful
creation functions] and [useful summary functions], most built-in R
functions work with a vector of values. That makes transforming tidy
data feel particularly natural.
In this example, it's `table1` that has the tidy representation, because each of the four columns represents a variable. This form is the easiest to work with in dplyr or ggplot2. It's also well suited for modelling, as you'll learn later. In fact, the way that R's modelling functions work was an inspiration for the tidy data format. Here are a couple of small examples of how you might work with this data. Think about how you'd achieve the same result with the other representations.
The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored in order to make data entry, not data analysis, easy.
The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
One of the most messy-data common problems is that you'll some variables will not be in the columns: one variable might be spread across multiple columns, or you might find that the variables for one observation are scattered across multiple rows. To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`.
A common problem is a dataset where some of the column names are not names of a variable, but _values_ of a variable. Take `table4a`, for example, the column names `1991` and `2000` represent values of the `year` variable.
The columns to gather are specified with `dplyr::select()` style notation. Here there are only two columns, so we list them by name. 1999 and 2000 are non-syntactic names so we have to surround in backticks. To refresh your memory of the other ways you can select columns, see [select](#select).
In the final result, the gathered columns are dropped, and we get new `key` and `value` variables. Otherwise, the relationships between the original variables are preserved.
To combine the tidied versions of `table4a` and `table4b` into a single tibble, we need to use `dplyr::left_join()`, which you'll learn about in [relational data].
Spreading is the opposite of gathering. You use it when the variables for one observation are scattered across multiple rows. For example, take `table2`. An observation is a country in a year, but each observation is spread across two rows.
As you might have guessed from the common `key` and `value` arguments, `spread()` and `gather()` are complements. `gather()` makes wide tables narrower and longer; `spread()` makes long tables shorter and wider.
You've learned how to tidy `table2` and `table4`, but not `table3`. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function. You'll also learn about inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
We need to use `separate()` to tidy `table3`, which combines values of *cases* and *population* in the same column. `separate()` take a data frame, the name of the column to separate, and the names of the columns to seperate into:
By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
You can also pass a vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.
separate(year, into = c("century", "year"), sep = 2)
```
### Unite
`unite()` does the opposite of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently that `separate()`, but it's still a useful tool to have in your back pocket.
We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
In this case we also need to use the `sep` arguent. The default will place an underscore (`_`) between the values from different columns. Here we don't want any separator so we use `""`:
Changing the representation of a dataset brings up an important subtlety of missing values. Suprisingly, a value can be missing in one of two possible ways:
One way to think about the difference is with this Zen-like koan: An implicit missing value is the presence of an absence; an explicit missing value is the absence of a presence.
The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit putting years in the columns:
Because these explicit missing values may not be important in other representations of the data, you can set `na.rm = TRUE` in `gather()` to turn explicit missing values implicit:
`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate the the previous value should be carried forward:
You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimese called last observation carried forward).
To finish off the chapter, let's pull together everything you've learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains reporter tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available for download at <www.who.int/tb/country/data/download/en/>.
There's a wealth of epidemiological information in this dataset, but it's challenging to work with the data in the form that it's provided:
This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll typically need to string together multiple verbs.
The best place to start is almost always to gathering together the columns that are not variables. Let's have a look at what we've got:
* It looks like `country`, `iso2`, and `iso3` are redundant ways of specifying
the same variable, the `country`.
* `year` is clearly also a variable.
* We don't know what all the other columns are yet, but given the structure
in the variables (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) these
are likely to be values, not variable names.
So we need to gather together all the columns from `new_sp_m3544` to `newrel_f65`. We don't yet know what these things mean, so for now we'll use the generic names `key`. We know the cells repesent the count of cases, so we'll use the variable `cases`. There are a lot of missing values in the current representation, so for now we'll use `na.rm` just so we can focus on the values that are present.
```{r}
who1 <- who %>% gather(new_sp_m014:newrel_f65, key = "key", value = "cases",
na.rm = TRUE)
who1
```
We can get some hint of the structure of the values in the new `key` column:
```{r}
who1 %>% count(key)
```
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.
Then we might as well drop the `new` colum because it's consistent in this dataset. While we're dropping columns, let's also drop `iso2` and `iso3` since they're redundant.
The `who` dataset is now tidy as each variable is a column. It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R. Typically you wouldn't assign each step to a new variable. Instead you'd join everything together in one big pipeline:
Before we continue on to other topics, it's worth talking a little bit about non-tidy data. Early in the chapter, I used the perjorative term "messy" to refer to non-tidy data. That's an oversimplification: there are lots of useful and well founded data structures that are not tidy data.
Either of these reasons means you'll need something other than a tibble (or data frame). If your data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to use other structures; tidy data is not the only way.
If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data/>