r4ds/tidy.Rmd

# Tidy data

## Introduction

> "Happy families are all alike; every unhappy family is unhappy in its
> own way." --– Leo Tolstoy

> "Tidy datasets are all alike, but every messy dataset is messy in its
> own way." --– Hadley Wickham

In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__.  Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.

### Prerequisites

In this chapter we'll focus on tidyr, a package that provides a bundle of tools to help tidy messy datasets. We'll also need to use a little dplyr, as is common when tidying data.

```{r setup}
library(tidyr)
library(dplyr)
```

## Tidy data

You can represent the same underlying data in multiple ways. For example, the datasets below show the same data organized in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in different way.

```{r}
table1
table2
table3

# Spread across two tibbles
table4a  # cases
table4b  # population
```

These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:

1.  Each variable has its own column.
1.  Each observation has its own row.
1.  Each value has its own cell.

```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-1.png")
```

These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:

1.  Put each dataset in a tibble.
1.  Put each variable in a column.

The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored to optimise 

Once you have your data in tidy form, it's easy to manipulate it with dplyr or visualise it with ggplot2:

```{r}
# Compute rate 
table1 %>% 
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>% 
  count(year, wt = cases)

# Visualise changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))
```

There are two advantages to tidy data:

1.  There's a general advantage to just picking one consistent way of storing
    data. If you have a consistent data structure, you can design tools that
    work with that data without having to translate it into different 
    structures.
    
1.  There's a specific advantage to placing variables in columns because
    it allows R's vectorised nature to shine. As you learned in [useful
    creation functions] and [useful summary functions], most built-in R
    functions work with a vector of values. That makes transforming tidy 
    data feel particularly natural.

As you'll learn later, tidy data is also very well suited for modelling, and in fact, the way that R's modelling functions work was an inspiration for the tidy data format.

### Exercises

1.  Using prose, describe how the variables and observations are organised in
    each of the sample tables.

1.  Compute the `rate` for `table2`, and `table4a` + `table4b`. 
    You will need to perform four operations:

    1.  Extract the number of TB cases per country per year.
    2.  Extract the matching population per country per year.
    3.  Divide cases by population, and multiply by 10000.
    5.  Store back in the appropriate place.
    
    Which is easiest? Which is hardest?

1.  Recreate the plot showing change in cases over time using `table2`
    instead of `table1`. What do you need to do first?

## Spreading and gathering

Now that you understand the basic principles of tidy data, it's time to learn the tools that allow you to transform untidy datasets into tidy datasets. 

The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.

One of the most messy-data common problems is that you'll find some variables are not in the columns. One variable might be spread across multiple columns, or you might find that a set of variables is spread over the rows. To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`. But before we can describe how they work, you need to understand the idea of the key-value pair.

### Key-value

A key-value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So, for example, this would be a key-value pair:

    Password: 0123456789

`0123456789` is the value, and it is associated with the key `Password`.

Data values form natural key-value pairs. The value is the value of the pair and the variable that the value describes is the key. So for example, you could decompose `table1` into a group of key-value pairs, like this:

    Country: Afghanistan
    Country: Brazil
    Country: China
    Year: 1999
    Year: 2000
    Year: 2001
    Population:   19987071
    Population:   20595360
    Population:  172006362
    Population:  174504898
    Population: 1272915272
    Population: 1280428583
    Cases:    745
    Cases:   2666
    Cases:  37737
    Cases:  80488
    Cases: 212258
    Cases: 213766

However, the key-value pairs would cease to be a useful dataset because you no longer know which values belong to the same observation.

Every cell in a table of data contains one half of a key-value pair, as does every column name. In tidy data, each cell will contain a value and each column name will contain a key, but this doesn't need to be the case for untidy data. Consider `table2`.

```{r}
table2
```

In `table2`, the `key` column contains only keys (and not just because the column is labeled `key`). Conveniently, the `value` column contains the values associated with those keys.

You can use the `spread()` function to tidy this layout.

### Spreading

`spread()` turns a pair of key:value columns into a set of tidy columns. To use `spread()`, pass a data frame, and the pair of key-value columns. This is particularly easy `table2` because the columns are already called key and value!

```{r}
spread(table2, key = key, value = value)
```

You can see that `spread()` maintains each of the relationships expressed in the original dataset. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations.

```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-8.png")
```

In general, you'll use `spread()` when you have a column that contains variable names, the `key` column, and a column that contains the values of that variable, the `value` column. Here's another simple example:

```{r}
weather <- frame_data(
  ~ day, ~measurement, ~record,
  "Jan 1", "temp", 31,
  "Jan 1", "precip", 0,
  "Jan 2", "temp", 35,
  "Jan 2", "precip", 5
)
weather %>% 
  spread(key = measurement, value = record)
```

The result of `spread()` without the `key` and `value` columns that you specified. Instead, it will have one new variable for each unique value in the `key` column.

### Gathering

`gather()` is the opposite of `spread()`. `gather()` collects a set of column names and places them into a single "key" column. It also collects the values associated with those columns and places them into a single value column. Let's use `gather()` to tidy `table4`.

```{r}
table4a
```

`gather()` takes a data frame, the names of the new key and value variables to create, and set a columns to gather:

```{r}
table4a %>% gather(key = "year", value = "cases", `1999`:`2000`)
```

Here, the column names (`key`) represent the years, and the cell values (`value`) represents the number of cases. We specify the columns to gather with `dplyr::select()` style notation: use all columns from "1999" to "2000". (Note that these are non-syntactic names so we have to surround in backticks.) To refresh your memory of the other ways you can select columns, see [select](#select).

`gather()` returns a copy of the data frame with the specified columns removed, and two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original dataset.

```{r, echo = FALSE, out.width = "100%"}
knitr::include_graphics("images/tidy-9.png")
```

Just like `spread()`, gather maintains each of the relationships in the original dataset. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion. `gather()` also maintains each of the observations in the original dataset, organizing them in a tidy fashion.

We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:

```{r}
table4b %>% gather(key = "year", value = "population", `1999`:`2000`)
```

It's easy to combine the `table4a` and `table4b` into a single single data frame because the new versions are both tidy. We'll use `dplyr::left_join()`,  which you'll learn about in [relational data].

```{r}
tidy4a <- table4a %>% gather("year", "cases", `1999`:`2000`)
tidy4b <- table4b %>% gather("year", "population", `1999`:`2000`)
left_join(tidy4a, tidy4b)
```

### Exercises

1.  Why are `gather()` and `spread()` not perfectly symmetrical?  
    Carefully consider the following example:
    
    ```{r, eval = FALSE}
    stocks <- data_frame(
      year   = c(2015, 2015, 2016, 2016),
      half  = c(   1,    2,     1,    2),
      return = c(1.88, 0.59, 0.92, 0.17)
    )
    stocks %>% 
      spread(year, return) %>% 
      gather("year", "return", `2015`:`2016`)
    ```

1.  Both `spread()` and `gather()` have a `convert` argument. What does it 
    do?

1.  Why does spreading this tibble fail?

    ```{r}
    people <- frame_data(
      ~name, ~key, ~value,
      "Phillip Woods", "age", 45,
      "Phillip Woods", "height", 186,
      "Phillip Woods", "age", 50,
      "Jessica Cordero", "age", 37,
      "Jessica Cordero", "height", 156
    )
    ```

1.  Tidy the simple tibble below. Do you need to spread or gather it?
    What are the variables?

    ```{r}
    preg <- frame_data(
      ~pregnant, ~male, ~female,
      "yes",     NA,    10,
      "no",      20,    12
    )
    ```

## Separating and uniting

You may have noticed that we skipped `table3` in the last section. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function.  In this section, we'll discuss the inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.

### Separate

`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears.

![](images/tidy-17.png)

We need to use `separate()` to tidy `table3`, which combines values of *cases* and *population* in the same column. `separate()` take a data frame, the name of the column to separate, and the names of the columns to seperate into:

```{r}
table3

table3 %>% 
  separate(rate, into = c("cases", "population"))
```

By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:

```{r eval=FALSE}
table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")
```

Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:

```{r}
table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
```

You can also pass an integer or vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.

```{r}
table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
```

### Unite

`unite()` does the opposite of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently that `separate()`, but it's still a useful tool to have in your back pocket.

![](images/tidy-18.png)

We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:

```{r}
table5
table5 %>% 
  unite(new, century, year)
```

In this case we also need to use the `sep` arguent. The default is will place an underscore (`_`) between values from separate columns. Here we don't want any separate so we use `""`:

```{r}
table5 %>% 
  unite(new, century, year, sep = "")
```

`unite()` returns a copy of the data frame that includes the new column, but not the columns used to build the new column.

### Exercises

1.  What do the `extra` and `fill` arguments do in `separate()`? 
    Experiment with the various options for the following two toy datasets.
    
    ```{r, eval = FALSE}
    tibble::tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
      separate(x, c("one", "two", "three"))
    
    tibble::tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
     separate(x, c("one", "two", "three"))
    ```

1.  Both `unite()` and `separate()` have a `remove` argument. What does it
    do? When would you set it to `FALSE`?

1.  Compare and contrast `separate()` and `extract()`.  Why are there
    three variations of separation, but only one unite?

## Missing values

Changing the representation of a dataset brings up an important fact about missing values. There are two types of missing values:

* __Explicit__ missing values are flagged with `NA`.
* __Implicit__ missing values are simply not present in the data.

Let's illustrate this idea with a very simple data set:

```{r}
stocks <- data_frame(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
```

There are two missing values in this dataset:

* The return for the fourth quarter of 2015 is explicitly missing, because
  the cell where its value should be instead contains `NA`.
  
* The return for the first quarter of 2016 is implicitly missing, because it
  simply does not appear in the dataset.
  
One way to think about the difference is this Zen-like koan: An implicit missing value is the presence of an absence; an explicit missing value is the absence of a presence.

The way that a dataset is represented can make implicit values explicit. For exmaple, we can make the implicit missing value explicit putting years in the columns:

```{r}
stocks %>% 
  spread(year, return)
```

Because these explicit missing values may not be important in other representations of the data, you can set `na.rm = TRUE` in `gather()` to turn explicit missing values implicit:

```{r}
stocks %>% 
  spread(year, return) %>% 
  gather(year, return, `2015`:`2016`, na.rm = TRUE)
```

An important tool to making missing values explicit in tidy data is `complete()`:

```{r}
stocks %>% 
  complete(year, qtr)
```

`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.

There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate the the previous value should be carried forward:

```{r}
treatment <- frame_data(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
```

You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimese called last observation carried forward).

```{r}
treatment %>% 
  fill(person)
```

### Exercises

1.  Compare and contrast the `fill` arguments to `spread()` and `complete()`. 

1.  What does the direction argument to `fill()` do?

## Case Study

The `who` dataset in tidyr contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at <www.who.int/tb/country/data/download/en/>. The data provides a wealth of epidemiological information, but it's challenging to work with the data in the form that it's provided:

```{r}
who
```

This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy. The most unique feature of `who` is its coding system. Columns five through sixty encode four separate pieces of information in their column names:

1.  The first three letters of each column denote whether the column 
    contains new or old cases of TB. In this dataset, each column contains 
    new cases.

1.  The next two letters describe the type of case being counted. We will 
    treat each of these as a separate variable.
    
    *   `rel` stands for cases of relapse
    *   `ep` stands for cases of extrapulmonary TB
    *   `sn` stands for cases of pulmonary TB that could not be diagnosed by 
        a pulmonary smear (smear negative)
    *   `sp` stands for cases of pulmonary TB that could be diagnosed be 
        a pulmonary smear (smear positive)

3.  The sixth letter describes the sex of TB patients. The dataset groups 
    cases by males (`m`) and females (`f`).

4.  The remaining numbers describe the age group of TB patients. The dataset
    groups cases into seven age groups:
    
    * `014` = 0 -- 14 years old
    * `1524` = 15 -- 24 years old
    * `2534` = 25 -- 34 years old
    * `3544` = 35 -- 44 years old
    * `4554` = 45 -- 54 years old
    * `5564` = 55 -- 64 years old
    * `65` = 65 or older

The `who` dataset is untidy in multiple ways, so we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll typically need to string together multiple verbs. 

Let's start by gathering the columns that are not variables. This is almost always the best place to start when tidying a new dataset. Here we'll use `na.rm` just so we can focus on the values that are present, not the many missing values.

```{r}
who1 <- who %>% gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE)
who1
```

We need to make a minor fix to the format of the column names: unfortunately the names are inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.

```{r}
who2 <- who1 %>% mutate(code = stringr::str_replace(code, "newrel", "new_rel"))
who2
```

We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.

```{r}
who3 <- who2 %>% separate(code, c("new", "type", "sexage"), sep = "_")
who3
```

Then we might as well drop the `new` colum because it's consistent in this dataset:

```{r}
who3 %>% count(new)
who4 <- who3 %>% select(-new)
```

The second pass will split `sexage` after the first character to create two columns, a sex column and an age column.

```{r}
who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
who5
```

The `rel`, `ep`, `sn`, and `sp` keys are all contained in the same column. We can now move the keys into their own column names with `spread()`.

```{r}
who6 <- who5 %>% spread(type, value)
who6
```

The `who` dataset is now tidy. It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R. 

Typically you wouldn't assign each step to a new variable. Instead you'd join everything together in one big pipeline:

```{r}
who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new) %>% 
  separate(sexage, c("sex", "age"), sep = 1) %>% 
  spread(var, value)
```

### Exercises

1.  In this case study I set `na.rm = TRUE` just to make it easier to
    check that we had the correct values. Is this reasonable? Think about
    how missing values are represented in this dataset. What's the difference
    between an `NA` and zero? Do you think we should use `fill = 0` in
    the final `spread()` step?

1.  What happens if you neglect the `mutate()` step? How might you use the
    `fill` argument to `gather()`?

1.  Compute the total number of cases of tb across all four diagnoses methods.
    You can perform the computation either before or after the final
    `spread()`. What are the advantages and disadvantages of each location?

## Non-tidy data

Before you go on further, it's worth talking a little bit about non-tidy data. Early in the chapter, I used the perjorative term "messy" to refer to non-tidy data. But that is an oversimplification: there are lots of useful and well founded data structures that are not tidy data.

There are two mains reasons to use other data structures:

* Alternative, non-tidy, representations maybe have substantial performance
  or memory advantages.
  
* Specialised fields have evolved their own conventions for storing data
  that may be quite different to the conventions of  tidy data.

Generally, however, these reason will require the usage of something other than a tibble or a data frame. If you data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to other structures; tidy data is not the only way.

If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data/>
-												Need heading

											
										
										
											2015-07-29 03:21:36 +08:00
+								# Tidy data
-												Add missing intro subheads

											
										
										
											2016-07-24 22:16:08 +08:00
+								## Introduction
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								> "Happy families are all alike; every unhappy family is unhappy in its
 								> own way." --– Leo Tolstoy
-												Fix typos (#146)


											
										
										
											2016-07-23 00:24:48 +08:00
+								> "Tidy datasets are all alike, but every messy dataset is messy in its
-												Consistent quote form

											
										
										
											2016-07-12 06:32:36 +08:00
+								> own way." --– Hadley Wickham
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								In this chapter, you will learn a consistent way to organise your data in R, a organisation called __tidy data__.  Getting your data into this format requires some upfront work, but that work pays off in the long-term. Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								This chapter will give you a practical introduction to tidy data and the accompanying tools in the __tidyr__ package. If you'd like to learn more about the underlying theory, you might enjoy the *Tidy Data* paper published in the Journal of Statistical Software, <http://www.jstatsoft.org/v59/i10/paper>.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Consistent chapter intro layout

											
										
										
											2016-07-19 21:01:50 +08:00
+								### Prerequisites
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								In this chapter we'll focus on tidyr, a package that provides a bundle of tools to help tidy messy datasets. We'll also need to use a little dplyr, as is common when tidying data.
 								```{r setup}
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								library(tidyr)
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								library(dplyr)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Start moving towards Hadley style

											
										
										
											2015-07-29 03:15:28 +08:00
+								## Tidy data
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								You can represent the same underlying data in multiple ways. For example, the datasets below show the same data organized in four different ways. Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in different way.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
 								table1
 								table2
 								table3
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Spread across two tibbles
 								table4a  # cases
 								table4b  # population
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								These are all representations of the same underlying data, but they are not equally easy to use. One dataset, the tidy dataset, will be much easier work with inside the tidyverse. There are three interrelated rules which make a dataset tidy:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Each variable has its own column.
 .  Each observation has its own row.
 .  Each value has its own cell.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tweak figures through out book

											
										
										
											2016-07-18 22:52:55 +08:00
+								```{r, echo = FALSE, out.width = "100%"}
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								knitr::include_graphics("images/tidy-1.png")
 								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								These three rules are interrelated because it's impossible to only satisfy two of the three rules. That interrelationship leads to even simpler set of practical instructions:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Put each dataset in a tibble.
 .  Put each variable in a column.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								The principles of tidy data seem so obvious that you might wonder if you'll ever encounter a dataset that isn't tidy. Unfortunately, while the principles are obvious in hindsight, it took Hadley over 5 years of struggling with many datasets to figure out these very simple principles. Most datasets that you will encounter in real life will not be tidy, either because the creator was not aware of the principles of tidy data, or because the data is stored to optimise
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Once you have your data in tidy form, it's easy to manipulate it with dplyr or visualise it with ggplot2:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Compute rate
 								table1 %>%
 								  mutate(rate = cases / population * 10000)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Compute cases per year
 								table1 %>%
 								  count(year, wt = cases)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								# Visualise changes over time
 								library(ggplot2)
 								ggplot(table1, aes(year, cases)) +
 								  geom_line(aes(group = country), colour = "grey50") +
 								  geom_point(aes(colour = country))
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								There are two advantages to tidy data:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  There's a general advantage to just picking one consistent way of storing
 								    data. If you have a consistent data structure, you can design tools that
 								    work with that data without having to translate it into different
 								    structures.
 .  There's a specific advantage to placing variables in columns because
 								    it allows R's vectorised nature to shine. As you learned in [useful
 								    creation functions] and [useful summary functions], most built-in R
 								    functions work with a vector of values. That makes transforming tidy
 								    data feel particularly natural.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								As you'll learn later, tidy data is also very well suited for modelling, and in fact, the way that R's modelling functions work was an inspiration for the tidy data format.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Using prose, describe how the variables and observations are organised in
 								    each of the sample tables.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Compute the `rate` for `table2`, and `table4a` + `table4b`.
 								    You will need to perform four operations:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Extract the number of TB cases per country per year.
 .  Extract the matching population per country per year.
 .  Divide cases by population, and multiply by 10000.
 .  Store back in the appropriate place.
 								    Which is easiest? Which is hardest?
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  Recreate the plot showing change in cases over time using `table2`
 								    instead of `table1`. What do you need to do first?
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								## Spreading and gathering
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Now that you understand the basic principles of tidy data, it's time to learn the tools that allow you to transform untidy datasets into tidy datasets.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								The first step to tidying any dataset is to study it and figure out what the variables are. Sometimes this is easy; other times you'll need to consult with the people who originally generated the data.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								One of the most messy-data common problems is that you'll find some variables are not in the columns. One variable might be spread across multiple columns, or you might find that a set of variables is spread over the rows. To fix these problems, you'll need the two most important functions in tidyr: `gather()` and `spread()`. But before we can describe how they work, you need to understand the idea of the key-value pair.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Key-value
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Fix typos (#146)


											
										
										
											2016-07-23 00:24:48 +08:00
+								A key-value pair is a simple way to record information. A pair contains two parts: a *key* that explains what the information describes, and a *value* that contains the actual information. So, for example, this would be a key-value pair:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Update tidy.Rmd

typos
											
										
										
											2016-01-29 05:10:04 +08:00
+								    Password: 0123456789
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								`0123456789` is the value, and it is associated with the key `Password`.
-												Fix typos (#146)


											
										
										
											2016-07-23 00:24:48 +08:00
+								Data values form natural key-value pairs. The value is the value of the pair and the variable that the value describes is the key. So for example, you could decompose `table1` into a group of key-value pairs, like this:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								    Country: Afghanistan
 								    Country: Brazil
 								    Country: China
 								    Year: 1999
 								    Year: 2000
 								    Year: 2001
 								    Population:   19987071
 								    Population:   20595360
 								    Population:  172006362
 								    Population:  174504898
 								    Population: 1272915272
 								    Population: 1280428583
 								    Cases:    745
 								    Cases:   2666
 								    Cases:  37737
 								    Cases:  80488
 								    Cases: 212258
 								    Cases: 213766
-												Update tidy.Rmd

typos
											
										
										
											2016-01-29 05:10:04 +08:00
-												Fix typos (#146)


											
										
										
											2016-07-23 00:24:48 +08:00
+								However, the key-value pairs would cease to be a useful dataset because you no longer know which values belong to the same observation.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Fix typos (#146)


											
										
										
											2016-07-23 00:24:48 +08:00
+								Every cell in a table of data contains one half of a key-value pair, as does every column name. In tidy data, each cell will contain a value and each column name will contain a key, but this doesn't need to be the case for untidy data. Consider `table2`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
 								table2
 								```
-												Update tidy.Rmd

typos
											
										
										
											2016-01-29 05:10:04 +08:00
+								In `table2`, the `key` column contains only keys (and not just because the column is labeled `key`). Conveniently, the `value` column contains the values associated with those keys.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								You can use the `spread()` function to tidy this layout.
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Spreading
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`spread()` turns a pair of key:value columns into a set of tidy columns. To use `spread()`, pass a data frame, and the pair of key-value columns. This is particularly easy `table2` because the columns are already called key and value!
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								spread(table2, key = key, value = value)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								You can see that `spread()` maintains each of the relationships expressed in the original dataset. The output contains the four original variables, *country*, *year*, *population*, and *cases*, and the values of these variables are grouped according to the original observations.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tweak figures through out book

											
										
										
											2016-07-18 22:52:55 +08:00
+								```{r, echo = FALSE, out.width = "100%"}
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								knitr::include_graphics("images/tidy-8.png")
 								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								In general, you'll use `spread()` when you have a column that contains variable names, the `key` column, and a column that contains the values of that variable, the `value` column. Here's another simple example:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								weather <- frame_data(
 								  ~ day, ~measurement, ~record,
 								  "Jan 1", "temp", 31,
 								  "Jan 1", "precip", 0,
 								  "Jan 2", "temp", 35,
 								  "Jan 2", "precip", 5
 								)
 								weather %>%
 								  spread(key = measurement, value = record)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								The result of `spread()` without the `key` and `value` columns that you specified. Instead, it will have one new variable for each unique value in the `key` column.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Gathering
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`gather()` is the opposite of `spread()`. `gather()` collects a set of column names and places them into a single "key" column. It also collects the values associated with those columns and places them into a single value column. Let's use `gather()` to tidy `table4`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table4a
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`gather()` takes a data frame, the names of the new key and value variables to create, and set a columns to gather:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table4a %>% gather(key = "year", value = "cases", `1999`:`2000`)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Here, the column names (`key`) represent the years, and the cell values (`value`) represents the number of cases. We specify the columns to gather with `dplyr::select()` style notation: use all columns from "1999" to "2000". (Note that these are non-syntactic names so we have to surround in backticks.) To refresh your memory of the other ways you can select columns, see [select](#select).
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`gather()` returns a copy of the data frame with the specified columns removed, and two new columns: a "key" column that contains the former column names of the removed columns, and a value column that contains the former values of the removed columns. `gather()` repeats each of the former column names (as well as each of the original columns) to maintain each combination of values that appeared in the original dataset.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tweak figures through out book

											
										
										
											2016-07-18 22:52:55 +08:00
+								```{r, echo = FALSE, out.width = "100%"}
-												Local bookdown working

											
										
										
											2015-12-12 03:28:10 +08:00
+								knitr::include_graphics("images/tidy-9.png")
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Just like `spread()`, gather maintains each of the relationships in the original dataset. This time `table4` only contained three variables, *country*, *year* and *cases*. Each of these appears in the output of `gather()` in a tidy fashion. `gather()` also maintains each of the observations in the original dataset, organizing them in a tidy fashion.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We can use `gather()` to tidy `table4b` in a similar fashion. The only difference is the variable stored in the cell values:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table4b %>% gather(key = "year", value = "population", `1999`:`2000`)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								It's easy to combine the `table4a` and `table4b` into a single single data frame because the new versions are both tidy. We'll use `dplyr::left_join()`,  which you'll learn about in [relational data].
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								tidy4a <- table4a %>% gather("year", "cases", `1999`:`2000`)
 								tidy4b <- table4b %>% gather("year", "population", `1999`:`2000`)
 								left_join(tidy4a, tidy4b)
 								```
 								### Exercises
 .  Why are `gather()` and `spread()` not perfectly symmetrical?
 								    Carefully consider the following example:
 								    ```{r, eval = FALSE}
 								    stocks <- data_frame(
 								      year   = c(2015, 2015, 2016, 2016),
 								      half  = c(   1,    2,     1,    2),
 								      return = c(1.88, 0.59, 0.92, 0.17)
 								    )
 								    stocks %>%
 								      spread(year, return) %>%
 								      gather("year", "return", `2015`:`2016`)
 								    ```
 .  Both `spread()` and `gather()` have a `convert` argument. What does it
 								    do?
 .  Why does spreading this tibble fail?
 								    ```{r}
 								    people <- frame_data(
 								      ~name, ~key, ~value,
 								      "Phillip Woods", "age", 45,
 								      "Phillip Woods", "height", 186,
 								      "Phillip Woods", "age", 50,
 								      "Jessica Cordero", "age", 37,
 								      "Jessica Cordero", "height", 156
 								    )
 								    ```
 .  Tidy the simple tibble below. Do you need to spread or gather it?
 								    What are the variables?
 								    ```{r}
 								    preg <- frame_data(
 								      ~pregnant, ~male, ~female,
 								      "yes",     NA,    10,
 								      "no",      20,    12
 								    )
 								    ```
 								## Separating and uniting
 								You may have noticed that we skipped `table3` in the last section. `table3` has a different problem: we have one column (`rate`) that contains two variables (`cases` and `population`). To fix this problem, we'll need the `separate()` function.  In this section, we'll discuss the inverse of `separate()`: `unite()`, which you use if a single variable is spread across multiple columns.
 								### Separate
 								`separate()` pulls apart one column into multiple variables, by separating wherever a separator character appears.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								![](images/tidy-17.png)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We need to use `separate()` to tidy `table3`, which combines values of *cases* and *population* in the same column. `separate()` take a data frame, the name of the column to separate, and the names of the columns to seperate into:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table3
 								table3 %>%
 								  separate(rate, into = c("cases", "population"))
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								By default, `separate()` will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). For example, in the code above, `separate()` split the values of `rate` at the forward slash characters. If you wish to use a specific character to separate a column, you can pass the character to the `sep` argument of `separate()`. For example, we could rewrite the code above as:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r eval=FALSE}
 								table3 %>%
 								  separate(rate, into = c("cases", "population"), sep = "/")
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Look carefully at the column types: you'll notice that `case` and `population` are character columns. This is the default behaviour in `separate()`: it leaves the type of the column as is. Here, however, it's not very useful those really are numbers. We can ask `separate()` to try and convert to better types using `convert = TRUE`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								table3 %>%
 								  separate(rate, into = c("cases", "population"), convert = TRUE)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								You can also pass an integer or vector of integers to `sep`. `separate()` will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative value start at -1 on the far-right of the strings. When using integers to separate strings, the length of `sep` should be one less than the number of names in `into`. You can use this arrangement to separate the last two digits of each year.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								table3 %>%
 								  separate(year, into = c("century", "year"), sep = 2)
 								```
 								### Unite
 								`unite()` does the opposite of `separate()`: it combines multiple columns into a single column. You'll need it much less frequently that `separate()`, but it's still a useful tool to have in your back pocket.
 								![](images/tidy-18.png)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We can use `unite()` to rejoin the *century* and *year* columns that we created in the last example. That data is saved as `tidyr::table5`. `unite()` takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in `dplyr::select()` style:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table5
 								table5 %>%
 								  unite(new, century, year)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								In this case we also need to use the `sep` arguent. The default is will place an underscore (`_`) between values from separate columns. Here we don't want any separate so we use `""`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								table5 %>%
 								  unite(new, century, year, sep = "")
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`unite()` returns a copy of the data frame that includes the new column, but not the columns used to build the new column.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  What do the `extra` and `fill` arguments do in `separate()`?
 								    Experiment with the various options for the following two toy datasets.
 								    ```{r, eval = FALSE}
 								    tibble::tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
 								      separate(x, c("one", "two", "three"))
 								    tibble::tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
 								     separate(x, c("one", "two", "three"))
 								    ```
 .  Both `unite()` and `separate()` have a `remove` argument. What does it
 								    do? When would you set it to `FALSE`?
 .  Compare and contrast `separate()` and `extract()`.  Why are there
 								    three variations of separation, but only one unite?
 								## Missing values
 								Changing the representation of a dataset brings up an important fact about missing values. There are two types of missing values:
 								* __Explicit__ missing values are flagged with `NA`.
 								* __Implicit__ missing values are simply not present in the data.
 								Let's illustrate this idea with a very simple data set:
 								```{r}
 								stocks <- data_frame(
 								  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
 								  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
 								  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
 								)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								There are two missing values in this dataset:
 								* The return for the fourth quarter of 2015 is explicitly missing, because
 								  the cell where its value should be instead contains `NA`.
 								* The return for the first quarter of 2016 is implicitly missing, because it
 								  simply does not appear in the dataset.
 								One way to think about the difference is this Zen-like koan: An implicit missing value is the presence of an absence; an explicit missing value is the absence of a presence.
 								The way that a dataset is represented can make implicit values explicit. For exmaple, we can make the implicit missing value explicit putting years in the columns:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								stocks %>%
 								  spread(year, return)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Because these explicit missing values may not be important in other representations of the data, you can set `na.rm = TRUE` in `gather()` to turn explicit missing values implicit:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								stocks %>%
 								  spread(year, return) %>%
 								  gather(year, return, `2015`:`2016`, na.rm = TRUE)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								An important tool to making missing values explicit in tidy data is `complete()`:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								stocks %>%
 								  complete(year, qtr)
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								`complete()` takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit `NA`s where necessary.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate the the previous value should be carried forward:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								treatment <- frame_data(
 								  ~ person,           ~ treatment, ~response,
 								  "Derrick Whitmore", 1,           7,
 								  NA,                 2,           10,
 								  NA,                 3,           9,
 								  "Katherine Burke",  1,           4
 								)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								You can fill in these missing values with `fill()`. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimese called last observation carried forward).
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								treatment %>%
 								  fill(person)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
 .  Compare and contrast the `fill` arguments to `spread()` and `complete()`.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  What does the direction argument to `fill()` do?
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Start moving towards Hadley style

											
										
										
											2015-07-29 03:15:28 +08:00
+								## Case Study
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								The `who` dataset in tidyr contains cases of tuberculosis (TB) reported between 1995 and 2013 sorted by country, age, and gender. The data comes in the *2014 World Health Organization Global Tuberculosis Report*, available for download at <www.who.int/tb/country/data/download/en/>. The data provides a wealth of epidemiological information, but it's challenging to work with the data in the form that it's provided:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Start moving towards Hadley style

											
										
										
											2015-07-29 03:15:28 +08:00
+								```{r}
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								who
 								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								This is a very typical example of data you are likely to encounter in real life. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy. The most unique feature of `who` is its coding system. Columns five through sixty encode four separate pieces of information in their column names:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  The first three letters of each column denote whether the column
 								    contains new or old cases of TB. In this dataset, each column contains
 								    new cases.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  The next two letters describe the type of case being counted. We will
 								    treat each of these as a separate variable.
 								    *   `rel` stands for cases of relapse
 								    *   `ep` stands for cases of extrapulmonary TB
 								    *   `sn` stands for cases of pulmonary TB that could not be diagnosed by
 								        a pulmonary smear (smear negative)
 								    *   `sp` stands for cases of pulmonary TB that could be diagnosed be
 								        a pulmonary smear (smear positive)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  The sixth letter describes the sex of TB patients. The dataset groups
 								    cases by males (`m`) and females (`f`).
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+.  The remaining numbers describe the age group of TB patients. The dataset
 								    groups cases into seven age groups:
 								    * `014` = 0 -- 14 years old
 								    * `1524` = 15 -- 24 years old
 								    * `2534` = 25 -- 34 years old
 								    * `3544` = 35 -- 44 years old
 								    * `4554` = 45 -- 54 years old
 								    * `5564` = 55 -- 64 years old
 								    * `65` = 65 or older
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								The `who` dataset is untidy in multiple ways, so we'll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you'll typically need to string together multiple verbs.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Let's start by gathering the columns that are not variables. This is almost always the best place to start when tidying a new dataset. Here we'll use `na.rm` just so we can focus on the values that are present, not the many missing values.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								who1 <- who %>% gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE)
 								who1
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We need to make a minor fix to the format of the column names: unfortunately the names are inconsistent because instead of `new_rel_` we have `newrel` (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about `str_replace()` in [strings], but the basic idea is pretty simple: replace the string "newrel" with "new_rel". This makes all variable names consistent.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								```{r}
 								who2 <- who1 %>% mutate(code = stringr::str_replace(code, "newrel", "new_rel"))
 								who2
 								```
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who3 <- who2 %>% separate(code, c("new", "type", "sexage"), sep = "_")
 								who3
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								Then we might as well drop the `new` colum because it's consistent in this dataset:
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who3 %>% count(new)
 								who4 <- who3 %>% select(-new)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
 								The second pass will split `sexage` after the first character to create two columns, a sex column and an age column.
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
 								who5
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
 								The `rel`, `ep`, `sn`, and `sp` keys are all contained in the same column. We can now move the keys into their own column names with `spread()`.
 								```{r}
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								who6 <- who5 %>% spread(type, value)
 								who6
 								```
 								The `who` dataset is now tidy. It is far from clean (for example, it contains several redundant columns and many missing values), but it will now be much easier to work with in R.
 								Typically you wouldn't assign each step to a new variable. Instead you'd join everything together in one big pipeline:
 								```{r}
 								who %>%
 								  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
 								  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
 								  separate(code, c("new", "var", "sexage")) %>%
 								  select(-new) %>%
 								  separate(sexage, c("sex", "age"), sep = 1) %>%
 								  spread(var, value)
-												Adding first draft of tidy data chapter

											
										
										
											2015-07-29 00:16:58 +08:00
+								```
-												Tidy chapter updates

											
										
										
											2016-07-26 00:28:05 +08:00
+								### Exercises
 .  In this case study I set `na.rm = TRUE` just to make it easier to
 								    check that we had the correct values. Is this reasonable? Think about
 								    how missing values are represented in this dataset. What's the difference
 								    between an `NA` and zero? Do you think we should use `fill = 0` in
 								    the final `spread()` step?
 .  What happens if you neglect the `mutate()` step? How might you use the
 								    `fill` argument to `gather()`?
 .  Compute the total number of cases of tb across all four diagnoses methods.
 								    You can perform the computation either before or after the final
 								    `spread()`. What are the advantages and disadvantages of each location?
 								## Non-tidy data
 								Before you go on further, it's worth talking a little bit about non-tidy data. Early in the chapter, I used the perjorative term "messy" to refer to non-tidy data. But that is an oversimplification: there are lots of useful and well founded data structures that are not tidy data.
 								There are two mains reasons to use other data structures:
 								* Alternative, non-tidy, representations maybe have substantial performance
 								  or memory advantages.
 								* Specialised fields have evolved their own conventions for storing data
 								  that may be quite different to the conventions of  tidy data.
 								Generally, however, these reason will require the usage of something other than a tibble or a data frame. If you data does fit naturally into a rectangular structure composed of observations and variables, I think tidy data should be your default choice. But there are good reasons to other structures; tidy data is not the only way.
 								If you'd like to learn more about non-tidy data, I'd highly recommend this thoughtful blog post by Jeff Leek: <http://simplystatistics.org/2016/02/17/non-tidy-data/>