In this chapter, you will learn a consistent way to organize your data in R using a system called **tidy data**.
Getting your data into this format requires some work up front, but that work pays off in the long term.
Once you have tidy data and the tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.
In this chapter, you'll first learn the definition of tidy data and see it applied to simple toy dataset.
Then we'll dive into the main tool you'll use for tidying data: pivoting.
Pivoting allows you to change the form of your data, without changing any of the values.
We'll finish up with a discussion of usefully untidy data, and how you can create it if needed.
If you particularly enjoy this chapter and learn more about the underlying theory, you can learn more about the history and theoretical underpinnings in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
You haven't yet learned all the functions you'd need to actual perform these operations, but you should still be able to think through the transformations you'd need.
2. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a *lot* of time working with data.
tidyr provides two functions for pivoting data: `pivot_longer()`, which makes datasets **longer** by increasing rows and reducing columns, and `pivot_wider()` which makes datasets **wider** by increasing columns and reducing rows.
- `cols` specifies which which columns need to be pivoted (the columns that aren't variables) using the same syntax as `select()`. In this case, we could say `!c(artist, track, date.entered)` or `starts_with("wk")`
- `names_to` names of the variable stored in the column names.
- `values_to` names the variable stored in the cell values.
This dataset records information about tuberculosis data collected by the WHO.
There are two columns that are easy to interpret: `country` and `year`.
They are followed by 56 column like `sp_m_014`, `ep_m_4554`, and `rel_m_3544`.
If you stare at these column for long enough, you'll notice there's a pattern.
Each column name is made up of three pieces separated by `_`.
The first piece, `sp`/`rel`/`ep`, describes the method used for the `diagnosis`, the second piece, `m`/`f` is the `gender`, and the third piece, `014`/`1524`/`2535`/`3544`/`4554`/`65` is the `age` range.
So in this case we have six variables: two variables are already columns, three variables are contained in the column name, and one variable is in the cell name.
This requires two changes to our call to `pivot_longer()`: `names_to` gets a vector of column names and `names_sep` describes how to split the variable name up into pieces:
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in Chapter \@ref(regular-expressions).
The new challenge in this dataset is that the column names contain the name of two variables (`dob`, `name)` and the values of another (`child,` with values 1 and 2).
We again we need to supply a vector to `names_to` but this time we use the special `".value"` sentinel.
This overrides the usual `values_to` argument and keeps the first component of the column name as a variable name.
We again use `values_drop_na = TRUE`, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and `parse_number()` to convert (e.g.) `child1` into 1.
We'll start with `cms_patient_experience`, a dataset from the Centers of Medicare and Medicaid services that provides information about patient experiences:
Neither of these variables will make particularly great variable names: `measure_cd` doesn't hint at the meaning of the variable and `measure_title` is a long sentence containing spaces.
`pivot_wider()` has the opposite interface to `pivot_longer()`: we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`:
The output doesn't look quite right as we still seem to have multiple rows for each organization.
That's because, by default, `pivot_wider()` will attempt to preserve all the existing columns including `measure_title` which has six distinct observations.
The connection between the position of the row in the input and the cell in the output is much weaker than in `pivot_longer()` because the rows and columns are primarily determined by the values of variables, not their location.
To being the process `pivot_wider()` needs to first figure what will go in the rows and columns.
Finding the column names is easy: it's just the values of `name`.
```{r}
df |>
distinct(name)
```
By default, the rows in the output are formed by all variables that aren't going into the names or the values.
These are called the `id_cols` and we'll come back to this argument shortly.
```{r}
df |>
select(-name, -value) |>
distinct()
```
`pivot_wider()` then uses this data to generate an empty data frame:
```{r}
df |>
select(-name, -value) |>
distinct() |>
mutate(x = NA, y = NA, z = NA)
```
And then fills in all the missing values from the data in the input.
In this case, not every cell in the output has corresponding value in the input as there's no entry for id "B" and name "z", so that cell remains missing.
It's interesting that `pivot_wider()` can turn implicit missing values into explicit missing values and we'll come back to this idea in Chapter \@ref(missing-values).
You might also wonder what happens if there are are multiple rows in the input that correspond to one cell in the output.
The example below has two rows that correspond to id "A" and name "x":
The following sections will show a few examples of `pivot_wider()` making usefully untidy data for presenting data to other humans, for multivariate statistics, and just for pragmatically solving data manipulation challenges.
As you've seen, `dplyr::count()` produces tidy data --- it makes one row for each group, with one column for each grouping variable, and one column for the number of observations:
That because, by default, `pivot_wider()` uses all the unmentioned columns to identify a row in the new dataset.
To get the display you are looking forward, you can either `select()` off the variables you don't care about, or use the `id_cols` arguments to explicitly define which columns identify each row in the result:
Most classical multivariate statistical methods (like dimension reduction and clustering) require a matrix representation of your data, where each column is time point, or a location, or gene, or species.
Sometimes these formats have substantial performance or space advantages or sometimes they're just necessary to get closer to the underlying matrix mathematics.
This produces a data frame, because tibbles don't support row names[^data-tidy-1].
[^data-tidy-1]: tibbles don't use row names because they only work for a subset of important cases: when observations can be identified by a single character vector.
We're now ready to cluster with (e.g.) `kmeans():`
Extracting the data out of this object into a form you can work with is a challenge we'll need to come back to later in the book, once you've learned more about lists.
This partly comes back to our original definition of tidy data, where I said tidy data has one variable in each column, but I didn't actually define what a variable is (and it's surprisingly hard to do so).
It's totally fine to be pragmatic and to say a variable is whatever makes your analysis easiest.
So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is columns; for those with a variable of number of values (like sums or means) it's usually easier in rows.
Let's explore this idea by looking at `cms_patient_care`, which has a similar structure to `cms_patient_experience`:
```{r}
cms_patient_care
```
It contains information about 9 measures (`beliefs_addressed`, `composite_process`, `dyspena_treatment`, ...) on 14 different facilities (identified by `ccn` with name given by `facility_name`).
However, compared to `cms_patient_experience` each measurement is recorded with both a normalized score, which is the percentage of patients affected, and a denominator, which is the number patients.
For example `beliefs_addressed`, which is the "documentation in the clinical record of a discussion of spiritual/religious concerns or documentation that the patient/caregiver did not want to discuss" records the percentage of patients in `observed` and the total number of patients that the metric applies to in `denominator`.
Depending on what you want to do next you might finding any of the following three structures useful:
- If you want to compute the number of patients that had a positive answer to the question, you might pivot `type` into the columns:
```{r}
cms_patient_care |>
pivot_wider(
names_from = type,
values_from = score
) |>
mutate(
numerator = round(observed / 100 * denominator)
)
```
- If you wanted to display the distribution of each metric, you might keep it as is:
```{r}
cms_patient_care |>
filter(type == "observed") |>
ggplot(aes(score)) +
geom_histogram(binwidth = 2) +
facet_wrap(~ measure_abbr) +
xlim(85, NA)
```
- If you wanted to explore how different metrics are related, you might