Working on tidy data
This commit is contained in:
parent
3267221ebb
commit
2c56ac830c
195
data-tidy.Rmd
195
data-tidy.Rmd
|
@ -33,6 +33,8 @@ You can represent the same underlying data in multiple ways.
|
||||||
The example below shows the same data organised in four different ways.
|
The example below shows the same data organised in four different ways.
|
||||||
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
|
Each dataset shows the same values of four variables *country*, *year*, *population*, and *cases*, but each dataset organizes the values in a different way.
|
||||||
|
|
||||||
|
<!-- TODO redraw as tables -->
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
table1
|
table1
|
||||||
table2
|
table2
|
||||||
|
@ -45,6 +47,7 @@ table4b # population
|
||||||
|
|
||||||
These are all representations of the same underlying data, but they are not equally easy to use.
|
These are all representations of the same underlying data, but they are not equally easy to use.
|
||||||
One of them, `table1`, will be much easier to work with inside the tidyverse because it's tidy.
|
One of them, `table1`, will be much easier to work with inside the tidyverse because it's tidy.
|
||||||
|
|
||||||
There are three interrelated rules that make a dataset tidy:
|
There are three interrelated rules that make a dataset tidy:
|
||||||
|
|
||||||
1. Each variable is a column; each column is a variable.
|
1. Each variable is a column; each column is a variable.
|
||||||
|
@ -115,7 +118,7 @@ ggplot(table1, aes(year, cases)) +
|
||||||
|
|
||||||
1. Using prose, describe how the variables and observations are organised in each of the sample tables.
|
1. Using prose, describe how the variables and observations are organised in each of the sample tables.
|
||||||
|
|
||||||
2. Compute the `rate` for `table2`, and `table4a` + `table4b`.
|
2. Sketch out the process you'd use to the `rate` for `table2`, and `table4a` + `table4b`.
|
||||||
You will need to perform four operations:
|
You will need to perform four operations:
|
||||||
|
|
||||||
a. Extract the number of TB cases per country per year.
|
a. Extract the number of TB cases per country per year.
|
||||||
|
@ -123,9 +126,7 @@ ggplot(table1, aes(year, cases)) +
|
||||||
c. Divide cases by population, and multiply by 10000.
|
c. Divide cases by population, and multiply by 10000.
|
||||||
d. Store back in the appropriate place.
|
d. Store back in the appropriate place.
|
||||||
|
|
||||||
Which representation is easiest to work with?
|
You haven't yet learned all the functions you'd need to actual perform these operations, but you should still be able to think through the transformations you'd need.
|
||||||
Which is hardest?
|
|
||||||
Why?
|
|
||||||
|
|
||||||
3. Recreate the plot showing change in cases over time using `table2` instead of `table1`.
|
3. Recreate the plot showing change in cases over time using `table2` instead of `table1`.
|
||||||
What do you need to do first?
|
What do you need to do first?
|
||||||
|
@ -147,8 +148,6 @@ Sometimes this is easy; other times you'll need to consult with the people who o
|
||||||
Next, you'll **pivot** your data into a tidy form, with variables in the columns and observations in the rows.
|
Next, you'll **pivot** your data into a tidy form, with variables in the columns and observations in the rows.
|
||||||
|
|
||||||
tidyr provides two functions for pivoting data: `pivot_longer()`, which makes datasets **longer** by increasing rows and reducing columns, and `pivot_wider()` which makes datasets **wider** by increasing columns and reducing rows.
|
tidyr provides two functions for pivoting data: `pivot_longer()`, which makes datasets **longer** by increasing rows and reducing columns, and `pivot_wider()` which makes datasets **wider** by increasing columns and reducing rows.
|
||||||
`pivot_longer()` is very useful for tidying data; `pivot_wider()` is more useful for making non-tidy data (we'll come back to this in Section \@ref(rectangle-data)), but is occasionally also needed for tidying..
|
|
||||||
|
|
||||||
The following sections work through the use of `pivot_longer()` and `pivot_wider()` to tackle a wide range of realistic datasets.
|
The following sections work through the use of `pivot_longer()` and `pivot_wider()` to tackle a wide range of realistic datasets.
|
||||||
These examples are drawn from `vignette("pivot", package = "tidyr")` which includes more variations and more challenging problems.
|
These examples are drawn from `vignette("pivot", package = "tidyr")` which includes more variations and more challenging problems.
|
||||||
|
|
||||||
|
@ -327,6 +326,18 @@ who2 |>
|
||||||
|
|
||||||
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in Chapter \@ref(regular-expressions).
|
An alternative to `names_sep` is `names_pattern`, which you can use to extract variables from more complicated naming scenarios, once you've learned about regular expressions in Chapter \@ref(regular-expressions).
|
||||||
|
|
||||||
|
Conceptually, this is only a minor variation on the simple case as illustrated by Figure \@ref(fig:pivot-multiple-names).
|
||||||
|
Now, instead of the column names pivoting into a single column, they pivot into multiple columns.
|
||||||
|
|
||||||
|
```{r pivot-multiple-names}
|
||||||
|
#| echo: FALSE
|
||||||
|
#| out.width: NULL
|
||||||
|
#| fig.cap: >
|
||||||
|
#| Pivotting with many variables in the column names means that each
|
||||||
|
#| column name now fills in values in multiple output columns.
|
||||||
|
knitr::include_graphics("diagrams/tidy-data/multiple-names.png", dpi = 270)
|
||||||
|
```
|
||||||
|
|
||||||
### Data and variable names in the column headers
|
### Data and variable names in the column headers
|
||||||
|
|
||||||
The next step up in complexity is when the column names include a mix of variable values and variable names.
|
The next step up in complexity is when the column names include a mix of variable values and variable names.
|
||||||
|
@ -354,6 +365,20 @@ household |>
|
||||||
|
|
||||||
We again use `values_drop_na = TRUE`, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and `parse_number()` to convert (e.g.) `child1` into 1.
|
We again use `values_drop_na = TRUE`, since the shape of the input forces the creation of explicit missing variables (e.g. for families with only one child), and `parse_number()` to convert (e.g.) `child1` into 1.
|
||||||
|
|
||||||
|
Figure \@ref(fig:pivot-names-and-values) illustrates what's happening in a simpler example.
|
||||||
|
The column names in the input now contribute to both values and variable names in the output.
|
||||||
|
|
||||||
|
```{r pivot-names-and-values}
|
||||||
|
#| echo: FALSE
|
||||||
|
#| out.width: NULL
|
||||||
|
#| fig.cap: >
|
||||||
|
#| Pivoting with `names_to = c(".value", "id")` splits the column names
|
||||||
|
#| into a two components: the first part determines the output column
|
||||||
|
#| name (`x` or `y`), and the second part determines the value of the
|
||||||
|
#| `id` column.
|
||||||
|
knitr::include_graphics("diagrams/tidy-data/multiple-names.png", dpi = 270)
|
||||||
|
```
|
||||||
|
|
||||||
### Widening data
|
### Widening data
|
||||||
|
|
||||||
So far we've used `pivot_longer()` to solves the common class of problems where values have ended up in column names.
|
So far we've used `pivot_longer()` to solves the common class of problems where values have ended up in column names.
|
||||||
|
@ -375,10 +400,10 @@ cms_patient_experience |>
|
||||||
distinct(measure_cd, measure_title)
|
distinct(measure_cd, measure_title)
|
||||||
```
|
```
|
||||||
|
|
||||||
Neither of these variables make particularly great variable names: `measure_cd` doesn't hint at the meaning of the variable and `measure_title` is a long sentence containing spaces.
|
Neither of these variables will make particularly great variable names: `measure_cd` doesn't hint at the meaning of the variable and `measure_title` is a long sentence containing spaces.
|
||||||
We'll use `measure_cd` for now.
|
We'll use `measure_cd` for now.
|
||||||
|
|
||||||
`pivot_wider()` has the opposite interface to `pivot_longer()` we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`:
|
`pivot_wider()` has the opposite interface to `pivot_longer()`: we need to provide the existing columns that define the values (`values_from`) and the column name (`names_from)`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
cms_patient_experience |>
|
cms_patient_experience |>
|
||||||
|
@ -388,8 +413,8 @@ cms_patient_experience |>
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
The output doesn't look quite right as we still seem to have multiple rows for each organistaion.
|
The output doesn't look quite right as we still seem to have multiple rows for each organization.
|
||||||
That's because, by default, `pivot_wider()` will attempt to preservere all the existing columns including `measure_title` which has six distinct observations.
|
That's because, by default, `pivot_wider()` will attempt to preserve all the existing columns including `measure_title` which has six distinct observations.
|
||||||
To fix this problem we need to tell `pivot_wider()` which columns identify each row; in this case that's the variables starting with `org`:
|
To fix this problem we need to tell `pivot_wider()` which columns identify each row; in this case that's the variables starting with `org`:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
|
@ -401,41 +426,98 @@ cms_patient_experience |>
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Widening multiple variables
|
### How does `pivot_wider()` work?
|
||||||
|
|
||||||
`cms_patient_care` has a similar structure:
|
To understand how `pivot_wider()` works, lets again start with a very simple dataset:
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
cms_patient_care
|
df <- tribble(
|
||||||
|
~id, ~name, ~value,
|
||||||
|
"A", "x", 1,
|
||||||
|
"B", "y", 2,
|
||||||
|
"B", "x", 3,
|
||||||
|
"A", "y", 4,
|
||||||
|
"A", "z", 5,
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
Depending on what you want to do next I think there are three meaningful ways:
|
We'll take the values from "value" and the names from "name":
|
||||||
|
|
||||||
```{r}
|
```{r}
|
||||||
cms_patient_care |>
|
df |>
|
||||||
pivot_wider(
|
pivot_wider(
|
||||||
names_from = type,
|
names_from = name,
|
||||||
values_from = score
|
values_from = value
|
||||||
)
|
|
||||||
|
|
||||||
cms_patient_care |>
|
|
||||||
pivot_wider(
|
|
||||||
names_from = measure_abbr,
|
|
||||||
values_from = score
|
|
||||||
)
|
|
||||||
|
|
||||||
cms_patient_care |>
|
|
||||||
pivot_wider(
|
|
||||||
names_from = c(measure_abbr, type),
|
|
||||||
values_from = score
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
We'll come back to this idea in the next section; for different analysis purposes you may want to consider different things to be variables
|
The connection between the position of the row in the input and the cell in the output is much weaker than in `pivot_longer()` because the rows and columns are primarily determined by the values of variables, not their location.
|
||||||
|
|
||||||
|
To being the process `pivot_wider()` needs to first figure what will go in the rows and columns.
|
||||||
|
Finding the column names is easy: it's just the values of `name`.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df |>
|
||||||
|
distinct(name)
|
||||||
|
```
|
||||||
|
|
||||||
|
By default, the rows in the output are formed by all variables that aren't going into the names or the values.
|
||||||
|
These are called the `id_cols` and we'll come back to this argument shortly.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df |>
|
||||||
|
select(-name, -value) |>
|
||||||
|
distinct()
|
||||||
|
```
|
||||||
|
|
||||||
|
`pivot_wider()` then uses this data to generate an empty data frame:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df |>
|
||||||
|
select(-name, -value) |>
|
||||||
|
distinct() |>
|
||||||
|
mutate(x = NA, y = NA, z = NA)
|
||||||
|
```
|
||||||
|
|
||||||
|
And then fills in all the missing values from the data in the input.
|
||||||
|
In this case, not every cell in the output has corresponding value in the input as there's no entry for id "B" and name "z", so that cell remains missing.
|
||||||
|
It's interesting that `pivot_wider()` can turn implicit missing values into explicit missing values and we'll come back to this idea in Chapter \@ref(missing-values).
|
||||||
|
|
||||||
|
You might also wonder what happens if there are are multiple rows in the input that correspond to one cell in the output.
|
||||||
|
The example below has two rows that correspond to id "A" and name "x":
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df <- tribble(
|
||||||
|
~id, ~name, ~value,
|
||||||
|
"A", "x", 1,
|
||||||
|
"A", "x", 2,
|
||||||
|
"A", "y", 3,
|
||||||
|
"B", "x", 4,
|
||||||
|
"B", "y", 5,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
If we attempt to pivot this we an output that contains list-columns, which you'll learn more about in Chapter \@ref(list-columns):
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df |> pivot_wider(
|
||||||
|
names_from = name,
|
||||||
|
values_from = value
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
Since you don't know how to work this sort of data yet, you'll want to follow the hint in the warning to figure out where the problem is:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
df %>%
|
||||||
|
group_by(id, name) %>%
|
||||||
|
summarise(n = n(), .groups = "drop") %>%
|
||||||
|
filter(n > 1L)
|
||||||
|
```
|
||||||
|
|
||||||
## Untidy data
|
## Untidy data
|
||||||
|
|
||||||
While I showed a couple of examples of using `pivot_wider()` to make tidy data, it's real strength is making **untidy** data.
|
While `pivot_wider()` is occasionally useful for making tidy data, it's real strength is making **untidy** data.
|
||||||
While that sounds like a bad thing, untidy isn't a pejorative term: there are many untidy data structures that are extremely useful.
|
While that sounds like a bad thing, untidy isn't a pejorative term: there are many untidy data structures that are extremely useful.
|
||||||
Tidy data is a great starting point for most analyses but it's not the only data format you'll even need.
|
Tidy data is a great starting point for most analyses but it's not the only data format you'll even need.
|
||||||
|
|
||||||
|
@ -584,3 +666,54 @@ It's totally fine to be pragmatic and to say a variable is whatever makes your a
|
||||||
So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
|
So if you're stuck figuring out how to do some computation, maybe it's time to switch up the organisation of your data.
|
||||||
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is columns; for those with a variable of number of values (like sums or means) it's usually easier in rows.
|
For computations involving a fixed number of values (like computing differences or ratios), it's usually easier if the data is columns; for those with a variable of number of values (like sums or means) it's usually easier in rows.
|
||||||
Don't be afraid to untidy, transform, and re-tidy if needed.
|
Don't be afraid to untidy, transform, and re-tidy if needed.
|
||||||
|
|
||||||
|
Let's explore this idea by looking at `cms_patient_care`, which has a similar structure to `cms_patient_experience`:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
cms_patient_care
|
||||||
|
```
|
||||||
|
|
||||||
|
It contains information about 9 measures (`beliefs_addressed`, `composite_process`, `dyspena_treatment`, ...) on 14 different facilities (identified by `ccn` with name given by `facility_name`).
|
||||||
|
However, compared to `cms_patient_experience` each measurement is recorded with both a normalized score, which is the percentage of patients affected, and a denominator, which is the number patients.
|
||||||
|
For example `beliefs_addressed`, which is the "documentation in the clinical record of a discussion of spiritual/religious concerns or documentation that the patient/caregiver did not want to discuss" records the percentage of patients in `observed` and the total number of patients that the metric applies to in `denominator`.
|
||||||
|
|
||||||
|
Depending on what you want to do next you might finding any of the following three structures useful:
|
||||||
|
|
||||||
|
- If you want to compute the number of patients that had a positive answer to the question, you might pivot `type` into the columns:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
cms_patient_care |>
|
||||||
|
pivot_wider(
|
||||||
|
names_from = type,
|
||||||
|
values_from = score
|
||||||
|
) |>
|
||||||
|
mutate(
|
||||||
|
numerator = round(observed / 100 * denominator)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
- If you wanted to display the distribution of each metric, you might keep it as is:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
cms_patient_care |>
|
||||||
|
filter(type == "observed") |>
|
||||||
|
ggplot(aes(score)) +
|
||||||
|
geom_histogram(binwidth = 2) +
|
||||||
|
facet_wrap(~ measure_abbr) +
|
||||||
|
xlim(85, NA)
|
||||||
|
```
|
||||||
|
|
||||||
|
- If you wanted to explore how different metrics are related, you might
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
cms_patient_care |>
|
||||||
|
filter(type == "observed") |>
|
||||||
|
select(-type) |>
|
||||||
|
pivot_wider(
|
||||||
|
names_from = measure_abbr,
|
||||||
|
values_from = score
|
||||||
|
) |>
|
||||||
|
ggplot(aes(dyspnea_screening, dyspena_treatment)) +
|
||||||
|
geom_point() +
|
||||||
|
coord_equal()
|
||||||
|
```
|
||||||
|
|
Binary file not shown.
Binary file not shown.
After ![]() (image error) Size: 58 KiB |
Binary file not shown.
After ![]() (image error) Size: 42 KiB |
Loading…
Reference in New Issue